# Remote Data Science

The purpose of this notebook is to describe the end-goal for how Syft and Grid will facilitate privacy preserving data science. The hope is that this notebook will drive interesting conversation around features and use cases of the tools we build. However, at the time of writing none of this functionality exists yet.

Scenario: the scenario is that you are seeking to perform research to predict what it takes to get a good night's sleep. Specifically, the project is split into three pieces:

- Federated Learning: train a classifier to predict whether someone will get a good night sleep based on various input factors

- Federated Analytics: use the classifier to estimate the amount of sleep that the population of the USA is getting (importantly including folks who do not record their own sleep data).

- Self Classification: leveraging the model from FL, predict what decisions I should make to get a better night sleep.

# Step 1: Imports

To begin, we must import syft.

<!-- #### Development Notes: 
_To start, from a client perspective, we want to maximize for convenience and minimize the number of dependencies one needs to install to work with PyGrid. Thus, in an ideal world, users only have to install one python package in order to work with all of pygrid. I like the current design in syft 0.2.x where we have grid clients in a grid package inside of Syft. The thing we definitely want to avoid here is the need for users of PyGrid to have to install all of the dependencies needed to run grid nodes (flask, databases, etc.) just to be able to interact with the grid. Putting grid inside of syft solves this as well._ -->

In [322]:
import syft as sy
from syft import grid as gr

# Step 2: View our Available Networks

In this step, we need to see if we are connected to some number of data networks which we can use to search for data relating to sleep. Conveniently, the PySyft library remembers the networks we've previously used in other experiments. The list of "known networks" can be displayed by simply executing `gr.newtorks` as below.

<!-- ### Development Notes

_By default, it would be really great if we could support a combination of two lists of networks:_

- networks which all users of PySyft have by default (OpenGrid)
- a history of all networks previously accessed (stored in some local config file)

_We should be able to view these available networks by just calling `gr.networks` which should pretty-print information about them. Below we show one way to do pretty-print using just a Pandas table as shown below._ -->

In [8]:
# possible food for thought for how to implement this feature: https://hyperledger-indy.readthedocs.io/projects/hipe/en/latest/text/0036-mediators-and-relays/README.html

gr.networks

Unnamed: 0_level_0,id,datasets,models,domains,online,registered,server-domains,mobile-domains
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
OpenGrid,8278092196,235262,2352,2532,2352,76,100%,0%
FitByte,3709750237,34734,2352,2532,2352,37,2%,98%
DeepMined,1854867712,34734,2352,2532,2352,15,100%,0%
OpanAI,720117246,0,2352,2532,2352,81,2%,98%
Damonios Pizza,5410816127,0,2352,2532,2352,73,99%,1%
MyFitnessPal,5997732196,0,2352,2532,2352,31,50%,50%
TrackMyRun,3489349824,0,2352,2532,2352,7,0%,100%
Netflax,6558315376,935685,7473,346,216,0,54%,46%
AMA,2980287202,2352,236622,53,52,56,100%,0%
CDC,1854778893,35,0,5,5,5,100%,0%


This table displays several useful pieces of information:

- Id: the unique id of the network
- Name: the name of the network
- Models: the number of models currently hosted on the network (public and private)
- Domains: the number of domains registered to this network (i.e., the number of individual hospitals)
- Online: the number of domains which are currently online
- Registered: the number of domains for which this user already has an account 
- Server-domains: the percentage of domains which are server based (cloud/on-prem compute cluster based grid noes)
- Mobile-domains: the percentage of domains which are mobile based (smartphone based grid noes)

Already from this list we can get some idea as to what kinds of datasets which may be available to us. 
- FitByte seems like a good source of gold-standard "sleep data"
- Netflax seems like it might have some information on what people are doing before bed (watching tv)
- TrackMiRun seems like it might have good information on exercise, which could affect sleep.
- MiFitnessPal is a place where people journal what they eat - this should be very informative!

# Step 3: Local Wallet

On the right side of the table, we see the column "Registered". This indicates the number of domains within the network with which we already have user accounts which let us leverage the data. The fact that many of these values are >0 means that we've already been doing some data science on sources within these networks. To confirm, we can see that we have public and private keys corresponding to each account.

<!-- ### Development Notes: 

_Somewhere in the local filesystem, we need to save the set of all keys/logins which this user has for various domains around the world. We should be able to see them here. Note this list is what creates the "Registered" number for each network._ -->

In [330]:
gr.wallet.domain_keys

Unnamed: 0,network,domain,pubkey,prikey
0,OpenGrid,PatrickCason,1797b4705276f9b932e590163aa9b028c080afc85152a7...,a95844a031bc77bab435859c708eb0bc9ab0f26c6232bb...
1,OpenGrid,AndrewTrask,59c1aca6f688ab93007592d7ea2ef334150b9e5602c73d...,19da4ba5a266b761ea9db7ee5630487f9e43be42164333...
2,OpenGrid,TudorCebere,7d90f4c545ef771dc2dfe64a52ae262f4c79d207e87d36...,0e4b3394ceadec2c7402321cd0365af92b4bfcc2bf0360...
3,OpenGrid,JasonMancuso,fc876cb2592d52afe5c5a220531ecbac9d42b148d5775d...,3c71f8ac4b9a7e59fd5f967e79eafc85ff4e51c9603681...
4,OpenGrid,BobbyWagner,b2ef0b460ea6462ee1beea55c9c6aae7b2bdbfaaf3533e...,78e3d7ef0f93429d38aa9d373dc1d8c1a7e86179f26b06...
5,AMA,UCSF,d8029cd54090b0db69d1942232823f3b66a1492ace40bc...,64e01691028588a99917c27874327b3ac9d1fde75e79b0...
6,AMA,Vanderbilt,f25e9e0c94dd432bf44b4de348be7fa8baa272b63ecc18...,96b562d957baf1abf0c4d463eb3d17187f5023f2085e45...
7,AMA,MDAnderson,6272298bb9969f4aa48dd7b4ba25dbf35a4d69b7c69dd8...,b7b52bfdf37ae418fae8c0760c28cc09ce51e2a18d5a3c...
8,AMA,BostonGeneral,d5f4f5559cd893a0e374fdf99df41cb0ee37c4afaf0c77...,7671668eede2ff59f640c0ca4f058d18e59118dca33420...
9,AMA,HCA,293561c999f802158196e9c24bb964dcd7acfdbc858a04...,137c09c93daf18d47ab805c2e07340711124375267c124...


# Step 4: Adding Another Network

Before we continue, we're also aware of an interesting sleep study done within the NHS. We'd like to see if we can access data relating to that sleep study which could be useful to our project. To add the new network to our list of "known networks", we just call the `gr.save_network()` method.

In [332]:
gr.save_network('http://localhost:8015/') # it's a network

Connecting... SUCCESS!


Unnamed: 0_level_0,id,datasets,models,domains,online,registered
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NHS,2352,86585,6585,5,5,0


In [333]:
gr.networks

Unnamed: 0_level_0,id,datasets,models,domains,online,registered,server-domains,mobile-domains
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
OpenGrid,7489181118,235262,2352,2532,2352,94,100%,0%
FitByte,3171669185,34734,2352,2532,2352,36,2%,98%
DeepMined,4986989490,34734,2352,2532,2352,14,100%,0%
OpanAI,81881178,0,2352,2532,2352,95,2%,98%
Damonios Pizza,6588945438,0,2352,2532,2352,80,99%,1%
MyFitnessPal,5951547480,0,2352,2532,2352,90,50%,50%
TrackMyRun,4478900429,0,2352,2532,2352,60,0%,100%
Netflax,1524599871,935685,7473,346,216,0,54%,46%
AMA,1340266147,2352,236622,53,52,33,100%,0%
CDC,7590082165,35,0,5,5,5,100%,0%


Before we continue, let's talk about what these columns represent. A "network", as you may have guessed, is a hosted service which exists to help you find data. Actually, it exists to help you find just about any kind of object within data science you might be looking for (data, compute, models, etc.), as long as that object exists within the network. 

The various members of the network are called "domains". The difference between "network" and "domain" is that a "domain" represents someone who actually *owns* a dataset whereas a "network" exists to help you *find* a dataset. So, for example, even though the "FitByte" network might help you *find* data relating to people's fitness, each individual person running their own *domain* (on their phone) is considered the owner. Importantly, the domain owner is the one which accepts/denies your request to perform data science (not the network owner).

So, the next logical step is to start training on the data, right? Let's ask some domain owners for permission!

Not so fast!

First, we need to find which datasets we want to leverage - asking domain owners to approve us for data science can take some time - and we can do a bit of the work ahead of time by performing a search over the datasets they are hosting first. 

# Step 5: Search

So, we're notionally connected to some networks which have domains with data. This allows us to do searches like so. We'll try the search term "diabetes".

In [516]:
# string search anywhere in an object's public metadata
result = gr.search(anywhere="diabetes")
result() #TODO: add distributed_models

Unnamed: 0,distributed_datasets,datasets,tensors,dataset_schemas,tensor_schemas,models,model_schemas
0,23,75474,947467,532,23,235,62


Looks like a few of these datasets relate to diabetes. We can view a few of them here!

In [517]:
pd_table = result.datasets.pandas()
pd_table

Unnamed: 0_level_0,network,domain,$/eps,$/gpu_flop_hour,validity_certs,top_validity_certs,user_certs,avg_user_cert_rank,id,upload-date,...,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
COVID Mortality,UCSF,UCSF,$364.32,$5.35,25,"UCSF, FDA, NSF",6432,8.3%,db1a7ef00e962171e446c325f61f219848305fee44fc03...,12/18/2019,...,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True
US COVID Deaths,CDC,Atlanta,$253.12,$4.22,23,"CDC, FDA, NSF",3523,12.5%,a85a8bc4c163be82b9ce838418e5f71fc9b47fdf2755e9...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},False
US COVID Deaths,CDC,Chicago,$263.56,$3.12,15,"UCSF, FDA, NSF",1532,3.3%,a85a8bc4c163be82b9ce838418e5f71fc9b47fdf2755e9...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},True
COVID Deaths,AMA,Boston General,$135.37,$5.32,1,AMA,3523,20.3%,b7e4324344926ae0d107575f4253990f9adeb67ef13bae...,2/20/2020,...,PT/TF/PD/NP/JX,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True
COVID Deaths,AMA,Doctor's Direct,$135.37,$5.32,1,John Mantle,93.5%,5dee370c2d36b9c95db65855c56489978ccf6b35362b68...,2/20/2020,2,...,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True,
Diabetes Pump Trial Data,AMA,Boston General,"$5,364.32",$1.32,1,AMA,235,32.1%,45da5e15082e1141cb97162fbe6a576c5420e3e063e37a...,1/3/2018,...,PT/TF/PD/NP/JX,23267,335,3463,AMA-DIABETES-TRIAL-252,#diabetes #or...,"In 2018, the American Medical Association...",True,{'collected':2018},True


### Understanding the Results

Before we go and look into more relevant (sleep related) data, let's take a moment to consider what this table is all about. The far left "name" column is the name of the dataset. This name isn't designed to be a unique name, it's just whatever the Domain owner chose to call their dataset. The network and domain are what you'd expect, domain is the owner of the data and the network is the network the domain is hosted in (note a domain can connect to multiple networks).

#### Pricing

The next two columns are perhaps a bit perplexing. The goal here is to get a general sense for how expensive the dataset is as well as how expensive the co-located compute is. 

##### Compute Prices

"USD/gpu_flop_hour" is perhaps the most intuitive. It's the average price (over the GPUs available) of an hour of GPU time normalized by flop (since some GPUs are more powerful than others). This is not the metric you would actually pay for (you'd pick your GPU and rent by the hour like most providers). However, this is to give a sense for how expensive compute is on each domain provider. (Maybe someday we'll "pay per flop" but not yet.)

##### Epsilon Prices

USD/eps is perhaps a bit less intuitive. Epsilon is referring to the notion of epsilon from Differential Privacy literature.  Is is a measure of "statistical uniqueness" of results and it's the most important pricing metric for a variety of reasons:

- If epsilon is sufficiently high, the buyer could reverse engineer the dataset and then (in theory) sell it (or derivatives) as a competitor to you. Economically, this is the "hard limit" creating resource scarcity for how much a dataset can be used. While a dataset can always be worked with (in theory), giving too much epsilon to one player (or to players who will collude) can accidentally create a competitor.
- Given that epsilon is a measure of resource scarcity, it's a great way to price the difference between two queries. Intuitively, if a seller of data has 10 queries against their data to consider, and one of those queries could be used to restore the entire dataset, then the price of that query must exceed the price of the other 9 queries combined in order for that query to be answered first. Of course, if the seller of data has reason to believe that no further queries will come, then selling to the 10th buyer becomes more reasonable. 
- The epsilon also can be used to measure the risk that an individual's identity will be revealed as being in the data. Risk should be priced, which is why this measure matters.

This is the metric which creates a liquid market for insights. If you can answer the question using less sensitive/scarce/valuable data, you'll be financially incentivised to do so because of the scarcity constraints mentioned above.

#### Certifications (certs)

A perennial problem with working with data you cannot see - how do you know it is real? The next set of columns seeks to address this.


##### Validity Certifications

These certifications are certifications of entities who claim to have "seen the data". Registered users of the domain are able to receive signed hashes of the dataset, signed using the private keys of well-known brands (such as major hospitals, etc.). The primary utility here is that data can be hosted by domains/networks on behalf of others, but still retain the "stamp" from the original creators of the data that "this is something I endorse". 

Note that the dataset from "Doctor's Direct" doesn't have any good certifications - this dataset should be considered suspect.

##### User Certifications

This type of certification is basically a form of dataset reviews. Individuals have the ability to report the ranking of datasets within their experiment, meaning a ranking of "which dataset was most predictive for my task, second most predictive, etc.". The metric "avg_user_cert_ranking" is a measure of the average "top percentage" that a dataset took across all tasks it's been used for. So a score of "5%" would be quite good, it would mean that a dataset ranked in the top 5% of all datasets for that task. (This can be measured using cross validation.) Note that each ranking is signed by the user's private key along with some cryptographic material which only users of the dataset receive (issued by the domain owner before experimentation begins).

Note that the dataset from "Doctor's Direct" has a pretty bad user score. I'll bet this dataset is (probably) fake.

#### Schema

Farther to the right in the table, you'll see the "schema" column. This is an extremely useful column wherein datasets can put forward the name of the schema the dataset is formatted in. This allows multiple datasets to subscribe to the same schema which makes federated learning easier. Note further that if you go through the trouble to format N datasets into a single schema for training, you can sell this transformation back to the data owners as a new dataset. It makes their data more valuable to be transformed into uniform/popular schemas.

#### Other Fields

Take a second to peruse the other fields in the table as well. Most of the names are quite intuitive (description, date uploaded, etc.). You'll notice that every dataset has a train, test, and validation piece. Each train/test/validation section is a single tensor. (People can choose for themselves which columns are input data and which ones should be the target.) 

#### Comprehensive Field Descriptions

Note that not all columns are printed by default (because there are so many). Default columns are just an intuitive perspective on the table. If you'd like the full description of all the columns (you don't need to know this now), it's commented out below. Just double click this cell and you can read it:
<!-- 
- Dataset: This is a dataset object existing within a single Domain. It consists of 

    - name (public - value - required): the name of the dataset
    - network (public - value - required): the network in which this dataset is hosted. Note that you might find the same dataset in different networks. They will be different rows in this table for now.
    - domain (public - value - required): the owner of the data who is running the compute within which the data is located.
    
    - data_hash: a hash of the .data attribute
    - metadata_hash: a hash of these attributes on the dataset not including .data
    - hash: a hash of the dataset including all attributes and the .data attribute

    - per-user-dataset USD/eps stages (public - value - required) - the list of price stages for varying amounts of eps. This is concerned with information we expect the user to not share with others.
    - per-lifetime-dataset USD/eps stages (public - value - required) - the list of price stages for varying amounts of eps. This is concerned with information we expect the user to make public (such as via a model or finding)   
    - per-user-entity USD/eps stages (public - value - required) - the list of price stages for varying amounts of eps. This is concerned with information we expect the user to not share with others.
    - per-lifetime-entity USD/eps stages (public - value - required) - the list of price stages for varying amounts of eps. This is concerned with information we expect the user to make public (such as via a model or finding)
    
    - USD/eps (public - func - required) - a measure of the current price of purchasing a query from a dataset - measured by the epsilon privacy leakage of that query
    - USD/.1 eps (public - func - required) - the cost of 1 epsilon
    - USD/1 eps (public - func - required) - the cost of 1 epsilon
    - USD/2 eps (public - func - required) - the cost of 2 epsilon
    - USD/5 eps (public - func - required) - the cost of 5 epsilon
    - USD/10 eps (public - func - required) - the cost of 10 epsilon
    
    - available_compute: a list of "Device" objects (see below) which are available within this domain. See device spec for schema of objects in this list. (https://github.com/OpenMined/PySyft/issues/3867)

    - Fixed Device Prices:
        - USD/gpu mean flop hour - (public - func - required) a measure of the average price of flops across the GPU compute available within this domain
        - USD/gpu min flop hour - (public - func - required) a measure of the minimum price of flops across the GPU compute available within this domain 
        - USD/gpu max flop hour - (public - func - required) a measure of the maximum price of flops across the GPU compute available within this domain     
        - USD/gpu mean RAM MB hour - (public - func - required) a measure of the average price of a MB of RAM across the GPU compute available within this domain
        - USD/gpu min RAM MB hour - (public - func - required) a measure of the minimum price of a MB of RAM across the GPU compute available within this domain     
        - USD/gpu max RAM MB hour - (public - func - required) a measure of the maximum price of a MB of RAM across the GPU compute available within this domain         
        - USD/cpu mean flop hour - (public - func - required) a measure of the average price of flops across the CPU compute available within this domain
        - USD/cpu min flop hour - (public - func - required) a measure of the minimum price of flops across the CPU compute available within this domain    
        - USD/cpu max flop hour - (public - func - required) a measure of the maximum price of flops across the CPU compute available within this domain        
        - USD/cpu mean RAM MB hour - (public - func - required) a measure of the average price of a MB of RAM across the CPU compute available within this domain
        - USD/cpu min RAM MB hour - (public - func - required) a measure of the minimum price of a MB of RAM across the CPU compute available within this domain  
        - USD/cpu max RAM MB hour - (public - func - required) a measure of the maximum price of a MB of RAM across the CPU compute available within this domain  
        
    - Spot Device Prices:
        - USD/spot gpu mean flop hour - (public - func - required) a measure of the average price of flops across the spot GPU compute available within this domain
        - USD/spot gpu min flop hour - (public - func - required) a measure of the minimum price of flops across the spot GPU compute available within this domain 
        - USD/spot gpu max flop hour - (public - func - required) a measure of the maximum price of flops across the spot GPU compute available within this domain     
        - USD/spot gpu mean RAM MB hour - (public - func - required) a measure of the average price of a MB of RAM across the spot GPU compute available within this domain
        - USD/spot gpu min RAM MB hour - (public - func - required) a measure of the minimum price of a MB of RAM across the spot GPU compute available within this domain     
        - USD/spot gpu max RAM MB hour - (public - func - required) a measure of the maximum price of a MB of RAM across the spot GPU compute available within this domain         
        - USD/spot cpu mean flop hour - (public - func - required) a measure of the average price of flops across the spot CPU compute available within this domain
        - USD/spot cpu min flop hour - (public - func - required) a measure of the minimum price of flops across the spot CPU compute available within this domain    
        - USD/spot cpu max flop hour - (public - func - required) a measure of the maximum price of flops across the spot CPU compute available within this domain        
        - USD/spot cpu mean RAM MB hour - (public - func - required) a measure of the average price of a MB of RAM across the spot CPU compute available within this domain
        - USD/spot cpu min RAM MB hour - (public - func - required) a measure of the minimum price of a MB of RAM across the spot CPU compute available within this domain  
        - USD/spot cpu max RAM MB hour - (public - func - required) a measure of the maximum price of a MB of RAM across the spot CPU compute available within this domain  
    
    - validity certs - (public - required) the number of entities which have "seen the data" and verify that it is genuine (can include an actual statement from each certifier available elsewhere)
    - top_validity_certs - (public - required) a short list of the primary entities (brands) who signed this data
    - user_certs - (public - required) the number of users who have certified as using this data for an experiment
    - avg_user_cert_rank - (public - required) the average ranking this dataset had relative to other datasets in a federated learning pool
    - arbitrary_certs - (public - required) anyone can sign this dataset and say something about it. This is the number of people who have done so. These are available through the API.
    - provenance_claim_certs - (public - optional) if this dataset was created using other hosted datasets/models/etc., a certificate can be issued which claims as such
    - provenance_verified_computation_certs - (public - optional) if this dataset was created using other hosted datasets/models/etc by performing verified computation, than these certificates can verify that the computation was indeed genuine.

    - id (public - required): the uid of the dataset
    - upload-date (public - required): the date the dataset was uploaded
    - version (public - required): is this dataset a new version of previous datasets? If so, what version is it?
    - previous_version_id (public - required): if the version of this dataset >0, what is the id of the previous version?
    - frameworks (public - required): the available frameworks for this dataset (derived from supported frameworks for the worker). Grouped into train, dev, and test.
    
    - data (public - required): the tensor object containing all data (including train, dev, and test)    
    - train_n_rows (public - required): the number of rows in the training dataset
    - train_indices (public - required): a list of the indices in the "data" object which correspond to the training data
    - dev_n_rows (public - required): the number of rows in the dev dataset   
    - dev_indices (public - required): a list of the indices in the "data" object which correspond to the dev data
    - test_n_rows (public - required): the number of rows in the test dataset            
    - test_indices (public - required): a list of the indices in the "data" object which correspond to the test data
    
    - schema (public - required): the DatasetSchema of the dataset - which is the name->schema mapping for each TensorSchema. Identical across train, dev, and test
    
    - tags (public - required): a list of tags affiliated with this dataset
    - description (public - required): a free text description of the dataset
    - metadata (public - optional): additional metadata someone wants to use for this dataset. We assume all of this data is public.    
    
    - raw (private - optional): the raw version of the dataset (such as a CSV file, free text file, etc.)
    
    - private: is the dataset's "data" tensor private or can it be downloaded?    
    - worst_case_user_budget: inferred values based on the worst case user-buget parameter within the dataset's tensors (see tensor user-budget)
    - dataset_budget_params (public - required): the per-user epsilon privacy budget parameters for this dataset:
    
        - entity_centric_lifetime_train
        - entity_centric_lifetime_dev
        - entity_centric_lifetime_test
    
        - lifetime_train: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)
        - lifetime_dev: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)        
        - lifetime_test: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)    
            
        - user_lifetime_train: the total epsilon each data scientist gets when interacting with the training dataset
        - user_lifetime_dev: the total epsilon each data scientist gets when interacting with the dev dataset
        - user_lifetime_test: the total epsilon each data scientist gets when interacting with the dev dataset

        - daily_auto_train: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review
        - daily_auto_dev: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review        
        - daily_auto_test: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review                

        - query_auto_train: the maximum amount of epsilon one query can return which doesn't require officer review when interacting with the training dataset
        - query_auto_dev: the maximum amount of epsilon one query can return which doesn't require officer review when interacting with the training dataset
        - query_auto_test: the maximum amount of epsilon one query can return which doesn't require officer review when interacting with the training dataset



### Types of Dataset Search

Given all these fields, you can also perform a wide variety of queries. However, because you're searching over a distributed/decentralized database (which could even include devices like mobile phones!) we split the ability to search into two types:

- Distant Search: the first step in searching is to request a bulk of information from the network to be copied to your local device. Searching can take a really long time, especially if devices are offline a lot (such as mobile devices) and the information hasn't been properly cached on the network node. Additionally, this is to reduce the burden on the network nodes themselves so that they're cheaper to run (and thus more likely to be run as a gift to the community).

- Local Search: searching, sorting, and filtering the metadata returned to you from the Distant Search.



#### Distant Search - Query 1: Basic Keyword Search

This is the type of search you've already seen. It searches for your term over all attributes and returns every dataset on the network which has the term mentioned somewhere.

In [518]:
# string search anywhere in an object's public metadata
result = gr.search(anywhere="diabetes")
result()

Unnamed: 0,distributed_datasets,datasets,tensors,dataset_schemas,tensor_schemas,models,model_schemas
0,23,75474,947467,532,23,235,62


However, what you can't see under the hood is that this "result" object is still populating. It will continue attempting to populate for the next 24 hours if you let it. You can also set a timeout if you like.

In [519]:
# string search anywhere in an object's public metadata
result = gr.search(anywhere="diabetes", timeout_hours=48, percent_response_threshold=0.8)

Now it will keep listening for the answer to the query for the next 48 hours.

You can view your list of ongoing search by checking `gr.searches`

In [520]:
gr.searches

[{'anywhere': 'diabetes', 'hours_remaining': 48}]

You can also past in a list of strings, which will be processed separately

In [521]:
result = gr.search(anywhere=["diabetes", "california"], timeout_hours=48)

This will return the set union between a search for "diabetes" and "california". However, if you only wanted queries which had BOTH terms as a requirement, you can add the require_all flag.

In [522]:
result = gr.search(anywhere=["diabetes", "california"], timeout_hours=48, require_all=True)

#### Distant Search - Query 2: Specific Text Attribute Search:

In this kind of search, you can query any attribute by name using "kwargs". timeout_hours, multiple strings, and require_all flags all still apply. For example, all of these are valid queries:

In [523]:
# can also search for tags
result = gr.search(tags="diabetes")
result = gr.search(tags=["diabetes"], require_all=True)

# can also search on the names of objects
result = gr.search(name_includes="MNIST")
result = gr.search(name_exact="MNIST")

# can also search on the description of objects
result = gr.search(description="diabetes COVID mortality")
result = gr.search(description=["diabetes", "COVID", "mortality"])

Additionally, if custom metadata was added to the metadata file, you can also simply search for it using kwargs. For example, if "rocket_origin='NASA'" was a piece of metadata in a file, you could add it.

In [524]:
# can also search for arbitrary metadata
result = gr.search(rocket_booster='NASA')

# Note: if you wanted to just search for anything that mentioned "rocket_booster", use the "anywhere"

If no such metadata including "rocket_booster" exists, the query will simply return empty.

#### Distant Search - Query 3: Numerical Search

For all of the numerical parameters, you can limit your search to specific ranges if you like. However, remember that for this first search bandwidth is really high but latency is really long. So it's probably better for you to not execute these kinds of searches until you get to "Local Search". However, if you want to use them, they're available.

In [525]:
result = gr.search(usd_spot_gpu_min_ram_mb_hour="<0.001")

The trick is to pass these in as strings and use "<" or ">" in the beginning of the string. You can even do multiple by using a "-".

In [526]:
result = gr.search(usd_spot_gpu_min_ram_mb_hour="0.0001-0.001")

#### Distant Search - Query 4: Credential Search (TODO)

Perhaps the most elusive attributes on the dataset object refer to credentials. 

    - validity certs - (public - required) the number of entities which have "seen the data" and verify that it is genuine (can include an actual statement from each certifier available elsewhere)
    - top_validity_certs - (public - required) a short list of the primary entities (brands) who signed this data
    - user_certs - (public - required) the number of users who have certified as using this data for an experiment
    - avg_user_cert_rank - (public - required) the average ranking this dataset had relative to other datasets in a federated learning pool
    - arbitrary_certs - (public - required) anyone can sign this dataset and say something about it. This is the number of people who have done so. These are available through the API.
    - provenance_claim_certs - (public - optional) if this dataset was created using other hosted datasets/models/etc., a certificate can be issued which claims as such
    - provenance_verified_computation_certs - (public - optional) if this dataset was created using other hosted datasets/models/etc by performing verified computation, than these certificates can verify that the computation was indeed genuine.

Consider "validity certs". "Certs" is short for "certificates". What a certificate is is a statement by a specific person. So, unlike all of the other attributes on the dataset object, which have no explicit author (and thus are implied to be authored by the domain owner), these certificates were authored by people other than the domain owner. They exist explicitly because the domain owner might be dishonest. In particular, it is possible that the domain owner might lie about the kinds of data he/she has. Thus, each certificate is a "signed statement" by a third party claiming something about the dataset. What kinds of claims are made?

- Validity Certs: these certificates exist to all say the same thing, namely that, "I, (certificate author) have seen the dataset and testify that it is genuine, namely that the rest of the metadata on the dataset object is true.". Validity certificates may also be accompanied by contracts which are binding in one or more legal jurisdictions, further adding weight to the claims. Note that the Domain owner may host these certificates themself because they are exclusively endorsements of the dataset, thus the domain owner has incentive to broadcast them. An important question to ask at this point is, "can the domain owner forge certificates?". The answer is yes and no. Yes, they can add as many certificates as they want to the list. However, each certificate is only as good as it's "signature". Much like in the real world, a signature is a symbol which only a specific individual could make. On paper, we have people use a pen and ink. In a data structure, we have people use a "cryptographic signature". In this way, a certificate is only useful if it is signed by someone who is known to be credible. "Credibility" in this case is usually associated with whether or not the individual has something to lose if they lose their reputation - aka, they're probably in the business of disseminating data and if word got out that they were disseminating fake data they would lose their shirt. The final important question you should ask is, "how do I know if a signature came from a specific person?". We will cover this later.

- User Certs: these certificates are like Validity certificates except that:
    - They aren't just endorsements of the data. They can also be negative (claims that the data is un-useful).
    - Because they can be negative, they must have alternative sources to "host" them, typically the network (by default), but they can also be other domains (who have a competitive reason to hold their competitors accountable) or other network nodes. The preferred storage is a mix between the dataset's domain holder and other domain holders because they are maximally incentivized to hold positive/negative user certs respectively.
    - These certificates come in several types:
        - Cross val rank certificates: these certificates basically say, "I trained on this dataset using federated learning, but before averaging all the models together, I created a set of cross validation scores, one score for each dataset which was with-held from the model average. This allowed me to rank the datasets by their relative contribution to the overall accuracy. On average, this dataset was in the top X% of datasets for my task on dataset schema Y".
        - Holdout accuracy certificates: I trained on this dataset exclusively and got a holdout score using X test dataset of Y.

- Provenance Certificates: these are claims or cryptographic proofs about where a particular dataset comes from.
    - Provenance Claim Certs: structured similarly to the user and validity certificates, this is merely a claim that a dataset originates from a particular transformation (such as a model) on another dataset.
    - Provenance Verified Computation Certs: a different kind of certificate which provides cryptographic proof that a dataset was created using one or more others passed through a specific function. This is computationally more intensive than Provenance Claim Certs, but in the right context they can lend very strong credibility to the claim.

- Arbitrary Certs: these are simply signed strings attached to the dataset. They can say anything, although they do have a "type" field which can allow for standards to emerge if a specific kind of certificate is deemed useful.

So, now that we have an intuition for what these certificates exist to do, we can consider how we might use them in a search query to find high-quality datasets.

In [9]:
result = gr.search(anywhere="diabetes", latest_version_only = True)
result.datasets()
# result.validity_identities()

Unnamed: 0_level_0,network,domain,$/eps,$/gpu_flop_hour,validity_certs,top_validity_certs,user_certs,avg_user_cert_rank,id,upload-date,...,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
COVID Mortality,UCSF,UCSF,$364.32,$5.35,25,"UCSF, FDA, NSF",6432,8.3%,997a017255cff6202e21e191df09710c06a5be452cd610...,12/18/2019,...,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True
US COVID Deaths,CDC,Atlanta,$253.12,$4.22,23,"CDC, FDA, NSF",3523,12.5%,447f0b38b15b4647a8835e2b9c5fb02f631fca75f4d2f2...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},False
US COVID Deaths,CDC,Chicago,$263.56,$3.12,15,"UCSF, FDA, NSF",1532,3.3%,447f0b38b15b4647a8835e2b9c5fb02f631fca75f4d2f2...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},True
COVID Deaths,AMA,Boston General,$135.37,$5.32,1,AMA,3523,20.3%,a4b20ac7d8fa38b12461da9d264cb733a7088156b85108...,2/20/2020,...,PT/TF/PD/NP/JX,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True
COVID Deaths,AMA,Doctor's Direct,$135.37,$5.32,1,John Mantle,93.5%,941631b1edf05ba549be6b43a25edef1b207aa4a10fbad...,2/20/2020,2,...,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True,
Diabetes Pump Trial Data,AMA,Boston General,"$5,364.32",$1.32,1,AMA,235,32.1%,d13ec1f16387cc1c28c9c7974fe966f766b28f7e0b07cc...,1/3/2018,...,PT/TF/PD/NP/JX,23267,335,3463,AMA-DIABETES-TRIAL-252,#diabetes #or...,"In 2018, the American Medical Association...",True,{'collected':2018},True


In [10]:
validity_cert_ids=result.validity_identities(only_entities_with...) #TODO: change to "validity_identities" - FURTURE TODO: when Opus schema matures, be able to search on the schema.
validity_cert_ids

Unnamed: 0_level_0,signer_pubkey,signer_name,signer url,dataset_id,dataset_name,certificate_authority_type,certificate_authority_url,certificate_metadata
cert_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
54b54571ed,4b04396a2b15b941eaece34f04f1cee9d4e266fd686d0b...,UCSF,http://opus-id.org/ucsf,c45629b08b,Covid Mortality,https,https://letsencrypt.org/,{...}
cdc6a0832a,0951a6a5f69dafdc9d70afc76756d1e2976359b1228623...,FDA,http://opus-id.org/fda,255764c350,Covid Mortality,https,https://letsencrypt.org/,{...}
929491354e,c356df8baebafdc169ac8eca5372ab10b2235e2024985c...,NSF,http://opus-id.org/nsf,aef7415f9f,Covid Mortality,https,https://letsencrypt.org/,{...}
2649ec46d5,d211ce0fb46da65293fdafa05f19dee8e593c3d7ccf8c4...,AMA,http://opus-id.org/ama,6684672643,COVID Deaths,https,https://letsencrypt.org/,{...}
e5b741c34c,ccef55a681050fa7faedb572090773290b77e96b9d9846...,CDC,http://opus-id.org/cdc,2d6490877c,US COVID Deaths,https,https://letsencrypt.org/,{...}
892710abcd,ccef55a681050fa7faedb572090773290b77e96b9d9846...,CDC,http://opus-id.org/cdc,e1f5eeef3b,Covid Mortality,https,https://letsencrypt.org/,{...}


Above we have printed all of the valid certificates matching this query. Namely, these are all the certificates attached to objects which were returned from the query. We can use them in a search by first searching for identities.

In [534]:
entities = gr.search_entities(signer_name="UCSF") #TODO - OPUS Project: what would we search for on an identity?
entities

Unnamed: 0_level_0,name,aliases,opus_url,verified_secondary_url
pubkey,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
9e56984e86bb827eb10794a7214db27ebdd41db047dee8307b683be8ab6e5bde,University of California San Franscisco,"[UCSF, ucsf]",http://opus-id.org/ucsf,https://www.ucsf.edu/


In [538]:
result = gr.search(anywhere="diabetes", validity_cert_ids=entities)

In [539]:
result.datasets()

Unnamed: 0_level_0,network,domain,id,upload-date,version,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
COVID Mortality,UCSF,UCSF,d6aa4be5a09776c4cad34e2faec28b20358880e06f34d3...,12/18/2019,1,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True


And as you can see, the new search only returned datasets which have been verified by the group of entities returned from the identity search.

In [12]:
result = gr.search(anywhere="diabetes")
result.datasets()

Unnamed: 0_level_0,network,domain,$/eps,$/gpu_flop_hour,validity_certs,top_validity_certs,user_certs,avg_user_cert_rank,id,upload-date,...,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
COVID Mortality,UCSF,UCSF,$364.32,$5.35,25,"UCSF, FDA, NSF",6432,8.3%,a25b642faa717dbe0461e7b020799e2e5b5b8db3699ff5...,12/18/2019,...,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True
US COVID Deaths,CDC,Atlanta,$253.12,$4.22,23,"CDC, FDA, NSF",3523,12.5%,037dee33389547f76a97da437f390c141aae7caed5146c...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},False
US COVID Deaths,CDC,Chicago,$263.56,$3.12,15,"UCSF, FDA, NSF",1532,3.3%,037dee33389547f76a97da437f390c141aae7caed5146c...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},True
COVID Deaths,AMA,Boston General,$135.37,$5.32,1,AMA,3523,20.3%,607caa7b8c30706fb6609341cc3036e637cafacecf1ee0...,2/20/2020,...,PT/TF/PD/NP/JX,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True
COVID Deaths,AMA,Doctor's Direct,$135.37,$5.32,1,John Mantle,93.5%,f5bf9bbae2b43497ff7cd4d92ea5afe82205a459ab0dd2...,2/20/2020,2,...,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True,
Diabetes Pump Trial Data,AMA,Boston General,"$5,364.32",$1.32,1,AMA,235,32.1%,c7e1f945c60f5aca2de5a4d0c8f4190943d88fce459a19...,1/3/2018,...,PT/TF/PD/NP/JX,23267,335,3463,AMA-DIABETES-TRIAL-252,#diabetes #or...,"In 2018, the American Medical Association...",True,{'collected':2018},True


#### Raw User Certification Data Structure (for Dataset)

- Dataset ID: the id of the dataset which was used for an experiment
- Certifier Public Key: the id of the individual who is making the claim (positive/negative endorsement)
- Dataset Schema: COVID-MORT-2
- Input Columns: ["Blood pressure", "BMI", "Age"]
- Target Columns: ["Mortality"]
- Dev Dataset: the dataset we're using for evaluation to create the FL Crossval Metic. (options: local dataset, <dataset_id>)
- FL Crossval (list of objects with the following schema):
    - Pool Size: (Example: pool_size=30) the number of participants training models in a single round of federated learning.
    - Evaluation Metric: (precision, recall, F1, accuracy, BLU, Entropy, Mean Squared Error, ROC, etc.)
    - Crossval Rank: (Example: crossval_rank=11) (11 / 30 = 36%)
- Experiment Notes (optional): the way the experiment was run - details about modeling choices, etc.

In [13]:
result = gr.search(anywhere="diabetes")
result.datasets()

Unnamed: 0_level_0,network,domain,$/eps,$/gpu_flop_hour,validity_certs,top_validity_certs,user_certs,avg_user_cert_rank,id,upload-date,...,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
COVID Mortality,UCSF,UCSF,$364.32,$5.35,25,"UCSF, FDA, NSF",6432,8.3%,17c6ebfa1446e2a834322ef402cd3587ebe3d10bef665a...,12/18/2019,...,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True
US COVID Deaths,CDC,Atlanta,$253.12,$4.22,23,"CDC, FDA, NSF",3523,12.5%,d4e194c5f9a27cb9e9298847709396cb2ed3cee4b180a9...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},False
US COVID Deaths,CDC,Chicago,$263.56,$3.12,15,"UCSF, FDA, NSF",1532,3.3%,d4e194c5f9a27cb9e9298847709396cb2ed3cee4b180a9...,1/18/2020,...,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},True
COVID Deaths,AMA,Boston General,$135.37,$5.32,1,AMA,3523,20.3%,c565bd5ca2b8de730092f3516daeb60b3405bef4910a90...,2/20/2020,...,PT/TF/PD/NP/JX,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True
COVID Deaths,AMA,Doctor's Direct,$135.37,$5.32,1,John Mantle,93.5%,dc1c2319302ab6ce73670496c3799a7627cff63a64d631...,2/20/2020,2,...,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True,
Diabetes Pump Trial Data,AMA,Boston General,"$5,364.32",$1.32,1,AMA,235,32.1%,95f608b2714037d7459a107af33ecc14841dad858a3970...,1/3/2018,...,PT/TF/PD/NP/JX,23267,335,3463,AMA-DIABETES-TRIAL-252,#diabetes #or...,"In 2018, the American Medical Association...",True,{'collected':2018},True


In [15]:
# result.validity_certificates()
user_cert_ids = result.user_cert_ids(filter="exclude people who on average overestimated the percentile ranking by 50% or more")
user_cert_ids

Unnamed: 0_level_0,signer_pubkey,signer_name,signer url,dataset_id,dataset_name,certificate_authority_type,certificate_authority_url,certificate_metadata
cert_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
b370c47bb4,3743986866ed3f27b38e73c3b8ce746284ff13364842e4...,UCSF,http://opus-id.org/ucsf,e30172c687,Covid Mortality,https,https://letsencrypt.org/,{...}
2d3e5b1954,ac02a9b5aa4c42abe4f77062641d88bcf8603fa93d56aa...,FDA,http://opus-id.org/fda,b0ba9d2853,Covid Mortality,https,https://letsencrypt.org/,{...}
07c5153cf7,22e68daa6134af71d0a444d26180bc106e21c23fed8ff2...,NSF,http://opus-id.org/nsf,f2cf25fd3c,Covid Mortality,https,https://letsencrypt.org/,{...}
d9c8fb350e,1fd1287036de81e9f05be59f82c7758c1a407667d4e9e7...,AMA,http://opus-id.org/ama,cac64498fe,COVID Deaths,https,https://letsencrypt.org/,{...}
349eb7f6be,a28b09c2d3952ecccd23f0be3ec7966e2db07cdb24cc03...,CDC,http://opus-id.org/cdc,2a1a2d40f7,US COVID Deaths,https,https://letsencrypt.org/,{...}
a497a960b5,a28b09c2d3952ecccd23f0be3ec7966e2db07cdb24cc03...,CDC,http://opus-id.org/cdc,45092f692c,Covid Mortality,https,https://letsencrypt.org/,{...}


In [None]:
# returns the list of datasets we trust to be high quality.
result = result.search(search="diabetes", user_cert_ids=user_cert_ids, user_cert_id_rank="85%" validity_cert_ids=validity_cert_ids)

#### Provenance Search - Query 6: Looking for datasets of a certain origin

##### Object Provenance Certificate

    - Object ID: the id of the object being signed
    - Current Owner Public Key: identifying who the current owner is
    - Provenance Link Type: {Creation, Transfer}
    - Transfer Public Key: the public key of of the target of the provenance link (the person the dataset is being transferred to)
    - Current Validity Certificate: a signed certificate of the dataset + metadata (includes all previous certificates) in its current state.

In [16]:
result = gr.search(anywhere="diabetes")
provenance_cert_ids = result.provenance_cert_ids()
provenance_cert_ids # result.validity_certificates()

Unnamed: 0_level_0,signer_pubkey,signer_name,signer url,dataset_id,dataset_name,certificate_authority_type,certificate_authority_url,certificate_metadata
cert_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0b009893e1,5807f218e8e30214f7579a36a37513d2965adfff36f2f0...,UCSF,http://opus-id.org/ucsf,ce99c3f99f,Covid Mortality,https,https://letsencrypt.org/,{...}
c1769a05d4,2607c476da8eecdcca2c8de547b80c928211afdc83da77...,FDA,http://opus-id.org/fda,69098d2d7e,Covid Mortality,https,https://letsencrypt.org/,{...}
472ed4a92a,a10e7f046fd1c3cc25c1b77d86d3cb2a9c5de31b38515c...,NSF,http://opus-id.org/nsf,c61318f1d4,Covid Mortality,https,https://letsencrypt.org/,{...}
3b26fea6e8,465a57e7a55ef02779261561a5605578f0a3158213a572...,AMA,http://opus-id.org/ama,85031f13f5,COVID Deaths,https,https://letsencrypt.org/,{...}
e5bab81b9c,2ac50c6fb32e145240b2cca7f2d183de2db270ab5a4916...,CDC,http://opus-id.org/cdc,9d6da83499,US COVID Deaths,https,https://letsencrypt.org/,{...}
9f9abd2887,2ac50c6fb32e145240b2cca7f2d183de2db270ab5a4916...,CDC,http://opus-id.org/cdc,d69c3135a1,Covid Mortality,https,https://letsencrypt.org/,{...}


In [None]:
provenance_cert_ids = result.provenance_cert_ids(filter="IDs corresponding to UCSF employed doctors")
result = result.search(provenance_cert_ids_must_include=provenance_cert_ids, provenance_cert_ids_must_exclude=result.provenance_cert_ids(filter="IDs corresponding to remote-worker UCSF doctors"))

#### Dataset Schema (Distributed Dataset) Search - Query 7: Looking for dataset schemas verified to be good for a specific problem

#### Distant Search - Query 5: REGEX Search

You can also do all of these queries using regex parameters like so

In [417]:
result = gr.search_regex(anywhere="[a-z]{4}-[0-9]{4}-[a-z]{4}-[0-9]{4}")

#### Local Search - Queries: All

And you can actually perform all of the same searches locally as well using the same parameters. However, it'll return almost instantly since it's running on your own device.

Just run your queries against the "result" variable. If you lose this variable you can find it again in `gr.searches`

In [420]:
local_result = result.search(anywhere="trial")
local_result()

Unnamed: 0,distributed_datasets,datasets,tensors,dataset_schemas,tensor_schemas,models,model_schemas
0,23,75474,947467,532,23,235,62


And even this local result can be further refined in the same way! (It's recursive)

In [421]:
local_result2 = local_result.search(anywhere="trial")
local_result2()

Unnamed: 0,distributed_datasets,datasets,tensors,dataset_schemas,tensor_schemas,models,model_schemas
0,23,75474,947467,532,23,235,62


# Step 5: Search (again)

Ok, enough about search in general! Let's continue with our experiment to find the secret to a good night's sleep. As a review, we're looking to accomplish three things:

- Federated Learning: train a classifier to predict whether someone will get a good night sleep based on various input factors

- Federated Analytics: use the classifier to estimate the amount of sleep that the population of the USA is getting (importantly including folks who do not record their own sleep data).

- Self Classification: leveraging the model from FL, predict what decisions I should make to get a better night sleep.

We have several networks at our disposal:

In [422]:
gr.networks

Unnamed: 0_level_0,id,datasets,models,domains,online,registered,server-domains,mobile-domains
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
OpenGrid,8393646751,235262,2352,2532,2352,6,100%,0%
FitByte,1580245361,34734,2352,2532,2352,19,2%,98%
DeepMined,7144419394,34734,2352,2532,2352,66,100%,0%
OpanAI,6147168007,0,2352,2532,2352,33,2%,98%
Damonios Pizza,6099664154,0,2352,2532,2352,61,99%,1%
MyFitnessPal,1641633989,0,2352,2532,2352,19,50%,50%
TrackMyRun,1578719189,0,2352,2532,2352,7,0%,100%
Netflax,932716859,935685,7473,346,216,0,54%,46%
AMA,5827234287,2352,236622,53,52,66,100%,0%
CDC,6384693059,35,0,5,5,5,100%,0%


Let's start simple and search for "sleep"

In [551]:
result = gr.search(anywhere="sleep")
result()

Unnamed: 0,distributed_datasets,datasets,tensors,dataset_schemas,tensor_schemas,models,model_schemas
0,23,75474,947467,532,23,235,62


In [552]:
result.status() # return the status of the query (what percentage of known devices have replied)

Unnamed: 0,query,timestamp,duration (hours),time remaining,% of all known domains,% of domains alive in last 24 hours
0,anywhere=sleep,1595524000.0,24,23.9,35%,98%


In [553]:
result.datasets()

Unnamed: 0_level_0,network,domain,id,upload-date,version,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
COVID Mortality,UCSF,UCSF,b428cf3a54636dd31740a832a3fabf9a19f30e29742f1b...,12/18/2019,1,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True


In [350]:
# show the individual dataset results
result.datasets(latest_version_only=True, ignore_duplicates=False, require_gpu=False)

Unnamed: 0_level_0,network,domain,id,upload-date,version,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata,gpu_available
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
COVID Mortality,UCSF,UCSF,a63641a11d506e63b5eaca37f30f70c2fdaca5c663d085...,12/18/2019,1,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019},True
US COVID Deaths,CDC,Atlanta,eb09edc205b459ff06901ddf6e9b9f3406f925e77c2071...,1/18/2020,23,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},False
US COVID Deaths,CDC,Chicago,eb09edc205b459ff06901ddf6e9b9f3406f925e77c2071...,1/18/2020,23,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020},True
COVID Deaths,AMA,Boston General,9bab67dab948fad522d57db7eea00b3b055a53753f2ff3...,2/20/2020,2,PT/TF/PD/NP/JX,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020},True
Diabetes Pump Trial Data,AMA,Boston General,607ab6481b74bcc1438c090d8ea02bbacbd3337e811b9d...,1/3/2018,26,PT/TF/PD/NP/JX,23267,335,3463,AMA-DIABETES-TRIAL-252,#diabetes #or...,"In 2018, the American Medical Association...",True,{'collected':2018},True


In [305]:
# latest_return_only=True by default

result.distributed()

Unnamed: 0_level_0,networks,domains,popular_dataset_description,n_datasets,n_datasets_dedup,train_rows,dev_rows,test_rows,max_train_rows_per_domain,max_dev_rows_per_domain,max_test_rows_per_domain
schema-name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
COVID-MORT-2,"UCSF,CDC,AMA","UCSF,Atlanta,Chicago,Boston General","Nationally reported on a daily basis, this dat...",4,3,39610,1063,366,34632,355,366
AMA-DIABETES-TRIAL-252,AMA,Boston General,"In 2018, the American Medical Association...",4,3,23267,335,3463,23267,335,3463


# Bulk Domain Registration

Let's say I found 10,000 phones with data I want to rain on - i need to ask 10,000 people for permission to do so.

# Select and Allocate Compute

Now that we have found an interesting distributed dataset, we need to set up some compute to do our analysis. However, the important thing to consider is that we can't use just any compute, we have to use compute which is co-located with each dataset we want to analyze. For example, part of the COVID-MORT-2 distributed dataset we found above is in UCSF's datacenters, so we need to spin up some compute within UCSF's "Domain". A "Domain" is the official word we use when referring to "all the data and compute within the ownership and jurisdiction of a single entity, known as the domain owner". 

So, since the dataset were most interested in "COVID-MORT-2" is actually distributed across multiple domain owners, we need to get setup with some compute within each one.



 
    
- Tensor:
    - name: the name of a tensor
    - schema (required - public - TensorSchema object): the schema of the tensor (type, name, and description for each column)
    - mock (generated): a mock tensor generated from the TensorSchema
    - id: the uid of the tensor
    - data: the tensor's values
    - tags (optional):
    - description (optional):
    - shape (required - public): the shape of the tensor
    - value: the tensor itself
    - private: is the tensor a private tensor?
    - sensitivity (optional): the sensitivity metadata for a tensor
        - h (public - derived from schema) - the max values a tensor can take on, derived from the schema
        - l (public - derived from schema)- the minimum values a tensor can take on, derived from the schema
        - e^h (private) - the max contributions from entities, initialized with the tensor
        - e^l (private) - the min contributions from entities, initialized with the tensor
    - accountant (private reference to global privacy accountant)
    - worst_case_user_budget: inferred values based on the worst case user-budget parameter across the entities in the tensor (see Entity.user_budget)

- Entity:
    - uid (required, randomly generated, public)
    - metadata (optional)
    - user_budget (public - required): the per-user privacy budget parameters for this dataset:
        - lifetime_train: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)
        - lifetime_dev: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)        
        - lifetime_test: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)                
        - user_lifetime_train: the total epsilon each data scientist gets when interacting with the training dataset
        - user_lifetime_dev: the total epsilon each data scientist gets when interacting with the dev dataset
        - user_lifetime_test: the total epsilon each data scientist gets when interacting with the dev dataset
        - daily_auto_train: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review
        - daily_auto_dev: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review        
        - daily_auto_test: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review                
        - query_auto_train: the maximum amount of epsilon one query can return which doesn't require officer review when interacting with the training dataset
        - query_auto_dev: the maximum amount of epsilon one query can return which doesn't require officer review when interacting with the training dataset
       - query_auto_test: the maximum amount of epsilon one query can return which doesn't require officer review when interacting with the training dataset        


- TensorSchema:
    - name: the name of the schema 
    - columns: each column has a type, name, and description for the column

- SchemaColumn:
    - type
    - name
    - description
    - vocabulary (optional - for text datasets)
    
- DatasetSchema: this is the schema of a dataset. Importantly, we try to encourage datasets in multiple locations to intentionally subscribe to the same schema so as to best facilitate Federated Learning.

- DistributedDataset: this is a virtual object which refers to a collection of Dataset objects which all subscribe to the same Dataset Schema. It is a convenient object because it gives you fast access to datasets at multiple institutions which are appropriate to train on together.


In [6]:
import pandas as pd
from time import time
import random
import os
from binascii import hexlify

def get_key():
    key = hexlify(os.urandom(32)).decode()
    return key

class Grid():
    ""
class Wallet():
    ""
gr = Grid()
gr.wallet = Wallet()

#ignore this...it's just to support the mock API
columns=["network", "domain", "pubkey", "prikey"]
data = [["OpenGrid", "PatrickCason", get_key(), get_key()],
       ["OpenGrid", "AndrewTrask", get_key(), get_key()],
       ["OpenGrid", "TudorCebere", get_key(), get_key()],
       ["OpenGrid", "JasonMancuso", get_key(), get_key()],
       ["OpenGrid", "BobbyWagner", get_key(), get_key()],
       ["AMA", "UCSF", get_key(), get_key()],
       ["AMA", "Vanderbilt", get_key(), get_key()],
       ["AMA", "MDAnderson", get_key(), get_key()],
       ["AMA", "BostonGeneral", get_key(), get_key()],
       ["AMA", "HCA", get_key(), get_key()],
       ["CDC", "Atlanta", get_key(), get_key()],
       ["CDC", "New York", get_key(), get_key()],
       ]
domain_keys = pd.DataFrame(columns=columns, data=data)
gr.wallet.domain_keys = domain_keys

#ignore this...it's just to support the mock API
columns=["id", "name", "datasets", "models", "domains", "online", "registered", "server-domains", "mobile-domains"]
data = [[random.randint(0,1e10), "OpenGrid", 235262, 2352, 2532, 2352, random.randint(0,100), "100%", "0%"],
       [random.randint(0,1e10), "FitByte", 34734, 2352, 2532, 2352, random.randint(0,100), "2%", "98%"],
       [random.randint(0,1e10), "DeepMined", 34734, 2352, 2532, 2352, random.randint(0,100), "100%", "0%"], 
       [random.randint(0,1e10), "OpanAI", 0, 2352, 2532, 2352, random.randint(0,100), "2%", "98%"],  
       [random.randint(0,1e10), "Damonios Pizza", 0, 2352, 2532, 2352, random.randint(0,100), "99%", "1%"],   
       [random.randint(0,1e10), "MyFitnessPal", 0, 2352, 2532, 2352, random.randint(0,100), "50%", "50%"],           
       [random.randint(0,1e10), "TrackMyRun", 0, 2352, 2532, 2352, random.randint(0,100), "0%", "100%"],                   
       [random.randint(0,1e10), "Netflax", 935685, 7473, 346, 216, 0, "54%", "46%"],        
       [random.randint(0,1e10), "AMA", 2352, 236622, 53, 52, random.randint(0,100), "100%", "0%"],
       [random.randint(0,1e10), "CDC", 35, 0, 5, 5, 5, "100%", "0%"]]
networks = pd.DataFrame(columns=columns, data=data)
networks = networks.set_index("name")    
gr.networks = networks

def save_network(network):
    columns=["id", "name", "datasets", "models", "domains", "online", "registered"]
    data = [[2352, "NHS", 86585, 6585, 5, 5, 0]]
    network = pd.DataFrame(columns=columns, data=data)
    network = network.set_index("name")
    
    gr.networks = pd.concat([gr.networks, network])
    print("Connecting... SUCCESS!")
    return network

gr.save_network = save_network

def search_diabetes(*args, **kwargs):
    
    if('anywhere' in kwargs and kwargs['anywhere'] == "diabetes"):

        if("validity_cert_ids" in kwargs):
            columns=["distributed_datasets", "datasets", "tensors", "dataset_schemas", "tensor_schemas", "models", "model_schemas"]
            data = [[23, 75474, 947467, 532, 23, 235, 62]]
            nets = pd.DataFrame(columns=columns, data=data)

            key2 = get_key()

            columns=["name", "network", "domain", "id", "upload-date", "version", "frameworks", "train_rows", "dev_rows", "test_rows", "schema", "tags", "description", "private", "metadata", "gpu_available"]
            data = [["COVID Mortality", "UCSF", "UCSF", get_key(), "12/18/2019", "1", "PT/TF/PD/NP/JX", 2626, 353, 366, "COVID-MORT-2", "#covid #or...", "This is the official statistics for COVID deaths within...", "True", "{'collected':2019}", "True"]]
            datasets = pd.DataFrame(columns=columns, data=data)
            datasets = datasets.set_index("name")


            columns=["schema-name", "networks", "domains", "popular_dataset_description", "n_datasets", "n_datasets_dedup", "train_rows", "dev_rows", "test_rows", "max_train_rows_per_domain", "max_dev_rows_per_domain", "max_test_rows_per_domain"]
            data = [["COVID-MORT-2", "UCSF,CDC,AMA", "UCSF,Atlanta,Chicago,Boston General", "Nationally reported on a daily basis, this dataset includes", 4,3, 39610, 1063, 366, 34632, 355, 366],
                   ["AMA-DIABETES-TRIAL-252", "AMA", "Boston General", "In 2018, the American Medical Association...", 4,3, 23267, 335, 3463, 23267, 335, 3463]]
            distributed = pd.DataFrame(columns=columns, data=data)
            distributed = distributed.set_index("schema-name")

            class Networks():
                def __call__(self):
                    return nets

                def datasets(self, *args, **kwargs):
                    return datasets

                def distributed(self, *args, **kwargs):
                    return distributed

                def search(self, *args, **kwargs):
                    return self            

            networks = Networks()
            return networks

        
        columns=["distributed_datasets", "datasets", "tensors", "dataset_schemas", "tensor_schemas", "models", "model_schemas"]
        data = [[23, 75474, 947467, 532, 23, 235, 62]]
        nets = pd.DataFrame(columns=columns, data=data)

        key2 = get_key()

        columns=["name", "network", "domain", "$/eps", "$/gpu_flop_hour", "validity_certs", "top_validity_certs", "user_certs", "avg_user_cert_rank", "id", "upload-date", "version", "frameworks", "train_rows", "dev_rows", "test_rows", "schema", "tags", "description", "private", "metadata", "gpu_available"]
        data = [["COVID Mortality", "UCSF", "UCSF", "$364.32", "$5.35", 25, "UCSF, FDA, NSF", 6432, "8.3%", get_key(), "12/18/2019", "1", "PT/TF/PD/NP/JX", 2626, 353, 366, "COVID-MORT-2", "#covid #or...", "This is the official statistics for COVID deaths within...", "True", "{'collected':2019}", "True"],
               ["US COVID Deaths", "CDC", "Atlanta", "$253.12", "$4.22", 23, "CDC, FDA, NSF", 3523, "12.5%", key2, "1/18/2020", "23", "PT/TF/PD/NP/JX", 34632, 355, 0, "COVID-MORT-2", "#covid #or...", "Nationally reported on a daily basis, this dataset includes", "True", "{'collected':2020}", "False"],
               ["US COVID Deaths", "CDC", "Chicago", "$263.56", "$3.12", 15,  "UCSF, FDA, NSF", 1532, "3.3%", key2, "1/18/2020", "23", "PT/TF/PD/NP/JX", 34632, 355, 0, "COVID-MORT-2", "#covid #or...", "Nationally reported on a daily basis, this dataset includes", "True", "{'collected':2020}", "True"],            
               ["COVID Deaths", "AMA", "Boston General", "$135.37", "$5.32", 1, "AMA", 3523, "20.3%", get_key(), "2/20/2020", "2", "PT/TF/PD/NP/JX", 2352, 335, 0, "COVID-MORT-2", "#covid #or...", "With attributes including risk factors like diabetes...", "True", "{'collected':2020}", "True"],
               ["COVID Deaths", "AMA", "Doctor's Direct", "$135.37", "$5.32", 1, "John Mantle", "93.5%", get_key(), "2/20/2020", "2", "PT/TF/PD/NP/JX", 2352, 335, 0, "COVID-MORT-2", "#covid #or...", "With attributes including risk factors like diabetes...", "True", "{'collected':2020}", "True"],                
               ["Diabetes Pump Trial Data", "AMA", "Boston General", "$5,364.32", "$1.32", 1, "AMA", 235, "32.1%", get_key(), "1/3/2018", "26", "PT/TF/PD/NP/JX", 23267, 335, 3463, "AMA-DIABETES-TRIAL-252", "#diabetes #or...", "In 2018, the American Medical Association...", "True", "{'collected':2018}", "True"]]
        datasets = pd.DataFrame(columns=columns, data=data)
        datasets = datasets.set_index("name")


        columns=["schema-name", "networks", "domains", "popular_dataset_description", "n_datasets", "n_datasets_dedup", "train_rows", "dev_rows", "test_rows", "max_train_rows_per_domain", "max_dev_rows_per_domain", "max_test_rows_per_domain"]
        data = [["COVID-MORT-2", "UCSF,CDC,AMA", "UCSF,Atlanta,Chicago,Boston General", "Nationally reported on a daily basis, this dataset includes", 4,3, 39610, 1063, 366, 34632, 355, 366],
               ["AMA-DIABETES-TRIAL-252", "AMA", "Boston General", "In 2018, the American Medical Association...", 4,3, 23267, 335, 3463, 23267, 335, 3463]]
        distributed = pd.DataFrame(columns=columns, data=data)
        distributed = distributed.set_index("schema-name")

        class Networks():
            def __call__(self):
                return nets

            def datasets(self, *args, **kwargs):
                return datasets

            def distributed(self, *args, **kwargs):
                return distributed
            
            def search(self, *args, **kwargs):
                return self
            
            def validity_certificates(self, *args, **kwargs):
                columns=["cert_id", "signer_pubkey", "signer_name", "signer url", "dataset_id", "dataset_name", "certificate_authority_type", "certificate_authority_url", "certificate_metadata"]
                
                cdc = get_key()
                
                data = [[get_key()[0:10], get_key(), "UCSF", "http://opus-id.org/ucsf", get_key()[0:10], "Covid Mortality", "https", "https://letsencrypt.org/", "{...}"],
                       [get_key()[0:10], get_key(), "FDA", "http://opus-id.org/fda", get_key()[0:10], "Covid Mortality", "https", "https://letsencrypt.org/", "{...}"],
                       [get_key()[0:10], get_key(), "NSF", "http://opus-id.org/nsf", get_key()[0:10], "Covid Mortality", "https", "https://letsencrypt.org/", "{...}"],
                       [get_key()[0:10], get_key(), "AMA", "http://opus-id.org/ama", get_key()[0:10], "COVID Deaths", "https", "https://letsencrypt.org/", "{...}"],
                       [get_key()[0:10], cdc, "CDC", "http://opus-id.org/cdc", get_key()[0:10], "US COVID Deaths", "https", "https://letsencrypt.org/", "{...}"],
                       [get_key()[0:10], cdc, "CDC", "http://opus-id.org/cdc", get_key()[0:10], "Covid Mortality", "https", "https://letsencrypt.org/", "{...}"]]
                
                certs = pd.DataFrame(columns=columns, data=data)
                certs = certs.set_index("cert_id")
                return certs

        networks = Networks()

    
        return networks
    
    else:
        
        columns=["distributed_datasets", "datasets", "tensors", "dataset_schemas", "tensor_schemas", "models", "model_schemas"]
        data = [[23, 75474, 947467, 532, 23, 235, 62]]
        nets = pd.DataFrame(columns=columns, data=data)

        key2 = get_key()

        columns=["name", "network", "domain", "id", "upload-date", "version", "frameworks", "train_rows", "dev_rows", "test_rows", "schema", "tags", "description", "private", "metadata", "gpu_available"]
        data = [["COVID Mortality", "UCSF", "UCSF", get_key(), "12/18/2019", "1", "PT/TF/PD/NP/JX", 2626, 353, 366, "COVID-MORT-2", "#covid #or...", "This is the official statistics for COVID deaths within...", "True", "{'collected':2019}", "True"]]
        datasets = pd.DataFrame(columns=columns, data=data)
        datasets = datasets.set_index("name")


        columns=["schema-name", "networks", "domains", "popular_dataset_description", "n_datasets", "n_datasets_dedup", "train_rows", "dev_rows", "test_rows", "max_train_rows_per_domain", "max_dev_rows_per_domain", "max_test_rows_per_domain"]
        data = [["COVID-MORT-2", "UCSF,CDC,AMA", "UCSF,Atlanta,Chicago,Boston General", "Nationally reported on a daily basis, this dataset includes", 4,3, 39610, 1063, 366, 34632, 355, 366],
               ["AMA-DIABETES-TRIAL-252", "AMA", "Boston General", "In 2018, the American Medical Association...", 4,3, 23267, 335, 3463, 23267, 335, 3463]]
        distributed = pd.DataFrame(columns=columns, data=data)
        distributed = distributed.set_index("schema-name")

        class Networks():
            def __call__(self):
                return nets

            def datasets(self, *args, **kwargs):
                return datasets

            def distributed(self, *args, **kwargs):
                return distributed
            
            def search(self, *args, **kwargs):
                return self   
            
            def status(self):
                columns=["query", "timestamp", "duration (hours)", "time remaining", "% of all known domains", "% of domains alive in last 24 hours"]
                data = [["anywhere=sleep", time(), 24, 23.9, "35%", "98%"]]
                datasets = pd.DataFrame(columns=columns, data=data)
                return datasets

        networks = Networks()

    
        return networks

def search_entities(*args, **kwargs):
    if("ucsf" in kwargs['anywhere'] or "UCSF" in kwargs['anywhere']):
        columns=["pubkey", "name", "aliases", "opus_url", "verified_secondary_url"]
        data = [[get_key(), "University of California San Franscisco", ["UCSF", "ucsf"], "http://opus-id.org/ucsf", "https://www.ucsf.edu/"]]
        entities = pd.DataFrame(columns=columns, data=data)
        entities = entities.set_index("pubkey")

        return entities
    
gr.search = search_diabetes
gr.search_regex = search_diabetes
gr.searches = [{"anywhere":"diabetes", "hours_remaining":48}]
gr.search_entities = search_entities

In [461]:
result = gr.search(anywhere="diabetes")
result.validity_certificates()

Unnamed: 0_level_0,signer_pubkey,signer_name,signer url,dataset_id,dataset_name,certificate_authority_type,certificate_authority_url,certificate_metadata
cert_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9289cd04c9,480bde0ed26895797eb40b6e8b3894dd713ac05ab5aa13...,UCSF,https://www.ucsf.edu/,af4c764567,Covid Mortality,https,https://letsencrypt.org/,{...}
8bd0cecebb,b22c6853427cc07275bb99712e605648a4bb706384e6de...,FDA,https://www.fda.gov/home,c586adb7af,Covid Mortality,https,https://letsencrypt.org/,{...}
14ab1cf435,e67f22cebbc593a8ae580e3720024c26fdedba8df6f991...,NSF,https://www.nsf.gov/,eba8c42b15,Covid Mortality,https,https://letsencrypt.org/,{...}
84eff4445f,507318e0aad6dd0494e088df8b5fd8c1e086227144931f...,AMA,https://www.ucsf.edu/,af0a2f4d8f,COVID Deaths,https,https://letsencrypt.org/,{...}
6bf5cc6f29,3104ff32cc69154edf4b526174f9630def5375ab1035d2...,CDC,https://www.cdc.gov/,816cbfdb12,US COVID Deaths,https,https://letsencrypt.org/,{...}
78e21a91ad,3104ff32cc69154edf4b526174f9630def5375ab1035d2...,CDC,https://www.cdc.gov/,00861bcab1,Covid Mortality,https,https://letsencrypt.org/,{...}


In [None]:


# diabetesSearch = network.search('diabetes') # search dataset name, description, and tags for 'diabetes'
diabetesSearch = network.search({ tag: 'diabetes' }) # specifically search for datasets with a tag of 'diabetes'

print(diabetesSearch)

"""
[
  {
    id: 1,
    name: 'Diabetes is terrible',
    description: '',
    node: 'ws://ucsf.com/pygrid',
    tags: ['diabetes', 'california', 'ucsf'],
    tensors: [
      {
        id: '1a',
        name: 'data',
        schema: []
      },
      {
        id: '1b',
        name: 'target',
        schema: []
      }
    ]
  },
  ...
]
"""

network.disconnect()

client = grid.connect(diabetesSearch[0].node) # 'ws://ucsf.com/pygrid'

user = client.signup('me@patrickcason.com', 'password')
# user = client.login('me@patrickcason.com', 'password')  # or, if you're already signed up

computeTypes = client.getComputeTypes()

"""
[
  {
    id: 1,
    name: 'EC2 P3',
    provider: 'AWS',
    cpu: {
      type: 'Intel Xeon 3.4GHz',
      cores: 32
    },
    gpu: {
      type: 'Tesla V100',
      min: 0,
      max: 8
    },
    ram: {
      value: 64,
      ordinal: 'gb'
    }
  },
  ...
]
"""

# env = user.createEnvironment() # creates the basic "default" environment for exploring

env = user.createEnvironment(computeTypes[0].id, {
    ram: Grid.RAM(32, 'gb'),
    gpu: 3
})

# Do stuff with "env"

# user.getEnvironments();