### Ignore the following cell

In the following cell I create the fake output that I will return later in the notebook

In [114]:
import pandas as pd

import os
from binascii import hexlify

def get_key():
    key = hexlify(os.urandom(32)).decode()
    return key

class Grid():
    ""
class Wallet():
    ""
gr = Grid()
gr.wallet = Wallet()

#ignore this...it's just to support the mock API
columns=["network", "domain", "pubkey", "prikey"]
data = [["OpenGrid", "PatrickCason", get_key(), get_key()],
       ["OpenGrid", "AndrewTrask", get_key(), get_key()],
       ["OpenGrid", "TudorCebere", get_key(), get_key()],
       ["OpenGrid", "JasonMancuso", get_key(), get_key()],
       ["OpenGrid", "BobbyWagner", get_key(), get_key()],
       ["AMA", "UCSF", get_key(), get_key()],
       ["AMA", "Vanderbilt", get_key(), get_key()],
       ["AMA", "MDAnderson", get_key(), get_key()],
       ["AMA", "BostonGeneral", get_key(), get_key()],
       ["AMA", "HCA", get_key(), get_key()],
       ["CDC", "Atlanta", get_key(), get_key()],
       ["CDC", "New York", get_key(), get_key()],
       ]
domain_keys = pd.DataFrame(columns=columns, data=data)
gr.wallet.domain_keys = domain_keys

#ignore this...it's just to support the mock API
columns=["id", "name", "datasets", "models", "domains", "online", "registered"]
data = [[235252, "OpenGrid", 235262, 2352, 2532, 2352, 23],
       [634252, "AMA", 2352, 236622, 53, 52, 23],
       [745742, "CDC", 35, 0, 5, 5, 5]]
networks = pd.DataFrame(columns=columns, data=data)
networks = networks.set_index("name")    
gr.networks = networks

def save_network(network):
    columns=["id", "name", "datasets", "models", "domains", "online", "registered"]
    data = [[2352, "NHS", 86585, 6585, 5, 5, 0]]
    network = pd.DataFrame(columns=columns, data=data)
    network = network.set_index("name")
    
    gr.networks = pd.concat([gr.networks, network])
    print("Connecting... SUCCESS!")
    return network

gr.save_network = save_network

# Step 1: Imports

To start, from a client perspective, we want to maximize for convenience and minimize the number of dependencies one needs to install to work with PyGrid. Thus, in an ideal world, users only have to install one python package in order to work with all of pygrid. I like the current design in syft 0.2.x where we have grid clients in a grid package inside of Syft. The thing we definitely want to avoid here is the need for users of PyGrid to have to install all of the dependencies needed to _run grid nodes_ (flask, databases, etc.) just to be able to interact with the grid. Putting grid inside of syft solves this as well.

In [115]:
# import syft as sy
# from syft import grid as gr

# Step 2: Default Networks

By default, it would be really great if we could support a combination of two lists of networks:

- networks which all users of PySyft have by default (OpenGrid)
- a history of all networks previously accessed (stored in some local config file)

We should be able to view these available networks by just calling `gr.networks` which should pretty-print information about them. Below we show one way to do pretty-print using just a Pandas table as shown below.

In [116]:
gr.networks

Unnamed: 0_level_0,id,datasets,models,domains,online,registered
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
OpenGrid,235252,235262,2352,2532,2352,23
AMA,634252,2352,236622,53,52,23
CDC,745742,35,0,5,5,5


This table displays several useful pieces of information:

- Id: the unique id of the network
- Name: the name of the network
- Datasets: the number of datasets currently hosted on the network
- Models: the number of models currently hosted on the network (public and private)
- Domains: the number of domains registered to this network (i.e., the number of individual hospitals)
- Online: the number of domains which are currently online
- Registered: the number of domains for which this user already has an account 

# Step 3: Local Wallet

Somewhere in the local filesystem, we need to save the set of all keys/logins which this user has for various domains around the world. We should be able to see them here. Note this list is what creates the "Registered" number for each network.

In [117]:
gr.wallet.domain_keys

Unnamed: 0,network,domain,pubkey,prikey
0,OpenGrid,PatrickCason,b0aa2f9121e03c12036b99fea2a29efd2580176627acf8...,a39a0ee3fbc246157696c37acaf3372107408a2021fb86...
1,OpenGrid,AndrewTrask,5f3943f8466a7ed9f716cb9e6afd5dd08d1b16384fcfb8...,c619dea3935d05ee697718df35dce8daf1a21214ebfc40...
2,OpenGrid,TudorCebere,eaf2ca574c2c8bd26daefd0b1a8e09bf4b9b3b897b1628...,d9a48f06bedd4e0ebfcab8b9f6053fd809824ba8613a08...
3,OpenGrid,JasonMancuso,71f2d1b793ddeabc2793eb7c7858891c2b9bf4d281ea91...,f4a15b8fc20ae647bf606c30d7a55cddb6be12d8cac455...
4,OpenGrid,BobbyWagner,91aef91491ee2f2ae5080192981084c13cadefd18de5ef...,c3334f60a553af41f73f3c2e00138d7808155135dfcb8f...
5,AMA,UCSF,c0ebfa72bcb131d17a3eeebebe63a554423ed26330bc8c...,f861834ca3501747b1184d7e5668fb200b8e8880e5c879...
6,AMA,Vanderbilt,dd6ee00c47640100e00c90f27fb306b81726b740619d9e...,289b31546174285d842bd50a214fd554bf10ae2aeff0bc...
7,AMA,MDAnderson,04648e0e12ddcb902c95038d39832bb15bb5ea1b894628...,9ee8e21a4b8776c216a8202642937d2f52aaae7cf14332...
8,AMA,BostonGeneral,e9b21c88947b2bc3cc7bd7e30285a9c9e312d8e5dd1273...,7506c14b1515ff07a7d57da76559b662753ab2c8747237...
9,AMA,HCA,a48d2ed6d8c65b9c21f99d5ac5fb3459d62494dacaf4e2...,573c518fce2fd1832288571793b7ecf880ddaefaf69066...


# Step 4: Adding Another Network

We should be able to add another network by simply dropping in the url to the network node (much like adding another PyPI/Npm repository or something)

In [118]:
gr.save_network('http://nhs.co.uk/pygrid') # it's a network

Connecting... SUCCESS!


Unnamed: 0_level_0,id,datasets,models,domains,online,registered
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NHS,2352,86585,6585,5,5,0


In [119]:
gr.networks

Unnamed: 0_level_0,id,datasets,models,domains,online,registered
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
OpenGrid,235252,235262,2352,2532,2352,23
AMA,634252,2352,236622,53,52,23
CDC,745742,35,0,5,5,5
NHS,2352,86585,6585,5,5,0


# Step 5: Find a Dataset

So now that we know that we are connected to a variety of domains within a group of networks we know about, now we want to begin doing some data science. There are a few datastructures we need to know about first:

- Dataset: This is a dataset object existing within a single Domain. It consists of 
    - name (public - required): the name of the dataset
    - id (public - required): the uid of the dataset
    - frameworks (public - required): the available frameworks for this dataset (derived from supported frameworks for the worker). Grouped into train, dev, and test.
    - tensors (public - reqired): a name->tensor dictionary enumerating the dataset's tensors (stored by default as pandas dataframes)
    - schema (public - required): the DatasetSchema of the dataset - which is the name->schema mapping for each TensorSchema. Identical across train, dev, and test
    - tags (public - required): a list of tags affiliated with this dataset
    - description (public - required): a free text description of the dataset
    - raw (private - optional): the raw version of the dataset (such as a CSV file, free text file, etc.)
    - metadata (public - optional): additional metadata someone wants to use for this dataset. We assume all of this data is public.
    - worst_case_user_budget: inferred values based on the worst case user-buget parameter within the dataset's tensors (see tensor user-buget)
    - private: does the dataset contain private tensors?
    
- Tensor:
    - name: the name of a tensor
    - schema (required - public - TensorSchema object): the schema of the tensor (type, name, and description for each column)
    - mock (generated): a mock tensor generated from the TensorSchema
    - id: the uid of the tensor
    - data: the tensor's values
    - tags (optional):
    - description (optional):
    - shape (required - public): the shape of the tensor
    - value: the tensor itself
    - private: is the tensor a private tensor?
    - sensitivity (optional): the sensitivity metadata for a tensor
        - h (public - derived from schema) - the max values a tensor can take on, derived from the schema
        - l (public - derived from schema)- the minimum values a tensor can take on, derived from the schema
        - e^h (private) - the max contributions from entities, initialized with the tensor
        - e^l (private) - the min contributions from entities, initialized with the tensor
    - accountant (private reference to global privacy accountant)
    - worst_case_user_budget: inferred values based on the worst case user-budget parameter across the entities in the tensor (see Entity.user_budget)

- Entity:
    - uid (required, randomly generated, public)
    - metadata (optional)
    - user_budget (public - required): the per-user privacy budget parameters for this dataset:
        - lifetime_train: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)
        - lifetime_dev: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)        
        - lifetime_test: the total epsilon which can be published to the greater public (i.e., when a data scientist intends to release a number openly)                
        - user_lifetime_train: the total epsilon each data scientist gets when interacting with the training dataset
        - user_lifetime_dev: the total epsilon each data scientist gets when interacting with the dev dataset
        - user_lifetime_test: the total epsilon each data scientist gets when interacting with the dev dataset
        - daily_auto_train: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review
        - daily_auto_dev: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review        
        - daily_auto_test: the amount of epsilon each data scientist gets per day for sample statistics which doesn't require compliance officer review                
        - query_auto_train: the maximum amount of epsilon one query can return which doesn't require officer review when intereacting with the training dataset
        - query_auto_dev: the maximum amount of epsilon one query can return which doesn't require officer review when intereacting with the training dataset
        - query_auto_test: the maximum amount of epsilon one query can return which doesn't require officer review when intereacting with the training dataset        


- TensorSchema:
    - name: the name of the schema 
    - columns: each column has a type, name, and description for the column
    
- DatasetSchema: this is the schema of a dataset. Importantly, we try to encourage datasets in multiple locations to intentionally subscribe to the same schema so as to best facilitate Federated Learning.

- DistributedDataset: this is a virtual object which referrs to a collection of Dataset objects which all subscribe to the same Dataset Schema. It is a convenient object because it gives you fast access to datasets at multiple institutions which are appropriate to train on together.

In [218]:
def search_diabetes(*args, **kwargs):
    
    columns=["distributed_datasets", "datasets", "tensors", "dataset_schemas", "tensor_schemas"]
    data = [[23, 75474, 947467, 532, 23]]
    nets = pd.DataFrame(columns=columns, data=data)
    
    columns=["name", "domain", "id", "upload-date", "version", "frameworks", "train_rows", "dev_rows", "test_rows", "schema", "tags", "description", "private", "metadata"]
    data = [["COVID Mortality", "UCSF", get_key(), "12/18/2019", "1", "PT/TF/PD/NP/JX", 2626, 353, 366, "COVID-MORT-2", "#covid #or...", "This is the official statistics for COVID deaths within...", "True", "{'collected':2019}"],
           ["US COVID Deaths", "CDC", get_key(), "1/18/2020", "23", "PT/TF/PD/NP/JX", 34632, 355, 0, "COVID-MORT-2", "#covid #or...", "Nationally reported on a daily basis, this dataset includes", "True", "{'collected':2020}"],
           ["COVID Deaths", "AMA", get_key(), "2/20/2020", "2", "PT/TF/PD/NP/JX", 2352, 335, 0, "COVID-MORT-2", "#covid #or...", "With attributes including risk factors like diabetes...", "True", "{'collected':2020}"]]
    datasets = pd.DataFrame(columns=columns, data=data)
    datasets = datasets.set_index("name")
    
    class Networks():
        def __repr__(self):
            return str(nets)
        
        def datasets(self, *args, **kwargs):
            return datasets
            
    networks = Networks()
    
    
    return networks
gr.search = search_diabetes

In [219]:
result = gr.search(anywhere="diabetes")

result = gr.search(anywhere=["diabetes"])
result = gr.search(tags="diabetes")
result = gr.search(exact_tags=["diabetes"])
result = gr.search(description="diabetes COVID mortality")
result = gr.search(description=["diabetes", "COVID", "mortality"])
result = gr.search(name="MNIST")
result = gr.search(exact_name="MNIST")

result

   distributed_datasets  datasets  tensors  dataset_schemas  tensor_schemas
0                    23     75474   947467              532              23

In [220]:
result.datasets(latest_version_only=True)

Unnamed: 0_level_0,domain,id,upload-date,version,frameworks,train_rows,dev_rows,test_rows,schema,tags,description,private,metadata
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
COVID Mortality,UCSF,a0bfac695d6542527fd04d0574a6e0516d4819974da2dd...,12/18/2019,1,PT/TF/PD/NP/JX,2626,353,366,COVID-MORT-2,#covid #or...,This is the official statistics for COVID deat...,True,{'collected':2019}
US COVID Deaths,CDC,8d26a2fa01636a8067435f0e4dbac1f95472cd6abc1f15...,1/18/2020,23,PT/TF/PD/NP/JX,34632,355,0,COVID-MORT-2,#covid #or...,"Nationally reported on a daily basis, this dat...",True,{'collected':2020}
COVID Deaths,AMA,2c27aa9571515c16bbb17126e75a1ab68ad585013708c4...,2/20/2020,2,PT/TF/PD/NP/JX,2352,335,0,COVID-MORT-2,#covid #or...,With attributes including risk factors like di...,True,{'collected':2020}


In [None]:


# diabetesSearch = network.search('diabetes') # search dataset name, description, and tags for 'diabetes'
diabetesSearch = network.search({ tag: 'diabetes' }) # specifically search for datasets with a tag of 'diabetes'

print(diabetesSearch)

"""
[
  {
    id: 1,
    name: 'Diabetes is terrible',
    description: '',
    node: 'ws://ucsf.com/pygrid',
    tags: ['diabetes', 'california', 'ucsf'],
    tensors: [
      {
        id: '1a',
        name: 'data',
        schema: []
      },
      {
        id: '1b',
        name: 'target',
        schema: []
      }
    ]
  },
  ...
]
"""

network.disconnect()

client = grid.connect(diabetesSearch[0].node) # 'ws://ucsf.com/pygrid'

user = client.signup('me@patrickcason.com', 'password')
# user = client.login('me@patrickcason.com', 'password')  # or, if you're already signed up

computeTypes = client.getComputeTypes()

"""
[
  {
    id: 1,
    name: 'EC2 P3',
    provider: 'AWS',
    cpu: {
      type: 'Intel Xeon 3.4GHz',
      cores: 32
    },
    gpu: {
      type: 'Tesla V100',
      min: 0,
      max: 8
    },
    ram: {
      value: 64,
      ordinal: 'gb'
    }
  },
  ...
]
"""

# env = user.createEnvironment() # creates the basic "default" environment for exploring

env = user.createEnvironment(computeTypes[0].id, {
    ram: Grid.RAM(32, 'gb'),
    gpu: 3
})

# Do stuff with "env"

# user.getEnvironments();