### User: Data Scientist: 


#### Goals:
- Select a dataset
- ETL the dataset into proper format

#### Summary:
- Select a network
- Select a dataset from the network or domain
- Extract and explore the dataset
- Perform transformations on the dataset if necessary

In [4]:
import syft as sy

# Let's check the list of the networks available
sy.networks

Unnamed: 0,Name,Hosted Domains,Datasets,Description,Tags,Url
0,United Nations,4,5,The UN hosts data related to the commodity and...,"[Commodities, Health]",https://un.openmined.org


In [47]:
# Get a client to the United Nations network
un_network = sy.networks[0]
un_network_client = un_network.login(email="sheldon@caltech.edu", password="bazinga")

# Now, let's check the list of datasets available on the network
un_network_client.datasets

Unnamed: 0,Name,Tags,Description,Dtype,Id,Domain,Shape
0,breast_cancer,"[mri, breast cancer, dicoms]",Labelled image dataset of patients suffering d...,ImageClassificationDataset,56lkw24,WHO,"((25000, 300, 300), (25000))"
1,canada_trade_data,"[canada, trade, un, commodities]",This dataset represents aggregated trade stati...,DataFrame,f3s9h1m,Canada,"(25000, 22)"
2,netherlands_trade_data,"[netherlands, trade, commodities, export]",This dataset represents aggregated trade stati...,DataFrame,2kf3o5d,Netherlands,"(35000, 22)"
3,italy_trade_data,"[italy, trade, un, commodities, export, import]",This dataset represents aggregated trade stati...,DataFrame,42wk65l,Italy,"(30000, 22)"
4,us_trade_data,"[us, trade, un, commodities]",This dataset represents aggregated trade stati...,DataFrame,86pfgh1,United States,"(40000, 22)"


In [None]:
# We will select one of the datasets, belonging to the commodity trade
ca_trade_dataset_ptr = un_network_client.datasets[1]  # Select the canada trade dataset

# Another way for an exhaustive is using the .filter method.
# Similar to the filter provided by the object manager of Django.
# We can exclude this as part of the demo, but in a longer run, as dev effort might be high.
# For reference: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#date
ca_trade_dataset_ptr = un_network_client.datasets.filter(tags__icontains="trade", dtype="DataFrame")[0]

In [49]:
# Let's print the description and tags associated with the dataset to validate if we choose the right dataset.

print(f"Description: {ca_trade_dataset_ptr.description}\t")
print(f"Tags: {ca_trade_dataset_ptr.tags}\t")
print(f"Type: {ca_trade_dataset_ptr.dtype}\t")
print(f"Shape: {ca_trade_dataset_ptr.shape}\t")

Description: This dataset represents aggregated trade statistics as reported by Canada about what it believes was imported/exported to/from its country in Feb 2021.	
Tags: ['canada', 'trade', 'un', 'commodities']	
DatasetType: DataFrame	
Shape: (25000, 22)	


In [16]:
# From description and tags we can see we have the correct dataset.
# Also, we see that the dataset is a tabular dataset.
# Let check if there is any metadata attached to the dataset
ca_trade_dataset_ptr.metadata

{'country': 'canada', 'type': 'trade', 'origin': 'un'}

In [34]:
# Hmm.. from the metadata it is clear that we have pointer to commodity trade dataset of canada.
# Now, let's check if any sample data has been provided

ca_trade_dataset_ptr.sample_data

Unnamed: 0,Classification,Year,Period,Period Desc.,Aggregate Level,Is Leaf Code,Trade Flow Code,Trade Flow,Reporter Code,Reporter,...,Partner,Partner ISO,Commodity Code,Commodity,Qty Unit Code,Qty Unit,Qty,Netweight (kg),Trade Value (US$),Flag
0,HS,2021,202102,February 2021,4,0,1,Imports,124,Canada,...,"Other Asia, nes",,6117,"Clothing accessories; made up, knitted or croc...",0,,,,9285,0
1,HS,2021,202102,February 2021,2,0,1,Imports,124,Canada,...,Egypt,,18,Cocoa and cocoa preparations,0,,,0.0,116604,0
2,HS,2021,202102,February 2021,2,0,1,Imports,124,Canada,...,United Kingdom,,18,Cocoa and cocoa preparations,0,,,0.0,1495175,0


In [40]:
# Great, we have a sample data attached. This will help us understand the dataset in more depth.

# Let's check all the columns present in the sample data
ca_trade_dataset_ptr.sample_data.columns

Index(['Classification', 'Year', 'Period', 'Period Desc.', 'Aggregate Level',
       'Is Leaf Code', 'Trade Flow Code', 'Trade Flow', 'Reporter Code',
       'Reporter', 'Reporter ISO', 'Partner Code', 'Partner', 'Partner ISO',
       'Commodity Code', 'Commodity', 'Qty Unit Code', 'Qty Unit', 'Qty',
       'Netweight (kg)', 'Trade Value (US$)', 'Flag'],
      dtype='object')

In [31]:
# Hmm...., we can see that each row defines a commodity with the type of trade being performed i.e. Import/Export.
# Its also defines the Partner with which the trade is performed, the quantity of the commodity
# being traded and the amount transacted (Trade Value in USD) during the trade.

# To understand the dataset in more details, let's check if description of the columns is provided.
# ca_trade_dataset_ptr.column_description

Unnamed: 0,Column,Description,Private
0,Classification,Commodity Classification (HS= Harmonized System),True
1,Year,4-digit year,False
2,Period,yyyymm,False
3,Period Desc.,Description,False
4,Aggregate level,"Level of reporting (6,4,2,0, where 0=total level)",True
5,Is Leaf code,Basic/Aggregated (0=basic level),True
6,Trade Flow Code,"Imports, Re-imports, Exports, Re-exports",True
7,Trade Flow,Description,True
8,Reporter Code,UN Country Code,False
9,Reporter,Description,False


In [None]:
# As the Data Scientist, we want to return a list of commodities 
# where the ratio of expected imports / exports is off by 10% or more.

# In order to achieve the above, we don't need all the columns of the dataset. 
# Let's filter out the data for the columns we desire.

required_columns = ["Classification", "Commodity Code", "Commodity", "Trade Value (US$)", "Partner", "Commodity Code"]

filtered_dataset_ptr = ca_trade_dataset_ptr[required_columns]

#### Great !!! we were able to select a dataset, extract it and load it as per our requirements.

### Dummy Data Creation

In [29]:
import pandas as pd
from enum import Enum


class bcolors(Enum):
    HEADER = "\033[95m"
    OKBLUE = "\033[94m"
    OKCYAN = "\033[96m"
    OKGREEN = "\033[92m"
    WARNING = "\033[93m"
    FAIL = "\033[91m"
    ENDC = "\033[0m"
    BOLD = "\033[1m"
    UNDERLINE = "\033[4m"


# Dummy available networks
available_networks = [
    {
        "Name": "United Nations",
        "Hosted Domains": 4,
        "Datasets": 5,
        "Description": "The UN hosts data related to the commodity and health sector.",
        "Tags": ["Commodities", "Health"],
        "Url": "https://un.openmined.org",
    }
]
available_networks = pd.DataFrame(available_networks)

## Dummy Data Store
dataset_store = [
    {
        "Name": "breast_cancer",
        "Tags": ["mri", "breast cancer", "dicoms"],
        "Description": "Labelled image dataset of patients suffering different types of breast cancer",
        "Dtype": "ImageClassificationDataset",
        "Id": "56lkw24",
        "Domain": "WHO",
        "Shape": "((25000, 300, 300), (25000))",
    },
    {
        "Name": "canada_trade_data",
        "Tags": ["canada", "trade", "un", "commodities"],
        "Description": "This dataset represents aggregated trade statistics as reported by Canada about what it believes was imported/exported to/from its country in Feb 2021.",
        "Dtype": "DataFrame",
        "Id": "f3s9h1m",
        "Domain": "Canada",
        "Shape": "(25000, 22)",
    },
    {
        "Name": "netherlands_trade_data",
        "Tags": ["netherlands", "trade", "commodities", "export"],
        "Description": "This dataset represents aggregated trade statistics as reported by Netherlands about what it believes was imported/exported to/from its country in Feb 2021.",
        "Dtype": "DataFrame",
        "Id": "2kf3o5d",
        "Domain": "Netherlands",
        "Shape": "(35000, 22)",
    },
    {
        "Name": "italy_trade_data",
        "Tags": ["italy", "trade", "un", "commodities", "export", "import"],
        "Description": "This dataset represents aggregated trade statistics as reported by Italy about what it believes was imported/exported to/from its country in Feb 2021.",
        "Dtype": "DataFrame",
        "Id": "42wk65l",
        "Domain": "Italy",
        "Shape": "(30000, 22)",
    },
    {
        "Name": "us_trade_data",
        "Tags": ["us", "trade", "un", "commodities"],
        "Description": "This dataset represents aggregated trade statistics as reported by United States about what it believes was imported/exported to/from its country in Feb 2021.",
        "Dtype": "DataFrame",
        "Id": "86pfgh1",
        "Domain": "United States",
        "Shape": "(40000, 22)",
    },
]

dataset_store = pd.DataFrame(dataset_store)

# print(f"Description: {dataset_store['Description'][1]}\t")
# print(f"Tags: {dataset_store['Tags'][1]}\t")
# print(f"Type: {dataset_store['Type'][1]}\t")
# print(f"Shape: {dataset_store['Shape'][1]}\t")


# dummy canada metadata
ca_metadata = {"country": "canada", "type": "trade", "origin": "un"}

# dummy canada dataset
ca_dataset = pd.read_csv("datasets/ca - feb 2021.csv")

# Dummy dataset schema
dataset_schema = pd.read_csv("datasets/schema.csv")

private_values = [
    True,
    False,
    False,
    False,
    True,
    True,
    True,
    True,
    False,
    False,
    False,
    False,
    False,
    False,
    False,
    True,
    True,
    False,
    False,
    False,
    True,
    False,
]

dataset_schema["Private"] = private_values

ca_dataset = pd.read_csv("datasets/ca - feb 2021.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
