### Searching for a Dataset

A Dataset object is the representation of the dataset uploaded to domains. When a user is searching for a dataset, the following properties of the datasets are visible to the user:

- Id (unique Id of the dataset)
- Domain (Name of the domain)
- Network (Name of the network)
- Assets (Name of asset along with Type of dataset - Dataframe, Tensors, Numpy, etc. A dataset can have multiple assets)
- Name (Name of the dataset)
- Tags (List of tags)
- Description (Description to the dataset)
- Usage (Number of users who have used this dataset) **[P1]**
- Added On (Date on which the dataset was added to the domain) **[P1]**

Users should be able to perform the following operations during a search for hosted datasets:
- List all the available datasets **[P0]**
- Filter the datasets via the properties of the dataset **[P1]** 
  
  Properties on which the user can perform a filter:
  - Id
  - Domain
  - Network
  - Name
  - Tags
- Group by datasets via Domain or Network. **[P2]**

In [40]:
import syft as sy

# Let's list all the available datasets
sy.datasets

Unnamed: 0,Id,Name,Tags,Assets,Description,Domain,Network,Usage,Added On
0,b16562aaec574696a380504c99b63d3f,Diabetes Dataset,"[Health, Classification, Dicom]","[""Images""] -> Tensor; [""Labels""] -> Tensor",A large set of high-resolution retina images,California Healthcare Foundation,WHO,102,Jan 15 2021
1,01a50a9ce5514bcb80317f5f150a8227,Canada Commodities Dataset,"[Commodities, Canada, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Canada Domain,United Nations,40,Mar 11 2021
2,d536d72d23aa463d9bb713343d28c127,Italy Commodities Dataset,"[Commodities, Italy, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Italy Domain,United Nations,23,Mar 15 2021
3,a32a57ea288345cca312c9f6a102527f,Netherlands Commodities Dataset,"[Commodities, Netherlands, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Netherland Domain,United Nations,20,Apr 12 2021
4,39c7cb402c5b4bd093b9bf9ef36ff4fe,Pnuemonia Dataset,"[Health, Pneumonia, X-Ray]","[""X-Ray-Images""] -> Tensor; [""labels""] -> Tensor",Chest X-Ray images. All provided images are in...,RSNA,WHO,334,Jan 15 2021


In [53]:
# Let's select the first dataset
diabetes_dataset = sy.datasets[0] # via Index
# Or
diabetes_dataset = sy.datasets["b16562aaec574696a380504c99b63d3f"] # via Id
# Or 
diabetes_dataset = sy.datasets["Pnuemonia Dataset"] # via Name

diabetes_dataset


Dataset: Diabetes Dataset
Description: A large set of high-resolution retina images



Unnamed: 0,Asset Key,Type,Shape
0,"[""Images""]",Tensor,"(1000, 512, 512, 3)"
1,"[""Labels""]",Tensor,"(1000, 7)"


In [5]:
# If during selection a dataset via `Name`, the `Name` is not unique, then raise a error

# Let assume, there are two dataset with the Name `Pnuemonia Dataset`
diabetes_dataset = ca_domain.datasets["Pnuemonia Dataset"]


    [91mMutipleDatasetsReturned[0m:
        There are more than one datasets with the `Name`: `Pneumonia Dataset`.
        Please select the dataset using `Id` or `index` of the dataset.



In [59]:
# Let's print the properties of the selected dataset
print(f"""
    Id: {diabetes_dataset.id}
    Name: {diabetes_dataset.name}
    Tags: {diabetes_dataset.tags}
    Description: {diabetes_dataset.description}
    Domain: {diabetes_dataset.domain}
    Network: {diabetes_dataset.network}
""")


    Id: b16562aaec574696a380504c99b63d3f
    Name: Diabetes Dataset
    Tags: ['Health', 'Classification', 'Dicom']
    Description: A large set of high-resolution retina images
    Domain: California Healthcare Foundation
    Network: WHO



In [33]:
# If a user tries to access a dataset
diabetes_dataset["Images"]

#Or
diabetes_dataset["Labels"]


    [91mAccessDeniedException[0m:
        You need to be log into the domain, to access the dataset.



If the dataset list is huge, then the user can filter via the `.filter` operation.

A user can filter results via three operations
- `filter(property=value)` Exact Match. (Equivalent to exact match query in SQL)
- `filter(property__contains=value)` Case-insensitive containment test. (Equivalent to a ILIKE query in SQL)
- `filter(property__in=[value1, value2, value3])` In a given iterable; often a list, tuple. (Equivalent to an IN query in SQL)

In [41]:
# For example a user wants to search for a dataset with Name `Diabetes Dataset`
sy.datasets.filter(name="Diabetes Dataset")

Unnamed: 0,Id,Name,Tags,Assets,Description,Domain,Network,Usage,Added On
0,b16562aaec574696a380504c99b63d3f,Diabetes Dataset,"[Health, Classification, Dicom]","[""Images""] -> Tensor; [""Labels""] -> Tensor",A large set of high-resolution retina images,California Healthcare Foundation,WHO,102,Jan 15 2021


In [43]:
# But, let's say the user wants to find all the datasets with commodities in its name
sy.datasets.filter(name__contains="Commodities")

Unnamed: 0,Id,Name,Tags,Assets,Description,Domain,Network,Usage,Added On
1,01a50a9ce5514bcb80317f5f150a8227,Canada Commodities Dataset,"[Commodities, Canada, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Canada Domain,United Nations,40,Mar 11 2021
2,d536d72d23aa463d9bb713343d28c127,Italy Commodities Dataset,"[Commodities, Italy, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Italy Domain,United Nations,23,Mar 15 2021
3,a32a57ea288345cca312c9f6a102527f,Netherlands Commodities Dataset,"[Commodities, Netherlands, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Netherland Domain,United Nations,20,Apr 12 2021


In [30]:
# Similarly, if a user want to filter out datasets with the given names

names_list = ["Diabetes Dataset", "Canada Commodities Dataset"]

sy.datasets.filter(name__in=names_list)

Unnamed: 0,Id,Name,Tags,Type,Description,Domain,Network,Usage,Added On
0,8ef9ff3af71f40a4b2b7b503c3ada68a,Diabetes Dataset,"[Health, Classification, Dicom]",<class 'torch.Tensor'>,A large set of high-resolution retina images,California Healthcare Foundation,WHO,102,Jan 13 2021
1,e1226a91a0ed4a09a521975e28b5f04a,Canada Commodities Dataset,"[Commodities, Canada, Trade]",DataFrameDatasetPointer,Commodity Trade Dataset,Canada Domain,United Nations,40,Mar 11 2021


Similarly, a user can perform the filter operations described above on the following properties:
- Id
- Name
- Domain
- Network
- Tags

### Dummy Data

In [1]:
import pandas as pd
from enum import Enum
import uuid
import torch
import datetime


class bcolors(Enum):
    HEADER = "\033[95m"
    OKBLUE = "\033[94m"
    OKCYAN = "\033[96m"
    OKGREEN = "\033[92m"
    WARNING = "\033[93m"
    FAIL = "\033[91m"
    ENDC = "\033[0m"
    BOLD = "\033[1m"
    UNDERLINE = "\033[4m"

In [38]:
all_datasets = [
    {
        "Id": uuid.uuid4().hex,
        "Name": "Diabetes Dataset",
        "Tags": ["Health", "Classification", "Dicom"],
        "Assets": '''["Images"] -> Tensor; ["Labels"] -> Tensor''',
        "Description": "A large set of high-resolution retina images",
        "Domain": "California Healthcare Foundation",
        "Network": "WHO",
        "Usage": 102,
        "Added On": datetime.datetime.now().replace(month=1).strftime("%b %d %Y")
    },
    {
        "Id": uuid.uuid4().hex,
        "Name": "Canada Commodities Dataset",
        "Tags": ["Commodities", "Canada", "Trade"],
        "Assets": '''["ca-feb2021"] -> DataFrame''',
        "Description": "Commodity Trade Dataset",
        "Domain": "Canada Domain",
        "Network": "United Nations",
        "Usage": 40,
        "Added On": datetime.datetime.now().replace(month=3, day=11).strftime("%b %d %Y")
    },
    {
        "Id": uuid.uuid4().hex,
        "Name": "Italy Commodities Dataset",
        "Tags": ["Commodities", "Italy", "Trade"],
        "Assets": '''["ca-feb2021"] -> DataFrame''',
        "Description": "Commodity Trade Dataset",
        "Domain": "Italy Domain",
        "Network": "United Nations",
        "Usage": 23,
        "Added On": datetime.datetime.now().replace(month=3).strftime("%b %d %Y")
    },
    {
        "Id": uuid.uuid4().hex,
        "Name": "Netherlands Commodities Dataset",
        "Tags": ["Commodities", "Netherlands", "Trade"],
        "Assets": '''["ca-feb2021"] -> DataFrame''',
        "Description": "Commodity Trade Dataset",
        "Domain": "Netherland Domain",
        "Network": "United Nations",
        "Usage": 20,
        "Added On": datetime.datetime.now().replace(month=4, day=12).strftime("%b %d %Y")
    },
    {
        "Id": uuid.uuid4().hex,
        "Name": "Pnuemonia Dataset",
        "Tags": ["Health", "Pneumonia", "X-Ray"],
        "Assets": '''["X-Ray-Images"] -> Tensor;  ["labels"] -> Tensor''',
        "Description": "Chest X-Ray images. All provided images are in DICOM format.",
        "Domain": "RSNA",
        "Network": "WHO",
        "Usage": 334,
        "Added On": datetime.datetime.now().replace(month=1).strftime("%b %d %Y")
    },
]

all_datasets_df = pd.DataFrame(all_datasets)

In [39]:
all_datasets_df

Unnamed: 0,Id,Name,Tags,Assets,Description,Domain,Network,Usage,Added On
0,b16562aaec574696a380504c99b63d3f,Diabetes Dataset,"[Health, Classification, Dicom]","[""Images""] -> Tensor; [""Labels""] -> Tensor",A large set of high-resolution retina images,California Healthcare Foundation,WHO,102,Jan 15 2021
1,01a50a9ce5514bcb80317f5f150a8227,Canada Commodities Dataset,"[Commodities, Canada, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Canada Domain,United Nations,40,Mar 11 2021
2,d536d72d23aa463d9bb713343d28c127,Italy Commodities Dataset,"[Commodities, Italy, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Italy Domain,United Nations,23,Mar 15 2021
3,a32a57ea288345cca312c9f6a102527f,Netherlands Commodities Dataset,"[Commodities, Netherlands, Trade]","[""ca-feb2021""] -> DataFrame",Commodity Trade Dataset,Netherland Domain,United Nations,20,Apr 12 2021
4,39c7cb402c5b4bd093b9bf9ef36ff4fe,Pnuemonia Dataset,"[Health, Pneumonia, X-Ray]","[""X-Ray-Images""] -> Tensor; [""labels""] -> Tensor",Chest X-Ray images. All provided images are in...,RSNA,WHO,334,Jan 15 2021


In [49]:
dataset_detail = [
    {
        "Asset Key": '["Images"]',
        "Type": "Tensor",
        "Shape": "(1000, 512, 512, 3)"
    },
    {
        "Asset Key": '["Labels"]',
        "Type": "Tensor",
        "Shape": "(1000, 7)"
    }
]
dataset_detail_df = pd.DataFrame(dataset_detail)
print('''
Dataset: Diabetes Dataset
Description: A large set of high-resolution retina images
'''
)
dataset_detail_df

In [50]:

print(f"""
    Id: {all_datasets_df[0:1].Id[0]}
    Name: {all_datasets_df[0:1].Name[0]}
    Tags: {all_datasets_df[0:1].Tags[0]}
    Description: {all_datasets_df[0:1].Description[0]}
    Domain: {all_datasets_df[0:1].Domain[0]}
    Network: {all_datasets_df[0:1].Network[0]}
""")




Unnamed: 0,Asset Key,Type,Shape
0,"[""Images""]",Tensor,"(1000, 512, 512, 3)"
1,"[""Labels""]",Tensor,"(1000, 7)"


In [31]:
error_on_access_dataset = f"""
    {bcolors.FAIL.value}AccessDeniedException{bcolors.ENDC.value}:
        You need to be log into the domain, to access this dataset. 
"""

print(error_on_access_dataset)


    [91mAccessDeniedException[0m:
        You need to be log into the domain, to access the dataset.



In [3]:
multiple_datasets = f"""
    {bcolors.FAIL.value}MutipleDatasetsReturned{bcolors.ENDC.value}:
        There are more than one datasets with the `Name`: `Pneumonia Dataset`.
        Please select the dataset using `Id` or `index` of the dataset.
"""