## Datasets & assets

Datasets in PySyft empower researchers to conduct studies in a remote manner, on private data that otherwise would not be accessible to them. The whole dataset (namely, data, access to it and the result computed from it) is fully guarded, managed and mentained by the data owner.

Datasets are hosted on a PySyft server.

Datasets (`sy.Dataset`) are Syft objects which hold data assets. Assets (`sy.Asset`) are Syft objects which directly hold the data you want to upload and work with. 

<div class="admonition alert alert-info">
    <p class="admonition-title" style="font-weight:bold">Info</p>
    A `<code>Syft Dataset</code> can contain one or more <code>Syft Assets</code>. An asset must belong to a dataset, and cannot be uploaded on its own. Throughout the documentation, dataset refers to <code>Syft Dataset</code>, and asset refers to <code>Syft Asset</code>.
</div>    


In [None]:
# syft absolute
import syft as sy

In [None]:
# launching a test node
node = sy.orchestra.launch(name="test_domain", port=8080, dev_mode=False, reset=True)

# logging in with default credentials (just for the example)
domain = sy.login(email="info@openmined.org", password="changethis", port=8080)

### Structuring your Data

For this demonstration, we are going to use `The Age Dataset 2023` from Kaggle. This extensive dataset provides a rich collection of demographic and life events records for million individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons.
<br>

Source: [The Age Dataset 2023](https://www.kaggle.com/datasets/lasaljaywardena/age-dataset-2023)

In [None]:
# stdlib
import ast
from random import randint

# third party
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

# third party
import numpy as np

In [None]:
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv

In [None]:
data_path = "ages_dataset.csv"
age_df = pd.read_csv(data_path)
age_df = age_df.dropna(how="any")
print(age_df.shape)
age_df.head()

In [None]:
age_df.info()

In [None]:
age_df.describe()

In [None]:
print(age_df["Id"].nunique())
print(age_df["Name"].nunique())

In [None]:
age_df["Gender"].value_counts()

In [None]:
print("No. of unique Occupations:", age_df["Occupation"].nunique())
age_df["Occupation"].value_counts()[:10]

In [None]:
print(
    "No. of unique combinations of Countries:", age_df["Associated Countries"].nunique()
)
age_df["Associated Countries"].value_counts()[:10]

### Creating Mock data
As mentioned in previous tutorials, data scientists don't get to interact with the real private data, but only to a synthetic dataset, which mimics the real one, which make up the concept of _mock data_. Given that mock data does not contain any private information, it can be easily opened up for access to researchers who would like to learn about the private assets, without getting any access to the real data.

In this example, we are going to use the `Faker` library for generating mock data. You can install this Python library by running the following cell. 

```bash
pip install Faker
```

In [None]:
!pip install Faker

# third party
from faker import Faker

**Mock Data Generation Steps:**

- Generate Fake `Id` and `Name`
- Generate Fake `Short Description`
- Encode each unique `Gender` with a standard naming convention (i.e. `Gender 1`, `Gender 2` etc.) \
to protect the under represented genders and generate Fake transformed `Gender` value
- Generate Fake `Occupation`, including cases with multiple occupation
- Generate Fake `Associated Countries`, including cases with multiple countries
- Generate Fake `Age of death` by adding standard deviation of += `threshold` to the original age data
- Adjust the `Birth year` and `Death year` according to the Fake `Age of death`
- Generate Fake `Life Expentency` but make sure to map the fake `Life Expentency` to the corresponding \
fake `Associated Country` in order to maintain consistency
- After generating fake values for all the above columns, it is safe to keep the original values for `Manner of death`, keeping the distribution present in the real data
- For this exercise, we will discard the rest two columns - `Country` and `Associated Country Coordinates (Lat/Lon)`

#### Step 1: Maintain relationships in the mock data [optional]

In [None]:
# Convert <string> type list of strings to python <list> type
age_df["Associated Countries"] = age_df["Associated Countries"].apply(ast.literal_eval)

age_df["Associated Country Coordinates (Lat/Lon)"] = age_df[
    "Associated Country Coordinates (Lat/Lon)"
].apply(ast.literal_eval)

age_df["Associated Country Life Expectancy"] = age_df[
    "Associated Country Life Expectancy"
].apply(ast.literal_eval)

In [None]:
# Separate countries from lists and calculate their individual value_counts() which will be
# used by random.choice function later as distributions


def value_counts_of_lists(series_with_lists):
    # Concatenate all the lists in the Series into a single list
    unpacked_list = [item for sublist in series_with_lists for item in sublist]

    # Create a new Series from the unpacked list
    unpacked_series = pd.Series(unpacked_list)

    # Use value_counts to get the count of unique values
    value_counts = unpacked_series.value_counts()

    return unpacked_list, value_counts


# Create a dictionary where each unique country from all the lists in Associated Countries
# are keys and the corresponding (Lat,Long) tuples are the values

unpacked_cnt_list, cnt_value_counts = value_counts_of_lists(
    age_df["Associated Countries"].values
)

unpacked_exp_list, exp_value_counts = value_counts_of_lists(
    age_df["Associated Country Life Expectancy"].values
)

print(len(unpacked_cnt_list))
print(len(unpacked_exp_list))

cnt_dict = dict.fromkeys(unpacked_cnt_list, None)

for i in range(len(unpacked_exp_list)):
    if cnt_dict[unpacked_cnt_list[i]] is None:
        cnt_dict[unpacked_cnt_list[i]] = unpacked_exp_list[i]

#### Step 2: Generate Mock Data

In [None]:
NUM_OF_ROWS = age_df.shape[0]  # 10000

Faker.seed(0)
faker = Faker()

In [None]:
gender_encode_dict = {
    "Male": "Gender 1",
    "Female": "Gender 2",
    "Transgender Female": "Gender 3",
    "Transgender Male": "Gender 4",
    "Intersex": "Gender 5",
    "Eunuch; Male": "Gender 6; Gender 1",
    "Transgender Female; Female": "Gender 3; Gender 2",
}


def generate_random_choice_columns(age_df, num):
    # Generate Id
    id_list = np.arange(1, num + 2000)
    fake_id = np.random.choice(id_list, size=num, replace=False)
    fake_id = pd.Series(fake_id).apply(lambda x: "Q" + str(x))

    # Generate Gender
    gender_dist = age_df["Gender"].value_counts(normalize=True)
    gender = np.random.choice(
        age_df["Gender"].unique().tolist(),
        size=num,
        replace=True,
        p=gender_dist,  # probability
    )
    gender = pd.Series(gender).replace(gender_encode_dict)

    # Generate Age of death, add noise by adding random int between(-5,5) to fake age
    age_of_death_dist = age_df["Age of death"].value_counts(normalize=True)
    age_of_death = np.random.choice(
        age_df["Age of death"].unique().tolist(),
        size=num,
        replace=True,
        p=age_of_death_dist,  # probability
    )
    age_of_death = (
        pd.Series(age_of_death).apply(lambda x: x + randint(-5, 5)).astype("float64")
    )

    # Generate Associated Countries
    assc_cnt_dist = age_df["Associated Countries"].value_counts(normalize=True)
    assc_cnt = np.random.choice(
        age_df["Associated Countries"].astype(str).value_counts().keys().tolist(),
        size=num,
        replace=True,
        p=assc_cnt_dist,  # probability
    )
    assc_cnt = pd.Series(assc_cnt).apply(ast.literal_eval)

    # Generate Life Expectency using the dictionary created above
    assc_life_exp = pd.Series(assc_cnt).apply(lambda x: [cnt_dict[i] for i in x])

    # Generate Manner of death
    manner_of_death_dist = age_df["Manner of death"].value_counts(normalize=True)
    manner_of_death = np.random.choice(
        age_df["Manner of death"].unique().tolist(),
        size=num,
        replace=True,
        p=manner_of_death_dist,  # probability
    )
    manner_of_death = pd.Series(manner_of_death)

    return fake_id, gender, age_of_death, assc_cnt, assc_life_exp, manner_of_death


def make_faker_data(num):
    fake_data = [
        {
            "Name": faker.name(),
            "Short description": faker.paragraph(nb_sentences=2),
            "Occupation": faker.job(),
            "Death year": float(faker.year()),
        }
        for x in range(num)
    ]

    return fake_data

In [None]:
age_mock_df = pd.DataFrame()
(
    age_mock_df["Id"],
    age_mock_df["Gender"],
    age_mock_df["Age of death"],
    age_mock_df["Associated Countries"],
    age_mock_df["Associated Country Life Expectancy"],
    age_mock_df["Manner of death"],
) = generate_random_choice_columns(age_df, num=NUM_OF_ROWS)

fake_data = pd.DataFrame(make_faker_data(num=NUM_OF_ROWS))

for col in fake_data.columns.to_list():
    age_mock_df[col] = fake_data[col]

# Generate Birth year by subtracting Age of death from Death year
age_mock_df["Birth year"] = age_mock_df["Death year"].astype(int) - age_mock_df[
    "Age of death"
].astype(int)

print(age_mock_df.shape)
age_mock_df.head()

#### Step 3: Match Shapes between Real and Mock data

In [None]:
age_mock_df["Country"] = ["Not Available"] * age_mock_df.shape[0]
age_mock_df["Associated Country Coordinates (Lat/Lon)"] = [
    "Not Available"
] * age_mock_df.shape[0]

In [None]:
print(age_mock_df.shape)
age_mock_df.head()

In [None]:
cols = age_mock_df.columns
age_df[cols].info()

In [None]:
age_mock_df[cols].info()

In [None]:
age_df[cols].describe()

In [None]:
age_mock_df[cols].describe()

In [None]:
age_df["Manner of death"].value_counts()[:5]

In [None]:
age_mock_df["Manner of death"].value_counts()[:5]

### Creating a sy.Dataset

In [None]:
# How an asset would be defined:
main_contributor = sy.Contributor(
    name="Jeffrey Salazar", role="Dataset Creator", email="jsala@ailab.com"
)

asset = sy.Asset(
    name="asset_name",
    data=age_df,  # real dataframe
    mock=age_mock_df,  # mock dataframe
    contributors=[main_contributor],
)

In [None]:
description_template = """### About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.

### Dataset usage policy
This dataset is subject to compliance with internal data use and mis-use policies at our organisation. The following rules apply:
- only aggregate statistics can be released from data computation
- data subjects should never be identifiable through the data computation outcomes
- a fixed privacy budget of eps=5 must be preserved by each researcher

### Data collection and pre-processing
The dataset is based on open data hosted by Wikimedia Foundation.

**Age**
Whenever possible, age was calculated based on the birth and death year mentioned in the description of the individual.

**Gender**
Gender was available in the original dataset for 50% of participants. For the remaining, it was added from predictions based on name, country and century in which they lived. (97.51% accuracy and 98.89% F1-score)

**Occupation**
The occupation was available in the original dataset for 66% of the individuals. For the remaining, it was added from predictions from a multiclass text classificator model. (93.4% accuracy for 84% of the dataset)

More details about the features can be found by reading the paper.

### Key features
1. **Id**: Unique identifier for each individual.
2. **Name**: Name of the person.
3. **Short description**: Brief description or summary of the individual.
4. **Gender**: Gender/s of the individual.
5. **Country**: Countries/Kingdoms of residence and/or origin.
6. **Occupation**: Occupation or profession of the individual.
7. **Birth year**: Year of birth for the individual.
8. **Death year**: Year of death for the individual.
9. **Manner of death**: Details about the circumstances or manner of death.
10. **Age of death**: Age at the time of death for the individual.
11. **Associated Countries**: Modern Day Countries associated with the individual.
12. **Associated Country Coordinates (Lat/Lon)**: Modern Day Latitude and longitude coordinates of the associated countries.
13. **Associated Country Life Expectancy**: Life expectancy of the associated countries.

### Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.


### Getting started

```
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv

age_df = pd.read_csv("ages_dataset.csv")
```

### Execution environment
The data is hosted in a remote compute environment with the following specifications:
- X CPU cores
- 1 GPU of type Y
- Z RAM
- A additional available storage

### Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82
"""

In [None]:
age_dataset = sy.Dataset(
    name="Dataset name",
    description=description_template,
    asset_list=[asset],
    contributors=[main_contributor],
)

### Uploading Dataset

To upload a dataset on a domain, use the `upload_dataset` function. You need to be logged in into the domain (low side or high side).

<div class="admonition info">
    <p class="admonition-title" style="font-weight:bold">Info</p>
Assets can be only uploading as part of a dataset.
</div>    


In [None]:
domain.upload_dataset(age_dataset)

### Working with datasets and assets: access

In [None]:
# returns a list of all the available datasets for that domain (or empty list if none)
domain.datasets

In [None]:
# access a particular asset by its index, or by its unique key (name)
asset = domain.datasets[0].assets[0]  # or domain.datasets[0].assets["Age Data 2023"]

In [None]:
# Access the mock or the real data within an asset
dataset = domain.datasets[0]

mock_data = dataset.assets[0].mock  # or dataset.assets[0].data, for the real data

In [None]:
# Stop domain
node.land()