## 1. Install Syft

In [None]:
# !pip install -U syft

In [None]:
SYFT_VERSION = ">=0.9,<0.9.2"
package_string = f'"syft{SYFT_VERSION}"'

In [None]:
# syft absolute
import syft as sy

sy.requires(SYFT_VERSION)
print(sy.__version__)

## 2. Running python server

In [None]:
server = sy.orchestra.launch(name="test-datasite-1", port=8081, reset=True)

In [None]:
admin_client = server.login(email="info@openmined.org", password="changethis")

In [None]:
admin_client

## 3. Uploading Data

In [None]:
# stdlib
import os

if not os.path.exists("ages_dataset.csv"):
    !curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv

In [None]:
# third party
import pandas as pd

# syft absolute
import syft as sy

age_df = pd.read_csv("ages_dataset.csv")
age_df = age_df.dropna(how="any")
age_df.head()

In [None]:
# stdlib
# TODO: also move to dataset repo
import os

if not os.path.exists("ages_mock_dataset.csv"):
    !curl -O https://openminedblob.blob.core.windows.net/csvs/ages_mock_dataset.csv

In [None]:
age_mock_df = pd.read_csv("ages_mock_dataset.csv")
age_mock_df = age_mock_df.dropna(how="any")
age_mock_df.head()

In [None]:
# How an asset for low side and high-side would be defined:
main_contributor = sy.Contributor(
    name="Jeffrey Salazar", role="Dataset Creator", email="jsala@ailab.com"
)

asset = sy.Asset(
    name="asset_name",
    data=age_df,  # real dataframe
    mock=age_mock_df,  # mock dataframe
    contributors=[main_contributor],
)

In [None]:
description_template = """### About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.

### Dataset usage policy
This dataset is subject to compliance with internal data use and mis-use policies at our organisation. The following rules apply:
- only aggregate statistics can be released from data computation
- data subjects should never be identifiable through the data computation outcomes
- a fixed privacy budget of eps=5 must be preserved by each researcher

### Data collection and pre-processing
The dataset is based on open data hosted by Wikimedia Foundation.

**Age**
Whenever possible, age was calculated based on the birth and death year mentioned in the description of the individual.

**Gender**
Gender was available in the original dataset for 50% of participants. For the remaining, it was added from predictions based on name, country and century in which they lived. (97.51% accuracy and 98.89% F1-score)

**Occupation**
The occupation was available in the original dataset for 66% of the individuals. For the remaining, it was added from predictions from a multiclass text classificator model. (93.4% accuracy for 84% of the dataset)

More details about the features can be found by reading the paper.

### Key features
1. **Id**: Unique identifier for each individual.
2. **Name**: Name of the person.
3. **Short description**: Brief description or summary of the individual.
4. **Gender**: Gender/s of the individual.
5. **Country**: Countries/Kingdoms of residence and/or origin.
6. **Occupation**: Occupation or profession of the individual.
7. **Birth year**: Year of birth for the individual.
8. **Death year**: Year of death for the individual.
9. **Manner of death**: Details about the circumstances or manner of death.
10. **Age of death**: Age at the time of death for the individual.
11. **Associated Countries**: Modern Day Countries associated with the individual.
12. **Associated Country Coordinates (Lat/Lon)**: Modern Day Latitude and longitude coordinates of the associated countries.
13. **Associated Country Life Expectancy**: Life expectancy of the associated countries.

### Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.


### Getting started

```
!curl -O https://openminedblob.blob.core.windows.net/csvs/ages_dataset.csv

age_df = pd.read_csv("ages_dataset.csv")
```

### Execution environment
The data is hosted in a remote compute environment with the following specifications:
- X CPU cores
- 1 GPU of type Y
- Z RAM
- A additional available storage

### Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82
"""

In [None]:
dataset = sy.Dataset(
    name="Age Dataset",
    description=description_template,
    asset_list=[asset],
    contributors=[main_contributor],
)

In [None]:
# Uploading the dataset
admin_client.upload_dataset(dataset)

In [None]:
admin_client.datasets

## 4. Register a DS

In [None]:
# register a new user
admin_client.register(
    name="John Doe", email="john@email.com", password="pass", password_verify="pass"
)

## 5. Working With Remote Data (This is where you start ... )

Try running different scenarios for user code submission trying out different::
- Try running different query sizes, result type, result size,
- raising intentional error in the code
- play with output/input policies
- Accessing private/mock data
- Nested codes

In [None]:
ds_client = server.login(email="john@email.com", password="pass")

In [None]:
ds_client.datasets

In [None]:
mock_data = ds_client.datasets[0].assets[0].mock
mock_data

In [None]:
private_data = ds_client.datasets[0].assets[0].data
private_data

### 5.1 Standard and custom Input/Output Policies and syft function decorator

In [None]:
asset = ds_client.datasets[0].assets[0]

In [None]:
@sy.syft_function_single_use(ages_data=asset)
def how_are_people_dying_statistics(ages_data):
    df = ages_data
    avg_age_death_gender = (
        df.groupby("Gender")["Age of death"].mean().reset_index(name="Avg_Age_of_Death")
    )
    manner_of_death_count = (
        df.groupby("Manner of death")
        .size()
        .reset_index(name="Count")
        .sort_values(by="Count", ascending=False)
    )

    return (manner_of_death_count, avg_age_death_gender)

In [None]:
# stdlib
from typing import Any

# third party
from result import Err
from result import Ok

# syft absolute
from syft.client.api import AuthedServiceContext
from syft.client.api import ServerIdentity


class CustomExactMatch(sy.CustomInputPolicy):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        pass

    def filter_kwargs(self, kwargs, context, code_item_id):
        # stdlib

        try:
            allowed_inputs = self.allowed_ids_only(
                allowed_inputs=self.inputs, kwargs=kwargs, context=context
            )
            results = self.retrieve_from_db(
                code_item_id=code_item_id,
                allowed_inputs=allowed_inputs,
                context=context,
            )
        except Exception as e:
            return Err(str(e))
        return results

    def retrieve_from_db(self, code_item_id, allowed_inputs, context):
        # syft absolute
        from syft import ServerType
        from syft.service.action.action_object import TwinMode

        action_service = context.server.get_service("actionservice")
        code_inputs = {}

        # When we are retrieving the code from the database, we need to use the server's
        # verify key as the credentials. This is because when we approve the code, we
        # we allow the private data to be used only for this specific code.
        # but we are not modifying the permissions of the private data

        root_context = AuthedServiceContext(
            server=context.server, credentials=context.server.verify_key
        )
        if context.server.server_type == ServerType.DATASITE:
            for var_name, arg_id in allowed_inputs.items():
                kwarg_value = action_service._get(
                    context=root_context,
                    uid=arg_id,
                    twin_mode=TwinMode.NONE,
                    has_permission=True,
                )
                if kwarg_value.is_err():
                    return Err(kwarg_value.err())
                code_inputs[var_name] = kwarg_value.ok()
        else:
            raise Exception(
                f"Invalid Server Type for Code Submission:{context.server.server_type}"
            )
        return Ok(code_inputs)

    def allowed_ids_only(
        self,
        allowed_inputs,
        kwargs,
        context,
    ):
        # syft absolute
        from syft import ServerType
        from syft import UID

        if context.server.server_type == ServerType.DATASITE:
            server_identity = ServerIdentity(
                server_name=context.server.name,
                server_id=context.server.id,
                verify_key=context.server.signing_key.verify_key,
            )
            allowed_inputs = allowed_inputs.get(server_identity, {})
        else:
            raise Exception(
                f"Invalid Server Type for Code Submission:{context.server.server_type}"
            )
        filtered_kwargs = {}
        for key in allowed_inputs.keys():
            if key in kwargs:
                value = kwargs[key]
                uid = value
                if not isinstance(uid, UID):
                    uid = getattr(value, "id", None)

                if uid != allowed_inputs[key]:
                    raise Exception(
                        f"Input with uid: {uid} for `{key}` not in allowed inputs: {allowed_inputs}"
                    )
                filtered_kwargs[key] = value
        return filtered_kwargs

    def _is_valid(
        self,
        context,
        usr_input_kwargs,
        code_item_id,
    ):
        filtered_input_kwargs = self.filter_kwargs(
            kwargs=usr_input_kwargs,
            context=context,
            code_item_id=code_item_id,
        )

        if filtered_input_kwargs.is_err():
            return filtered_input_kwargs

        filtered_input_kwargs = filtered_input_kwargs.ok()

        expected_input_kwargs = set()
        for _inp_kwargs in self.inputs.values():
            for k in _inp_kwargs.keys():
                if k not in usr_input_kwargs:
                    return Err(f"Function missing required keyword argument: '{k}'")
            expected_input_kwargs.update(_inp_kwargs.keys())

        permitted_input_kwargs = list(filtered_input_kwargs.keys())
        not_approved_kwargs = set(expected_input_kwargs) - set(permitted_input_kwargs)
        if len(not_approved_kwargs) > 0:
            return Err(
                f"Input arguments: {not_approved_kwargs} to the function are not approved yet."
            )
        return Ok(True)


def allowed_ids_only(
    self,
    allowed_inputs,
    kwargs,
    context,
):
    # syft absolute
    from syft import ServerType
    from syft import UID
    from syft.client.api import ServerIdentity

    if context.server.server_type == ServerType.DATASITE:
        server_identity = ServerIdentity(
            server_name=context.server.name,
            server_id=context.server.id,
            verify_key=context.server.signing_key.verify_key,
        )
        allowed_inputs = allowed_inputs.get(server_identity, {})
    else:
        raise Exception(
            f"Invalid Server Type for Code Submission:{context.server.server_type}"
        )
    filtered_kwargs = {}
    for key in allowed_inputs.keys():
        if key in kwargs:
            value = kwargs[key]
            uid = value
            if not isinstance(uid, UID):
                uid = getattr(value, "id", None)

            if uid != allowed_inputs[key]:
                raise Exception(
                    f"Input with uid: {uid} for `{key}` not in allowed inputs: {allowed_inputs}"
                )
            filtered_kwargs[key] = value
    return filtered_kwargs

In [None]:
class RepeatedCallPolicy(sy.CustomOutputPolicy):
    n_calls: int = 0
    downloadable_output_args: list[str] = []
    state: dict[Any, Any] = {}

    def __init__(self, n_calls=1, downloadable_output_args: list[str] = None):
        self.downloadable_output_args = (
            downloadable_output_args if downloadable_output_args is not None else []
        )
        self.n_calls = n_calls
        self.state = {"counts": 0}

    def public_state(self):
        return self.state["counts"]

    def update_policy(self, context, outputs):
        self.state["counts"] += 1

    def apply_to_output(self, context, outputs, update_policy=True):
        if hasattr(outputs, "syft_action_data"):
            outputs = outputs.syft_action_data
        output_dict = {}
        if self.state["counts"] < self.n_calls:
            for output_arg in self.downloadable_output_args:
                output_dict[output_arg] = outputs[output_arg]
            if update_policy:
                self.update_policy(context, outputs)
        else:
            return None
        return output_dict

    def _is_valid(self, context):
        return self.state["counts"] < self.n_calls

### 5.2 Creating Query Function

In [None]:
@sy.syft_function(
    input_policy=CustomExactMatch(ages_data=asset),
    output_policy=RepeatedCallPolicy(n_calls=10, downloadable_output_args=["y"]),
)
def how_are_people_dying_statistics_custom(ages_data):
    df = ages_data
    avg_age_death_gender = (
        df.groupby("Gender")["Age of death"].mean().reset_index(name="Avg_Age_of_Death")
    )
    manner_of_death_count = (
        df.groupby("Manner of death")
        .size()
        .reset_index(name="Count")
        .sort_values(by="Count", ascending=False)
    )

    return (manner_of_death_count, avg_age_death_gender)

### 5.3 Test on Mock Data

In [None]:
pointer = how_are_people_dying_statistics(ages_data=asset)
result = pointer.get()

In [None]:
result[0]

In [None]:
result[1]

### 5.4 Submit Code

In [None]:
# Create a new project
new_project = sy.Project(
    name="The project about death",
    description="Hi, I want to calculate some statistics on how folks are dying",
    members=[ds_client],
)
new_project

In [None]:
result = new_project.create_code_request(how_are_people_dying_statistics, ds_client)

In [None]:
project = new_project.send()
project

## 6. Approve Requests (As Admin)

We will approve all the incoming requests for now as the focus of this exercise is to test different scenarios for user code submission.

In [None]:
admin_client.projects[0]

In [None]:
project = admin_client.projects[0]
project.requests

In [None]:
request = project.requests[0]
request

In [None]:
result = request.approve()

## 7. Running Code (Data Scientist)

In [None]:
ds_client.code

In [None]:
ds_client.requests

In [None]:
result = ds_client.code.how_are_people_dying_statistics(ages_data=asset)

In [None]:
result[0]

In [None]:
result[1]