# Compare Demographics across the four Datasites

As our first step in our analyses, we want to collect information about the demographics in each dataset.

> 💡 This information is crucial to understand possible differences in data distributions, and therefore how data
varies across the four Hospitals! (another good reason to appreciate the benefits from working with **more data**).

Naturally, we can't collect this information from _mock_ data! But we can use mock to prepare our code. Afterwards, we will send a request on each datasite to gather the **true** statistics we're interested in.

## Step 1. Login to datasites as **External Researcher**

⚠️ First verify that the Datasites are already running. If needed, launch the following command in a new terminal session:

```bash
$ python launch_datasites.py
```

**Note**: In Jupyter Lab, you can open a new terminal session via `File >> New >> Terminal`

In [None]:
import syft as sy

In [None]:
from datasites import DATASITE_URLS

datasites = {}
for name, url in DATASITE_URLS.items():
    datasites[name] = sy.login(url=url, email="researcher@openmined.org", password="****")

## Step 2. Get Mock data and prepare your data science code

In [None]:
dataset = datasites["Cleveland Clinic"].datasets["Heart Disease Dataset"]
dataset.description

_From the dataset description_: the `age`, and `sex` columns corresponds to the demographics in our dataset, while the `num` column is the outcome of the study. 

Let's download the mock data to start working our code!

> 💡 Remeber: Mock data is an _artificially created_ version of the true (non-public) data that is only meant for code prototyping!

In [None]:
mock_data = dataset.assets["Heart Study Data"].mock

In [None]:
mock_data.head()

We can do some `pandas` [**magic**](https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html) to generate statistics about the _disease prevalance_ in the dataset, aggregated by the demographics (`age`, `sex`)

In [None]:
import pandas as pd

def aggregate_factors(data):
    """Gather demographics categorical factors from data:
    - Age will be mapped to three Age Ranges
    - New Diagnosis column for binary outcome of the study (and better plotting legends)
    - New Sex-Label column to better decode the `sex` column in data.
    """
    info = pd.DataFrame()
    info["diagnosis"] = data["num"].map(lambda v: "present" if v > 0 else "absent")
    info["sex-label"] = data["sex"].map({0: "female", 1: "male"})

    age_range = lambda v: "0-40" if v < 40 else "40-65" if v <= 65 else "Over 65"
    info["age-range"] = pd.Categorical(data["age"].apply(age_range), 
                                       categories=["0-40", "40-65", "Over 65"],
                                       ordered=True)
    return info

def disease_prevalence_per_demographic(data):
    cats = aggregate_factors(data)
    prevalence_by_demographics = pd.crosstab(
        index = cats["age-range"], columns = [cats["sex-label"], cats["diagnosis"]],
    )
    return prevalence_by_demographics

disease_prevalence_per_demographic(mock_data)

## Step 3. Run code remotely on all datasites

Let's now rewrite our `disease_prevalence_per_demographic` function to be **self-contained**: 

> All dependencies necessary for the execution must be defined within the body of the function.

In [None]:
for name, datasite in datasites.items():
    # data asset on the DataSite
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]  
    
    @sy.syft_function_single_use(data=data_asset)
    def disease_prevalence_per_demographic(data) -> pd.DataFrame:
        # third party dependency
        import pandas as pd
        
        def aggregate_factors():
            info = pd.DataFrame()
            info["diagnosis"] = data["num"].map(lambda v: "present" if v > 0 else "absent")
            info["sex-label"] = data["sex"].map({0: "female", 1: "male"})
        
            age_range = lambda v: "0-40" if v < 40 else "40-65" if v <= 65 else "Over 65"
            info["age-range"] = pd.Categorical(
                data["age"].apply(age_range),
                categories=["0-40", "40-65", "Over 65"],
                ordered=True,
            )
            return info
        
        cats = aggregate_factors()
        prevalence_by_demographics = pd.crosstab(
            index = cats["age-range"], columns = [cats["sex-label"], cats["diagnosis"]],
        )
        return prevalence_by_demographics

    # Submit simple code request
    datasite.code.request_code_execution(disease_prevalence_per_demographic)

In [None]:
# Check status of requests
from utils import check_status_last_code_requests

check_status_last_code_requests(datasites)

🎉 All requests should be all (automatically) `APPROVED`! (_If that's not the case, check again until are all approved_)

## Step 4. Gather results from all datasites 

In [None]:
demographics = {}
for name, datasite in datasites.items():
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    data_stats = datasite.code.disease_prevalence_per_demographic(data=data_asset).get_from(datasite) # use .get_from() to download the result
    demographics[name] = data_stats

Let's now compare statistics across the **four datasets**, and plot the results for clearer insights and comparisons:

In [None]:
# Let's plot the result for better visualisation
from matplotlib import pyplot as plt
from itertools import product

def plot_disease_prevalence(axis, data, name) -> None:
    data.plot.bar(ax=axis)
    for container in axis.containers:
        axis.bar_label(container)
    axis.set_ylim([0, 130])
    axis.set_title(f"Disease Prevalence (per Demographic) in {name}", fontsize = "medium")

fig, axes = plt.subplots(2, 2, figsize=(10,8))

for coords, (name, data) in zip(product(range(2), repeat=2), demographics.items()):
    plot_disease_prevalence(axis=axes[coords], data=data, name=name)

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()


## Conclusions

Data distributions across the four datasets/datasites is very different - which means that we may expect different results when training classifiers on each dataset.

Moreover, apart from the dataset in the "Cleveland Clinic", we have discovered that `age` is **not** likely to be a good indicator for data partitioning, as data is too skew, if combined with other demographics!

Let's now continue our analyses with some Machine learning modelling!

### Final Remarks

In this example, the **true** statistics about the data are returned. This is a fair assumption, considering the data we are working with. However, in more realistic scenarios, additional **PET**s (**P**rivacy **E**nhancing **T**echnologies) could be used to better protect the privacy of the data. In fact, these are exactly the types of query that techniques like [Differential Privacy](https://opendp.org/about) can help with! 

I will definitely show you how to use `DP` with PySyft in another tutorial! But in the meantime, please feel free to try it yourself, and then send a new [PR](https://github.com/OpenMined/syft-heart-disease-tutorial/pulls) to contribute to this tutorial!