# CDC NDI Mortality - Syft Duet - Data Scientist 🥁

This worksheet is intended to illustrate functionality of a shared statistical platform, using a partially synthetic public-use dataset that mirrors the restricted-use dataset. Ultimately, these processes would apply to the restricted-use data.

Sample data compiled from the public-use linked mortality files share at https://www.cdc.gov/nchs/data-linkage/mortality.htm provided by the National Center for Health Statistics (NCHS).

## PART 1: Connect to a Remote Duet Server

As the Data Scientist, you want to perform data science on data that is sitting in the Data Owner's Duet server in their Notebook.

In order to do this, we must run the code that the Data Owner sends us, which importantly includes their Duet Session ID. The code will look like this, importantly with their real Server ID.

```
import syft as sy
duet = sy.duet('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
```

This will create a direct connection from my notebook to the remote Duet server. Once the connection is established all traffic is sent directly between the two nodes.

Paste the code or Server ID that the Data Owner gives you and run it in the cell below. It will return your Client ID which you must send to the Data Owner to enter into Duet so it can pair your notebooks.

In [None]:
import syft as sy
import pandas as pd
import numpy as np
import statsmodels
duet = sy.join_duet(loopback=True)

### <img src="https://github.com/OpenMined/design-assets/raw/master/logos/OM/mark-primary-light.png" alt="he-black-box" width="100"/> Checkpoint 0 : Now STOP and run the Data Owner notebook until the next checkpoint.

In [None]:
duet.store.pandas

In [None]:
df_ptr = duet.store["df"]

In [None]:
cancer_ptr = df_ptr[(df_ptr["UCOD_LEADING"] == 2) & (df_ptr["ELIGSTAT"] == 1)]

In [None]:
heart_ptr = df_ptr[(df_ptr["UCOD_LEADING"] == 1) & (df_ptr["ELIGSTAT"] == 1)]

In [None]:
# Compute simple means and for the cancer and heart subgroups that had diabetes
# listed as a multiple cause of death
cancer_mean_ptr = cancer_ptr["DIABETES"].mean()

In [None]:
# Compute simple means and standard deviations for the cancer and heart subgroups
# that had diabetes as a multiple cause of death
heart_mean_ptr = heart_ptr["DIABETES"].mean()

In [None]:
cancer_mean = cancer_mean_ptr.get(request_block=True, delete_obj=False)
cancer_mean

In [None]:
heart_mean = heart_mean_ptr.get(request_block=True, delete_obj=False)
heart_mean

In [None]:
# Sample means data should account for weights. Write a custom function that uses the weights.


def weighted_mean(dx, key, weight_key="WGT_NEW"):
    w = dx[weight_key]
    v = dx[key]
    return (w * v).sum() / w.sum()


cancer_wm_ptr = weighted_mean(cancer_ptr, "DIABETES")
heart_wm_ptr = weighted_mean(heart_ptr, "DIABETES")

In [None]:
# Example of a small subgroup (sample size = 6)
# Cancer-deaths from males aged 47 who died in 2015
# We should check for small cell sizes here
subgroup = cancer_ptr[
    (cancer_ptr["SEX"] == 1)
    & (cancer_ptr["AGE_P"] == 47)
    & (cancer_ptr["DODYEAR"] == 2015)
]
print(subgroup["DIABETES"].mean().get(request_block=True, delete_obj=False))
print(weighted_mean(subgroup, "DIABETES").get(request_block=True, delete_obj=False))
print(len(subgroup))

# These stats are problematic, as the subgroup is too small to report (n=6)
subgroup.get(request_block=True, delete_obj=False)

In [None]:
# import statsmodels.api as sm
# from statsmodels.genmod.generalized_linear_model import GLM
# from statsmodels.genmod.families import Binomial

# # Drop any missing values in the dataset (those under 18)
# df = df.dropna(subset=["MORTSTAT"])
# # Keep only the eligible portion
# df = df[df.ELIGSTAT == 1]

# # Ignore people > 80
# df = df[df.AGE_P <= 80]

# # A person is alive if MORTSTAT==0
# df["is_alive"] = df.MORTSTAT == 0

# # Assign a helpful column for sex (0==male, 1==female)
# df["sex"] = "male"
# df.loc[df.SEX == 2, "sex"] = "female"

# x = df["AGE_P"]
# _x = sm.add_constant(x)
# _y = df["is_alive"]

# results = GLM(_y, _x, family=Binomial()).fit()
# print(results.summary())

In [None]:
# see available remote statsmodel API
duet.statsmodels

In [None]:
# Drop any missing values in the dataset (those under 18)
df = df_ptr.dropna(subset=["MORTSTAT"])
# Keep only the eligible portion
df = df[df["ELIGSTAT"] == 1]

# Ignore people > 80
df = df[df["AGE_P"] <= 80]

# A person is alive if MORTSTAT==0
df["is_alive"] = df["MORTSTAT"] == 0

# Assign a helpful column for sex (0==male, 1==female)
df["sex"] = "male"
df.loc[df["SEX"] == 2, "sex"] = "female"

x_ptr = df["AGE_P"]
_x_ptr = duet.statsmodels.api.add_constant(x_ptr)
_y_ptr = df["is_alive"]

model = duet.statsmodels.genmod.generalized_linear_model.GLM(
    _y_ptr, _x_ptr, family=duet.statsmodels.genmod.families.Binomial()
)
results = model.fit()

In [None]:
remote_summary = results.get(request_block=True, delete_obj=False)
print(remote_summary)

In [None]:
import pandas as pd
import io

summary_df = pd.read_csv(io.StringIO(remote_summary), names=[1, 2, 3, 4, 5, 6, 7])
summary_df

In [None]:
x = x_ptr.get(request_block=True, delete_obj=False)

In [None]:
_y = _y_ptr.get(request_block=True, delete_obj=False)

In [None]:
sex = df["sex"].get(request_block=True, delete_obj=False)

In [None]:
# TODO finish adding range and dynamic object attributes for remote invocation
predict_x = range(x.min(), x.max() + 1, 1)
# preds = results.predict(sm.add_constant(predict_x))

In [None]:
# request the results.predict calculation and retrieve the results

In [None]:
preds = duet.store["preds"].get(request_block=True, delete_obj=False)

In [None]:
plot_df = pd.DataFrame()

In [None]:
plot_df["AGE_P"] = x
plot_df["is_alive"] = _y
plot_df["sex"] = sex

In [None]:
try:
    import pylab as plt
    import seaborn as sns

    plt.figure(figsize=(12, 5))
    plt.plot(predict_x, preds, "k", lw=3, label="Best Fit for all data")
    sns.lineplot(data=plot_df, x="AGE_P", y="is_alive", hue="sex", err_style="bars")
    sns.despine()
except ImportError:
    print("Cant import seaborn try:\n!pip install seaborn")