# CDC NDI Mortality - Syft Duet - Data Owner 🎸

This worksheet is intended to illustrate functionality of a shared statistical platform, using a partially synthetic public-use dataset that mirrors the restricted-use dataset. Ultimately, these processes would apply to the restricted-use data.

Sample data compiled from the public-use linked mortality files share at https://www.cdc.gov/nchs/data-linkage/mortality.htm provided by the National Center for Health Statistics (NCHS).

## PART 1: Launch a Duet Server and Connect

As a Data Owner, you want to allow someone else to perform data science on data that you own and likely want to protect.

In order to do this, we must load our data into a locally running server within this notebook. We call this server a "Duet".

To begin, you must launch Duet and help your Duet "partner" (a Data Scientist) connect to this server.

You do this by running the code below and sending the code snippet containing your unique Server ID to your partner and following the instructions it gives!

In [None]:
import syft as sy

duet = sy.launch_duet(loopback=True)

In [None]:
sy.load("pandas")

In [None]:
duet.requests.add_handler(action="accept")

In [None]:
# download data
from syft.util import get_root_data_path
import urllib.request
import shutil
import os

csv_file = "mort_match_nhis_all_years.csv"
zip_file = f"{csv_file}.zip"
url = f"https://datahub.io/madhava/mort_match_nhis_all_years/r/{zip_file}"
data_path = f"{get_root_data_path()}/CDC"
zip_path = f"{data_path}/{zip_file}"
csv_path = f"{data_path}/{csv_file.upper()}"
if not os.path.exists(zip_path):
    os.makedirs(data_path, exist_ok=True)
    urllib.request.urlretrieve(url, zip_path)
if not os.path.exists(csv_path):
    shutil.unpack_archive(zip_path, data_path)
assert os.path.exists(csv_path)
csv_path

In [None]:
import pandas as pd

df = pd.read_csv(csv_path)
df[df.MORTSTAT == 0].head()

In [None]:
# TODO: fix size issues / serde performance
df = df.head(10000)  # make smaller
len(df)

In [None]:
df_ptr = df.send(duet, tags=["df"])

In [None]:
df

In [None]:
# local stats

# Select the records that died by cancer that were eligible for linkage
# 002-Malignant neoplasms (C00-C97)
cancer = df[(df.UCOD_LEADING == 2) & (df.ELIGSTAT == 1)]

# Select the records that died due to heart disease and were eligible for linkage
# 001-Diseases of heart (I00-I09, I11, I13, I20-I51)
heart = df[(df.UCOD_LEADING == 1) & (df.ELIGSTAT == 1)]

In [None]:
# Compute simple means and for the cancer and heart subgroups that had diabetes
# listed as a multiple cause of death
cancer["DIABETES"].mean()

In [None]:
# Compute simple means and standard deviations for the cancer and heart subgroups
# that had diabetes as a multiple cause of death
heart["DIABETES"].mean()

In [None]:
# Sample means data should account for weights. Write a custom function that uses the weights.


def weighted_mean(dx, key, weight_key="WGT_NEW"):
    w = dx[weight_key]
    v = dx[key]
    return (w * v).sum() / w.sum()


weighted_mean(cancer, "DIABETES"), weighted_mean(heart, "DIABETES")

In [None]:
# Example of a small subgroup (sample size = 6)
# Cancer-deaths from males aged 47 who died in 2015
# We should check for small cell sizes here
# subgroup = cancer[(cancer.SEX==1) & (cancer.AGE_P==47) & (cancer.DODYEAR==2001)]

# run different query that matches in the first 10k records from above
# df = df.head(10000)  # make smaller
subgroup = cancer[(cancer.SEX == 1) & (cancer.AGE_P == 51) & (cancer.DODYEAR == 2013)]
print(subgroup["DIABETES"].mean())
print(weighted_mean(subgroup, "DIABETES"))
print(len(subgroup))

# These stats are problematic, as the subgroup is too small to report (n=6)
subgroup

In [None]:
# import statsmodels.api as sm
# from statsmodels.genmod.generalized_linear_model import GLM
# from statsmodels.genmod.families import Binomial

# Drop any missing values in the dataset (those under 18)
df = df.dropna(subset=["MORTSTAT"])
# Keep only the eligible portion
df = df[df.ELIGSTAT == 1]

# Ignore people > 80
df = df[df.AGE_P <= 80]

# A person is alive if MORTSTAT==0
df["is_alive"] = df.MORTSTAT == 0

# Assign a helpful column for sex (0==male, 1==female)
df["sex"] = "male"
df.loc[df.SEX == 2, "sex"] = "female"

x = df["AGE_P"]
# _x = sm.add_constant(x)
_y = df["is_alive"]

# results = GLM(_y, _x, family=Binomial()).fit()
# print(results.summary())

In [None]:
df

In [None]:
x, _y

In [None]:
# import pylab as plt
# # import seaborn as sns

# plt.figure(figsize=(12,5))
# predict_x = range(x.min(), x.max()+1, 1)
# plt.plot(predict_x, results.predict(sm.add_constant(predict_x)), 'k', lw=3,
#          label='Best Fit for all data')
# sns.lineplot(
#     data=df, x='AGE_P', y='is_alive', hue='sex',
#     err_style="bars")
# sns.despine()