## Step 1. Login to datasites as **External Researcher**

Launch Datasites if not running:
```bash
$ python launch_datasites.py
```

In [1]:
import syft as sy

In [2]:
from datasites import CONNECTION_STRINGS

datasites = {}
for name, url in CONNECTION_STRINGS.items():
    datasites[name] = sy.login_as_guest(url=url)

Logged into <Cleveland Clinic: High-side Datasite> as GUEST
Logged into <Hungarian Inst. of Cardiology: High-side Datasite> as GUEST
Logged into <Univ. Hospitals Zurich and Basel: High-side Datasite> as GUEST
Logged into <V.A. Medical Center: High-side Datasite> as GUEST


## Step 2. Get Mock data and test the model training code

In [3]:
mock_data = datasites["Cleveland Clinic"].datasets["Heart Disease Dataset"].assets["Heart Study Data"].mock

In [4]:
# DS/ML libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Model Persistance
import joblib
from io import BytesIO

# Input data is not "ready" for ML experiments, so we need to 
# (1) extract features and labels; (2) train/test split data 
# before training our model.
def by_demographics(data):
    # NO age stratification as data is too skew 
    # (see notebook 01-Compare-Demographics.ipynb)
    sex = data["sex"].map(lambda v: '0' if v == 0 else '1')
    target = data["num"].map(lambda v: '0' if v == 0 else '1')
    return (sex+target).values

# 1. get features and labels
X = mock_data.drop(columns=["age", "sex", "num"], axis=1)
y = mock_data["num"].map(lambda v: 0 if v == 0 else 1)
# 2. partition data
X_train, _, y_train, _ = train_test_split(
    X, y, random_state=12345, stratify=by_demographics(mock_data)
)
# 3. train model: Tree-based model as its invariant to feature scale, and allows data sparsity
model = RandomForestClassifier(random_state=12345)
model.fit(X_train, y_train)

## Step 3. Submit Experiment to each datasite

In [5]:
for name, datasite in datasites.items():
    print(f"Datasite: {name}")
    # 1. Get data asset from datasite
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    
    @sy.syft_function_single_use(data=data_asset)
    def train(data):
        # DS/ML libraries
        from sklearn.model_selection import train_test_split
        from sklearn.ensemble import RandomForestClassifier
        # Model Persistance
        import joblib
        from io import BytesIO
        
        # Input data is not "ready" for ML experiments, so we need to 
        # (1) extract features and labels; (2) train/test split data 
        # before training our model.
        def by_demographics(data):
            sex = data["sex"].map(lambda v: '0' if v == 0 else '1')
            target = data["num"].map(lambda v: '0' if v == 0 else '1')
            return (sex+target).values
        
        # 1. get features and labels
        X = data.drop(columns=["age", "sex", "num"], axis=1)
        y = data["num"].map(lambda v: 0 if v == 0 else 1)
        # 2. partition data
        X_train, _, y_train, _ = train_test_split(
            X, y, random_state=12345, stratify=by_demographics(data)
        )
        # 3. train model
        model = RandomForestClassifier(random_state=12345)
        model.fit(X_train, y_train)
        # 4. model persistance - return model serialised 
        serialised_model = BytesIO()
        joblib.dump(model, serialised_model)

        return serialised_model
    
    ml_training_project = sy.Project(
        name="Traning RandomForest Classifier on Heart Study Data",
        description="""I would like to train a RandomForest Classifier on the Heart Study data.
        Submitted code will partition the dataset, using demographics info in the data, and will 
        return the trained model serialised.
        """,
        members=[datasite],
    )
    ml_training_project.create_code_request(train, datasite)
    project = ml_training_project.send()

Datasite: Cleveland Clinic


Datasite: Hungarian Inst. of Cardiology


Datasite: Univ. Hospitals Zurich and Basel


Datasite: V.A. Medical Center


In [7]:
from utils import check_status_last_code_requests

check_status_last_code_requests(datasites)

Datasite: Cleveland Clinic


Datasite: Hungarian Inst. of Cardiology


Datasite: Univ. Hospitals Zurich and Basel


Datasite: V.A. Medical Center


## Step 4. Train Models on all datasites

In [8]:
from utils import dump_model

In [9]:
for name, datasite in datasites.items():
    print(f"Datasite: {name}")
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    serialised_model = datasite.code.train(data=data_asset).get_from(datasite)
    print(dump_model(datasite_name=name, model_buffer=serialised_model))

Datasite: Cleveland Clinic
Model saved in models/cleveland_clinic_model.jbl
Datasite: Hungarian Inst. of Cardiology
Model saved in models/hungarian_inst_of_cardiology_model.jbl
Datasite: Univ. Hospitals Zurich and Basel
Model saved in models/univ_hospitals_zurich_and_basel_model.jbl
Datasite: V.A. Medical Center
Model saved in models/va_medical_center_model.jbl


## Conclusions

We have trained and stored independently **four** Random Forest Classifiers, using the (non-public) data on each Hospital (datasite) and <u>_without never seeing the training data_</u>! 