# Train Machine learning on multiple Medical datasets

In this notebook we will learn how to train Machine learning models on remote (non-public) data, using **Scikit-learn** and **PySyft**.

## Step 1. Login to datasites as **External Researcher**

⚠️ First verify that the Datasites are already running. If needed, launch the following command in a new terminal session:

```bash
$ python launch_datasites.py
```

**Note**: In Jupyter Lab, you can open a new terminal session via `File >> New >> Terminal`

In [None]:
import syft as sy

In [None]:
from datasites import DATASITE_URLS

datasites = {}
for name, url in DATASITE_URLS.items():
    datasites[name] = sy.login(url=url, email="researcher@openmined.org", password="****")

## Step 2. Get Mock data and test the model training code

In [None]:
mock_data = datasites["Cleveland Clinic"].datasets["Heart Disease Dataset"].assets["Heart Study Data"].mock

In [None]:
# DS/ML libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


# Input data is not "ready" for ML experiments, so we need to 
# (1) extract features and labels; (2) train/test split data 
# before training our model.
def by_demographics(data):
    # NO age stratification as data is too skew 
    # (see notebook 01-Compare-Demographics.ipynb)
    sex = data["sex"].map(lambda v: '0' if v == 0 else '1')
    target = data["num"].map(lambda v: '0' if v == 0 else '1')
    return (sex+target).values

# 1. get features and labels
X = mock_data.drop(columns=["age", "sex", "num"], axis=1)
y = mock_data["num"].map(lambda v: 0 if v == 0 else 1)
# 2. partition data
X_train, _, y_train, _ = train_test_split(
    X, y, random_state=12345, stratify=by_demographics(mock_data)
)
# 3. train model: Tree-based model as its invariant to feature scale, and allows data sparsity
model = RandomForestClassifier(random_state=12345)
model.fit(X_train, y_train)

## Step 3. Submit Experiment to each datasite

In [None]:
for name, datasite in datasites.items():
    print(f"Datasite: {name}")
    # 1. Get data asset from datasite
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    
    @sy.syft_function_single_use(data=data_asset)
    def train(data):
        # DS/ML libraries
        from sklearn.model_selection import train_test_split
        from sklearn.ensemble import RandomForestClassifier
        # Extra dependencies for model persistance (see 4.)
        import joblib
        from io import BytesIO
        
        def by_demographics(data):
            sex = data["sex"].map(lambda v: '0' if v == 0 else '1')
            target = data["num"].map(lambda v: '0' if v == 0 else '1')
            return (sex+target).values
        
        # 1. get features and labels
        X = data.drop(columns=["age", "sex", "num"], axis=1)
        y = data["num"].map(lambda v: 0 if v == 0 else 1)
        # 2. partition data
        X_train, _, y_train, _ = train_test_split(
            X, y, random_state=12345, stratify=by_demographics(data)
        )
        # 3. train model
        model = RandomForestClassifier(random_state=12345)
        model.fit(X_train, y_train)
        # 4. model persistance - return model serialised 
        serialised_model = BytesIO()
        joblib.dump(model, serialised_model)

        return serialised_model
    
    ml_training_project = sy.Project(
        name="Traning RandomForest Classifier on Heart Study Data",
        description="""I would like to train a classifier on the Heart Study data.
        The code will partition the dataset using sex and target, and will train 
        a RandomForest classifier, that will be returned serialised.
        """,
        members=[datasite],
    )
    ml_training_project.create_code_request(train, datasite)
    project = ml_training_project.send()

In [None]:
from utils import check_status_last_code_requests

check_status_last_code_requests(datasites)

## Step 4. Train Models on all datasites

In [None]:
from utils import dump_model

In [None]:
for name, datasite in datasites.items():
    print(f"Datasite: {name}")
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    serialised_model = datasite.code.train(data=data_asset).get_from(datasite)
    print(dump_model(datasite_name=name, model_buffer=serialised_model))

## Conclusions

We have trained and stored independently **four** Random Forest Classifiers, using the (non-public) data in each hospital, and <u>_without seeing nor downloading the training data_</u>! 