# Evaluate Machine learning performance on non-public Medical data

In this notebook we will learn how to assess ML model performance, using test data as gathered on each Datasite.

## Step 1. Login to datasites as **External Researcher**

⚠️ First verify that the Datasites are already running. If needed, launch the following command in a new terminal session:

```bash
$ python launch_datasites.py
```

**Note**: In Jupyter Lab, you can open a new terminal session via `File >> New >> Terminal`

In [None]:
import syft as sy

In [None]:
from datasites import DATASITE_URLS

datasites = {}
for name, url in DATASITE_URLS.items():
    datasites[name] = sy.login(url=url, email="researcher@openmined.org", password="****")

## Step 2. Get Mock data and test the model evaluation code

In [None]:
mock_data = datasites["Cleveland Clinic"].datasets["Heart Disease Dataset"].assets["Heart Study Data"].mock

In [None]:
# DS/ML libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef as mcc_score
from sklearn.metrics import confusion_matrix
from utils import load_model  # utility function to load stored trained models from disk


# ML Data preparation - same strategy as in 02-Model-Training-Experiment.ipynb
def by_demographics(data):
    sex = data["sex"].map(lambda v: '0' if v == 0 else '1')
    target = data["num"].map(lambda v: '0' if v == 0 else '1')
    return (sex+target).values

# 1. get features and labels
X = mock_data.drop(columns=["age", "sex", "num"], axis=1)
y = mock_data["num"].map(lambda v: 0 if v == 0 else 1)
# 2. partition data
_, X_test, _, y_test = train_test_split(
    X, y, random_state=12345, stratify=by_demographics(mock_data)
)
# 3. Load a single trained model as example
model_dump_file = "./models/cleveland_clinic_model.jbl"
classifier = load_model(model_dump_file)
# 4. Evaluate Metrics (MCC score and Confusion Matrix)
y_pred = classifier.predict(X_test)
mcc_test = mcc_score(y_test, y_pred)  # count both positives and negatives
cm = confusion_matrix(y_test, y_pred) # compare performance across classes
print(mcc_test)  # should be almost zero as it's random data! MCC ranges from -1 to 1
print(cm)

## Step 3. Submit Experiment to each datasite

In [None]:
# get model files
from utils import load_models
# Load saved models from disk
models = load_models(datasites)
assert len(models) == 4

In [None]:
from utils import serialize_and_upload
remote_models = {}

for name, datasite in datasites.items():
    print(f"Datasite: {name}")
    # 1. Get data asset from datasite
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    # 1.1 Upload models to Datasite (to be passed as input of the Syft function)
    remote_model = serialize_and_upload(model=models[name], to=datasite)
    remote_models[name] = remote_model
    
    @sy.syft_function_single_use(data=data_asset, model=remote_model)
    def evaluate(data, model):
        # DS/ML libraries
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import matthews_corrcoef as mcc_score
        from sklearn.metrics import confusion_matrix
        import joblib  # to load serialised input model
        
        # ML Data preparation - same strategy as in model training
        def by_demographics(data):
            sex = data["sex"].map(lambda v: '0' if v == 0 else '1')
            target = data["num"].map(lambda v: '0' if v == 0 else '1')
            return (sex+target).values
        
        # 1. get features and labels
        X = data.drop(columns=["age", "sex", "num"], axis=1)
        y = data["num"].map(lambda v: 0 if v == 0 else 1)
        # 2. partition data
        _, X_test, _, y_test = train_test_split(
            X, y, random_state=12345, stratify=by_demographics(data)
        )
        # 3. Get trained model
        classifier = joblib.load(model)  # load model
        # 4. Evaluate Metrics (MCC and Confusion Matrix)
        y_pred = classifier.predict(X_test)
        return mcc_score(y_test, y_pred), confusion_matrix(y_test, y_pred)
    
    ml_eval_project = sy.Project(
        name="Evaluate performance of trained classifier on Heart Study Data",
        description="""I would like to calculate MCC score, and Confusion Matrix on the test data partition for the 
         input trained RandomForest classifier.""",
        members=[datasite],
    )
    ml_eval_project.create_code_request(evaluate, datasite)
    project = ml_eval_project.send()

In [None]:
from utils import check_status_last_code_requests

check_status_last_code_requests(datasites)

## Step 4. Evaluate Models on all datasites

In [None]:
mcc_scores, confusion_matrices = {}, {}
for name, datasite in datasites.items():
    print(f"Datasite: {name}")
    data_asset = datasite.datasets["Heart Disease Dataset"].assets["Heart Study Data"]
    remote_model = remote_models[name]
    results = datasite.code.evaluate(data=data_asset, model=remote_model).get_from(datasite)
    mcc_scores[name], confusion_matrices[name] = results

In [None]:
mcc_scores

Data is so skew (as expected) in the 'Univ. Hospitals Zurich and Basel' that the model is basically predicting always the same outcome (i.e. `MCC = 0`).

Let's double check the resulting confusion matrices, and then will see if we can do better!

In [None]:
from matplotlib import pyplot as plt
from utils import plot_all_confusion_matrices

plot_all_confusion_matrices(confusion_matrices)
plt.show()

## Conclusions

We have gathered evaluation metrics (MCC and Confusion Matrix) for each trained `RandomForestClassifier` model, using test data on each corresponding datasite. In addition to running our code on the non-public data, in this notebook we have learnt how to upload an ML model to a datasite, to be used in a Syft function. 

### Excercise

As an exercise, you could try to check model performance using _test data_ gathered from datasites different than the ones used in training!

> 💡 Considering the code of our experiment, the only thing you'd need to change is _which_ model gets passed in as input to the `evaluate` function! 😉

In [None]:
# You code here