## Implementing Horizontal Federated Learning with Random Forest In Healthcare Sector

## Project overview

1. We'll demonstrate this project by using 5 heart disease datasets.


2. All the datasets have been pre-processed by us. We won't show any data pre-processing steps here. But We'll use these pre-processed datasets throughout this code implementation.


3. This whole process can be implemented only in Horizontal Federated learning.


4. In this project we considered two clients(two hospitals data: Cleveland, Hungary). 


5. Our federated server also has it's own local data. We made this data by merging 3 other hospitals data.

## Steps: 

### Phase 1:

1. First, We'll initiate a Random Forest model in our federated central server. Eventually we'll get moderate evaluation metrics for our server data. Then we'll save the model for sending the parameters to our clients.


2. Then, We'll send the initial models parameters to both of our clients. We'll use the initial federated models parameters to run Random Forest models in our clients server. 


3. After that, we'll tune the models of our clients. If the evaluation metrics improves we'll save the tuned models.


4. Then we'll send both of the models tuned parameters(not the data) to the federated server and see if the federated models accuracy improves or not.

### Phase 2:

1. We'll increase our data in federated server as well as in our clients server. We'll do this because these type of data never stays the same(online data). New data can be added anytime.


2. We'll do the same process for our increased data to see if the accuracy is changed or not.

<img src='method.jpeg' width = 450px>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# saving the model
from joblib import dump, load

### Helper Functions to receive parameters from clients

In [2]:
def no_repeat(lst):
    new_lst = []
    for i in lst:
        if i not in new_lst:
            new_lst.append(i)
        else:
            pass
    return new_lst

In [3]:
def parameters(model_lst):
    all_models = []

    md = []
    mf = []
    ne = []
    cr = []
    
    parameters = {}
    
    for i in range(len(model_lst)):
        
        loaded_model = load(model_lst[i])
        all_models.append(loaded_model)
        
        md.append(all_models[i].max_depth)
        mf.append(all_models[i].max_features)
        ne.append(all_models[i].n_estimators)
        cr.append(all_models[i].criterion)
        
        parameters['max_depth'] = no_repeat(md)
        parameters['max_features'] = no_repeat(mf)
        parameters['n_estimators'] = no_repeat(ne)
        parameters['criterion'] = no_repeat(cr)

    return parameters

## FL with newly added data (Phase: 2)

In [4]:
df = pd.read_csv("federated_full.csv")
df.head()

Unnamed: 0,age,resting_blood_pressure,cholesterol,fasting_blood_sugar,max_heart_rate_achieved,exercise_induced_angina,st_depression,target,sex_male,chest_pain_type_atypical angina,chest_pain_type_non-anginal pain,chest_pain_type_typical angina,rest_ecg_left ventricular hypertrophy,rest_ecg_normal,st_slope_flat,st_slope_upsloping
0,57,165,289,1,124,0,1.0,1,1,0,0,0,1,0,1,0
1,63,130,254,0,147,0,1.4,1,1,0,0,0,1,0,1,0
2,48,124,274,0,166,0,0.5,1,1,0,0,0,1,0,1,0
3,51,100,222,0,143,1,1.2,0,1,0,1,0,0,1,1,0
4,60,150,258,0,157,0,2.6,1,0,0,0,0,1,0,1,0


In [5]:
df.shape

(579, 16)

In [6]:
X = df.drop("target", axis=1)
y = df["target"]

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, 
                                                        test_size=0.25, stratify=y,
                                                        random_state=5)
print('-----------remodeled-Training Set------------------')
print(X_train2.shape)
print(y_train2.shape)

print('------------Test Set------------------')
print(X_test2.shape)
print(y_test2.shape)

-----------remodeled-Training Set------------------
(434, 15)
(434,)
------------Test Set------------------
(145, 15)
(145,)


In [7]:
model1=RandomForestClassifier(random_state=1)

### Tuning dataset with newly added samples

# 1

In [8]:
# Define the grid of hyperparameters 'params_rf'
params_rf4 = {'n_estimators': list(range(50,100,5)), 'max_depth': list(range(3,7)),
             'max_features': ['log2','sqrt'], 'criterion': ['gini', 'entropy']}


# Instantiate a 5-fold CV grid search object 'grid_rf'
grid_rf4 = GridSearchCV(estimator=model1, param_grid=params_rf4, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf4.fit(X_train2, y_train2)

# Extract best model from 'grid_rf'
best_model4 = grid_rf4.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams4 = grid_rf4.best_params_
print('Best hyerparameters', best_hyperparams4)

# Evaluate test set accuracy
pred4= best_model4.predict(X_test2)

test_acc4 = accuracy_score(y_test2, pred4)
print('Test Accuracy_4:', test_acc4*100)

precision_4 = precision_score(y_test2, pred4)
print('Precision_4:', precision_4*100)

recall_4 = recall_score(y_test2, pred4)
print('Recall_4:', recall_4*100)

f1_score_4 = f1_score(y_test2, pred4)
print('f1_score_4:', f1_score_4*100)

Best hyerparameters {'criterion': 'entropy', 'max_depth': 6, 'max_features': 'log2', 'n_estimators': 50}
Test Accuracy_4: 88.27586206896552
Precision_4: 90.1639344262295
Recall_4: 83.33333333333334
f1_score_4: 86.61417322834646


### Saving the model for clients

In [9]:
dump(best_model4, "fed_model_3.joblib")

['fed_model_3.joblib']

# 2

In [10]:
clients = ["client_cleveland_3.joblib", "client_hungarian_3.joblib"]
params_rf_5 = parameters(clients)
params_rf_5

{'max_depth': [4, 5],
 'max_features': ['log2'],
 'n_estimators': [90, 60],
 'criterion': ['gini']}

In [11]:
model1=RandomForestClassifier(random_state=1)

# Instantiate a 5-fold CV grid search object 'grid_rf'
# Instantiate a 5-fold CV grid search object 'grid_rf'
grid_rf_5 = GridSearchCV(estimator=model1, param_grid=params_rf_5, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf_5.fit(X_train2, y_train2)

# Extract best model from 'grid_rf'
best_model5 = grid_rf_5.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams_5 = grid_rf_5.best_params_
print('Best hyerparameters', best_hyperparams_5)

# Evaluate test set accuracy
pred5 = best_model5.predict(X_test2)

test_acc5 = accuracy_score(y_test2, pred5)
print('Test Accuracy_5:', test_acc5*100)

precision_5 = precision_score(y_test2, pred5)
print('Precision_5:', precision_5*100)

recall_5 = recall_score(y_test2, pred5)
print('Recall_5:', recall_5*100)

f1_score_5 = f1_score(y_test2, pred5)
print('f1_score_5:', f1_score_5*100)

Best hyerparameters {'criterion': 'gini', 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 90}
Test Accuracy_5: 85.51724137931035
Precision_5: 86.88524590163934
Recall_5: 80.3030303030303
f1_score_5: 83.46456692913385


# 3

In [41]:
params_rf_6 = {'n_estimators': list(range(50,100,5)), 'max_depth': list(range(3,10)),
             'max_features': ['log2','sqrt'], 'criterion': ['gini', 'entropy']}

# params_rf4 = {'n_estimators': list(range(30, 80, 5)), 'max_depth': list(range(2,9)),
#              'max_features': ['log2','sqrt'], 'criterion': ['gini', 'entropy']}

grid_rf_6 = GridSearchCV(estimator=model1, param_grid=params_rf_6, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf_6.fit(X_train2, y_train2)

# Extract best model from 'grid_rf'
best_model6 = grid_rf_6.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams_6 = grid_rf_6.best_params_
print('Best hyerparameters', best_hyperparams_6)

# Evaluate test set accuracy
pred6 = best_model6.predict(X_test2)

test_acc6 = accuracy_score(y_test2, pred6)
print('Test Accuracy_6:', test_acc6*100)

precision_6 = precision_score(y_test2, pred6)
print('Precision_6:', precision_6*100)

recall_6 = recall_score(y_test2, pred6)
print('Recall_6:', recall_6*100)

f1_score_6 = f1_score(y_test2, pred6)
print('f1_score_6:', f1_score_6*100)

Best hyerparameters {'criterion': 'entropy', 'max_depth': 9, 'max_features': 'log2', 'n_estimators': 65}
Test Accuracy_6: 95.86206896551724
Precision_6: 96.875
Recall_6: 93.93939393939394
f1_score_6: 95.38461538461539


In [42]:
dump(best_model6, "fed_model_4.joblib")

['fed_model_4.joblib']