## Implementing Horizontal Federated Learning with Random Forest In Healthcare Sector

## Project overview

1. We'll demonstrate this project by using 5 heart disease datasets.


2. All the datasets have been pre-processed by us. We won't show any data pre-processing steps here. But We'll use these pre-processed datasets throughout this code implementation.


3. This whole process can be implemented only in Horizontal Federated learning.


4. In this project we considered two clients(two hospitals data: Cleveland, Hungary). 


5. Our federated server also has it's own local data. We made this data by merging 3 other hospitals data.

## Steps: 

### Phase 1:

1. First, We'll initiate a Random Forest model in our federated central server. Eventually we'll get moderate evaluation metrics for our server data. Then we'll save the model for sending the parameters to our clients.


2. Then, We'll send the initial models parameters to both of our clients. We'll use the initial federated models parameters to run Random Forest models in our clients server. 


3. After that, we'll tune the models of our clients. If the evaluation metrics improves we'll save the tuned models.


4. Then we'll send both of the models tuned parameters(not the data) to the federated server and see if the federated models accuracy improves or not.

### Phase 2:

1. We'll increase our data in federated server as well as in our clients server. We'll do this because these type of data never stays the same(online data). New data can be added anytime.


2. We'll do the same process for our increased data to see if the accuracy is changed or not.

<img src='method.jpeg' width = 450px>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from joblib import dump, load

In [2]:
dt = pd.read_csv("federated_80%.csv")
dt.head()

Unnamed: 0,age,resting_blood_pressure,cholesterol,fasting_blood_sugar,max_heart_rate_achieved,exercise_induced_angina,st_depression,sex_male,chest_pain_type_atypical angina,chest_pain_type_non-anginal pain,chest_pain_type_typical angina,rest_ecg_left ventricular hypertrophy,rest_ecg_normal,st_slope_flat,st_slope_upsloping,target
0,40,140,199,0,178,1,1.4,1,0,0,1,0,1,0,1,0
1,42,120,295,0,162,0,0.0,1,1,0,0,0,1,0,1,0
2,54,108,309,0,156,0,0.0,1,1,0,0,0,1,0,1,0
3,58,125,220,0,144,0,0.4,1,1,0,0,0,1,1,0,0
4,58,120,340,0,172,0,0.0,0,0,1,0,0,1,0,1,0


In [3]:
dt.shape

(463, 16)

In [4]:
features =dt.drop("target", axis=1)

In [5]:
target = dt["target"]

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(features, target, 
                                                        test_size=0.3, stratify=target,
                                                        random_state=5)
print('-----------remodeled-Training Set------------------')
print(X_train.shape)
print(y_train.shape)

-----------remodeled-Training Set------------------
(324, 15)
(324,)


In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### Initiating our initial federated model

In [9]:
model1 = RandomForestClassifier(random_state=1)

# Define the grid of hyperparameters 'params_rf'
params_rf = {'n_estimators': list(range(1,21,5)), 'max_depth': list(range(2,5)),
             'max_features': ['log2','sqrt'], 'criterion': ['gini', 'entropy']}

# Instantiate a 5-fold CV grid search object 'grid_rf'
grid_rf = GridSearchCV(estimator=model1, param_grid=params_rf, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf.fit(X_train, y_train)

# Extract best model from 'grid_rf'
best_model1 = grid_rf.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams = grid_rf.best_params_
print('Best hyerparameters:', best_hyperparams)

# Evaluate test set accuracy
pred1 = best_model1.predict(X_test)

test_acc1 = accuracy_score(y_test, pred1)
print('Test Accuracy:', test_acc1*100)

precision_1 = precision_score(y_test, pred1)
print('Precision_1:', precision_1*100)

recall_1 = recall_score(y_test, pred1)
print('Recall_1:', recall_1*100)

f1_score_1 = f1_score(y_test, pred1)
print('f1_score_1:', f1_score_1*100)

Best hyerparameters: {'criterion': 'entropy', 'max_depth': 4, 'max_features': 'log2', 'n_estimators': 11}
Test Accuracy: 81.29496402877699
Precision_1: 87.75510204081633
Recall_1: 68.25396825396825
f1_score_1: 76.78571428571428


### Saving the initial federated model

In [10]:
# saving the model
from joblib import dump, load

In [11]:
dump(best_model1, "initial_fed_model.joblib")

['initial_fed_model.joblib']

### Helper Functions to receive parameters from clients

In [12]:
def no_repeat(lst):
    new_lst = []
    for i in lst:
        if i not in new_lst:
            new_lst.append(i)
        else:
            pass
    return new_lst

In [13]:
def parameters(model_lst):
    all_models = []

    md = []
    mf = []
    ne = []
    cr = []
    
    parameters = {}
    
    for i in range(len(model_lst)):
        
        loaded_model = load(model_lst[i])
        all_models.append(loaded_model)
        
        md.append(all_models[i].max_depth)
        mf.append(all_models[i].max_features)
        ne.append(all_models[i].n_estimators)
        cr.append(all_models[i].criterion)
        
        parameters['max_depth'] = no_repeat(md)
        parameters['max_features'] = no_repeat(mf)
        parameters['n_estimators'] = no_repeat(ne)
        parameters['criterion'] = no_repeat(cr)

    return parameters

In [14]:
# loading the model from saved file
client_cleveland_1 = load("client_cleveland_1.joblib")
print('client_cleveland_1:', client_cleveland_1)

# loading the model from saved file
client_hungarian_1  = load("client_hungarian_1.joblib")
print('client_hungarian_1:', client_hungarian_1)

client_cleveland_1: RandomForestClassifier(max_depth=7, max_features='log2', n_estimators=85,
                       random_state=1)
client_hungarian_1: RandomForestClassifier(max_depth=4, max_features='log2', n_estimators=35,
                       random_state=1)


In [15]:
clients = ["client_cleveland_1.joblib", "client_hungarian_1.joblib"]
params_rf = parameters(clients)
params_rf

{'max_depth': [7, 4],
 'max_features': ['log2'],
 'n_estimators': [85, 35],
 'criterion': ['gini']}

### Using the parameters from our clients to our federated model 

In [16]:
model1=RandomForestClassifier(random_state=1)

# Instantiate a 5-fold CV grid search object 'grid_rf'
grid_rf2 = GridSearchCV(estimator=model1, param_grid=params_rf, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf2.fit(X_train, y_train)

# Extract best model from 'grid_rf'
best_model2 = grid_rf2.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams2 = grid_rf2.best_params_
print('Best hyerparameters', best_hyperparams2)

# Evaluate test set accuracy
pred2= best_model2.predict(X_test)

test_acc2 = accuracy_score(y_test, pred2)
print('Test Accuracy_2:', test_acc2*100)

precision_2 = precision_score(y_test, pred2)
print('Precision_2:', precision_2*100)

recall_2 = recall_score(y_test, pred2)
print('Recall_2:', recall_2*100)

f1_score_2 = f1_score(y_test, pred2)
print('f1_score_1:', f1_score_2*100)

Best hyerparameters {'criterion': 'gini', 'max_depth': 7, 'max_features': 'log2', 'n_estimators': 35}
Test Accuracy_2: 87.05035971223022
Precision_2: 94.11764705882352
Recall_2: 76.19047619047619
f1_score_1: 84.21052631578947


### Further tuning the federated model

In [17]:
# Define the grid of hyperparameters 'params_rf'
params_rf3 = {'n_estimators': list(range(20,150,10)), 'max_depth': list(range(3,8)),
             'max_features': ['log2','sqrt'], 'criterion': ['gini', 'entropy']}


# Instantiate a 5-fold CV grid search object 'grid_rf'
grid_rf3 = GridSearchCV(estimator=model1, param_grid=params_rf3, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf3.fit(X_train, y_train)

# Extract best model from 'grid_rf'
best_model3 = grid_rf3.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams3 = grid_rf3.best_params_
print('Best hyerparameters', best_hyperparams3)

# Evaluate test set accuracy
pred3= best_model3.predict(X_test)

test_acc3 = accuracy_score(y_test, pred3)
print('Test Accuracy_3:', test_acc3*100)

precision_3 = precision_score(y_test, pred3)
print('Precision_3:', precision_3*100)

recall_3 = recall_score(y_test, pred3)
print('Recall_3:', recall_3*100)

f1_score_3 = f1_score(y_test, pred3)
print('f1_score_3:', f1_score_3*100)

Best hyerparameters {'criterion': 'entropy', 'max_depth': 7, 'max_features': 'log2', 'n_estimators': 30}
Test Accuracy_3: 87.76978417266187
Precision_3: 92.5925925925926
Recall_3: 79.36507936507937
f1_score_3: 85.47008547008546


In [18]:
dump(best_model3, "fed_model_2.joblib")

['fed_model_2.joblib']

## FL with newly added data (Phase: 2)

In [19]:
df = pd.read_csv("federated_full.csv")
df.head()

Unnamed: 0,age,resting_blood_pressure,cholesterol,fasting_blood_sugar,max_heart_rate_achieved,exercise_induced_angina,st_depression,target,sex_male,chest_pain_type_atypical angina,chest_pain_type_non-anginal pain,chest_pain_type_typical angina,rest_ecg_left ventricular hypertrophy,rest_ecg_normal,st_slope_flat,st_slope_upsloping
0,57,165,289,1,124,0,1.0,1,1,0,0,0,1,0,1,0
1,63,130,254,0,147,0,1.4,1,1,0,0,0,1,0,1,0
2,48,124,274,0,166,0,0.5,1,1,0,0,0,1,0,1,0
3,51,100,222,0,143,1,1.2,0,1,0,1,0,0,1,1,0
4,60,150,258,0,157,0,2.6,1,0,0,0,0,1,0,1,0


In [20]:
df.shape

(579, 16)

In [21]:
X = df.drop("target", axis=1)
y = df["target"]

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, 
                                                        test_size=0.25, stratify=y,
                                                        random_state=5)
print('-----------remodeled-Training Set------------------')
print(X_train2.shape)
print(y_train2.shape)

print('------------Test Set------------------')
print(X_test2.shape)
print(y_test2.shape)

-----------remodeled-Training Set------------------
(434, 15)
(434,)
------------Test Set------------------
(145, 15)
(145,)


### Tuning dataset with newly added samples

In [36]:
# Define the grid of hyperparameters 'params_rf'
params_rf4 = {'n_estimators': list(range(30, 40, 5)), 'max_depth': list(range(2,8)),
             'max_features': ['log2','sqrt'], 'criterion': ['gini', 'entropy']}


# Instantiate a 5-fold CV grid search object 'grid_rf'
grid_rf4 = GridSearchCV(estimator=model1, param_grid=params_rf4, scoring='accuracy', cv=5, n_jobs=-1)

grid_rf4.fit(X_train2, y_train2)

# Extract best model from 'grid_rf'
best_model4 = grid_rf4.best_estimator_

# Extract best hyperparameters from 'grid_rf'
best_hyperparams4 = grid_rf4.best_params_
print('Best hyerparameters', best_hyperparams4)

# Evaluate test set accuracy
pred4= best_model4.predict(X_test2)

test_acc4 = accuracy_score(y_test2, pred4)
print('Test Accuracy_4:', test_acc4*100)

precision_4 = precision_score(y_test2, pred4)
print('Precision_4:', precision_4*100)

recall_4 = recall_score(y_test2, pred4)
print('Recall_4:', recall_4*100)

f1_score_4 = f1_score(y_test2, pred4)
print('f1_score_4:', f1_score_4*100)

Best hyerparameters {'criterion': 'entropy', 'max_depth': 7, 'max_features': 'log2', 'n_estimators': 35}
Test Accuracy_4: 92.41379310344827
Precision_4: 95.08196721311475
Recall_4: 87.87878787878788
f1_score_4: 91.33858267716536


### Saving the model for clients

In [37]:
dump(best_model4, "fed_model_3.joblib")

['fed_model_3.joblib']