### Random Forest Modelling: COVID-19 Dataset

#### Content
##### 1) Import Packages
##### 2) Reading Into Data
##### 3) Train Test Splitting
##### 4) Dealing with Target Class Imbalance
##### 5) Necessary Functions
##### 6) Random Forest: Random Under-Sampling
##### 7) Random Forest: SMOTE Over-Sampling
##### 8) Random Forest: Random Under-Sampling and SMOTE
##### 9) Summary

#### 1) Import Packages

In [11]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE


from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

#### 2) Reading Into Data

In [12]:
df = pd.read_csv('COVID_Clean_Data_OHE.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,USMER,SEX,PATIENT_TYPE,PNEUMONIA,AGE,PREGNANT,DIABETES,COPD,ASTHMA,...,MEDICAL_UNIT_5,MEDICAL_UNIT_6,MEDICAL_UNIT_7,MEDICAL_UNIT_8,MEDICAL_UNIT_9,MEDICAL_UNIT_10,MEDICAL_UNIT_11,MEDICAL_UNIT_12,MEDICAL_UNIT_13,DEATH
0,0,2,1,1,1,65,2,2,2,2,...,0,0,0,0,0,0,0,0,0,1
1,1,2,2,1,1,72,2,2,2,2,...,0,0,0,0,0,0,0,0,0,1
2,2,2,2,2,2,55,2,1,2,2,...,0,0,0,0,0,0,0,0,0,1
3,3,2,1,1,2,53,2,2,2,2,...,0,0,0,0,0,0,0,0,0,1
4,4,2,2,1,2,68,2,1,2,2,...,0,0,0,0,0,0,0,0,0,1


In [13]:
df.shape

(1021977, 38)

In [14]:
df.drop(columns=['Unnamed: 0'], inplace=True)

#### 3) Train Test Splitting

In [15]:
df.columns

Index(['USMER', 'SEX', 'PATIENT_TYPE', 'PNEUMONIA', 'AGE', 'PREGNANT',
       'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR', 'HIPERTENSION',
       'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC',
       'TOBACCO', 'CLASIFFICATION_FINAL_1', 'CLASIFFICATION_FINAL_2',
       'CLASIFFICATION_FINAL_3', 'CLASIFFICATION_FINAL_4',
       'CLASIFFICATION_FINAL_5', 'CLASIFFICATION_FINAL_6',
       'CLASIFFICATION_FINAL_7', 'MEDICAL_UNIT_1', 'MEDICAL_UNIT_2',
       'MEDICAL_UNIT_3', 'MEDICAL_UNIT_4', 'MEDICAL_UNIT_5', 'MEDICAL_UNIT_6',
       'MEDICAL_UNIT_7', 'MEDICAL_UNIT_8', 'MEDICAL_UNIT_9', 'MEDICAL_UNIT_10',
       'MEDICAL_UNIT_11', 'MEDICAL_UNIT_12', 'MEDICAL_UNIT_13', 'DEATH'],
      dtype='object')

In [16]:
features = ['USMER', 'SEX', 'PATIENT_TYPE', 'PNEUMONIA', 'AGE', 'PREGNANT',
            'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR', 'HIPERTENSION',
            'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC',
            'TOBACCO', 'CLASIFFICATION_FINAL_1', 'CLASIFFICATION_FINAL_2',
            'CLASIFFICATION_FINAL_3', 'CLASIFFICATION_FINAL_4',
            'CLASIFFICATION_FINAL_5', 'CLASIFFICATION_FINAL_6',
            'CLASIFFICATION_FINAL_7', 'MEDICAL_UNIT_1', 'MEDICAL_UNIT_2',
            'MEDICAL_UNIT_3', 'MEDICAL_UNIT_4', 'MEDICAL_UNIT_5', 'MEDICAL_UNIT_6',
            'MEDICAL_UNIT_7', 'MEDICAL_UNIT_8', 'MEDICAL_UNIT_9', 'MEDICAL_UNIT_10',
            'MEDICAL_UNIT_11', 'MEDICAL_UNIT_12', 'MEDICAL_UNIT_13']
X = df[features]
y = df['DEATH']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
### Stratify y to get equal distribution of 'DEATH' values in train and test
lst = [X_train, X_test, y_train, y_test]
for x in lst:
    print(x.shape)

(817581, 36)
(204396, 36)
(817581,)
(204396,)


#### 4) Dealing with Target Class Imbalance
##### Under-sampling and Over-sampling: we could over-sample the minority class at the potential cost of over-fitting because very similar instances of the minority class are repeated more times, reinforcing a relationship when it may not necessarily be present. Under-sampling could be done at the cost of losing information by reducing the variability of information that is present in the majority sample. Could also combine both, so over-sampling the minority from 1:100 to 1:5 ratio and from there under-sampling the majority to equal ratio (https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/#:~:text=Random%20oversampling%20duplicates%20examples%20from,information%20invaluable%20to%20a%20model).
##### SMOTE: Synthetic Minority Over-sampling Technique. It is said that SMOTE reduces the variability of the minority class which is not something we want. SMOTE has been reported to reduce class imbalance effectively for low-dimensional data which is consistent with our number of features. As the class imbalance was a step larger the reduction in bias for either effect increased with increasing sample size (although it was not a million). These results are based on testing prediction balance of K-Nearest Neighbour and Random Forests, among other models (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/). We are planning to use these models for this data.
##### Another resource mentions that when we have loads of data (in the millions range) we can use under-sampling but again at the cost of information loss, says pretty much the same as the first source (https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/).

##### A source says that a combination of both under-sampling and over-sampling methods could be good. There are many ways, we will compare a few of them (https://machinelearningmastery.com/combine-oversampling-and-undersampling-for-imbalanced-classification/).

##### I will compare the random under-sampling against SMOTE against random under-sampling and SMOTE.


#### Resampling the data
##### Source: https://machinelearningmastery.com/combine-oversampling-and-undersampling-for-imbalanced-classification/

#### Re-Sampling with Random Under-Sampling

In [17]:
under_base = RandomUnderSampler(sampling_strategy=1,random_state=42)  ### we want the majority to reduce till it is the size of the minority

In [18]:
X_train_u, y_train_u = under_base.fit_resample(X_train, y_train)
y_train_u.value_counts()

0    59726
1    59726
Name: DEATH, dtype: int64

#### Re-sampling with SMOTE over-sampling

In [19]:
smote = SMOTE(sampling_strategy=1, random_state=42)

In [20]:
X_train_s, y_train_s = smote.fit_resample(X_train, y_train)
y_train_s.value_counts()

0    757855
1    757855
Name: DEATH, dtype: int64

#### Re-Sampling with Under-sampling and SMOTE Combination

In [23]:
under_half = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
X_train_us, y_train_us = under_half.fit_resample(X_train, y_train)
X_train_us, y_train_us = smote.fit_resample(X_train_us, y_train_us)
y_train_us.value_counts()

0    119452
1    119452
Name: DEATH, dtype: int64

#### 5) Necessary Functions

In [24]:
####Dropping certain columns after extracting metrics
def drop(X):
    for i in ['DEATH','prob_survive','prob_die','y_pred']:
        try:
            X.drop(columns=i,inplace=True)
        except KeyError:
            print(f'Column {i} doesn\'t exist')
            continue

In [25]:
####Grid searching to find the best parameters
def grid_search(param_dict, classifier_object, folds, X, y):
    """Grid search finding the best parameters. Requires gridsearch package."""
    gs = GridSearchCV(classifier_object, param_grid=param_dict, cv=folds, verbose=1, scoring=['precision', 'f1'],
                      refit='precision')
    gs.fit(X, y)
    print(gs.best_score_)
    print(gs.best_params_)

In [26]:
def cutoff_iterator(X, y, iterations, accuracy=[], precision=[], f1_score=[]):
    """Works through many cutoff points and outputs precision and recall scores.
    Requires sklearn metrics package."""
    for i in iterations:
        X['y_pred'] = np.where(X['prob_die'] > i, 1, 0)
        accuracy.append(metrics.accuracy_score(y, X['y_pred']))
        precision.append(metrics.precision_score(y, X['y_pred']))
        f1_score.append(metrics.f1_score(y, X['y_pred']))
    dict_2 = {'Cut_Off_Points': iterations,
              'Accuracy': accuracy,
              'Precision': precision,
              'F1': f1_score}
    metrics_table = pd.DataFrame(dict_2)
    return metrics_table

#### 6) Random Forest: Random Under-Sampling

In [27]:
X_train_u['DEATH'] = y_train_u
X_train_u_sample = X_train_u.sample(n=1000)
drop(X_train_u)

Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [28]:
y_u = X_train_u_sample['DEATH']
X_u = X_train_u_sample
drop(X_u)

Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [29]:
X_u.columns

Index(['USMER', 'SEX', 'PATIENT_TYPE', 'PNEUMONIA', 'AGE', 'PREGNANT',
       'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR', 'HIPERTENSION',
       'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC',
       'TOBACCO', 'CLASIFFICATION_FINAL_1', 'CLASIFFICATION_FINAL_2',
       'CLASIFFICATION_FINAL_3', 'CLASIFFICATION_FINAL_4',
       'CLASIFFICATION_FINAL_5', 'CLASIFFICATION_FINAL_6',
       'CLASIFFICATION_FINAL_7', 'MEDICAL_UNIT_1', 'MEDICAL_UNIT_2',
       'MEDICAL_UNIT_3', 'MEDICAL_UNIT_4', 'MEDICAL_UNIT_5', 'MEDICAL_UNIT_6',
       'MEDICAL_UNIT_7', 'MEDICAL_UNIT_8', 'MEDICAL_UNIT_9', 'MEDICAL_UNIT_10',
       'MEDICAL_UNIT_11', 'MEDICAL_UNIT_12', 'MEDICAL_UNIT_13'],
      dtype='object')

In [30]:
for x in [X_u,y_u]:
    print(x.shape)

(1000, 36)
(1000,)


In [31]:
param_dict = {'n_estimators': [75, 100, 125],
              'max_depth': [5, 6, 7, 8],
              'min_samples_split': [2, 3, 4],
              'min_samples_leaf': [1, 2, 3]}
grid_search(param_dict=param_dict, classifier_object=RandomForestClassifier(random_state=42), folds=10, X=X_u,
            y=y_u)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits


KeyboardInterrupt: 

In [32]:
drop(X_train_u)

Column DEATH doesn't exist
Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [33]:
rf = RandomForestClassifier(max_depth=5, n_estimators=75, min_samples_leaf=1, min_samples_split=2, random_state=42)
rf.fit(X_train_u, y_train_u)

RandomForestClassifier(max_depth=5, n_estimators=75, random_state=42)

In [34]:
X_train_u[['prob_survive', 'prob_die']] = rf.predict_proba(X_train_u)

In [35]:
iterations = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85,
              0.90, 0.95]
cutoff_iterator(X=X_train_u, y=y_train_u, iterations=iterations, accuracy=[], precision=[], f1_score=[])

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Cut_Off_Points,Accuracy,Precision,F1
0,0.05,0.613594,0.564097,0.721229
1,0.1,0.690528,0.617872,0.763437
2,0.15,0.808233,0.724159,0.838516
3,0.2,0.860136,0.785147,0.87639
4,0.25,0.88531,0.820346,0.895868
5,0.3,0.898244,0.841554,0.906041
6,0.35,0.905167,0.855116,0.91141
7,0.4,0.908792,0.866,0.913829
8,0.45,0.910131,0.873698,0.914309
9,0.5,0.911035,0.881458,0.914356


In [62]:
drop(X_test)

Column DEATH doesn't exist


In [63]:
X_test[['prob_survive', 'prob_die']] = rf.predict_proba(X_test)
X_test['y_pred'] = np.where(X_test['prob_die'] > 0.75, 1, 0)
print('Precision: ', metrics.precision_score(y_test, X_test['y_pred']))
print('f1_score: ', metrics.f1_score(y_test, X_test['y_pred']))

Precision:  0.4751364848279656
f1_score:  0.5823132780082987


#### 7) Random Forest: SMOTE Over-Sampling

In [36]:
X_train_s['DEATH'] = y_train_s
X_train_s_sample = X_train_s.sample(n=1000)
drop(X_train_s)

Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [37]:
y_s = X_train_s_sample['DEATH']
X_s = X_train_s_sample
drop(X_s)

Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [38]:
X_s.columns

Index(['USMER', 'SEX', 'PATIENT_TYPE', 'PNEUMONIA', 'AGE', 'PREGNANT',
       'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR', 'HIPERTENSION',
       'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC',
       'TOBACCO', 'CLASIFFICATION_FINAL_1', 'CLASIFFICATION_FINAL_2',
       'CLASIFFICATION_FINAL_3', 'CLASIFFICATION_FINAL_4',
       'CLASIFFICATION_FINAL_5', 'CLASIFFICATION_FINAL_6',
       'CLASIFFICATION_FINAL_7', 'MEDICAL_UNIT_1', 'MEDICAL_UNIT_2',
       'MEDICAL_UNIT_3', 'MEDICAL_UNIT_4', 'MEDICAL_UNIT_5', 'MEDICAL_UNIT_6',
       'MEDICAL_UNIT_7', 'MEDICAL_UNIT_8', 'MEDICAL_UNIT_9', 'MEDICAL_UNIT_10',
       'MEDICAL_UNIT_11', 'MEDICAL_UNIT_12', 'MEDICAL_UNIT_13'],
      dtype='object')

In [39]:
for x in [X_s, y_s]:
    print(x.shape)

(1000, 36)
(1000,)


In [68]:
param_dict = {'n_estimators': [75, 100, 125],
              'max_depth': [5, 6, 7, 8],
              'min_samples_split': [2, 3, 4],
              'min_samples_leaf': [1, 2, 3]}
grid_search(param_dict=param_dict, classifier_object=RandomForestClassifier(random_state=42), folds=10, X=X_s,
            y=y_s)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits
0.8980697542267395
{'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 125}


In [40]:
drop(X_train_s)

Column DEATH doesn't exist
Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [41]:
rf = RandomForestClassifier(max_depth=5, n_estimators=125, min_samples_leaf=1, min_samples_split=2, random_state=42)
rf.fit(X_train_s, y_train_s)

RandomForestClassifier(max_depth=5, n_estimators=125, random_state=42)

In [42]:
X_train_s[['prob_survive', 'prob_die']] = rf.predict_proba(X_train_s)

In [43]:
iterations = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85,
              0.90, 0.95]
cutoff_iterator(X=X_train_s, y=y_train_s, iterations=iterations, accuracy=[], precision=[], f1_score=[])

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Cut_Off_Points,Accuracy,Precision,F1
0,0.05,0.583539,0.545583,0.705959
1,0.1,0.714843,0.636997,0.777935
2,0.15,0.812157,0.728294,0.841305
3,0.2,0.864913,0.790999,0.880136
4,0.25,0.886339,0.821193,0.896804
5,0.3,0.899192,0.842268,0.906932
6,0.35,0.907315,0.857494,0.913352
7,0.4,0.911337,0.870059,0.916021
8,0.45,0.913956,0.880089,0.917626
9,0.5,0.91467,0.889472,0.917344


In [73]:
drop(X_test)

Column DEATH doesn't exist


In [74]:
X_test[['prob_survive', 'prob_die']] = rf.predict_proba(X_test)
X_test['y_pred'] = np.where(X_test['prob_die'] > 0.80, 1, 0)
print('Precision: ', metrics.precision_score(y_test, X_test['y_pred']))
print('f1_score: ', metrics.f1_score(y_test, X_test['y_pred']))

Precision:  0.537595111137739
f1_score:  0.5675162861299096


#### 8) Random Forest: Random Under-Sampling and SMOTE

In [44]:
X_train_us['DEATH'] = y_train_us
X_train_us_sample = X_train_us.sample(n=1000)
drop(X_train_us)

Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [45]:
y_us = X_train_us_sample['DEATH']
X_us = X_train_us_sample
drop(X_us)

Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [46]:
X_us.columns

Index(['USMER', 'SEX', 'PATIENT_TYPE', 'PNEUMONIA', 'AGE', 'PREGNANT',
       'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR', 'HIPERTENSION',
       'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY', 'RENAL_CHRONIC',
       'TOBACCO', 'CLASIFFICATION_FINAL_1', 'CLASIFFICATION_FINAL_2',
       'CLASIFFICATION_FINAL_3', 'CLASIFFICATION_FINAL_4',
       'CLASIFFICATION_FINAL_5', 'CLASIFFICATION_FINAL_6',
       'CLASIFFICATION_FINAL_7', 'MEDICAL_UNIT_1', 'MEDICAL_UNIT_2',
       'MEDICAL_UNIT_3', 'MEDICAL_UNIT_4', 'MEDICAL_UNIT_5', 'MEDICAL_UNIT_6',
       'MEDICAL_UNIT_7', 'MEDICAL_UNIT_8', 'MEDICAL_UNIT_9', 'MEDICAL_UNIT_10',
       'MEDICAL_UNIT_11', 'MEDICAL_UNIT_12', 'MEDICAL_UNIT_13'],
      dtype='object')

In [47]:
for x in [X_us, y_us]:
    print(x.shape)

(1000, 36)
(1000,)


In [48]:
param_dict = {'n_estimators': [75, 100, 125],
              'max_depth': [5, 6, 7, 8],
              'min_samples_split': [2, 3, 4],
              'min_samples_leaf': [1, 2, 3]}
grid_search(param_dict=param_dict, classifier_object=RandomForestClassifier(random_state=42), folds=10, X=X_us,
            y=y_us)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits


KeyboardInterrupt: 

In [49]:
drop(X_train_us)

Column DEATH doesn't exist
Column prob_survive doesn't exist
Column prob_die doesn't exist
Column y_pred doesn't exist


In [50]:
rf = RandomForestClassifier(max_depth=8, n_estimators=100, min_samples_leaf=2, min_samples_split=2, random_state=42)
rf.fit(X_train_us, y_train_us)

RandomForestClassifier(max_depth=8, min_samples_leaf=2, random_state=42)

In [51]:
X_train_us[['prob_survive', 'prob_die']] = rf.predict_proba(X_train_us)

In [52]:
iterations = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85,
              0.90, 0.95]
cutoff_iterator(X=X_train_us, y=y_train_us, iterations=iterations, accuracy=[], precision=[], f1_score=[])

Unnamed: 0,Cut_Off_Points,Accuracy,Precision,F1
0,0.05,0.738125,0.656544,0.792256
1,0.1,0.843272,0.763439,0.863895
2,0.15,0.875841,0.805254,0.888709
3,0.2,0.890412,0.826399,0.900199
4,0.25,0.901668,0.844526,0.909198
5,0.3,0.909286,0.858705,0.91526
6,0.35,0.912668,0.866297,0.917867
7,0.4,0.915112,0.873501,0.919591
8,0.45,0.917444,0.881592,0.921148
9,0.5,0.917942,0.889465,0.920836


In [84]:
drop(X_test)

Column DEATH doesn't exist


In [85]:
X_test[['prob_survive', 'prob_die']] = rf.predict_proba(X_test)
X_test['y_pred'] = np.where(X_test['prob_die'] > 0.85, 1, 0)
print('Precision: ', metrics.precision_score(y_test, X_test['y_pred']))
print('f1_score: ', metrics.f1_score(y_test, X_test['y_pred']))

Precision:  0.5490489052353313
f1_score:  0.5909934430285785


#### 9) Summary
##### The best model is the random forest model using randomly under-sampled and SMOTE oversampled data which achieved a 0.55 precision and 0.59 f1 score on the test data. The difference between the precision and f1 score between the test and train data was extremely large which means that the model yet again has overfit.