## Setup Virtual Env to Avoid package conflicts

<li> python -m venv venv
<li> venv\Scripts\activate
<li> ipython kernel install --user --name=venv
<li> change kernel to venv
<li> After done: jupyter --kernelspec uninstall venv
<li> Install and import packages

In [1]:
pip install pandas scikit-learn imbalanced-learn xgboost

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, precision_score, f1_score

In [3]:
df = pd.read_csv('OPTIMAL_combined_3studies_6feb2020.csv')

#### Inspecting Data

In [4]:
df.shape

(1842, 22)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1842 entries, 0 to 1841
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ID                    1842 non-null   int64  
 1   age                   1842 non-null   float64
 2   gender                1842 non-null   object 
 3   dementia              1808 non-null   float64
 4   dementia_all          1842 non-null   int64  
 5   educationyears        1842 non-null   float64
 6   EF                    1634 non-null   float64
 7   PS                    1574 non-null   float64
 8   Global                1534 non-null   float64
 9   diabetes              1842 non-null   int64  
 10  smoking               1831 non-null   object 
 11  hypertension          1842 non-null   object 
 12  hypercholesterolemia  1842 non-null   object 
 13  lacunes_num           1842 non-null   object 
 14  fazekas_cat           1842 non-null   object 
 15  study                

In [6]:
df.describe()

Unnamed: 0,ID,age,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,SVD Simple Score,SVD Amended Score,Fazekas
count,1842.0,1842.0,1808.0,1842.0,1842.0,1634.0,1574.0,1534.0,1842.0,1165.0,1165.0,1842.0
mean,29897.929967,65.952588,0.045354,0.062975,11.139522,-0.063088,-0.066649,-0.02686,0.122693,0.719313,1.491845,1.311075
std,67056.874773,8.923488,0.208137,0.242984,2.983946,0.785264,0.871836,0.677071,0.328173,0.932063,1.623277,0.799495
min,1.0,38.0,0.0,0.0,1.0,-5.2,-2.68,-2.42,0.0,0.0,0.0,0.0
25%,223.5,60.0,0.0,0.0,9.0,-0.436896,-0.670805,-0.476881,0.0,0.0,0.0,1.0
50%,612.5,66.151393,0.0,0.0,10.0,0.13,-0.02,0.006667,0.0,0.0,1.0,1.0
75%,1713.25,73.0,0.0,0.0,13.0,0.484327,0.55,0.423333,0.0,1.0,2.0,2.0
max,211301.0,90.0,1.0,1.0,24.0,2.35,2.73,1.853333,1.0,3.0,7.0,3.0


In [7]:
pd.set_option("display.max_columns", None)
df.head(5)

Unnamed: 0,ID,age,gender,dementia,dementia_all,educationyears,EF,PS,Global,diabetes,smoking,hypertension,hypercholesterolemia,lacunes_num,fazekas_cat,study,study1,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
0,1,52.67,male,0.0,0,11.0,-2.403333,-1.29,-1.287,0,current-smoker,Yes,Yes,more-than-zero,2 to 3,scans,scans,3.0,7.0,3,>5,>=1
1,10,64.58,male,0.0,0,10.0,1.28,0.36,0.744,0,ex-smoker,Yes,Yes,more-than-zero,0 to 1,scans,scans,2.0,3.0,1,1 to 2,>=1
2,100,74.92,male,0.0,0,8.0,-1.44,-1.52,-0.922,0,never-smoker,Yes,Yes,more-than-zero,0 to 1,scans,scans,1.0,2.0,1,1 to 2,0
3,101,74.83,male,1.0,1,9.0,,-2.136271,-1.301102,0,current-smoker,Yes,Yes,more-than-zero,2 to 3,scans,scans,2.0,4.0,2,3 to 5,0
4,102,79.25,male,0.0,0,10.0,-0.92,-1.493333,-0.924,0,ex-smoker,Yes,Yes,more-than-zero,2 to 3,scans,scans,2.0,3.0,2,1 to 2,0


Removing columns that do not provide any useful information. 'dementia' is already expressed by dementia_all. 'ID' does not provide any useful information for modelling. 'lacunes_num' is a less detailed version of 'lac_count'. 'fazekas_cat' is a less detailed version of 'Fazekas'. 'study' and 'study1' only provides information about which study the case was gathered from, which would not provide any additional information for modelling.

In [8]:
df.drop('dementia', axis=1, inplace=True)
df.drop('ID', axis=1, inplace=True)
df.drop('lacunes_num', axis=1, inplace=True)
df.drop('fazekas_cat', axis=1, inplace=True)
df.drop('study', axis=1, inplace=True)
df.drop('study1', axis=1, inplace=True)

In [9]:
df.head(5)

Unnamed: 0,age,gender,dementia_all,educationyears,EF,PS,Global,diabetes,smoking,hypertension,hypercholesterolemia,SVD Simple Score,SVD Amended Score,Fazekas,lac_count,CMB_count
0,52.67,male,0,11.0,-2.403333,-1.29,-1.287,0,current-smoker,Yes,Yes,3.0,7.0,3,>5,>=1
1,64.58,male,0,10.0,1.28,0.36,0.744,0,ex-smoker,Yes,Yes,2.0,3.0,1,1 to 2,>=1
2,74.92,male,0,8.0,-1.44,-1.52,-0.922,0,never-smoker,Yes,Yes,1.0,2.0,1,1 to 2,0
3,74.83,male,1,9.0,,-2.136271,-1.301102,0,current-smoker,Yes,Yes,2.0,4.0,2,3 to 5,0
4,79.25,male,0,10.0,-0.92,-1.493333,-0.924,0,ex-smoker,Yes,Yes,2.0,3.0,2,1 to 2,0


<strong> Column information </strong>
<li> Age (Years)
<li> Gender (M/F)
<li> Presence of Dementia (1 = Dementia/ 0 = No)
<li> Years of Education
<li> EF (Score that represents Executive Function)
<li> PS (Score that represents Processing Speed)
<li> Global (Global Cognitive Score)
<li> Diabetes (presence of diabetes or not)
<li> Smoking (Current smoker, ex-smoker, or never smoked)
<li> Hypertension (Yes/No)
<li> Hypercholesterolemia (Yes/No)
<li> SVD Simple score (Brain injury score)
<li> SVD amended score (Brain injury score)
<li> Fazekas (Rating White matter from MRI data)
<li> lac_count (Count of lacunas [Cavities that appear when brain tissue has died])
<li> CMB_count (Cerebral Microbleeds [Small chronic brain hemorrhages])

### Cleaning Data

In [10]:
df.isnull().sum()

age                       0
gender                    0
dementia_all              0
educationyears            0
EF                      208
PS                      268
Global                  308
diabetes                  0
smoking                  11
hypertension              0
hypercholesterolemia      0
SVD Simple Score        677
SVD Amended Score       677
Fazekas                   0
lac_count                 0
CMB_count                 0
dtype: int64

As the null count for SVD is a third of the dataset, removing it or imputing with a value would severely impact and influence the outcome, therefore it will be removed.

In [11]:
df.drop('SVD Simple Score', axis=1, inplace=True)
df.drop('SVD Amended Score', axis=1, inplace=True)

In [12]:
df.head(5)

Unnamed: 0,age,gender,dementia_all,educationyears,EF,PS,Global,diabetes,smoking,hypertension,hypercholesterolemia,Fazekas,lac_count,CMB_count
0,52.67,male,0,11.0,-2.403333,-1.29,-1.287,0,current-smoker,Yes,Yes,3,>5,>=1
1,64.58,male,0,10.0,1.28,0.36,0.744,0,ex-smoker,Yes,Yes,1,1 to 2,>=1
2,74.92,male,0,8.0,-1.44,-1.52,-0.922,0,never-smoker,Yes,Yes,1,1 to 2,0
3,74.83,male,1,9.0,,-2.136271,-1.301102,0,current-smoker,Yes,Yes,2,3 to 5,0
4,79.25,male,0,10.0,-0.92,-1.493333,-0.924,0,ex-smoker,Yes,Yes,2,1 to 2,0


The other columns with null values are EF, PS, Global and smoking. As these columns do not consist of as large of a proportion of the dataset as SVD, therefore these will be maintained.

In [13]:
df['EF'] = df['EF'].fillna(df['EF'].median())
df['PS'] = df['PS'].fillna(df['PS'].median())
df['Global'] = df['Global'].fillna(df['Global'].median())
df['smoking'] = df['smoking'].fillna(df['smoking'].mode()[0])

In [14]:
df.isnull().sum()

age                     0
gender                  0
dementia_all            0
educationyears          0
EF                      0
PS                      0
Global                  0
diabetes                0
smoking                 0
hypertension            0
hypercholesterolemia    0
Fazekas                 0
lac_count               0
CMB_count               0
dtype: int64

### Encoding Labels

In [15]:
le = LabelEncoder()

to_transform = ['gender', 'smoking', 'hypertension', 'hypercholesterolemia']

for a in to_transform:
    df[a] = le.fit_transform(df[a])
    
#le.invese_transform() to receive labels later

In [16]:
df.head(5)

Unnamed: 0,age,gender,dementia_all,educationyears,EF,PS,Global,diabetes,smoking,hypertension,hypercholesterolemia,Fazekas,lac_count,CMB_count
0,52.67,1,0,11.0,-2.403333,-1.29,-1.287,0,0,1,1,3,>5,>=1
1,64.58,1,0,10.0,1.28,0.36,0.744,0,1,1,1,1,1 to 2,>=1
2,74.92,1,0,8.0,-1.44,-1.52,-0.922,0,2,1,1,1,1 to 2,0
3,74.83,1,1,9.0,0.13,-2.136271,-1.301102,0,0,1,1,2,3 to 5,0
4,79.25,1,0,10.0,-0.92,-1.493333,-0.924,0,1,1,1,2,1 to 2,0


#### Converting other categorical values to numerical

Columns 'lac_count' and 'CMB_count' contain values that are not suitable for modelling, such as >5 and >=1, which would have to be fixed.

In [17]:
df['lac_count'].replace('1 to 2', 1.5, inplace = True)
df['lac_count'].replace('3 to 5', 4, inplace = True)
df['lac_count'].replace('>5', 7, inplace = True)
df['lac_count'].replace('Zero', 0, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['lac_count'].replace('1 to 2', 1.5, inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['lac_count'].replace('3 to 5', 4, inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are 

To process the values and enable them to be used for modelling, '1 to 2' has been averaged to 1.5. '3 to 5' has been averaged to 4. '>5' will be replaced with 7. This is an arbitrary value chosen, with more research this value could be replaced with a more suitable value that reflects the average number of lacunes above 5. There is a possibility that the majority of values are 6, and therefore the use of 7 could include bias into the model. 'Zero' has been replaced with 0.

In [18]:
df['CMB_count'].replace('>=1', 5, inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CMB_count'].replace('>=1', 5, inplace = True)


'>=1' will be replaced with 5, as observed from research, the number of cerebral microbleeds could go up to >10, so an average number has been taken.

### Dealing with Class Imbalances

In [19]:
df['dementia_all'].value_counts()

dementia_all
0    1726
1     116
Name: count, dtype: int64

### Processing data, splitting into targets and training data, stratified k-fold cross validation

In [20]:
x = df.drop('dementia_all', axis=1)
y = df['dementia_all']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Roughly 6% of the dataset contains dementia patients, compared to 94% which do not have dementia. We will oversample the data to create more dementia cases for training, however, with oversampling, we run the risk of overfitting. As such, we will apply oversampling within the k-fold cross validation loop, to reduce the effect of overfitting. This can then be compared to the dataset that has not undergone k-fold cross validation and has only been oversampled.

In [50]:
def train_and_evaluate_model(x, y, model, use_stratified_kfold=True, use_smote=True, n_splits=10):
    accuracies = []
    recalls = []
    precisions = []
    f1s = []
    
    if use_stratified_kfold:
        kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1)
        for train_index, val_index in kf.split(x, y):
            kf_x_train, kf_x_val = x.iloc[train_index], x.iloc[val_index]
            kf_y_train, kf_y_val = y.iloc[train_index], y.iloc[val_index]
            
            if use_smote:
                smote = SMOTE(random_state=42)
                kf_x_oversampled, kf_y_oversampled = smote.fit_resample(kf_x_train, kf_y_train)
                model.fit(kf_x_oversampled, kf_y_oversampled)
            else:
                model.fit(kf_x_train, kf_y_train)
                
            y_pred = model.predict(kf_x_val)
            accuracies.append(accuracy_score(kf_y_val, y_pred))
            recalls.append(recall_score(kf_y_val, y_pred))
            precisions.append(precision_score(kf_y_val, y_pred))
            f1s.append(f1_score(kf_y_val, y_pred))
            
    else:
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
        
        
        if use_smote:
            smote = SMOTE(random_state=42)
            x_oversampled, y_oversampled = smote.fit_resample(x_train, y_train)
            model.fit(x_oversampled, y_oversampled)
        else:
            model.fit(x_train, y_train)
 
        y_pred = model.predict(x_test)      
        accuracies.append(accuracy_score(y_test, y_pred))
        recalls.append(recall_score(y_test, y_pred))
        precisions.append(precision_score(y_test, y_pred))
        f1s.append(f1_score(y_test, y_pred))
        
    average_accuracy = np.mean(accuracies)
    average_recall = np.mean(recalls)
    average_precision = np.mean(precisions)
    average_f1 = np.mean(f1s)
    
    print("Model:", model)
    print("Average Accuracy:", average_accuracy)
    print("Average Recall:", average_recall)
    print("Average Precision:", average_precision)
    print("Average F1-Score:", average_f1)

In [49]:
# Using Stratified K-Fold and SMOTE
model = RandomForestClassifier(random_state=42)
train_and_evaluate_model(x, y, model, use_stratified_kfold=True, use_smote=True)

Model: RandomForestClassifier(random_state=42)
Average Accuracy: 0.9104259694477085
Average Recall: 0.25
Average Precision: 0.26763902763902764
Average F1-Score: 0.25454048494094256


We will also assess the performance of models that has not had SMOTE and stratified k-fold cross validation applied.

In [48]:
# Using just SMOTE
model = RandomForestClassifier(random_state=42)
train_and_evaluate_model(x, y, model, use_stratified_kfold=False, use_smote=True)

Model: RandomForestClassifier(random_state=42)
Average Accuracy: 0.9051490514905149
Average Recall: 0.16
Average Precision: 0.2222222222222222
Average F1-Score: 0.18604651162790697


#### to do, properly print out metrics for appropriate model
#### implement other models, compare and analyse
#### test without smote, and with kfold
#### also test kfold without smote
#### analyse best model and explain why
#### project complete