<h1>GBC Model Implementation</h1>

<h3>Group 3</h3>
<p>By:<br>
    Aaron Norwood,218330434<br>
    Joshua Anthony, 219466473<br>
    Roger Middenway, 217602784<br>
    David Adams, 216110104<br>
    Linden Hutchinson, 218384326<br>
    Dale Orders, 219106283

# Imported libraries

In [57]:
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.ensemble import GradientBoostingClassifier
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
from collections import Counter

<h3> Read in the data, store if dataframe

In [58]:
df = pd.read_csv('./data/healthcare-dataset-stroke-data.csv')

# Tidying up the data

Implementing consistent capitalization and replacing underscores  spaces with hyphens in the data.

In [59]:
##convert gender to lowercase
df['gender'] = df['gender'].apply(lambda x: x.lower())

##convert work_type to lowercase ensure consistent spacing 
df['work_type'] = df['work_type'].apply(lambda x: x.lower().replace('_','-'))

##convert residence_type to lowercase
df.rename(columns={'Residence_type':'residence_type'}, inplace=True)
df['residence_type'] = df['residence_type'].apply(lambda x: x.lower())

##convert smoking_status to lowercase ensure consistent spacing 
df['smoking_status'] = df['smoking_status'].apply(lambda x: x.lower().replace(' ', '-'))

<h3>Replacing gender with dummy variables for easier visualisation

In [60]:
df['gender'] = df['gender'].str.lower().map({'male': 1, 'female': 0})

<h3> Indexes of the outliers with a bmi above 60 for verification purposes

<h3> Cap the outliers at a maximum bmi of 60

In [61]:
df['bmi'] = df['bmi'].apply(lambda bmi_value: bmi_value if 12 < bmi_value < 60 else np.nan)

### Impute missing BMI values
<P>Replace missing BMI values with the average BMI found in rows with the same age and gender<P>

#### Check initial number of nulls

In [62]:
##get number of nulls in df
df.isnull().sum()

id                     0
gender                 1
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
residence_type         0
avg_glucose_level      0
bmi                  218
smoking_status         0
stroke                 0
dtype: int64

<h3>Impute missing values 

In [63]:
df['age'] = df['age'].apply(lambda x : round(x))

m_df = df[df['gender'] == 1]
f_df = df[df['gender'] == 0]

m_bmi_avg = m_df.groupby('age')['bmi'].mean()
f_bmi_avg = f_df.groupby('age')['bmi'].mean()
##round to one to fit with other bmi values
m_bmi_avg = round(m_bmi_avg,1)
f_bmi_avg = round(f_bmi_avg,1)

missing_vals = df[df.isnull().any(axis = 1)]

for index, row in missing_vals.iterrows():
    if row['gender'] == 1:
        df.loc[index,['bmi']] = m_bmi_avg[row['age']]
    else:
        df.loc[index,['bmi']] = f_bmi_avg[row['age']]


<h3>Checking again for nulls to verify imputation success</h3>
<p>NOTE: the values with nulls were left in a separate column in case it is needed later

In [64]:
df.isnull().sum()

id                   0
gender               1
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### Imputing a single missing gender value

In [65]:
df['gender'] = df['gender'].replace(np.nan, 0)

In [66]:
df.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [67]:
##I realised that gender had all been mapped to float, so I've fixed that he
df['gender'] = df['gender'].astype(int)

## Creating the df used for GBC and Decision Trees
### Not using some features which proved not to be helpful for the model

In [68]:
df2 = df.drop(['id','work_type','residence_type','smoking_status', 'ever_married'],axis=1)

In [69]:
df2.head()

Unnamed: 0,gender,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
0,1,67,0,1,228.69,36.6,1
1,0,61,0,0,202.21,29.1,1
2,1,80,0,1,105.92,32.5,1
3,0,49,0,0,171.23,34.4,1
4,0,79,1,0,174.12,24.0,1


## Basic decision tree
Very poor recall and precision

In [70]:
data = df2.drop(['stroke'], axis=1)
target = df['stroke']

X_train, X_test, y_train, y_test = train_test_split(
     data, target, test_size=0.3, random_state=0)

print(y_test.shape)
print(X_test.shape)
clf = tree.DecisionTreeClassifier()
#clf = clf.fit(X_train,y_train)
clf.fit(X_train,y_train)
pred = clf.predict(X_test)

print(classification_report(y_test,pred))

(1533,)
(1533, 6)
              precision    recall  f1-score   support

           0       0.95      0.96      0.96      1457
           1       0.12      0.11      0.11        76

    accuracy                           0.92      1533
   macro avg       0.54      0.53      0.54      1533
weighted avg       0.91      0.92      0.92      1533



In [71]:
confusion_matrix(y_test,pred)

array([[1400,   57],
       [  68,    8]], dtype=int64)

## GBC no SMOTE
Also very poor recall and precision

In [72]:
X_train, X_test, y_train, y_test = train_test_split(
     data, target, test_size=0.3, random_state=0)

gb_clf = GradientBoostingClassifier(loss='deviance',
                                    max_depth=10,
                                    learning_rate=0.5,
                                    n_estimators=600,
                                    random_state=42,
                                    criterion='friedman_mse',
                                    min_samples_split=0.01).fit(X_train, y_train)
print(y_test.shape)
print(X_test.shape)
##gb_clf = gb_clf.fit(X_train,y_train)
gb_clf.fit(X_train,y_train)
pred = gb_clf.predict(X_test)

print(classification_report(y_test,pred))

(1533,)
(1533, 6)
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1457
           1       0.26      0.07      0.11        76

    accuracy                           0.94      1533
   macro avg       0.61      0.53      0.54      1533
weighted avg       0.92      0.94      0.93      1533



In [73]:
confusion_matrix(y_test,pred)

array([[1443,   14],
       [  71,    5]], dtype=int64)

## Setting up SMOTE

In [74]:
#X_smote, y_smote = SMOTE().fit_resample(data, target)

#counter = Counter(y_smote)
#print('After',counter)

## GBC with SMOTE
Much better results, good acc, recall and precision

In [75]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)

X_smote, y_smote = SMOTE().fit_resample(X_train, y_train)

counter = Counter(y_smote)
print('After',counter)

gb_clf = GradientBoostingClassifier(loss='deviance',
                                    max_depth=10,
                                    learning_rate=0.5,
                                    n_estimators=600,
                                    random_state=42,
                                    criterion='friedman_mse',
                                    min_samples_split=0.01).fit(X_train, y_train)
gb_clf.fit(X_smote,y_smote)
pred = gb_clf.predict(X_test)

print(classification_report(y_test,pred))

After Counter({0: 3404, 1: 3404})
              precision    recall  f1-score   support

           0       0.96      0.93      0.94      1457
           1       0.13      0.20      0.15        76

    accuracy                           0.89      1533
   macro avg       0.54      0.56      0.55      1533
weighted avg       0.92      0.89      0.90      1533



In [76]:
confusion_matrix(y_test,pred)

array([[1354,  103],
       [  61,   15]], dtype=int64)