# Capstone Project: Context

Due to confidentiality reasons I cannot use internal company data for this Capstone project, so I am using this data which mirrors a similar data source.

The WHO states that strokes are the 2nd leading cause of death globally. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases and whether they smoke.

In [95]:
#import relevant libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE,RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [44]:
#Read dataset and view head
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [45]:
#view information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [103]:
#Checking target value occurance
df['stroke'].value_counts()

0    4860
1     249
Name: stroke, dtype: int64

As you can see from the above, the data is unbalanced - there are far more cases of stroke than non-stroke. Therefore we are going to rebalance the data slightly, and will use Smote inbalancing technique to address this later. 

# Data Preprocessing

In [53]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

#Data preprocessing needed

#There are null values for BMI - going to fill with median value of bmi for each gender
male_df = df[df['gender'] == 'Male']
female_df = df[df['gender'] == 'Female']  
median_bmi_male = male_df['bmi'].median()
median_bmi_female = female_df['bmi'].median()
df.loc[df['gender'] == 'Male', 'bmi'] = df.loc[df['gender'] == 'Male', 'bmi'].fillna(median_bmi_male)
df.loc[df['gender'] == 'Female', 'bmi'] = df.loc[df['gender'] == 'Female', 'bmi'].fillna(median_bmi_female)

#remove rows with gender = 'Other' as it is only one row
df = df[df['gender'] != 'Other']

#drop id column as it is not useful
df = df.drop('id', axis=1)

#Binary encoding of columns with two unique values
df['gender'] = df['gender'].map({'Male': 0, 'Female': 1})
df['ever_married'] = df['ever_married'].map({'No': 0, 'Yes': 1})
df['Residence_type'] = df['Residence_type'].map({'Urban': 0, 'Rural': 1})

#One-hot encoding of other columns (work type and smoking_status)
df_e = pd.get_dummies(df, columns=['work_type', 'smoking_status'])

# Insert the 'stroke' column back into the DataFrame at the last position
stroke_column = df_e.pop('stroke')
df_e['stroke'] = stroke_column

#View new dataframe
df_e.head()
        

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,stroke
0,0,67.0,0,1,1,0,228.69,36.6,0,0,1,0,0,0,1,0,0,1
1,1,61.0,0,0,1,1,202.21,27.8,0,0,0,1,0,0,0,1,0,1
2,0,80.0,0,1,1,1,105.92,32.5,0,0,1,0,0,0,0,1,0,1
3,1,49.0,0,0,1,0,171.23,34.4,0,0,1,0,0,0,0,0,1,1
4,1,79.0,1,0,1,1,174.12,24.0,0,0,0,1,0,0,0,1,0,1


# Models / Algorithms

Before splitting the data into a training set, a test set, and a validation set, we need to divide the data into two arrays: the first one, X, a 2D array containing all the predictors and the second, y, a 1D array with the response

In [65]:
#Splitting data into two arrays, one containing predictors and the other with the response
Xy = np.array(df_e)
X=Xy[:,:-1]
y=Xy[:,-1]

Trying a random forest algorithm with a test/train split of 70/30. First I will try this without smote rebalancing, to check the result of the confusion matrix.

In [105]:
#Now split the data to test/train.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Initialize Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

#Fit the classifier to the training data
rf_classifier.fit(X_train, y_train)

#Predict classes for the test set
y_pred = rf_classifier.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.9412915851272016
Confusion Matrix:
[[1443    1]
 [  89    0]]


As you can see above, without smote rebalancing the models looks as though it is accurate with a score of 0.94, however it is not predicting hardly any cases of a stroke, so this model is not useful. 

Now we will try a random forest algorithm with a smote rebalancing.

In [111]:
smote = SMOTE(sampling_strategy=0.3, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

#Initialize Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

#Fit the classifier to the training data
rf_classifier.fit(X_train, y_train)

#Predict classes for the test set
y_pred = rf_classifier.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.9525316455696202
Confusion Matrix:
[[1444    2]
 [  88  362]]


As you can see above, as we adjusted the data to increase the number of cases of strokes, the data is now predicting more stroke cases and thus will be more useful to use. 

We will now also trial a support vector machine with smote rebalancing.

In [106]:
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.35, random_state=42)

#Initialize Support Vector Machine classifier
svm_classifier = SVC(kernel='rbf', random_state=42)

#Fit the classifier to the training data
svm_classifier.fit(X_train, y_train)

#Predict classes for the test set
y_pred = svm_classifier.predict(X_test)

#Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.7648442092886537
Confusion Matrix:
[[1230  486]
 [ 314 1372]]


As you can see above, SVF performed more poorly than random forest. Even after varying the kernel the accuracy was always worse than the random forst algorithm. 

# Conclusion

We ended with a result of 95% accuracy, utilising a random forest algorithm. It was important to rebalance the data using Smote rebalancing, as without this the algorithm failed to predict many cases of strokes. An SVM algorithm was also trialled, but this performed more poorly than the random forst algorithm. 