In this notebook, I used the Brain Stroke Dataset, publically available in kaggle here: https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset

To build a Decision Tree Classifier, I used the free model available in the sklearn.tree library.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

Here is the dataset as you can download from kaggle below:

In [100]:
df = pd.read_csv("/content/brain_stroke.csv").drop(columns=['work_type'])
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Urban,186.21,29.0,formerly smoked,1


Some of the datatypes here are unusable by a decision tree classifier so I had to process them into different forms. Some were simple. Yes or No, Male or Female, and such are classifiable in binary. We simply replaced them by assigning one or the other as 0 or 1. In some outlying cases, such as gender, 'Other' was given so it was safely replaced with '0.5.'

However, the most difficult feature to process was 'smoking status.' It was split into four categories: 'formerly smoked', 'never smoked', 'smokes', or 'Unknown'.

I considered a lot of choices such as dropping the whole smoking column altogether but I didn't go with that choice because I believed that one's smoking history would have great impact on the outcome of whether a person gets a stroke or not. Thus I tried to process it. 'Never smoked' and 'smokes' were easy to classify as 0 and 1. 'Formerly smoked' was the wild card as it was hard without a concrete answer to judge. I tried to go with a safe '0.5' again. However, 'Unknown' was the biggest curve ball as it literally provided no information. I thought that the best course of action was to assume the person was entirely clean or smoked consistently. I decided to try splitting the data in two and consider both to best get information.

In [101]:
df['gender'] = df['gender'].replace(['Male','Female','Other'],['1','0', '0.5'])
df['ever_married'] = df['ever_married'].replace(['Yes', 'No'], ['1', '0'])
df['Residence_type'] = df['Residence_type'].replace(['Urban', 'Rural'], ['1', '0'])

In [102]:
df_smoke = df
df_clean = df

In [103]:
df_smoke.smoking_status = df_smoke.smoking_status.replace(['formerly smoked', 'never smoked', 'smokes', 'Unknown'], ['0.5', '0', '1', '1'])
df_clean.smoking_status = df_clean.smoking_status.replace(['formerly smoked', 'never smoked', 'smokes', 'Unknown'], ['0.5', '0', '1', '0'])

In [104]:
df_smoke.head(3)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1,67.0,0,1,1,1,228.69,36.6,0.5,1
1,1,80.0,0,1,1,0,105.92,32.5,0.0,1
2,0,49.0,0,0,1,1,171.23,34.4,1.0,1


In [105]:
df_clean.head(3)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1,67.0,0,1,1,1,228.69,36.6,0.5,1
1,1,80.0,0,1,1,0,105.92,32.5,0.0,1
2,0,49.0,0,0,1,1,171.23,34.4,1.0,1


In [106]:
df.shape # print this just to see how much data we have

(4981, 10)

In [107]:
df.stroke.value_counts()

0    4733
1     248
Name: stroke, dtype: int64

Here we randomly mix up our data to add a little extra mix of chaos into our learning process. This way, we can be sure that the model isn't being overfitted after receiving hundreds of 'not stroke' data followed by hundreds of 'stroke' data.

In [108]:
df_smoke = df_smoke.sample(frac=1).reset_index(drop=True)
df_clean = df_clean.sample(frac=1).reset_index(drop=True)

Now, we can split our data into train and test data. I kept the test_size at a reasonable 500 rows.

In [109]:
X_smoke = df_smoke[["gender","age","hypertension","heart_disease","ever_married","Residence_type","avg_glucose_level","bmi","smoking_status"]]
y_smoke = df_smoke.stroke

X_train1, X_test1, y_train1, y_test1 = train_test_split(X_smoke, y_smoke, test_size = 500, random_state=42)
X_train1.shape, y_train1.shape, X_test1.shape, y_test1.shape

((4481, 9), (4481,), (500, 9), (500,))

In [110]:
X_clean = df_clean[["gender","age","hypertension","heart_disease","ever_married","Residence_type","avg_glucose_level","bmi","smoking_status"]]
y_clean = df_clean.stroke

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_clean, y_clean, test_size = 500, random_state=42)
X_train2.shape, y_train2.shape, X_test2.shape, y_test2.shape

((4481, 9), (4481,), (500, 9), (500,))

In [111]:
# DECISION TREE CLASSIFIER

In [112]:
score = 0
for i in range(300):
  classification_tree1 = tree.DecisionTreeClassifier()
  classification_tree1 = classification_tree1.fit(X_train1, y_train1) # Classifies first using "Unknown" = smoked data
  score += accuracy_score(classification_tree1.predict(X_test1), y_test1)
print(score/300)

0.9037599999999988


In [113]:
score = 0
for i in range(300):
  classification_tree2 = tree.DecisionTreeClassifier()
  classification_tree2 = classification_tree2.fit(X_train2, y_train2) # Classifies secondly using "Unknown" = has NOT smoked data
  score += accuracy_score(classification_tree2.predict(X_test2), y_test2)
print(score/300)

0.9201599999999992


It seems that the "Unknown" = NOT smoked data provided a more accurate result by a rather large margin. I had expected that the smoking feature would be rather important to the outcome and provide a lot of bias to the model. I then wondered how the data would look like if we got rid of the smoking data entirely.

In [114]:
df = df.drop(columns=['smoking_status']) # dropping the whole smoking data column

In [115]:
X = df[["gender","age","hypertension","heart_disease","ever_married","Residence_type","avg_glucose_level","bmi"]]
y = df.stroke

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 500, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((4481, 8), (4481,), (500, 8), (500,))

In [116]:
score = 0
for i in range(300):
  classification_tree = tree.DecisionTreeClassifier()
  classification_tree = classification_tree.fit(X_train, y_train)
  score += accuracy_score(classification_tree.predict(X_test), y_test)
print(score/300)

0.8960933333333325


As expected, accuracy dropped overall. It was still pretty impressive that the model managed to make it to 89% accuracy without smoking data. However, it seems as though it is important and needs to be kept to get the best accuracy.

Afterwards, we'll try using the Bagging and Boosting classifier ensembles. As we've learned, the Unknown=NOT Smoked data provided the best accuracy so we'll use that to try both ensembles.

In [117]:
# BAGGING CLASSIFIER

In [137]:
bag_model = BaggingClassifier(n_estimators=10)

We will now try Repeated Stratified K fold cross validation on the Bagging model to test out its accuracy.

In [138]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(bag_model, X_train2, y_train2, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('\t\tAccuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

		Accuracy: 0.942 (0.005)


We get a pretty high accuracy rate at a near 95% accuracy. Furthermore, the standard deviation is very low at 0.005. The only drawback to testing this out was that stratified k fold cross validation took a long time to compute, around 10 seconds in my estimate. This makes sense since the bagging technique ultimately is creating multiple decision trees.

Now, let's test out the model on the testing dataset.

In [139]:
score = 0
for i in range(300):
  bag_model.fit(X_train2, y_train2)
  score += accuracy_score(bag_model.predict(X_test2), y_test2)
print(score/300)

0.9444133333333322


We get a really good score, one that is very similar to our original cross validation accuracy.

In [95]:
# BOOSTING CLASSIFIER

In [134]:
boost_model = AdaBoostClassifier(n_estimators=4, random_state=42, algorithm='SAMME')

Now, we will now try Repeated Stratified K fold cross validation on the Boosting model to test out its accuracy.

In [135]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
n_scores = cross_val_score(boost_model, X_train2, y_train2, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('\t\tAccuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

		Accuracy: 0.950 (0.001)


We get a fantastic accuracy at 95%. Furthermore, the standard deviation is even lower than that of the bagging model when we ran stratified K fold cross validation at 0.001. Additionally, the whole thing was computed very quickly as well.

Now let's try computing the accuracy on the test dataset.

In [136]:
score = 0
for i in range(300):
  boost_model.fit(X_train2, y_train2)
  score += accuracy_score(boost_model.predict(X_test2), y_test2)
print(score/300)

0.948


We get a very good score here as well with 94.8% accuracy.

---



##3##

Of the three models I had implemented above, I believe I can confidently state that the Boosting ensemble model was the best model for this project.

Firstly, it had outdone the other two models in final accuracy by a good margin. Compared to the regular decision tree, the boosting ensemble model had around 3% better accuracy. Although the disparity isn't that huge, the decision tree model was in every respect similar to the boosting model but its 92% accuracy was lesser than the 95% accuracy that the boosting model provided.

Furthermore, when compared to our bagging ensemble model, it also proved to be more accurate while being much faster when it came to computing. This means that our bagging model used a lot more computational power to provide a less accurate result. The time it took to calculate the accuracy over 300 iterations took a lot longer for the bagging model.

Additionally, the boosting model's answers are rather consistent while the bagging model might be a little bit more spread out. The boosting model's standard deviation was 4 times lower than the bagging model's after all. Being consistent and reliable is a very important feature that should be sought for in a model. We don't want the answers to be very accurate one moment and off the mark the next.

It is important for the model to be fast. Although our small project doesn't really require that much speed, in the real world, important models must run on very large datasets and be trained to be very accurate. It must be very efficient and not take up too much time. This is also very important when it comes to life saving areas such as brain strokes. Thus, I can confidently say that the boosting ensemble method was the best method.

