# Business Context

- According to the NPR Organization on average, someone dies of a stroke every 3 minutes which makes prediciting wether someone is prone to have a stroke or not very important since there are ways to manage their health to avoid it from happening.


- This ML Pipeline is implemented in order to  predict if a person is in danger of having a stroke or not based on the follwoing aspects:
    - Gender
    - Age
    - BMI
    - Hypertensive or not
    - Has a heart disease or not
    - Marital status
    - Work type
    - Residence type
    - Average glucose level
    - Smoking history
 
 
- In this pipeline I will be implementing all data preprocessing and preparation techniques needed to train a model to achieve the highest prediction precision to ensure that all patients who are prone to having a stroke are identified.

# Import Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import svm
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from imblearn.over_sampling import ADASYN
from imblearn import metrics
from sklearn import metrics

# Load Dataset

- Loaded the `stroke-prediction` dataset using Pandas. It can be found at https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

In [3]:
df_og = pd.read_csv("E:\\Documents\\Masters\\Intro_to_AI\\healthcare-dataset-stroke-data.csv")
df_og.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


---------------------------------------------------------------------------------------------

# Data Exploration

- Checked the shape, datatypes and non-null count of the DataFrame
- Identified unique values of each feature
- Counted implict missing values for the attribute `smoking_status`
- Counted explicit missing values for the attribute `bmi`
- Counted number of instances where the `gender` is "other"
- Checked the class distribution

In [4]:
#Dimensions of the dataset
df_og.shape

(5110, 12)

In [5]:
#Printing Information on the dataset
df_og.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [6]:
#Checking unique values for each feature
df_og.nunique()

id                   5110
gender                  3
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               5
Residence_type          2
avg_glucose_level    3979
bmi                   418
smoking_status          4
stroke                  2
dtype: int64

In [7]:
df_og.isnull().any()

id                   False
gender               False
age                  False
hypertension         False
heart_disease        False
ever_married         False
work_type            False
Residence_type       False
avg_glucose_level    False
bmi                   True
smoking_status       False
stroke               False
dtype: bool

In [8]:
#Checking number of null values in bmi
sum(df_og['bmi'].isna())

201

In [9]:
#Checking number of implicit null values in smoking_status
sum(df_og['smoking_status'] == 'Unknown')

1544

In [10]:
#Checking number of records with gender other
sum(df_og['gender'] == 'Other')

1

In [11]:
#Checking class balance
print(df_og['stroke'].value_counts())

0    4861
1     249
Name: stroke, dtype: int64


Summary of the original dataset:
- It contains 5110 records and 12 column (11 features and 1 label), the features are:
    - `id`
    - `gender`
    - `age`
    - `hypertension`
    - `heart_disease`	
    - `ever_married`
    - `work_type`
    - `Residence_type`
    - `avg_glucose_level`	
    - `bmi`
    - `smoking_status`
    - `stroke`: Class 0 (won't have a stroke) and Class 1 (will have a stroke)
    
    
- Only the `bmi` column has explicit missing values which are 201 missing value.


- `smoking_status` has an implicit missing value which is "Unknown".


- `gender` feature has 3 uniques values (Male, Female and Other), however "Other" only exists in 1 record.


- The classes are imbalanced as Class 0 has 4861 records and Class 1 has 249 records.

--------------------------------------------------------------------------

# Data Preprocessing

The following step were applied to clean the dataset and prepare it:
    
- Created a non-pointer copy of the dataset to preserve the original one
- The Column `id` was dropped as it has no significance and the indexing can be used as the Primary Key.
- Splitted the dataset into Training dataset (80% of data) and Test Dataset(20% of data) to apply data preprocessing the training data only to avoid data leakage
- There's only 1 row where the `gender` is "Other" so it's dropped to prevent outliers 
- Since there are **201** rows with missing `bmi` value, I will fill the missing values with the mean of the bmi in thetraining dataset to avoid dataloss and also because the dataset is originally small
- Dropped rows with `smoking_status` as "Unknown" since its an implicit missing value

In [12]:
#Creating a non-pointer copy of the dataset
df = df_og.copy()
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [13]:
#Droping id column
df = df.drop(['id'], axis=1)

In [14]:
#Spliting dataset into 80% training data and 20% test data
train_df, test_df = train_test_split(df ,test_size=0.2, shuffle=True)


In [15]:
#Displaying the train dataset
train_df

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
4748,Female,28.0,0,0,Yes,Govt_job,Rural,86.91,21.1,formerly smoked,0
4732,Female,45.0,0,0,Yes,Private,Urban,108.03,37.3,never smoked,0
1562,Female,81.0,1,0,Yes,Private,Urban,58.71,34.5,never smoked,0
1047,Female,5.0,0,0,No,children,Rural,84.93,17.6,Unknown,0
3770,Female,60.0,0,0,Yes,Govt_job,Rural,145.94,29.2,Unknown,0
...,...,...,...,...,...,...,...,...,...,...,...
1211,Female,79.0,0,0,Yes,Private,Rural,90.77,22.5,never smoked,0
772,Male,61.0,0,0,Yes,Private,Rural,55.26,33.2,Unknown,0
1640,Male,76.0,0,1,Yes,Private,Urban,79.05,,Unknown,0
1553,Male,66.0,0,0,Yes,Govt_job,Rural,218.54,38.9,smokes,0


Training data before clensing had **4088** Records and **11** Features

**Data Cleaning**

In [16]:
print(f'Number of missing values in "bmi" column in the training dataset: {train_df["bmi"].isnull().sum()}')

Number of missing values in "bmi" column in the training dataset: 155


In [17]:
#Fill the null values in BMI column using "BMI" mean of the rest of the train dataset
train_mean_bmi = train_df['bmi'].mean()
train_df['bmi'].fillna(value=train_mean_bmi, inplace=True)

#Dropping the record with the gender "Other" by only including records with gender not equal to "Other"
train_df = train_df.loc[train_df['gender']!= "Other"]

#Dropping implicit null value of the column smoking_status by only including records with smoking_status not equal to "Unknown"
train_df=train_df.loc[train_df["smoking_status"]!="Unknown"]

#Displaying training dataset after cleansing
train_df

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
4748,Female,28.0,0,0,Yes,Govt_job,Rural,86.91,21.1,formerly smoked,0
4732,Female,45.0,0,0,Yes,Private,Urban,108.03,37.3,never smoked,0
1562,Female,81.0,1,0,Yes,Private,Urban,58.71,34.5,never smoked,0
1041,Female,30.0,0,0,Yes,Govt_job,Rural,110.55,30.9,smokes,0
4678,Female,66.0,0,0,Yes,Self-employed,Rural,74.88,32.6,never smoked,0
...,...,...,...,...,...,...,...,...,...,...,...
711,Female,37.0,0,0,No,Self-employed,Rural,134.39,22.7,formerly smoked,0
2035,Female,45.0,0,0,Yes,Private,Urban,73.27,22.2,smokes,0
1211,Female,79.0,0,0,Yes,Private,Rural,90.77,22.5,never smoked,0
1553,Male,66.0,0,0,Yes,Govt_job,Rural,218.54,38.9,smokes,0


Training data after cleansing has **2852** Records and **11** Features

# One-hot Encoding

- Encoding categorical variables into numerical features using OneHotEncoder() in order to be able to train the model as the models only take numerical features
- Encoding the following categorical features:
     - `gender`
     - `ever_married`
     - `work_type`
     - `residence_type`
     -`smoking_status`

In [18]:
#Getting a list of the categorical featues
catg_cols = train_df.select_dtypes(include=['object']).columns.tolist()
catg_cols

['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

In [19]:
#Initializing OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')

#Fitting the OneHotEncoder on the train dataset and transforming the data while keeping the dataframe form
enc_data=pd.DataFrame(enc.fit_transform(train_df[catg_cols]).toarray())

#Extracting feature names for the encoded features
enc_data.columns = enc.get_feature_names_out(catg_cols)

In [20]:
#Resetting the index of the train dataset to be able to merge it with the encoded data on the index
train_df.reset_index(drop=True)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Female,28.0,0,0,Yes,Govt_job,Rural,86.91,21.1,formerly smoked,0
1,Female,45.0,0,0,Yes,Private,Urban,108.03,37.3,never smoked,0
2,Female,81.0,1,0,Yes,Private,Urban,58.71,34.5,never smoked,0
3,Female,30.0,0,0,Yes,Govt_job,Rural,110.55,30.9,smokes,0
4,Female,66.0,0,0,Yes,Self-employed,Rural,74.88,32.6,never smoked,0
...,...,...,...,...,...,...,...,...,...,...,...
2837,Female,37.0,0,0,No,Self-employed,Rural,134.39,22.7,formerly smoked,0
2838,Female,45.0,0,0,Yes,Private,Urban,73.27,22.2,smokes,0
2839,Female,79.0,0,0,Yes,Private,Rural,90.77,22.5,never smoked,0
2840,Male,66.0,0,0,Yes,Govt_job,Rural,218.54,38.9,smokes,0


In [21]:
#Merge train dataset with enc_data based on the row index
train_df= pd.merge(train_df,enc_data, on=train_df.index)

In [22]:
#Drop categorical columns that have been already encoded
train_df =train_df.drop(catg_cols, axis =1)

#Drop the key column that was automatically assigned with merge as it's not needed
train_df =train_df.drop(['key_0'], axis =1)

#Displaying dataset after encoding categorical features
print("Training dataset after encoding categorical features:")
train_df

Training dataset after encoding categorical features:


Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,28.0,0,0,86.91,21.1,0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,45.0,0,0,108.03,37.3,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,81.0,1,0,58.71,34.5,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,30.0,0,0,110.55,30.9,0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,66.0,0,0,74.88,32.6,0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2837,37.0,0,0,134.39,22.7,0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2838,45.0,0,0,73.27,22.2,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2839,79.0,0,0,90.77,22.5,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2840,66.0,0,0,218.54,38.9,0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


-----------------------------------------------------------------------------------------------

# Data Standardization

- Standardizing the data because it is not normally distributed as most columns have values of 0 or 1 however the columns `age`
`avg_glucose_level`	and `bmi` have numerical values of wide range

In [23]:
#Splitting training dataset to features (x_train) and label (y_train)
x_train= train_df.drop(['stroke'], axis=1)
y_train=train_df['stroke']

In [24]:
#Instantiate standard scaler 
train_scaler= StandardScaler()

#Fit x_train data (features) to standard scaler
cols = x_train.columns

#Transforming the data back to a pd dataframe form
x_train = pd.DataFrame(train_scaler.fit_transform(x_train), columns=cols)

In [25]:
print("Training Dataset after Standardization:")
x_train

Training Dataset after Standardization:


Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,gender_Female,gender_Male,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,-1.104665,-0.379028,-0.265392,-0.460827,-1.274007,0.800548,-0.800548,-0.557029,0.557029,2.391375,-0.056364,-1.367738,-0.467091,-0.132453,1.013461,-1.013461,1.747629,-1.068491,-0.531499
1,-0.202080,-0.379028,-0.265392,-0.023264,0.955316,0.800548,-0.800548,-0.557029,0.557029,-0.418169,-0.056364,0.731134,-0.467091,-0.132453,-0.986717,0.986717,-0.572204,0.935899,-0.531499
2,1.709278,2.638330,-0.265392,-1.045072,0.570001,0.800548,-0.800548,-0.557029,0.557029,-0.418169,-0.056364,0.731134,-0.467091,-0.132453,-0.986717,0.986717,-0.572204,0.935899,-0.531499
3,-0.998479,-0.379028,-0.265392,0.028945,0.074596,0.800548,-0.800548,-0.557029,0.557029,2.391375,-0.056364,-1.367738,-0.467091,-0.132453,1.013461,-1.013461,-0.572204,-1.068491,1.881472
4,0.912879,-0.379028,-0.265392,-0.710063,0.308537,0.800548,-0.800548,-0.557029,0.557029,-0.418169,-0.056364,-1.367738,2.140910,-0.132453,1.013461,-1.013461,-0.572204,0.935899,-0.531499
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2837,-0.626826,-0.379028,-0.265392,0.522860,-1.053827,0.800548,-0.800548,1.795239,-1.795239,-0.418169,-0.056364,-1.367738,2.140910,-0.132453,1.013461,-1.013461,1.747629,-1.068491,-0.531499
2838,-0.202080,-0.379028,-0.265392,-0.743419,-1.122633,0.800548,-0.800548,-0.557029,0.557029,-0.418169,-0.056364,0.731134,-0.467091,-0.132453,-0.986717,0.986717,-0.572204,-1.068491,1.881472
2839,1.603092,-0.379028,-0.265392,-0.380856,-1.081349,0.800548,-0.800548,-0.557029,0.557029,-0.418169,-0.056364,0.731134,-0.467091,-0.132453,1.013461,-1.013461,-0.572204,0.935899,-0.531499
2840,0.912879,-0.379028,-0.265392,2.266272,1.175496,-1.249144,1.249144,-0.557029,0.557029,2.391375,-0.056364,-1.367738,-0.467091,-0.132453,1.013461,-1.013461,-0.572204,-1.068491,1.881472


-----------------------------------------------------------------------------------------

# Dimensionality Reduction
- Note that dimensionality reduction using PCA was tried before balancing the data, however it made the models perform badly as the features are already few and the dataset size is medium

--------------------------------------------------------------------------

# Handling Class Imbalanace by Oversamplig (ADASYN)

- Fitting the oversampling algorithm (ADASYN) to the training data (features and label) to balance the classes and they are imbalanced.
- SMOTE was also tested however ADASYN performed better with the models


In [26]:
# Print class distribution before applying ADASYN
print('Class distribution before oversampling:\n',y_train.value_counts())

Class distribution before oversampling:
 0    2681
1     161
Name: stroke, dtype: int64


In [27]:
#Instantiate ADASYN
ada = ADASYN(random_state=130)

# Fit ADASYN to all training data
x_train, y_train = ada.fit_resample(x_train, y_train)

# Print class distribution after applying ADASYN
print('Class distribution after oversampling:\n',y_train.value_counts())

Class distribution after oversampling:
 0    2681
1    2635
Name: stroke, dtype: int64


As seen the classes distribution is now balanced

-------------------------------------------------------------------

# Model Training

- Conducted HyperParameter tuning on 3 different classification algorithms and then trained them using the best parameters. The algorithms are: 
    - Decision Trees
    - SVM
    - KNN
- Chose the one with the highest precision as precison is the matrix I'm more interested in as it's more important to classify all records that will have a stroke.

# Decision Trees (DT)

In [28]:
#Defining Parameter range
param_grid_dt = {'criterion': ['gini', 'entropy'],   
              'max_depth': range(1,50)}

#Chose the scoring to be precison as it's more important for this pipeline chose to implement 10 k-folds
dt_grid = GridSearchCV(DecisionTreeClassifier(), param_grid_dt, scoring='precision', cv=10) 
   
# fitting the model for grid search 
dt_grid.fit(x_train, y_train) 
 
# print best parameter after tuning 
print(f"The best parameters are: {dt_grid.best_params_}") 

The best parameters are: {'criterion': 'entropy', 'max_depth': 47}


In [29]:
#Accuracy of best parameter
dt_accuracy =dt_grid.best_score_ *100
print("Accuracy of training dataset with tuning is : {:.1f}%".format(dt_accuracy) )

Accuracy of training dataset with tuning is : 87.8%


# Support Vector Machines (SVM)

In [30]:
#Defining parameter range 
param_grid = {'C': [1, 10, 100],   
              'gamma':['scale', 'auto'],
              'kernel': ['linear', 'rbf']}  
   
#Chose the scoring to be precison as it's more important for this pipeline.
svm_grid = GridSearchCV(SVC(class_weight='balanced'), param_grid, scoring='precision') 
   
# fitting the model for grid search 
svm_grid.fit(x_train, y_train) 
 
# print best parameter after tuning 
print(f"The best parameters are: {svm_grid.best_params_}") 

The best parameters are: {'C': 100, 'gamma': 'auto', 'kernel': 'rbf'}


In [31]:
#Accuracy of best parameter
svm_accuracy = svm_grid.best_score_ *100
print("Accuracy of training dataset with tuning is : {:.1f}%".format(svm_accuracy) )

Accuracy of training dataset with tuning is : 89.4%


# K-Nearest Neighbour (KNN)

In [32]:
knn= KNeighborsClassifier()

#Defining parameter range 
knn_parameters = { 'n_neighbors' : range(1, 21),
               'weights' : ['uniform','distance'],
               'metric' : ['minkowski','euclidean','manhattan']}

#Chose the scoring to be precison as it's more important for this pipeline and chose to implement 10 k-folds
knn_grid = GridSearchCV(knn, knn_parameters, scoring='precision', cv=10)
  
# fitting the model for grid search
knn_grid.fit(x_train, y_train)

#Printing best parameters
print((f"The best parameters are: {knn_grid.best_params_}"))

The best parameters are: {'metric': 'manhattan', 'n_neighbors': 2, 'weights': 'uniform'}


In [29]:
#Accuracy of the best parameter
knn_accuracy = knn_grid.best_score_ *100
print("Accuracy of training dataset with tuning is : {:.1f}%".format(knn_accuracy) )

Accuracy of training dataset with tuning is : 91.5%


Accuracy of the 3 algorithms on the training dataset with hyperparameter tuning:

- Decision Trees: 87.3%
- Support Vector Machines: 88.8%
- K-Nearest Neighbour: 91.5%

Since for this pipeline the precision metric is more important to us as its more important to classify ALL records that will have a stroke meaning that class 1 is more important to us. All algortihms were tested for precision on test dataset before and KNN had the highest precision, however DT had similar values.

---------------------------------------------------------------------------------

# Data Preprocessing on Test data

- Applying all data preprocessing applied on training dataset except for Data Balancing to ensure generalization

In [30]:
#Fill the null values in BMI column using "BMI" mean of the train dataset to ensue generalization
test_mean_bmi = test_df['bmi'].mean()
test_df['bmi'].fillna(value=train_mean_bmi, inplace=True)

#Dropping the record with the gender "Other" 
test_df = test_df.loc[test_df['gender']!= "Other"]

#Dropping implicit null value of the column smoking_status which are equal to "Unknown"
test_df=test_df.loc[test_df["smoking_status"]!="Unknown"]

In [31]:
#One-hot encoding categorical features by fitting the already intiated encoder to the test dataset
enc_test_data=pd.DataFrame(enc.fit_transform(test_df[catg_cols]).toarray())
enc_test_data.columns = enc.get_feature_names_out(catg_cols)

#Resetting the index of the test dataset to be able to merge
test_df.reset_index(drop=True)

#Merge test dataset with enc_data based on the row index
test_df= pd.merge(test_df,enc_test_data, on=test_df.index)

#Drop categorical columns that has already been encoded
test_df =test_df.drop(catg_cols, axis =1)

#Drop the key column that was automatically assigned with merge as it's not needed
test_df =test_df.drop(['key_0'], axis =1)

In [32]:
#Splitting test dataset to features (x_test) and label (y_test)
x_test= test_df.drop(['stroke'], axis=1)
y_test=test_df['stroke']

In [33]:
#Standardize test data using the train set's scaler for more accurate standardization 

#Fit x_train data (features) to standard scaler
cols = x_test.columns

#Transforming data back to pd dataframe form
x_test = pd.DataFrame(train_scaler.transform(x_test), columns=cols)

#Display test dataset
x_test

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,gender_Female,gender_Male,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,0.635302,-0.379173,-0.259354,-0.665716,-0.445694,0.812096,-0.812096,-0.556859,0.556859,-0.426181,-0.053,-1.321421,2.080471,-0.142704,-0.993717,0.993717,-0.566029,0.929640,-0.532537
1,-0.370358,-0.379173,3.855732,2.423181,0.564999,0.812096,-0.812096,-0.556859,0.556859,2.346422,-0.053,-1.321421,-0.480660,-0.142704,-0.993717,0.993717,-0.566029,-1.075686,1.877804
2,0.899950,-0.379173,-0.259354,-0.782784,-0.183393,-1.231381,1.231381,-0.556859,0.556859,-0.426181,-0.053,-1.321421,2.080471,-0.142704,-0.993717,0.993717,1.766695,-1.075686,-0.532537
3,-1.164301,-0.379173,-0.259354,-0.491145,-0.459731,0.812096,-0.812096,1.795787,-1.795787,2.346422,-0.053,-1.321421,-0.480660,-0.142704,-0.993717,0.993717,-0.566029,-1.075686,1.877804
4,-0.635006,-0.379173,-0.259354,-1.004759,0.775560,0.812096,-0.812096,-0.556859,0.556859,-0.426181,-0.053,0.756761,-0.480660,-0.142704,-0.993717,0.993717,1.766695,-1.075686,-0.532537
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
704,1.217527,-0.379173,3.855732,0.685715,0.382513,0.812096,-0.812096,-0.556859,0.556859,-0.426181,-0.053,-1.321421,2.080471,-0.142704,-0.993717,0.993717,-0.566029,-1.075686,1.877804
705,0.158937,-0.379173,-0.259354,-0.463939,-0.894890,0.812096,-0.812096,-0.556859,0.556859,-0.426181,-0.053,0.756761,-0.480660,-0.142704,-0.993717,0.993717,1.766695,-1.075686,-0.532537
706,0.370655,-0.379173,-0.259354,-0.474039,-0.347432,-1.231381,1.231381,-0.556859,0.556859,-0.426181,-0.053,0.756761,-0.480660,-0.142704,1.006323,-1.006323,1.766695,-1.075686,-0.532537
707,1.746822,-0.379173,3.855732,-0.750838,-0.431656,-1.231381,1.231381,-0.556859,0.556859,-0.426181,-0.053,-1.321421,2.080471,-0.142704,1.006323,-1.006323,1.766695,-1.075686,-0.532537


------------------------------------------------------------------------------------

# Model Assessment

As mentioned above the hyper-tuned knn model will be applied on the test dataset to predict the labels of the test test dataset as knn had the highest accuracy on training dataset.

In [42]:
#Predict the label for the the test data using KNN model
grid_predictions = knn_grid.predict(x_test) 

In [43]:
from imblearn import metrics
#Print Matrix to show Precision, Recall, F1 and more    
print(metrics.classification_report_imbalanced(y_test, grid_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.95      0.94      0.12      0.94      0.34      0.12       668
          1       0.11      0.12      0.94      0.11      0.34      0.10        41

avg / total       0.90      0.89      0.17      0.89      0.34      0.12       709



In [44]:
from sklearn import metrics
#Print model accuracy on the test data
print(metrics.accuracy_score(y_test,grid_predictions))

0.8899858956276445


Metrics after applying the model on the dataset using KNN and hyperparameter tunning:
- Precision for class 1 is very low (11%)
- Precision for class 2 is good (95%)
- the rest of the metrics were also low for class 1 and good for class 2.

Although the accuracy is good (89%) the precision is very low for the class that we are most interested in although all preprocessing and class imbalance handling were implemented

--------------------------------------------------------------

# Conclusion

- After performing data cleaning, encoding categorical features, standardizing data, balancing the classes and applying hyper-parameter tuning on the model the performance improved although precision metric for class 1 is still low. Therefore I believe more records should be added to the dataset specially of class stroke (1) in order to improve the quality of predictions.


- Tried implementing dimensionality reduction but it only made the model perform poorly as the dataset is already small and there are few features (11 before encoding)


- All preprocessing was implemented on the train data and test data separately so this pipeline is immune to data leakage.


- For future recommendations :
    - A correlation matrix between the features can be implemeted to understand the features that affect the prediction more and focus on them.
    - Try implementing algorithms that handle imbalance better such as Random Forest Classifier or try ensemble methods.
    - Add more records to the dataset of class 1 (Will stroke)
    - Add a feature of whether the patient is diabetic o not and which type of diabetes as according to several health journals diabetes is highly related to having a stroke.