#Full DL Solution

© 2023, Zaka AI, Inc. All Rights Reserved.

---

###**Case Study:** Stroke Prediction

**Objective:** The goal of this project is to walk you through a case study where you can apply the deep learning concepts that you learned about during the week. By the end of this project, you would have developed a solution that predicts if a person will have a stroke or not.


**Dataset Explanation:** We will be using the stroke dataset. Its features are:


* **id:** unique identifier
* **gender:** "Male", "Female" or "Other"
* **age:** age of the patient
* **hypertension:** 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* **heart_disease:** 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* **ever_married:** "No" or "Yes"
* **work_type:** "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* **Residence_type:** "Rural" or "Urban"
* **avg_glucose_level:** average glucose level in blood
* **bmi:** body mass index
* **smoking_status:** "formerly smoked", "never smoked", "smokes" or "Unknown"*
* **stroke:** 1 if the patient had a stroke or 0 if not

#Importing Libraries

In [2]:
pip install scikeras

Collecting scikeras
  Downloading scikeras-0.11.0-py3-none-any.whl (27 kB)
Installing collected packages: scikeras
Successfully installed scikeras-0.11.0


We start by importing the libraries: numpy and pandas

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.metrics import Precision, Recall
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier

#Loading the Dataset

We load the dataset from a csv file, and see its first rows

In [4]:
data = pd.read_csv('healthcare-dataset-stroke-data.csv')
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


#Exploratory Data Analysis

Now we start the exploratory data analysis.

###Shape of the data

First, you need to know the shape of our data (How many examples and features do we have)

In [5]:
print("We have {} row and {} columns".format(data.shape[0], data.shape[1]))

We have 5110 row and 12 columns


###Types of different Columns

See the type of each of your features and see if you have any nulls

In [6]:
data.info()
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

###Dealing with categorical variables

Now we will walk through the categorical variables that we have to see the categories and the counts of each of them.

In [7]:
categ_columns = ['gender','ever_married','work_type', 'Residence_type', 'smoking_status']
for col in categ_columns:
    print("{}\n".format(data[col].value_counts()))

Female    2994
Male      2115
Other        1
Name: gender, dtype: int64

Yes    3353
No     1757
Name: ever_married, dtype: int64

Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: work_type, dtype: int64

Urban    2596
Rural    2514
Name: Residence_type, dtype: int64

never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64



#Preprocessing

Prepare the data in a way to be ready to be used to train a DL model.

In [8]:
#dropping unused columns
data.drop('id', axis=1, inplace=True)

In [10]:
#dropping missing values
data = data.dropna()
data.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [11]:
#Converting categorical columns to numerical
encoder = LabelEncoder()
for col in categ_columns:
    data[col] = encoder.fit_transform(data[col])

In [12]:
#Normalization
scaler = MinMaxScaler()
scaler.fit(data)
data = pd.DataFrame(scaler.transform(data), columns=data.columns)

In [13]:
#checking the data after preprocessing
data.info()
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4909 entries, 0 to 4908
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4909 non-null   float64
 1   age                4909 non-null   float64
 2   hypertension       4909 non-null   float64
 3   heart_disease      4909 non-null   float64
 4   ever_married       4909 non-null   float64
 5   work_type          4909 non-null   float64
 6   Residence_type     4909 non-null   float64
 7   avg_glucose_level  4909 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     4909 non-null   float64
 10  stroke             4909 non-null   float64
dtypes: float64(11)
memory usage: 422.0 KB


Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
count,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0,4909.0
mean,0.205032,0.522282,0.091872,0.049501,0.652679,0.542524,0.507232,0.231674,0.212981,0.458478,0.042575
std,0.246154,0.275331,0.288875,0.216934,0.476167,0.273148,0.499999,0.20508,0.089966,0.355774,0.201917
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.304199,0.0,0.0,0.0,0.5,0.0,0.10133,0.151203,0.0,0.0
50%,0.0,0.536133,0.0,0.0,1.0,0.5,1.0,0.168775,0.203895,0.666667,0.0
75%,0.5,0.731445,0.0,0.0,1.0,0.75,1.0,0.269827,0.261168,0.666667,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#Building the DL Model

Now it's time to build the actual model. Propose a DL architecture suitable for this problem and print its summary.

In [14]:
def create_model():
    model = Sequential()
    model.add(Dense(32, input_dim=10, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(4, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    return model

model = create_model()
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 32)                352       
                                                                 
 dense_5 (Dense)             (None, 16)                528       
                                                                 
 dense_6 (Dense)             (None, 4)                 68        
                                                                 
 dense_7 (Dense)             (None, 1)                 5         
                                                                 
Total params: 953 (3.72 KB)
Trainable params: 953 (3.72 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


###Compiling the model

Now we need to compile the model.

In [15]:
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy', Precision(thresholds = 0.5), Recall(thresholds = 0.5)])

###Fitting the model

we split our dataset between training and testing, and we fit the model on training data (70%), and validate on the testing data (30%).

In [16]:
X = data.drop('stroke', axis=1)
y = data['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=97)

model.fit(X_train, y_train, validation_data=(X_test,y_test), epochs=30, batch_size=100)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x7ec5d20b3880>

In [17]:
#Evaulation
scores = model.evaluate(X_test, y_test)
print('Accurcay : {}'.format(scores[1]))
print('Precision : {}'.format(scores[2]))
print('Recall : {}'.format(scores[3]))

Accurcay : 0.9592667818069458
Precision : 0.0
Recall : 0.0


What can you deduce from the results you obtained?

**[The model's accuracy is high but that's only because it predicted the negative cases only, it didn't learn from the positive cases since precision and recall both equal to zero ]**

#Improving DL Models

**TIP: When tuning your model to obtain a better performance, make sure you use a validation set**

###Data Improvement

After having studied your data in previous parts, enhance the performance of your model with one data improvement using **SMOTE**.

In [18]:
print('Classes counts before using SMOTE :\n{}'.format(y.value_counts()))

Classes counts before using SMOTE :
0.0    4700
1.0     209
Name: stroke, dtype: int64


In [19]:
#Applying SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print('Classes counts after using SMOTE :\n{}'.format(y_resampled.value_counts()))

Classes counts after using SMOTE :
1.0    4700
0.0    4700
Name: stroke, dtype: int64


Comment the performance you obtained

In [20]:
#Training and fitting the model on the resampled data
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_resampled, y_resampled, test_size=0.3)
model.fit(X_train_s, y_train_s, validation_data=(X_test_s, y_test_s), epochs=30, batch_size=100)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x7ec5c8296170>

In [None]:
#Evaulation after resampling
scores = model.evaluate(X_test_s, y_test_s)
print('Accurcay : {}'.format(scores[1]))
print('Precision : {}'.format(scores[2]))
print('Recall : {}'.format(scores[3]))

Accurcay : 0.8361701965332031
Precision : 0.7892767786979675
Recall : 0.9107913374900818


**[The model now is trained in both negative and positive cases. After applying SMOTE to balance the dataset, the model's performance improved significantly for the minority class, with better precision, recall , the model is now able to predict properly on new cases.]**

###Model Design

Propose one model design method to improve the performance of your model even more.

In [21]:
def create_model_v2():
    model = Sequential()
    model.add(Dense(32, input_dim=10, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(16, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(4, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(1, activation='sigmoid'))
    return model

model_v2 = create_model_v2()
model_v2.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_8 (Dense)             (None, 32)                352       
                                                                 
 batch_normalization (Batch  (None, 32)                128       
 Normalization)                                                  
                                                                 
 dense_9 (Dense)             (None, 16)                528       
                                                                 
 batch_normalization_1 (Bat  (None, 16)                64        
 chNormalization)                                                
                                                                 
 dense_10 (Dense)            (None, 4)                 68        
                                                                 
 batch_normalization_2 (Bat  (None, 4)                

In [22]:
#Training and fitting the new model
model_v2.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy', Precision(thresholds = 0.5), Recall(thresholds = 0.5)])
model_v2.fit(X_train_s, y_train_s, validation_data=(X_test_s, y_test_s), epochs=30, batch_size=100)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x7ec5c80a4a00>

In [23]:
#Evaluation for the new model
scores_v2 = model_v2.evaluate(X_test_s, y_test_s)
print('Accurcay : {}'.format(scores_v2[1]))
print('Precision : {}'.format(scores_v2[2]))
print('Recall : {}'.format(scores_v2[3]))

Accurcay : 0.8755319118499756
Precision : 0.82097989320755
Recall : 0.9519301056861877


Comment the performance of your model

**[After performing the batch normalization the model's performance is getting better than before.]**

###Hyperparameter Tuning

Now we will tune some hyperparameters of our model. Pick two hyperparameters to optimize, and run a grid search to optimize them. Then fit your model on the best parameters.

In [24]:
#Hyperparameter tuning
def create_model_v3():
    model = Sequential()
    model.add(Dense(32, input_dim=10, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(16, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(4, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy', Precision(thresholds = 0.5), Recall(thresholds = 0.5)])
    return model

model_v3 = KerasClassifier(model=create_model_v3, verbose=0)

param_grid = {
    'epochs': [25,50,100],
    'batch_size': [50,100,200]
}
grid_search = GridSearchCV(model_v3, param_grid=param_grid, scoring='accuracy', cv=3)
grid_search.fit(X_train_s, y_train_s)

In [25]:
#printing the best hyperparameters
best_epochs = grid_search.best_params_['epochs']
best_batch_size = grid_search.best_params_['batch_size']
print("Best Number of Epochs: ", best_epochs)
print("Best Batch Size: ", best_batch_size)

Best Number of Epochs:  100
Best Batch Size:  200


In [27]:
#Fitting the new model on the best hyperparameters
best_model = create_model_v3()
best_model.fit(X_train_s, y_train_s, validation_data=(X_test_s, y_test_s), epochs=best_epochs, batch_size=best_batch_size)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7ec5c33b9240>

In [28]:
#Evaluation for the new model after hyperparameter tuning
scores_v3 = best_model.evaluate(X_test_s, y_test_s)
print('Accurcay : {}'.format(scores_v3[1]))
print('Precision : {}'.format(scores_v3[2]))
print('Recall : {}'.format(scores_v3[3]))

Accurcay : 0.9028368592262268
Precision : 0.8466876745223999
Recall : 0.9774217009544373


Comment the performance of your model

**[This is the best mode, the accuracy, precision and recall are better now]**