## Project Name : Stroke Prediction

The main aim of this project is to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.

### Dataset  link
https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

## All the Lifecycle In A Data Science Projects
1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building
5. Model Deployment

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
import plotly.graph_objs as go

****[!pip install dataprep](http://)[](http://)


In [4]:
from dataprep.eda import *


In [18]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')


In [19]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [20]:
df.columns


Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [21]:
df.select_dtypes(exclude=['int64','float64']).columns


Index(['gender', 'ever_married', 'work_type', 'Residence_type',
       'smoking_status'],
      dtype='object')

In [22]:
df.gender.replace({'Male': 1, 'Female': 0}, inplace=True)

df.ever_married.replace({'No': 0, 'Yes': 1}, inplace=True)

df.work_type.replace({'Private': 0, 'Self-employed': 1, 'children': 2,'Govt_job':3,'Never_worked':4}, inplace=True)

df.Residence_type.replace({'Urban': 0, 'Rural': 1}, inplace=True)

df.smoking_status.replace({'never smoked': 0, 'Unknown': 1,'formerly smoked':2,'smokes':3}, inplace=True)



In [23]:
df['gender']=pd.get_dummies(df['gender'])

In [28]:
create_report(df)


  0%|          | 0/1617 [00:00<?, ?it/s]

0,1
Number of Variables,11
Number of Rows,5110
Missing Cells,201
Missing Cells (%),0.4%
Duplicate Rows,0
Duplicate Rows (%),0.0%
Total Size in Memory,404.3 KB
Average Row Size in Memory,81.0 B

0,1
Numerical,11

0,1
Distinct Count,2
Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,44.9 KB
Mean,0.5859
Minimum,0
Maximum,1

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,1
Q3,1
95-th Percentile,1
Maximum,1
Range,1
IQR,1

0,1
Mean,0.5859
Standard Deviation,0.4926
Variance,0.2427
Sum,2994.0
Skewness,-0.3488
Kurtosis,-1.8783
Coefficient of Variation,0.8408

0,1
Distinct Count,104
Unique (%),2.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,43.2266
Minimum,0.08
Maximum,82

0,1
Minimum,0.08
5-th Percentile,5.0
Q1,25.0
Median,45.0
Q3,61.0
95-th Percentile,79.0
Maximum,82.0
Range,81.92
IQR,36.0

0,1
Mean,43.2266
Standard Deviation,22.6126
Variance,511.3318
Sum,220888.0
Skewness,-0.137
Kurtosis,-0.9912
Coefficient of Variation,0.5231

0,1
Distinct Count,2
Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,0.09746
Minimum,0
Maximum,1

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,0
Q3,0
95-th Percentile,1
Maximum,1
Range,1
IQR,0

0,1
Mean,0.09746
Standard Deviation,0.2966
Variance,0.08798
Sum,498.0
Skewness,2.7146
Kurtosis,5.369
Coefficient of Variation,3.0435

0,1
Distinct Count,2
Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,0.05401
Minimum,0
Maximum,1

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,0
Q3,0
95-th Percentile,1
Maximum,1
Range,1
IQR,0

0,1
Mean,0.05401
Standard Deviation,0.2261
Variance,0.0511
Sum,276.0
Skewness,3.9461
Kurtosis,13.5716
Coefficient of Variation,4.1854

0,1
Distinct Count,2
Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,0.6562
Minimum,0
Maximum,1

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,1
Q3,1
95-th Percentile,1
Maximum,1
Range,1
IQR,1

0,1
Mean,0.6562
Standard Deviation,0.475
Variance,0.2257
Sum,3353.0
Skewness,-0.6576
Kurtosis,-1.5676
Coefficient of Variation,0.724

0,1
Distinct Count,5
Unique (%),0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,0.8321
Minimum,0
Maximum,4

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,0
Q3,2
95-th Percentile,3
Maximum,4
Range,4
IQR,2

0,1
Mean,0.8321
Standard Deviation,1.1099
Variance,1.2319
Sum,4252.0
Skewness,0.9744
Kurtosis,-0.4964
Coefficient of Variation,1.3339

0,1
Distinct Count,2
Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,0.492
Minimum,0
Maximum,1

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,0
Q3,1
95-th Percentile,1
Maximum,1
Range,1
IQR,1

0,1
Mean,0.492
Standard Deviation,0.5
Variance,0.25
Sum,2514.0
Skewness,0.0321
Kurtosis,-1.999
Coefficient of Variation,1.0163

0,1
Distinct Count,3979
Unique (%),77.9%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,106.1477
Minimum,55.12
Maximum,271.74

0,1
Minimum,55.12
5-th Percentile,60.7135
Q1,77.245
Median,91.885
Q3,114.09
95-th Percentile,216.2945
Maximum,271.74
Range,216.62
IQR,36.845

0,1
Mean,106.1477
Standard Deviation,45.2836
Variance,2050.6008
Sum,542414.63
Skewness,1.5718
Kurtosis,1.6777
Coefficient of Variation,0.4266

0,1
Distinct Count,418
Unique (%),8.5%
Missing,201
Missing (%),3.9%
Infinite,0
Infinite (%),0.0%
Memory Size,76.7 KB
Mean,28.8932
Minimum,10.3
Maximum,97.6

0,1
Minimum,10.3
5-th Percentile,17.64
Q1,23.5
Median,28.1
Q3,33.1
95-th Percentile,42.96
Maximum,97.6
Range,87.3
IQR,9.6

0,1
Mean,28.8932
Standard Deviation,7.8541
Variance,61.6864
Sum,141836.9
Skewness,1.055
Kurtosis,3.358
Coefficient of Variation,0.2718

0,1
Distinct Count,4
Unique (%),0.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,1.1117
Minimum,0
Maximum,3

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,1
Q3,2
95-th Percentile,3
Maximum,3
Range,3
IQR,2

0,1
Mean,1.1117
Standard Deviation,1.0718
Variance,1.1488
Sum,5681.0
Skewness,0.5295
Kurtosis,-1.0015
Coefficient of Variation,0.9641

0,1
Distinct Count,2
Unique (%),0.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,79.8 KB
Mean,0.04873
Minimum,0
Maximum,1

0,1
Minimum,0
5-th Percentile,0
Q1,0
Median,0
Q3,0
95-th Percentile,0
Maximum,1
Range,1
IQR,0

0,1
Mean,0.04873
Standard Deviation,0.2153
Variance,0.04636
Sum,249.0
Skewness,4.1921
Kurtosis,15.5733
Coefficient of Variation,4.4188


In [29]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,67.0,0,1,1,0,0,228.69,36.6,2,1
1,1,61.0,0,0,1,1,1,202.21,,0,1
2,0,80.0,0,1,1,0,1,105.92,32.5,0,1
3,1,49.0,0,0,1,0,0,171.23,34.4,3,1
4,1,79.0,1,0,1,1,1,174.12,24.0,0,1


In [30]:
df=df.drop(columns='id',axis=1)


KeyError: "['id'] not found in axis"

In [31]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,67.0,0,1,1,0,0,228.69,36.6,2,1
1,1,61.0,0,0,1,1,1,202.21,,0,1
2,0,80.0,0,1,1,0,1,105.92,32.5,0,1
3,1,49.0,0,0,1,0,0,171.23,34.4,3,1
4,1,79.0,1,0,1,1,1,174.12,24.0,0,1


In [32]:
df.fillna(df.mode())

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,0,67.0,0,1,1,0,0,228.69,36.6,2,1
1,1,61.0,0,0,1,1,1,202.21,,0,1
2,0,80.0,0,1,1,0,1,105.92,32.5,0,1
3,1,49.0,0,0,1,0,0,171.23,34.4,3,1
4,1,79.0,1,0,1,1,1,174.12,24.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
5105,1,80.0,1,0,1,0,0,83.75,,0,0
5106,1,81.0,0,0,1,1,0,125.20,40.0,0,0
5107,1,35.0,0,0,1,1,1,82.99,30.6,0,0
5108,0,51.0,0,0,1,0,1,166.29,25.6,2,0


In [None]:
df=df.dropna()

In [None]:
trace0 = go.Box(
    name = "gender",
    y = df["gender"]
)

trace1 = go.Box(
    name = "age",
    y = df["age"]
)

trace2 = go.Box(
    name = "hypertension",
    y = df["hypertension"]
)

trace3 = go.Box(
    name = "heart_disease",
    y = df["heart_disease"] 
)

trace4 = go.Box(
    name = "ever_married",
    y = df["ever_married"]
)

trace5 = go.Box(
    name = "work_type",
    y = df["work_type"]
)

trace6 = go.Box(
    name = "Residence_type",
    y = df["Residence_type"]
)

trace7 = go.Box(
    name = "avg_glucose_level",
    y = df["avg_glucose_level"]
)

trace8 = go.Box(
    name = "bmi",
    y = df["bmi"]
)

trace9 = go.Box(
    name = "smoking_status",
    y = df["smoking_status"]
)

trace10 = go.Box(
    name = "stroke",
    y = df["stroke"]
)
data = [trace0, trace1, trace2 , trace3 , trace4 , trace5 ,trace6, trace7, trace8 , trace9 , trace10  ]
plotly.offline.iplot(data)

In [None]:
plot_correlation(df, "stroke")


In [None]:
fig = px.scatter_matrix(df, dimensions=['gender', 'age', 'hypertension', 'stroke'])
fig.show()

In [None]:
fig = px.scatter_matrix(df, dimensions=[ 'heart_disease', 'ever_married',
       'work_type', 'stroke'])
fig.show()

In [None]:
fig = px.scatter_matrix(df, dimensions=['Residence_type', 'avg_glucose_level', 'bmi',
                                        'stroke'])
fig.show()

In [None]:
fig = px.scatter_matrix(df, dimensions=['Residence_type', 'avg_glucose_level', 'bmi',
                                        'smoking_status','stroke'])
fig.show()

In [None]:
df[df.isnull().any(axis=1)]

In [None]:
X = df.drop(columns=["stroke"])
y = df["stroke"]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier=RandomForestClassifier(n_estimators=10).fit(X_train,y_train)
prediction=rf_classifier.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(confusion_matrix(y_test,prediction))
print(accuracy_score(y_test,prediction))
print(classification_report(y_test,prediction))

The main parameters used by a Random Forest Classifier are:

- criterion = the function used to evaluate the quality of a split.
- max_depth = maximum number of levels allowed in each tree.
- max_features = maximum number of features considered when splitting a node.
- min_samples_leaf = minimum number of samples which can be stored in a tree leaf.
- min_samples_split = minimum number of samples necessary in a node to cause node splitting.
- n_estimators = number of trees in the ensamble.

In [None]:
### Manual Hyperparameter Tuning
model=RandomForestClassifier(n_estimators=300,criterion='entropy',
                             max_features='sqrt',min_samples_leaf=10,random_state=100).fit(X_train,y_train)
predictions=model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(accuracy_score(y_test,predictions))
print(classification_report(y_test,predictions))

###  Genetic Algorithms¶
Genetic Algorithms tries to apply natural selection mechanisms to Machine Learning contexts.

Let's immagine we create a population of N Machine Learning models with some predifined Hyperparameters. We can then calculate the accuracy of each model and decide to keep just half of the models (the ones that performs best). We can now generate some offsprings having similar Hyperparameters to the ones of the best models so that go get again a population of N models. At this point we can again caltulate the accuracy of each model and repeate the cycle for a defined number of generations. In this way, just the best models will survive at the end of the process.

In [None]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 1000,10)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10,14]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,6,8]
# Create the random grid
param = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
              'criterion':['entropy','gini']}
print(param)

In [None]:
from tpot import TPOTClassifier


tpot_classifier = TPOTClassifier(generations= 5, population_size= 24, offspring_size= 12,
                                 verbosity= 2, early_stop= 12,
                                 config_dict={'sklearn.ensemble.RandomForestClassifier': param}, 
                                 cv = 4, scoring = 'accuracy')
tpot_classifier.fit(X_train,y_train)

In [None]:
accuracy = tpot_classifier.score(X_test, y_test)
print(accuracy)

### Optimize hyperparameters of the model using Optuna
The hyperparameters of the above algorithm are n_estimators and max_depth for which we can try different values to see if the model accuracy can be improved. The objective function is modified to accept a trial object. This trial has several methods for sampling hyperparameters. We create a study to run the hyperparameter optimization and finally read the best hyperparameters.

In [None]:
import optuna
import sklearn.svm
def objective(trial):

    classifier = trial.suggest_categorical('classifier', ['RandomForest', 'SVC'])
    
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 200, 2000,10)
        max_depth = int(trial.suggest_float('max_depth', 10, 100, log=True))

        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    else:
        c = trial.suggest_float('svc_c', 1e-10, 1e10, log=True)
        
        clf = sklearn.svm.SVC(C=c, gamma='auto')

    return sklearn.model_selection.cross_val_score(
        clf,X_train,y_train, n_jobs=-1, cv=3).mean()


In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

In [None]:
trial

In [None]:
study.best_params

In [None]:
rf=RandomForestClassifier(n_estimators=330,max_depth=30)
rf.fit(X_train,y_train)

In [None]:
y_pred=rf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))