Use Case

Objective Statement:
Get business insight about whether stroke will occure or not.
To reduce risk in deciding to whom a stroke occur.
To increase predicting efficiency by using different features of a person like age, gender, history of work and other medical conditions.

Challenges:
Large size of data, can not maintain by excel spreadsheet.
Need several conditions to check at a same time.

Methodology / Analytic Technique:
Descriptive analysis
Graph analysis

Business Benefit:
Helping Business Development Team to create predictions based on the characteristic for each patient.
Know how to treat customer with specific medical condition.

Expected Outcome:
Know how many many patients are in risk of a stroke.

Business Understanding:
Why it is important to learn about stroke?
Stroke is the second leading cause of death and disability worldwide. According to the WHO, 5 million people worldwide suffer a stroke every year. 
In the USA, someone has a stroke every 40 seconds and every 4 minutes, someone dies. The aftermath is devastating, with victims experiencing a wide range of disabling symptoms. 
The economic burden to the healthcare system in the US amounts to about $34 billion per year in the US. 
Who is affected?
While there is no one, absolute risk factor for determining one’s chances of having a stroke, certain characteristics and factors may increase a person’s odds. 
It is estimated that 60 to 80% of strokes could be prevented through healthy lifestyle changes. 
The dataset will help to examine a multitude of variables to better understand which, if any, play a significant role in predicting the odds of having a stroke.
Why do we care?
Understanding one’s risk factors could help motivate an individual to better educate themselves on their chances of having a stroke, 
more closely monitor their health, make healthier choices, and ultimately decrease their overall risk of stroke. 

Data Understanding:
Data Set: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Date Created: 01/26/2021
The dataset has 12 columns and 5,110 rows.

Data preparation:
Python Version: 3.7.6
Packages: Pandas, Numpy, Matplotlib, Seaborn, Sklearn, and imblearn

In [None]:
#Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score,classification_report,precision_score,recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import sklearn.metrics as metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score

In [None]:
data=pd.read_csv('healthcare-dataset-stroke-data.csv') #Reading csv file
data.head() # Displaying top 5 rows

Exploratory Data Analysis

In [None]:
data.info() # Showing information about datase

In [None]:
data.describe() # Showing data's statistical features

In [None]:
#ID is nothing but a unique number assigned to every patient to keep track of them and making them unique. So, dropping it.
data.drop("id",inplace=True,axis=1)

In [None]:
#Gender of the paitents
print('Unique values\n',data['gender'].unique())
print('Value Counts\n',data['gender'].value_counts())
# Above codes will give us gender's unique values and count of each value.

sns.countplot(data=data,x='gender') #This will help us to see count of values in each unique category.
sns.countplot(data=data,x='gender',hue='stroke')# This plot will help to analyze how gender will affect chances of stroke.

In [None]:
#Age
data['age'].nunique() # Returns number of unique values 
sns.displot(data['age']) # This will plot a distribution plot of variable age
plt.figure(figsize=(15,7))
sns.countplot(data=data,x='age',hue='stroke') # This plot will help to analyze how gender will affect chances of stroke.

In [None]:
#Previous heart diseases
print('Unique Value\n',data['heart_disease'].unique())
print('Value Counts\n',data['heart_disease'].value_counts())
# Above code will gives us unique value for heart disease and its value counts
sns.countplot(data=data,x='heart_disease') # Will plot a counter plot of variable heart diseases
sns.countplot(data=data,x='heart_disease',hue='stroke') # This plot will help to analyze how gender will affect chances of stroke.

In [None]:
#Hypertensive paitents
print('Unique Value\n',data['heart_disease'].unique())
print('Value Counts\n',data['heart_disease'].value_counts())
# Above code will gives us unique value for heart disease and its value counts
sns.countplot(data=data,x='heart_disease') # Will plot a counter plot of variable heart diseases
sns.countplot(data=data,x='heart_disease',hue='stroke') # This plot will help to analyze how gender will affect chances of stroke.

In [None]:
#Ever married in life 
print('Unique Values\n',data['ever_married'].unique())
print('Value Counts\n',data['ever_married'].value_counts())
# Above code will gives us number unique values of ever-married patients and its value count
sns.countplot(data=data,x='ever_married') # Counter plot of ever married 
sns.countplot(data=data,x='ever_married',hue='stroke') # Ever married with respect of stroke

In [None]:
#Work type of patients
print('Unique Value\n',data['work_type'].unique())
print('Value Counts\n',data['work_type'].value_counts())
# Above code will gives us unique values of work type and its value count
sns.countplot(data=data,x='work_type') # Counter plot of work type
sns.countplot(data=data,x='work_type',hue='stroke') # Count plot of work type with respect to stroke

In [None]:
#Residence type of paitents
print('Unique Values\n',data['Residence_type'].unique())
print("Value Counts\n",data['Residence_type'].value_counts())
# Above code will gives us unique values of Residence type and its count
sns.countplot(data=data,x='Residence_type') # Counter plot of residence type
sns.countplot(data=data,x='Residence_type',hue='stroke') # Residence Type with respect to stroke

In [None]:
#Body mass index - BMI
data['bmi'].isna().sum() #Gives us the null values
data['bmi'].fillna(data['bmi'].mean(),inplace=True) # Filling null values with average value
data['bmi'].nunique() # Gives us the number of unique values 
sns.displot(data['bmi']) # Distribution of bmi

In [None]:
#Smoking status of patients
print('Unique Values\n',data['smoking_status'].unique())
print('Value Counts\n',data['smoking_status'].value_counts())
# Gives us the unique values and its count
sns.countplot(data=data,x='smoking_status') # Count plot of smoking status
sns.countplot(data=data,x='smoking_status',hue='stroke') # Smoking Status with respect to Stroke

In [None]:
#Stroke - Our target variable. It tells us whether patients have chances of stroke.
print('Unique Value\n',data['stroke'].unique())
print('Value Counts\n',data['stroke'].value_counts())
# Gives us the unique Value and its count
sns.countplot(data=data,x='stroke') # Count Plot of Stroke

Feature Engineering

In [None]:
# Feteching columns whose data type is object.
cols=data.select_dtypes(include=['object']).columns
print(cols)
le=LabelEncoder() # Initializing our Label Encoder object
data[cols]=data[cols].apply(le.fit_transform) # Transfering categorical data into numeric
print(data.head())

In [None]:
#Plotting heat map for checking correlation 
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(),annot=True,fmt='.2')
#We can see that age, hypertension, heart_disease, ever_married, avg_glucose_level have effective correlation

In [None]:
#To crsss check we will use - SelectKBest - used for extracting best features of given dataset
#f_classif - Compute the ANOVA F-value for the provided sample.
classifier = SelectKBest(score_func=f_classif,k=5)
fits = classifier.fit(data.drop('stroke',axis=1),data['stroke'])
x=pd.DataFrame(fits.scores_)
columns = pd.DataFrame(data.drop('stroke',axis=1).columns)
fscores = pd.concat([columns,x],axis=1)
fscores.columns = ['Attribute','Score']
fscores.sort_values(by='Score',ascending=False)

In [None]:
#We can see that age is having highest values so, we can keep threshold of 50
cols=fscores[fscores['Score']>50]['Attribute']
print(cols)

In [None]:
#Defining independent and dependent variables
X = data[cols]
y = data['stroke']

In [None]:
# standard scalar transforms the data in such a manner that it has mean as 0 and standard deviation as 1
sc = StandardScaler()

In [None]:
X = sc.fit_transform(X)

In [None]:
#CHecking values of 0 and 1 in dataset
print("Before OverSampling, counts of label '1': {}".format(sum(y==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y==0)))
#creating new samples using existing one with help of smote function
sm = SMOTE(random_state=2)
X, y = sm.fit_resample(X, y.ravel())
#checking values of X and y in dataset after oversampling 
print('After OverSampling, the shape of X: {}'.format(X.shape))
print('After OverSampling, the shape of y: {} \n'.format(y.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y==0)))

Modelling

In [None]:
#Splitting data in train and test 
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=50,test_size=0.20)

Model 1 - Logistic Regression

In [None]:
#Calling logistic function 
lr = LogisticRegression()
lr.fit(X_train, y_train) #fitting and training the model with traning values

In [None]:
#Creating y_pred variable as model is predicting values with help of X-test data 
y_pred = lr.predict(X_test)

Evaluation of Model 1 - Logistic Regression

In [None]:
#Calling confusion matrix function because they give direct comparisons of values like True Positives, False Positives, True Negatives and False Negatives.
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
#Plotting the confusion matrix
from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=cm, figsize=(5, 5), cmap=plt.cm.Greens)
plt.xlabel('Predictions', fontsize=18)
plt.ylabel('Actuals', fontsize=18)
plt.title('Confusion Matrix', fontsize=18)
plt.show()

In [None]:
#Creating a variable and assigning the accuracy score values with help of y_test and y_pred 
#Accuracy classification score - This function computes subset accuracy the y predicted for a sample must exactly match with the y_actual
logreg=accuracy_score(y_test,y_pred) 
logreg

In [None]:
#Printing roc and auc scores
#OC is a probability curve and AUC represents the degree/measure of separability
roc_auc_score(y_test, y_pred)

In [None]:
#Classification report- Build a text report showing the main classification metrics.
print(metrics.classification_report(y_test, y_pred)) 

Model 2 - Decision Tree Classifier

In [None]:
#Calling decision tree classifier 
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
dt.fit(X_train, y_train) #fitting and training the model with traning values
dt_pred_train = dt.predict(X_train)

Evaluation of Model 2 - Decision Tree Classifier

In [None]:
#Printing evaluation score usinf f-test for training data
print('Training Set Evaluation F1-Score=> ', f1_score(y_train, dt_pred_train))

In [None]:
dt_pred_test = dt.predict(X_test)
#Printing evaluation score usinf f-test for test data
print('Testing Set Evaluation F1-Score=> ', f1_score(y_test, dt_pred_test))

Model 3 - Random Forest Classifier 

In [None]:
#Building Random Forest Classifier
rfc = RandomForestClassifier(criterion = 'entropy', random_state = 42)
rfc.fit(X_train, y_train)

Evaluation of Model 3 - Random Forest Classifier 

In [None]:
#Evaluating on Training set
rfc_pred_train = rfc.predict(X_train)
print('Training Set Evaluation F1_score=> ', f1_score(y_train, rfc_pred_train))

In [None]:
rfc_pred_test = rfc.predict(X_test)
#Printing evaluation score usinf f-test for test data
print('Testing Set Evaluation F1-Score=> ', f1_score(y_test, rfc_pred_test))

Model 4 - Extreme Gradient Boosting 

In [None]:
#Building Xgboost Classifier

xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

In [None]:
Evaluation of Model 4 - Extreme Gradient Boosting 

In [None]:
#Evaluating on Training set
print('Training Score: {}'.format(xgb_model.score(X_train, y_train))) 

In [None]:
#Evaluating on test set
print('Test Score: {}'.format(xgb_model.score(X_test, y_test)))

In [None]:
y_pred = xgb_model.predict(X_test)

In [None]:
#Printing accuracy scores for xgboost
logreg=accuracy_score(y_test,y_pred) 
logreg

In [None]:
#Printing roc and auc scores for xgboost
roc_auc_score(y_test, y_pred)

In [None]:
#Classification report for xgboost
print(metrics.classification_report(y_test, y_pred))

As per the above models we can say, that the model 4 - Extreme Gradient Boosting has the higest accouracy for both training and test data set.
So, we can use the same model for our predictions.