<a href="https://www.kaggle.com/code/shlesha/stroke-prediction-analysis-and-modelling?scriptVersionId=111380722" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import pandas as pd
import numpy as np    
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline  
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import plot_confusion_matrix,classification_report
from sklearn.metrics import plot_precision_recall_curve
from sklearn.metrics import f1_score, accuracy_score, matthews_corrcoef
from sklearn import metrics

In [None]:
df = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df.head()

#### Let's see a concise summary of the dataframe.

In [None]:
df.info()

The column containing Id numbers of persons is not much of use here, so dropping the 'id' column.

In [None]:
df.drop(columns ='id', axis=1, inplace=True)

The column named 'bmi' has some missing values. Let us check it.

In [None]:
df['bmi'].isnull().sum()

The number of missing values in the column 'bmi' is 201. Replacing the missing data points with mean of the column.

In [None]:
bmi_mean=df['bmi'].mean()
df['bmi'].replace(np.nan, bmi_mean, inplace=True)

#### Basic statistical summary of the data.

In [None]:
df.describe()

In [None]:
# number of entries for each categories in 'stroke' column
df['stroke'].value_counts()

So the entries for 'stroke' = 1 is just 249 out of 5110 rows (which is about 5%). So out of 5110 samples, persons who had stroke is 5%.

## A- Visualising the data
### 1. Age and Gender 

In [None]:
fig, axs = plt.subplots(3,2,figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.boxplot(x=df.hypertension, y=df.age,hue=df.gender, data = df, ax =axs[0,0])
sns.stripplot(x=df.hypertension, y=df.age,hue=df.gender, data = df, ax =axs[0,1])

sns.boxplot(x=df.heart_disease, y=df.age,hue=df.gender, data = df, ax = axs[1,0])
sns.stripplot(x=df.heart_disease, y=df.age,hue=df.gender, data = df, ax = axs[1,1])

sns.boxplot(x=df.stroke, y=df.age,hue=df.gender, data = df, ax = axs[2,0])
sns.stripplot(x=df.stroke, y=df.age,hue=df.gender, data = df, ax = axs[2,1])

We want to know the average age for patients of hypertension, heart_disease and stroke for males and females.

In [None]:
a =df.groupby(['hypertension', 'gender'])
print("Average 'age' for hypertension: ",'\n',a['age'].aggregate('mean'),'\n')

b =df.groupby(['heart_disease','gender'])
print("Average 'age' for Heart-disease : ",'\n' ,b['age'].aggregate('mean'),'\n')

c=df.groupby(['stroke','gender'])
print("Average 'age' for stroke : ",'\n' ,c['age'].aggregate('mean'))


**So from the above analysis we can say**

 **1. The average age for Hypertension is about 63 years in Females and 61 year in Males. It is slightly more in case female.**

**2. The plot for occurrence of heart-disease vs age very clearly depicts that for people below age of 40 years have very rare cases of Heart-disease. On an average for people of age of about 70 years it is most common. There is not much difference in average age for Male and Female for heart-disease.**

**3. The average age of Stroke for Males and Females are not very different (about 67 years in Female and 69 years in Males).**

**4. The occurrence of any of the above three diseases is very rare for people of age below 40 years and high for people of age 60 years and above for all.**

### 2. Hypertension 

In [None]:
plt.plot(figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.countplot(data = df,  x='stroke', hue='hypertension')

print('\n','Stroke vs Hypertension Frequency table : ','\n')
pd.crosstab(df['stroke'], 
                   df['hypertension'],  normalize='all', margins = True)

**Among 90% of persons who are not having hypertension about 87% are those who did not get stroke as well.**

### 3. Heart-disease

In [None]:
plt.plot(figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.countplot(data = df, x='stroke', hue='heart_disease')

print('\n','Stroke vs Heart-disease Frequency table: ', '\n')
pd.crosstab(df['stroke'], 
                  df['heart_disease'], normalize='all', margins = True)

**The data shows 90% of persons who do not have heart-disease don't get stroke.**


### 4. Marital-status

In [None]:
fig, axs = plt.subplots(3, figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.countplot(data = df, x='ever_married', hue='hypertension', ax=axs[0])

sns.countplot(data = df, x='ever_married', hue='stroke', ax=axs[1])

sns.countplot(data = df, x='ever_married', hue='heart_disease', ax= axs[2])


In [None]:
pd.crosstab([df['hypertension'], df['heart_disease'],df['stroke']], 
            df['ever_married'])

**Those who were ever-married have more cases of hypertension, heart-disease and stroke than those who were not.**

### 5.Work-type
Let us see if there is any trend in the type of work a person does and disease


In [None]:
fig, axs = plt.subplots(3, figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.countplot(data = df, x='work_type', hue='hypertension', ax=axs[0])

sns.countplot(data = df, x='work_type', hue='stroke', ax=axs[1])

sns.countplot(data = df, x='work_type', hue='heart_disease', ax= axs[2])

**The people in private-sector are at higher risk to develop any of these diseases. People who never-worked are at minimal risk.**

### 6. Residence-type
Let us now find out trend between Residence-type and diseases.

In [None]:
fig, axs = plt.subplots(3, figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.countplot(data = df, x='Residence_type', hue='hypertension', ax=axs[0])

sns.countplot(data = df, x='Residence_type', hue='stroke', ax=axs[1])

sns.countplot(data = df, x='Residence_type', hue='heart_disease', ax= axs[2]) 

**The countplot for hypertension, heart-disease and stroke for both (urban and rural) categories of Residence-type does not show major difference between each residents for their diseases. So we can say residence location does not has any significant role in causing stroke.**

### 7. Smoking_status

In [None]:
#     smoking_status    

fig, axs = plt.subplots(3, figsize=(8,10))
sns.set_theme(style='whitegrid')
sns.countplot(data = df, x='smoking_status', hue='hypertension', ax=axs[0])

sns.countplot(data = df, x='smoking_status', hue='stroke', ax=axs[1])

sns.countplot(data = df, x='smoking_status', hue='heart_disease', ax= axs[2]) 

In [None]:
pd.crosstab(df['smoking_status'], df['hypertension'], normalize = 'columns')

**From all the patients of Hypertension, highest 46% are from never-smoked group while 24% are from 'formerly-smoked' and and 18% 'smokes' group. For 10% patients of Hypertension patients the smoking-status is unkonwn.**

**Although for many(52) of the patients their smoking-status is not-known. But we can make inference based on fact that a big percentage (46% which in numbers is 234 out of 498) of stroke-patients never-smoked.** 

### 8. Body-Mass-Index (bmi)
Let us find out if there is any relation among these disease data and bmi.

In [None]:
sns.catplot(x="hypertension", y="bmi",
                hue="heart_disease", col="stroke",
                data=df, kind="bar",
                height=5);

### 9. Average Glucose-level

In [None]:
sns.catplot(x="hypertension", y="avg_glucose_level",
                hue="heart_disease", col="stroke",
                data=df, kind="bar",
                height=5);

In [None]:
b =df.groupby(['hypertension', 'gender'])
print("Average 'avg_glucose_level' for hypertension: ",'\n'*2 ,b['avg_glucose_level'].aggregate('mean'),'\n')

c =df.groupby(['heart_disease','gender'])
print("Average 'avg_glucose_level' for Heart-disease : ",'\n'*2 ,c['avg_glucose_level'].aggregate('mean'),'\n')

a=df.groupby(['stroke','gender'])
print("Average 'avg_glucose_level' for stroke : ",'\n' *2 ,a['avg_glucose_level'].aggregate('mean'))

**Mean glucose-level among hypertension patients is slightly higher for males(132) than females(128)**

**The mean value of glucose-level in Heart-disease patients are higher for females(about 143) than that for males(about 132)**

**The stroke patients who have mean-glucose level of about 124 for females and 143 for males. Here females show lower mean value of glucose level that for males**

Overall, we can say that high glucose level is common among all these patients.

## B- Preprocessing
Let us find all the columns that contain string type categorical data. 
And using Scikit-learn library change the string type categorical variable columns to numerical type.

In [None]:
cat_col = df.select_dtypes(['object']).columns
print(cat_col)
label_encode = LabelEncoder()    #initializing an object of class LabelEncoder
for i in df.columns:
    df[i] = label_encode.fit_transform(df[i])
df.head()

Let us see correlation table and Heat-map.

In [None]:
df.corr()

In [None]:
plt.plot(figsize=(15,15))
sns.heatmap(df.corr())

###  Feature-selection
**Feature selection technique permits researchers to choose measures that are maximally predictive of relevant outcomes, even when there are interactions or nonlinearities. These techniques facilitate decisions about which measures may be dropped from a study while maintaining efficiency of prediction**

**Let us choose best features, using SelectKBest method, which is a univariate feature selection method, using f_classif as scoring function. For classification, scoring functions based on F-test estimate the degree of linear dependency between two random variables.**

In [None]:
#drop target columns
drop_col = ['stroke']

#dataset containing the features columns
x_feat = df.iloc[:,0:10]     #independent variable (predictors)
y_tar = df['stroke']      # target feature column

best_feat = SelectKBest(score_func = f_classif, k='all')
features_fitted = best_feat.fit(x_feat, y_tar)
df_scores = pd.DataFrame(features_fitted.scores_)
df_columns = pd.DataFrame(x_feat.columns)


In [None]:
# concatenate dataframes
feature_scores = pd.concat([df_columns , df_scores], axis =1)
feature_scores.columns = ['Features','Score']

#sorting the feature dataframe based on score values
feature_scores.sort_values(by='Score',ascending=False)
feature_scores

Selecting features with scores greater than 60

In [None]:
cols = feature_scores[feature_scores['Score']>60]['Features']
print('Choosen features : ','\n', cols)

### Normalizing the dataset.

In [None]:
x=df[cols].values
y=df['stroke'].values
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit_transform(x)

### Splitting the dataset into training and testing sets.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state = 50, test_size =0.25)
x_train.shape,x_test.shape,y_train.shape, y_test.shape

### **Modelling**

### Let us now implement Logistic regression algorithm.

Let's build our model using LogisticRegression from the Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers.  C parameter indicates inverse of regularization strength which must be a positive float. Smaller values specify stronger regularization.



In [None]:
#from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(C=0.1, solver='liblinear').fit(x_train, y_train)

yp_lr = LR.predict(x_test)

#f1_score:
yf = f1_score(y_test,yp_lr )

ya = metrics.accuracy_score(y_test,yp_lr)     

print('F1 Score for Logistic regression model : %.2f'%yf)
print('Accuracy : %.2f'%ya)

plot_confusion_matrix(LR, x_test, y_test,cmap='YlGn')
plt.title('Cofusion matrix for model')

plt.show()  

**As shown by confusion matrix the Logistic regression classifier gives zero true-positives so precision will be zero.**

**The zero value of F1-score shows that the model is failure.**

Now let us try to model using three more algorithms namely: K-Nearest Neighbor, Decision Trees and Support Vector Machine. Then we will compare their results using different evaluation matrices.

In [None]:
# declaring three list variables to strore various evaluation matrices
acc_val = []
f1_val = []
macc_val = []

### Implementing KNearest neighbour

In [None]:
# Selecting best K-value(number of neighbours) based upon the accuracy-score
ks = 10
mean_acc = np.zeros((ks-1))
std_acc = np.zeros((ks-1))
for n in range (1,ks):
    neg = KNeighborsClassifier(n_neighbors=n).fit(x_train, y_train)
    ypred_knn = neg.predict(x_test)
    mean_acc[n-1]= metrics.accuracy_score(y_test, ypred_knn)
    std_acc[n-1]= np.std(ypred_knn==y_test)/np.sqrt(ypred_knn.shape[0])
print(' Best accuracy of %.3f '%mean_acc.max(), 'was with k = ', mean_acc.argmax())

In [None]:
plt.figure(figsize=(10,5))
plt.plot(range(1,ks),mean_acc,'g')
plt.fill_between(range(1,ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()

**The plot above depicts the changing values of accuracy with K-values.**

In [None]:
#Finding Training and testing set accuracy for best k
knn = KNeighborsClassifier(n_neighbors=7).fit(x_train, y_train)

train_scores= metrics.accuracy_score(y_train, knn.predict(x_train))
test_scores= metrics.accuracy_score(y_test, knn.predict(x_test))
print(' Train-set accuracy with k=7 is: %.3f'%train_scores)
print(' Test-set accuracy with k=7 is: %.3f'%test_scores)
k_acc = metrics.accuracy_score(y_test, ypred_knn)
acc_val.append(k_acc)
#print(' Accuracy-score : %.3f'%k_acc)

### Implementing Decision tree

In [None]:
d_tree = DecisionTreeClassifier(criterion='entropy')
d_tree.fit(x_train, y_train)
ypred_tree = d_tree.predict(x_test)
#Evaluating the model
d_acc =  metrics.accuracy_score(y_test, ypred_tree)
acc_val.append(d_acc)

### Implementing Support Vector Machine

**SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.**
**The most used type of kernel function is RBF. Because it has localized and finite response along the entire x-axis.**

In [None]:
clf = svm.SVC(kernel='rbf')
sv = clf.fit(x_train, y_train)
ypred_svm = clf.predict(x_test)
s_acc = accuracy_score(y_test, ypred_svm)
acc_val.append(s_acc)

## **Evaluating the model**

### Accuracy-score
**Let us compare the accuracy of the three algorithms for given dataset**

In [None]:
print('KNN model accuracy : %.3f' %k_acc)
print('Decision tree model accuracy : %.3f'%d_acc)
print('SVM model accuracy : %.3f' %s_acc)

**So accuracy scores for all of te chosen models are equal.**

### Computing the Confusion Matrix

In [None]:
fig, axs = plt.subplots(nrows=3, figsize=(10,15), constrained_layout = True)
plt.rcParams['font.size'] = '16'
plot_confusion_matrix(d_tree, x_test, y_test,ax=axs[0],cmap='YlGn')
axs[0].set_title('Cofusion matrix for Decision tree model')

plot_confusion_matrix(sv, x_test, y_test,ax = axs[1],cmap='YlGn')
axs[1].set_title('Cofusion matrix for SVM model')

plot_confusion_matrix(knn, x_test, y_test, ax = axs[2],cmap='YlGn')
axs[2].set_title('Cofusion matrix for KNN model')
plt.show()  

#### As seen from confusion matrices, 
**1. The True-Negatives for support vector machine model highest in numbers.**

**2. The decision tree model gives highest True-Positives followed by KNN model while SVM model fails to give any True-Positives.**

Here we are trying to build a model to predict of a person has stroke or not, so we want to capture the disease i.e as many positives as possible. Decision tree classifiers performs best among the three models in capture highest number of True-Positives (as desirable)

### Classification Report

In [None]:
np.set_printoptions(precision=2)
t_n = ['Stroke = 0', 'Stroke = 1']
print('\n',"Classification report for KNN classifier")
print(classification_report(y_test, ypred_knn, labels=[0,1], target_names=t_n))
print('\n',"Classification report for Decision tree classifier")
print(classification_report(y_test, ypred_tree, labels=[0,1], target_names=t_n))
print('\n',"Classification report for SVM classifier")
print(classification_report(y_test, ypred_svm, labels=[0,1],target_names=t_n))

**The average Precision and Recall are highest for Decision tree**

**The F1 score**

In [None]:
print('F1-score for the KNN model is : %.3f '  %f1_score(y_test, knn.predict(x_test)))
print('F1-score for the decision tree model is : %.3f'%f1_score(y_test, ypred_tree))
print('F1-score for the support vector machine model : %.3f' %f1_score(y_test,ypred_svm))
f1_val.append(f1_score(y_test, knn.predict(x_test)))
f1_val.append(f1_score(y_test, ypred_tree))
f1_val.append(f1_score(y_test,ypred_svm))

**The F1-score for decision tree classifier is higher than the other two classifiers.**

### Matthews correlation coefficient
#### The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.”
#### The Matthews correlation coefficient (MCC), is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.

In [None]:
print("The Matthews correlation coefficient for Decision tree model : %.3f" %matthews_corrcoef(y_test, ypred_tree))
print("The Matthews correlation coefficient for K nearest neighbour model : %.3f" %matthews_corrcoef(y_test, knn.predict(x_test)))
print("The Matthews correlation coefficient for SVM model : %.3f" %matthews_corrcoef(y_test, ypred_svm))
macc_val.append(matthews_corrcoef(y_test, knn.predict(x_test)))
macc_val.append(matthews_corrcoef(y_test, ypred_tree))
macc_val.append(matthews_corrcoef(y_test, ypred_svm))

#### Decision Tree model has highest matthews correlation coefficient among the chosen models.

In [None]:
eval_mat = pd.DataFrame(list(zip(acc_val, f1_val, macc_val)), 
                        columns=['Accuracy','F1_score','Matthews_cor_coef'],
                       index = ['KNN', 'Decision_tree', 'SVM'])
eval_mat

Analyzing the various evaluation matrices for our models we clearly see that the Decision tree classifier performs best among the three classifiers.

### Precision-Recall curves for the three classifiers

In [None]:
fig, axs = plt.subplots(figsize=(8,8))
plot_precision_recall_curve(knn, x_test, y_test, name = 'K-Nearest Neighbour',ax=axs)
plot_precision_recall_curve(d_tree, x_test, y_test, name = 'Decision tree',ax=axs)
plot_precision_recall_curve(clf, x_test, y_test, name = 'Support Vector Machine', ax=axs)
plt.legend(loc=(0, -.30), prop=dict(size=14))
plt.title('Precision-Recall curve for different models')

### Conclusion
The precision-recall curve for decision matrix has higher area under the curve, thus confirming that it is better than the other two models, This is in agreement with our conclusion that decision tree classifier is best when compared to K-nearest neighbour and Support Vector Machine.
