# STROKE RISK PREDICTION USING MACHINE LEARNING MODELS

Stroke is a serious medical condition that requires timely intervention to reduce its impact on health. Early prediction of stroke risk can play a vital role in preventive healthcare. This project aims to predict the likelihood of stroke based on various symptoms and risk factors, including chest pain, shortness of breath, dizziness, high blood pressure, and others. The dataset contains multiple features related to patient symptoms and medical history.

To build an accurate prediction model, several machine learning algorithms were employed, including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression, Random Forest, and Decision Tree classifiers. These models were evaluated using key performance metrics to identify the best-performing model for stroke risk prediction.

Importing Dataset and Libraries

In [1]:
import pandas as pd
import warnings 
warnings.filterwarnings('ignore')
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
df=pd.read_csv(r"E:\data_analytics\ml_works\machine_learning_projects\stroke_risk_dataset.csv")
df.head()

Unnamed: 0,Chest Pain,Shortness of Breath,Irregular Heartbeat,Fatigue & Weakness,Dizziness,Swelling (Edema),Pain in Neck/Jaw/Shoulder/Back,Excessive Sweating,Persistent Cough,Nausea/Vomiting,High Blood Pressure,Chest Discomfort (Activity),Cold Hands/Feet,Snoring/Sleep Apnea,Anxiety/Feeling of Doom,Age,At Risk (Binary)
0,0,1,1,1,0,0,0,1,1,1,0,1,1,0,0,54,1
1,0,0,1,0,0,1,0,0,0,0,1,0,1,1,0,49,0
2,1,0,0,1,1,1,0,0,1,0,0,0,0,1,0,62,1
3,1,0,1,1,0,1,1,1,1,1,1,0,0,0,0,48,1
4,0,0,1,0,0,1,0,1,0,1,1,0,0,1,1,61,1


Information about Data

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype
---  ------                          --------------  -----
 0   Chest Pain                      70000 non-null  int64
 1   Shortness of Breath             70000 non-null  int64
 2   Irregular Heartbeat             70000 non-null  int64
 3   Fatigue & Weakness              70000 non-null  int64
 4   Dizziness                       70000 non-null  int64
 5   Swelling (Edema)                70000 non-null  int64
 6   Pain in Neck/Jaw/Shoulder/Back  70000 non-null  int64
 7   Excessive Sweating              70000 non-null  int64
 8   Persistent Cough                70000 non-null  int64
 9   Nausea/Vomiting                 70000 non-null  int64
 10  High Blood Pressure             70000 non-null  int64
 11  Chest Discomfort (Activity)     70000 non-null  int64
 12  Cold Hands/Feet                 70000 non-null  int64
 13  S

Finding and removing null values

In [3]:
#removing null values
df.isna().sum()

Chest Pain                        0
Shortness of Breath               0
Irregular Heartbeat               0
Fatigue & Weakness                0
Dizziness                         0
Swelling (Edema)                  0
Pain in Neck/Jaw/Shoulder/Back    0
Excessive Sweating                0
Persistent Cough                  0
Nausea/Vomiting                   0
High Blood Pressure               0
Chest Discomfort (Activity)       0
Cold Hands/Feet                   0
Snoring/Sleep Apnea               0
Anxiety/Feeling of Doom           0
Age                               0
At Risk (Binary)                  0
dtype: int64

Replacing Space in Column name

In [4]:
#replacing space
df.columns=df.columns.str.replace(" ","_")
print(df.columns)

Index(['Chest_Pain', 'Shortness_of_Breath', 'Irregular_Heartbeat',
       'Fatigue_&_Weakness', 'Dizziness', 'Swelling_(Edema)',
       'Pain_in_Neck/Jaw/Shoulder/Back', 'Excessive_Sweating',
       'Persistent_Cough', 'Nausea/Vomiting', 'High_Blood_Pressure',
       'Chest_Discomfort_(Activity)', 'Cold_Hands/Feet', 'Snoring/Sleep_Apnea',
       'Anxiety/Feeling_of_Doom', 'Age', 'At_Risk_(Binary)'],
      dtype='object')


Removing Duplicates

In [5]:
#remove duplicates
df.duplicated().sum()
df.drop_duplicates(inplace=True)

Independent and Dependent Variable

In [6]:
x=df.iloc[:,[0,1,2,3,6,7,10,11,13,15]]
x=pd.DataFrame(x)
print(x)
y=df['At_Risk_(Binary)'].values
y=pd.DataFrame(y)
print(y)

       Chest_Pain  Shortness_of_Breath  Irregular_Heartbeat  \
0               0                    1                    1   
1               0                    0                    1   
2               1                    0                    0   
3               1                    0                    1   
4               0                    0                    1   
...           ...                  ...                  ...   
69995           1                    0                    0   
69996           0                    0                    0   
69997           1                    1                    0   
69998           0                    1                    1   
69999           0                    1                    0   

       Fatigue_&_Weakness  Pain_in_Neck/Jaw/Shoulder/Back  Excessive_Sweating  \
0                       1                               0                   1   
1                       0                               0                   0   


Splitting Variable to Test and Train

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=42)

### Logistic Regression Model

In [7]:
from sklearn.linear_model import LogisticRegression

Model Fitting

In [None]:
model=LogisticRegression(max_iter=1000)
model.fit(x_test,y_test)

Model Prediction

In [9]:
y_pred=model.predict(x_test)
print(y_test)
print(y_pred)

       0
16355  0
52816  1
12074  1
11639  0
46141  0
...   ..
65630  1
33281  0
67846  1
37844  1
61251  1

[17245 rows x 1 columns]
[1 1 1 ... 1 1 1]


Model Evaluation

In [10]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.11974485358074805
mean squared error: 0.11974485358074805
root mean squared error: 0.34604169341388336


Prediction Score

In [11]:
score=metrics.accuracy_score(y_test,y_pred)
print(score*100,"%")

88.0255146419252 %


### SVM Model

In [12]:
from sklearn.svm import SVC

Model Fitting

In [13]:
classifier=SVC(kernel='linear',random_state=42)
classifier.fit(x_train,y_train)

Model Prediction

In [14]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[1 1 1 ... 1 1 1]
       0
16355  0
52816  1
12074  1
11639  0
46141  0
...   ..
65630  1
33281  0
67846  1
37844  1
61251  1

[17245 rows x 1 columns]


Model Evaluation

In [15]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.11823717019425921
mean squared error: 0.11823717019425921
root mean squared error: 0.3438563220216537


Prediction Accuracy

In [16]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

88.17628298057409 %


### KNN Model

In [17]:
from sklearn.neighbors import KNeighborsClassifier

Model Fitting

In [18]:
classifier=KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
classifier.fit(x_train,y_train)

Model Prediction

In [19]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[1 0 0 ... 0 1 1]
       0
16355  0
52816  1
12074  1
11639  0
46141  0
...   ..
65630  1
33281  0
67846  1
37844  1
61251  1

[17245 rows x 1 columns]


Model Evaluation

In [20]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.1494346187300667
mean squared error: 0.1494346187300667
root mean squared error: 0.3865677414504044


Prediction Accuracy

In [21]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

85.05653812699333 %


### Decision Tree Model

In [22]:
from sklearn.tree import DecisionTreeClassifier

Model Fitting

In [23]:
classifier=DecisionTreeClassifier(criterion='entropy',random_state=42)
classifier.fit(x_train,y_train)

Model prediction

In [24]:
y_pred=classifier.predict(x_test)
print(y_pred)
print(y_test)

[1 0 0 ... 0 1 0]
       0
16355  0
52816  1
12074  1
11639  0
46141  0
...   ..
65630  1
33281  0
67846  1
37844  1
61251  1

[17245 rows x 1 columns]


Model Evaluation

In [25]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.17019425920556683
mean squared error: 0.17019425920556683
root mean squared error: 0.4125460691917532


Prediction accuracy

In [26]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

82.98057407944331 %


### Random Forest Model

In [27]:
from sklearn.ensemble import RandomForestClassifier

Model Fitting

In [28]:
classifier=RandomForestClassifier(criterion='entropy',n_estimators=10,random_state=42)
classifier.fit(x_train,y_train)

Model Prediction

In [29]:
y_pred=classifier.predict(x_test)
print(y_test)
print(y_pred)

       0
16355  0
52816  1
12074  1
11639  0
46141  0
...   ..
65630  1
33281  0
67846  1
37844  1
61251  1

[17245 rows x 1 columns]
[1 0 0 ... 1 1 1]


Model Evaluation

In [30]:
print("mean absolute error:",metrics.mean_absolute_error(y_test,y_pred))
print("mean squared error:",metrics.mean_squared_error(y_test,y_pred))
print("root mean squared error:",np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

mean absolute error: 0.16027834154827486
mean squared error: 0.16027834154827486
root mean squared error: 0.4003477757503779


Prediction Accuracy

In [31]:
accuracy=metrics.accuracy_score(y_test,y_pred)
print(accuracy*100,"%")

83.97216584517251 %


Conclusion:
In this stroke risk prediction project, multiple machine learning models were employed, including SVM, KNN, Logistic Regression, Random Forest, and Decision Tree classifiers. Among all the models, the Support Vector Machine (SVM) demonstrated the highest accuracy of 88.176%, closely followed by the Logistic Regression model with an accuracy of 88.02%.

The SVM model's superior performance indicates its effectiveness in handling complex relationships within the dataset and making precise predictions. Logistic Regression also performed well, showing that the data is likely well-suited for linear classification methods.

Overall, the high accuracy of both models suggests a reliable approach for predicting stroke risks based on the provided symptoms and medical history. The developed model could potentially be integrated into healthcare systems to support early diagnosis and intervention for stroke prevention. Further improvements could be explored by fine-tuning hyperparameters or using advanced ensemble techniques to enhance the predictive power.