# **Stroke Risk Prediction Model Development with Data Imbalance Handling**

# **Introduction**
The aim of this project is to develop a machine learning model for predicting the risk of stroke. The dataset used consists of multiple health-related attributes, such as age, hypertension, heart disease, and other medical features. The model's goal is to predict whether a person has a high risk of stroke, based on these features.

We will be using various techniques in data mining, including data preprocessing, model training, evaluation, and performance metrics. For this purpose, we'll be employing classification algorithms such as Random Forest and evaluating the model using accuracy, ROC curve, and AUC score.

# **EDA**

In [None]:
import pandas as pd
data = pd.read_csv("/kaggle/input/cerebral-stroke-predictionimbalaced-dataset/dataset.csv")
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isnull().sum()

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data['stroke'].value_counts()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='stroke', data=data)
plt.title("Imbalance data")
plt.show()

## **One Hot Encoding**

In [None]:
data.columns

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

data['gender'] = encoder.fit_transform(data['gender'])
data['ever_married'] = encoder.fit_transform(data['ever_married'])
data['work_type'] = encoder.fit_transform(data['work_type'])
data['Residence_type'] = encoder.fit_transform(data['Residence_type'])
data['smoking_status'] = encoder.fit_transform(data['smoking_status'])

# Check the data after Label Encoding
print(data.head())
print(data.info())

The Label Encoding has successfully transformed categorical columns into numeric values. The dataset contains 43,400 rows. The bmi column has missing values, while others have no missing data. Data types are correctly set as int64 and float64.

## **Handling Missing Values**

In [None]:
# Count the missing values
print("Missing values:\n", data.isnull().sum())

In [None]:
from sklearn.impute import KNNImputer
import numpy as np

imputer = KNNImputer(missing_values=np.nan)
tab = imputer.fit_transform(data)
data = pd.DataFrame(tab, columns=data.columns)
print(data.isnull().sum())

In [None]:
data.head(10)

# **Model Evaluation Without Resampling**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, f1_score, roc_curve
import matplotlib.pyplot as plt

X = data.drop('stroke', axis=1)
y = data['stroke']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=1))

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"ROC-AUC Score: {roc_auc:.4f}")

# F1 Score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.4f}")

fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='blue', lw=2, label=f"ROC Curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# **OverSampling (SMOTE)**

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, f1_score

X = data.drop('stroke', axis=1)
y = data['stroke']

smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC:", roc_auc)

# F1 Score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

# **UnderSampling**

In [None]:
from imblearn.under_sampling import RandomUnderSampler

under_sampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = under_sampler.fit_resample(X, y)

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC:", roc_auc)

# F1 Score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

# **Combining OverSampling and UnderSampling**

In [None]:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(X, y)

under_sampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)
X_resampled_combined, y_resampled_combined = under_sampler.fit_resample(X_resampled_smote, y_resampled_smote)

X_train, X_test, y_train, y_test = train_test_split(X_resampled_combined, y_resampled_combined, test_size=0.3, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC-AUC:", roc_auc)

# F1 Score
f1 = f1_score(y_test, y_pred)
print("F1 Score:", f1)

# **GridSearchCV**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [10, 20, 30]}

rf_model = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_

In [None]:
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# F1 Score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1}")

# ROC-AUC Score
auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {auc}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='blue', lw=2, label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# **Conclusion**
The stroke risk prediction model achieved 99.12% accuracy, 0.9911 F1-score, and 0.9985 AUC-ROC, indicating excellent performance in distinguishing stroke cases. The low false positives (5 cases) show strong precision, but 219 false negatives suggest a need to improve recall to minimize missed stroke cases.

The ROC curve confirms near-perfect classification, and the classification report highlights a strong balance between precision and recall.

Future Recommendations
* Reduce false negatives by adjusting the classification threshold.
* Validate on real-world data to ensure generalizability.
* Explore other models like XGBoost or Neural Networks for potential improvements.
* Enhance feature selection with additional medical or lifestyle factors.