Data Preprocessing:

- Drop the first column which is an index.
- Handle any missing values.
- Convert categorical variables using Label Encoding.
- Split the data into features (X) and target (y).
- Split these into training and testing sets.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Load the data
file_path = 'stroke_data_cleaned.csv'
stroke_data = pd.read_csv(file_path)

# Quick look at the dataset
stroke_data.head()


Unnamed: 0.1,Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_imp,smoking_status_imp,rounded_age
0,0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,36.6,formerly smoked,67
1,1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,28.1,never smoked,61
2,2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,32.5,never smoked,80
3,3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,34.4,smokes,49
4,4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,24.0,never smoked,79


- The index column was removed.
- Missing values in 'bmi' were filled with the imputed values.
- Categorical variables were encoded.
- The dataset was split into features and the target variable, and then into training and testing sets.

Now let's proceed with training the Decision Tree and Random Forest classifiers. After training, we will evaluate them using accuracy and a classification report, which includes precision, recall, and F1-score for both models. Let's start with the Decision Tree.

In [2]:
# Drop the 'Unnamed: 0' column as it's just an index
stroke_data.drop(columns=['Unnamed: 0'], inplace=True)

# Check for any remaining missing values
missing_values = stroke_data.isnull().sum()

# Encode categorical variables
label_encoder = LabelEncoder()
categorical_columns = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

for column in categorical_columns:
    stroke_data[column] = label_encoder.fit_transform(stroke_data[column])

# Fill missing values in 'bmi' with the imputed values in 'bmi_imp'
stroke_data['bmi'].fillna(stroke_data['bmi_imp'], inplace=True)

# Separate the dataset into X (features) and y (target)
X = stroke_data.drop(['stroke'], axis=1)  # features
y = stroke_data['stroke']  # target

# Split the data into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Display the processed data and the missing values (if any)
processed_data_head = stroke_data.head()
missing_values, processed_data_head


(gender                  0
 age                     0
 hypertension            0
 heart_disease           0
 ever_married            0
 work_type               0
 Residence_type          0
 avg_glucose_level       0
 bmi                   201
 smoking_status          0
 stroke                  0
 bmi_imp                 0
 smoking_status_imp      0
 rounded_age             0
 dtype: int64,
    gender   age  hypertension  heart_disease  ever_married  work_type  \
 0       1  67.0             0              1             1          2   
 1       0  61.0             0              0             1          3   
 2       1  80.0             0              1             1          2   
 3       0  49.0             0              0             1          2   
 4       0  79.0             1              0             1          3   
 
    Residence_type  avg_glucose_level   bmi  smoking_status  stroke  bmi_imp  \
 0               1             228.69  36.6               1       1     36.6   
 

In [4]:
# Encoding the 'smoking_status_imp' column. 
stroke_data['smoking_status_imp'] = label_encoder.fit_transform(stroke_data['smoking_status_imp'])

# Drop the original 'smoking_status' column since 'smoking_status_imp' is its encoded version
stroke_data.drop('smoking_status', axis=1, inplace=True)

# Redefine X and y with the updated dataset
X = stroke_data.drop(['stroke'], axis=1)  # features
y = stroke_data['stroke']  # target

# Split the data again into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train the Decision Tree classifier again
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
dt_predictions = dt_classifier.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_predictions)
dt_classification_report = classification_report(y_test, dt_predictions)

# Train the Random Forest classifier again
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
rf_predictions = rf_classifier.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_classification_report = classification_report(y_test, rf_predictions)

(dt_accuracy, dt_classification_report, rf_accuracy, rf_classification_report)


(0.9178082191780822,
 '              precision    recall  f1-score   support\n\n           0       0.96      0.96      0.96       972\n           1       0.15      0.14      0.14        50\n\n    accuracy                           0.92      1022\n   macro avg       0.55      0.55      0.55      1022\nweighted avg       0.92      0.92      0.92      1022\n',
 0.949119373776908,
 '              precision    recall  f1-score   support\n\n           0       0.95      1.00      0.97       972\n           1       0.00      0.00      0.00        50\n\n    accuracy                           0.95      1022\n   macro avg       0.48      0.50      0.49      1022\nweighted avg       0.90      0.95      0.93      1022\n')

The models have been trained and evaluated. Here are the results:

Decision Tree Classifier:

Accuracy: 91.78%
Precision for classifying stroke: 15%
Recall for classifying stroke: 14%
F1-score for classifying stroke: 14%
Random Forest Classifier:

Accuracy: 94.91%
Precision for classifying stroke: 0% (indicating no stroke cases were correctly predicted)
Recall for classifying stroke: 0% (no true positive predictions for stroke cases)
F1-score for classifying stroke: 0% (due to no positive predictions for stroke)
The accuracy metric might be misleading due to the imbalanced nature of the dataset. Most of the data belongs to the non-stroke class, which the classifiers are likely predicting almost exclusively. This is why we see high overall accuracy but very low precision and recall for the actual stroke predictions.

For imbalanced datasets, accuracy is not the best metric. Instead, one should consider using the F1-score, precision, recall, and ROC-AUC scores to evaluate model performance on the minority class, which in this case is the occurrence of a stroke. The Random Forest classifier didn't predict any stroke cases correctly, which suggests it may not be the best model without further tuning or addressing the class imbalance more effectively.

In [5]:
# The Random Forest trained and its predictions are trained. 
# However, I will now train it again with a different approach for handling the class imbalance.
# This time I will adjust class weights within the Random Forest classifier.

# Train the Random Forest classifier with class weight adjustment
rf_classifier_balanced = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_classifier_balanced.fit(X_train, y_train)

# Predict on the test set with the balanced classifier
rf_balanced_predictions = rf_classifier_balanced.predict(X_test)

# Evaluate the Random Forest classifier with class weight adjustment
rf_balanced_accuracy = accuracy_score(y_test, rf_balanced_predictions)
rf_balanced_classification_report = classification_report(y_test, rf_balanced_predictions)

(rf_balanced_accuracy, rf_balanced_classification_report)


(0.9500978473581213,
 '              precision    recall  f1-score   support\n\n           0       0.95      1.00      0.97       972\n           1       0.00      0.00      0.00        50\n\n    accuracy                           0.95      1022\n   macro avg       0.48      0.50      0.49      1022\nweighted avg       0.90      0.95      0.93      1022\n')

After training the Random Forest classifier with class weights adjusted to account for the imbalance, the results are:

- Accuracy: 95.01%
- Precision for classifying stroke: 0% (still indicating no stroke cases were correctly predicted)
- Recall for classifying stroke: 0% (no true positive predictions for stroke cases)
- F1-score for classifying stroke: 0% (due to no positive predictions for stroke)

Adjusting the class weights did not change the classifier's ability to correctly predict the minority class, which in this dataset is the occurrence of a stroke. The classifier is still biased towards predicting the majority class, which suggests that we might need to employ more sophisticated methods to handle the imbalance, such as SMOTE for oversampling the minority class, more complex model tuning, or using different evaluation metrics that focus on the performance on the minority class, such as the ROC-AUC score.

In [6]:
# Initialize the Support Vector Machine classifier with class weight adjustment for imbalance
svm_classifier = SVC(random_state=42, class_weight='balanced')

# Train the SVM classifier
svm_classifier.fit(X_train, y_train)

# Predict on the test set with the SVM classifier
svm_predictions = svm_classifier.predict(X_test)

# Evaluate the SVM classifier
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_classification_report = classification_report(y_test, svm_predictions)

(svm_accuracy, svm_classification_report)


(0.700587084148728,
 '              precision    recall  f1-score   support\n\n           0       0.99      0.69      0.82       972\n           1       0.12      0.82      0.21        50\n\n    accuracy                           0.70      1022\n   macro avg       0.55      0.76      0.51      1022\nweighted avg       0.94      0.70      0.79      1022\n')

After training the Support Vector Machine (SVM) classifier with class weights adjusted for the imbalance, the results are quite different from the previous models:

- Accuracy: 70.06%
- Precision for classifying stroke: 12% (some stroke cases were correctly predicted)
- Recall for classifying stroke: 82% (a large proportion of the actual stroke cases were identified)
- F1-score for classifying stroke: 21%

The SVM classifier has significantly improved in terms of recall for the stroke class compared to the Random Forest classifier, indicating that it's able to identify most of the actual stroke cases. However, the precision is low, which means there are a considerable number of false positives. This trade-off between precision and recall is common in imbalanced datasets.

The decrease in overall accuracy compared to the Random Forest model reflects the SVM model's increased focus on the minority class. This is a good example of how accuracy is not always the most important metric in imbalanced class situations. Instead, the improvement in recall for the stroke class might be considered more valuable in a healthcare context, where missing out on potential stroke cases could have serious consequences.

In [7]:
# Initialize the Logistic Regression classifier with class weight adjustment for imbalance
logreg_classifier = LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000)

# Train the Logistic Regression classifier
logreg_classifier.fit(X_train, y_train)

# Predict on the test set with the Logistic Regression classifier
logreg_predictions = logreg_classifier.predict(X_test)

# Evaluate the Logistic Regression classifier
logreg_accuracy = accuracy_score(y_test, logreg_predictions)
logreg_classification_report = classification_report(y_test, logreg_predictions)

(logreg_accuracy, logreg_classification_report)


(0.7524461839530333,
 '              precision    recall  f1-score   support\n\n           0       0.99      0.75      0.85       972\n           1       0.14      0.80      0.24        50\n\n    accuracy                           0.75      1022\n   macro avg       0.56      0.78      0.55      1022\nweighted avg       0.95      0.75      0.82      1022\n')

After training the Logistic Regression classifier with class weights adjusted for the imbalance, here are the results:

- Accuracy: 75.24%
- Precision for classifying stroke: 14% (indicating that it correctly predicted some stroke cases)
- Recall for classifying stroke: 80% (a high number of actual stroke cases were identified)
- F1-score for classifying stroke: 24%

The Logistic Regression model performed similarly to the SVM in terms of recall, which is quite high, but the precision is still low, resulting in a modest F1-score. The overall accuracy is lower than that of the Random Forest classifier but higher than that of the SVM.

Like the SVM, the Logistic Regression model's lower overall accuracy is due to it prioritizing the minority class (stroke cases), as seen by the high recall rate. This could be more desirable in a medical diagnosis context where it is crucial not to miss any potential stroke cases.

When considering the performance of the various models we trained, it's essential to look beyond just accuracy, especially since we're dealing with an imbalanced dataset where the minority class—patients who have had a stroke—is of particular interest. Here’s a summary of the performance of each model on the test data:

Decision Tree Classifier:
- Accuracy: 91.78%
- Precision for stroke: 15%
- Recall for stroke: 14%
- F1-score for stroke: 14%

Random Forest Classifier:
- Accuracy: 94.91%
- Precision for stroke: 0%
- Recall for stroke: 0%
- F1-score for stroke: 0%

Random Forest with Balanced Class Weights:
- Accuracy: 95.01%
- Precision for stroke: 0%
- Recall for stroke: 0%
- F1-score for stroke: 0%

Support Vector Machine (SVM):
- Accuracy: 70.06%
- Precision for stroke: 12%
- Recall for stroke: 82%
- F1-score for stroke: 21%

Logistic Regression with Balanced Class Weights:
- Accuracy: 75.24%
- Precision for stroke: 14%
- Recall for stroke: 80%
- F1-score for stroke: 24%

Model Performance Evaluation:

- Decision Tree: Moderately high accuracy but low precision and recall for predicting strokes, suggesting limited usefulness for our specific aim.

- Random Forest (Standard and Balanced): High accuracy but failed to predict any stroke cases correctly, indicating it might be heavily biased towards the majority class.

- SVM: Lower accuracy, but significantly higher recall, suggesting it is much better at identifying the minority class (stroke cases) than the other models. However, its precision is low, leading to a higher number of false positives.

- Logistic Regression: Shows a good balance with decent recall and the highest F1-score among all models for the stroke class, making it potentially the most useful model for our purposes.

Considering the aim is to predict strokes—a condition where failing to predict a positive case could be life-threatening—the models' ability to detect the positive class (high recall) is crucial. However, it is also important to maintain a reasonable precision to avoid too many false positives, which could lead to unnecessary anxiety and medical interventions.

Best Performing Model:

The Logistic Regression with Balanced Class Weights model appears to be the most suitable model for this task, based on our evaluations. It has the highest F1-score for stroke predictions, which balances precision and recall, making it the most effective at identifying true stroke cases while controlling for false positives better than the SVM.

The SVM had a higher recall but lower F1-score due to many false positives, which may not be as preferable in a medical setting where false alarms can have significant consequences.

It's important to note that these models can be further fine-tuned and evaluated using more sophisticated techniques and metrics, like the ROC-AUC score, to potentially improve their performance. Additionally, ensembling methods, more advanced oversampling techniques, and feature engineering could be explored to further enhance the model's ability to predict stroke events accurately.