# Stroke predictions

### Introduction

In this notebook we investigate the occurrence of strokes based on anonymized data of patients. We will train various machine learning algorithms to predict strokes based on input parameters such as age, body mass index, and smoking status. There are many metrics to assess model perfomance. However, in this analysis we will select the best model based on the F1-score only.

### Imports and settings

In [None]:
# Standard Python libraries:
import sys
import time
from typing import List, Dict, Any, Union
import warnings

# Data processing and modeling:
from imblearn.over_sampling import SMOTE
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Data visualization:
import matplotlib.pyplot as plt
import seaborn as sns

# Settings:
warnings.filterwarnings("ignore")
np.set_printoptions(threshold=sys.maxsize)
pd.set_option("display.max_colwidth", None)
sns.set_theme()

### Data parsing

The variable *input_path* corresponds to the path of the input CSV file (the healtcare dataset). This file is converted to a pandas dataframe where each row corresponds to a unique patient.

In [None]:
input_path = "../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv"
df = pd.read_csv(input_path)

Our dataframe contains the following columns:

In [None]:
df.columns = df.columns.str.lower()
df.info()

In [None]:
df.head()

The meanings of the columns are rather self-explanatory. There are eleven features in total, plus the label. The label is represented by the *stroke* column. Note that the *bmi* column has missing values, which is something that we will tackle later. Furthermore, a number of features are categorical rather than numerical. As most machine learning algorithms require numeric input, we will address this issue for each categorical feature individually.

### Data exploration and preprocessing

In the following we will study each column in more detail. Besides exploratory analyses, we will also immediately perform most of the required preprocessing.

#### ID feature

Let's start with the *id* column. Since this column will not be relevant to predict strokes, we will simply drop it. However, before we remove it let us make sure that the dataset does not contain any duplicate IDs.

In [None]:
df = df.drop_duplicates(subset="id")
df = df.drop(["id"], axis=1)

#### Gender feature

The *gender* column shows the following distribution:

In [None]:
df["gender"].value_counts()

For simplicity, we will only consider two gender options. The *Other* value can be replaced by the majority vote, which is *Female*. Furthermore, we will convert gender into a numerical variable using binary encoding:

In [None]:
df["gender"] = df["gender"].replace(["Other"], "Female")
gender_conversion = {"Male": 0, "Female": 1}
df["gender"] = df["gender"].map(gender_conversion)
df["gender"] = df["gender"].astype(int)

In the plot below we can see how gender affects the probability of getting a stroke:

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(x="gender", y="stroke", data=df)
plt.xticks(list(gender_conversion.values()), list(gender_conversion.keys()))
plt.xlabel("Gender", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

The wide, colored bars represent mean values and the black lines provide the boundaries of the 95% confidence interval. We note that there is no big differene between males and females.

#### Age feature

Next up is the *age* column:

In [None]:
df["age"].describe()

The correlation between age and having a stroke is as follows:

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df[df["stroke"] == 0]["age"], binwidth=5, binrange=[0, 85], stat="probability", color="limegreen", label="No stroke")
sns.histplot(df[df["stroke"] == 1]["age"], binwidth=5, binrange=[0, 85], stat="probability", color="firebrick", label="Stroke")
plt.xlabel("Age", fontweight="bold")
plt.ylabel("Stroke (probability)", fontweight="bold")
plt.legend()
plt.show()

The sum of the bar heights for each scenario (stroke vs. no stroke) equals unity. We can infer that the chances of having a stroke significantly increase with age.

#### Hypertension feature

The *hypertension* field takes on the values 0 (no hypertension) and 1 (hypertension): 

In [None]:
df["hypertension"].value_counts()

The relation with *stroke* can be visualized as follows:

In [None]:
hypertension_conversion = {"No hypertension": 0, "Hypertension": 1}
plt.figure(figsize=(6,4))
sns.barplot(x="hypertension", y="stroke", data=df)
plt.xticks(list(hypertension_conversion.values()), list(hypertension_conversion.keys()))
plt.xlabel("Hypertension", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

We note that patients with hypertension are more likely to experience a stroke.

#### Heart disease feature

The variable *heart_disease* is either 0 (no heart disease) or 1 (heart disease):

In [None]:
df["heart_disease"].value_counts()

The plot below shows the correlation between heart disease and strokes:

In [None]:
heart_conversion = {"No heart disease": 0, "Heart disease": 1}
plt.figure(figsize=(6,4))
sns.barplot(x="heart_disease", y="stroke", data=df)
plt.xticks(list(heart_conversion.values()), list(heart_conversion.keys()))
plt.xlabel("Heart disease", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

From this plot we infer that heart disease increases the chances of experiencing a stroke.

#### Ever married feature

The column *ever_married* shows whether the patient has ever been married:

In [None]:
df["ever_married"].value_counts()

To convert this feature to a numerical one, we can apply binary encoding:

In [None]:
married_conversion = {"No": 0, "Yes": 1}
df["ever_married"] = df["ever_married"].map(married_conversion)
df["ever_married"] = df["ever_married"].astype(int)

The distribution of this variable with respect to strokes is as follows:

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(x="ever_married", y="stroke", data=df)
plt.xticks(list(married_conversion.values()), list(married_conversion.keys()))
plt.xlabel("Ever married", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

It seems that marriage significantly increases the chances of having a stroke.

#### Work type feature

Next, we have the column *work_type*. There are five possible values:

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x="work_type", y="stroke", data=df)
plt.xlabel("Work type", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

Note that *Private* and *Govt_job* have a similar impact on strokes. The same is true for *children* and *Never_worked*. For this reason, we can create three categories instead without loosing much information. Moreover, the fact that a patient is a child is actually redundant as it is already captured by the *age* field.

In [None]:
df["work_type"] = df["work_type"].replace(["Self-employed"], "self-employed")
df["work_type"] = df["work_type"].replace(["Private", "Govt_job"], "employed")
df["work_type"] = df["work_type"].replace(["children", "Never_worked"], "never_employed")

# Sanity check that children (up to age 12) are never employed:
df.loc[df["age"] < 13, "work_type"] = "never_employed"

df["work_type"].value_counts()

Since the work type categories are nominal, we can benefit from one-hot encoding (we  will drop one column to remove redundant information) to create a numerical variable:

In [None]:
df_work_ohe = pd.get_dummies(
    df["work_type"], 
    prefix="work_ohe", 
    drop_first=True,
)
df = pd.concat([df, df_work_ohe], axis=1)
df = df.drop(["work_type"], axis=1)

#### Residence type feature

For the *residence_type* field, the distribution is as follows:

In [None]:
df["residence_type"].value_counts()

We can convert this categorical field to a numerical one using binary encoding:

In [None]:
residence_conversion = {"Rural": 0, "Urban": 1}
df["residence_type"] = df["residence_type"].map(residence_conversion)
df["residence_type"] = df["residence_type"].astype(int)

The plot below shows the relation of this variable with *stroke*:

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(x="residence_type", y="stroke", data=df)
plt.xticks(list(residence_conversion.values()), list(residence_conversion.keys()))
plt.xlabel("Residence type", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

Residence type does seem to affect strokes much.

#### Glucose level feature

The following column, *avg_glucose_level*, describes the average glucose level in mg/dL. Its statistical details read:

In [None]:
df["avg_glucose_level"].describe()

From the [Mayo Clinic](https://www.mayoclinic.org/diseases-conditions/diabetes/diagnosis-treatment/drc-20371451) we learn the following in relation to diabetes:

> A blood sugar level less than 140 mg/dL is normal. A reading of more than 200 mg/dL after two hours indicates diabetes. A reading between 140 and 199 mg/dL indicates prediabetes.

Glucose level and strokes are related as follows:

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df[df["stroke"] == 0]["avg_glucose_level"], binwidth=10, binrange=[50, 280], stat="probability", color="limegreen", label="No stroke")
sns.histplot(df[df["stroke"] == 1]["avg_glucose_level"], binwidth=10, binrange=[50, 280], stat="probability", color="firebrick", label="Stroke")
plt.xlabel("Average glucose level (mg/dL)", fontweight="bold")
plt.ylabel("Stroke (probability)", fontweight="bold")
plt.legend()
plt.show()

From the above plot we infer that there is a (positive) relation between strokes and being (pre)diabetic.

#### Smoking status feature

The next feature column is *smoking_status*. There are four possible values:

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x="smoking_status", y="stroke", data=df)
plt.xlabel("Smoking status", fontweight="bold")
plt.ylabel("Stroke (mean)", fontweight="bold")
plt.show()

Although smoking does not seem beneficial for your health, the relationship with strokes is not so clear from this plot due to the large confidence intervals.

The value *Unknown* means that information on smoking status is unavailable. To decide on what to do with this, let us look at the number of occurrences:

In [None]:
df["smoking_status"].value_counts()

Since *Unknown* occurs 1544 times in our dataset, it is best to leave it as a separate category rather than to replace it by guesses.

We consider this variable to be nominal. For that reason we can apply one-hot encoding to convert it into a numerical feature:

In [None]:
df_smoking_ohe = pd.get_dummies(
    df["smoking_status"], 
    prefix="smoking_ohe", 
    drop_first=True,
)
df_smoking_ohe = df_smoking_ohe.rename(columns={
    "smoking_ohe_never smoked": "smoking_ohe_never_smoked", 
    "smoking_ohe_formerly smoked": "smoking_ohe_formerly_smoked",
})
df = pd.concat([df, df_smoking_ohe], axis=1)
df = df.drop(["smoking_status"], axis=1)

#### BMI feature

The next column is *bmi*, the body mass index (BMI) in kg/m$^2$. Its statistical details are as follows:

In [None]:
df["bmi"].describe()

From the [CDC](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html) we learn the following in relation to obesity:

| BMI | Weight status |
| ---: | :--- |
| < 18.5 | Underweight |
| 18.5 - 24.9 | Normal weight |
| 25.0 - 29.9 | Overweight |
| > 30.0 | Obese |

From earlier we know that there are 201 missing values for this field. There are various ways to deal with this. The best option might be to infer those missing values from the other features using a regression model. For convenience, let us pick the easiest regression algorithm for this, i.e. linear regression:

In [None]:
train_data = df.dropna()
X_train = train_data.drop("bmi", axis=1)
y_train = train_data["bmi"]

test_data = df[df["bmi"].isnull()]
X_test = test_data.drop("bmi", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)
y_test = model.predict(X_test)

bmi_slice = df["bmi"].copy()
bmi_slice[np.isnan(bmi_slice)] = y_test
df["bmi"] = bmi_slice

Now that our BMI values are complete, let us study the relationship with strokes:

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df[df["stroke"] == 0]["bmi"], binwidth=2, binrange=[10, 100], stat="probability", color="limegreen", label="No stroke")
sns.histplot(df[df["stroke"] == 1]["bmi"], binwidth=2, binrange=[10, 100], stat="probability", color="firebrick", label="Stroke")
plt.xlabel("BMI (kg/m2)", fontweight="bold")
plt.ylabel("Stroke (probability)", fontweight="bold")
plt.legend()
plt.show()

The distributions for stroke and no stroke are fairly similar. However, strokes seem to be relatively common in the BMI range of 26-32 kg/m$^2$.

#### Stroke label

Finally, we have the target column *stroke*. Its values refer to whether the patient has experienced a stroke (1) or not (0).

In [None]:
df["stroke"].value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

We note that this dataset is extremely imbalanced. In fact, if our model would always predict 0 it would be correct 95% of the time! For this reason, accuracy will not be a suitable metric to measure model performance. Furthermore, in the next section we will try to balance our dataset more to avoid any model bias.

#### Wrap-up

To conclude the preprocessing, let us verify that there are no longer any missing values and that all fields are numeric of nature:

In [None]:
df.info()

In [None]:
df.head()

### Model building

Now that we have almost fully preprocessed our dataset, we can try to model it with various machine learning algorithms. First, we will split the data into training and test sets. After the split we will perform a few more preprocessing steps. Then quickly train and test various different algorithms to see which perform best. The best ones will be examined in more detail to further improve the predictions.

#### Train-test split

First, we make a train-test split of the data:

In [None]:
X = df.drop(["stroke"], axis=1)
y = df["stroke"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Let us take a look at the distribution of the target label:

In [None]:
y_train.value_counts()

Since our dataset is highly imbalanced, there is a risk that our models will be biased toward predicting no stroke. To combat this issue, we can apply an oversampling technique called SMOTE, which is short for Synthetic Minority Oversampling Technique. This technique makes use of the K-nearest neighbors algorithm to synthesize more data for the minority class. After applying SMOTE to the training set, the stroke vs. no-stroke rows are more balanced.

In [None]:
oversampler = SMOTE(sampling_strategy=0.65, random_state=0)
X_train, y_train = oversampler.fit_resample(X_train, y_train)
y_train.value_counts()

Next, we apply standardization (meaning zero mean and unit variance) to all features (i.e. *X_train* and *X_test*) to ensure that all data 'lives' at the same scale:

In [None]:
scaler = StandardScaler()
feature_sets = [scaler.fit_transform(feature_set) for feature_set in [X_train, X_test]]    
X_train = feature_sets[0]
X_test = feature_sets[1]

Before we start training, it is good practice to verify the dimensions of the data:

In [None]:
y_train = np.array(y_train)
print("Dimensions of the training set:\n")
print("Features:\t", X_train.shape, "\nLabels:\t\t", y_train.shape)

y_test = np.array(y_test)
print("\nDimensions of the test set:\n")
print("Features:\t", X_test.shape, "\nLabels:\t\t", y_test.shape)

#### Naive modeling

Using the Scikit-learn library, we can quickly try many different machine learning algorithms and select the best one. To this end, we define a function that we can invoke for each algorithm that calculates the mean F1-score that results from cross validation:

In [None]:
def cross_validation(algorithm: Any, X_train: Any, y_train: Any) -> Dict[str, float]:
    """Performs and assesses cross validation for a given machine learning algorithm.
    
    The performance metric is the F1-score.
    
    Args:
        algorithm: The Scikit-learn algorithm class.
        X_train: The input for training.
        y_train: The labels for training.
    
    Returns:
        The mean and standard deviation for the F1-score.
    """
    
    cross_val_scores = cross_val_score(algorithm, X_train, y_train, cv=5, scoring="f1")
    f1_mean = round(cross_val_scores.mean(), 3)
    f1_std = round(cross_val_scores.std(), 3)
    results = {"Mean": f1_mean, "Standard deviation": f1_std}

    return results

Let us now collect all algorithms that we want to try out. In this subsection we perform a 'quick and dirty' analysis using the Scikit-learn default settings for the algorithms.

In [None]:
algorithms = {
    "Gaussian naive Bayes": GaussianNB(),
    "K-nearest neighbors": KNeighborsClassifier(),
    "Support vector machine": SVC(),
    "Logistic regression": LogisticRegression(),
    "Multilayer perceptron": MLPClassifier(random_state=0),
    "Decision tree": DecisionTreeClassifier(random_state=0),
    "Random forest": RandomForestClassifier(random_state=0),
}

Next, we can perform cross validation for all these algorithms. We collect the mean and standard deviation for all F1-scores. The results are as follows:

In [None]:
cv_results = []
for name, algorithm in algorithms.items():
    start_time = time.time()
    results = cross_validation(algorithm, X_train, y_train)
    elapsed_time = round(time.time() - start_time, 2)
    f1_mean = results["Mean"]
    f1_std = results["Standard deviation"]
    cv_results.append((name, f1_mean))
    print("\n{}\nDuration: \t{} seconds\nF1-score: \t{} ± {}".format(name.upper(), elapsed_time, f1_mean, f1_std))

To select the most suitable algorithms, let us sort them by their (mean) F1-scores:

In [None]:
sorted_cv_results = sorted(cv_results, key=lambda x: x[1], reverse=True)
pd.DataFrame(sorted_cv_results, columns=["Model", "F1-score"])

The four best models are the random forest (RF), decision tree (DT), K-nearest neighbors (KNN), and multilayer perceptron (MLP) models. An RF in general performs better than a DT as an RF essentially averages a number of DTs through ensemling to improve generalizability. For this reason, we will discard the DT model and only continue with the RF, KNN, and MLP models.

#### Improved modeling

In this subsection we aim to improve the RF, KNN, and MLP models. To this end, we can perform hyperparameter grid searches to optimize the models. For convenience, let us define the following functoin that facilitates grid searches:

In [None]:
def grid_search(
        algorithm: Any, 
        parameters: Dict[str, List[Any]],
        X_train: Any, 
        y_train: Any, 
) -> Dict[str, Union[float, Dict[str, Any]]]:
    """Performs a grid search for a given algorithm to find the best hyperparameters.
    
   All parameter combinations are considered. The best parameters are those that 
   maximize the F1-score using cross validation.
    
    Args:
        algorithm: The Scikit-learn algorithm class.
        parameters: A grid of hyperparameters that the algorithm will loop through.
        X_train: The input for training.
        y_train: The labels for training.
    
    Returns:
        The best F1-score and the corresponding hyperparameter settings.
    """
    
    clf = GridSearchCV(algorithm, parameters, cv=5, scoring="f1")
    clf.fit(X_train, y_train)
    best_f1 = round(clf.best_score_, 3)
    best_parameters = clf.best_params_
    results = {"F1-score": best_f1, "Parameters": best_parameters}

    return results

Let us now define the hyperparameters that we would like to try:

In [None]:
knn_parameters = {
    "n_neighbors": [2, 3, 5, 8, 13],
    "weights": ["uniform", "distance"],
    "algorithm": ["ball_tree", "kd_tree", "brute"],
    "p": [1, 2],
}  # Number of combinations: 5*2*3*2 = 60
mlp_parameters = {
    "hidden_layer_sizes": [(8,), (13,), (21,), (8, 8,), (13, 13,), (21, 21,)],
    "solver": ["adam", "lbfgs"],
    "alpha": [0.01, 0.1, 1.0],
    "random_state": [0],
}  # Number of combinations: 6*2*3 = 36
rf_parameters = {
    "criterion": ["gini", "entropy"],
    "max_features": ["auto", 5, 8],
    "max_depth": [8, 13],
    "min_samples_split": [5, 8],
    "ccp_alpha": [0.001, 0.01, 0.1],
    "random_state": [0],
}  # Number of combinations: 2*3*2*2*3 = 72

settings = {
    "K-nearest neighbors": (KNeighborsClassifier(), knn_parameters),
    "Multilayer perceptron": (MLPClassifier(), mlp_parameters),
    "Random forest": (RandomForestClassifier(), rf_parameters),
}

Now we are ready to perform the grid search. For each algorithm, we collect the best F1-scores and corresponding hyperparameters. The results are as follows:

In [None]:
search_results = []
for name, grid in settings.items():
    algorithm = grid[0]
    parameters = grid[1]
    start_time = time.time()
    results = grid_search(algorithm, parameters, X_train, y_train)
    elapsed_time = round(time.time() - start_time, 2)
    best_f1_score = results["F1-score"]
    best_parameters = results["Parameters"]
    search_results.append((name, best_f1_score, best_parameters))
    print("\n{}\nDuration: \t\t{} seconds\nBest F1-score: \t\t{}\nHyperparameters: \t{}".format(name.upper(), elapsed_time, best_f1_score, best_parameters))

To select the most suitable algorithm, let us sort them by their F1-scores:

In [None]:
sorted_search_results = sorted(search_results, key=lambda x: x[1], reverse=True)
pd.DataFrame(sorted_search_results, columns=["Model", "F1-score", "Hyperparameters"])

All three models perform well during cross validation based on the F1-score. However, from this improved analysis we learn that the RF model with the given hyperparameters is most suitable for our use case. Note that the F1-score is slightly lower than before. The reason for this is that we enforce pruning, which is controlled by the parameter *ccp_alpha*. 

### Conclusions

In this final section we will look at some characteristics of our final RF model. Furthermore, we will see how it performs on our test set.

#### Model characteristics

A nice thing about RFs is that we can easily infer what the relative importance is of the different features regarding stroke predictions. Studying the features this way also provides a good check with respect to the exploratory analysis we performed earlier.

In [None]:
best_rf_params = sorted_search_results[0][2]
random_forest = RandomForestClassifier(**best_rf_params)
random_forest.fit(X_train, y_train)

weights = list(random_forest.feature_importances_.round(3))
feature_names = list(X.columns)
relevancies = list(zip(feature_names, weights))
sorted_relevancies = sorted(relevancies, key=lambda x: x[1], reverse=True)
pd.DataFrame(sorted_relevancies, columns=["Feature", "Relative importance"])

From this overview, we conclude that *age*, *avg_glucose_level*, and *bmi* are the most important features to predict whether a patient is susceptible to a stroke. These results are fairly consistent with our exploratory analysis, with the possible exception of the bottom three features, *ever_married*, *heart_disease*, and *hypertension*. Earlier, those features showed significantly different distributions for stroke vs. no stroke. Apparently they do not play a very important role in our final RF model.

Recall that an RF model is build up from an ensemble of DTs. To get a better feel for our model, let us visualize a random tree:

In [None]:
fig = plt.figure(dpi=2048)
tree.plot_tree(
    random_forest.estimators_[0],
    feature_names=feature_names, 
    class_names=["no stroke", "stroke"],
    filled=True,
)
fig.savefig('rf_example_tree.png')

Each node (except leaves) splits into two nodes. The left/right node satisties/dissatisfies the condition prescribed by its parent node. The colors provide information about the class label. The bluer the node, the stronger it predicts a stroke. Likewise, the redder the node, the higher the probability that it corresponds to no stroke.

#### Test results

We have optimized our model by performing grid searches and cross validation. Now that we have selected our final model, we need to test its performance on the test set that we defined earlier. One important difference with cross validation is that in contrast to the training set, the test set has not been oversampled. This means that the test set reflects the true, imbalanced situation where strokes are rare.

In [None]:
y_pred = random_forest.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = round(accuracy_score(y_test, y_pred), 3)
precision = round(precision_score(y_test, y_pred), 3)
recall = round(recall_score(y_test, y_pred), 3)
f1 = round(f1_score(y_test, y_pred), 3)    
print("Confusion matrix: \n\n{}\n\nAccuracy: \t{}\nPrecision: \t{}\nRecall: \t{}\nF1-score: \t{}".format(conf_matrix, accuracy, precision, recall, f1))

From left to right, the first row of the confusion matrix contains the True Negatives (TNs) and the False Positives (FPs). The second row, again from left to right, contains the False Negatives (FNs) and the True Positives (TPs). As the diagonal contains the correct predictions, our goal has been to make this matrix as diagonal as possible. A high precision means that the FPs are suppressed, while a high recall corresponds to a low number of FNs. Ideally, both of these metrics are large. Both quantities are combined in the F1-score, which is the harmonic mean of precision and recall. For this reason, the F1-score is probably the most suitable metric to assess our model performance.

#### Summary and discussion

In this notebook we have considered various machine learning algorithms to predict strokes in patients. Before modeling, we performed some exploratory data analyses (mainly visual) and converted the categorical data into numeric data using either binary encoding or one-hot encoding. The missing BMI values were imputed from a linear regression model that was trained on the remaining data. To tackle the issue of significantly imbalanced class labels, we appealed to the SMOTE algorithm to oversample the stroke labels, creating a more balanced training set. Subsequently, we standardized all features, i.e. we enforced a zero mean and unit variance.

After having fully preprocessed the data, we applied ten-fold cross validation to seven different machine learning algorithms using the Scikit-learn default settings. We selected the top three algorithms based on the F1-score. For these three algorithms we then performed a grid search to find optimal sets of hyperparameters. A random forest model came out best from cross validation. Perhaps surprisingly, only a marginal amount of pruning (controlled by the *ccp_alpha* parameter) was applied.

Finally, we investigated the obtained random forest model in more detail. In particular, we looked at the relative importance of the different features and we visualized a decision tree. To assess the performance of this model in a more realistic setting, we had it predict on a hold-out test set, without oversampling. Unfortunately, the F1-score came out relatively low, much lower than during cross validation. Although it is not shown explicitly in this notebook, varying the oversampling factor would not resolve this issue. As it turned out, no oversampling at all would make the results much worse. Of course, one could again search for better hyperparameters to improve the results, but then one is essentially fitting to the test set, which is not a preferred workflow.

Any feedback on this notebook is more than welcome! I have also shared this project on my [GitHub](https://github.com/tvdaal). Finally, a big thank you to [fedesoriano](https://www.kaggle.com/fedesoriano) for sharing this interesting dataset.