In [None]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

## CRISP-DM Process

1. Business Understanding: Stroke is a disease that affects the arteries leading to and within the brain. It is the number 5 cause of death and a leading cause of disability in the United States. A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or bursts (or ruptures). When that happens, part of the brain cannot get the blood (and oxygen) it needs, so it and brain cells die. Stroke is a medical emergency. Prompt treatment is crucial. Early action can reduce brain damage and other complications. The good news is that strokes can be treated and prevented, and many fewer Americans die of stroke now than in the past. (https://www.cdc.gov/stroke/index.htm)

2. Data Understanding: This dataset collect from patients who have been diagnosed with stroke, and it contains 5110 rows and 12 columns. The dataset contains 5110 observations and 12 variables. The dataset contains 5 numerical variables and 7 categorical variables.

3. Prepare Data: Download the dataset from Kaggle and import the dataset into Jupyter Notebook. Check the missing values and outliers. Then, clean the dataset by removing the missing values and outliers.

4. Data Modeling: Use the cleaned dataset to build the model. The model will be built by using the following algorithms: Logistic Regression, Decision Tree, Random Forest, and XGBoost.

5. Evaluate the Results: Evaluate the results by using the following metrics: Accuracy, Precision, Recall, F1-Score, and ROC-AUC.

In [None]:
df = pd.read_csv("../DataSet/brain_stroke.csv")
df.head()

## Process categorical variables

In [None]:
cat_cols = df.select_dtypes(include = ['object']).columns.to_list()

In [None]:
def label_encoder(df, cat_cols):
    """
    This function takes in a dataframe and a list of categorical columns and returns a dataframe with the categorical columns encoded.

    Args:
        df (dataframe): dataframe to be encoded
        cat_cols (list): list of categorical columns

    Returns:
        dataframe: dataframe with categorical columns encoded
    """
    for i in cat_cols:
        le = LabelEncoder()
        df[i] = le.fit_transform(df[i])
    return df

In [None]:
df = label_encoder(df, cat_cols)
df.head()

## Handle missing values

In [None]:
df.info()

In [None]:
df.dropna(inplace=True)
df.info()

## Analysis, Modeling, Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

### Question 1: What is the ratio of stroke patients to non-stroke patients in this dataset and it follows the same ratio in the real world?

In [None]:
labels =df['stroke'].value_counts(sort = True).index
sizes = df['stroke'].value_counts(sort = True)

plt.figure(figsize=(7,7))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90,)

plt.title('Number of stroke in the dataset')
plt.legend(['0: No Stroke', '1: Stroke'])
plt.show()

Answer: The ratio of stroke patients to non-stroke patients in this dataset is 1:24. It is not the same ratio in the real world. The ratio of stroke patients to non-stroke patients in the real world is 1:6.

### Question 2: Is there any difference between gender and stroke?

In [None]:
plt.figure(figsize=(17,7))
sns.catplot(x="gender", y="stroke", hue="heart_disease", palette="pastel", kind="bar", data=df)
sns.catplot(x="gender", y="stroke", hue="Residence_type", palette="pastel", kind="bar", data=df)
sns.catplot(x="gender", y="stroke", hue="hypertension", palette="pastel", kind="bar", data=df)
plt.show()

Answer: The relationship between male and female are the same in `heart_disease` and `residence_type`. But with hypertension features, the ratio of males is greater than females. 

### Question 3: What is the relationship between age and stroke?

In [None]:
# Plot the relationship between age and stroke

plt.figure(figsize=(17,7))
sns.histplot(data=df, x="age", hue="stroke", multiple="stack")
plt.xlabel('Age')
plt.ylabel('Number of People')
plt.title('Age Distribution')
plt.show()

Answer: The relationship between age and stroke is positive. The older the age, the higher the risk of stroke.

## Correlation

In [None]:
plt.figure(figsize=(16,8))
sns.heatmap(df.corr(), cmap="Blues");

We can see that the correlation between `age` and `stroke` and `married` is greater than 0.7.