### Module Import

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### Dataset definition

In [None]:
df = pd.read_csv('stroke_dataset.csv')
df.head()

In [None]:
df.info()

### Fields descriptions

- Gender: The person's gender, indicating whether they are male or female.

- Age: The person's age, indicating how many years old they are. This variable includes floats.

- Hypertension: Indicates whether the person has hypertension or high blood pressure (1 if they have it, 0 if they don't).

- Heart disease: Indicates whether the person has heart disease (1 if they have it, 0 if they don't).

- Ever_married: Indicates whether the person has ever been married (yes or no).

- Work_type: The type of work the person does, which can be categorized in various ways, such as office work, manual labor, etc.

- Residence_type: The type of residence of the person, which can be "Rural" or "Urban," indicating whether they live in a rural or urban area.

- Avg_glucose_level: The person's average blood glucose level, which is an important measure for assessing a person's health, especially in relation to diabetes. This variable has a float data type.

- bmi: The person's Body Mass Index (BMI), which is a measure that relates a person's weight and height to assess their body composition and potential obesity. It is a float.

- Smoking_status: The person's smoking status, which can be categorized into different states such as "never smoked," "former smoker," or "current smoker and "Unknown".

### Categorical Variables

In [None]:
cat = df.select_dtypes(include = ['object'])
cat_columns = list(cat)

In [None]:
for col in cat_columns:
    print(f'Column name: {col}')
    print(df[col].value_counts())
    print()

### Numeric Variables

In [None]:
num = df.select_dtypes(include = ['number'])
num_columns = list(num)
print(num_columns)

In [None]:
for col in num_columns:
    print(f'Column name: {col}')
    print(df[col].value_counts())
    print()

In [None]:
df.describe()

### Null Values Verification

In [None]:
df.isnull().sum()

### Duplicate Check

In [None]:
df.duplicated().sum()

### Cardinality Verification

In [None]:
df.nunique()

- There are no duplicates and nulls in the dataset.

- We found few columns with unbalanced categories: heart_disease, hypertension, stroke.

- We found multiple variables with numeric datatype, but they are truly boolean. We could change the datatype in the future and see how the model responds.

- avg_glucose_level has a high cardinality. We could try in the future to group this variable in different categories.

### Outliers Exploration

In [None]:
num_outliers = df[['age', 'avg_glucose_level', 'bmi','stroke']]
num_columns = list(num_outliers)

In [None]:
plt.figure(figsize=(14, 6))

sns.boxplot(data=num_outliers, orient="v", palette="Set2")

plt.xticks(rotation=45)
plt.ylabel("Values")
plt.title("Boxplots")

plt.tight_layout()
plt.show()

- avg_glucose_level has a outliers which can be treated like separate groups. We can investigate separately.

In [None]:
bmi = df[['bmi']]

plt.figure(figsize=(14, 6))

sns.boxplot(data=bmi, orient="h", palette="Set2")

plt.xticks(rotation=45)
plt.ylabel("Values")
plt.title("Boxplots")

plt.tight_layout()
plt.show()

In [None]:
sns.pairplot(data=df[num_columns], hue = "stroke", corner= True)

- Age graph shows high concentration of the strokes on the right side (50+ y.o.).

- Going to continue with detailed analysis of each variable.

- AVG_glucose level  has 2 picks in the graphs which can be investigated like 2 different groups with levels(1 - 160 and 160 -300).  

In [None]:
sns.relplot(data= df, x="age", y ='bmi', hue = 'stroke', col ='stroke')

- Based on the graph above we can state that many persons suffer from obesity in the age 50+ and had strokes and we can consider it has some correlation. (to be confirmed by heatmap)

Bajo peso: IMC menor a 18.5

Peso normal: IMC entre 18.5 y 24.9
Sobrepeso: IMC entre 25 y 29.9
Obesidad leve: IMC entre 30 y 34.9
Obesidad moderada: IMC entre 35 y 39.9
Obesidad grave: IMC de 40 o superior

In [None]:
sns.relplot(data= df, x="age", y ='avg_glucose_level', hue = 'stroke', col ='stroke')

- The normal values of glucose level is between 70-126 mg/dl, 140-199 it's sign of diabetes start  and the value over 200 means it's critical diabetes stage.  On the graph we can see points grouped on 2 sides - normal and critical level and few data in prediabetes stage.  

In [None]:
sns.relplot(data= df, x="bmi", y ='avg_glucose_level', hue = 'stroke', col ='stroke')

- From the view we can take 2 groups: 1 - normal level of glucose and high bmi(overweight), 2  - high level of glucose(diabetes) and high bmi (overweight). The idea is to create additional groups(columns) in dataset to analyse  the impact by groups.

In [None]:
contingency_table = pd.crosstab(df['gender'], df['smoking_status'])

print(contingency_table)

contingency_table.plot(kind='bar', stacked=False)
plt.title('Relations between Gender and Smoking_status')
plt.xlabel('Smoking_status')
plt.ylabel('Quantity of the patients')
plt.show()

- There is high number of Unknown values in Smoking status column so we have different approaches of how to deal with it. Impute all children under 12yo field with "never smoked" status and for all the rest use KNN impute mode or leave it as "Unknown" class.

- According to the graph, there are more females that never smoked than male in this same category.

In [None]:
contingency_table = pd.crosstab([df['gender'], df['smoking_status']], df['stroke'])

print(contingency_table)

contingency_table.plot(kind='bar', stacked=False)
plt.title('Relations between Gender and Smoking_status')
plt.xlabel('Smoking_status')
plt.ylabel('Quantity of the patients')
plt.legend(title='Stroke', labels=['No Stroke', 'Stroke'])
plt.show()

### Heatmap: Correlation between Numerical Variables

In [None]:
# Mapa de calor

correlacion_numericas = df.corr()

# Crear el heatmap
plt.figure(figsize=(8, 6))  # Ajusta el tamaño de la figura, puede ponerse o no.
sns.heatmap(correlacion_numericas, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap de Correlación entre Columnas')
plt.show()

We did not observe any significant correlation.

In [None]:
#Ever married-stroke
Ever_married=pd.crosstab(df['ever_married'],df['stroke'])
Ever_married.div(Ever_married.sum(1).astype(float), axis=0).plot.bar(stacked=False, figsize=(4,4),color = ['skyblue','coral'])
plt.xticks(rotation = 360)
plt.xlabel('Ever Married');

We observed a higher number of strokes among married people.

### Transforming categorical variables into numerical (Heatmap)

Making copy of dataset.

In [None]:
df_onehot=df.copy()

We apply one hot encoder to categorical columns dropping the first column if binary 

In [None]:
cat = df.select_dtypes(include = ['object'])
cat_columns = list(cat)

print(cat_columns)

In [None]:
onehot = OneHotEncoder(drop = 'if_binary', handle_unknown='ignore', sparse=False)


In [None]:
X = onehot.fit_transform(cat)

column_names = onehot.get_feature_names_out()

df_encoded = pd.DataFrame(data = X, columns = column_names)

print(df_encoded)

In [None]:
concatenated_df = pd.concat([num, df_encoded], axis=1)
concatenated_df.head()

In [None]:
concatenated_df.shape

In [None]:
# Mapa de calor
all_correlation = concatenated_df.corr()

# Crear el heatmap
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(all_correlation, dtype = bool))
sns.heatmap(all_correlation, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5, mask = mask)
plt.title('Heatmap of all Variables (Categorical Encoded)')
plt.show()

- Moderate correlation between age and work_type_children (-0.64)

- Moderate correlation between age and ever_married_Yes (0.68)

- Moderate correlation between ever_married_Yes and work_type_children (-0.55)

- Moderate correlation between work_type_Private and work_type_Self-employed (-0.51)

- Moderate correlation between work_type_children and smoking_status_Unknown (0.51)

- Moderate correlation between smoking_status_Unknown and smoking_status_never smoked (-0.50)

Maybe there is multi-collinearity between some variables that have more than 0.6 as the correlation coefficient. We could try to eliminate one variable or combine them in one to improve model's performance.

### ML model test

In [None]:
model = RandomForestClassifier()

In [None]:
X, y = concatenated_df.drop('stroke',axis = 1),concatenated_df['stroke'] 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
model.fit(X_train,y_train)
y_pred = model.predict(X_train)

#Define a prediction function for the given model
accuracy = round(accuracy_score(y_train, y_pred), 3)
precision = round(precision_score(y_train, y_pred), 3)
recall = round(recall_score(y_train, y_pred), 3)

print('Model: {} || Accuracy: {} || Precision: {} || Recall: {}'.format(model,accuracy, precision, recall))

In [None]:
conf_matrix = confusion_matrix(y_train, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicciones')
plt.ylabel('Valores Reales')
plt.title('Matriz de Confusión')
plt.show()

In [None]:
y_pred = model.predict(X_test)

#Define a prediction function for the given model
accuracy = round(accuracy_score(y_test, y_pred), 3)
precision = round(precision_score(y_test, y_pred), 3)
recall = round(recall_score(y_test, y_pred), 3)

print('Model: {} || Accuracy: {} || Precision: {} || Recall: {}'.format(model,accuracy, precision, recall))

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicciones')
plt.ylabel('Valores Reales')
plt.title('Matriz de Confusión')
plt.show()

With this preliminary test of a Random Forest Model, we see that there is overfitting. It is probably due to the unbalanced categories. There is few data in the stroke category.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=af12788f-aecc-4989-a302-f8b336f386d1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>