## **Notebook Goal**

This project focuses on the development of a Machine Learning Model for Stroke Prediction, carried out by the Data Analysis department at Hospital F5.

The stroke prediction model is based on the use of 11 information variables extracted from a dataset containing 4,982 records of positive and negative stroke cases. These variables will be used as features to train and evaluate various Machine Learning classification algorithms.

## **Notebook Content**

0. Importing Libraries and Dataset

1. Basic Understanding of Data

2. Exploratory Data Analysis (EDA)

3. Feature Engineering

4. Data Preprocessing

5. Model Building

6. Model Performance Check

7. Model Hyper Parameter Tunning

8. Analysis of the Most Influential Features in the Model

9. Conclusion

## **Importing Libraries and Dataset**

In [1]:
# Librerías de análisis
import numpy as np
import pandas as pd
import math
from scipy import stats
from tabulate import tabulate

# Librerías de visualización
import matplotlib.pyplot as plt
import seaborn as sns

# Librerías de ML

In [2]:
import pandas as pd
path = "/work/stroke_dataset.csv"
df = pd.read_csv(path)

df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


## **1. Basic Understanding of Data**

1.1. Data Description

1.2. Unique Values in Each Column

1.3. Data Dimension Check

1.4. Data Type Check

🔹 Categorical

🔹 Numeric

🔹 Mixed Data Types

🔹 Errors or Typos

1.5. Duplicate Data Check

1.6. Total Number and Percentage of Missing Values Check

1.7. Cardinality Check of Categorical Features

### **1.1. Data Description**

In [3]:
# Feature, data type and non-null count
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4981 entries, 0 to 4980
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4981 non-null   object 
 1   age                4981 non-null   float64
 2   hypertension       4981 non-null   int64  
 3   heart_disease      4981 non-null   int64  
 4   ever_married       4981 non-null   object 
 5   work_type          4981 non-null   object 
 6   Residence_type     4981 non-null   object 
 7   avg_glucose_level  4981 non-null   float64
 8   bmi                4981 non-null   float64
 9   smoking_status     4981 non-null   object 
 10  stroke             4981 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 428.2+ KB


This dataset is a table represented as a Pandas DataFrame, with the following characteristics:

**Number of Entries (Rows):** 4981
**Number of Columns (Attributes):** 11

Here are the descriptions of each of the columns:

**gender:** A categorical variable representing the gender of individuals (e.g., male, female, etc.).

**age:** A numerical variable representing the age of individuals in years (data type: float).

**hypertension:** A binary variable (0 or 1) indicating whether the person has hypertension (0 = No, 1 = Yes).

**heart_disease:** A binary variable (0 or 1) indicating whether the person has heart diseases (0 = No, 1 = Yes).

**ever_married:** A categorical variable indicating whether the person has been married before (e.g., Yes or No).

**work_type:** A categorical variable describing the type of job the person has (e.g., private job, government, etc.).

**Residence_type:** A categorical variable describing the type of residence of the person (e.g., urban or rural).

**avg_glucose_level:** A numerical variable representing the average blood glucose level of individuals (data type: float).

**bmi:** A numerical variable representing the body mass index (BMI) of individuals (data type: float).

**smoking_status:** A categorical variable describing the smoking status of individuals (e.g., smoker, ex-smoker, never smoker, etc.).

**stroke:** A binary variable (0 or 1) indicating whether the person has had a stroke (0 = No, 1 = Yes).

In [4]:
# Summary statistics for numerical features
df.describe()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4981.0,4981.0,4981.0,4981.0,4981.0,4981.0
mean,43.419859,0.096165,0.05521,105.943562,28.498173,0.049789
std,22.662755,0.294848,0.228412,45.075373,6.790464,0.217531
min,0.08,0.0,0.0,55.12,14.0,0.0
25%,25.0,0.0,0.0,77.23,23.7,0.0
50%,45.0,0.0,0.0,91.85,28.1,0.0
75%,61.0,0.0,0.0,113.86,32.6,0.0
max,82.0,1.0,1.0,271.74,48.9,1.0


### **1.2.  Unique Values in Each Column**

In [5]:
# Obtener los valores únicos de las columnas que los pueden tener
cols_con_unicos = [
    'gender',
    'hypertension',
    'ever_married',
    'work_type',
    'Residence_type',
    'smoking_status',
    'stroke'
]

vals_unicos = {}
for col in cols_con_unicos:
    valores = df[col].unique()
    vals_unicos[col] = valores

# Convertir el diccionario en una lista de listas para tabulate
tabla_datos = []
for col, valores in vals_unicos.items():
    tabla_datos.append([col, ', '.join(map(str, valores))])

# Imprimir la tabla
print(tabulate(tabla_datos, headers=["Columna", "Valores Únicos"], tablefmt="grid"))

+----------------+------------------------------------------------+
| Columna        | Valores Únicos                                 |
| gender         | Male, Female                                   |
+----------------+------------------------------------------------+
| hypertension   | 0, 1                                           |
+----------------+------------------------------------------------+
| ever_married   | Yes, No                                        |
+----------------+------------------------------------------------+
| work_type      | Private, Self-employed, Govt_job, children     |
+----------------+------------------------------------------------+
| Residence_type | Urban, Rural                                   |
+----------------+------------------------------------------------+
| smoking_status | formerly smoked, never smoked, smokes, Unknown |
+----------------+------------------------------------------------+
| stroke         | 1, 0                         

### **1.3. Data Dimension Check**

In [9]:
# Number of rows and columns in dataset
df.shape

(4981, 11)

### **1.4. Data Type Check**

##### **Categorical**

**Numeric**

**Mixed Data Types**

**Errors or Typos**

### **1.5. Duplicate Data Check**

### **1.6. Missing Values Check**

In [10]:
# Check for missing values in the dataset
missing_values = df.isnull().sum()
missing_values

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### **1.7. Cardinality Check**

**Cardinality of Numeric Columns**

In [11]:
numeric_columns = df.select_dtypes(include=["float64", "int64"]).columns.tolist()

**Cardinality of Categorical Columns**

In [12]:
categorical_columns = df.select_dtypes(include=["object"]).columns.tolist()

## **2. Exploratory Data Analysis (EDA)**


### **2.1. Analysis of the Target Variable 'stroke'**

## **3. Feature Engineering**


**TO DO**

- Round 'age' and convert it to an integer.
- Reduce the 'smoking_status' column to two variables: true or false (1, 0).


## **4. Data Preprocessing**

4.1. Dropping Unused Columns

4.2. Null Values Imputation

4.3. Feature Encoding

4.4. Feature Balancing

4.5. Feature and Target Variable Selection

4.6. Train-Test Split

4.7. Feature Scaling

### **4.1. Dropping Unused Columns**

### **4.2. Null Values Imputation**

**TO DO**

- Replace the 'Unknown' values in 'smoking_status' for children under 12 years old with 'never smoker'.

### **4.3. Feature Encoding**

### **4.4. Feature Balancing**

### **4.5. Feature and Target Variable Selection**

### **4.6. Train-Test Split**

### **4.7. Feature Scaling**

## **5. Model Building**

## **6. Model Performance Check**

## **7. Model Hyper Parameter Tunning**

## **8. Analysis of the Most Influential Features in the Model**

## **9. Conclusion**

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=311bc360-0ff3-42ad-8fe8-990672a12326' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>