### Module Import

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset definition

In [2]:
df = pd.read_csv('stroke_dataset.csv')
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4981 entries, 0 to 4980
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4981 non-null   object 
 1   age                4981 non-null   float64
 2   hypertension       4981 non-null   int64  
 3   heart_disease      4981 non-null   int64  
 4   ever_married       4981 non-null   object 
 5   work_type          4981 non-null   object 
 6   Residence_type     4981 non-null   object 
 7   avg_glucose_level  4981 non-null   float64
 8   bmi                4981 non-null   float64
 9   smoking_status     4981 non-null   object 
 10  stroke             4981 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 428.2+ KB


### Fields descriptions

- Gender: The person's gender, indicating whether they are male or female.

- Age: The person's age, indicating how many years old they are. This variable includes floats.

- Hypertension: Indicates whether the person has hypertension or high blood pressure (1 if they have it, 0 if they don't).

- Heart disease: Indicates whether the person has heart disease (1 if they have it, 0 if they don't).

- Ever_married: Indicates whether the person has ever been married (yes or no).

- Work_type: The type of work the person does, which can be categorized in various ways, such as office work, manual labor, etc.

- Residence_type: The type of residence of the person, which can be "Rural" or "Urban," indicating whether they live in a rural or urban area.

- Avg_glucose_level: The person's average blood glucose level, which is an important measure for assessing a person's health, especially in relation to diabetes. This variable has a float data type.

- bmi: The person's Body Mass Index (BMI), which is a measure that relates a person's weight and height to assess their body composition and potential obesity. It is a float.

- Smoking_status: The person's smoking status, which can be categorized into different states such as "never smoked," "former smoker," or "current smoker and "Unknown".

### Categorical Variables

In [4]:
cat = df.select_dtypes(include = ['object'])
cat_columns = list(cat)

In [5]:
for col in cat_columns:
    print(f'Column name: {col}')
    print(df[col].value_counts())
    print()

Column name: gender
Female    2907
Male      2074
Name: gender, dtype: int64

Column name: ever_married
Yes    3280
No     1701
Name: ever_married, dtype: int64

Column name: work_type
Private          2860
Self-employed     804
children          673
Govt_job          644
Name: work_type, dtype: int64

Column name: Residence_type
Urban    2532
Rural    2449
Name: Residence_type, dtype: int64

Column name: smoking_status
never smoked       1838
Unknown            1500
formerly smoked     867
smokes              776
Name: smoking_status, dtype: int64



### Numeric Variables

In [6]:
num = df.select_dtypes(include = ['number'])
num_columns = list(num)

In [7]:
for col in num_columns:
    print(f'Column name: {col}')
    print(df[col].value_counts())
    print()

Column name: age
78.00    102
57.00     92
54.00     85
51.00     84
79.00     84
        ... 
0.48       3
1.16       3
0.40       2
0.08       2
0.16       1
Name: age, Length: 104, dtype: int64

Column name: hypertension
0    4502
1     479
Name: hypertension, dtype: int64

Column name: heart_disease
0    4706
1     275
Name: heart_disease, dtype: int64

Column name: avg_glucose_level
93.88     6
83.16     5
73.00     5
72.49     5
91.68     5
         ..
120.09    1
197.58    1
99.91     1
133.76    1
60.50     1
Name: avg_glucose_level, Length: 3895, dtype: int64

Column name: bmi
28.7    42
28.4    41
27.3    38
26.1    37
27.7    37
        ..
46.6     1
47.9     1
46.3     1
48.0     1
14.9     1
Name: bmi, Length: 342, dtype: int64

Column name: stroke
0    4733
1     248
Name: stroke, dtype: int64



In [8]:
df.describe()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,4981.0,4981.0,4981.0,4981.0,4981.0,4981.0
mean,43.419859,0.096165,0.05521,105.943562,28.498173,0.049789
std,22.662755,0.294848,0.228412,45.075373,6.790464,0.217531
min,0.08,0.0,0.0,55.12,14.0,0.0
25%,25.0,0.0,0.0,77.23,23.7,0.0
50%,45.0,0.0,0.0,91.85,28.1,0.0
75%,61.0,0.0,0.0,113.86,32.6,0.0
max,82.0,1.0,1.0,271.74,48.9,1.0


### Null Values Verification

In [9]:
df.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### Duplicate Check

In [10]:
df.duplicated().sum()

0

### Cardinality Verification

In [11]:
df.nunique()

gender                  2
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               4
Residence_type          2
avg_glucose_level    3895
bmi                   342
smoking_status          4
stroke                  2
dtype: int64

- There are no duplicates and nulls in the dataset.

- We found few columns with unbalanced categories: heart_disease, hypertension, stroke.

- We found multiple variables with numeric datatype, but they are truly boolean. We could change the datatype in the future and see how the model responds.

- avg_glucose_level has a high cardinality. We could try in the future to group this variable in different categories.

### Outliers Exploration

In [12]:
num = df.select_dtypes(include = ['number'])
num_columns = list(num)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=af12788f-aecc-4989-a302-f8b336f386d1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>