## Heart Stroke Prediction

### 0. Summary

#### Goal

Overall: To forecast the likelihood of a patient experiencing a stroke as a function of:
 - presence of diseases,
 - age,
 - gender,
 - smoking status,
 - etc.
 
 This part will focus on a relatively extensive "Exploratory Data Analysis" (EDA) of the Heart Stroke Prediction data set

#### Dataset

The data is a standard text file consisting of comma separated values, found in various places (for example [here](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)), with the following features:
 1. gender (object)
 2. age (float)
 3. hypertension (int)
 4. heart_disease (int)
 5. ever_married (object)
 6. work_type (object)
 7. Residence_type (object)
 8. avg_glucose (float)
 9. bmi (float)
 10. smoking_status (object)
 11. stroke (int)

There is a total of 5110 records.

---
### 1. Python Modules

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import pandas as pd
import seaborn as sns
import sklearn as skl 
import statsmodels.api as smapi
import scipy as scp
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

---
### 2. Preprocess Data for EDA

In [7]:
df_raw = pd.read_csv("heart_stroke_prediction.csv")

Split the dataframe into 2 dataframes:
1. Numerical features
2. Categorical features
3. Boolean features

In [8]:
df_cat = df_raw[['gender','ever_married','work_type','Residence_type','smoking_status']].copy() # objects
df_num = df_raw[['age','avg_glucose_level','bmi',]].copy() # floats
df_int = df_raw[['hypertension','heart_disease','stroke']].copy() # ints (values are either 1 or 0)

#### Categorical dataframe

In [9]:
for col in df_cat.columns:
    print(col, df_cat[col].unique())

gender ['Male' 'Female' 'Other']
ever_married ['Yes' 'No']
work_type ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
Residence_type ['Urban' 'Rural']
smoking_status ['formerly smoked' 'never smoked' 'smokes' 'Unknown']


No phantom / duplicated categories. Normalize these categorical features by replacing dashes or spaces with underscores, and converting everything to lower case. 

In [10]:
# 1. get rid of spaces and replace dashes w. underscores
df_cat['work_type']      = df_cat['work_type'     ].replace('Self-employed',   'Self_employed')
df_cat['smoking_status'] = df_cat['smoking_status'].replace('formerly smoked', 'formerly_smoked')
df_cat['smoking_status'] = df_cat['smoking_status'].replace('never smoked',    'never_smoked')
# 2. turn all values to lower-case
for col in df_cat.columns:
    df_cat[col] = df_cat[col].map(lambda x:x.lower())
# 3. get rid of the capitalization in one of the columns
df_cat.columns = df_cat.columns.str.lower()

Find any count all NaN, None, NaT entries in the categorical dataframe. Count only `False` entries.

In [11]:
df_cat.isna().sum()

gender            0
ever_married      0
work_type         0
residence_type    0
smoking_status    0
dtype: int64

Find any wrong types among the entries.

In [12]:
for col in df_cat.columns:
    print(col, df_cat[col].apply(lambda x:isinstance(x,object)).value_counts().get(True))

gender 5110
ever_married 5110
work_type 5110
residence_type 5110
smoking_status 5110


#### Numerical dataframe

In [13]:
df_num.describe()

Unnamed: 0,age,avg_glucose_level,bmi
count,5110.0,5110.0,4909.0
mean,43.226614,106.147677,28.893237
std,22.612647,45.28356,7.854067
min,0.08,55.12,10.3
25%,25.0,77.245,23.5
50%,45.0,91.885,28.1
75%,61.0,114.09,33.1
max,82.0,271.74,97.6


* The values for `age` range from 0.08 (newborn) to 82, and there are no nonsensical values.
* Columns `hypertension`, `heart_disease`, and `stroke` are actually bool (not int64), expressed as either 1 or 0. Quantities such as the mean, the std, etc., are therefore meaningless for the purposes of this step.
* `avg_glucose_level` has a maximum of 271, which indicates a medical emergency, but which does not appear to be unreasonable.
* Similarly, the maximum `bmi` value (97.6) indicates a medically urgent scenario, but is nevertheless entirely plausible.

Find any count all NaN, None, NaT entries in the categorical dataframe

In [14]:
df_num.isna().sum()

age                    0
avg_glucose_level      0
bmi                  201
dtype: int64

Around 4% of the `bmi` feature are problematic. Replacing missing values with the `mean()` would be problematic due to the presence of outliers. So, replace with the median of the existing data set. This imputation is meant for the EDA steps ahead - different imputation techniques will be explored during the feature engineering stage.

In [15]:
median_bmi_val = df_num["bmi"].median()
df_num["bmi"] = df_num["bmi"].fillna(median_bmi_val)

In [16]:
df_num.isna().sum()

age                  0
avg_glucose_level    0
bmi                  0
dtype: int64

Next: check for out-of-place datatypes. Note that everything will evaluate to True if checked against the object datatype, but an int will evaluate to False if type checked against float.

In [17]:
cols_d = df_num.columns
for ii in range(0, len(cols_d), 1):
    print(df_num[cols_d[ii]].apply(lambda x:isinstance(x,(float))).value_counts().get(True))

5110
5110
5110
