### Intro

My first dive into the Stroke prediction data set.
My primary goal will be creating simple and insightful EDA as I learn to viaualize with the plotly.express package, primarily.
Eventually, I will use this to walk through the basic Machine Learning models.

Importing Libraries

In [1]:
import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
import os

import plotly
import plotly.express as px
import plotly.io as pio
# from dash import Dash, html, dcc
# import plotly.graph_objects as go
# import plotly.figure_factory as ff

from datatile.summary.df import DataFrameSummary

from sklearn.preprocessing import LabelEncoder 


Loading the data into a dataframe

In [2]:
filepath = '../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv'
raw_log = pd.read_csv(filepath)

In [3]:
df = raw_log.copy()

dfs = DataFrameSummary(df)

### Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Attribute Information

1) id: unique identifier 

2) gender: "Male", "Female" or "Other" 

3) age: age of the patient 

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension 

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 

6) ever_married: "No" or "Yes" 

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed" 

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood 

10) bmi: body mass index 

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not 
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

## General Data Observation

In [4]:
df.shape

(5110, 12)

In [5]:
df.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


In [6]:
df.tail(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5100,68398,Male,82.0,1,0,Yes,Self-employed,Rural,71.97,28.3,never smoked,0
5101,36901,Female,45.0,0,0,Yes,Private,Urban,97.95,24.5,Unknown,0
5102,45010,Female,57.0,0,0,Yes,Private,Rural,77.93,21.7,never smoked,0
5103,22127,Female,18.0,0,0,No,Private,Urban,82.85,46.9,Unknown,0
5104,14180,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown,0
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [8]:
df.describe(include= "object")

Unnamed: 0,gender,ever_married,work_type,Residence_type,smoking_status
count,5110,5110,5110,5110,5110
unique,3,2,5,2,4
top,Female,Yes,Private,Urban,never smoked
freq,2994,3353,2925,2596,1892


In [9]:
dfs.summary()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
count,5110.0,,5110.0,5110.0,5110.0,,,,5110.0,4909.0,,5110.0
mean,36517.829354,,43.226614,0.097456,0.054012,,,,106.147677,28.893237,,0.048728
std,21161.721625,,22.612647,0.296607,0.226063,,,,45.28356,7.854067,,0.21532
min,67.0,,0.08,0.0,0.0,,,,55.12,10.3,,0.0
25%,17741.25,,25.0,0.0,0.0,,,,77.245,23.5,,0.0
50%,36932.0,,45.0,0.0,0.0,,,,91.885,28.1,,0.0
75%,54682.0,,61.0,0.0,0.0,,,,114.09,33.1,,0.0
max,72940.0,,82.0,1.0,1.0,,,,271.74,97.6,,1.0
counts,5110,5110,5110,5110,5110,5110,5110,5110,5110,4909,5110,5110
uniques,5110,3,104,2,2,2,5,2,3979,418,4,2


In [10]:
df.duplicated().value_counts()

False    5110
dtype: int64

Notes before we move on to cleaning: 
- 5110 individuals in this data
- No duplicates
- ~4% of observations are missing a BMI
- Unique values in 'gender', 'work type', and 'smoking status' (all catagorical) need to be addressed
- ID column is th unique identifer and can be dropped

## Data Cleaning

In [11]:
df = df.drop(columns= ['id'])

In [12]:
# Convert all columns to lowercase since 'Residence_type' is the only column not lowercase 
df.columns = df.columns.str.lower()

In [13]:
# Replace missing data in bmi with the mean
df['bmi'].fillna(df['bmi'].mean(), inplace=True)

# there are more intuitive routes like a Decision Tree for this, or at least assaigning the respective mean of each gender, but we will stick with a basic approach for now.

# Double check all missing values are filled
df.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### Catagorical Data Corrections

In [14]:
df['gender'].value_counts()

Female    2994
Male      2115
Other        1
Name: gender, dtype: int64

Since there is only one 'Other' in the dataset, we will make Gender binary to make thing easier down the road. Thus, we will change 'other' to 'female' since it is the mode.

In [15]:
df['gender'] = df['gender'].replace('Other', 'Female')
df['gender'].value_counts()

Female    2995
Male      2115
Name: gender, dtype: int64

In [16]:
df['work_type'].value_counts()

Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: work_type, dtype: int64

In [17]:
df['smoking_status'].value_counts()

never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64

Values in 'work_type' could potentially be clumped ( [Private, Govt_job] and [ Self-employed, children, Never_worked] ), but we won't go there with this data.

'smoking_status' looks good; just need to keep in mind that 'Unknown' is basically missing data that represents a large portion here.

### Checking for Possible Outliers in the Numeric Data

In [18]:
table0 = px.box(
    data_frame= df[{'age', 'avg_glucose_level', 'bmi'}].melt(),
    y= 'value',
    facet_col= 'variable',
    color= 'variable',
    points= 'outliers',
    template= 'ggplot2',
    title= 'Identifying Possible Outliers'
)
table0.update_yaxes(matches= None)
table0.for_each_annotation(lambda x: x.update(text= x.text.split("=")[-1]))
table0.show()              

In [19]:
table0 = px.box(
    data_frame= df[{'age', 'avg_glucose_level', 'bmi', 'stroke'}].melt(value_vars= [ 'avg_glucose_level','bmi', 'age'], id_vars= ['stroke']),
    y= 'value',
    x='stroke',
    facet_col= 'variable',
    color= 'variable',
    points= 'outliers',
    template= 'ggplot2',
    title= 'Identifying Possible Outliers- Stroke Split'
)
table0.update_yaxes(matches= None)
table0.for_each_annotation(lambda x: x.update(text= x.text.split("=")[-1]))
table0.show()     

In [20]:
px.scatter(
    df,
    x= 'avg_glucose_level',
    y= 'bmi',
    marginal_x= 'box',
    marginal_y= 'box',
    color= 'age',
    color_continuous_scale=px.colors.sequential.GnBu,
    template= 'ggplot2',
    title= "BMI vs Avg Glucose"
    
)

A few takeaways:
- The boxplots do show there are several mathematical outliers in bmi and avg_glucose_level; however, we will not be removing them as explained later in the analysis.
- Much higher concentration of older ages as avg_glucose_level increases, but not neccesarily for BMI (though we will see a higher correlation between BMI and Age later on)

## Visualizing Trends

##### Total Strokes

In [21]:
table1 = px.histogram(
    data_frame = df,
    x = 'stroke',
    text_auto = True,
    color = 'stroke',
    template= 'ggplot2',
    title = "Stroke Count"
)

table1.update_layout(bargap=0.1)
table1.show()

In [22]:
table2 = px.pie(
    df,
    names= 'stroke',
    color= 'stroke',
    template= 'ggplot2',
    title= "Stroke Percentage"
)

table2.show()

In [23]:
px.pie(
    df,
    names= 'gender',
    color= 'gender',
    template= 'ggplot2',
    title= "Gender Percentage"
)



A few takeaways:
- Imbalanced set in terms of gender. Not a good representation of the population (~ 50.5% to 49.5%).
- Patients with confirmed strokes only make up <5% of the set, so an accuracy score of 95% could easily be falsely achieved. Something to keep in mind when we build our models later.  

##### Quick Correlation chart

In [24]:
fig1 = px.imshow(
    df.corr().round(3),
    color_continuous_scale= 'gnbu',
    text_auto=True,
    template= 'ggplot2',
    title= "Correlation Heatmap"
)

fig1.show()

A few takeaways:
- No particularly strong correlations
- *Age* has a **higher** correlation with all factors than others share, especially *BMI*
- *BMI*'s shockingly **low** correlation with *heart disease* and *stroke*

#### Age and Gender

In [25]:
px.histogram(
    df,
    x= 'age',
    y= 'stroke',
    marginal= 'histogram',
    barmode= 'relative',
    template= 'ggplot2',
    title= "Total Stroke totals by Age"
).update_layout(bargap=0.1)

Breakdown of Strokes by Age with the distribution of *Age for the entire dataset* in the margin. 

In [26]:
px.histogram(
    df,
    x= 'age',
    y= 'stroke',
    color= 'gender',
    barmode= 'group',
    template= 'ggplot2',
    title= "Gender Stroke totals by Age"
)

A few takeaways:
- 2 children stroke cases, and the bulk are ages 40+.
- *Female* hold higher percentage of the stroke **up to age 56**.
- *Males* hold greater number of strokes from **58 to 65**.
- After 66, *Females* tend to have more strokes, but this could just be beacuse they have a high life expectancy so there are more older females in general.

#### Hypertension

In [27]:
px.histogram(
    df,
    x= 'age',
    y= 'stroke',
    color= 'hypertension',
    barmode= 'group',
    nbins= 10,
    template= 'ggplot2',
    title= "Total Strokes by Age with Hypertension"
)


#### Heart Disease

In [28]:
px.histogram(
    df,
    x= 'age',
    y= 'stroke',
    color= 'heart_disease',
    barmode= 'group',
    template= 'ggplot2',
    title= "Total Strokes by Age with Heart Disease"
)

note: the colors indicating 0 and 1 have flipped from the hypertension chart- needs to be fixed.

#### Avg Glucose Level

In [29]:
px.histogram(
    df,
    x= 'avg_glucose_level',
    y= 'stroke',
    barmode= 'group',
    template= 'ggplot2',
    title= "Total Strokes across Glucose Level"
).update_layout(bargap=0.1)

#### BMI

In [30]:
px.histogram(
    df,
    x= 'bmi',
    y= 'stroke',
    barmode= 'group',
    template= 'ggplot2',
    title= "Total Strokes across BMI"
).update_layout(bargap=0.1)

The spike of 57 total strokes for BMI is a little fishy. 
If you remember, we were originally missing 201 BMIs from our data that we replaced with the mean of **28.89**.

Let's isolate and check the raw data that had a *null* BMI value.

In [31]:
df_test = raw_log.copy()

df_test = raw_log.loc[raw_log['bmi'].isnull()]


px.histogram(
    df_test,
    x = 'stroke',
    text_auto = True,
    color = 'stroke',
    template= 'ggplot2',
    title = "Stroke Count where BMI was missing"
).update_layout(bargap=0.1)

There you have it. We added **40** stroke cases to our mean BMI, **~20%** of the total which is not representative of the dataset.

Therefore, even though we already knew BMI and Stroke had a low correlation, we will probably not trust this metric. 

For kicks, let's look at the distibution of BMIs across stroke patients if we remove the null data.

In [32]:
px.histogram(
    raw_log.loc[raw_log['bmi'].notnull()],
    x= 'bmi',
    y= 'stroke',
    barmode= 'group',
    template= 'ggplot2',
    title= "Total Strokes across BMI- null data removed"
).update_layout(bargap=0.1)

### Categorical Conversion and LabelEncorder

Before we continue into the other trends of the other catagorical variable, let's get a correlation check like we did with the previous varables.

However, to do this we need to convert the catagorical variables into encoded numbers. 

e.g. smoking_status : ["never smoked", "formerly smoked", "smokes" or "Unknown"] to [0, 1, 2, 3]

This could be done manually, but let's use this opportunity to use the **LabelEncoder** tool from the **sklearn.preprocessing** package.

EDIT: We will be using a custom funtion **MultiColumnLabelEncoder**, so that we can easily inverse the encoded data back for cleaner visualizations. Thank you and credit to *gereleth* for the clean [solution](http://https://stackoverflow.com/questions/58217005/how-to-reverse-label-encoder-from-sklearn-for-multiple-columns) 

In [33]:
class MultiColumnLabelEncoder:

    def __init__(self, columns=None):
        self.columns = columns # array of column names to encode


    def fit(self, X, y=None):
        self.encoders = {}
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            self.encoders[col] = LabelEncoder().fit(X[col])
        return self


    def transform(self, X):
        output = X.copy()
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            output[col] = self.encoders[col].transform(X[col])
        return output


    def fit_transform(self, X, y=None):
        return self.fit(X,y).transform(X)


    def inverse_transform(self, X):
        output = X.copy()
        columns = X.columns if self.columns is None else self.columns
        for col in columns:
            output[col] = self.encoders[col].inverse_transform(X[col])
        return output

In [34]:
multi = MultiColumnLabelEncoder(columns=['gender', 'ever_married', 'work_type', 'residence_type', 'smoking_status'])

df = multi.fit_transform(df)

In [35]:
df.head(10)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1,67.0,0,1,1,2,1,228.69,36.6,1,1
1,0,61.0,0,0,1,3,0,202.21,28.893237,2,1
2,1,80.0,0,1,1,2,0,105.92,32.5,2,1
3,0,49.0,0,0,1,2,1,171.23,34.4,3,1
4,0,79.0,1,0,1,3,0,174.12,24.0,2,1
5,1,81.0,0,0,1,2,1,186.21,29.0,1,1
6,1,74.0,1,1,1,2,0,70.09,27.4,2,1
7,0,69.0,0,0,0,2,1,94.39,22.8,2,1
8,0,59.0,0,0,1,2,0,76.15,28.893237,0,1
9,0,78.0,0,0,1,2,1,58.57,24.2,0,1


In [36]:
fig2 = px.imshow(
    df.corr().round(3),
    color_continuous_scale= 'gnbu',
    text_auto=True,
    template= 'ggplot2',
    title= "Correlation Heatmap 2"
)

fig2.show()

That is a lot to take in, so let's also look at just those variables we have not touched yet.

In [37]:
fig3 = px.imshow(
    df[{'gender', 'age', 'ever_married', 'work_type', 'residence_type', 'smoking_status', 'stroke'}].corr().round(3),
    color_continuous_scale= 'gnbu',
    text_auto=True,
    template= 'ggplot2',
    title= "Correlation Heatmap 2.1"
)

fig3.show()

A few takeaways:
- No particularly strong correlations except *Age* and *Ever_married*
- *Work-type* has a **higher** negative correlation with *Age* and *Ever_married*
- *Residence_type* and *Gender* have **NO** correlation with any other factors
- *Smoking_status* carries a **higher** correlation with *BMI*, *Age* AND *Ever_married*; not that useful for our context, but it is interesting nonetheless. 

Inverse all of the encoded data back to it's respective descriptors before continuing with the viz.

In [38]:
df = multi.inverse_transform(df)

df.tail(10)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
5100,Male,82.0,1,0,Yes,Self-employed,Rural,71.97,28.3,never smoked,0
5101,Female,45.0,0,0,Yes,Private,Urban,97.95,24.5,Unknown,0
5102,Female,57.0,0,0,Yes,Private,Rural,77.93,21.7,never smoked,0
5103,Female,18.0,0,0,No,Private,Urban,82.85,46.9,Unknown,0
5104,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown,0
5105,Female,80.0,1,0,Yes,Private,Urban,83.75,28.893237,never smoked,0
5106,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


#### Residence Type

In [39]:
px.histogram(
    df,
    'residence_type',
    color = 'residence_type',
    text_auto = True,
    template= 'ggplot2',
    title= "Residence Count"
    
)

In [40]:
px.histogram(
    df,
    x= 'age',
    y= 'stroke',
    color= 'residence_type',
    barmode= 'group',
    nbins= 20,
    template= 'ggplot2',
    title= "Total Strokes by Age with Residence Type"
)

#### Work Type

In [41]:
px.histogram(
    df,
    x= 'work_type',
    y= 'stroke',
    color= 'work_type',
    marginal= 'histogram',
    template= 'ggplot2',
    title= "Total Strokes by Work Type (with total Work Type marginal)"
)

Those working with children (early-year teacher, childcare, stay-at-home) are **LESS** likely to suffer a stroke!

#### Smoking Status

In [42]:
px.histogram(
    df,
    x= 'smoking_status',
    y= 'stroke',
    color= 'smoking_status',
    marginal= 'histogram',
    template= 'ggplot2',
    title= "Total Strokes by Smoking status (with total marginal)"
)

The large quantity of "Unknown" make it hard to use this data, but nonethless there appears to be no direct correlation between strokes and smokes!

#### Marriage Status

In [43]:
px.histogram(
    df,
    x= 'age',
    y= 'stroke',
    color= 'ever_married',
    barmode= 'group',
    template= 'ggplot2',
    title= "Total Strokes by Age with Marriage"
)

### Addressing Multicollinearity 

Collinearities were examined in our "Correlation Heatmap 2"

*age* and *ever_married* are pretty strongly correlated at a **0.679**.

Even though it is not in the dangerous 0.75-0.80+ range, for the sake of practice, we are going to remove *ever_married* before progressing.

In [44]:
df = df.drop(['ever_married'], axis= 1)

df.head(5)

Unnamed: 0,gender,age,hypertension,heart_disease,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Self-employed,Rural,202.21,28.893237,never smoked,1
2,Male,80.0,0,1,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Self-employed,Rural,174.12,24.0,never smoked,1


### Next Steps

-Apply feature standardization to the continuous numeric data to reduce potential bias in some of the models. Options: StandardScaler from sklearn
-Build models.