This file operates on the stroke_full.csv data which had been created by the file DSI_project_v2.ipynb using the dataset present in healthcare-dataset-stroke-data.csv

For further details on how stroke_full.csv was created, and the data it contains, please read the documentation in DSI_project_v2.ipynb

In [592]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from scipy.stats import chi2_contingency
from sklearn.feature_selection import VarianceThreshold

In [593]:
df = pd.read_csv("stroke_full.csv")
print (df.dtypes)

gender                     object
age                       float64
hypertension                int64
heart_disease               int64
ever_married               object
work_type                  object
Residence_type             object
avg_glucose_level         float64
bmi                       float64
smoking_status             object
stroke                      int64
gender_encoded              int64
ever_married_encoded        int64
work_type_encoded           int64
Residence_type_encoded      int64
smoking_status_encoded      int64
Age_temp                    int64
avg_glucose_level_temp      int64
dtype: object


Note that gender and smoking_status are showing up as object variables even though they were category variables when this dataframe was written to stroke_full.csv by DSI_project_v2.ipynb. This is a known issue in 

pandas. 

In [594]:
print (df.shape)

(3413, 18)


In [595]:
# Convert categorical columns back to "category"
categorical_cols = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
for col in categorical_cols:
    df[col] = df[col].astype('category')

print(df.dtypes)

gender                    category
age                        float64
hypertension                 int64
heart_disease                int64
ever_married              category
work_type                 category
Residence_type            category
avg_glucose_level          float64
bmi                        float64
smoking_status            category
stroke                       int64
gender_encoded               int64
ever_married_encoded         int64
work_type_encoded            int64
Residence_type_encoded       int64
smoking_status_encoded       int64
Age_temp                     int64
avg_glucose_level_temp       int64
dtype: object


In [596]:
df['heart_risk'] = (df['hypertension'] | df['heart_disease']).astype(int)

# Verify the result
print(df[['hypertension', 'heart_disease', 'heart_risk']].head(10))

   hypertension  heart_disease  heart_risk
0             0              1           1
1             0              0           0
2             0              1           1
3             0              0           0
4             1              0           1
5             0              0           0
6             1              1           1
7             0              0           0
8             0              0           0
9             0              0           0


In [597]:
print (df.dtypes)


gender                    category
age                        float64
hypertension                 int64
heart_disease                int64
ever_married              category
work_type                 category
Residence_type            category
avg_glucose_level          float64
bmi                        float64
smoking_status            category
stroke                       int64
gender_encoded               int64
ever_married_encoded         int64
work_type_encoded            int64
Residence_type_encoded       int64
smoking_status_encoded       int64
Age_temp                     int64
avg_glucose_level_temp       int64
heart_risk                   int64
dtype: object


In [598]:
# Select all category variables in this dataset and obtain the counts of each levels of those variables.
category_columns = df.select_dtypes(include='category').columns

for col in category_columns:
    print(f"\nCount of each level in '{col}':")
    counts = df[col].value_counts()
    for category, count in counts.items():
        print(f"{category}: {count}")


Count of each level in 'gender':
Female: 2026
Male: 1387

Count of each level in 'ever_married':
Yes: 3051
No: 362

Count of each level in 'work_type':
Private: 2090
Self-employed: 753
Govt_job: 570

Count of each level in 'Residence_type':
Urban: 1729
Rural: 1684

Count of each level in 'smoking_status':
never smoked: 1371
formerly smoked: 773
Unknown: 661
smokes: 608


In [599]:
#This function does a Chi square test of independence on two nominal variables. It returns the Chi Square test statistic, p-value, degrees of freedom, and expected frequencies
def chi_square(dataframe, var1, var2):
    try:
        # Create the contingency table
        contingency_table = pd.crosstab(dataframe[var1], dataframe[var2])

        # Do the chi-square test
        chi2, p, dof, expected = chi2_contingency(contingency_table)

        return chi2, p, dof, expected

    except:
        print(f"Error performing chi-square test")



In [600]:
#stroke and gender
result = chi_square(df, 'stroke', 'gender')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 0.9177594052664405
P-value: 0.33806396780881875
Degrees of freedom: 1
Expected frequencies:
[[1879.3776736 1286.6223264]
 [ 146.6223264  100.3776736]]


p value of the Chi square test of independence on gender and stroke is 0.338. Hence we fail to reject the null hypothesis of no association between gender and stroke.

In [601]:
#Stroke and heart risk
result = chi_square(df, 'stroke', 'heart_risk')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)


Chi-square statistic: 67.27529316498607
P-value: 2.361219893995417e-16
Degrees of freedom: 1
Expected frequencies:
[[2528.71842953  637.28157047]
 [ 197.28157047   49.71842953]]


p-value for the Chi square test of independence on heart risk and stroke is 0.000. Hence we reject H0 and conclude that there **is** a significant association between heart risk and stroke

In [602]:
#Stroke and ever married
result = chi_square(df, 'stroke', 'ever_married')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 0.004196368695238362
P-value: 0.9483496555138795
Degrees of freedom: 1
Expected frequencies:
[[ 335.80193378 2830.19806622]
 [  26.19806622  220.80193378]]


p value of the Chi square test of independence on stroke and ever married is 0.9483. Hence we fail to reject the null hypothesis of no association between stroke and ever married at the 5% significance level.

In [603]:
#Stroke and work type
result = chi_square(df, 'stroke', 'work_type')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 3.998458019195217
P-value: 0.13543966567485147
Degrees of freedom: 2
Expected frequencies:
[[ 528.74890126 1938.74597129  698.50512745]
 [  41.25109874  151.25402871   54.49487255]]


p value of the Chi square test of independence on stroke and work_type is 0.1354. Hence we fail to reject the null hypothesis of no association between stroke and ever married at the 5% significance level.

In [604]:
#Stroke and Residence type
result = chi_square(df, 'stroke', 'Residence_type')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 1.2237362549577369
P-value: 0.2686286548452952
Degrees of freedom: 1
Expected frequencies:
[[1562.12833285 1603.87166715]
 [ 121.87166715  125.12833285]]


p value of the Chi square test of independence on stroke and Residence_type is 0.2686. Hence we fail to reject the null hypothesis of no association between stroke and Residence type at the 5% significance level.

In [605]:
#Stroke and Smoking type
result = chi_square(df, 'stroke', 'smoking_status')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 5.011184251740086
P-value: 0.17098000795349666
Degrees of freedom: 3
Expected frequencies:
[[ 613.16319953  717.05772048 1271.78025198  563.99882801]
 [  47.83680047   55.94227952   99.21974802   44.00117199]]


p value of the Chi square test of independence on stroke and smoking_status is 0.1710. Hence we fail to reject the null hypothesis of no association between stroke and smoking status type at the 5% significance level.

However, the smoking status variable is a very interesting variable in that its levels are never smoked, formerly smoked, Unknown and smokes. Let's try doing some other other Chi square tests of independence and see 

what happens. In the first test we will combine formerly smoked and smokes into one category and have never smoked and Unknown as the other two categories. 

In [606]:
#  Combines two or three levels of 'smoking_status' into a new level.
def combine_smoker_categories(df, new_level, level1, level2, level3 = None):
    # Create a mapping dictionary
    category_mapping = {level1: new_level, level2: new_level}
    if level3 is not None:
        category_mapping[level3] = new_level
    
    # Apply the mapping to the column
    df['smoking_status'] = df['smoking_status'].map(category_mapping).fillna(df['smoking_status'])
    
    # Convert back to categorical and remove unused categories
    df['smoking_status'] = df['smoking_status'].astype('category').cat.remove_unused_categories()
    
    return df


In [607]:
df_smoke_test = combine_smoker_categories(df.copy(),'Ever_Smoked', 'smokes', 'formerly smoked')
result = chi_square(df_smoke_test, 'stroke', 'smoking_status')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 2.6728070843107927
P-value: 0.26278908094368353
Degrees of freedom: 2
Expected frequencies:
[[1281.05654849  613.16319953 1271.78025198]
 [  99.94345151   47.83680047   99.21974802]]


In [608]:
df_smoke_test_2 = combine_smoker_categories(df.copy(), 'Ever_Smoked', 'smokes', 'formerly smoked', 'Unknown' )
result = chi_square(df_smoke_test_2, 'stroke', 'smoking_status')
if result:
    chi2, p, dof, expected = result
    print(f"Chi-square statistic: {chi2}")
    print(f"P-value: {p}")
    print(f"Degrees of freedom: {dof}")
    print("Expected frequencies:")
    print(expected)

Chi-square statistic: 1.3807519931334042
P-value: 0.2399730970890941
Degrees of freedom: 1
Expected frequencies:
[[1894.21974802 1271.78025198]
 [ 147.78025198   99.21974802]]


p-values for the Chi square test on stroke and smoking_status continues to remain non-significant even when we combined the levels of the smoking_status variable in all possible (even non-intuitive) ways.

Our inference so far is: 

**No association exists between Stroke and the following variables**: Gender, ever_married, work_type, residence_type

**An association exists between Stroke and heart risk**, where heart risk is a composite variable obtained by combining the binary variables hypertension and heart_disease

A question exists over whether there is any association between Stroke and Smoking status. The Chi Square test of independence is saying there is no association based on the current levels of smoking status which are 

never smoked, formerly smoked, Unknown, and smokes. Combining the observations corresponding to formerly smoked and smokes into a a single level called Ever_Smoked and doing a Chi Square test of independence on 

stroke and smoking status yielded an insignificant p value at the 5% significance level. Combining the observations corresponding to formerly smoked, smokes and Unknown into a a single level called Ever_Smoked and 

doing a Chi Square test of independence on stroke and smoking status also yielded an insignificant p value at the 5% significance level. This second combination of formerly smoked, smokes and Unknown was completely 

arbitrary and done the assumption that many former smokers would refuse to disclose that they had been smokers previously. The point is that no matter how we frame the Chi Square test of independence, the dataset we 

have continues to tell us that there is no association between Stroke and smoking status at the 5% significance level.

Our revised inference therefore is:

**No association exists between Stroke and the following variables**: Gender, ever_married, work_type, residence_type, and smoking_status

**An association exists between Stroke and heart risk**, where heart risk is a composite variable obtained by combining the binary variables hypertension and heart_disease


**Numerical Features**

There are three candidate features in this dataset which are numerical variables. These are age, avg_glucose_level, and bmi. Using the VarianceThreshold method we will identify if any of these variables has a 

variance of 0.9 or more in its variable values in which case it would be useful for any prediction. We will standardize these numerical variables before we apply the VarianceThreshold method on them.


In [609]:
df_numerical_predictors = df[['age', 'avg_glucose_level', 'bmi']]
#numerical_predictors.head(6)

In [610]:
def scale_numerical_predictors(df, numerical_predictors):
    scaler = StandardScaler()

    # Create a copy of the dataframe to avoid modifying the original.
    df_scaled = df.copy()

    # Scale the numerical columns.
    df_scaled[numerical_predictors] = scaler.fit_transform(df_scaled[numerical_predictors])

    return df_scaled

In [611]:
numerical_predictors = ['age', 'avg_glucose_level', 'bmi']
df_numerical_predictors_scaled = scale_numerical_predictors(df_numerical_predictors, numerical_predictors)
print (df_numerical_predictors_scaled)

           age  avg_glucose_level       bmi
0     0.763532           2.312038  0.887420
1     0.343728           1.787070 -0.305926
2     1.673108          -0.121883  0.260149
3    -0.495881           1.172891  0.550835
4     1.603141           1.230185 -1.040292
...        ...                ...       ...
3408  1.673108          -0.561404 -0.367123
3409  1.743076           0.260344  1.407596
3410 -1.475424          -0.576471 -0.030538
3411 -0.355946           1.074955 -0.795503
3412 -0.845718          -0.531072 -0.703707

[3413 rows x 3 columns]


In [612]:
var_thr = VarianceThreshold(threshold = 0.90)
var_thr.fit(df_numerical_predictors_scaled)
mask = var_thr.get_support()
print (mask)


[ True  True  True]


Our inference is:

All three numerical predictors (age, avg_glucose_level, and bmi) have a variance greater or equal to 0.90 implying sufficient inherent variability for prediction of the binary outcome variable stroke (presence or 

absence of stroke). Whether these three numerical predictors are significant predictors of the outcome variable will have to be determined by a statistical test or model.

At this point, we are left with four predictors for stroke. The three numerical variables age, avg_glucose_level and bmi, and also the nominal binary variable heart_risk (which is a composite variable created by 

combining the hypertension and heart_disease variables.)