# Week 3: Introduction to Data Science 📈
# Additional Exercises

To wrap up this week's exploration on the heart failure dataset, we will take a closer look at the statistical analysis done at the end of the tutorial module, where we used a logistic regression model to identify which features have the most statistical significance and impact on the death of a patient.

The focus for this module is the following:
1. Statistical analysis using logistic regression

First, run the cell below to load the dataset after cleaning and modifying it in the tutorial (do not modify this code).

In [1]:
import pandas as pd
df = pd.read_csv("hf_data_tut.csv")
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,DEATH_EVENT
0,75.0,0,582,0,20,1.0,265000.00,1.9,130,1,0.0,1
1,55.0,0,7861,0,38,0.0,263358.03,1.1,136,1,0.0,1
2,65.0,0,146,0,20,0.0,162000.00,1.3,129,1,1.0,1
3,50.0,1,111,0,20,0.0,210000.00,1.9,137,1,0.0,1
4,65.0,1,160,1,20,0.0,327000.00,2.7,116,0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1.0,155000.00,1.1,143,1,1.0,0
295,55.0,0,1820,0,38,0.0,270000.00,1.2,139,0,0.0,0
296,45.0,0,2060,1,60,0.0,742000.00,0.8,138,0,0.0,0
297,45.0,0,2413,0,38,0.0,140000.00,1.4,140,1,1.0,0


## A Deeper Dive: Statistical analysis with logistic regression
Recall that in tutorial, we showed the results of applying a statical model after separating the predictor and target variables. We used logistic regression as this is a classification task (the outcome can be one of two predefined categories: death or no death). Here we will explain how you actually interpret the results we showed in the tutorial module.

First, a brief recap of the steps we took:
1. obtain a logistic regression model from statsmodels, which is a Python library for statistical analysis. 
2. fit the model to our data; the Logit model in statsmodels requires the target variable `y` and predictor variables `x` for this purpose. This is where parameters in the model equation are determined. 

We print the result of this fitting to obtain statistics below: 

In [7]:
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,DEATH_EVENT
0,75.0,0,582,0,20,1.0,265000.0,1.9,130,1,0.0,1
1,55.0,0,7861,0,38,0.0,263358.03,1.1,136,1,0.0,1
2,65.0,0,146,0,20,0.0,162000.0,1.3,129,1,1.0,1
3,50.0,1,111,0,20,0.0,210000.0,1.9,137,1,0.0,1
4,65.0,1,160,1,20,0.0,327000.0,2.7,116,0,0.0,1


In [8]:
import numpy as np
import statsmodels.api as sm

# split predictor and target variables
x = df.drop(['DEATH_EVENT'], axis=1).copy()
y = df['DEATH_EVENT'].copy()
# obtain and fit logistic regression model
logit_model=sm.Logit(y,x)
result=logit_model.fit()

print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.494072
         Iterations 6
                             Results: Logit
Model:                 Logit              Method:             MLE       
Dependent Variable:    DEATH_EVENT        Pseudo R-squared:   0.213     
Date:                  2023-11-26 16:22   AIC:                317.4552  
No. Observations:      299                BIC:                358.1601  
Df Model:              10                 Log-Likelihood:     -147.73   
Df Residuals:          288                LL-Null:            -187.67   
Converged:             1.0000             LLR p-value:        5.2675e-13
No. Iterations:        6.0000             Scale:              1.0000    
------------------------------------------------------------------------
                          Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
------------------------------------------------------------------------
age                       0.0575   0.0130  4.4192 0

A few definitions on the P>|z| and Coef. columns above:

<span style="background-color: #AFEEEE">**p-value [P>|z|]**</span>: a statistical measure of how likely the observed outcome is due to chance. A lower p-value indicates high statistical significance. Often, p < 0.05 is considered statistically significant. 

In the summary above, we see 6 variables have a p-value (the P>|z| column) greater than 0.05: anaemia, diabetes, high_blood_pressure, platelets, sex, and smoking. The effect of these variables on the death outcome are likely to be due to chance. 

<span style="background-color: #AFEEEE">**Coefficients (in log regression) [Coef.]**</span>: a statistical measure of how much each predictor variable affects the observed outcome, if all other variables are held constant. For continuous variables (such as age), the coef. tells us how much the log-odds of the patient dying will increase for a 1-year increase in age. For binary categorical variables (such as smoking, which is either True or False), the coef. tells us how much the log-odds will increase for a smoker compared to a non-smoker.

<span style="background-color: #AFEEEE">**log-odds**</span>: the logarithm of the odds.

<span style="background-color: #AFEEEE">**odds**</span>: the ratio of an outcome vs. other outcomes. 
In this case, odds are the ratio of probability of death/no death. If the probability of death is the same as probability of no death, the odds would be 1. If death is more probable, the odds would be greater than 1. If no death is more probable, the odds would be less than 1. 

It is easier to interpret coefficients by converting them to odds. We do this by taking the exponential function of the log-odds:

<h1><center> 

odds = exp(log-odds)

For example, ejection_fraction has a coefficient of -0.0710.

odds = exp(-0.071) = 0.93239\
0.93239 - 1 = -0.0676

</center></h1>

Thus, a 1-unit increase in ejection fraction results in a 6.76% *decrease* of the odds that the patient will die.

**Q1**: Calculate the odds increase for anaemia from the coefficient in the result above and interpret the results (remember that anaemia is a categorical variable). Based on the coefficient and p-value, is anaemia likely a risk factor for death?

<span style="background-color: #FFD700">**Write your answer here**</span>


We must find the features with highest increase in odds, and simulatenously filter out features with low statistical significance. Below is the code that filters out these features. Run the code below to generate a table with just these features from the results summary.

In [9]:
# Creating new data frame of p values
pvals = result.pvalues.T.to_frame()
pvals.index.name = 'Features'

# Creating new data frame of coefficients
coefs = result.params.T.to_frame()
coefs.index.name = 'Features'

# Merge into one data frame
results = pd.merge(coefs, pvals, how = "left", on = "Features",suffixes=("params","pvalues")).fillna(0).reset_index()
results = results.rename(columns={'0params':'Coef','0pvalues':'P-value'})

# Keep statistically significant features
final_results = results.loc[(results['P-value'] < 0.05)].reset_index(drop=True)

# Take exp of coefs to get odds
final_results['Odds'] = np.exp(final_results['Coef'])
# Calculate percent increase in odds
final_results['Percent Increase'] = (final_results['Odds'] - 1)*100
# Sort by odds
final_results = final_results.sort_values(by=['Odds'], ascending=False).reset_index(drop=True)
final_results

Unnamed: 0,Features,Coef,P-value,Odds,Percent Increase
0,serum_creatinine,0.699365,6.6e-05,2.012475,101.247512
1,age,0.057537,1e-05,1.059224,5.92242
2,creatinine_phosphokinase,0.000282,0.04762,1.000282,0.028185
3,serum_sodium,-0.021686,0.004251,0.978547,-2.145268
4,ejection_fraction,-0.071023,2e-06,0.931441,-6.855928


Note that this is the same output provided to you in the tutorial module. At this point in your learning journey you may not know how to implement all of the code above that performs the filtering and data manipulation, but that's okay! You will become more familiar with these functions and libraries as you continue learning.

## Conclusion
To wrap up the week 3 modules, today you have learned about:
1. The statsmodels library
2. How to interpret results from statsmodels logistic regression