# Week 3: Introduction to Data Science 📈
# Additional Exercises

To wrap up this week's exploration on the heart failure dataset, we will take a closer look at the statistical analysis done at the end of the tutorial module, where we used a logistic regression model to identify which features have the most statistical significance and impact on the death of a patient.

The focus for this module is the following:
1. Statistical analysis using logistic regression

First, **run the cell below** to load the dataset after cleaning and modifying it in the tutorial (do not modify this code).

In [None]:
import pandas as pd
df = pd.read_csv("hf_data_tut.csv")
df

### A Deeper Dive: Statistical analysis with logistic regression
Recall that in the tutorial, we showed the results of applying a statistical model after separating the predictor and target variables. We used logistic regression as this is a classification task (the outcome can be one of two predefined categories: death or no death). Here we will explain how you actually interpret the results we showed in the tutorial module.

First, a brief recap of the steps we took:
1. Obtain a logistic regression model from `statsmodels`, which is a Python library for statistical analysis. 
2. Fit the model to our data; the `Logit model` in `statsmodels` requires the target variable `y` and predictor variables `x` for this purpose. This is where parameters in the model equation are determined. 

We print the result of this fitting to obtain statistics below: 

In [None]:
df.head()

In [None]:
import numpy as np
import statsmodels.api as sm

# Split predictor and target variables
x = df.drop(['DEATH_EVENT'], axis=1).copy()
y = df['DEATH_EVENT'].copy()

# Obtain and fit logistic regression model
logit_model=sm.Logit(y,x)
result=logit_model.fit()

print(result.summary2())

---
A few definitions on the P>|z| and Coef. columns above:

<span style="background-color: #AFEEEE">**p-value [P>|z|]**</span>: a statistical measure of how likely the observed outcome is due to chance. A lower p-value indicates high statistical significance. Often, p < 0.05 is considered statistically significant. 

In the summary above, we see 6 variables have a p-value (the P>|z| column) greater than 0.05: `anaemia`, `diabetes`, `high_blood_pressure`, `platelets`, `sex`, and `smoking`. The effect of these variables on the death outcome is likely to be due to chance. 

<span style="background-color: #AFEEEE">**Coefficients (in log regression) [Coef.]**</span>: a statistical measure of how much each predictor variable affects the observed outcome, if all other variables are held constant. For continuous variables (such as `age`), the Coef. tells us how much the log-odds of the patient dying will increase for a 1-year increase in age. For binary categorical variables (such as `smoking`, which is either `True` or `False`), the Coef. tells us how much the log-odds will increase for a smoker compared to a non-smoker.

<span style="background-color: #AFEEEE">**log-odds**</span>: the logarithm of the odds.

<span style="background-color: #AFEEEE">**odds**</span>: the ratio of an outcome vs. other outcomes. 
In this case, odds are the ratio of the probability of death/no death. If the probability of death is the same as the probability of no death, the odds would be 1. If death is more probable, the odds would be greater than 1. If no death is more probable, the odds would be less than 1. 

It is easier to interpret coefficients by converting them to odds. We do this by taking the exponential function of the log-odds:

<h1><center> 

odds = exp(log-odds)

For example, ejection_fraction has a coefficient of -0.0710.

odds = exp(-0.071) = 0.93239\
0.93239 - 1 = -0.0676

</center></h1>

Thus, a 1-unit increase in ejection fraction results in a 6.76% *decrease* in the odds that the patient will die.

---
**Q*1: Calculate the odds increase for `anaemia` from its coefficient in the result above and interpret the results (remember that `anaemia` is a categorical variable). Based on the coefficient and p-value, is anaemia likely a risk factor for death?**

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

We must find the features with the highest increase in odds, and simultaneously filter out features with low statistical significance. Below is the code that filters out these features. **Run the code below** to generate a table with just these features from the results summary.

In [None]:
# Creating new data frame of p values
pvals = result.pvalues.T.to_frame()
pvals.index.name = 'Features'

# Creating new data frame of coefficients
coefs = result.params.T.to_frame()
coefs.index.name = 'Features'

# Merge into one data frame
results = pd.merge(coefs, pvals, how = "left", on = "Features",suffixes=("params","pvalues")).fillna(0).reset_index()
results = results.rename(columns={'0params':'Coef','0pvalues':'P-value'})

# Keep statistically significant features
final_results = results.loc[(results['P-value'] < 0.05)].reset_index(drop=True)

# Take exp of coefs to get odds
final_results['Odds'] = np.exp(final_results['Coef'])

# Calculate percent increase in odds
final_results['Percent Increase'] = (final_results['Odds'] - 1)*100

# Sort by odds
final_results = final_results.sort_values(by=['Odds'], ascending=False).reset_index(drop=True)
final_results

Note that this is the same output provided to you in the tutorial module. At this point in your learning journey, you may not know how to implement all of the code above that performs the filtering and data manipulation, but that's okay! You will become more familiar with these functions and libraries as you continue learning.

## Conclusion
To wrap up the week 3 modules, today you have learned about:
1. The `statsmodels` library
2. How to interpret results from `statsmodels` logistic regression