# Exercise 09: Causal Thinking

Welcome to the tenth exercise for Applied Machine Learning.

Your objectives for this session are to:

- understand the fragility of correlations, 
- recognize how correlations can be misinterpreted as causation, and
- practice drawing a DAG to visualize your causal thinking.
------

In [None]:
# load our libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

----
### Part 1: Simulating correlations

We know "correlation doesn't imply causation," but sometimes a correlation can be indicative of a causal relationship. Intuitively, it seems unlikely that we'd observe a strong correlation between two absolutely unrelated variables, right? Let's do a quick simulation to test our intuition.

In [None]:
# let's run 10 iterations of our simulation
for i in range(0, 10): 
    
    # create x: a random sample of 10 numbers from a uniform distribution between 0 and 100
    x = np.random.uniform(0, 100, size = 10)  
    
    # create 7: a random sample of 10 numbers from a uniform distribution between 0 and 100
    y = np.random.uniform(0, 100, size = 10)
    
    # print the iteration number, x, y, and the pearson correlation (r) between x and y 
    print("Iteraton", i+1, "\n")
    print("r =", np.corrcoef(x, y)[0,1], "\n")

Compare your results with a classmate. Do you see any correlations? (I do!) 

For reference, in psychology, 0.1 < |r| < 0.3 is often considered a weak correlation, 0.3 < |r| < 0.5 is a moderate correlation, and |r| > 0.5 is a strong correlation.

Maybe spurious correlations are more common than we thought...

Of course, we'd expect that if we run this simulation many many times, the average correlation would be zero. We just made two random vectors that are entirely independent from one another, and thus, there is no true relationship between them. However, collecting data in the real world is not so unlike running this simulation just once — there's always a chance of spurious correlation (or association).

------
### Part 2: Modelling HR data

For the next part of the exercise we'll work with a dataset from the world of HR Analytics (or "People Analytics"). The dataset is comprised of 10,000 instances, where each instance represents a recent graduate employed in the retail industry in the United Kingdom. The task at hand is to *understand how working overtime affects income*.

The dataset has the following attributes:
* `Income` (target variable): the monthly salary after tax in thousands of GBP
* `Consc`: conscientiousness, as scored in a survey of the employees' managers; a personality trait defined as "the quality of wishing to do one's work or duty well and thoroughly"
* `Happiness`: the self-reported happiness of the individual 
* `Overtime`: average weekly overtime hours worked 

Let's read in the data.

In [None]:
hr_data = pd.read_csv('hr_data.csv')

Intuitively, it seems clear that working overtime would increase income — retail employees usually get paid hourly and thus get extra pay for overtime shifts. So let's do what we do best; let's build a model with scikit-learn and interpret the outputs.

Let's fit a linear regression with just `Overtime` predicting `Income`.

In [None]:
X = hr_data[['Overtime']]
y = hr_data['Income']

lr = LinearRegression()
lr.fit(X,y)

Now use 5-fold cross-validation to evaluate the model in terms of R-squared and mean absolute error (MAE) like so:

In [None]:
r2 = cross_val_score(lr, X, y, cv=5).mean()
mae = -cross_val_score(lr, X, y, cv=5, scoring='neg_mean_absolute_error').mean()

print(f"R-squared: {r2}")
print(f"MAE: {mae}")

R-squared is pretty high. Looking good!

Now let's inspect the coefficient to see how much `Overtime` affects `Income`.

In [None]:
coefs = lr.coef_
sorted_coefs = sorted((zip(X.columns, coefs)), key = lambda e:e[1], reverse=True)
sorted_coefs

It looks like 1 hour of overtime per week increases annual income by about 2.6 thousand GBP on average. Impressive!


# <font color='red'>TASK 1</font>


But what about the other attributes? It seems like conscientiousness could also influence income. 

Fit a linear regression to the day with `Overtime` and `Consc` predicting `Income`. Use 5-fold cross-validation to evaluate the model in terms of R-sqaured and MAE, and then inspect the coefficients.

In [None]:
# your code here - define X and y, then fit the model


In [None]:
# your code here - use 5-fold cross validation to evaluate R-squared and MAE


In [None]:
# your code here - report the coefficients


Does this model seem better or worse than the previous one? What does this model say the effect of working overtime on income is? 

# <font color='red'>TASK 2</font>

But what about the other attribute? Maybe happiness could also influence income. 

Fit a linear regression to the day with `Overtime`, `Consc`, and `Happiness` predicting `Income`. Use 5-fold cross-validation to see how well the model fits the data, and then inspect the coefficients.

In [None]:
# your code here -  define X and y, then fit the model


In [None]:
# your code here - use 5-fold cross validation to evaluate R-squared and MAE


In [None]:
# your code here -  report the coefficients


Does this model seem better or worse than the previous one? What does this model say the effect of working overtime on income is? 

------
### Part 2: Thinking about causality

Based on your modelling, what do you think the effect of working overtime on income is? Which is the correct coefficient? Based on the R-squared values, it seems like the final model with all attributes is the best, right?

Not so fast! Remember, R-squared tells us how well a model fits some data, which can be indicative of how well it would fit new, unseen data to make predictions. But R-squared does nothing to distinguish association from causation. Similarly, MAE tells us how accurate a model's predictions are, but MAE tells us nothing about whether a model accurate estimates a cause-and-effect relationshipm. And since our goal was to *understand how working overtime affects income*, we're looking for an explanation, and estimate of cause-and-effect, not a prediction.

In the figure below, I've drawn a DAG to visualize my causal assumptions about the relationship between the variables in the data using domain knowledge. `Overtime` is the exposure or treatment variable (i.e., the attribute of interest) and `Income` is the outcome variable (i.e., the target variable).

![dag](dag.png) 


I think there's a direct, causal path between `Overtime` and `Income` such that working overtime *causes* one's income to increase, hence the directed green line.

But I also think that `Consc` causally influences both `Overtime` and `Income` — I think conscientious people are more inclined to work overtime, and I think conscientious people generally earn more because they're hard workers who are viewed favorably by their bosses.

As for `Happiness`, I think this is caused by `Income`, not the other way around. And I also think `Happiness` is causally affected by `Overtime` — working overtime decreases happiness.

By the logic visualized in the DAG above, `Consc` is a *counfounder* and `Happiness` is a *collider*. To estimate the true, total effect of `Overtime` on `Income`, we should thus include `Consc` but not `Happiness` — **the true coefficient is 2.00**. Including a collider in our model biases the estimate despite (potentially) increasing the model's predictive ability, as displayed by R-squared in this case.

The problem with using a predictive machine learning models (in this case, linear regression), is that they only find associations. Basically, what the model sees in the data is the following:

![correlation-graph](correlation_graph.png) 

No directed causal paths. Just undirected correlational links between every variable. This is good for prediction, but bad for explanation.

Now you might be wondering: can't we just look at a DAG, pick the appropriate attributes to include in a model, and then interpret the model outputs in a explanatory way?

Yes, we can!

The problem is that drawing a DAG requires domain knowledge and making lots of assumptions. In reality, I have no idea if the DAG above is correct. The only way I know it's correct here, and that the true coefficient is for sure 2.00, is because I actually simulated the HR Analytics dataset and specified the relationship between `Income` and `Overtime`. Here's the code I used to do it:

```
np.random.seed(0)

# randomly sample of 10,000 values from a normal distribution and then add 4 to all
conscientiousness = np.random.randn(10000) + 4

# define overtime as a function of conscientiousess -- i.e., overtime is caused by conscientiousess
overtime = conscientiousness + 3.5 + 0.75*np.random.randn(10000)

# define income as a function of overtime and conscientiousess -- i.e., income is caused by overtime and conscientiousess
income = conscientiousness + 2*overtime + 10 + 0.75*np.random.randn(10000)

# define happiness as a function of overtime and income -- i.e., happiness is caused by income and overtime 
happiness = income - 10*overtime + 6 + 0.75*np.random.randn(10000)

# put it all in a dataframe
data = pd.DataFrame({
    'Consc': conscientiousness,
    'Overtime': overtime,
    'Happiness': happiness,
    'Income': income
})

# export the data as a csv
data.to_csv (r'hr_data.csv', index = False, header=True)
```

-----

### Part 4: Draw your DAG

For the next part of the exercise, we'll use a web app called DAGitty. Everything you need to know about how to use DAGitty is in the user manual here (especially see sections 4-5): http://www.dagitty.net/manual-2.x.pdf.

Open the following link in a separate tab or window: http://www.dagitty.net. Then click the link to "Launch DAGitty online in your browser," and then click Model > New Model to start with a blank canvas.

# <font color='red'>TASK 3</font>

By this point in the semester, you should have chosen a dataset to work with for the course project. 

**If you're doing a supervised machine learning task for your project (i.e., regression or classification),** use DAGitty to draw your causal assumptions:
1. Start with a node for your target variable (the "outcome")
2. Then add a node for the attribute you think will be most important in predicting your target (an "exposure") you plan on including in your modelling. 
3. Add more attributes and map out the causal structure between the attributes and the target variable.   
4. Are there any variables not included in your dataset that might be causally relevant to your case? Add them as "unobserved" nodes and map out any additional causal structure (e.g., if my target is `exam_grade` and I have two attributes in my dataset, `GPA` and `prior_degree_earned`, it would seem likely that I should consider `weekly_hours_spent_studying` as a causally-relevant, unobserved variable.)
5. Interpret your DAG — are there any biasing paths, unobserved confounders, or colliders that would make it difficult to interpret your coefficients or feature importance in an explanatory way? 

*Hint: If you're having a hard time getting started with your DAG, check out this blog for some tips: https://towardsdatascience.com/creating-minimal-dags-step-by-step-d604cb05e59a*
    

**If you're doing an unsupervised machine learning task for your project (i.e., clustering),** use DAGitty to visualise your causal assumptions for the stroke prediction case you did in Exercise 03:
1. Start with a node for the target variable, `stroke` (the "outcome")
2. Then add a node for the attribute you think will be most important in predicting your target (an "exposure") you plan on including in your modelling. 
3. Add more attributes and map out the causal structure between the attributes and the target variable.   
4. Are there any variables not included in your dataset that might be causally relevant to your case? Add them as "unobserved" nodes and map out any additional causal structure (i.e., are there any attributes not included in the dataset that probably influence the risk of stroke?)
5. Interpret your DAG — are there any biasing paths, unobserved confounders, or colliders that would make it difficult to interpret your coefficients or feature importance in an explanatory way? 

*Hint: If you're having a hard time getting started with your DAG, check out this blog for some tips: https://towardsdatascience.com/creating-minimal-dags-step-by-step-d604cb05e59a*

*Here are the attributes in the stroke dataset:*
* `gender`: "Male", "Female" or "Other"
* `age`: age of the patient
* `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 
* `ever_married`: "No" or "Yes"
* `worktype`: "children", "Govt_job", "Neverworked", "Private" or "Self-employed" 
* `Residence_type`: "Rural" or "Urban"
* `avg_glucose_level`: average glucose level in blood
* `bmi`: body mass index
* `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* `stroke`: 1 if the patient had a stroke or 0 if not

------

**Congratulations, you made it through all the exercise notebooks for the course!** 
