# Causal analysis in drug evaluation
This notebook shows how causal inference can be used to resolve an instance of Simpson's paradoxon. It is based on a great exercise in Judea Pearl's book "Causal Inference in Statistics: A Primer". The used data describes an observational study where all patients suffer from the same illness. We are tasked with finding out the effectiveness of a drug that some of the patients have been prescribed. Curiously, the drug seems to help both the male and the female subpopulations, but in the overall population it seems to inhibit recovery. Should we recommend the use of the drug to a random patient? To a man? To a woman?

## Extracting information from data
### a) Read the data into a Pandas Dataframe!

In [29]:
import pandas as pd
df= pd.read_csv("drugs_data.csv", index_col=0)
df

Unnamed: 0,Sex,Drug,Recovered
0,Woman,1,0
1,Woman,1,1
2,Man,0,1
3,Man,1,1
4,Woman,1,1
...,...,...,...
9995,Man,0,1
9996,Woman,0,1
9997,Woman,1,1
9998,Man,0,0


### b) Find out how many of the patients
- are men,
- are women,
- have taken the drug,
- have not taken the drug,
- have recovered,
- have not recovered.

### Note that 1 means True, 0 means False. Provide your answers in percentages, not absolute numbers.

In [25]:


def count_properties(df, prop, value, verbose=True):
    rows=df[prop]==value
    result=sum(rows)/(len(df))
    print(f"{prop} {value} holds {result}%")
    
count_properties(df, "Sex", "Man")
count_properties(df, "Sex", "Woman")
count_properties(df, "Drug", True)
count_properties(df, "Drug", False)
count_properties(df, "Recovered", True)
count_properties(df, "Recovered", False)

Sex Man holds 0.5006%
Sex Woman holds 0.4994%
Drug True holds 0.5051%
Drug False holds 0.4949%
Recovered True holds 0.8026%
Recovered False holds 0.1974%


### c) Create a table (or dictionary etc.) that shows how frequently recovery has been observed for
- Men that have taken the drug,
- Men that have not taken the drug,
- Women that have taken the drug,
- Women that have not taken the drug,
- Patients that have taken the drug,
- Patients that have not taken the drug.
### Note that 1 means True, 0 means False. Provide your answers in percentages, not absolute numbers.

In [32]:
def get_recovery_rate(df, sex, drug):
    if sex == "Both":
        filtered_df=df[df.Drug==drug]
    else:
        filtered_df=df[(df.Sex==sex)&(df.Drug==drug)]
    return filtered_df["Recovered"].mean()

results_df=pd.DataFrame()
sexes=["Man", "Woman", "Both"]
for drug in [0,1]:
    results_df[drug]=[get_recovery_rate(df,sex,drug)for sex in sexes]
results_df.index=sexes
results_df.columns=["No Drug", "Drug"]
print(results_df)

        No Drug      Drug
Man    0.870668  0.915039
Woman  0.670114  0.738515
Both   0.831481  0.774302


### d) Imagine using a linear regression to model the dependency of
- i) Recovery on Drug: negative close to 0
- ii) Recovery on Drug and Sex. positive close to 0

### Predict what the coefficient of Drug will look like in each model.

### e) Using a package of your choice (e.g. statsmodels), fit the linear regression and confirm your intuition from the previous task.

In [35]:
import statsmodels.api as sm
import numpy as np
Y = df['Recovery']
X = df["Sex"]
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
results.params

ModuleNotFoundError: No module named 'statsmodels'

### f) Describe the paradoxon that the data seem to present!

Answer: Taking the drug seems to help both men and women in their respective subpopulations, but for the overall population not taking it seems to be the better choice.

## Causal inference to the rescue (qualitative)
### a) Randomized Controlled Trials (RCTs) are the gold-standard for clinical studies. Our patients have decided themselves if they wanted to take the drug or not. In an RCT, the experimenters decide for each patient whether they will take the drug or not (i.e. receive a placebo). What would the causal graph over Sex, Drug and Recovery look like if it was generated by an RCT?

### b) Can you draw the causal graph for the scenario that has generated our data? Start by stating which edges must (not) exist and identify the correct graph from the remaining possibilities.

### c) What is the difference between the two graphs?

### d) Can you find an explanation for the paradoxon in our data?

### e) Taking into account your explanation, would you recommend taking the drug to a random person? To a man? To a woman?

### f) Can you think of a scenario where the data is exactly the same, but your recommendation is the opposite? You are allowed to replace Sex by a different binary variable with the same distribution (e.g. for an Income variable, you would replace each 'Man' by 'low' and each 'Woman' by 'high').

## Causal inference to the rescue (quantitative)
### Given that our qualitative considerations about the data generating process still require some mental gymnastics, let us exploit Pearl's theory of graph-based causal inference and automate the causal analysis with the ```cause2e``` package. Based on the example analysis with the Sprinkler dataset, create your own causal end-to-end analysis and decide if you would recommend taking the drug! As a starting point, use the example analysis notebook. You will see that only minor changes are necessary to adapt the analysis to the new use case.
- read the data
- provide domain knowledge (qualitative and quantitative)
- learn the causal graph from data and domain knowledge
- postprocess it (if necessary) or provide new domain knowledge
- estimate all causal effects
- describe the results and make your recommendation