In [None]:
%reload_ext nb_black

# Kidney Treatment Analysis

You have data collected about an experimental kidney treatmeant, and you want to decide which treatment is more effective: A or B.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/kidney_stone_data.csv"

* Read and inspect the data.  Do we have any missing values to deal with?

In [None]:
kidney_df = pd.read_csv(data_url)
kidney_df.head()

Which treatment is more successful? How do we go about investigating this?

* Investigate the `pd.crosstab()` function and use it as a way to assess treatment A vs B.
* What do you conclude?

We could more formally analyze these numbers with a $\chi^2$ ("chi square") test of independence.  See more on what this procedure is doing in this video from [Khan Academy](https://www.khanacademy.org/math/ap-statistics/chi-square-tests/chi-square-tests-two-way-tables/v/chi-square-test-association-independence).

What do you conclude from this test?

In [None]:
# Input your crosstab here w/o normalizing or row/col totals
crosstab = ____
chi2, p, df, expected = stats.chi2_contingency(crosstab)
p

Now, include the `'stone_size'` column in your crosstab analysis.

What do you conlude?

The small effect seen in the success rates has reversed! For all stone sizes, treatment A has a higher success rate than treatment B. This is an example of Simpson's paradox:

> Simpson's paradox (or Simpson's reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox) is a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined.

from [Wikipedia](https://en.wikipedia.org/wiki/Simpson%27s_paradox)

----

If we were to run a $\chi^2$ test of independence:

In [None]:
# Input your crosstab here w/o normalizing or row/col totals
crosstab = ____
chi2, p, df, expected = stats.chi2_contingency(crosstab)
p

# Let's p-hack!

First, let's go over the theory behind it

### Sample size and the t statistic

In a t-test, the p value is directly related to t statistic.  As t increases, p decreases.  The definition of t is below.

$$t = \frac{signal}{noise} = \frac{\overline{x}_{1}-\overline{x}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$$

The denominator (aka $noise$) is the component that is affected by sample size.  The intuition behind this is that as your sample increases you should be drowning out the 'noisy' observations and the result is less noise overall.

This means as as `n` increases, our denominator decreases.  In fractions/division, when we hold the numerator constant and the denominator gets smaller, the result gets larger (e.g. $\frac{1}{4} = 0.25$ & $\frac{1}{2} = 0.5$).

All of this builds up to... our t statistic will get larger as `n` increases (assuming everything else stays relatively the same).

----

Enough with the theory, prove it.  We have 2 means and standard deviations defined below.

In [None]:
mean_x1 = 11
mean_x2 = 10
std_x1 = 2
std_x2 = 2

* Write a `for` loop that loops over the different values in the `ns` list
* In each iteration, calculate a `t` and `p` value for the given means, standard deviations, and value of `n` (assume both groups had `n` observations).
* Store the p values in a list to print/plot the relationship between p and n

In [None]:
ns = [10, 50, 100, 500, 1000, 5000]
ps = []
for ____:
    signal = ____
    noise = ____
    t = ____

    # Look up p value for given value of t and sample size
    p = stats.t.sf(np.abs(t), 2 * n - 2) * 2
    ____

### Sample size and the confidence interval

The formula we've been using for a 95% confidence interval for a t-test is shown below.  Reason out what will happen to our confidence interval as sample size increases.

$$\overline{X}_{1}-\overline{X}_{2} \pm 1.96 * {\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$$

Write a for loop similar to the one above but this time with a focus on confidence intervals.  What happens to the confidence interval as n increases?