In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [None]:
!pip install plotly

## Table of Contents

*(clicking links should jump to sections if working in jupyter)*

* [Imports](#Imports)
  * Self-explanatory
* [Helper functions](#Helper-functions)
  * You can skip this section unless you're curious.  The contents of these functions are outside the scope of what we care about today.
* [Off to the races](#Off-to-the-races)
  * A warm up to discuss the intuition of the ideas behind t-tests and hypothesis testing
* [t-test math](#t-test-math)
  * Walking through the formula, hypotheses, and p-value
* [Performing t-tests in python](#Performing-t-tests-in-python)
  * Self-explanatory

### Imports

In [2]:
import pandas as pd
import numpy as np

# For performing t-test
from scipy import stats

# For (relatively) easy animated plots
# !pip install plotly
import plotly.graph_objects as go
import plotly.express as px

# For typical plotting
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

ModuleNotFoundError: No module named 'plotly'

<IPython.core.display.Javascript object>

### Helper functions

In [None]:
# Helpers for generating race data
def _gen_player_data(color, name="", y=0, n_steps=20):
    """Generate a random race time and split it out over time steps for plotting"""
    s = 0.3
    if color == "#4C78A8":
        s = 4.0

    # Generate random race times
    finish_time = np.random.normal(n_steps * 0.8, s)

    # Find x position for each frame of race
    rate = 1 / finish_time
    step = finish_time / n_steps
    time_steps = np.arange(n_steps + 1)
    x_pos = time_steps * rate

    # Store all plotting info for plotly
    race_df = pd.DataFrame(
        {
            "time": time_steps,
            "x": x_pos,
            "y": y,
            "color": color,
            "name": name,
            "finish_time": finish_time,
        }
    )

    # Add a little jitter to be less boring
    excitement = np.ones_like(x_pos) * 0.01
    excitement[: len(excitement) // 2] *= -1
    np.random.shuffle(excitement)
    race_df["x"] += excitement
    race_df.loc[0, "x"] = 0

    return race_df


def _gen_race_data(players, colors=px.colors.qualitative.T10):
    """'Simulate' a marble race between players"""
    race_dfs = []
    name_colors = zip(players, colors)
    for i, (name, color) in enumerate(name_colors):
        race_df = _gen_player_data(color, name, i * 0.1)
        race_dfs.append(race_df)

    return pd.concat(race_dfs).reset_index(drop=True)


def marble_race(players, seed=None):
    """'Simulate' a marble race"""
    if isinstance(seed, int):
        np.random.seed(seed)
    race_df = _gen_race_data(players)

    return (
        race_df[["color", "name", "finish_time"]]
        .drop_duplicates()
        .reset_index(drop=True)
    )

In [None]:
def plot_marble_race(players, seed=None):
    """'Simulate' and plot a marble race"""
    if isinstance(seed, int):
        np.random.seed(seed)
    race_df = _gen_race_data(players)

    color_df = race_df[["color", "name"]].drop_duplicates()
    color_discrete_map = {}
    for _, row in color_df.iterrows():
        color_discrete_map[row["name"]] = row["color"]

    fig = px.scatter(
        data_frame=race_df,
        x="x",
        y="y",
        color="name",
        text="name",
        animation_frame="time",
        title="Thinkful Marble Racing Series",
        color_discrete_map=color_discrete_map,
    )

    fig.update_traces(marker={"size": 20})
    fig.update_layout(showlegend=False)

    fig.add_trace(
        go.Scatter(x=[1, 1], y=[-300, 300], mode="lines", line={"color": "black"},)
    )

    fig.update_xaxes(
        {"range": [-0.1, 1.1], "showgrid": False, "zeroline": False, "visible": False,}
    )
    fig.update_yaxes(
        {"range": [-0.1, 1.1], "showgrid": False, "zeroline": False, "visible": False,}
    )

    return fig

### Off to the races

In [None]:
racers = [
    "Adam",
    "Anthony",
    "Dillan",
    "Gaukhar",
    "Harinder",
    "James",
    "Joshua",
    "Leon",
    "Mason",
    "Rachel",
    "Steve",
]

**Claim: Blue marbles are faster than other marbles.**

* Do you believe me that blue marbles are faster? why? why not?

#### Race 1

In [None]:
plot_marble_race(racers, 90)

In [None]:
results = marble_race(racers, 90)
results.sort_values("finish_time").head(3)

* Do you believe me now that blue marbles are faster? why? why not?

#### Race 2

In [None]:
plot_marble_race(racers, 1337)

In [None]:
results = marble_race(racers, 1337)
results.sort_values("finish_time").head(3)

* Do you believe me now that blue marbles are faster? why? why not?

#### Race 3

In [None]:
plot_marble_race(racers, 42)

In [None]:
results = marble_race(racers, 42)
results.sort_values("finish_time").head(3)

Do you believe me now that blue marbles are faster? why? why not?

#### Race 4

In [None]:
plot_marble_race(racers, 8675309)

In [None]:
results = marble_race(racers, 8675309)
results.sort_values("finish_time").head(3)

Do you believe me now that blue marbles are faster? why? why not?

What would it take for me to convince you?

## t-test math

Intuitively, we can think of a t-test (and many statistical tests) as a ratio of signal to noise:

**Intuition only $t$ formula:**

$$t = \frac{signal}{noise}$$


* If we keep $noise$ the same and the $signal$ gets larger, what happens to the value of $t$?  Does $t$ get larger or smaller as the $signal$ gets large?
* If we keep $signal$ the same and the $noise$ gets larger, what happens to the value of $t$?  Does $t$ get larger or smaller as the $noise$ gets large?


### Signal

The signal is the thing we are actually trying to measure.  Intuitively, the signal of a t-test is trying to measure is: "is there a difference of means between two groups?". Mathematically, this signal is: $\overline{x}_{1}-\overline{x}_{2}$ (difference of means between the two groups)"

**Updated $t$ formula:**

* $\overline{x}$ = sample mean

$$t = \frac{\overline{x}_{1}-\overline{x}_{2}}{noise}$$

If the 2 group means are identical, what is the value of $t$?

### Noise

The noise is trying to quantify the quality and consistency of our evidence.  Do we have a lot of evidence (i.e. do we have a large sample size)?  Is this evidence consistent or are the numbers all over the place (i.e. do we have high variance)?

A little more advanced intuition, if you're familiar with the metric '[standard error](https://www.statisticshowto.com/what-is-the-standard-error-of-a-sample/)', the noise of a t-test follows the same idea, but accounting for both of our groups at once.

**Updated $t$ formula:**


* $n$ = sample size (how much evidence?)
* $s$ = sample standard deviation (how consistent?)

$$t = \frac{signal}{noise}$$

$$t = \frac{\overline{x}_{1}-\overline{x}_{2}}{\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}}$$

Translate this formula to python:

In [None]:
np.random.seed(42)

# test data (expected output of calculating t is about -169)
x1 = np.random.normal(10, 0.1, 50)
x2 = np.random.normal(13, 0.1, 50)

In [None]:
signal = ____ - ____
noise = np.sqrt(____ / ____ + ____ / ____)
t = signal / noise

In [None]:
t

### p-values & hypotheses

If the 2 group means are identical, we have a $t$ value of 0.  As the distance between the 2 means increases, so does the value of $t$.  At what point do we declare signifcance?  This is where a the famous "p value" comes in.  A p-value measures how likely is this value of $t$, assuming there's no difference in means.

Remember in our marble race, we said it should be up to the person claiming blue is faster to be the one needing proof.  Our default is position is the status quo.  

In court, people are innocent until proven guilty.  In a t-test, the 2 means are the same until proven different; this means our default position is that the 2 means are equal.  The way this is represented in statistical testing is through 'null' and 'alternative' hypotheses.  The null hypothesis is the default position (the means are equal) the alternative hypothesis is challenging the default position (the means are different).  Formally you migh see these written like below.

* $H_o$: There is no difference of means; $\overline{x}_1 - \overline{x}_2 = 0$
* $H_a$: There is a difference of means; $\overline{x}_1 - \overline{x}_2 \neq 0$

Our p value is telling us "whats the probability we'd observe this difference of means assuming that there's no difference?" or "whats the probability we'd observe this assuming our null hypothesis?".  

Whats the probability that the blue marble wins a single race given there's no difference in speed based on marble color?  There's a pretty good chance (i.e. a high **p**robability blue can win 1 w/o a true speed diff), so we indicate this with a large p value.

Whats the probability that the blue marble wins a 10 out of 20 races given there's no difference in speed based on marble color?  There's still a chance that this happens, but the **p**robility of there not being a difference is lowering, so we indicate this with a lower p value.

Whats the probability that the blue marble wins a 20 out of 20 races given there's no difference in speed based on marble color?  There's a small chance that this happens, but the **p**robility of there not being a difference is starting to be very small, so we indicate this with a low p value.

* Step 1: Assume there's no difference.
* Step 2: Collect data.
* Step 3: Whats the probability we see this data given our assumption of no difference?
* Step 4: Compare this probability to our threshold of being convinced of a difference (commonly 5%, but this can be changed)

## Performing t-tests in python

**Enough with your discussion and intuition! Tell me the steps as directly as possible.**

* Separate your data into the 2 groups of interest
* Check that the data is normal enough
* Input the data into `stats.ttest_ind()` and out comes a value of $t$ and a $p$-value
* Compare the $p$-value to 5%
  * Is the $p$-value less than 5%? Significant difference of means
  * Is the $p$-value greater than 5%? No significant difference of means

----

Let's do this for our marbles.

* Use the `marble_race` function to run 100 races and combine these dataframes into 1 big dataframe.

In [None]:
race_dfs = []
for i in ____:
    race_df = marble_race(racers)
    race_dfs.____(____)

marble_data = pd.____(race_dfs)

* Create a column in the dataframe that indicates what group the marble belongs to (i.e. blue vs not blue)
* Create a boxplot or violinplot that compares these 2 groups
* What are your thoughts at this point? expect a difference?

In [None]:
color_blue = "#4C78A8"

marble_data['is_blue'] = _____

sns._____(____, ____, data=marble_data)
plt.show()

* Separate the data into 2 dataframes: `blue` & `not_blue`

In [None]:
blue = ____
not_blue = ____

* Are the 2 groups normal enough?

* Perform the t-test.  Is there a significant difference? Do we reject the null hypothesis?

In [None]:
stats.ttest_ind(____, ____)

* If there is a significant difference, how big is it?

...to be continued...