## Tiny Statistics Review

What is a distribution and what is a standard deviation?

Let's look to [this resource](https://www.mathsisfun.com/data/standard-normal-distribution.html) from Math is Fun!

## Confidence Intervals

>The 95% confidence interval is a range of values that you can be 95% certain contains the true mean of the population. As the sample size increases, the range of interval values will narrow, meaning that you know that mean with much more accuracy compared with a smaller sample.

- [Simply Psychology](https://www.simplypsychology.org/confidence-interval.html)

Typically we use the normal distribution for calculating confidence intervals when we have more than 120 samples. However, for really small numbers of samples (under 120), we can use the wider, flatter [t-distribution](https://www.statisticshowto.com/probability-and-statistics/t-distribution/), which _looks like_ the normal distribution at and above about 120 samples.

In [22]:
import sys
!{sys.executable} -m pip install scipy 

import math
from scipy.stats import t
import numpy as np

def confidence_interval_for_collection(sample_size=[], standard_deviation=[], mean=[], confidence=0.95):
    degrees_freedom = [count - 1 for count in sample_size]
    outlier_tails = (1.0 - confidence) / 2.0
    confidence_collection = [outlier_tails for _ in sample_size]
    t_distribution_number = [-1 * t.ppf(tails, df) for tails, df in zip(confidence_collection, degrees_freedom)]

    step_1 = [std/math.sqrt(count) for std, count in zip(standard_deviation, sample_size)]
    step_2 = [step * t for step, t in zip(step_1, t_distribution_number)]

    low_end = [mean_num - step_num for mean_num, step_num in zip(mean, step_2)]
    high_end = [mean_num + step_num for mean_num, step_num in zip(mean, step_2)]

    return low_end, high_end

You should consider upgrading via the '/Users/chelseatroy/.pyenv/versions/3.9.0/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [27]:
import pandas as pd
aggregation = pd.read_csv('metrics.csv') \
        .assign(year=lambda row: row["Period Start"].apply(lambda x: x[-4:])) \
        .assign(activity_year=lambda row: row["Activity"] + " (" + row["year"] + ")") \
        .assign(average_days_to_complete_activity=lambda row: row["Average Days to Complete Activity"].apply(lambda x: float(x))) \
        .groupby('activity_year') \
        .agg({
             'Target Response Days': 'max', 
             'average_days_to_complete_activity': ['mean','std'],
             'Activity' : 'count'
            })\
        .reset_index()

aggregation.columns = [' '.join(col).strip() for col in aggregation.columns.values]
aggregation["conf_interval_bottom"], aggregation["conf_interval_top"] = confidence_interval_for_collection(sample_size=aggregation["Activity count"], standard_deviation=aggregation["average_days_to_complete_activity std"], mean=aggregation["average_days_to_complete_activity mean"])

aggregation["average_slippage"] = aggregation["average_days_to_complete_activity mean"] - aggregation["Target Response Days max"]
aggregation["slippage_corrected"] = aggregation["conf_interval_top"] - aggregation["Target Response Days max"]

aggregation

Unnamed: 0,activity_year,Target Response Days max,average_days_to_complete_activity mean,average_days_to_complete_activity std,Activity count,conf_interval_bottom,conf_interval_top,average_slippage,slippage_corrected
0,Alley Grading-Unimproved (2011),180,37.270000,30.426184,3,-38.312832,112.852832,-142.730000,-67.147168
1,Alley Grading-Unimproved (2012),180,203.708571,159.513088,7,56.183571,351.233572,23.708571,171.233572
2,Alley Grading-Unimproved (2013),180,233.506957,229.657309,23,134.195689,332.818224,53.506957,152.818224
3,Alley Grading-Unimproved (2014),180,222.213750,283.339112,8,-14.663676,459.091176,42.213750,279.091176
4,Alley Grading-Unimproved (2015),180,151.196667,177.340770,18,63.007176,239.386157,-28.803333,59.386157
...,...,...,...,...,...,...,...,...,...
257,Wire Down (2014),1,1.130192,0.165417,52,1.084140,1.176245,0.130192,0.176245
258,Wire Down (2015),1,1.518269,2.128002,52,0.925830,2.110709,0.518269,1.110709
259,Wire Down (2016),1,1.102692,0.111604,52,1.071622,1.133763,0.102692,0.133763
260,Wire Down (2017),1,1.143269,0.353918,52,1.044738,1.241801,0.143269,0.241801


### Is the distance between two means statistically significant? 

It's possible they are, even if their confidence intervals overlap.

In [None]:
def t_test_for(num_samples_1, standard_deviation_1, mean_1, num_samples_2, standard_deviation_2, mean_2, confidence=0.95):
    alpha = 1 – confidence
    total_degrees_freedom = num_samples_1 + num_samples_2 – 2

    t_distribution_number = –1 * t.ppf(alpha, total_degrees_freedom)

    degrees_freedom_1 = num_samples_1 – 1
    degrees_freedom_2 = num_samples_2 – 1
    sum_of_squares_1 = (standard_deviation_1 ** 2) * degrees_freedom_1
    sum_of_squares_2 = (standard_deviation_2 ** 2) * degrees_freedom_2

    combined_variance = (sum_of_squares_1 + sum_of_squares_2) / (degrees_freedom_1 + degrees_freedom_2)
    first_dividend_addend = combined_variance/float(num_samples_1)
    second_dividend_addend = combined_variance/float(num_samples_2)

    denominator = math.sqrt(first_dividend_addend + second_dividend_addend)
    numerator = mean_1 – mean_2
    t_value = float(numerator)/float(denominator)

    accept_null_hypothesis = abs(t_value) < abs(t_distribution_number) #results are not significant

    return accept_null_hypothesis, t_value

## Multiple Comparisons

You may have 95% certainty that any one comparison isn’t statistically significant by fluke, but when you run a bunch of comparisons, eventually one of them will be a fluke. In fact, when you run 100 separate comparisons, your likelihood that none of the significant outcomes are flukes drops to a measly 1%.

One solution is the **Bonferroni correction**: divide your intended p value by the number of comparisons you're running. This method gets criticism for being to harsh and missing important findings, so sometimes folks temper it by _lowering_ the P value, but not all the way to the Bonferroni correction prescription.

## Continuity Errors

This type of data representation error happens when someone misrepresents continuous data that does not fall into discrete categories (like body mass index) and either misrepresents it as discrete categories or interprets it in some way that isn’t true to the data. 

For example, with body mass index, we frequently see two categories: ‘normal weight’ and ‘overweight.’ So first of all, body mass index as a metric in the first place has been demonstrated to be a poor measure of health and fitness. So, we already have some issues. But let’s stick to continuity errors specifically. 

Frequently body mass index data gets categorized as ‘normal weight’ (24.9 or lower) or ‘overweight’ (above 24.9). Where is underweight? Also, what is the difference between a 24.8 and a 25.1? When these middle values get averaged together with extremes on either end, it looks like this 0.2 difference in the middle is a night and day difference. It’s not. We’re just representing a wide range with a tiny number of categories. 

It’s worth examining whether and why we need to categorize continuous variables before we do it. There are good reasons (get evenly sized buckets of points to compare means, draw meaningful visualizations, et cetera). But it’s not a default thing to do.