In [None]:
import matplotlib.pyplot as plt
import numpy as np
import math as math
from sklearn.datasets import fetch_california_housing

In [None]:
def four_million_sample_means(population, sample_size):
    "A true sample distribution of means is too large to calculate in class"
    return np.mean(np.random.choice(population, size=(4000000, sample_size)), axis=1)

In [None]:
def plot_histogram(title, distribution):
    "Plots a simple histogram with 300 bins"
    fig, axis = plt.subplots()
    axis.set_title(title)
    axis.hist(distribution, bins=300)

def print_calculation(title, calculation):
    print(title + ": " + str(calculation))

# The Central Limit Theorem

## The Goal: How Do We Test for Statistical Significance?

## Our Method

<div style="display: flex; align-items: center;">
  <div style="flex: 1;">
    <img src="normal_distribution.jpg" alt="Normal Distribution diagram with μ, σ, and percentages" width="2000">
      <span style="font-size: 0.8em; font-style: italic; text-align: left; display: block;">(Image source: https://static.vecteezy.com/system/resources/previews/007/695/520/original/gauss-distribution-standard-normal-distribution-gaussian-bell-graph-curve-business-and-marketing-concept-math-probability-theory-editable-stroke-illustration-isolated-on-white-background-vector.jpg)</span>
  </div>
  <div style="flex: 1">
    <ol style="margin-top: 0;">
      <li>Represent our problem as a normal distribution.</li>
      <li>Find the μ (the mean) of that distribution.</li>
      <li>Find the σ² (the variance) of that distribution.</li>
      <li>Check if the distance of our result from μ, measured in σ (standard deviations), exceeds our threshold of statistical significance.</li>
    </ol>
  </div>
</div>

**What is "statistical significance?"**

When a relationship we observe in our data is likely not present due to random chance.

**How likely?**

We get to decide, but in the social sciences, the threshold is usually 0.95 or 95%.

**Used in a sentence:**

"Our level of confidence that the relationship in our data *did not occur by random chance* is 95%."

**Another common formulation:**

You will also see statistical significance expressed in inverse terms using α. e.g. if we want 95% confidence our result didn't occur due to random chance, we would decide "to conduct our study using α = 0.05."

## What is the normal distribution?

<img src="normal_distribution.jpg" alt="Normal Distribution diagram with μ, σ, and percentages" width="80%">

### An experiment to check μ:

In [None]:
normally_distributed_data = np.random.normal(loc=10, scale=2, size=500000)
plot_histogram("My (approximately) normally distributed data", normally_distributed_data)

In [None]:
print_calculation(title="μ (mean) of the data", calculation=np.mean(normally_distributed_data)) 

### An experiment to check σ:

<img src="normal_distribution.jpg" alt="Normal Distribution diagram with μ, σ, and percentages" width="80%">

In [None]:
normally_distributed_data = np.random.normal(loc=10, scale=2, size=500000)
plot_histogram("My (approximately) normally distributed data", normally_distributed_data)

In [None]:
def percentage_within(distribution, lower_bound, upper_bound):
    values_between = 0
    total_values = len(distribution)
    
    for datapoint in distribution:
        if datapoint >= lower_bound and datapoint <= upper_bound:
            values_between += 1
        else:
            values_between += 0
            
    return values_between / total_values

<img src="normal_distribution.jpg" alt="Normal Distribution diagram with μ, σ, and percentages" width="50%">

In [None]:
mean = np.mean(normally_distributed_data)
standard_deviation = np.std(normally_distributed_data)

print_calculation(
    "Percentage within 1σ of μ", 
    percentage_within(
        normally_distributed_data,
        mean - (1 * standard_deviation), 
        mean + (1 * standard_deviation)
    )
)

## Our Method

<div style="display: flex; align-items: center;">
  <div style="flex: 1;">
    <img src="normal_distribution.jpg" alt="Normal Distribution diagram with μ, σ, and percentages" width="2000">
  </div>
  <div style="flex: 1">
    <ol style="margin-top: 0;">
      <li>Represent our problem as a normal distribution.</li>
      <li>Find the μ (the mean) of that distribution.</li>
      <li>Find the σ² (the variance) of that distribution.</li>
      <li>Check if the distance of our result from μ, measured in σ (standard deviations), exceeds our threshold of statistical significance.</li>
    </ol>
  </div>
</div>

# (Finally) The Central Limit Theorem

Given a sufficiently large sample:
1. The means of the samples in a set of samples (the
sample means) will be approximately normally
distributed,
2. This normal distribution will have a mean close to the
mean the population, and
3. The variance of the sample means will be close to the
variance of the population divided by the sample size.

<span style="font-size: 0.8em; font-style: italic; text-align: left; display: block;">(Source https://ocw.mit.edu/courses/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/resources/mit6_0002f16_lec8/)</span>