---
---
---

# ✨ **TUTORIAL**: Normal Distributions and the Central Limit Theorem ✨

In this tutorial, you'll evaluate your understanding of normal distributions and the Central Limit Theorem by generating some custom functions and scripts to sample from dummy data and produce your own normalized distributions of sampling means.

---
---

If the class lecture was your first exposure to the Central Limit Theorem (CLT), it can seem a bit confusing.  The goal of this notebook is to demystify the CLT by having you write an algorithm that actually uses sampling to approximate a normal distribution from a non-normally distributed data set.  

In this notebook you will:

1. Run code to generate a non-normal data set.  
2. Create an function to randomly sample subsets of data.
3. Create a data set of the means of each sample.
4. Visualize the distribution of the means of each sample.  

---

### 🔹 **Creating our Dummy Data** 🔹

---

We're going to use `numpy` to create a non-normal distribution.  The easiest way to do this is just to create a uniform distribution!  

**TASKS:** Run the code below to import `numpy` and set a random seed, and then use numpy to create a uniform distribution with integer values between 0 and 100.

(Hint: For integer values, `random.uniform` is not our best choice since it generates floats.  Which `numpy` method should you use to generate a uniform distribution of random integers?)

In [1]:
# Run this cell to import the packages you'll need and set a seed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Don't change this -- otherwise, you'll get different results from everyone else!
np.random.seed(42)

In [2]:
# Create a uniform distribution of 10,000 integers between 0 and 100.
non_normal_data = np.random.randint(low=0, high=100, size=10000)

# TODO: Use plt.hist() to visualize our the distribution of our dummy dataset.
pass

---

### 🔹 **Creating a Sampling Function** 🔹

---

Now that we have created our data set, we'll need to sample from it.  In order to do this, you'll need to create two different functions--a `get_sample()` to create random samples of size `n`, and a `create_sample_distribution()` function to actually create a sample distribution of size `n` (using our helper function).

Your `get_sample()` function should:

1.  Take a keyword argument for sample size (called `n` for short)
1.  Randomly grab 'n' samples from the uniform distribution with replacement (any samples selected should NOT be removed from the original data set).
1.  Calculate the mean of the sub-sample and return it.


Your `create_sample_distribution()` function should:

1.  Take a keyword argument for size, which will determine the total size of the sample distribution.
1.  Use the `get_sample()` helper function to create sample distributions and calculate sample mean.   
1.  Store the sample mean.
1.  Repeat this process until there a distribution of `[size]` sample means exist.  When the data set is complete, return it as a `numpy` array.  

``` python
def get_sample(dataset, n=30):
    """Grabs a random subsample of size 'n' from dataset.
    Outputs the mean of the subsample."""
    pass

def create_sample_distribution(dataset, size=100):
    """Creates a dataset of subsample means.  The length of the dataset is specified by the 'size'
    keyword argument. Should return the entire sample distribution as a numpy array.  """
    pass
```

In [3]:
# TODO: Complete the two functions below.
def get_sample(dataset, n=30):
    """Grabs a random subsample of size 'n' from dataset.
    Outputs the mean of the subsample."""
    pass

def create_sample_distribution(dataset, size=100):
    """Creates a dataset of subsample means.  The length of the dataset is specified by the 'size'
    keyword argument. Should return the entire sample distribution as a numpy array.  """
    pass

---

### 🔹 **Visualizing Our Sampling Distribution** 🔹

---

Now that we have created our sample distribution, let's visualize it to determine if it's a normal distribution.  

**TASK:** Use `matplotlib` or `seaborn` to visualize our sample distribution.

In [4]:
# TODO: Visualize our sample distribution below.
# Remember, we aliased matplotlib.pyplot as plt!
pass

Great work!

Now that you've used the Central Limit Theorem, you're able to create to treat non-normally distributed datasets as normally distributed.  You can now compute Z-scores and compute probabilities for values in these datasets!  

---
---
---