In [1]:
%pip install numpy pandas pingouin scipy statsmodels;



In [2]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

**Exercises** let's do some analysis on some fake data to get a feel for these two statistics tools.

Generate the Data: Run the code below to create the dataset `data`.

In [5]:
rnd = np.random.RandomState(seed=42)  # Makes sure the pseudorandom number generators reproduce the same data for us all.

variables = ['a', 'b', 'c', 'd']
data_a = rnd.normal(0, 1, size=20)
data_b = rnd.normal(0.2, 1, size=20)
data_c = rnd.normal(0.7, 1, size=20)
data_d = (data_a - 0.2) + rnd.normal(0, 0.2, size=20)

data = np.array([data_a, data_b, data_c, data_d]).T
data

array([[ 0.49671415,  1.66564877,  1.43846658,  0.20087931],
       [-0.1382643 , -0.0257763 ,  0.87136828, -0.3753961 ],
       [ 0.64768854,  0.2675282 ,  0.58435172,  0.22642154],
       [ 1.52302986, -1.22474819,  0.3988963 ,  1.08378853],
       [-0.23415337, -0.34438272, -0.77852199, -0.27164821],
       [-0.23413696,  0.31092259, -0.01984421, -0.16288895],
       [ 1.57921282, -0.95099358,  0.23936123,  1.36481079],
       [ 0.76743473,  0.57569802,  1.75712223,  0.76814131],
       [-0.46947439, -0.40063869,  1.04361829, -0.59714718],
       [ 0.54256004, -0.09169375, -1.06304016,  0.21353609],
       [-0.46341769, -0.40170661,  1.02408397, -0.59113857],
       [-0.46572975,  2.05227818,  0.31491772, -0.35812244],
       [ 0.24196227,  0.18650278,  0.023078  ,  0.03479706],
       [-1.91328024, -0.85771093,  1.31167629, -1.80035151],
       [-1.72491783,  1.02254491,  1.73099952, -2.44886685],
       [-0.56228753, -1.02084365,  1.63128012, -0.59790703],
       [-1.01283112,  0.

## Visualize the Data

What do these four variables look like, when compared against each other?  Let's take a look using three types of plots and matplotlib:

| Plot type | Function | Example Code |
| :--  | :-- | :-- |
| **Box Plot** | `plt.boxplot()` | `plt.boxplot([x, y, z])` |
| **Violin Plot** | `plt.violin()` | `plt.violin([x, y, z])` |
| **Strip Plot** | `plt.scatter()` | `plt.scatter(x=[1]*len(x) + np.random.uniform(-.3, .3, len(x)), y=x) ` |


**Exercises**

Make a Violin Plot of the four columns of `data`

Make a Box Plot of the four columns of `data`:

Make a Strip plot of the four columns of `data`: (Tip: Matplotlib has no strip plot function. Adapt the long code from the table above)

**Discussion**: Which of the three plots above do you find most interesting?  What information do you get from each of them?

# Doing T-Tests with the Scipy.Stats

[**Scipy.Stats**](https://docs.scipy.org/doc/scipy/reference/stats.html) has all the stats functions you know and love from statistics class.  Like all the functions in the [scipy](https://docs.scipy.org/doc/scipy/getting_started.html) package, it is fully-compatible with Numpy.


T-tests compare the means of two samples of data generated from a normally-distributed population and compute the probability that they have the same mean. When the p-value is very low, it tells us that the two data samples must have come from different populations.  Both packages have functions for t-tests! 


| Test, | `scipy.stats` Function, | 
| :---: | :---: |
| One-Sampled T-Test | [**stats.ttest_1samp(x, 0)**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html#scipy.stats.ttest_1samp) | 
| Independent T-Test | [**stats.ttest_ind(x, y)**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind)
| Paired T-test | [**stats.ttest_rel(x, y)**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html#scipy.stats.ttest_rel)


**Exercises**

**A vs 0, One-Sampled T-Test**: Is the mean of the normally-distributed population that the the dataset A is generated from unlikely to be zero?

*Example*:

In [50]:
stats.ttest_1samp(data[:, 0], 0)

Ttest_1sampResult(statistic=-0.797966433655592, pvalue=0.43475058842710046)

**B vs 1, One-Sampled T-Test**: Is the mean of the normally-distributed population that the the dataset B is generated from unlikely to be one?

**A vs B, Independent Samples T-Test**: Is the mean of the normally-distributed population that the the dataset `A` is generated from unlikely to be the same as the mean of the normally-distributed population that the the dataset `B` is generated from?

**A vs C, Independent Samples T-Test**: Is the mean of the normally-distributed population that the the dataset `A` is generated from unlikely to be the same as the mean of the normally-distributed population that the the dataset `C` is generated from?

**A vs C, Paired Samples T-Test (a.k.a Relative Samples T-Test)**: Is the mean of the differences between each pair of samples in generated from the two normally-distributed populations `A` and `C`  unlikely to be 0?

**A vs D, Paired Samples T-Test**: Is the mean of the differences between each pair of samples in generated from the two normally-distributed populations `A` and `D`  unlikely to be 0?