# Doing Statistics on Data with the Scipy.Stats and Pingouin Packages

**Scipy.Stats** has all the stats functions you know and love from stats class!

**Pingouin** is a new statistics package in Python that uses pandas, seaborn, and scipy-stats!  It's main benefit? It's really easy to use and it works great with DataFrames!

https://pingouin-stats.org/index.html



## Basic Stats with Pengouin and Scipy-Stats

#### Pairwise Tests

T-tests compare the means of two samples of data generated from a normally-distributed population and compute the probability that they have the same mean. Both packages have functions for t-tests! 


| Test, | `scipy.stats` Function, | `pengouin` Function |
| :---: | :---: | :---: |
| One-Sampled T-Test | `stats.ttest_1samp()` | `pg.ttest(x, 0)` |
| Independent T-Test | `stats.ttest_ind()` | `pg.ttest(x,y)` |
| Paired T-test | `stats.ttest_rel()` | `pg.ttest(x, y, paired=True)`
| Pairwise T-tests |   | `pg.pairwise_ttests(padjust='fdr_bh')`


In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

ModuleNotFoundError: No module named 'pingouin'

**Exercises** Using **both* packages, let's do some analysis on some fake data

##### Generate the Data

In [None]:
np.random.seed(42)  # Makes sure we all have the same random numbers
n = 25
df = pd.DataFrame({
    'A': np.random.normal(loc=0, scale=1, size=n),
    'B': np.random.normal(loc=0, scale=1, size=n),
    'C': np.random.normal(loc=0.5, scale=1, size=n),
})
df['D'] = df['A'] 0.3 + np.random.normal(loc=0, scale=0.5, size=n)
df.head()

Unnamed: 0,A,B,C,D,E
0,0.496714,0.110923,0.824084,8.719025,2.037386
1,-0.138264,-1.150994,0.114918,1.370471,1.582281
2,0.647689,0.375698,-0.176922,-2.490074,2.800176
3,1.52303,-0.600639,1.111676,1.417608,3.883406
4,-0.234153,-0.291694,1.531,-19.375689,1.568127


##### Analyze the Data with Pairwise Tests
Use both pingouin and scipy.stats to do a ttest comparing the two variables stated (first scipy stats, then penguoin).  Do you get the same results?  

A vs B (Ind)

A vs C (Ind)

B vs C (Paired)

A vs D (Paired)

A vs Zero (Is A's mean different from 0?)

C vs Zero

Do all Pairwise Tests (all vs all)

(extra: Google it!)  Are A and C correlated?

(extra: Google it!) Are A and D correlated?

## More-Complete Analysis: Walking the Frequentist Statistics Decision Tree

### Fertility Dataset

### Load the Data

In [None]:
import statsmodels.api as sm  # !pip install statsmodels
df = sm.datasets.fertility.load()['data']
df.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,Aruba,ABW,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,4.82,4.655,4.471,4.271,4.059,3.842,...,1.786,1.769,1.754,1.739,1.726,1.713,1.701,1.69,,
1,Andorra,AND,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,,,,,,,...,,,1.24,1.18,1.25,1.19,1.22,,,
2,Afghanistan,AFG,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,7.671,7.671,7.671,7.671,7.671,7.671,...,7.136,6.93,6.702,6.456,6.196,5.928,5.659,5.395,,


### What significant differences are there between the fertility rates in 1990, 2000, and 2010?

### Parametric Tests

Follow the flowchart in the **ANOVA** section of the penguoin docs to test for differences in the mean fertility rate between these 3 years.  Even the deta is not homoscedastic, go ahead and do the anova and pairwise tests

https://pingouin-stats.org/guidelines.html#id5

### Nonparametric Tests

The data isn't normally distributred!  So let's do the same thing, but with the tests that don't need to make any assumptions about the distribution of our data.  Follow the flowchart in the **Non-Parametric** section of the penguoin docs to test for differences in the mean fertility rate between these 3 years.  

https://pingouin-stats.org/guidelines.html#id7

## Correlation Tests

Follow the flowchart in the **Non-Parametric** section of the penguoin docs to test for differences in the mean fertility rate between these 3 years.  

https://pingouin-stats.org/guidelines.html#id6

### Is there a significant correlation between time Germany's fertility rate and France's fertility rate?

### Is there a significant correlation between time Germany's fertility rate and India's fertility rate?

### Is there a significant correlation between time (Year) fertility rate?

Before running the correlation test, Get a new dataframe with just "Country Name", "Year", and "Fertility Rate" columns, and change the Year column to integers using the `DataFrame.melt()` method

*Hint*: A list of column names in a DataFrame is at `DataFrame.columns`
Get a new dataframe with just "Country Name", "Year", and "Fertility Rate" columns, and change the Year column to integers


### Is there a significant correlation between time (Year) and Germany's fertility rate?

## Further Reading

Nice article on Pingouin here: https://towardsdatascience.com/the-new-kid-on-the-statistics-in-python-block-pingouin-6b353a1db57c

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=61a4b91b-9261-42f0-a4f5-a139c5d33a06' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>