# Doing Statistics on Data with the Scipy.Stats and Pingouin Packages

**Scipy.Stats** has all the stats functions you know and love from stats class!

**Pingouin** is a new statistics package in Python that uses pandas, seaborn, and scipy-stats!  It's main benefit? It's really easy to use and it works great with DataFrames!

https://pingouin-stats.org/index.html



## Basic Stats with Pengouin and Scipy-Stats

#### Pairwise Tests

T-tests compare the means of two samples of data generated from a normally-distributed population and compute the probability that they have the same mean. Both packages have functions for t-tests! 


| Test, | `scipy.stats` Function, | `pengouin` Function |
| :---: | :---: | :---: |
| One-Sampled T-Test | `stats.ttest_1samp()` | `pg.ttest(x, 0)` |
| Independent T-Test | `stats.ttest_ind()` | `pg.ttest(x,y)` |
| Paired T-test | `stats.ttest_rel()` | `pg.ttest(x, y, paired=True)`
| Pairwise T-tests |   | `pg.pairwise_ttests(padjust='fdr_bh')`


In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

**Exercises** Using **both* packages, let's do some analysis on some fake data

##### Generate the Data

In [None]:
np.random.seed(42)  # Makes sure we all have the same random numbers
n = 25
df = pd.DataFrame({
    'A': np.random.normal(loc=0, scale=1, size=n),
    'B': np.random.normal(loc=0, scale=1, size=n),
    'C': np.random.normal(loc=0.8, scale=1, size=n),
})
df['D'] = df['A'] * 0.4 + np.random.normal(loc=0, scale=0.5, size=n)
df = np.round(df, 2)  # round everything to the nearest two decimal points
df.head()

Unnamed: 0,A,B,C,D
0,0.5,0.11,1.12,0.61
1,-0.14,-1.15,0.41,-0.01
2,0.65,0.38,0.12,0.11
3,1.52,-0.6,1.41,0.66
4,-0.23,-0.29,1.83,-1.09


##### Analyze the Data with Pairwise Tests
Use both pingouin and scipy.stats to do ttests and correlations comparing the two variables stated (first scipy stats, then penguoin).  Do you get the same results?  

#### Scipy.stats

https://docs.scipy.org/doc/scipy/reference/stats.html

Some useful functions: `stats.ttest_ind()`, `stats.ttest_1samp()`, `stats.ttest_rel()`, `stats.wilcoxon()`, `stats.kruskal()`, `stats.pearsonr()`, `stats.spearmanr()`


A vs B (t-test with Independent Samples)

A vs C (t-test with Independent Samples)

A vs C (t-test with Independent Samples, Alternative Hypothesis is C is greater than A)

B vs C (Wilcoxen Test Signed-Rank Test)

Correlation of A vs C (Pearson Correlation)

Correlation of A vs D (Spearman Correlation)

#### Pinguoin

Pinguoin also has a wealth of statistical functions, but compared to scipy.stats, they are a bit more concisely organized (and, I find, easier to read):  https://pingouin-stats.org/api.html

Some useful Functions: `pg.ttest()`, `pg.corr()`


A vs B (t-test with Independent Samples)

A vs C (t-test with Independent Samples)

A vs C (t-test with Independent Samples, Alternative Hypothesis is C is greater than A)

B vs C (Nonparametric T-test, for example Wilcoxen Test Signed-Rank Test)

Correlation of A vs C (Pearson Correlation)

Correlation of A vs D (Spearman Correlation)

(*Extra*): Full Pairwise Analysis.  Pinguoin has functions that start with `pairwise_***` that will compare all 
the pairs against each other. Some of them, like `pg.pairwise_ttest()`, require a long dataframe (`df.melt()`), while others want a wide dataframe (`df.pivot()`). 

**`pg.pairwise_ttest()`** wants a long dataframe.  Below it's prepared; try using it to do a pairwise ttest analysis with pingouin in the cell below it.

In [None]:
dfl = df.reset_index().melt(id_vars=['index'])
dfl.sample(4)

Unnamed: 0,index,variable,value
63,13,C,-0.4
11,11,A,-0.47
56,6,C,-0.04
84,9,D,-0.19


**`pg.pairwise_corr()`** wants a wide dataframe (the original is already wide).  Try using it to do a correlations for every pair of variables.


**Short Discussion**: What are some differences you noticed in these functoins between working with scipy.stats and pingouin?

## More-Complete Analysis: Walking the Frequentist Statistics Decision Tree

### Fertility Dataset

### Load the Data

In [None]:
import statsmodels.api as sm  # !pip install statsmodels
df = sm.datasets.fertility.load()['data']
number_cols = [col for col in df.columns if col.isdigit()]
df = df[['Country Name'] + number_cols ]
df.head(3)  # wide version

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,Aruba,4.82,4.655,4.471,4.271,4.059,3.842,3.625,3.417,3.226,...,1.786,1.769,1.754,1.739,1.726,1.713,1.701,1.69,,
1,Andorra,,,,,,,,,,...,,,1.24,1.18,1.25,1.19,1.22,,,
2,Afghanistan,7.671,7.671,7.671,7.671,7.671,7.671,7.671,7.671,7.671,...,7.136,6.93,6.702,6.456,6.196,5.928,5.659,5.395,,


In [None]:
dfl = df.melt(
    id_vars=['Country Name'], 
    value_vars=number_cols, 
    var_name='Year', 
    value_name='FertilityRate'
)
dfl = dfl.astype({'Year': int})
dfl.head()

Unnamed: 0,Country Name,Year,FertilityRate
0,Aruba,1960,4.82
1,Andorra,1960,
2,Afghanistan,1960,7.671
3,Angola,1960,7.316
4,Albania,1960,6.186


**Exercises**
Answer the following questions, using any Python tools we've covered so far

What countries are in this dataset?

How many measurements are in this dataset?

What years are covered in this dataset?

What is the distribution of fertility rates in this dataset?

For Germany, where there any significant differences between the fertility rates in 1990 and 2010?

For your home country, where there any significant differences between the fertility rates in 1990 and 2010?

Which countries had significant differences between the fertility rates in 1990 and 2010?

Was the fertility rate between India and France correlated?

Was there a steady change in fertility rate in Indonesia over the study period? (i.e. was year and fertility rate correlated)?  If so, did it tend to increase or decrease?

Did the world's fertility rate overall increase or decrase over the study period?

### Need Some Statistics Help?  Follow this Flowchart!

Pingouin's docs has a nice flowchart (https://pingouin-stats.org/guidelines.html) that can be helpful for selecting good statistics in a study.

#### Further Reading

Nice article on Pingouin here: https://towardsdatascience.com/the-new-kid-on-the-statistics-in-python-block-pingouin-6b353a1db57c

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e2b9254c-61b4-41cd-9cd0-b285172171b7' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>