# Doing Statistics on Data with the Scipy.Stats and Pingouin Packages

**Scipy.Stats** has all the stats functions you know and love from stats class!

**Pingouin** is a new statistics package in Python that uses pandas, seaborn, and scipy-stats!  It's main benefit? It's really easy to use and it works great with DataFrames!

https://pingouin-stats.org/index.html



## Basic Stats with Pengouin and Scipy-Stats

### T-Tests

T-tests compare the means of two samples of data generated from a normally-distributed population and compute the probability that they have the same mean. Both packages have functions for t-tests! 


| Test, | `scipy.stats` Function, | `pengouin` Function |
| :---: | :---: | :---: |
| One-Sampled T-Test | `stats.ttest_1samp(x, 0)` | `pg.ttest(x, 0)` |
| Independent T-Test | `stats.ttest_ind(x, y)` | `pg.ttest(x,y)` |
| Paired T-test | `stats.ttest_rel(x, y)` | `pg.ttest(x, y, paired=True)`
| Pairwise T-tests |   | `pg.pairwise_ttests(padjust='fdr_bh')`


In [35]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

**Exercises** Using **both* packages, let's do some analysis on some fake data

Generate the Data: Run the code below to create the dataset `df`.

In [10]:
np.random.seed(42)  # Makes sure we all have the same random numbers
df = pd.DataFrame()
df['A'] = np.random.normal(0, 1, size=20)
df['B'] = np.random.normal(0, 1, size=20)
df['C'] = np.random.normal(0.5, 1, size=20)
df['D'] = df['A'] * 0.3 + np.random.normal(0, 0.5, size=20)
df.head()

Unnamed: 0,A,B,C,D
0,0.496714,1.465649,1.238467,-0.090573
1,-0.138264,-0.225776,0.671368,-0.134309
2,0.647689,0.067528,0.384352,-0.358861
3,1.52303,-1.424748,0.198896,-0.141194
4,-0.234153,-0.544383,-0.978522,0.336017


Analyze the Data with Pairwise Tests

**(first do the questions in a section with scipy stats, then do the section with penguoin before moving on to the next section)**. 

**A vs 0, One-Sampled T-Test**: Is the mean of the normally-distributed population that the the dataset A is generated from unlikely to be zero?

**Example**: with `scipy.stats`:

In [14]:
stats.ttest_1samp(df['A'], 0)

Ttest_1sampResult(statistic=-0.797966433655592, pvalue=0.43475058842710046)

**Example:** with `pingouin`:

In [15]:
pg.ttest(df['A'], 0)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-0.797966,19,two-sided,0.434751,"[-0.62, 0.28]",0.178431,0.309,0.118021


**B vs 1, One-Sampled T-Test**: Is the mean of the normally-distributed population that the the dataset B is generated from unlikely to be one?

with `scipy.stats`:

In [16]:
stats.ttest_1samp(df['B'], 1)

Ttest_1sampResult(statistic=-5.848539710012047, pvalue=1.2401934925939586e-05)

with `pingouin`:

In [17]:
pg.ttest(df['B'], 1)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-5.84854,19,two-sided,1.2e-05,"[-0.72, 0.19]",1.307773,1812.832,0.999824


**A vs B, Independent Samples T-Test**: Is the mean of the normally-distributed population that the the dataset `A` is generated from unlikely to be the same as the mean of the normally-distributed population that the the dataset `B` is generated from?

with `scipy.stats`:

In [19]:
stats.ttest_ind(df['A'], df['B'])

Ttest_indResult(statistic=0.31056072990949674, pvalue=0.757831664080289)

with `pingouin`:

In [20]:
pg.ttest(df.A, df.B)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,0.310561,38,two-sided,0.757832,"[-0.52, 0.71]",0.098208,0.321,0.060568


**A vs C, Independent Samples T-Test**: Is the mean of the normally-distributed population that the the dataset `A` is generated from unlikely to be the same as the mean of the normally-distributed population that the the dataset `C` is generated from?

with `scipy.stats`:

In [21]:
stats.ttest_ind(df.A, df.C)

Ttest_indResult(statistic=-2.282284751241357, pvalue=0.028158918415237755)

with `pingouin`:

In [22]:
pg.ttest(df.A, df.C)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-2.282285,38,two-sided,0.028159,"[-1.22, -0.07]",0.721722,2.28,0.604274


**A vs C, Paired Samples T-Test**: Is the mean of the differences between each pair of samples in generated from the two normally-distributed populations `A` and `C`  unlikely to be 0?

with `scipy.stats`:

In [23]:
stats.ttest_rel(df.A, df.C)

Ttest_relResult(statistic=-1.962887803520587, pvalue=0.06446923770352178)

with `pingouin`:

In [24]:
pg.ttest(df.A, df.C, paired=True)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-1.962888,19,two-sided,0.064469,"[-1.33, 0.04]",0.721722,1.141,0.864518


**A vs D, Paired Samples T-Test**: Is the mean of the differences between each pair of samples in generated from the two normally-distributed populations `A` and `D`  unlikely to be 0?

with `scipy.stats`:

In [32]:
stats.ttest_rel(df.A, df.D)

Ttest_relResult(statistic=-0.5277415478664295, pvalue=0.6037879300093334)

with `pingouin`:

In [33]:
pg.ttest(df.A, df.D, paired=True)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-0.527742,19,two-sided,0.603788,"[-0.52, 0.31]",0.128979,0.263,0.085076


The paired ttest is equivalent to a one-sample ttest against 0 for some statistics, but not others.  What values are different?  Compare the results below comparing A and D to the ones you generated above:

In [31]:
pg.ttest(df.A - df.D, 0)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,-0.527742,19,two-sided,0.603788,"[-0.52, 0.31]",0.118007,0.263,0.079282


(extra: Google it!)  Are A and C correlated?

(extra: Google it!) Are A and D correlated?

## More-Complete Analysis: Walking the Frequentist Statistics Decision Tree

### Fertility Dataset

### Load the Data

In [None]:
import statsmodels.api as sm  # !pip install statsmodels
df = sm.datasets.fertility.load()['data']
df.head(3)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
0,Aruba,ABW,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,4.82,4.655,4.471,4.271,4.059,3.842,...,1.786,1.769,1.754,1.739,1.726,1.713,1.701,1.69,,
1,Andorra,AND,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,,,,,,,...,,,1.24,1.18,1.25,1.19,1.22,,,
2,Afghanistan,AFG,"Fertility rate, total (births per woman)",SP.DYN.TFRT.IN,7.671,7.671,7.671,7.671,7.671,7.671,...,7.136,6.93,6.702,6.456,6.196,5.928,5.659,5.395,,


### What significant differences are there between the fertility rates in 1990, 2000, and 2010?

### Parametric Tests

Follow the flowchart in the **ANOVA** section of the penguoin docs to test for differences in the mean fertility rate between these 3 years.  Even the deta is not homoscedastic, go ahead and do the anova and pairwise tests

https://pingouin-stats.org/guidelines.html#id5

### Nonparametric Tests

The data isn't normally distributred!  So let's do the same thing, but with the tests that don't need to make any assumptions about the distribution of our data.  Follow the flowchart in the **Non-Parametric** section of the penguoin docs to test for differences in the mean fertility rate between these 3 years.  

https://pingouin-stats.org/guidelines.html#id7

## Correlation Tests

Follow the flowchart in the **Non-Parametric** section of the penguoin docs to test for differences in the mean fertility rate between these 3 years.  

https://pingouin-stats.org/guidelines.html#id6

### Is there a significant correlation between time Germany's fertility rate and France's fertility rate?

### Is there a significant correlation between time Germany's fertility rate and India's fertility rate?

### Is there a significant correlation between time (Year) fertility rate?

Before running the correlation test, Get a new dataframe with just "Country Name", "Year", and "Fertility Rate" columns, and change the Year column to integers using the `DataFrame.melt()` method

*Hint*: A list of column names in a DataFrame is at `DataFrame.columns`
Get a new dataframe with just "Country Name", "Year", and "Fertility Rate" columns, and change the Year column to integers


### Is there a significant correlation between time (Year) and Germany's fertility rate?

## Further Reading

Nice article on Pingouin here: https://towardsdatascience.com/the-new-kid-on-the-statistics-in-python-block-pingouin-6b353a1db57c

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=61a4b91b-9261-42f0-a4f5-a139c5d33a06' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>