# Statistical Testing

## Correlation Tests


1. Estimates  
2. Sampling
3. Power
4. Statistical Tests
    1. Normality Tests
    2. **Correlation Tests**
    3. Parametric Statistical Hypothesis Tests
    4. Nonparametric Statistical Hypothesis Tests

## Correlation Tests

1. Pearson's Correlation Coefficient
2. Spearman's Rank Correlation
3. Kendall's Rank Correlation
4. Chi-Squared Test

Correlation tests are statistical tests to use to check if two samples are related.
They are often used for feature selection and multivariate analysis in data preprocessing and exploration.   
Correlation helps us investigate and establish relationships between variables.   
This is employed in feature selection before any kind of statistical modelling or data analysis.


### 1. Pearson’s Correlation Coefficient
1. Tests whether two samples have a linear relationship. 
    - The correlation coefficient is a numerical measure of the sign and strength of the linear association between two variables.
    - It is like covariance, but divides out the standard deviations of both variables.
    - It is unitless and always lies between -1 and 1, where 1 = perfect correlation and -1 = perfect negative correlation.
    - The correlation coefficient will range between -1.00 (negative correlation) and +1.00 (positive correlation).
    - Correlation is not causality. A causal relationship exists when the independent variable is the underlying contributing determinant of the dependent variable. A causal relationship may be suggested by correlation; it is not proof a causal relationship exists however.
2. Assumptions
    - Observations in each sample are independent and identically distributed (iid).
    - Observations in each sample are normally distributed.
    - Observations in each sample have the same variance.
3. Interpretation
    - H0: the two samples are independent.
    - H1: there is a dependency between the samples.
4. Python Code  
`from scipy.stats import pearsonr
data1, data2 = ...
corr, p = pearsonr(data1, data2)`

#### Caveats to Pearson's Correlation

##### 1. Simpson's Paradox
- states that correlations can be misleading when confounding variables are ignored. (*Data Science from Scratch*).  
- Correlation is measuring the relationship between two variables, *with all else being equal*, so you should check for confounding factors before drawing conclusions related to correlation.  

Demonstrating the effect of confounding variables: 

In [1]:
import pandas as pd
d = {'region': ['SAFD','HFD'], 'num_customers': ['103,000', '101,000'], 'avg_spend': ['$646.80', '$819.70']}
df = pd.DataFrame(data=d)
df

Unnamed: 0,region,num_customers,avg_spend
0,SAFD,103000,$646.80
1,HFD,101000,$819.70


But when you separate by the class of loyalty, you can see that in both the loyal and non-loyal classes, San Antonio market has a higher average spend.  This is due to the impact of confounding variables.  

In [2]:
d = {'region': ['SAFD','SAFD','HFD','HFD'], 'loyalty': ['loyal','non-loyal','loyal','non-loyal'],
     'num_customers': ['33,000','70,000','66,000','35,000'], 'avg_spend': ['$1340.00','$320.00','$1090.00','$310.00']}
df = pd.DataFrame(data=d)
df

Unnamed: 0,region,loyalty,num_customers,avg_spend
0,SAFD,loyal,33000,$1340.00
1,SAFD,non-loyal,70000,$320.00
2,HFD,loyal,66000,$1090.00
3,HFD,non-loyal,35000,$310.00


##### 2. Linear Relationship
- Correlation measures *linear* relationship between the 2 variables.   
- A correlation of 0 indicates there is no *linear* relationship.  
- However, there may be other types of relationships, such as a quadratic or absolute value relationship. 


In [3]:
d = {'x': [-2, -1, 0, 1, 2], 'y': [2, 1, 0, 1, 2]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,x,y
0,-2,2
1,-1,1
2,0,0
3,1,1
4,2,2


In [4]:
from scipy.stats import linregress
linregress(df)[0]

0.0

##### 3. Scale of the Relationship
Correlation tells you nothing about how large the relationship is.  



In [5]:
d = {'x': [-2, -1, 0, 1, 2], 'y': [99.98,99.99,100.00,100.01,100.02]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,x,y
0,-2,99.98
1,-1,99.99
2,0,100.0
3,1,100.01
4,2,100.02


In [6]:
from scipy.stats import linregress
linregress(df)[2]


1.0

##### 4. Correlation != Causation
*see spurious correlations*  
http://www.tylervigen.com/spurious-correlations   
https://hbr.org/2015/06/beware-spurious-correlations  

### 2. Spearman’s Rank Correlation
1. Tests whether two samples have a monotonic relationship.
2. Assumptions
    - Observations in each sample are independent and identically distributed (iid).
    - Observations in each sample can be ranked.
3. Interpretation
    - H0: the two samples are independent.
    - H1: there is a dependency between the samples.
4. Python Code  
`from scipy.stats import spearmanr
data1, data2 = ...
corr, p = spearmanr(data1, data2)`

### 3. Kendall’s Rank Correlation
1. Tests whether two samples have a monotonic relationship.
2. Assumptions
    - Observations in each sample are independent and identically distributed (iid).
    - Observations in each sample can be ranked.
3. Interpretation
    - H0: the two samples are independent.
    - H1: there is a dependency between the samples.
4. Python Code  
`from scipy.stats import kendalltau
data1, data2 = ...
corr, p = kendalltau(data1, data2)`

### 4. Chi-Squared Test
1. Tests whether two categorical variables are related or independent.
2. Assumptions
    - Observations used in the calculation of the contingency table are independent.
    - 25 or more examples in each cell of the contingency table.
3. Interpretation
    - H0: the two samples are independent.
    - H1: there is a dependency between the samples.
4. Python Code  
`from scipy.stats import chi2_contingency
table = ...
stat, p, dof, expected = chi2_contingency(table)`

### Appendix

#### Covariance
While variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means. 
- A large positive covariance means that x tends to be large when y is large, and small when y is small.  
- A large negative covariance means that x tends to be large when y is small, and vice versa.  
- If units are the product of the inputs' units, it can be hard to interpret (e.g. *friend-minutes-per-day...what's a "friend-minutes-per-day?*) 
- *If each user had twice as many friends, but the same number of minutes, the covariance would be twice as large.  But in a sense the variables would be just as interrelated...it's hard to say what counts as a "large" covariance.*   

PCA or Principal Component Analysis is an application of correlation analysis. Do we use a correlation matrix or a covariance matrix? Use the covariance matrix when the variable are on similar scales and the correlation matrix when the scales of the variables differ.

