## t-Test
- Analyzes whether there is a significant difference between the mean values of two groups
- Assumptions:
    - Variables are metric
    - Normal distribution
    - Variances in groups must be approximately equal (independent samples t-Test)

### t-Test in Python ###
**ttest_ind** for independent samples <br />
**ttest_rel** for related samples

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

In [None]:
df = pd.read_csv("data/Gender-ResponseTime.csv")
df

**Null Hypothesis ($ H_0 $):** The mean response times for males are not signficantly different than the mean response times for females <br/>
**Alternative Hypothesis ($ H_1 $):** The mean response times for males are  signficantly different than the mean response times for females

In [None]:
group2 = df.where(df.Gender== 'male').dropna()['Response_time']

In [None]:
group1 = df.where(df.Gender== 'female').dropna()['Response_time']

In [None]:
group1

In [None]:
group2

In [None]:
stats.ttest_ind(group1,group2)

**pvalue (0.895) > threshold (0.05)** <br />
Cannot reject the null hypothesis


Let's adjust the response time values for one of the groups

In [None]:
f = lambda x: x - 10
group2 = group2.apply(f)

In [None]:
stats.ttest_ind(group1,group2)

**pvalue (0.004) < threshold (0.05)** <br />
We reject the null hypothesis
<br/>
<br/>
<br/>



## ANOVA
- Analyzes whether there is a significant difference between the mean values of more than two groups
- Assumptions:
    - Variables are metric
    - Normal distribution
    - Variances in groups must be approximately equal (independent samples t-Test)

In [None]:
df = pd.read_csv("data/OneWayANOVA.csv")
df

In [None]:
df.Drug.unique()

**Null Hypothesis ($ H_0 $):** The are no differences between the means of the systolic blood pressures of the individual groups<br/>
**Alternative Hypothesis ($ H_1 $):** At least two group means differ from each other in the population

In [None]:
groups = df.groupby("Drug").groups
group1 = df.BP[groups["Drug A"]]
group2 = df.BP[groups["Drug B"]]
group3 = df.BP[groups["Drug C"]]
group1.describe()

In [None]:
group2.describe()

In [None]:
group3.describe()

In [None]:
# Perform the ANOVA
stats.f_oneway(group1,group2,group3)

**Alternative ANOVA  using *statsmodels* library**

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('BP ~ Drug',                 # Model formula
            data = df).fit()
                
anova_result = sm.stats.anova_lm(model, typ=2)
print (anova_result)

Both calculations show that **pvalue < threshold (0.05)** <br />
We can reject the null hypothesis


## $ Chi^2\ Test $
- A hypothesis test that is to determine if there is a relationship between **two categorical variables**.
- Assumptions:
    - The expected frequencies per cell are greater than 5
    - Uses only the categories, not the rankings


In [None]:
df = pd.read_csv("data/Gender-Newspaper.csv")
df

**Null Hypothesis ($ H_0 $):** The is **no relationship** between gender and the preferred newspaper<br/>
**Alternative Hypothesis ($ H_1 $):** There **is a relationship** between gender and the preferred newspaper<br/><br/>
Use panda's *crosstab* method to create a contingency table

In [None]:
contingency = pd.crosstab(df.Newspaper, df.Gender)
contingency

In [None]:
# Chi-square test of independence. 
from scipy.stats import chi2_contingency
c, p, dof, expected = chi2_contingency(contingency) 
# Print the p-value
print('Chi statistic: ' + str(p))
print('df: ' + str(dof))
print('pvalue: ' + str(p))
print('expected:')
print(expected)


**expected** shows what the contingency table would look like if the 2 variables were perfectly independent <br/>
**pvalue** is significantly greater than the threshhold and we cannot reject the null hypothesis<br/><br/>
Let's adjust the response time values for one of the gender categories

In [None]:
for x in range(contingency.shape[0]):
    if x % 2 == 1:
        contingency.iloc[[x],[0]] = contingency.iloc[[x],[0]] + 20
contingency

In [None]:
# Chi-square test of independence. 
from scipy.stats import chi2_contingency
c, p, dof, expected = chi2_contingency(contingency) 
# Print the p-value
print('Chi statistic: ' + str(p))
print('df: ' + str(dof))
print('pvalue: ' + str(p))
print('expected:')
print(expected)


**pvalue (0.04) < threshold (0.05)** <br />
We reject the null hypothesis
<br/>
<br/>
<br/>


## Correlation Analysis
- Measure the relationship between two variables<br/>
#### Pearson Correlation (r)
- Measures the **linear** relationship between two variables


In [None]:
df = pd.read_csv("data/ReactionTime.csv")
df

**Null Hypothesis ($ H_0 $):** The correlation coefficient does not differ significantly from zero - no linear relationship<br/>
**Alternative Hypothesis ($ H_1 $):** The correlation coefficient differs significantly from zero - there is a linear relationship<br/><br/>


In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
corr = df.corr(method='pearson')

In [None]:
corr

In [None]:
stats.pearsonr(df.Before_Intervention, df.After_Intervention)

#### Accept or reject the null hypothesis?<br/><br/>

Spearman Correlation (r)

- Used when the two variables are **not** linearly related<br/>
- Desribes how well the variables' relationship can be described using a monotonic function - a Pearson correlation between the **ranked** variables
- Performs a Spearman correlation on the variable ranks

In [None]:
df = pd.read_csv("data/Reaction-Age-Spearman.csv")
df

In [None]:
# Create a scatterplot
plt.scatter(df['Reaction_time'], df['Age'])
# Create label for x-axis
plt.xlabel('Reaction Time')
# Create label for y-axis
plt.ylabel('Age')
# Create title
plt.title('Reaction Time vs. Subject Age')

In [None]:
# Create a scatterplot
plt.scatter(df['Ranks_reaction_time'], df['Ranks_age'])
# Create label for x-axis
plt.xlabel('Reaction Time Rank')
# Create label for y-axis
plt.ylabel('Age Rank')
# Create title
plt.title('Ranked Reaction Time vs. RankedSubject Age')

In [None]:
stats.pearsonr(df.Ranks_reaction_time, df.Ranks_age)