# Statistical Tests - Topic 01: Overview, Shapiro and Chi-Squared

### Lesson Learning Outcome
The Statistical Tests Lesson consists of 3 units.
By the end of this lesson, you should be able to:
- Use the concepts within a Statistical Test
- Use and interpret statistical tests including Shapiro-Wilk, Chi-Squared, T-test, Paired T-Test, ANOVA, Mann-Whitney, Wilcoxon and Kruskal-Wallis test

### Topic Objectives
- Use the concepts within a Statistical Test
- Use and interpret statistical Shapiro-Wilk and Chi-Squared tests

### Why do we study Statistical Tests?
Because we can determine the differences or similarities between groups. We can also evaluate if a predictor variable is statistically important to a target variable.
A difference between groups can sometimes be seen or measured but exists due to random chance. Statistical significance is a determination that a relationship between two or more variables is caused by something other than chance.

### Additional Context for Learning
- Add code cells and try other options, play around with parameter values in a function/method, or consider additional function parameters etc.
- Also, add your comments to the cells. It can help you to consolidate your learning.
- Functions may have mandatory, default, or optional parameters; check package docs for details.

Useful documentation:
- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [Pingouin Documentation](https://pingouin-stats.org/)

In [None]:
# Import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import pingouin as pg
import scipy.stats as stats

### Statistical Tests Overview
A statistical test has a mechanism to make a decision about a process.

The idea is to see if there is enough evidence to accept or reject a hypothesis about the process.

### Hypothesis Testing
Hypothesis testing helps form conclusions from data:
- Null Hypothesis (H0): Usually states no difference between groups.
- Alternative Hypothesis (H1): States there is a difference.

We compare observed data to what we expect under H0.

### Significance Level (alpha)
- The chance of rejecting the null hypothesis when it is true.
- Usually set to 0.05 (5%).
- Lower alpha means stricter test (used in high stakes research).

### Test Statistic
A number describing how different groups are in your data.
The formula differs depending on the test.

### P-value
- Probability that the null hypothesis is true.
- Smaller p-value means stronger evidence against H0.
- Decision rule:
  - If p-value < alpha: reject H0
  - Else: fail to reject H0

## Shapiro-Wilk Test for Normality
- Null hypothesis: data is normally distributed
- Alternative: data is not normally distributed
- If p-value < alpha (0.05), reject null hypothesis.

In [None]:
# Generate example data
from scipy.stats import skewnorm
np.random.seed(1)
size = 200

X1 = np.random.normal(loc=40, scale=2, size=int(size/2))
X2 = np.random.normal(loc=10, scale=4, size=int(size/2))
bi_modal = np.concatenate([X1, X2])

X1 = np.random.normal(loc=40, scale=4, size=int(size/4))
X2 = np.random.normal(loc=10, scale=4, size=int(size/4))
X3 = np.random.normal(loc=0, scale=2, size=int(size/4))
X4 = np.random.normal(loc=80, scale=2, size=int(size/4))
multi_modal = np.concatenate([X1, X2, X3, X4])

df = pd.DataFrame(data={
    'Normal': np.random.normal(loc=0, scale=2, size=size),
    'Positive Skewed': skewnorm.rvs(a=10, size=size),
    'Negative Skewed': skewnorm.rvs(a=-10, size=size),
    'Exponential': np.random.exponential(scale=20, size=size),
    'Uniform': np.random.uniform(low=0.0, high=1.0, size=size),
    'Bimodal': bi_modal,
    'Multimodal': multi_modal,
    'Poisson': np.random.poisson(lam=1.0, size=size),
    'Discrete': np.random.choice([10,12,14,15,16,17,20], size=size)
}).round(3)

df.head(3)

### Visualize distributions with boxplot and histogram

In [None]:
for col in df.columns:
    fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(7,7), gridspec_kw={"height_ratios": (.15, .85)})
    sns.boxplot(data=df, x=col, ax=axes[0])
    axes[0].set_xlabel('')
    sns.histplot(data=df, x=col, kde=True, ax=axes[1])
    fig.suptitle(f"{col} Distribution - Boxplot and Histogram")
    plt.show()
    print("\n\n")

### Use pingouin to test if variables are normally distributed
We test all numerical columns at once

In [None]:
pg.normality(data=df, alpha=0.05)

### Example: Penguins dataset
Check if `bill_length_mm` is normally distributed across species

In [None]:
df_penguins = sns.load_dataset('penguins')
print(df_penguins.shape)
df_penguins.head(3)

In [None]:
pg.normality(data=df_penguins, dv='bill_length_mm', group='species', alpha=0.05)

In [None]:
# Check normality of bill_length_mm overall
pg.normality(data=df_penguins['bill_length_mm'], alpha=0.05)

### Plot histograms for bill_length_mm
Overall and grouped by species

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,7))
sns.histplot(data=df_penguins, x='bill_length_mm', kde=True, ax=axes[0])
sns.histplot(data=df_penguins, x='bill_length_mm', hue='species', kde=True, palette='Set2', ax=axes[1])
plt.show()

## Chi-Squared Test (Goodness of Fit)
Tests if observed frequencies differ from expected frequencies in categorical data.

**Hypotheses:**
- Null: No difference in frequency/proportion
- Alternative: There is a difference

### Example dataset from pingouin on heart disease

In [None]:
df_hd = pg.read_dataset('chi2_independence')
print(df_hd.shape)
df_hd.head()

In [None]:
# Check target value counts
df_hd['target'].value_counts()

### Does fasting blood sugar (fbs) predict heart disease (target)?

In [None]:
sns.countplot(x='fbs', hue='target', data=df_hd)
plt.title('Fasting Blood Sugar (fbs) by Heart Disease (target)')
plt.show()

### Chi-Squared test for independence
Is fbs independent of target?

In [None]:
chi2_result = pg.chi2_independence(data=df_hd, x='fbs', y='target')
chi2_result

### Interpretation:
- p-value < 0.05 means we reject null hypothesis.
- fbs and target are NOT independent; fasting blood sugar is associated with heart disease.
