# 2. Statistical Testing & Hypothesis Validation
Once you find insights during EDA, statistical testing serves as the "courtroom trial" for your data‚Äîproving your findings are statistically significant and not just a coincidence (random noise).

## The Foundation: Hypotheses and P-Values ‚öñÔ∏è

- **Null Hypothesis ($H_0$)**: The "innocent until proven guilty" baseline. It assumes there is no relationship, difference, or effect. Any difference seen is due to random chance.
- **Alternative Hypothesis ($H_a$)**: What you are trying to prove. It states there is a real, statistically significant difference or relationship.
- **P-Value üìâ**: The probability of observing your results if the Null Hypothesis is true (a "coincidence detector"). 
  - If **$p < 0.05$** (standard threshold), you *reject* $H_0$ and conclude the effect is real.

*Business Case Example*: Rolling out a GenAI chatbot versus an old rules-based bot.
- **$H_0$**: The GenAI bot makes no difference in wait times compared to the old bot.
- **$H_a$**: The GenAI bot lowers wait times.
- **Result**: If the calculated $p$-value is $0.02$ (which is $< 0.05$), there is only a 2% chance the drop in wait time was random. We reject $H_0$ and confidently roll out the new bot! üöÄ

## Choosing the Right Test üßÆ
Different scenarios require different mathematical tests. Below is a guide on when to use which test and the underlying test statistic formulas:

| Test | When to Use (Variables) | Why / Purpose | Formula / Test Statistic |
| :--- | :--- | :--- | :--- |
| **T-Test** | 1 Categorical (2 groups) vs <br>1 Continuous | Compares the **means of two groups** (e.g., Does Group A spend more than Group B?). | $$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$ |
| **ANOVA** (Analysis of Variance) | 1 Categorical (3+ groups) vs <br>1 Continuous | Compares the **means of three or more groups** at once to see if at least one is different. | $$F = \frac{\text{Variance between groups}}{\text{Variance within groups}}$$ |
| **Chi-Square Test** ($\chi^2$) | 2 Categorical variables | Determines if there is a significant **association between categorical variables** (e.g., Is conversion related to device type?). | $$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$<br>*(O = Observed, E = Expected)* |
| **A/B Testing** | Framework | A randomized experimental framework to compare two versions (A and B) of a variable. Relies heavily on T-Tests or Chi-Square tests to evaluate the results. | *(Relies on the statistical tests above depending on the metric)* |
| **Multivariate Tests** (e.g., MANOVA) | Multiple Continuous vs <br>Categorical | Used when you have **multiple dependent variables** that might be correlated, testing them simultaneously across groups. | $$\Lambda = \frac{|W|}{|T|}$$ *(Wilks' Lambda, comparing within-group and total variance matrices)* |

In [None]:
from scipy import stats
import numpy as np

# Example of calculating a P-value using an Independent T-Test
# We feed the arrays of wait times from two groups into a T-test function

genai_wait_times = np.array([4, 5, 3, 4, 2, 4, 3])
old_bot_wait_times = np.array([6, 7, 5, 6, 7, 8, 6])

t_stat, p_value = stats.ttest_ind(genai_wait_times, old_bot_wait_times)

print(f"T-Statistic: {t_stat:.2f}")
print(f"P-Value: {p_value:.4f}")
if p_value < 0.05:
    print("Statistically significant: Reject the Null Hypothesis")
else:
    print("Not statistically significant: Fail to reject the Null Hypothesis")

## Pre-Modeling Steps: Feature Engineering üõ†Ô∏è
After validating your insights, you must format the data so machine learning algorithms can digest it. 

**Feature Engineering** involves extracting or creating new columns (features) from your existing data to better highlight underlying patterns for the model.

*Example*: 
A raw timestamp like `"2023-10-27 14:35:00"` isn't very useful directly to a machine learning model trying to predict queue wait times. Instead, you can extract features such as:
- **Hour of the day** (e.g., `14`): Wait times might spike around lunch or rush hour.
- **Day of the week** (e.g., `Friday`): Traffic might be systematically higher on Fridays.
- **Is Weekend?** (Boolean `True/False`): Helps capture the difference between typical weekday and weekend volume.