# **Statistical Hypothesis Testing**

Hypothesis testing is a statistical method used to make inferences about a population based on sample data.

## **Key Steps in Hypothesis Testing**
1. **Formulate the null and alternative hypotheses**  
2. **Choose a significance level (alpha)**  
3. **Collect and prepare data**  
4. **Select and calculate the test statistic**  
5. **Determine the p-value**  
6. **Make a decision about the null hypothesis**  
7. **Interpret the results in the context of the original problem**  

The **test statistic** is a numerical value calculated from the sample data during hypothesis testing. It helps determine whether to reject the **null hypothesis (\(H_0\))** by comparing it to a **critical value** or using it to compute a **p-value**.

- It **quantifies the evidence** against the null hypothesis.
- It is computed using a **formula specific to the chosen hypothesis test**.
- Its value helps determine the **statistical significance** of the results.

The **critical value** is a threshold that helps decide whether to reject \(H_0\). It is chosen based on:
- The **significance level (alpha)**, usually set at 0.05 (5%).
- The **statistical test used** (Z-test, T-test, Chi-square test, etc.).
- The **sampling distribution** (Normal, t-distribution, etc.).

If the **test statistic is more extreme than the critical value**, we reject the null hypothesis.

Imagine you are **checking if a new medicine works**. You test it on a group and compare their results to a known standard.

- **Test Statistic:** It’s like your exam score after taking a test.
- **Critical Value:** It’s like the **passing mark** set by the teacher.
- If your score (test statistic) **exceeds the passing mark (critical value)**, you **reject the idea that the medicine has no effect**.

In hypothesis testing:
- If the test statistic is within the normal range, we **accept \(H_0\)** (the medicine does nothing).
- If the test statistic is **far beyond the critical value**, we **reject \(H_0\)** (the medicine has an effect).

## **Types of Test Statistics**

Different hypothesis tests use different test statistics. Some common ones include:

| **Test Statistic** | **Used In** | **Formula** | **Explanation** |
|------------------|------------|------------|----------------|
| **Z-Statistic (Z-Score)** | Z-test (large samples, known population variance) | $$ Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} $$ | Used when the population standard deviation ($\sigma$) is known. Suitable for large samples ($n > 30$). |
| **T-Statistic (T-Score)** | T-test (small samples, unknown population variance) | $$ t = \frac{\bar{X} - \mu}{s / \sqrt{n}} $$ | Used when the population standard deviation is unknown. Suitable for small samples ($n < 30$). |
| **Chi-Square ($\chi^2$) Statistic** | Chi-Square tests (categorical data, variance tests) | $$ \chi^2 = \sum \frac{(O - E)^2}{E} $$ | Compares observed frequencies ($O$) to expected frequencies ($E$) in categorical data. |
| **F-Statistic** | ANOVA (Analysis of Variance) | $$ F = \frac{\text{variance between groups}}{\text{variance within groups}} $$ | Used to compare the means of multiple groups and determine if there is a significant difference. |

Each test statistic helps in determining whether to reject the **null hypothesis $H_0$** in different scenarios.

### **How is the Test Statistic Used?**

1. **Calculate the test statistic** using the appropriate formula.
2. **Compare it to the critical value** (from statistical tables) or use it to compute the **p-value**.
3. If the **test statistic falls in the rejection region** or if the **p-value is smaller than the significance level (alpha)**, reject \(H_0\).

---

## **Scenario: Exam Score Hypothesis Testing**

We want to test whether the **average exam score** of a class is **75**. A sample of **10 students** is taken, and their scores are:

$$
\{78, 74, 72, 80, 77, 69, 85, 73, 76, 71\}
$$

### **Step 1: Formulate Hypotheses**
- **Null Hypothesis (\(H_0\))**: The true mean exam score is **75**.
- **Alternative Hypothesis (\(H_1\))**: The true mean exam score is **not 75**.

### **Step 2: Compute T-Statistic**
Using the formula:

$$
t = \frac{\bar{X} - \mu}{s / \sqrt{n}}
$$

where:
- $\bar{X}$ = sample mean = **75.5**
- $\mu$ = hypothesized population mean = **75**
- $s$ = sample standard deviation = **4.93**
- $n$ = sample size = **10**

The **calculated T-Statistic** is **0.333**.

### **Step 3: Compute P-Value**
The corresponding **p-value** is **0.747**.

### **Step 4: Decision Rule**
- If \( p \)-value < **significance level (alpha = 0.05)**, we reject \(H_0\).
- Since **0.747 > 0.05**, we **fail to reject \(H_0\)**, meaning **there is no strong evidence that the true mean is different from 75**.


## **Scenario: Effectiveness of a New Customer Service Training Program**

A telecom company, **"ConnectTel"**, recently introduced a **new customer service training program** for its call center employees. The company’s goal is to **reduce the average call handling time (AHT)** and improve customer satisfaction.

### **Objective**
The management team wants to determine whether employees who underwent the new training program have significantly **lower average call handling times** compared to those who did not receive the training.

### **Data Collection**
To evaluate the effectiveness of the training program, the company collected data from **two groups** of employees:
- **Control Group**: Employees who did **not** undergo the new training.
- **Treatment Group**: Employees who completed the new training program.

The average call handling time (AHT) is measured for both groups.

### **Key Question**
**Does the new training program significantly reduce the average call handling time compared to employees who did not receive the training?**

To answer this question, we will conduct a **t-test** to compare the mean AHT between the two groups.

### **Steps to Simulate Data**

### **1. Assume a Normal Distribution for Call Handling Time**
- Most real-world **time-based performance metrics** (e.g., call duration) follow a **normal distribution**.
- We generate random values that follow a **normal distribution** using `numpy.random.normal()`.

### **2. Define Mean and Standard Deviation**
- We set **different means** for the **control** and **treatment** groups.
- Example:
  - **Control Group (No Training)** → Mean = **8.5 minutes**, Std Dev = **1.2 minutes**.
  - **Treatment Group (With Training)** → Mean = **7.8 minutes**, Std Dev = **1.1 minutes**.
- The **treatment group** is expected to have a **lower mean** because training improves efficiency. 

### **3. Set Sample Size**
- We choose a **reasonable sample size** (e.g., **30 employees** in each group).
- The **sample size** should be **large enough for statistical validity** but **small enough for practical business data collection**.

### **4. Generate Random Data**
- We use `numpy.random.normal(loc=mean, scale=std, size=n)` to generate:
  - **Random call handling times** for each group.

In [3]:
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Define parameters for the control group (No Training)
control_mean = 8.5   # Average call handling time (minutes)
control_std = 1.2    # Standard deviation
control_size = 30    # Number of employees

# Define parameters for the treatment group (With Training)
treatment_mean = 7.8  # Expected lower average call handling time
treatment_std = 1.1   # Standard deviation
treatment_size = 30   # Number of employees

# Generate synthetic data
control_group = np.random.normal(loc=control_mean, scale=control_std, size=control_size)
treatment_group = np.random.normal(loc=treatment_mean, scale=treatment_std, size=treatment_size)

# Print first few values to check
print("Control Group Call Handling Times:", control_group[:5])
print("Treatment Group Call Handling Times:", treatment_group[:5])

Control Group Call Handling Times: [ 9.09605698  8.33408284  9.27722625 10.32763583  8.21901595]
Treatment Group Call Handling Times: [7.13812273 9.837506   7.78515305 6.63651798 8.7047994 ]


### **Formulating Hypothesis & Defining Significance Level**

### **1. Formulate Hypotheses**
To determine whether the new customer service training program has a significant effect on **call handling time**, we set up the following hypotheses:

- **Null Hypothesis ($H_0$)**: The training program has **no effect** on call handling time.  
  $$ H_0: \mu_{\text{trained}} = \mu_{\text{untrained}} $$
  (The mean call handling time of trained employees is equal to that of untrained employees.)

- **Alternative Hypothesis ($H_1$)**: The training program **reduces** call handling time.  
  $$ H_1: \mu_{\text{trained}} < \mu_{\text{untrained}} $$
  (The mean call handling time of trained employees is **less than** that of untrained employees.)

This is a **one-tailed (left-tailed) t-test** since we expect the **training to reduce** the call handling time.


## **2. Define the Significance Level**
- The **significance level** ($\alpha$) represents the probability of rejecting $H_0$ when it is actually true.
- We set $\alpha = 0.05$ (**5% level of significance**).
- This means that we accept a **5% risk** of concluding that the training program is effective when it actually is not.

If the **p-value** obtained from the t-test is **less than 0.05**, we will **reject $H_0$** and conclude that the training program significantly reduces call handling time.


### **Why Use a T-Test?**

- **We Are Comparing the Means of Two Independent Groups**  
  The control group (employees without training) and the treatment group (employees with training) are **separate, independent groups**. A **t-test** is the appropriate statistical method to compare their mean call handling times.

- **The Sample Size is Limited**  
  The **t-test is preferred over a Z-test** when the sample size is **small or moderate** (e.g., 30 employees per group). Since we **do not have a large sample** and need a test that accounts for variability in small samples, a t-test is the right choice.

- **The Population Standard Deviation is Unknown**  
  A **Z-test requires a known population standard deviation ($\sigma$)**, but in this case, we only have the **sample standard deviation ($s$)**. The **t-test estimates $\sigma$ from the sample**, making it the correct test for this scenario.

In [5]:
import numpy as np  # Import NumPy for numerical computations
import scipy.stats as stats  # Import SciPy's stats module for statistical tests
import matplotlib.pyplot as plt  # Import Matplotlib for plotting
import seaborn as sns  # Import Seaborn for enhanced visualizations

# Perform an independent two-sample t-test 
t_statistic, p_value = stats.ttest_ind(control_group, treatment_group, equal_var=False)  
# `ttest_ind` computes the t-test for two independent samples.
# `control_group` and `treatment_group` are the two sample datasets.
# `equal_var=False` ensures we perform t-test (Welch's t-test), which does not assume equal population variance.
# Welch’s t-test is an adaptation of Student’s t-test that does not assume equal population variances.
# This is important because real-world datasets often have unequal variances, making Welch’s test more reliable in such cases.

In [6]:
# Define significance level (commonly set to 0.05)
alpha = 0.05  

# Determine if we reject the null hypothesis based on the p-value
decision = "Reject H₀ (Training significantly reduces call handling time)" if p_value < alpha else "Fail to reject H₀ (No significant reduction in call handling time)"

# If p-value is less than alpha, we reject the null hypothesis, meaning there is a significant difference.
# Otherwise, we fail to reject the null hypothesis, indicating no statistically significant difference.

In [7]:
# Compute 95% Confidence Intervals for both groups

conf_interval_control = stats.t.interval(
    0.95,  # 95% confidence level: We are 95% sure the true mean falls within this range
    df=len(control_group)-1,  # Degrees of freedom (sample size - 1)
    loc=np.mean(control_group),  # Sample mean: the average value of the control group
    scale=stats.sem(control_group)  # Standard error of the mean (SEM): measures variability of sample mean
)

conf_interval_treatment = stats.t.interval(
    0.95,  # 95% confidence level
    df=len(treatment_group)-1,  # Degrees of freedom
    loc=np.mean(treatment_group),  # Sample mean for the treatment group
    scale=stats.sem(treatment_group)  # Standard error of the mean for treatment group
)

# The function `stats.t.interval` calculates the range (lower, upper bounds) within which we expect 
# the true mean of each group to lie, based on the t-distribution.

Confidence intervals (CIs) help us understand the uncertainty in our sample data and provide a range within which the true population mean is likely to fall. Example: Instead of saying "the average call handling time is 10 minutes," a CI allows us to say, "we are 95% confident that the true average is between 9.5 and 10.5 minutes." A p-value only tells us if a difference is statistically significant but doesn’t tell us how big or important the difference is. CI shows the possible range of values, giving a better sense of the effect size. If the confidence interval does not include 0 (or any hypothesized mean difference), we have stronger evidence to reject the null hypothesis.
If it does include 0, we might fail to reject the null hypothesis (meaning the difference may not be meaningful).  

In [8]:
results = {
    "T-Statistic": t_statistic,
    "P-Value": p_value,
    "Decision": decision,
    "Control Group 95% CI": conf_interval_control,
    "Treatment Group 95% CI": conf_interval_treatment
}

In [9]:
results

{'T-Statistic': np.float64(2.23552670033675),
 'P-Value': np.float64(0.029257824976781182),
 'Decision': 'Reject H₀ (Training significantly reduces call handling time)',
 'Control Group 95% CI': (np.float64(7.870942217242202),
  np.float64(8.677505232715308)),
 'Treatment Group 95% CI': (np.float64(7.284274080601427),
  np.float64(8.049168484760703))}

# **Interpretation of Statistical Findings**

## **T-Statistic ($2.24$)**
- The **T-statistic** quantifies the **difference between the two groups** relative to the **variability in the data**.
- A **higher absolute T-value** suggests a **greater difference** between groups.
- Here, $T = 2.24$ indicates a **moderate difference** between trained and untrained employees in terms of call handling time.  A t-statistic of 2.24 indicates a moderate difference between the groups that exceeds random variation. 
- This shows that the difference in call handling time between trained and untrained employees is likely meaningful rather than due to chance.

---

## **P-Value ($0.029$)**
- The **P-value** represents the **probability** of observing this difference **if there were no real effect** (i.e., if the training had no impact).
- Since **$p = 0.029 < \alpha = 0.05$**, we **reject the null hypothesis** ($H_0$).
- **Conclusion:** The training program **significantly reduced call handling time**. The probability of seeing this difference purely by chance is only 2.9%.

---

## **Confidence Intervals (95% CI)**
- **Control Group (No Training):** Call handling time falls between **$7.87$ to $8.68$ minutes**.
- **Treatment Group (With Training):** Call handling time falls between **$7.28$ to $8.05$ minutes**.

**Interpretation:**  
- The intervals actually do overlap slightly (8.05 > 7.87), but the overlap is minimal. In hypothesis testing, we shouldn't solely rely on confidence interval overlap as a criterion for significance. The p-value already tells us the difference is statistically significant, and the slight overlap doesn't invalidate this conclusion.


## **Actionable Business Recommendations**

## 📌 **1. Expand the Training Program to More Employees**
- The data confirms that **trained employees handle calls faster** than untrained employees.
- **Recommendation:** Scale up the training to **all call center employees** to improve operational efficiency.


## 📌 **2. Set an Expected Performance Benchmark**
- Employees who completed training have an **average handling time of ~7.8 minutes**, compared to **~8.5 minutes** for untrained employees.
- **Recommendation:** Set a **performance benchmark of ~7.8 minutes per call** for employees post-training.


## 📌 **3. Quantify Business Impact**
- Calculate the **time and cost savings** from the reduced call handling time.
- **Example Calculation:**
  - If each employee **handles 50 calls per day**,
  - A **reduction of ~0.6 minutes per call** (based on confidence intervals),
  - **Saves ~30 minutes per employee daily**.
- **Recommendation:** Use these insights to **optimize workforce allocation** and **reduce operational costs**.


## 📌 **4. Estimate Cost Savings from Reduced Call Handling Time**
- If each trained agent **reduces call time by 0.6 minutes per call**, this could lead to:
  - **More calls handled per hour**, increasing overall productivity.
  - **Reduced labor costs** by optimizing call center workforce needs.
  - **Faster customer service**, leading to **higher customer satisfaction**.


## 📌 **5. Conduct Follow-Up Evaluation After Full Implementation**
- The study was conducted with a **limited sample size** (30 employees per group).
- **Recommendation:** After implementing the training across all employees, conduct a **long-term evaluation** to:
  - **Validate improvements** in efficiency.
  - **Measure the financial impact** of reduced call handling times.
  - **Assess if retraining or ongoing coaching is needed** to sustain improvements.
- Establish a **tracking system** to ensure **improvements persist over time**.
- **Recommendation:** Monitor if refresher training might be needed for employees to maintain efficiency.


## 📌 **6. Consider Optimizing Other Performance Metrics**
- While call time reduction is positive, it should **not negatively impact** other important business metrics. The **quality of service should also be assessed**.
- **Recommendation:** Combine training improvements with **customer satisfaction metrics** to ensure that **faster handling does not compromise service quality**. Verify that **customer satisfaction, issue resolution rates, and first-call resolution (FCR)** are not compromised due to the focus on speed.