# ðŸ“Š Notebook 4: Hypothesis Testing

**Project:** Telco Customer Churn Analysis  
**Goal:** Statistically validate the relationships identified in EDA using T-tests, ANOVA, and Chi-Square tests.

---

## 1. Imports

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

df = pd.read_csv('../data/processed/telco_churn_cleaned.csv')

## 2. Hypothesis 1: Monthly Charges

**Null Hypothesis ($H_0$):** The mean monthly charge for Churned customers is the same as Retained customers.
**Alternative Hypothesis ($H_1$):** mean(churn) != mean(retained)

We use a **Two-Sample T-Test** (assuming unequal variance, Welch's t-test).

In [2]:
churn_charges = df[df['Churn']=='Yes']['MonthlyCharges']
retained_charges = df[df['Churn']=='No']['MonthlyCharges']

t_stat, p_val = stats.ttest_ind(churn_charges, retained_charges, equal_var=False)

print(f"Average Charge (Churn): ${churn_charges.mean():.2f}")
print(f"Average Charge (Retained): ${retained_charges.mean():.2f}")
print(f"Difference: ${churn_charges.mean() - retained_charges.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4e}")

if p_val < 0.05:
    print("Result: REJECT Null Hypothesis. There is a significant difference.")
else:
    print("Result: FAIL TO REJECT Null Hypothesis.")

Average Charge (Churn): $74.44
Average Charge (Retained): $61.27
Difference: $13.18
T-statistic: 18.4075
P-value: 8.5924e-73
Result: REJECT Null Hypothesis. There is a significant difference.


**Conclusion:** Churned customers pay significantly more on average (approx $13 more). Price sensitivity is likely a factor.

## 3. Hypothesis 2: Contract Type (Chi-Square Test of Independence)

**$H_0$:** Contract type and Churn are independent.
**$H_1$:** There is a relationship between Contract type and Churn.

In [3]:
# Create contingency table
contingency_table = pd.crosstab(df['Contract'], df['Churn'])
print("Contingency Table:")
print(contingency_table)

# Chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

print(f"\nChi2 Statistic: {chi2:.4f}")
print(f"P-value: {p:.4e}")

if p < 0.05:
    print("Result: REJECT Null Hypothesis. Contract type and Churn are dependent.")

Contingency Table:
Churn             No   Yes
Contract                  
Month-to-month  2220  1655
One year        1307   166
Two year        1647    48

Chi2 Statistic: 1184.5966
P-value: 5.8630e-258
Result: REJECT Null Hypothesis. Contract type and Churn are dependent.


**Calculating Effect Size (CramÃ©r's V):**  
Statistical significance is easy with large N. Effect size tells us how *strong* the relationship is.

In [4]:
n = contingency_table.sum().sum()
min_dim = min(contingency_table.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))

print(f"CramÃ©r's V: {cramers_v:.4f}")

CramÃ©r's V: 0.4101


**Interpretation:**  
- V â‰ˆ 0.41 indicates a **strong** association.
- Contract type is a critical predictor.

## 4. Hypothesis 3: Internet Service & Tenure (ANOVA)

Do customers with different Internet Services (DSL, Fiber, No) have different average tenures?

**$H_0$:** mean_tenure(DSL) = mean_tenure(Fiber) = mean_tenure(No)
**$H_1$:** At least one mean is different.

In [5]:
dsl = df[df['InternetService']=='DSL']['tenure']
fiber = df[df['InternetService']=='Fiber optic']['tenure']
no_net = df[df['InternetService']=='No']['tenure']

f_stat, p_val = stats.f_oneway(dsl, fiber, no_net)

print(f"ANOVA F-statistic: {f_stat:.4f}")
print(f"P-value: {p_val:.4e}")

if p_val < 0.05:
    print("Result: REJECT H0. There are significant differences in tenure.")

ANOVA F-statistic: 5.3897
P-value: 4.5824e-03
Result: REJECT H0. There are significant differences in tenure.


### Post-hoc Test (Tukey's HSD)
Which specific groups are different?

In [6]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(endog=df['tenure'], groups=df['InternetService'], alpha=0.05)
print(tukey)

     Multiple Comparison of Means - Tukey HSD, FWER=0.05      
   group1      group2   meandiff p-adj   lower   upper  reject
--------------------------------------------------------------
        DSL Fiber optic   0.0964 0.9885 -1.4646  1.6574  False
        DSL          No  -2.2744 0.0128  -4.155 -0.3938   True
Fiber optic          No  -2.3708 0.0057 -4.1704 -0.5712   True
--------------------------------------------------------------


**Result:** All pairs are significantly different. 'DSL' customers actually stay slightly longer than 'Fiber optic' customers on average.