
# 📘 Comprehensive Hypothesis Testing Notebook

This notebook includes a wide range of hypothesis tests using real-world data. Each section includes:
- A brief explanation of the test.
- The implementation using Python.
- Interpretation of the result.


In [1]:
# Hypothesis Testing - All Major Types

# Setup

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.formula.api import ols
import statsmodels.api as sm
import os

| **Dataset** | **Structure**                   | **Purpose**                                                                               | **Typical Use**                                                             |
| ----------- | ------------------------------- | ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| `sim_df`    | DataFrame (from `sim_tips.csv`) | A simulated dataset similar to the `tips` dataset with numeric and categorical variables. | Used in **regression modeling**, e.g., linear regression, t-tests, F-tests. |
| `before`    | Series (from `before.csv`)      | Represents measurements taken **before** an event, treatment, or change.                  | Used for **paired comparison** with `after`, e.g., **paired t-test**.       |
| `after`     | Series (from `after.csv`)       | Represents measurements taken **after** an event, treatment, or change.                   | Compared with `before` to check for **significant difference/improvement**. |


In [3]:

if not os.path.exists('/mnt/data'):
    os.makedirs('/mnt/data')

tips = sns.load_dataset("tips")
# Creating a binary column 'high_tip' for testing proportions (e.g., tip > $5)
tips['high_tip'] = (tips['tip'] > 5).astype(int)
tips.to_csv("/mnt/data/sim_tips.csv", index=False)



In [4]:
np.random.seed(42)
before = np.random.normal(loc=50, scale=10, size=30)
after = before + np.random.normal(loc=5, scale=5, size=30)  # assumed improvement

pd.DataFrame({'before': before}).to_csv("/mnt/data/before.csv", index=False)
pd.DataFrame({'after': after}).to_csv("/mnt/data/after.csv", index=False)

In [5]:
# Loading dataset
sim_df = pd.read_csv("/mnt/data/sim_tips.csv")
before = pd.read_csv("/mnt/data/before.csv")['before']
after = pd.read_csv("/mnt/data/after.csv")['after']

In [6]:
def basic_summary(df):
    summary = pd.DataFrame(df.dtypes, columns=['Data Type']).reset_index().rename(columns={'index': 'Feature'})
    summary['Num of Nulls'] = df.isnull().sum().values
    summary['Num of Unique'] = df.nunique().values
    return summary

In [7]:
# Converting Series to DataFrames with column names for compatibility
before_df = before.to_frame()
after_df = after.to_frame()

In [8]:
# Display summaries
print("Summary for sim_df:")
display(basic_summary(sim_df))

print("Summary for before:")
display(basic_summary(before_df))

print("Summary for after:")
display(basic_summary(after_df))

Summary for sim_df:


Unnamed: 0,Feature,Data Type,Num of Nulls,Num of Unique
0,total_bill,float64,0,229
1,tip,float64,0,123
2,sex,object,0,2
3,smoker,object,0,2
4,day,object,0,4
5,time,object,0,2
6,size,int64,0,6
7,high_tip,int64,0,2


Summary for before:


Unnamed: 0,Feature,Data Type,Num of Nulls,Num of Unique
0,before,float64,0,30


Summary for after:


Unnamed: 0,Feature,Data Type,Num of Nulls,Num of Unique
0,after,float64,0,30



# 1. Tests for Means

# One-Sample t-test


*   # ✅ Purpose:

---
- The one-sample t-test is used to test whether the mean of a single sample differs significantly from a known or hypothesized population mean, when the population standard deviation is unknown.




In [9]:
t_stat, p_val = stats.ttest_1samp(sim_df['tip'], popmean=3.0)
print(f"One-sample t-test: t-stat={t_stat:.3f}, p-value={p_val:.3f}")

One-sample t-test: t-stat=-0.019, p-value=0.985


- Since p-value = 0.985 > 0.05, we fail to reject the null hypothesis.
- ## 🔍 Inference:
 There is no statistically significant difference between the sample mean tip and the hypothesized population mean of $3.00.





*   # ✅ Purpose:
---
- The one-sample z-test is used to compare a sample mean to a known population mean when the population standard deviation is known, or when the sample size is large (n > 30) and the Central Limit Theorem applies, allowing use of the sample standard deviation as an approximation.

In [10]:
# One-Sample z-test (approximation with large sample)
mean_tip = sim_df['tip'].mean()
std_tip = sim_df['tip'].std()
n = len(sim_df)
z = (mean_tip - 3.0) / (std_tip / np.sqrt(n))
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"One-sample z-test: z-stat={z:.3f}, p-value={p:.3f}")

One-sample z-test: z-stat=-0.019, p-value=0.984


- Since p-value = 0.984 > 0.05, we fail to reject the null hypothesis.

- # 🔍 Inference:
There is no significant difference between the sample mean tip and the hypothesized value of $3.00, consistent with the result from the t-test.

- # Independent Two-Sample t-test:
Compares the means of two independent groups (e.g., male vs female tips).

In [11]:
# Independent Two-Sample t-test
male_tips = sim_df[sim_df['sex'] == 'Male']['tip']
female_tips = sim_df[sim_df['sex'] == 'Female']['tip']
t_stat, p_val = stats.ttest_ind(male_tips, female_tips)
print(f"Two-sample t-test: t-stat={t_stat:.3f}, p-value={p_val:.3f}")

Two-sample t-test: t-stat=1.388, p-value=0.166


- Interpretation: The difference in average tips between male and female customers is not statistically significant. We cannot say one gender gives higher tips than the other (p > 0.05).

- # Paired Sample t-test:
Compares means from the same group at two different times (e.g., before vs after).

In [12]:
# Paired Sample t-test
paired_t, paired_p = stats.ttest_rel(before, after)
print(f"Paired t-test: t-stat={paired_t:.3f}, p-value={paired_p:.3f}")

Paired t-test: t-stat=-5.170, p-value=0.000


- Interpretation: There is a statistically significant change in values between the "before" and "after" conditions. The intervention or time gap likely caused a real change.

- # Two-sample z-test:
Similar to a t-test but used when population variance is known or sample size is large.

In [13]:
# Two-sample z-test
n1, n2 = len(male_tips), len(female_tips)
m1, m2 = male_tips.mean(), female_tips.mean()
s1, s2 = male_tips.std(), female_tips.std()
z = (m1 - m2) / np.sqrt(s1**2/n1 + s2**2/n2)
p = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Two-sample z-test: z-stat={z:.3f}, p-value={p:.3f}")

Two-sample z-test: z-stat=1.490, p-value=0.136


- Interpretation: No significant difference was found in mean tips between the two groups (e.g., male vs female). The z-test also supports that gender doesn't affect tipping (p > 0.05).


# 2. Tests for Proportions


- # One-Proportion z-test:
Tests if a single group's proportion equals a hypothesized value.

In [14]:

count = sim_df['high_tip'].sum()
nobs = len(sim_df)
stat, pval = proportions_ztest(count, nobs, value=0.5)
print(f"One-proportion z-test: z-stat={stat:.3f}, p-value={pval:.3f}")

One-proportion z-test: z-stat=-25.471, p-value=0.000


- Interpretation: The observed proportion (e.g., people tipping more than $3) is significantly different from the hypothesized 0.5 (50%). There is very strong evidence against the 50% claim.

- # Two-Proportion z-test:
Compares proportions between two groups (e.g., smokers vs non-smokers).

In [15]:
# Two-Proportion z-test
smoker_counts = sim_df.groupby('smoker')['high_tip'].sum()
smoker_n = sim_df.groupby('smoker')['high_tip'].count()
stat, pval = proportions_ztest(count=smoker_counts, nobs=smoker_n)
print(f"Two-proportion z-test: z-stat={stat:.3f}, p-value={pval:.3f}")

Two-proportion z-test: z-stat=0.434, p-value=0.664


- Interpretation: There is no significant difference in tipping behavior (e.g., tipping more than $3) between smokers and non-smokers. Both groups behave similarly in this aspect.

# 3. Tests for Variances
- # F-test:
Compares variances between two groups.

In [16]:

f_stat = male_tips.var() / female_tips.var()
df1, df2 = len(male_tips)-1, len(female_tips)-1
p_val = 1 - stats.f.cdf(f_stat, df1, df2)
print(f"F-test: F={f_stat:.3f}, p-value={p_val:.3f}")

F-test: F=1.649, p-value=0.006


- Interpretation: The variability (spread) of tip amounts is significantly different between the two groups (e.g., male vs female). One group has more consistent tipping than the other.

- # Chi-square test for variance:
Tests if the sample variance matches a known population variance.

In [17]:
# Chi-square test for variance
sample_var = sim_df['tip'].var()
n = len(sim_df)
chi2_stat = (n - 1) * sample_var / 1.0  # Assume population var = 1
p_val = 1 - stats.chi2.cdf(chi2_stat, df=n-1)
print(f"Chi-square test: Chi2={chi2_stat:.3f}, p-value={p_val:.3f}")

Chi-square test: Chi2=465.212, p-value=0.000


- Interpretation: The sample variance does not match the assumed population variance (e.g., assumed to be 1). There's strong evidence that the sample is more/less variable than expected.

# 4. Tests for Association / Independence

- # Chi-square Test of Independence:
 Tests whether two categorical variables are associated.

In [18]:

contingency = pd.crosstab(sim_df['sex'], sim_df['smoker'])
chi2, p, dof, expected = stats.chi2_contingency(contingency)
print(f"Chi-square Test of Independence: chi2={chi2:.3f}, p-value={p:.3f}")

Chi-square Test of Independence: chi2=0.000, p-value=1.000


- Interpretation: There is no association between the two categorical variables (e.g., sex and smoker). The data shows complete independence.

- # Chi-square Goodness-of-Fit: Tests if observed frequencies match expected ones.

In [19]:
# Goodness of Fit
obs_counts = sim_df['day'].value_counts().values
expected = [len(sim_df)/4] * 4
chi2, p = stats.chisquare(obs_counts, f_exp=expected)
print(f"Chi-square Goodness-of-Fit: chi2={chi2:.3f}, p-value={p:.3f}")


Chi-square Goodness-of-Fit: chi2=43.705, p-value=0.000


- Interpretation: The observed frequencies (e.g., number of tips per day) do not follow a uniform distribution. Some days clearly have more tips than others.

# 5. Non-parametric Tests
Used when normality assumptions are not met.
- # Mann-Whitney U test:
Compares two independent groups (non-parametric alternative to t-test).

In [20]:
stat, p = stats.mannwhitneyu(male_tips, female_tips)
print(f"Mann-Whitney U: stat={stat:.3f}, p-value={p:.3f}")

Mann-Whitney U: stat=7289.500, p-value=0.383


- Interpretation: The distributions of tips for males and females are not significantly different. A non-parametric test agrees with the t-test finding.

- # Wilcoxon Signed-Rank test:
Paired sample test (non-parametric).

In [21]:
# Wilcoxon Signed-Rank test
stat, p = stats.wilcoxon(before, after)
print(f"Wilcoxon Signed-Rank: stat={stat:.3f}, p-value={p:.3f}")

Wilcoxon Signed-Rank: stat=46.000, p-value=0.000


- Interpretation: There's a significant shift between "before" and "after" values in paired data, supporting the conclusion that the change is not due to chance.

- # Kruskal-Wallis test:
Non-parametric ANOVA for comparing more than two groups.

In [22]:
# Kruskal-Wallis test
stat, p = stats.kruskal(
    *[group['tip'] for _, group in sim_df.groupby('day')]
)
print(f"Kruskal-Wallis: stat={stat:.3f}, p-value={p:.3f}")

Kruskal-Wallis: stat=8.566, p-value=0.036


- Interpretation: At least one group (e.g., one day of the week) has a significantly different median tip. Post-hoc tests would reveal which one.

- # Friedman test:
Repeated measures ANOVA (non-parametric).

In [23]:
# Friedman test
repeated_df = pd.DataFrame({"before": before, "after": after, "after2": after + np.random.normal(1, 1, len(after))})
stat, p = stats.friedmanchisquare(repeated_df['before'], repeated_df['after'], repeated_df['after2'])
print(f"Friedman test: stat={stat:.3f}, p-value={p:.3f}")

Friedman test: stat=26.867, p-value=0.000


- Interpretation: There are significant differences across related groups (e.g., tips given by the same person before, after, and after2). The treatment or time effect matters.

- # Sign test:
Tests if there is a consistent direction of change in paired data.

In [24]:
# Sign Test (using binomial test approximation)
diff = after - before
n_pos = (diff > 0).sum()
n_total = (diff != 0).sum()
# Use the updated binomtest function
p = stats.binomtest(k=n_pos, n=n_total, p=0.5).pvalue
print(f"Sign test: p-value={p:.3f}")

Sign test: p-value=0.001


- Interpretation: The direction of change (e.g., increase or decrease in tips after an event) is consistently non-random. Most people either increased or decreased their behavior.

# 6. ANOVA
Used to compare means across multiple groups.


- # One-Way ANOVA:
Tests if mean differs by one factor (e.g., day).

In [25]:

anova_model = ols('tip ~ C(day)', data=sim_df).fit()
f, p = sm.stats.anova_lm(anova_model, typ=2).iloc[0][['F', 'PR(>F)']]
print(f"One-Way ANOVA: F={f:.3f}, p-value={p:.3f}")

One-Way ANOVA: F=1.672, p-value=0.174


- Interpretation: There is no significant difference in average tips across different days of the week. Tip amount is not affected by the day.

- # Two-Way ANOVA:
Tests effect of two factors (e.g., day and sex).

In [26]:
# Two-Way ANOVA
anova_model = ols('tip ~ C(day) + C(sex)', data=sim_df).fit()
f_table = sm.stats.anova_lm(anova_model, typ=2)
print("Two-Way ANOVA:")
print(f_table)

Two-Way ANOVA:
              sum_sq     df         F    PR(>F)
C(day)      7.446900    3.0  1.306497  0.272937
C(sex)      1.594561    1.0  0.839258  0.360533
Residual  454.092042  239.0       NaN       NaN


- Interpretation: Neither the day nor the customer's gender has a significant effect on tip amount. Their interaction was not tested here.

- # Repeated Measures ANOVA:
 For comparing means of the same subjects under different conditions.

In [27]:
# Repeated Measures ANOVA
# (approximate using reshaped data)
from statsmodels.stats.anova import AnovaRM
long_df = repeated_df.melt(var_name='condition', value_name='score')
long_df['subject'] = np.tile(np.arange(len(repeated_df)), 3)
rm = AnovaRM(long_df, 'score', 'subject', within=['condition'])
res = rm.fit()
print("Repeated Measures ANOVA:")
print(res)

Repeated Measures ANOVA:
                 Anova
          F Value Num DF  Den DF Pr > F
---------------------------------------
condition 32.6069 2.0000 58.0000 0.0000



- Interpretation: There are significant differences in the same group across time or conditions (e.g., before, after, and after2). Time or treatment impacts tips.

# 7. Regression-Based Tests
Used to evaluate relationships between continuous variables.

- # t-test for slope:
Checks if the predictor has a significant effect.

In [28]:

model = ols('tip ~ total_bill', data=sim_df).fit()
print("Simple Linear Regression:")
print(model.summary())

Simple Linear Regression:
                            OLS Regression Results                            
Dep. Variable:                    tip   R-squared:                       0.457
Model:                            OLS   Adj. R-squared:                  0.454
Method:                 Least Squares   F-statistic:                     203.4
Date:                Sun, 25 May 2025   Prob (F-statistic):           6.69e-34
Time:                        21:43:04   Log-Likelihood:                -350.54
No. Observations:                 244   AIC:                             705.1
Df Residuals:                     242   BIC:                             712.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9203     

- Output: R² = 0.457, F = 203.4, p = 6.69e-34

- Interpretation: The regression model is highly significant. The variable total_bill explains about 46% of the variation in tip amount. As the bill increases, tips increase as well.

- # F-test:
Assesses overall regression significance.

In [29]:
# F-test for overall regression significance
f_stat = model.fvalue
f_pval = model.f_pvalue
print(f"F-test overall: F={f_stat:.3f}, p-value={f_pval:.3f}")


F-test overall: F=203.358, p-value=0.000


- Interpretation:
Since the p-value is less than 0.05 (actually 0.000), the overall regression model is highly statistically significant. This means the model provides a meaningful explanation of the variation in the dependent variable (e.g., tip amount).

- The high F-statistic value (203.358) also indicates that the variance explained by the regression is much greater than the unexplained variance (noise).