**Aman Dubal T076**

Practical 4 :
Hypothesis Testing
1. Formulate null and alternative hypotheses for a given problem.
2. Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-
square test).
3. Interpret the results and draw conclusions based on the test outcomes.


# Practical 4

Categorical Feature Engineering Code

Import Libraries

In [None]:
import pandas as pd
import numpy as np
import math
from math import sqrt
from scipy.stats import norm
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import ttest_rel

Load Dataset


In [None]:
df = pd.read_csv("globalAirQuality.csv")

# Create AQI Groups
def categorize(aqi):
    if aqi <= 50:
        return "Good"
    elif aqi <= 100:
        return "Moderate"
    elif aqi <= 150:
        return "Unhealthy"
    else:
        return "Very Unhealthy"

df["aqi_level"] = df["aqi"].apply(categorize)
print(df["aqi_level"].value_counts())


aqi_level
Unhealthy         8874
Moderate          8108
Very Unhealthy     847
Good               171
Name: count, dtype: int64


In [None]:
df

Unnamed: 0,timestamp,country,city,latitude,longitude,pm25,pm10,no2,so2,o3,co,aqi,temperature,humidity,wind_speed,aqi_level
0,2025-11-04 18:25:17.554219,US,New York,40.713,-74.006,50.295,108.938,27.998,6.539,52.568,1.096,108,18.504,70.168,3.725,Unhealthy
1,2025-11-04 19:25:17.554219,US,New York,40.713,-74.006,32.083,63.043,36.120,4.021,43.536,1.075,90,5.838,80.088,8.969,Moderate
2,2025-11-04 20:25:17.554219,US,New York,40.713,-74.006,42.250,82.553,26.935,9.538,23.320,0.977,84,31.833,62.783,9.650,Moderate
3,2025-11-04 21:25:17.554219,US,New York,40.713,-74.006,30.403,79.951,63.536,7.609,31.369,0.230,158,23.140,89.153,8.956,Very Unhealthy
4,2025-11-04 22:25:17.554219,US,New York,40.713,-74.006,21.083,66.423,38.997,6.919,45.615,1.085,97,13.632,76.499,4.017,Moderate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17995,2025-11-19 13:25:17.554219,CH,Zurich,47.377,8.542,27.899,74.179,41.474,6.677,50.869,1.028,103,7.079,52.443,7.452,Unhealthy
17996,2025-11-19 14:25:17.554219,CH,Zurich,47.377,8.542,2.950,47.988,42.235,2.821,35.551,0.644,105,28.734,85.678,4.496,Unhealthy
17997,2025-11-19 15:25:17.554219,CH,Zurich,47.377,8.542,61.347,72.908,46.976,5.763,66.492,0.947,122,21.951,72.311,9.660,Unhealthy
17998,2025-11-19 16:25:17.554219,CH,Zurich,47.377,8.542,40.722,95.152,32.957,5.524,53.193,0.868,95,24.042,31.880,2.642,Moderate


**T-test to evaluate whether our hypothesis is correct or not.**

One Sample t-test

Compare mean PM2.5 against WHO standard (35 µg/m³).

 Hypothesis

H0: Mean PM2.5 = 35

H1: Mean PM2.5 ≠ 35

In [None]:
pm25_values = df['pm25']

t_stat, p_val = stats.ttest_1samp(pm25_values, 35)

print("\n--- One Sample t-test (PM2.5 vs WHO Standard) ---")
print("T-statistic:", t_stat)
print("P-value:", p_val)

if p_val < 0.05:
    print("Reject Null Hypothesis → PM2.5 mean is significantly different from 35")
else:
    print("Fail to Reject Null Hypothesis → No significant difference")



--- One Sample t-test (PM2.5 vs WHO Standard) ---
T-statistic: 40.818620184487955
P-value: 0.0
Reject Null Hypothesis → PM2.5 mean is significantly different from 35


Independent Two-Sample t-test

Compare PM2.5 between two AQI groups (example: Moderate vs Unhealthy).

In [None]:
group1 = df[df['aqi_level'] == 'Moderate']['pm25']
group2 = df[df['aqi_level'] == 'Unhealthy']['pm25']

t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)

print("\n--- Independent Two-Sample t-test (PM2.5 Moderate vs Unhealthy) ---")
print("T-statistic:", t_stat)
print("P-value:", p_val)

if p_val < 0.05:
    print("Reject Null Hypothesis → Groups have significantly different PM2.5")
else:
    print("Fail to Reject Null Hypothesis → No significant difference")



--- Independent Two-Sample t-test (PM2.5 Moderate vs Unhealthy) ---
T-statistic: -62.694700939899015
P-value: 0.0
Reject Null Hypothesis → Groups have significantly different PM2.5


Paired t-test Code

In [None]:
# Choose two countries to compare
country1 = "US"
country2 = "IN"

# Filter rows for each country
df1 = df[df['country'] == country1][['timestamp', 'pm25']]
df2 = df[df['country'] == country2][['timestamp', 'pm25']]

# Rename columns for clarity
df1.rename(columns={'pm25': f'pm25_{country1}'}, inplace=True)
df2.rename(columns={'pm25': f'pm25_{country2}'}, inplace=True)

# Merge based on timestamp (only matching time becomes a pair)
paired_data = pd.merge(df1, df2, on='timestamp')

print("\nPaired dataset preview:")
print(paired_data.head())

# Run paired t-test
t_stat, p_value = ttest_rel(paired_data[f'pm25_{country1}'], paired_data[f'pm25_{country2}'])

print("\n--- Paired Sample t-test (PM2.5 between countries) ---")
print(f"Comparing: {country1} vs {country2}")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

if p_value < 0.05:
    print("Reject Null Hypothesis → Significant difference in PM2.5 between the two countries.")
else:
    print("Fail to Reject Null Hypothesis → No significant difference in PM2.5 between the two countries.")


Paired dataset preview:
                    timestamp  pm25_US  pm25_IN
0  2025-11-04 18:25:17.554219   50.295   45.820
1  2025-11-04 18:25:17.554219   50.295   44.829
2  2025-11-04 18:25:17.554219   50.295    5.001
3  2025-11-04 18:25:17.554219   50.295   53.758
4  2025-11-04 18:25:17.554219   50.295   29.044

--- Paired Sample t-test (PM2.5 between countries) ---
Comparing: US vs IN
T-statistic: 0.08330170502384954
P-value: 0.9336134974020501
Fail to Reject Null Hypothesis → No significant difference in PM2.5 between the two countries.


Z-Test

We use a Z-test when:

Sample size ≥ 30

Population mean is known or assumed

OR when comparing two means with large sample

In [None]:
# Column
data = df['pm25']

# Hypothesized population mean
mu_0 = 35

# Calculate Z value
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # sample standard deviation
n = len(data)

z_value = (sample_mean - mu_0) / (sample_std / math.sqrt(n))

# P-value
p_value = 2 * (1 - stats.norm.cdf(abs(z_value)))

print("\n--- Z Test (PM2.5 vs WHO Standard = 35) ---")
print(f"Z-Statistic: {z_value}")
print(f"P-value: {p_value}")

if p_value < 0.05:
    print("Reject Null Hypothesis → Mean PM2.5 is significantly different from 35")
else:
    print("Fail to Reject Null Hypothesis → No significant difference")


--- Z Test (PM2.5 vs WHO Standard = 35) ---
Z-Statistic: 40.818620184488076
P-value: 0.0
Reject Null Hypothesis → Mean PM2.5 is significantly different from 35


Two-sample Z-test

PM2.5 levels between two AQI categories, for example:

 Moderate vs Unhealthy

In [None]:
# Select two groups
group1 = df[df['aqi_level'] == 'Moderate']['pm25']
group2 = df[df['aqi_level'] == 'Unhealthy']['pm25']

# Sample statistics
mean1, mean2 = np.mean(group1), np.mean(group2)
std1, std2 = np.std(group1, ddof=1), np.std(group2, ddof=1)
n1, n2 = len(group1), len(group2)

# Calculate Z value
z_stat = (mean1 - mean2) / sqrt((std1**2 / n1) + (std2**2 / n2))

# p-value (two-tailed test)
p_val = 2 * (1 - norm.cdf(abs(z_stat)))

print("\n--- Two-Sample Z Test (PM2.5: Moderate vs Unhealthy) ---")
print(f"Mean Group1 (Moderate): {mean1}")
print(f"Mean Group2 (Unhealthy): {mean2}")
print(f"Z-statistic: {z_stat}")
print(f"P-value: {p_val}")

if p_val < 0.05:
    print("Reject Null Hypothesis → PM2.5 levels differ significantly between groups.")
else:
    print("Fail to Reject Null Hypothesis → No significant difference between groups.")


--- Two-Sample Z Test (PM2.5: Moderate vs Unhealthy) ---
Mean Group1 (Moderate): 32.19417476566354
Mean Group2 (Unhealthy): 46.244707685372994
Z-statistic: -62.69470093989896
P-value: 0.0
Reject Null Hypothesis → PM2.5 levels differ significantly between groups.


Chi-Square Test

Check if aqi_category is related to another categorical column.

In [None]:
# Convert temperature to categorical bins
df['temp_level'] = pd.cut(df['temperature'], bins=3, labels=['Low', 'Medium', 'High'])

cont_table = pd.crosstab(df['aqi_level'], df['temp_level'])

chi2, p, dof, expected = stats.chi2_contingency(cont_table)

print("\n--- Chi-Square Test (AQI Category vs Temperature Level) ---")
print("Chi-square value:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)

if p < 0.05:
    print("Reject Null Hypothesis → Variables are dependent")
else:
    print("Fail to Reject Null Hypothesis → Variables are independent")



--- Chi-Square Test (AQI Category vs Temperature Level) ---
Chi-square value: 7.065882199247431
Degrees of Freedom: 6
P-value: 0.3147974299452369
Fail to Reject Null Hypothesis → Variables are independent
