## T-tests are a family of statistical tests used to see if the means (averages) of one or more groups are significantly different from each other.

You’ll be doing:

- a) One-sample t-test (ttest_1samp)
Checks if the average of a single group is different from a known or hypothesized value.
Example: Is the average age in our dataset different from 40 years?

- b) Two-sample t-test (ttest_ind)
Checks if two independent groups have different means.
Example: Is the average salary (or hours per week) different for males vs females?

“Independent” means one person’s data is not related to another person’s data in the other group.

## Why it matters
- This is a core statistical skill for data science because you’ll often:
- Compare groups (A/B testing, marketing campaigns, product experiments)
- Test if observed differences are random or meaningful

In [2]:
import pandas as pd
from scipy import stats

df = pd.read_csv("data/adult.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [3]:

# Hypothesized mean
hyp_mean = 40  

# Extract "hours.per.week"
hours = df["hours.per.week"]

# One-sample t-test
t_stat, p_val = stats.ttest_1samp(hours, hyp_mean)

print("One-sample t-test:")
print(f"T-statistic = {t_stat:.4f}, P-value = {p_val:.4f}")

One-sample t-test:
T-statistic = 6.3930, P-value = 0.0000


### The average hours per week in this dataset is significantly different from 40 hours. The difference is not likely due to random sampling — it’s real.

In [4]:
print(df["hours.per.week"].mean())

40.437455852092995


In [5]:
# The t-statistic measures:

# 𝑡 = (sample mean − hypothesized mean) / standard error of the mean

In [7]:
# Split into male & female groups
male_hours = df[df["sex"] == "Male"]["hours.per.week"]
female_hours = df[df["sex"] == "Female"]["hours.per.week"]

# Two-sample t-test
t_stat2, p_val2 = stats.ttest_ind(male_hours, female_hours, equal_var=False)  # Welch's t-test

print("\nTwo-sample t-test (Male vs Female):")
print(f"T-statistic = {t_stat2:.4f}, P-value = {p_val2:.4f}")


Two-sample t-test (Male vs Female):
T-statistic = 42.8820, P-value = 0.0000


In [8]:
print(male_hours.mean(), female_hours.mean())

42.42808627810923 36.410361154953115
