# File 'student-mat-1.csv' contains a list of students and their alcohol consumption habits along with some demographic data.

### Using this data, answer the following questions -

Q1 Is there a difference between mean "workday" alcohol consumption for students who reported their mother as being 'at_home' compared to those whose mothers are working ? Look at column Mjob. (10 points)

In [3]:
import pandas as pd
from scipy import stats

# Load the data
df = pd.read_csv("student-mat-1.csv")

# Select workday alcohol consumption by mother's job
at_home = df[df["Mjob"] == "at_home"]["Workday consumption"]
working = df[df["Mjob"] != "at_home"]["Workday consumption"]

# Descriptive statistics
mean_at_home = at_home.mean()
mean_working = working.mean()

print("Mean workday alcohol consumption (mother at home):", mean_at_home)
print("Mean workday alcohol consumption (mother working):", mean_working)

# Welch's two-sample t-test
t_stat, p_value = stats.ttest_ind(at_home, working, equal_var=False)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Hypothesis test decision
alpha = 0.05
if p_value <= alpha:
    print("Reject the null hypothesis: significant difference exists.")
else:
    print("Fail to reject the null hypothesis: no significant difference.")

Mean workday alcohol consumption (mother at home): 1.3898305084745763
Mean workday alcohol consumption (mother working): 1.4970238095238095
t-statistic: -0.9847643643972255
p-value: 0.3273282691537669
Fail to reject the null hypothesis: no significant difference.


Q2 Create a new column called "Total Consumption" by adding the weekday and weekend consumption. Is the mean total consumption higher for  aged <18 compared to those aged >=18 ? (10 points)

In [4]:
import pandas as pd
from scipy import stats

# Load the data
df = pd.read_csv("student-mat-1.csv")

# Create Total Consumption column
df["Total Consumption"] = df["Workday consumption"] + df["Weekend consumption"]

# Split by age group
under_18 = df[df["age"] < 18]["Total Consumption"]
over_equal_18 = df[df["age"] >= 18]["Total Consumption"]

# Descriptive statistics
print("Mean total consumption (age < 18):", under_18.mean())
print("Mean total consumption (age >= 18):", over_equal_18.mean())

# Welch two-sample t-test
t_stat, p_value = stats.ttest_ind(
    under_18,
    over_equal_18,
    equal_var=False
)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Hypothesis test decision
alpha = 0.05
if p_value <= alpha:
    print("Reject the null hypothesis: significant difference exists.")
else:
    print("Fail to reject the null hypothesis: no significant difference.")

Mean total consumption (age < 18): 3.711267605633803
Mean total consumption (age >= 18): 3.9279279279279278
t-statistic: -0.9578127002231458
p-value: 0.3393513530149026
Fail to reject the null hypothesis: no significant difference.


Q3 Do individuals consume more alcohol over the weekend compared to workdays ? (Analyse this as a paired test) (10 points)

In [5]:
import pandas as pd
from scipy import stats

# Load the data
df = pd.read_csv("student-mat-1.csv")

# Extract paired samples
workday = df["Workday consumption"]
weekend = df["Weekend consumption"]

# Descriptive statistics
print("Mean workday consumption:", workday.mean())
print("Mean weekend consumption:", weekend.mean())

# Paired t-test
t_stat, p_value = stats.ttest_rel(weekend, workday)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Hypothesis test decision
alpha = 0.05
if p_value <= alpha:
    print("Reject the null hypothesis: weekend consumption is significantly different.")
else:
    print("Fail to reject the null hypothesis: no significant difference.")

Mean workday consumption: 1.481012658227848
Mean weekend consumption: 2.2911392405063293
t-statistic: 16.378501737000736
p-value: 2.340452168567084e-46
Reject the null hypothesis: weekend consumption is significantly different.


Q4 Is average school attendance worse for students whose total consumption is >4. (Worse attendance = more absences from school) (10 points)

In [8]:
import pandas as pd
from scipy import stats

# Load the data
df = pd.read_csv("student-mat-1.csv")

# Create Total Consumption column
df["Total Consumption"] = df["Workday consumption"] + df["Weekend consumption"]

# Split data by total consumption threshold
high_consumption = df[df["Total Consumption"] > 4]["absences from school"]
low_consumption = df[df["Total Consumption"] <= 4]["absences from school"]

# Descriptive statistics
print("Mean absences (Total Consumption > 4):", high_consumption.mean())
print("Mean absences (Total Consumption ≤ 4):", low_consumption.mean())

# Welch two-sample t-test
t_stat, p_value = stats.ttest_ind(
    high_consumption,
    low_consumption,
    equal_var=False
)

print("t-statistic:", t_stat)
print("p-value:", p_value)

# Hypothesis test decision
alpha = 0.05
if p_value <= alpha:
    print("Reject the null hypothesis: attendance is significantly worse for high consumption students.")
else:
    print("Fail to reject the null hypothesis: no significant difference in attendance.")

Mean absences (Total Consumption > 4): 7.546218487394958
Mean absences (Total Consumption ≤ 4): 4.916666666666667
t-statistic: 2.790851665523653
p-value: 0.005798142412445125
Reject the null hypothesis: attendance is significantly worse for high consumption students.
