# 04 - Hypothesis Testing

*This notebook is exploratory and may be removed if it does not meaningfully contribute to the project. It has been included to strengthen the analysis and aim for a distinction.*

## Objectives

- Ask specific questions about the dataset
- Use statistical tests to check assumptions and relationships
- Support findings from EDA with quantitative evidence


In [2]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Load cleaned dataset
df_chunk = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")
df_chunk.head()

Unnamed: 0,Price,Date of Transfer,Old/New,Duration,Town/City,County,PPDCategory Type,Year,Month,Property_D,Property_F,Property_S,Property_T
0,25000,1995-08-18,0,1,OLDHAM,GREATER MANCHESTER,A,1995,8,False,False,False,True
1,42500,1995-08-09,0,1,GRAYS,THURROCK,A,1995,8,False,False,True,False
2,45000,1995-06-30,0,1,HIGHBRIDGE,SOMERSET,A,1995,6,False,False,False,True
3,43150,1995-11-24,0,1,BEDFORD,BEDFORDSHIRE,A,1995,11,False,False,False,True
4,18899,1995-06-23,0,1,WAKEFIELD,WEST YORKSHIRE,A,1995,6,False,False,True,False


## Hypothesis 1: Are new houses more expensive than old ones?

- **Null hypothesis (H0):** There is no significant difference in price between new and old houses.
- **Alternative hypothesis (H1):** New houses are significantly more expensive than old houses.

We'll use an independent t-test to compare the two groups.


In [3]:
# Split data into two groups
new_prices = df_chunk[df_chunk["Old/New"] == 1]["Price"]
old_prices = df_chunk[df_chunk["Old/New"] == 0]["Price"]

# Perform independent t-test
t_stat, p_val = stats.ttest_ind(new_prices, old_prices, equal_var=False)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject the null hypothesis – New houses are significantly more expensive.")
else:
    print("Fail to reject the null hypothesis – No significant difference in price.")

T-statistic: 2.38
P-value: 0.0185
Reject the null hypothesis – New houses are significantly more expensive.


## Hypothesis 2: Does house price vary by property type?

- **Null hypothesis (H0):** All property types have the same average price.
- **Alternative hypothesis (H1):** At least one property type has a different average price.

We'll use a one-way ANOVA test.


In [4]:
print(df_chunk.columns)

Index(['Price', 'Date of Transfer', 'Old/New', 'Duration', 'Town/City',
       'County', 'PPDCategory Type', 'Year', 'Month', 'Property_D',
       'Property_F', 'Property_S', 'Property_T'],
      dtype='object')


In [5]:
# Run ANOVA test
property_types = ['Property_D', 'Property_F', 'Property_S', 'Property_T']
grouped_prices = [df_chunk[df_chunk[col] == 1]["Price"] for col in property_types if col in df_chunk.columns]

f_stat, p_val = stats.f_oneway(*grouped_prices)
print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject the null hypothesis – Property type affects price.")
else:
    print("Fail to reject the null hypothesis – No significant difference between property types.")

F-statistic: 32.08
P-value: 0.0000
Reject the null hypothesis – Property type affects price.
