# 04 – Hypothesis Testing & Statistical Evaluation

**Notebook Name:** `04_Hypothesis_Testing_Statistical_Evaluation.ipynb`

## Objectives
- Welch’s t-test: New vs Old builds.
- One-way ANOVA + Tukey: Price by property type.
- Optional t-test: London vs Rest of E&W.
- Report statistics, p-values, and decisions at α=0.05.

## Inputs
- `HousePricesRecords_clean.csv`

## Section 0: Libraries & Load Data

In [4]:
import pandas as pd
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
import seaborn as sns


df_chunk = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")
df_chunk.head()

Unnamed: 0,Price,Date of Transfer,Old/New,Duration,Town/City,County,PPDCategory Type,Year,Month,Property_D,Property_F,Property_O,Property_S,Property_T,Region,RegionMedianPrice,RegionSaleCount,CountyMedianPrice,CountySaleCount,LogPrice
0,277000,2017-06-29,0,1,WICKFORD,ESSEX,A,2017,6,False,False,False,True,False,WICKFORD,301000.0,2,299995.0,29,12.531776
1,30000,2017-06-29,0,0,HULL,CITY OF KINGSTON UPON HULL,A,2017,6,False,True,False,False,False,HULL,95000.0,5,95000.0,5,10.308986
2,551000,2017-06-29,0,1,CHISLEHURST,GREATER LONDON,A,2017,6,False,False,False,False,True,CHISLEHURST,551000.0,1,485000.0,77,13.219492
3,240000,2017-06-29,0,1,BEDFORD,BEDFORD,B,2017,6,False,False,False,True,False,BEDFORD,385000.0,7,385000.0,7,12.388398
4,527500,2017-06-29,0,1,HEMEL HEMPSTEAD,HERTFORDSHIRE,B,2017,6,True,False,False,False,False,HEMEL HEMPSTEAD,527500.0,1,336500.0,18,13.175906


## H1 - New vs Old Builds

In [5]:
new_prices = df_chunk[df_chunk["Old/New"] == 1]["Price"]
old_prices = df_chunk[df_chunk["Old/New"] == 0]["Price"]

t_stat, p_val = stats.ttest_ind(new_prices, old_prices, equal_var=False)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject the null hypothesis – New houses are significantly more expensive.")
else:
    print("Fail to reject the null hypothesis – No significant difference in price.")

T-statistic: -3.96
P-value: 0.0051
Reject the null hypothesis – New houses are significantly more expensive.


## H2 - Price by Property Type

In [6]:
property_types = ['Property_D', 'Property_F', 'Property_S', 'Property_T']
grouped_prices = [df_chunk[df_chunk[col] == 1]["Price"] for col in property_types if col in df_chunk.columns]


f_stat, p_val = stats.f_oneway(*grouped_prices)
print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject the null hypothesis – Property type affects price.")
else:
    print("Fail to reject the null hypothesis – No significant difference between property types.")

F-statistic: 43.93
P-value: 0.0000
Reject the null hypothesis – Property type affects price.
