# 04 – Hypothesis Testing & Statistical Evaluation  
**CRISP-DM Phase 5: Evaluation**  
We validate our market hypotheses with formal statistical tests and decide on accept/reject at α = 0.05.

### Objectives
* Formulate and test the hypothesis:  
  1. **H1** – New builds > Old builds in mean sale price (Welch’s t-test).  
  2. **H2** – Price differs by Property Type (one-way ANOVA + Tukey post-hoc).  
  3. **(Optional H3)** – County-level comparison (e.g. London vs Rest of E&W).  
* Compute test statistics and p-values.  
* Make clear accept/reject decisions at α = 0.05.

### Inputs
* `outputs/datasets/collection/HousePricesRecords_clean.csv`  

### Outputs
* Inline test summaries: t-statistic, F-statistic, p-values  
* Markdown verdicts (“Reject H₀” or “Fail to reject H₀”)  

### Additional Comments  
#### Business Requirements Addressed  
* **BR2**: Provides statistical evidence on market hypotheses for the Hypotheses tab.  

#### Additional Notes  
* Save any summary tables (e.g. ANOVA table) for download in the Streamlit page.  

---

### Import Required Libraries  
This cell brings in the packages needed to perform our statistical tests and visualise the results:  
- **pandas** (`pd`) for data manipulation.  
- **scipy.stats** for Welch’s t-test and one-way ANOVA functions.  
- **matplotlib.pyplot** and **seaborn** for plotting test results and distributions.


In [1]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

#### Load Cleaned Dataset for Testing  
This cell reads in the fully cleaned dataset (`HousePricesRecords_clean.csv`) produced in Notebook 02. We display the first few rows with `.head()` to confirm that the data has loaded correctly and contains the expected columns for our statistical tests.


In [2]:

df_chunk = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")
df_chunk.head()

Unnamed: 0,Price,Date of Transfer,Old/New,Duration,Town/City,County,PPDCategory Type,Year,Month,Property_D,Property_F,Property_S,Property_T,Region,RegionMedianPrice,RegionSaleCount,CountyMedianPrice,CountySaleCount,LogPrice
0,25000,1995-08-18,0,1,OLDHAM,GREATER MANCHESTER,A,1995,8,False,False,False,True,OLDHAM,27675.0,4,37000.0,40,10.126671
1,42500,1995-08-09,0,1,GRAYS,THURROCK,A,1995,8,False,False,True,False,GRAYS,54995.0,3,55000.0,5,10.657283
2,45000,1995-06-30,0,1,HIGHBRIDGE,SOMERSET,A,1995,6,False,False,False,True,HIGHBRIDGE,32500.0,2,41500.0,9,10.71444
3,43150,1995-11-24,0,1,BEDFORD,BEDFORDSHIRE,A,1995,11,False,False,False,True,BEDFORD,79995.0,5,76747.5,8,10.672461
4,18899,1995-06-23,0,1,WAKEFIELD,WEST YORKSHIRE,A,1995,6,False,False,True,False,WAKEFIELD,53500.0,5,46750.0,38,9.846917


## Hypothesis 1: Are new houses more expensive than old ones?

- **Null hypothesis (H0):** There is no significant difference in price between new and old houses.
- **Alternative hypothesis (H1):** New houses are significantly more expensive than old houses.

We'll use an independent t-test to compare the two groups.


#### Separate Price Series for New and Existing Homes  
This cell filters the cleaned DataFrame into two pandas Series based on the `Old/New` flag:  
- **`new_prices`** contains the sale prices where `Old/New == 1` (new builds).  
- **`old_prices`** contains the sale prices where `Old/New == 0` (existing homes).  

By isolating these two groups, we can later apply Welch’s t-test to compare their mean prices.


In [3]:
# Split data into two groups
new_prices = df_chunk[df_chunk["Old/New"] == 1]["Price"]
old_prices = df_chunk[df_chunk["Old/New"] == 0]["Price"]

#### Perform Welch’s t-test on New vs Existing Homes  
This cell applies Welch’s t-test (which does not assume equal variances) to compare the mean sale prices of new builds (`new_prices`) against existing homes (`old_prices`):

1. `stats.ttest_ind(..., equal_var=False)` returns the **t-statistic** and **p-value**.  
2. The **p-value** is compared to a significance level (α = 0.05).  
3. If `p_val < 0.05`, we **reject the null hypothesis**—indicating a statistically significant difference in mean prices. Otherwise, we **fail to reject** the null hypothesis.


In [4]:
t_stat, p_val = stats.ttest_ind(new_prices, old_prices, equal_var=False)

print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject the null hypothesis – New houses are significantly more expensive.")
else:
    print("Fail to reject the null hypothesis – No significant difference in price.")

T-statistic: 3.26
P-value: 0.0013
Reject the null hypothesis – New houses are significantly more expensive.


## Hypothesis 2: Does house price vary by property type?

- **Null hypothesis (H0):** All property types have the same average price.
- **Alternative hypothesis (H1):** At least one property type has a different average price.

We'll use a one-way ANOVA test.


#### List All Columns in the Cleaned DataFrame  
This cell prints out all column names in `df_chunk`, helping you confirm which features are available for grouping and statistical tests (e.g., property flags, date fields) and ensuring you reference the correct names in subsequent analysis.


In [5]:
print(df_chunk.columns)

Index(['Price', 'Date of Transfer', 'Old/New', 'Duration', 'Town/City',
       'County', 'PPDCategory Type', 'Year', 'Month', 'Property_D',
       'Property_F', 'Property_S', 'Property_T', 'Region', 'RegionMedianPrice',
       'RegionSaleCount', 'CountyMedianPrice', 'CountySaleCount', 'LogPrice'],
      dtype='object')


#### Extract Sale Prices by Property Type  
This cell defines the one-hot encoded property type columns (`Property_D`, `Property_F`, `Property_S`, `Property_T`) and then builds a list of pandas Series, each containing the sale prices for properties where that flag equals 1. The resulting `grouped_prices` list will be used as input to the one-way ANOVA test in the next step.


In [6]:
property_types = ['Property_D', 'Property_F', 'Property_S', 'Property_T']
grouped_prices = [df_chunk[df_chunk[col] == 1]["Price"] for col in property_types if col in df_chunk.columns]



#### Conduct One-Way ANOVA Across Property Types  
This cell runs a one-way ANOVA to test whether mean sale prices differ significantly between our four property-type groups (`Property_D`, `Property_F`, `Property_S`, `Property_T`):

1. `stats.f_oneway(*grouped_prices)` computes the **F-statistic** and **p-value** across the input price series.  
2. We compare the **p-value** to α = 0.05.  
3. If `p_val < 0.05`, we **reject the null hypothesis**—indicating that at least one property type has a significantly different average price. Otherwise, we **fail to reject** the null.


In [7]:
f_stat, p_val = stats.f_oneway(*grouped_prices)
print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_val:.4f}")

if p_val < 0.05:
    print("Reject the null hypothesis – Property type affects price.")
else:
    print("Fail to reject the null hypothesis – No significant difference between property types.")

F-statistic: 74.65
P-value: 0.0000
Reject the null hypothesis – Property type affects price.
