# 03 – Exploratory Data Analysis

**Notebook Name:** `03_Exploratory_Data_Analysis.ipynb`

## Objectives
- Compute summary statistics for numeric fields.
- Visualise distribution of `Price` (histogram + KDE, log-scale).
- Plot time trends by Year and Year–Month.
- Compare average prices by County.
- Generate Pearson & Spearman correlation heatmaps.

## Inputs
- `outputs/datasets/collection/HousePricesRecords_clean.csv`

## Outputs
- Histograms, line charts, bar charts, heatmaps
- `Key Insights` summary

## Libraries & Data Load

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_full = pd.read_csv('../outputs/datasets/collection/HousePricesRecords_clean.csv')
df = df_full.sample(n=1000, random_state=42).reset_index(drop=True)
print(f'Sample size: {len(df)} of {len(df_full)} total')

ValueError: Cannot take a larger sample than population when 'replace=False'

Descriptive Statistics

In [None]:
df.describe()

## Categorical Value Counts

In [None]:
for col in ['Old/New','Duration','County','Town/City']:
    print(col, df[col].value_counts(), sep=':\n', end='\n\n')

## Price Distribution

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df['Price'], bins=50, kde=True)
plt.title('Price Distribution')
plt.show()

## Time Trends

In [None]:
plt.figure(figsize=(10,5))
avg_year = df.groupby('Year')['Price'].mean()
avg_year.plot(marker='o')
plt.title('Avg Price by Year')
plt.show()

plt.figure(figsize=(12,6))
df['YearMonth'] = pd.to_datetime(df['Date of Transfer']).dt.to_period('M')
avg_ym = df.groupby('YearMonth')['Price'].mean()
avg_ym.plot()
plt.title('Avg Price by Year-Month')
plt.xticks(rotation=45)
plt.show()