# 03 - Exploratory Data Analysis

In this notebook, we explore the cleaned house price dataset to understand patterns and trends in the data.

**Objectives:**
- Understand the distribution of house prices
- Explore trends over time
- Identify differences between regions
- Understand correlations between variables


This imports three Python libraries
    
-pandas witch helps work with data in tables

-matplotlib.pyplot lets you create charts.

-seaborn helps you make nicer, more advanced charts.


then it reads the cleaned data from the saved file and stores it in a variable called df so i can start exploring it 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


df_chunk = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")

### Summary Statistics
this gives an overview of numerical columns .

In [None]:
df_chunk.describe()

### Value Counts for Key Categorical Columns
Shows how often each value appears in selected columns helps spot issues and understand the data distribution.

In [None]:
categorical_cols = ["Old/New", "Duration", "County", "Town/City"]
for col in categorical_cols:
    print(f"\nValue counts for {col}:")
    print(df_chunk[col].value_counts())

### Distribution of House Prices
This shows the spread of house prices.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df_chunk['Price'], bins=50, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

### Average House Price Per Year
This highlights trends in pricing over time.

In [None]:
avg_price_year = df_chunk.groupby("Year")['Price'].mean()
avg_price_year.plot(kind='line', marker='o', figsize=(10, 5))
plt.title('Average House Price Per Year')
plt.ylabel('Average Price')
plt.xlabel('Year')
plt.grid(True)
plt.show()

### Average House Price by County
Shows regional price differences.

In [None]:
plt.figure(figsize=(14, 6))
df_chunk.groupby("County")["Price"].mean().sort_values(ascending=False).plot(kind='bar')
plt.title("Average House Price by County")
plt.ylabel("Average Price")
plt.xticks(rotation=90)
plt.show()

### Correlation Heatmap
Understand relationships between numerical variables.

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df_chunk.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()