# HomeVista: Exploratory Data Analysis (EDA)

**Goal:** Validate the "Market Simulation Engine" by comparing Real (Calibration) Data vs. Synthetic (Simulated) Data.

## 1. Setup & Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from pathlib import Path

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Load Data
DATA_PATH = Path('../data/processed/analytical_dataset.csv')
df = pd.read_csv(DATA_PATH)

print(f"Dataset Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

## 2. Market Simulation Validation

We need to prove that our **Synthetic Data (Simulation)** aligns with the **Real Data (Calibration)**.

In [None]:
# Check Data Source Distribution
source_counts = df['data_source'].value_counts()
print("Data Source Distribution:")
print(source_counts)

# Plot Rent Distribution by Source
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='annual_rent', hue='data_source', element='step', stat='density', common_norm=False)
plt.title('Rent Distribution: Real vs. Simulated')
plt.xlabel('Annual Rent (AED)')
plt.xlim(0, 400000)  # Focus on main range
plt.show()

## 3. Feature Analysis

Analyzing key drivers of rental price.

In [None]:
# Correlation Matrix
numeric_cols = ['annual_rent', 'size_sqft', 'bedrooms', 'bathrooms', 'amenity_count', 'price_per_sqft']
corr = df[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

## 4. Location Analysis

Rent distribution by neighborhood tier.

In [None]:
plt.figure(figsize=(14, 8))
sns.boxplot(data=df, x='annual_rent', y='neighborhood', hue='tier')
plt.title('Rent Distribution by Neighborhood')
plt.xlabel('Annual Rent (AED)')
plt.show()