# Exploratory Data Analysis (EDA)

## House Prices Dataset

### Purpose
This notebook explores the House Prices dataset to understand:
- Data structure and quality
- Feature distributions
- Missing values patterns
- Relationships between features and target
- Potential data quality issues

### Key Questions
1. What is the distribution of house prices?
2. Which features are most correlated with price?
3. How many missing values are there?
4. Are there outliers?
5. What feature engineering opportunities exist?

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load data
import sys
from pathlib import Path

# Add src to path
sys.path.append(str(Path('..').resolve()))

from src.data_loader import DataLoader

loader = DataLoader(data_dir='../data')
train_df = loader.load_train_data()

print(f"Dataset shape: {train_df.shape}")
print(f"\nColumns: {train_df.columns.tolist()}")

## 1. Target Variable Analysis

In [None]:
# Analyze SalePrice (target variable)
target = train_df['SalePrice']

print(f"Target Statistics:")
print(target.describe())

# Plot distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(target, bins=50, edgecolor='black')
axes[0].set_title('SalePrice Distribution')
axes[0].set_xlabel('Sale Price ($)')
axes[0].set_ylabel('Frequency')

# Log transformation (often needed for skewed data)
axes[1].hist(np.log1p(target), bins=50, edgecolor='black', color='orange')
axes[1].set_title('Log(SalePrice) Distribution')
axes[1].set_xlabel('Log(Sale Price)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

print(f"\nSkewness: {target.skew():.2f}")
print(f"Kurtosis: {target.kurtosis():.2f}")