# 🏠 House Price Prediction - Exploratory Data Analysis

This notebook provides a comprehensive exploratory data analysis (EDA) for the Zillow house price prediction dataset.

## 📋 Table of Contents
1. [Data Loading and Basic Information](#data-loading)
2. [Data Quality Assessment](#data-quality)
3. [Univariate Analysis](#univariate-analysis)
4. [Bivariate Analysis](#bivariate-analysis)
5. [Multivariate Analysis](#multivariate-analysis)
6. [Feature Engineering Insights](#feature-engineering)
7. [Data Preprocessing Recommendations](#preprocessing-recommendations)

---

## 🎯 Project Objective
Predict house prices using historical data from the Zillow dataset based on various features like location, square footage, number of bedrooms/bathrooms, etc.


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Import our custom modules
import sys
sys.path.append('../src')
from data_preprocessing import DataPreprocessor, create_sample_data

print("📚 Libraries imported successfully!")
print("🔧 Custom modules loaded!")


## 1. Data Loading and Basic Information {#data-loading}

Let's start by loading the dataset and understanding its structure.


In [None]:
# Initialize the data preprocessor
preprocessor = DataPreprocessor()

# For this demo, we'll create sample data that mimics the Zillow dataset structure
# In a real scenario, you would load the actual dataset:
# df = preprocessor.load_data('../data/zillow.csv')

# Create sample data for demonstration
df = create_sample_data()

print("📊 Sample dataset created!")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display basic information
preprocessor.basic_info(df)


In [None]:
# Display first few rows
print("🔍 First 5 rows of the dataset:")
df.head()


In [None]:
# Display dataset statistics
print("📈 Dataset Statistics:")
df.describe()


## 2. Data Quality Assessment {#data-quality}

Let's assess the quality of our data by examining missing values, duplicates, and data types.


In [None]:
# Check for missing values
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})

print("🔍 Missing Values Analysis:")
print(missing_df[missing_df['Missing Count'] > 0])

# Visualize missing values
if missing_data.sum() > 0:
    plt.figure(figsize=(10, 6))
    missing_df[missing_df['Missing Count'] > 0].plot(kind='bar', y='Missing Percentage')
    plt.title('Missing Values by Column')
    plt.ylabel('Percentage Missing')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("✅ No missing values found!")


In [None]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"🔍 Duplicate rows: {duplicates}")

# Check data types
print("\n📊 Data Types:")
print(df.dtypes)

# Check for potential outliers using IQR method
numerical_cols = df.select_dtypes(include=[np.number]).columns
print(f"\n🔍 Numerical columns: {list(numerical_cols)}")

outlier_summary = {}
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_summary[col] = len(outliers)

outlier_df = pd.DataFrame(list(outlier_summary.items()), columns=['Column', 'Outlier Count'])
print("\n🚨 Outliers detected:")
print(outlier_df)


## 3. Univariate Analysis {#univariate-analysis}

Let's examine the distribution of individual variables to understand their characteristics.


In [None]:
# Create distribution plots for numerical variables
numerical_cols = df.select_dtypes(include=[np.number]).columns
n_cols = len(numerical_cols)
n_rows = (n_cols + 2) // 3  # 3 columns per row

fig, axes = plt.subplots(n_rows, 3, figsize=(15, 5*n_rows))
axes = axes.flatten() if n_rows > 1 else [axes] if n_rows == 1 else axes

for i, col in enumerate(numerical_cols):
    if i < len(axes):
        # Histogram
        axes[i].hist(df[col].dropna(), bins=30, alpha=0.7, color='skyblue', edgecolor='black')
        axes[i].set_title(f'Distribution of {col}')
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Frequency')
        axes[i].grid(True, alpha=0.3)

# Hide empty subplots
for i in range(len(numerical_cols), len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()


In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(n_rows, 3, figsize=(15, 5*n_rows))
axes = axes.flatten() if n_rows > 1 else [axes] if n_rows == 1 else axes

for i, col in enumerate(numerical_cols):
    if i < len(axes):
        # Box plot
        axes[i].boxplot(df[col].dropna())
        axes[i].set_title(f'Box Plot of {col}')
        axes[i].set_ylabel(col)
        axes[i].grid(True, alpha=0.3)

# Hide empty subplots
for i in range(len(numerical_cols), len(axes)):
    axes[i].set_visible(False)

plt.tight_layout()
plt.show()


## 4. Bivariate Analysis {#bivariate-analysis}

Let's examine relationships between variables, especially with our target variable.


In [None]:
# Correlation matrix
correlation_matrix = df.corr()
print("🔗 Correlation Matrix:")
print(correlation_matrix)

# Visualize correlation matrix
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.show()


In [None]:
# Scatter plots for key relationships
key_features = ['taxvaluedollarcnt', 'calculatedfinishedsquarefeet', 'bedroomcnt', 'bathroomcnt']
target = 'logerror'

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()

for i, feature in enumerate(key_features):
    if feature in df.columns and i < len(axes):
        axes[i].scatter(df[feature], df[target], alpha=0.6, color='blue')
        axes[i].set_xlabel(feature)
        axes[i].set_ylabel(target)
        axes[i].set_title(f'{target} vs {feature}')
        axes[i].grid(True, alpha=0.3)
        
        # Add correlation coefficient
        corr = df[feature].corr(df[target])
        axes[i].text(0.05, 0.95, f'Corr: {corr:.3f}', transform=axes[i].transAxes,
                    bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()


## 5. Multivariate Analysis {#multivariate-analysis}

Let's explore more complex relationships and patterns in the data.


In [None]:
# Pair plot for key numerical variables
key_vars = ['taxvaluedollarcnt', 'calculatedfinishedsquarefeet', 'bedroomcnt', 'bathroomcnt', 'logerror']
available_vars = [var for var in key_vars if var in df.columns]

if len(available_vars) >= 3:
    plt.figure(figsize=(12, 10))
    sns.pairplot(df[available_vars], diag_kind='hist', plot_kws={'alpha': 0.6})
    plt.suptitle('Pair Plot of Key Variables', y=1.02)
    plt.show()
else:
    print("⚠️ Not enough variables available for pair plot")


In [None]:
# Interactive 3D scatter plot using Plotly
if 'taxvaluedollarcnt' in df.columns and 'calculatedfinishedsquarefeet' in df.columns and 'logerror' in df.columns:
    fig = px.scatter_3d(df, 
                        x='taxvaluedollarcnt', 
                        y='calculatedfinishedsquarefeet', 
                        z='logerror',
                        color='bedroomcnt' if 'bedroomcnt' in df.columns else 'logerror',
                        title='3D Scatter Plot: Price vs Square Footage vs Log Error',
                        labels={'taxvaluedollarcnt': 'Tax Value ($)',
                               'calculatedfinishedsquarefeet': 'Square Feet',
                               'logerror': 'Log Error'})
    fig.show()
else:
    print("⚠️ Required columns not available for 3D plot")


## 6. Feature Engineering Insights {#feature-engineering}

Let's explore potential new features that could improve our model performance.


In [None]:
# Create engineered features
df_engineered = df.copy()

# Price per square foot
if 'taxvaluedollarcnt' in df_engineered.columns and 'calculatedfinishedsquarefeet' in df_engineered.columns:
    df_engineered['price_per_sqft'] = df_engineered['taxvaluedollarcnt'] / df_engineered['calculatedfinishedsquarefeet']
    df_engineered['price_per_sqft'] = df_engineered['price_per_sqft'].replace([np.inf, -np.inf], np.nan)

# Total rooms
if 'bedroomcnt' in df_engineered.columns and 'bathroomcnt' in df_engineered.columns:
    df_engineered['total_rooms'] = df_engineered['bedroomcnt'] + df_engineered['bathroomcnt']

# Property age
if 'yearbuilt' in df_engineered.columns:
    current_year = 2023
    df_engineered['property_age'] = current_year - df_engineered['yearbuilt']
    df_engineered['property_age'] = df_engineered['property_age'].clip(lower=0)

# Log transformations
if 'taxvaluedollarcnt' in df_engineered.columns:
    df_engineered['log_taxvaluedollarcnt'] = np.log1p(df_engineered['taxvaluedollarcnt'])

if 'calculatedfinishedsquarefeet' in df_engineered.columns:
    df_engineered['log_calculatedfinishedsquarefeet'] = np.log1p(df_engineered['calculatedfinishedsquarefeet'])

print("🔧 Feature engineering completed!")
print(f"New features created: {[col for col in df_engineered.columns if col not in df.columns]}")

# Display correlation with target for new features
new_features = [col for col in df_engineered.columns if col not in df.columns]
if new_features and 'logerror' in df_engineered.columns:
    print("\n📊 Correlation of new features with target:")
    for feature in new_features:
        corr = df_engineered[feature].corr(df_engineered['logerror'])
        print(f"{feature}: {corr:.4f}")


## 7. Data Preprocessing Recommendations {#preprocessing-recommendations}

Based on our analysis, let's summarize key findings and recommendations.


In [None]:
# Summary of key findings
print("📋 EDA SUMMARY AND RECOMMENDATIONS")
print("="*50)

print("\n🔍 DATA QUALITY:")
print(f"• Dataset shape: {df.shape}")
print(f"• Missing values: {df.isnull().sum().sum()}")
print(f"• Duplicate rows: {df.duplicated().sum()}")

print("\n📊 KEY INSIGHTS:")
if 'logerror' in df.columns:
    print(f"• Target variable (logerror) range: {df['logerror'].min():.4f} to {df['logerror'].max():.4f}")
    print(f"• Target variable mean: {df['logerror'].mean():.4f}")
    print(f"• Target variable std: {df['logerror'].std():.4f}")

# Top correlations with target
if 'logerror' in df.columns:
    correlations = df.corr()['logerror'].abs().sort_values(ascending=False)
    print(f"\n🔗 TOP CORRELATIONS WITH TARGET:")
    for i, (feature, corr) in enumerate(correlations.head(6).items()):
        if feature != 'logerror':
            print(f"  {i+1}. {feature}: {corr:.4f}")

print("\n🛠️ PREPROCESSING RECOMMENDATIONS:")
print("1. Handle missing values using median imputation")
print("2. Remove outliers using IQR method")
print("3. Apply log transformation to skewed features")
print("4. Create engineered features (price per sqft, total rooms, property age)")
print("5. Scale features using StandardScaler")
print("6. Consider feature selection based on correlation analysis")

print("\n🎯 MODELING RECOMMENDATIONS:")
print("1. Start with Random Forest for baseline")
print("2. Try XGBoost for better performance")
print("3. Use cross-validation for robust evaluation")
print("4. Consider ensemble methods")
print("5. Focus on feature engineering for improvement")

print("\n✅ EDA COMPLETED SUCCESSFULLY!")
print("Ready to proceed with model training and evaluation.")
