# Day 3: Housing Market Prediction with AI

## Overview
Predict house prices using machine learning on the Kaggle Housing dataset.

**Dataset**: 1,460 houses with 79 features each
**Target**: SalePrice ($34,900 - $755,000)
**Goal**: Build accurate price prediction models

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

In [None]:
# Load data
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# Set Id as index
train_df.set_index('Id', inplace=True)
test_df.set_index('Id', inplace=True)

print(f"Training data: {train_df.shape}")
print(f"Test data: {test_df.shape}")
print(f"Price range: ${train_df['SalePrice'].min():,.0f} - ${train_df['SalePrice'].max():,.0f}")

train_df.head()

In [None]:
# Target variable analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('SalePrice Analysis', fontsize=16)

# Distribution
sns.histplot(train_df['SalePrice'], kde=True, ax=axes[0,0])
axes[0,0].set_title('Price Distribution')

# Box plot
sns.boxplot(y=train_df['SalePrice'], ax=axes[0,1])
axes[0,1].set_title('Price Box Plot')

# Q-Q plot
from scipy import stats
stats.probplot(train_df['SalePrice'], dist="norm", plot=axes[1,0])
axes[1,0].set_title('Q-Q Plot')

# Log transformation
log_prices = np.log1p(train_df['SalePrice'])
sns.histplot(log_prices, kde=True, ax=axes[1,1])
axes[1,1].set_title('Log-Transformed Prices')

plt.tight_layout()
plt.show()

print(f"Skewness: {train_df['SalePrice'].skew():.3f}")
print(f"Log-transformed skewness: {log_prices.skew():.3f}")

In [None]:
# Feature analysis
numerical_features = train_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

if 'SalePrice' in numerical_features:
    numerical_features.remove('SalePrice')

print(f"Numerical features: {len(numerical_features)}")
print(f"Categorical features: {len(categorical_features)}")

# Missing values
missing = train_df.isnull().sum()
missing_percent = 100 * missing / len(train_df)
missing_table = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_percent
})
missing_table = missing_table[missing_table['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

print("\nTop 10 features with missing values:")
print(missing_table.head(10))

In [None]:
# Correlation analysis
correlations = train_df[numerical_features + ['SalePrice']].corr()['SalePrice'].sort_values(ascending=False)
top_correlations = correlations.head(16)[1:]  # Exclude SalePrice itself

print("Top 15 features correlated with SalePrice:")
for feature, corr in top_correlations.items():
    print(f"{feature:20s}: {corr:6.3f}")

# Visualize correlations
plt.figure(figsize=(10, 8))
top_correlations.plot(kind='barh')
plt.title('Top Features Correlated with SalePrice')
plt.xlabel('Correlation Coefficient')
plt.tight_layout()
plt.show()

In [None]:
# Key relationships visualization
top_features = ['GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF']

fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('SalePrice vs Top Features', fontsize=16)

for i, feature in enumerate(top_features):
    row = i // 2
    col = i % 2
    
    sns.scatterplot(data=train_df, x=feature, y='SalePrice', ax=axes[row, col], alpha=0.6)
    axes[row, col].set_title(f'SalePrice vs {feature}')
    
    # Add correlation
    corr = train_df[feature].corr(train_df['SalePrice'])
    axes[row, col].text(0.05, 0.95, f'r = {corr:.3f}', 
                       transform=axes[row, col].transAxes, 
                       bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

## Next Steps

1. **Data Preprocessing**: Handle missing values, outliers
2. **Feature Engineering**: Create new features, encode categoricals
3. **Model Building**: Linear Regression, XGBoost
4. **Model Evaluation**: Compare performance metrics
5. **Prediction**: Generate submission file

**Key Insights**:
- SalePrice is right-skewed (log transformation helps)
- OverallQual has highest correlation (0.791)
- GrLivArea, GarageCars, GarageArea are strong predictors
- Many features have missing values (need handling)