# Task 2: Machine Learning Model for House Price Prediction

This notebook implements:
- **Task 2a**: Robust ML algorithm for price prediction
- **Task 2b**: Feature relationship analysis and importance

## Table of Contents
1. Data Loading and Preprocessing
2. Feature Engineering
3. Model Training and Evaluation
4. Feature Importance Analysis
5. Feature Relationships with Price
6. Model Comparison and Selection
7. Final Model and Predictions

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Import custom modules
import sys
sys.path.append('../src')
from data_preprocessing import DataPreprocessor
from feature_engineering import FeatureEngineer
from model_training import HousePricePredictor
from model_evaluation import *

print("Libraries imported successfully!")

## 1. Data Loading and Preprocessing

In [None]:
# Load data
df = pd.read_csv('../data/train.csv')
print(f"Original data shape: {df.shape}")
df.head()

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor()

# Preprocess the data
X, y = preprocessor.preprocess(df, target_col='SalePrice', 
                                scale=False, handle_outliers_flag=True)

print(f"\nProcessed data shapes:")
print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")

## 2. Feature Engineering

In [None]:
# Initialize feature engineer
feature_engineer = FeatureEngineer()

# Create new features
X_engineered = feature_engineer.create_all_features(X)

print(f"\nShape after feature engineering: {X_engineered.shape}")
print(f"Added {X_engineered.shape[1] - X.shape[1]} new features")

In [None]:
# Optional: Select top features
# Uncomment the following lines to use feature selection
# X_selected = feature_engineer.select_features(X_engineered, y, k=50)
# X_final = X_selected

# Using all engineered features
X_final = X_engineered
print(f"Final feature count: {X_final.shape[1]}")

## 3. Model Training and Evaluation

In [None]:
# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X_final, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")

In [None]:
# Initialize predictor
predictor = HousePricePredictor()

# Train all models and compare
results_df = predictor.train_all_models(X_train, y_train, X_val, y_val)

print("\nModel Comparison Results:")
results_df

In [None]:
# Visualize model comparison
plot_model_comparison(results_df, metric='RMSE', 
                     title='Model Performance Comparison')

In [None]:
# Cross-validation on best model
best_model = predictor.best_model
cv_results = predictor.cross_validate(best_model, X_train, y_train, cv=5)

print(f"\nCross-Validation Results for {predictor.best_model_name}:")
print(f"Mean RMSE: ${cv_results['mean_score']:,.2f}")
print(f"Std RMSE: ${cv_results['std_score']:,.2f}")

## 4. Feature Importance Analysis (Task 2b)

In [None]:
# Get feature importance from best model (if tree-based)
if hasattr(predictor.best_model, 'feature_importances_'):
    importance_df = pd.DataFrame({
        'Feature': X_final.columns,
        'Importance': predictor.best_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    print("Top 20 Most Important Features:")
    print(importance_df.head(20))
    
    # Visualize
    plot_feature_importance(importance_df, 
                           title=f'Feature Importance - {predictor.best_model_name}',
                           top_n=20)
else:
    print(f"{predictor.best_model_name} does not provide feature importances")

## 5. Feature Relationships with Price (Task 2b)

In [None]:
# Correlation analysis for top features
if hasattr(predictor.best_model, 'feature_importances_'):
    top_features = importance_df.head(10)['Feature'].tolist()
    
    # Add target to correlation analysis
    corr_data = X_train[top_features].copy()
    corr_data['SalePrice'] = y_train.values
    
    # Calculate correlations
    correlations = corr_data.corr()['SalePrice'].sort_values(ascending=False)
    print("Correlation of Top Features with Sale Price:")
    print(correlations)
    
    # Visualize correlation heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_data.corr(), annot=True, fmt='.2f', cmap='coolwarm',
                center=0, square=True, linewidths=1)
    plt.title('Correlation Matrix - Top 10 Important Features + SalePrice')
    plt.tight_layout()
    plt.show()

In [None]:
# Scatter plots showing relationships
if hasattr(predictor.best_model, 'feature_importances_'):
    top_4_features = importance_df.head(4)['Feature'].tolist()
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    axes = axes.flatten()
    
    for i, feat in enumerate(top_4_features):
        axes[i].scatter(X_train[feat], y_train, alpha=0.5, 
                       edgecolors='k', linewidth=0.5)
        axes[i].set_xlabel(feat)
        axes[i].set_ylabel('Sale Price ($)')
        axes[i].set_title(f'{feat} vs Sale Price')
        axes[i].grid(True, alpha=0.3)
        
        # Add trendline
        z = np.polyfit(X_train[feat], y_train, 1)
        p = np.poly1d(z)
        axes[i].plot(X_train[feat], p(X_train[feat]), "r--", alpha=0.8, linewidth=2)
    
    plt.tight_layout()
    plt.show()

## 6. Model Evaluation - Detailed Analysis

In [None]:
# Get predictions from best model
y_pred = predictor.best_model.predict(X_val)

# Create comprehensive evaluation report
metrics = create_evaluation_report(y_val, y_pred, 
                                  model_name=predictor.best_model_name,
                                  save_dir='../outputs')

## 7. Save Best Model

In [None]:
# Save the best model
predictor.save_model(model_name=predictor.best_model_name, 
                    filepath='../models/best_model.pkl')

# Also save the preprocessor and feature engineer for later use
import joblib
joblib.dump(preprocessor, '../models/preprocessor.pkl')
joblib.dump(feature_engineer, '../models/feature_engineer.pkl')

print("Models and processors saved successfully!")

## Key Findings - Task 2

### Task 2a: Machine Learning Algorithm
- **Best Model**: The best performing model and its metrics are shown above
- **Performance**: Evaluated using RMSE, MAE, and RÂ² metrics
- **Robustness**: Cross-validation ensures model generalization

### Task 2b: Feature Relationships and Price Variation
- **Feature Importance**: Top features identified through model analysis
- **Correlations**: Strong correlations visualized and quantified
- **Price Variation**: Price varies based on:
  - Quality metrics (OverallQual, etc.)
  - Size/Area features (GrLivArea, TotalSF, etc.)
  - Age-related features (HouseAge, YearsSinceRemodel, etc.)
  - Garage and Basement features
  - Location (Neighborhood)