# House Price Prediction Project
## Ben Belnap (003177064)

This notebook implements a machine learning solution to predict house prices using the Ames Housing Dataset. The project follows the CRISP-DM methodology and aims to provide accurate home value estimates for a mortgage company.

## 1. Setup and Required Libraries

Import all necessary libraries for data analysis, modeling, and visualization.

In [21]:
# Setup Python path
import os
import sys
notebook_dir = os.path.abspath(os.getcwd())  # Get the notebook's directory
project_root = os.path.dirname(notebook_dir)  # Get the project root directory
if project_root not in sys.path:
    sys.path.append(project_root)
print(f"Added to Python path: {project_root}")

# Data manipulation and analysis
import pandas as pd
import numpy as np
import polars as pl

# Machine learning libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Custom modules
from src.data_preprocessing import check_data_quality, remove_outliers, prepare_data
from src.model_utils import get_default_models, debug_model_performance
from src.evaluation_utils import evaluate_model, plot_predictions

# Settings
plt.style.use('seaborn-v0_8')  
sns.set_theme()  # Using seaborn's default theme
%matplotlib inline

Added to Python path: d:\Repos\House-Price-Prediction-Machine-Learning


## 2. Data Understanding

In this section, we will load and explore the Ames Housing Dataset. We'll analyze its structure, generate summary statistics, and visualize key relationships.

In [22]:
# Load the dataset with error handling
try:
    df = pd.read_csv('../data/amesHousing.csv')
    print("Dataset successfully loaded!")
    print("\nDataset Shape:", df.shape)
    print("\nFirst few rows:")
    print(df.head())
    print("\nDataset Info:")
    df.info()
    
    # Check for missing values
    missing_values = df.isnull().sum()
    print("\nColumns with missing values:")
    print(missing_values[missing_values > 0])
    
    # Display basic statistics
    print("\nNumerical columns statistics:")
    print(df.describe())
except FileNotFoundError:
    print("Error: Dataset file not found. Please make sure 'ames_housing.csv' is in the data directory.")
except pd.errors.EmptyDataError:
    print("Error: The dataset file is empty.")
except pd.errors.ParserError:
    print("Error: Unable to parse the CSV file. Please check if it's a valid CSV format.")
except Exception as e:
    print(f"An unexpected error occurred: {str(e)}")

Dataset successfully loaded!

Dataset Shape: (2930, 82)

First few rows:
   Order        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street  \
0      1  526301100           20        RL         141.0     31770   Pave   
1      2  526350040           20        RH          80.0     11622   Pave   
2      3  526351010           20        RL          81.0     14267   Pave   
3      4  526353030           20        RL          93.0     11160   Pave   
4      5  527105010           60        RL          74.0     13830   Pave   

  Alley Lot Shape Land Contour  ... Pool Area Pool QC  Fence Misc Feature  \
0   NaN       IR1          Lvl  ...         0     NaN    NaN          NaN   
1   NaN       Reg          Lvl  ...         0     NaN  MnPrv          NaN   
2   NaN       IR1          Lvl  ...         0     NaN    NaN         Gar2   
3   NaN       Reg          Lvl  ...         0     NaN    NaN          NaN   
4   NaN       IR1          Lvl  ...         0     NaN  MnPrv          NaN   

 

## 3. Data Preparation

Here we will clean the data, handle missing values, remove outliers, and prepare features for modeling.

In [27]:
# Data cleaning functions
check_data_quality(df)

=== Data Quality Report ===

Data Types:
object     43
int64      28
float64    11
Name: count, dtype: int64

Columns with missing values:
Lot Frontage       486
Alley             2649
Mas Vnr Type      1766
Mas Vnr Area        22
Bsmt Qual           80
Bsmt Cond           80
Bsmt Exposure       83
BsmtFin Type 1      80
BsmtFin SF 1         1
BsmtFin Type 2      81
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
Electrical           1
Bsmt Full Bath       2
Bsmt Half Bath       2
Fireplace Qu      1422
Garage Type        157
Garage Yr Blt      159
Garage Finish      159
Garage Cars          1
Garage Area          1
Garage Qual        159
Garage Cond        159
Pool QC           2836
Fence             2280
Misc Feature      2741
dtype: int64

Number of duplicate rows: 0

Checking numerical columns for invalid values...


{'missing': Order               0
 PID                 0
 MS SubClass         0
 MS Zoning           0
 Lot Frontage      486
                  ... 
 Mo Sold             0
 Yr Sold             0
 Sale Type           0
 Sale Condition      0
 SalePrice           0
 Length: 82, dtype: int64,
 'duplicates': np.int64(0)}

In [29]:
# Apply data preparation using our utility functions
df_cleaned = prepare_data(df.copy())

check_data_quality(df_cleaned)

Starting data preparation...
=== Data Quality Report ===

Data Types:
object     43
int64      28
float64    11
Name: count, dtype: int64

Columns with missing values:
Lot Frontage       486
Alley             2649
Mas Vnr Type      1766
Mas Vnr Area        22
Bsmt Qual           80
Bsmt Cond           80
Bsmt Exposure       83
BsmtFin Type 1      80
BsmtFin SF 1         1
BsmtFin Type 2      81
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
Electrical           1
Bsmt Full Bath       2
Bsmt Half Bath       2
Fireplace Qu      1422
Garage Type        157
Garage Yr Blt      159
Garage Finish      159
Garage Cars          1
Garage Area          1
Garage Qual        159
Garage Cond        159
Pool QC           2836
Fence             2280
Misc Feature      2741
dtype: int64

Number of duplicate rows: 0

Checking numerical columns for invalid values...

Removing outliers from SalePrice...
Removed 68 outliers (2.39% of data)

Handling missing values...
Remaining missing 

{'missing': Order             0
 PID               0
 MS SubClass       0
 MS Zoning         0
 Lot Frontage      0
                  ..
 Mo Sold           0
 Yr Sold           0
 Sale Type         0
 Sale Condition    0
 SalePrice         0
 Length: 82, dtype: int64,
 'duplicates': np.int64(0)}

## 4. Model Development

We will implement and train multiple models, including Random Forest and XGBoost.

In [None]:
# Initialize models using our utility function
models = get_default_models()

In [None]:
# Model validation and debugging functions
def debug_model_performance(model, X_train, X_test, y_train, y_test, model_name):
    """Debug model performance and print detailed analysis"""
    print(f"=== {model_name} Debug Report ===")
    
    # Training performance
    y_train_pred = model.predict(X_train)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    
    # Testing performance
    y_test_pred = model.predict(X_test)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    print("\nPerformance Metrics:")
    print(f"Training MAE: ${train_mae:,.2f}")
    print(f"Testing MAE: ${test_mae:,.2f}")
    print(f"Training R²: {train_r2:.4f}")
    print(f"Testing R²: {test_r2:.4f}")
    
    # Check for overfitting
    print("\nOverfitting Analysis:")
    mae_diff = abs(train_mae - test_mae)
    r2_diff = abs(train_r2 - test_r2)
    print(f"MAE difference (train-test): ${mae_diff:,.2f}")
    print(f"R² difference (train-test): {r2_diff:.4f}")
    
    if mae_diff > 10000 or r2_diff > 0.1:
        print("WARNING: Possible overfitting detected!")
        
    # Feature importance for tree-based models
    if hasattr(model, 'feature_importances_'):
        print("\nTop 10 Important Features:")
        importances = pd.DataFrame({
            'feature': X_train.columns,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)
        print(importances.head(10))
        
    return {
        'train_mae': train_mae,
        'test_mae': test_mae,
        'train_r2': train_r2,
        'test_r2': test_r2
    }

## 5. Model Evaluation

We will evaluate our models using mean absolute error and other metrics.

In [None]:
def evaluate_model(y_true, y_pred, model_name):
    """Calculate and display model performance metrics"""
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\nModel: {model_name}")
    print(f"Mean Absolute Error: ${mae:,.2f}")
    print(f"Root Mean Squared Error: ${rmse:,.2f}")
    print(f"R² Score: {r2:.4f}")
    
    return {'mae': mae, 'rmse': rmse, 'r2': r2}

## 6. Model Comparison

Finally, we will compare the performance of different models and visualize their predictions.

In [None]:
def plot_predictions(y_true, y_pred, model_name):
    """Create scatter plot of predicted vs actual values"""
    plt.figure(figsize=(10, 6))
    plt.scatter(y_true, y_pred, alpha=0.5)
    plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)
    plt.xlabel('Actual Price')
    plt.ylabel('Predicted Price')
    plt.title(f'{model_name}: Predicted vs Actual Home Prices')
    plt.tight_layout()
    plt.show()