# Project 7: House Price Prediction

This notebook tackles the regression task of predicting house prices. We will use a simplified version of the Ames Housing dataset to build and compare two models:
1. A baseline **Linear Regression** model.
2. A more advanced **XGBoost** model.

The goal is to see how a powerful gradient boosting model like XGBoost compares to a simple linear model on this structured dataset.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn import metrics

## 2. Data Loading and EDA

In [None]:
# Load the dataset
try:
    df = pd.read_csv('data/train.csv')
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Data file not found. Make sure 'train.csv' is in the 'data/' directory.")

df.head()

In [None]:
df.describe()

### Visualizing Key Relationships

In [None]:
# Scatter plot for living area vs. sale price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=df)
plt.title('Living Area vs. Sale Price')
plt.xlabel('Above Ground Living Area (sq ft)')
plt.ylabel('Sale Price ($)')
plt.show()

In [None]:
# Box plot for overall quality vs. sale price
plt.figure(figsize=(10, 6))
sns.boxplot(x='OverallQual', y='SalePrice', data=df)
plt.title('Overall Quality vs. Sale Price')
plt.xlabel('Overall Quality')
plt.ylabel('Sale Price ($)')
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df.corr(numeric_only=True)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

## 3. Feature Engineering and Preprocessing

In [None]:
# Define features (X) and target (y)
# For this simplified project, we use all columns except Id and SalePrice as features
X = df.drop(['Id', 'SalePrice'], axis=1)
y = df['SalePrice']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Model Building and Training

### 4.1. Linear Regression (Baseline)

In [None]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_predictions = lr_model.predict(X_test)

### 4.2. XGBoost (Advanced Model)

In [None]:
xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, random_state=42)
xgb_model.fit(X_train, y_train)
xgb_predictions = xgb_model.predict(X_test)

## 5. Model Evaluation

In [None]:
# Store results in a dictionary
results = {}

# Linear Regression Evaluation
lr_mae = metrics.mean_absolute_error(y_test, lr_predictions)
lr_mse = metrics.mean_squared_error(y_test, lr_predictions)
lr_r2 = metrics.r2_score(y_test, lr_predictions)
results['Linear Regression'] = {'MAE': lr_mae, 'MSE': lr_mse, 'R2': lr_r2}

print("--- Linear Regression ---")
print(f'MAE: {lr_mae:.2f}')
print(f'MSE: {lr_mse:.2f}')
print(f'R2 Score: {lr_r2:.4f}')

# XGBoost Evaluation
xgb_mae = metrics.mean_absolute_error(y_test, xgb_predictions)
xgb_mse = metrics.mean_squared_error(y_test, xgb_predictions)
xgb_r2 = metrics.r2_score(y_test, xgb_predictions)
results['XGBoost'] = {'MAE': xgb_mae, 'MSE': xgb_mse, 'R2': xgb_r2}

print("--- XGBoost Regressor ---")
print(f'MAE: {xgb_mae:.2f}')
print(f'MSE: {xgb_mse:.2f}')
print(f'R2 Score: {xgb_r2:.4f}')

## 6. Conclusion

In this project, we compared a simple Linear Regression model with a more complex XGBoost model for predicting house prices. As expected, the **XGBoost model significantly outperformed the baseline model** across all evaluation metrics (lower MAE/MSE and higher R² score).

This demonstrates the power of gradient boosting algorithms for structured, tabular data. While Linear Regression provides a good, interpretable starting point, models like XGBoost are often capable of capturing more complex patterns and achieving higher predictive accuracy.