# 🎓 Assignment – House Price Prediction with Linear Regression & Random Forest

## Part A – Practical (Jupyter Notebook)

**Objective:** In this part, you will implement **house price prediction** using **Linear Regression** and **Random Forest Regressor**.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

### Step 0: Data Preprocessing (Recreating Lesson 3 Cleaning)
We need to clean the `house_l3_dataset.csv` first as instructed.

In [None]:
# Load raw data
df = pd.read_csv('../../../dataset/house_l3_dataset.csv')

# Clean Price: remove '$' and ','
df['Price'] = df['Price'].astype(str).str.replace(r'[$,]', '', regex=True)
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

# Clean Location: Fix typos and imputation
df['Location'] = df['Location'].replace({'Subrb': 'Suburb', '??': np.nan})
df['Location'] = df['Location'].fillna(df['Location'].mode()[0])

# Impute Missing Values
for col in ['Size_sqft', 'Bedrooms', 'Bathrooms', 'YearBuilt']:
    df[col] = df[col].fillna(df[col].median())

# Remove Duplicates
df = df.drop_duplicates()

# Outliers Handling for Price (IQR)
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df['Price'] = df['Price'].clip(lower=lower_bound, upper=upper_bound)

# One-Hot Encode Location
df = pd.get_dummies(df, columns=['Location'], drop_first=True, dtype=int)
print("Columns after encoding:", df.columns.tolist())

# Save Cleaned Data
df.to_csv('clean_house_dataset.csv', index=False)
print("Data Cleaned and Saved to clean_house_dataset.csv")
df.head()

### Step 1: Load Dataset
Load the cleaned dataset.

In [None]:
df = pd.read_csv('clean_house_dataset.csv')
print(df.info())

### Step 2: Prepare Features & Target
Target (`y`) = `Price`
Features (`X`) = all other columns except `Price` and `LogPrice`.

In [None]:
# Drop LogPrice if it exists
if 'LogPrice' in df.columns:
    df = df.drop(columns=['LogPrice'])

X = df.drop(columns=['Price'])
y = df['Price']

print("Features:", X.columns.tolist())

### Step 3: Split Data
Split into 80% training and 20% testing.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training sets: {X_train.shape}, {y_train.shape}")
print(f"Testing sets: {X_test.shape}, {y_test.shape}")

### Step 4: Train Models
Train Linear Regression and Random Forest Regressor.

In [None]:
# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

### Step 5: Evaluate Performance
Print R², MAE, MSE, RMSE for both models.

In [None]:
def evaluate_metrics(y_true, y_pred, model_name):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    print(f"{model_name} Performance:")
    print(f"  R²   : {r2:.2f}")
    print(f"  MAE  : {mae:,.0f}")
    print(f"  RMSE : {rmse:,.0f}\n")

print("--- Model Evaluation ---")
evaluate_metrics(y_test, lr_model.predict(X_test), "Linear Regression")
evaluate_metrics(y_test, rf_model.predict(X_test), "Random Forest")

### Step 6: Single-row Sanity Check
Compare actual price with predictions for a single test row.

In [None]:
i = 0
single_row = X_test.iloc[[i]]
actual_price = y_test.iloc[i]
lr_pred = lr_model.predict(single_row)[0]
rf_pred = rf_model.predict(single_row)[0]

print(f"Sanity Check (Test Row {i}):")
print(f"  Actual Price   : ${actual_price:,.2f}")
print(f"  LR Prediction  : ${lr_pred:,.2f}")
print(f"  RF Prediction  : ${rf_pred:,.2f}")