# House Prices Regression - Prediction Model

**Student ID:** 202210221  
**Student Name:** Ahmad Abu Ghazaleh  
**Project:** DS&AI Projects - SQA Implementation

## Objective
Predict house prices using regression algorithms on the Ames Housing dataset.

**Target Metric:** Root Mean Squared Error (RMSE) as per SQA Plan

---

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## 1. Data Integrity Testing (Unit Testing Phase)
According to SQA Plan: Datasets must be "clean" (0 null values in selected features) before training.

In [2]:
# Load dataset
df = pd.read_csv('../datasets/house-prices/train.csv')

print(f"Dataset Shape: {df.shape}")
print(f"Number of features: {df.shape[1]}")
print(f"Number of samples: {df.shape[0]}")
print(f"\nFirst 5 rows:")
df.head()

Dataset Shape: (1460, 81)
Number of features: 81
Number of samples: 1460

First 5 rows:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# Display basic information
print("="*60)
print("DATASET INFORMATION")
print("="*60)
print(f"\nData types distribution:")
print(df.dtypes.value_counts())

print(f"\nTarget variable (SalePrice) statistics:")
print(df['SalePrice'].describe())

print(f"\nMissing values (top 15 features):")
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
print(missing.head(15))

DATASET INFORMATION

Data types distribution:
object     43
int64      35
float64     3
Name: count, dtype: int64

Target variable (SalePrice) statistics:
count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

Missing values (top 15 features):
PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtExposure      38
BsmtFinType2      38
BsmtQual          37
dtype: int64


## 2. Feature Selection and Data Cleaning
Selecting relevant features for house price prediction.

In [6]:
# Select features for modeling (avoiding high-missing features)
numerical_features = [
    'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
    'YearRemodAdd', 'TotalBsmtSF', 'GrLivArea',
    '1stFlrSF', '2ndFlrSF', 'BedroomAbvGr',
    'TotRmsAbvGrd', 'Fireplaces', 'GarageArea',
    'YrSold'
]

categorical_features = [
    'MSZoning', 'Street', 'LotShape', 'Neighborhood',
    'HouseStyle', 'RoofStyle', 'Foundation',
    'Heating', 'CentralAir', 'KitchenQual',
    'SaleCondition'
]

target = 'SalePrice'

# Combine selected features
selected_features = numerical_features + categorical_features

print(f"Selected {len(selected_features)} features:")
print(f"  - Numerical: {len(numerical_features)}")
print(f"  - Categorical: {len(categorical_features)}")

Selected 25 features:
  - Numerical: 14
  - Categorical: 11


In [7]:
# Create working dataframe with selected features
df_clean = df[selected_features + [target]].copy()

print(f"Before cleaning: {df_clean.shape[0]} rows")

# Check missing values
print(f"\nMissing values in selected features:")
missing_counts = df_clean.isnull().sum()
missing_counts = missing_counts[missing_counts >
                                0].sort_values(ascending=False)
print(missing_counts)

# Handle missing values properly
# Numerical features: Fill with median
for col in numerical_features:
    if df_clean[col].isnull().sum() > 0:
        df_clean[col].fillna(df_clean[col].median(), inplace=True)

# Categorical features: Fill with mode
for col in categorical_features:
    if df_clean[col].isnull().sum() > 0:
        df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)

print(f"\nAfter cleaning: {df_clean.shape[0]} rows")
print(f"Data retention: {df_clean.shape[0]/df.shape[0]*100:.1f}%")
print(f"Remaining missing values: {df_clean.isnull().sum().sum()}")

Before cleaning: 1460 rows

Missing values in selected features:
Series([], dtype: int64)

After cleaning: 1460 rows
Data retention: 100.0%
Remaining missing values: 0


## 3. Feature Engineering
Creating derived features to improve model performance.

In [8]:
# Feature Engineering
# Create new features that might be predictive

# Total square footage
df_clean['TotalSF'] = df_clean['TotalBsmtSF'] + \
    df_clean['1stFlrSF'] + df_clean['2ndFlrSF']

# House age
df_clean['HouseAge'] = df_clean['YrSold'] - df_clean['YearBuilt']

# BUG: Creating feature with NEGATIVE logarithm of area features
# This will cause issues because some area values might be 0
# Also, log transformation should be applied to target, not features with 0 values
df_clean['Log_LotArea'] = np.log(df_clean['LotArea'])
df_clean['Log_GrLivArea'] = np.log(df_clean['GrLivArea'])
df_clean['Log_TotalSF'] = np.log(df_clean['TotalSF'])  # BUG: TotalSF can be 0!

print("✓ Feature engineering completed")
print(f"\nNew features created:")
print("  - TotalSF (Total Square Footage)")
print("  - HouseAge (Age of house)")
print("  - Log_LotArea (Log of lot area)")
print("  - Log_GrLivArea (Log of living area)")
print("  - Log_TotalSF (Log of total SF)")

# Check for any issues
print(f"\nChecking for invalid values:")
print(
    f"  Infinite values: {np.isinf(df_clean.select_dtypes(include=[np.number])).sum().sum()}")
print(f"  NaN values: {df_clean.isnull().sum().sum()}")

✓ Feature engineering completed

New features created:
  - TotalSF (Total Square Footage)
  - HouseAge (Age of house)
  - Log_LotArea (Log of lot area)
  - Log_GrLivArea (Log of living area)
  - Log_TotalSF (Log of total SF)

Checking for invalid values:
  Infinite values: 0
  NaN values: 0


Unnamed: 0,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,TotalBsmtSF,GrLivArea,1stFlrSF,2ndFlrSF,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,GarageArea,YrSold,MSZoning,Street,LotShape,Neighborhood,HouseStyle,RoofStyle,Foundation,Heating,CentralAir,KitchenQual,SaleCondition,SalePrice,TotalSF,HouseAge,Log_LotArea,Log_GrLivArea,Log_TotalSF
