# **Environmental SetUp**

Welcome to the foundation of our **house price predicting model**! In this section, we'll prepare our **workspace** by **importing essential libraries**, **models**, and **matrices** while establishing **crucial constants** for a **smooth and reliable workflow**.

* **Library Inclusion:** We'll import pivotal Python libraries like pandas, numpy, and matplotlib to handle data, perform calculations, and visualize our insights.

* **Models and Matrices:** Incorporating the necessary tools from scikit-learn, including models such as Linear Regression or Random Forest, and matrices like Mean Squared Error for evaluating model performance.

* **Ensuring Reproducibility:** We'll set a random seed, vital for reproducibility, guaranteeing consistent results across multiple runs.

Let's lay down this robust foundation, combining professionalism with the thrill of building something innovative!

In [1]:
# Common Imports
import keras
import numpy as np

# Data Imports
import pandas as pd

# Data Visualization
import plotly.express as px
import matplotlib.pyplot as plt

# Data Processing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model
from keras import layers
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Model Evaluations
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In order to guarantee that the results are reproducible, we will declare a random seed.

In [2]:
# Setting random seed to the Magic Number
random_seed = 42
np.random.seed(random_seed)

# **Data Loading**

Welcome to the pivotal phase of data initialization! Here, we inaugurate our exploration by meticulously loading the dataset, setting the stage for an insightful journey. Engaging in a rigorous statistical analysis, we delve into the dataset's essence, unraveling its intricacies through measures like mean, median, and quartiles, shedding light on its inherent characteristics. Our discerning eye scrutinizes for null values, a crucial step ensuring the purity of our data before the modeling spectacle commences. This scrutiny aids in identifying any necessary data preprocessing, affording our dataset the finesse it requires for predictive excellence.

In [3]:
# Set the data file path
file_path = "/kaggle/input/house-prices-advanced-regression-techniques/train.csv"

# Loading the Data
df = pd.read_csv(file_path)
df.drop(columns=["Id"], inplace=True)
df

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1456,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


That's how our data set looks. Let's uncover some hidden insights about the data.

In [4]:
len(df.columns)

80

The data set consists of **80 features** at the **current stage**. Because this value is **extremely high**, we will go through **feature selection process** in order to **remove some of the features** and **only preserve the important ones**.

In [5]:
# Record feature names with extremely high null values
high_null_count_features = []

# Record total number of features with null values
null_feature_values = 0

for feature, null_count in df.isnull().sum().items():
    if null_count > 0:
        print(f"{feature:20} : {null_count:4}")
        null_feature_values += 1
    
    if null_count >= 1000:
        high_null_count_features.append(feature)

print(f"\nFeatures with Null Values: {null_feature_values}")
print(f"Features with Hight Null Values: {len(high_null_count_features)}")

LotFrontage          :  259
Alley                : 1369
MasVnrType           :  872
MasVnrArea           :    8
BsmtQual             :   37
BsmtCond             :   37
BsmtExposure         :   38
BsmtFinType1         :   37
BsmtFinType2         :   38
Electrical           :    1
FireplaceQu          :  690
GarageType           :   81
GarageYrBlt          :   81
GarageFinish         :   81
GarageQual           :   81
GarageCond           :   81
PoolQC               : 1453
Fence                : 1179
MiscFeature          : 1406

Features with Null Values: 19
Features with Hight Null Values: 4


Upon thorough inspection for **null values**, it's evident that among the **advertisement features**, **19 attributes** exhibit **missing data**. Notably, several features such as **'alley,'** **'Fence,'**, **'MiscFeature'**, etc, contain **notably excessive null values**, exceeding a **count of 1000**.

This **substantial presence** of **missing data** poses a **significant challenge** for **straightforward imputation techniques** like **mean or mode** substitution, given the **nature of these features**. Consequently, addressing these **null values** may require more **nuanced strategies** to preserve the **authenticity of our dataset**, as **simply filling them might risk skewing the dataset's fidelity** from a **reflection of real-world information** to a more **artificial, synthesized representation.**

Because we have a **high count** for the **total number of features(i.e. 80)**. We can simply **delete/remove** the **features with extremely high null value counts.

In [6]:
# Drop columns with high null value count
df.drop(columns = high_null_count_features, inplace = True)

# Confirme the deletion
print(f"Features left: {len(df.columns)}")

Features left: 76


Exactly after deleting the **4 features** with **extremely high null value counts**, we are **left with 76 features**. Out of these **76 features**, we still have **15 features** which have **null values**, and we still need to deal with them.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   LotShape       1460 non-null   object 
 6   LandContour    1460 non-null   object 
 7   Utilities      1460 non-null   object 
 8   LotConfig      1460 non-null   object 
 9   LandSlope      1460 non-null   object 
 10  Neighborhood   1460 non-null   object 
 11  Condition1     1460 non-null   object 
 12  Condition2     1460 non-null   object 
 13  BldgType       1460 non-null   object 
 14  HouseStyle     1460 non-null   object 
 15  OverallQual    1460 non-null   int64  
 16  OverallCond    1460 non-null   int64  
 17  YearBuilt      1460 non-null   int64  
 18  YearRemo

Upon looking at the **data type** of the **different features,** we can clearly see that our data is a **mixture of both numeric** and **categorical features.** 

# **Data Preprocessing**

Presently, our data poses **two significant challenges**. 

* Primarily, the **existence of null values** necessitates meticulous handling to **maintain data integrity**. 

* Secondly, the dataset comprises **categorical features**, which inherently pose a **challenge for computational analysis** due to computers' preference for **numerical data**. 

To address this, employing an **ordinal encoder** becomes imperative to seamlessly **transform these categorical attributes** into **numeric equivalents**, **facilitating our computational processes**.

In [8]:
# Reanalysis of null values with data types
null_features = []
for feature, null_count in df.isnull().sum().items():
    if null_count > 0:
        print(f"{feature:20} -> {str(df[feature].dtype):10} -> {null_count:5}")
        null_feature_values += 1
        null_features.append(feature)
    
    if null_count >= 1000:
        high_null_count_features.append(feature)

LotFrontage          -> float64    ->   259
MasVnrType           -> object     ->   872
MasVnrArea           -> float64    ->     8
BsmtQual             -> object     ->    37
BsmtCond             -> object     ->    37
BsmtExposure         -> object     ->    38
BsmtFinType1         -> object     ->    37
BsmtFinType2         -> object     ->    38
Electrical           -> object     ->     1
FireplaceQu          -> object     ->   690
GarageType           -> object     ->    81
GarageYrBlt          -> float64    ->    81
GarageFinish         -> object     ->    81
GarageQual           -> object     ->    81
GarageCond           -> object     ->    81


We can clearly see that majority of the **null values** are present in **categorical features**. Thus we will first need to **encode them** and at the same time, **impute them**.

In [9]:
# Fill the missing values
for feature in null_features:
    
    if str(df[feature].dtype) == "object":
        
        # Impute using Mode Values
        imputer = SimpleImputer(strategy = "most_frequent")
        df[feature] = imputer.fit_transform(df[feature].to_numpy().reshape(-1, 1)).ravel()

    else:
        # Impute using mean values
        imputer = SimpleImputer(strategy = "mean")
        df[feature] = imputer.fit_transform(df[feature].to_numpy().reshape(-1, 1)).ravel()

In [10]:
print(f"Null Values Left: {any(df.isnull().sum())}")

Null Values Left: False


Excellent progress! We've effectively managed the issue of **null values** by leveraging the **Simple Imputer**. **Numeric nulls** have been **addressed** by replacing them with their **respective mean values**, while **categorical features** have been handled by **imputing the mode**, representing the **most frequent value within each category.**

---

To prepare our data for **modeling purposes**, it's imperative to **transform categorical features into numerical representations.** 

In [11]:
# Record the feature and its respective encoder
encoders = {}

# Use ordinary encoder to convert category into numeric values
for feature in df.columns:
    if str(df[feature].dtype) == "object":
        
        # Initialize the ordinal encoder
        encoder = OrdinalEncoder()
        
        # Apply the ordinal encoder
        encoder.fit(df[feature].to_numpy().reshape(-1, 1))
        df[feature] = encoder.transform(df[feature].to_numpy().reshape(-1, 1)).ravel()
        
        # Save the encoder
        encoders[feature] = encoder

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   float64
 2   LotFrontage    1460 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   float64
 5   LotShape       1460 non-null   float64
 6   LandContour    1460 non-null   float64
 7   Utilities      1460 non-null   float64
 8   LotConfig      1460 non-null   float64
 9   LandSlope      1460 non-null   float64
 10  Neighborhood   1460 non-null   float64
 11  Condition1     1460 non-null   float64
 12  Condition2     1460 non-null   float64
 13  BldgType       1460 non-null   float64
 14  HouseStyle     1460 non-null   float64
 15  OverallQual    1460 non-null   int64  
 16  OverallCond    1460 non-null   int64  
 17  YearBuilt      1460 non-null   int64  
 18  YearRemo

Upon reviewing **the dataset information,** it's evident that **our data** has been **seamlessly transformed** into **numerical values**. Consequently, we have **effectively managed** and **resolved** the **challenge** posed by **categorical features**, successfully **converting them into numeric representations.**

# **Exploratory Data Analysis**

Given the extensive number of feature columns in our dataset, manually exploring each feature for visual analysis isn't a practical or efficient approach. Instead, an optimal strategy involves utilizing a correlation matrix to identify the relationships between different features. This matrix offers a comprehensive overview, highlighting correlations between specific features, thereby enabling focused exploration and plotting of those specific relationships. This method streamlines our analysis, allowing us to pinpoint and delve deeper into the most relevant and interrelated features, optimizing our insights.

In [13]:
import seaborn as sns

In [14]:
# Compute Spearman correlation
corr = df.corr(method = "spearman")
corr = np.round(corr, 2)

# Visualize Spearman correlation
fig = px.imshow(corr, text_auto = True, height = 1000)
fig.show()

With the abundance of features, the correlation matrix reveals numerous correlations, yet only a subset of these correlations holds significance for our analysis. It's evident that only certain features exhibit substantial correlations among themselves, warranting our attention. To streamline our visual exploration, we'll employ a threshold of 0.7 for correlation values. This criterion will allow us to focus solely on visualizing those correlations that surpass this threshold, ensuring a targeted examination of the most pertinent and influential feature relationships.

In [15]:
# Compute Spearman correlation
corr = df.corr(method = "spearman")
corr = corr[np.round(corr, 2) > .7]

# Visualize Spearman correlation
fig = px.imshow(corr, text_auto = True, height = 800, color_continuous_scale='gray')
fig.show()

It's evident that within our dataset, a **sparse number of features** exhibit **significant correlations** with each other. Notably, the **correlations predominantly display positive relationships**, with a **solitary instance** of a **negative correlation**.

---
Let's visualize the first correlation between the overall quality and the sales price. The correlation coefficient is 0.8 signifying a strong positive correlation.

In [16]:
fig = px.histogram(df, 'SalePrice', color="OverallQual", title="OverallQual vs SalePrice (0.8)")
fig.show()

fig = px.box(df, y = 'SalePrice', x = 'OverallQual', color="OverallQual", title="OverallQual vs SalePrice (0.8)")
fig.show()

A clear **linear relationship** emerges as the **overall quality value ascends**, coinciding with an **increase in prices**. This correlation aligns logically, as **superior quality** often commands **higher prices**. Notably, a distinct **trend surfaces**: houses with a **quality rating of 10**, the **highest achievable rating**, command notably **higher prices** compared to those **graded at the 4th and 5th quality levels**, which exhibit comparatively **lower pricing**. A **histogram analysis** of **sales** status further accentuates this trend, showcasing a **significantly larger number of customers for houses** categorized at **lower quality levels** in contrast to their **higher-quality counterparts.**

In [17]:
fig = px.histogram(df, x='YearBuilt', y='GarageYrBlt', title="YearBuilt vs GarageYrBlt (0.8)", text_auto=True)
fig.show()

fig = px.scatter(df, x = 'YearBuilt', y = 'GarageYrBlt', title="YearBuilt vs GarageYrBlt (0.8)")
fig.show()

Upon examining **the histogram**, the unexpectedly **clear linear relationship** between **the year and house prices** becomes apparent. Notably, this relationship wasn't consistently present throughout the **dataset's history**. Its emergence **began post-1920 and continues to persist to the current day**. The **trend aligns** with **real-world observations**, depicting an **increase in house prices corresponding to the passing years**. This observation resonates with **real-world data**, indicating a **logical correlation** where, generally, as the year progresses, house prices tend to ascend.

In [18]:
fig = px.histogram(df, x='Exterior1st', title="Histogram of Exterior1st", text_auto=True)
fig.show()

fig = px.histogram(df, x='Exterior2nd', title="Histogram of Exterior2nd", text_auto=True)
fig.show()

Both variables showcase a notably **strong linear relationship**, boasting a **coefficient of 0.85**. However, this **correlation doesn't necessarily** indicate a **direct association** between **these points**. Rather, it suggests a **correlation stemming** from their **measurement of seemingly related attributes or aspects**. Admittedly, the **specific attributes** of **'exterior first' and 'exterior second'** that they measure remain ambiguous to me. Yet, their **internal correlation implies** an **intrinsic connection** or **similarity** between what these **attributes** represent, possibly measuring aspects that share **an inherent correlation.**

In [19]:
fig = px.histogram(df, x = 'TotalBsmtSF', y = '1stFlrSF', title="TotalBsmtSF vs 1stFlrSF (0.8)", text_auto=True)
fig.show()

fig = px.scatter(df, x = 'TotalBsmtSF', y = '1stFlrSF', title="TotalBsmtSF vs 1stFlrSF (0.8)")
fig.show()

Upon **visual inspection**, a seemingly **linear relationship emerges**; however, it's noteworthy that this **linearity is somewhat influenced by an outlier,** skewing the otherwise apparent linearity. On **closer examination** within the **value range of approximately 500 to 2500,** a semblance of **linearity exists**, albeit **not consistently throughout**. This **inconsistency points** to a **fluctuating linear pattern** within this specific value range.

In [20]:
fig = px.histogram(df, 'GrLivArea', color="TotRmsAbvGrd", title="TotRmsAbvGrd vs GrLivArea (0.8)")
fig.show()

fig = px.box(df, y = 'GrLivArea', x = 'TotRmsAbvGrd', color="TotRmsAbvGrd", title="TotRmsAbvGrd vs GrLivArea (0.8)")
fig.show()

Here, the linear relationship is visible by the help of the box plot, because the mean values are continuously rising, but the outliers present in between as well.

In [21]:
fig = px.histogram(df, x='SalePrice', y='GarageYrBlt', title="SalePrice vs GarageYrBlt (0.8)", text_auto=True)
fig.show()

fig = px.scatter(df, x = 'SalePrice', y = 'GarageYrBlt', title="SalePrice vs GarageYrBlt (0.8)")
fig.show()

In [22]:
fig = px.histogram(df, 'GarageArea', color="GarageCars", title="GarageArea vs GarageCars (0.8)")
fig.show()

fig = px.box(df, y = 'GarageArea', x = 'GarageCars', color="GarageCars", title="GarageArea vs GarageCars (0.8)")
fig.show()

The **correlation** observed between **Garage Area** and the **number of garage cars** appears intuitively **logical**. As the **count of cars escalates**, a **corresponding increase** in **the necessary garage area** for accommodating these **cars naturally follows**. This relationship adheres to an expected and pragmatic pattern, where the need for more space aligns with an increased number of cars needing storage within the garage.

# **ML Models - Data Preprocessing**

Prior to advancing further, a crucial step involves meticulous **data preprocessing** to ensure **compatibility with our machine learning models**. Firstly, we'll partition **the dataset** into **separate training and testing sets**, essential for **model evaluation**. Additionally, the **disparate scales** among the **features necessitate scaling to establish uniformity**, a practice known to **enhance model performance**. Employing the **StandardScaler,** we aim to **standardize the data**, aligning **all feature scales**, a **recognized method** for **optimizing model** performance by ensuring consistency **across the dataset.**

In [23]:
# Target Column
Y_data = df.pop('SalePrice')

# Feature Columns
X_data = df.copy()

In [24]:
# Applying Standard Scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_data)

In [25]:
# Splitting data into training and testing set
x_train, x_test, y_train, y_test = train_test_split(
    X_scaled, Y_data, 
    shuffle = True,
    random_state = random_seed,
    train_size = 0.9,
    test_size = 0.1
)

# **ML Models**

With the successful application of **data preprocessing steps**, our dataset is **now primed** for **compatibility with various machine learning models**. Having standardized the data and made it conducive for **model interpretation**, we're poised to initiate the **crucial phases of training** and **testing** our predictive model.

In [26]:
def metrics(true, pred):
    mse = mean_squared_error(true, pred)
    mae = mean_absolute_error(true, pred)
    r2_s = r2_score(true, pred)
    return mse, mae, r2_s

In [27]:
# Record all the training and validation scores
model_names = []
train_mses = []
test_mses = []
train_maes = []
test_maes = []
train_r2s = []
test_r2s = []

In [28]:
model_name = "LinearRegression"

# Initialize and train the Model
lr = LinearRegression()
lr.fit(x_train, y_train)

# Calculate metrices
train_pred = lr.predict(x_train)
test_pred = lr.predict(x_test)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Append the metrics to the lists
model_names.append(model_name)
train_mses.append(train_mse)
test_mses.append(test_mse)
train_maes.append(train_mae)
test_maes.append(test_mae)
train_r2s.append(train_r2)
test_r2s.append(test_r2)

# Print the appended results
print("Model Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Model Name : LinearRegression
Train MSE  : 920248745.155921
Test MSE   : 1360842282.1077387
Train MAE  : 18631.17258699519
Test MAE   : 20822.111757196693
Train R2   : 0.8464176712013937
Test R2    : 0.8510565906732396


The Linear Regression model showcases a moderately favorable performance on the test data. It demonstrates an ability to predict house prices with reasonable accuracy, as indicated by the R-squared value of approximately 0.85, suggesting that around 85% of the variance in house prices is explained by the model. However, the Mean Squared Error (MSE) and Mean Absolute Error (MAE) values reveal some discrepancies between predicted and actual prices, implying room for further improvement in capturing the finer nuances of pricing dynamics. Overall, while the model shows promise, fine-tuning or exploring alternative models might enhance its predictive precision.

In [29]:
model_name = "SupportVectorRegression"

# Initialize and train the Model
svr = SVR()
svr.fit(x_train, y_train)

# Calculate metrices
train_pred = svr.predict(x_train)
test_pred = svr.predict(x_test)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Append the metrics to the lists
model_names.append(model_name)
train_mses.append(train_mse)
test_mses.append(test_mse)
train_maes.append(train_mae)
test_maes.append(test_mae)
train_r2s.append(train_r2)
test_r2s.append(test_r2)

# Print the appended results
print("Model Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Model Name : SupportVectorRegression
Train MSE  : 6249131009.706121
Test MSE   : 9467853895.498758
Train MAE  : 54676.1766707412
Test MAE   : 62950.85577393378
Train R2   : -0.04293116235180228
Test R2    : -0.036251192914940944


The Support Vector Regression model appears to exhibit suboptimal performance based on the provided test results. The negative R-squared values for both the training and test sets indicate that this model performs worse than a simple horizontal line fitting the data. Additionally, the high Mean Squared Error (MSE) and Mean Absolute Error (MAE) values suggest significant discrepancies between predicted and actual house prices, signifying poor predictive accuracy. It's evident that this model struggles to capture the underlying patterns within the data, highlighting the need for reevaluation or alternative modeling strategies to improve its predictive capabilities.

In [30]:
model_name = "DecisionTreeRegression"

# Initialize and train the Model
dtr = DecisionTreeRegressor()
dtr.fit(x_train, y_train)

# Calculate metrices
train_pred = dtr.predict(x_train)
test_pred = dtr.predict(x_test)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Append the metrics to the lists
model_names.append(model_name)
train_mses.append(train_mse)
test_mses.append(test_mse)
train_maes.append(train_mae)
test_maes.append(test_mae)
train_r2s.append(train_r2)
test_r2s.append(test_r2)

# Print the appended results
print("Model Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Model Name : DecisionTreeRegression
Train MSE  : 0.0
Test MSE   : 1162384929.69863
Train MAE  : 0.0
Test MAE   : 22168.369863013697
Train R2   : 1.0
Test R2    : 0.8727776343697895


The Decision Tree Regression model demonstrates an intriguingly contrasting performance between training and testing sets based on the provided metrics. With a Train Mean Squared Error (MSE) and Train Mean Absolute Error (MAE) of 0.0, the model perfectly fits the training data, implying an exact match between predicted and actual prices, a scenario uncommon in real-world applications. However, on the test set, while it exhibits a relatively high R-squared value of approximately 0.87, signifying a good fit to the test data, there's a noticeable increase in MSE and MAE. This suggests some level of overfitting, where the model excessively tailors itself to the training data, resulting in reduced generalization to unseen data. Further regularization or fine-tuning is recommended to strike a balance between complexity and predictive accuracy for improved real-world applicability.

In [31]:
model_name = "DecisionTreeRegression(MD=5)"

# Initialize and train the Model
dtr = DecisionTreeRegressor(max_depth=5)
dtr.fit(x_train, y_train)

# Calculate metrices
train_pred = dtr.predict(x_train)
test_pred = dtr.predict(x_test)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Append the metrics to the lists
model_names.append(model_name)
train_mses.append(train_mse)
test_mses.append(test_mse)
train_maes.append(train_mae)
test_maes.append(test_mae)
train_r2s.append(train_r2)
test_r2s.append(test_r2)

# Print the appended results
print("Model Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Model Name : DecisionTreeRegression(MD=5)
Train MSE  : 837911112.7901535
Test MSE   : 1411009240.766638
Train MAE  : 21577.830107540503
Test MAE   : 24862.38163046515
Train R2   : 0.8601591790198645
Test R2    : 0.8455658457452984


The Decision Tree Regression model with a maximum depth of 5 exhibits favorable performance, showcasing a balanced fit between training and test datasets. With a moderate difference between training and test metrics, the model demonstrates good generalization to unseen data. The R-squared values for both training (approximately 0.86) and test sets (around 0.85) indicate a substantial portion of variance in house prices being captured by the model, suggesting a reasonably good fit. While the Mean Squared Error (MSE) and Mean Absolute Error (MAE) are relatively higher compared to a perfect fit, they still imply an acceptable level of predictive accuracy. This model configuration strikes a balance between complexity and performance, showcasing promising results for practical deployment.

In [32]:
model_name = "RandomForestRegression"

# Initialize and train the Model
rfr = RandomForestRegressor()
rfr.fit(x_train, y_train)

# Calculate metrices
train_pred = rfr.predict(x_train)
test_pred = rfr.predict(x_test)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Append the metrics to the lists
model_names.append(model_name)
train_mses.append(train_mse)
test_mses.append(test_mse)
train_maes.append(train_mae)
test_maes.append(test_mae)
train_r2s.append(train_r2)
test_r2s.append(test_r2)

# Print the appended results
print("Model Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Model Name : RandomForestRegression
Train MSE  : 116414506.95307802
Test MSE   : 954689586.8188685
Train MAE  : 6391.348926940639
Test MAE   : 16533.422602739727
Train R2   : 0.9805713279394194
Test R2    : 0.8955097708388952


The RandomForest Regression model demonstrates exceptional performance based on the provided metrics, showcasing high accuracy and robustness. With significantly low Mean Squared Error (MSE) and Mean Absolute Error (MAE) values for both training and test datasets, the model excels in predicting house prices. The R-squared values, particularly the high values of approximately 0.98 for training and 0.90 for the test set, indicate an excellent fit to the data, explaining a vast majority of the variance in house prices. This model's superior performance suggests a strong ability to generalize well to new data while maintaining high accuracy, making it a compelling choice for predictive modeling in this context.

In [33]:
model_name = "XGBRegressor"

# Initialize and train the Model
xgb = XGBRegressor()
xgb.fit(x_train, y_train)

# Calculate metrices
train_pred = xgb.predict(x_train)
test_pred = xgb.predict(x_test)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Append the metrics to the lists
model_names.append(model_name)
train_mses.append(train_mse)
test_mses.append(test_mse)
train_maes.append(train_mae)
test_maes.append(test_mae)
train_r2s.append(train_r2)
test_r2s.append(test_r2)

# Print the appended results
print("Model Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Model Name : XGBRegressor
Train MSE  : 1602708.6614431557
Test MSE   : 720266915.3514816
Train MAE  : 879.9216401493532
Test MAE   : 16234.793637628425
Train R2   : 0.9997325204409071
Test R2    : 0.9211671981329388


The XGBRegressor model displays outstanding performance across the provided metrics, showcasing its prowess in predicting house prices. With impressively low Mean Squared Error (MSE) and Mean Absolute Error (MAE) values for both training and test datasets, this model excels in accuracy and precision. Notably, the R-squared values of approximately 0.999 for training and 0.92 for the test set suggest an exceptional fit to the data, capturing nearly all variance in house prices while generalizing well to new data. This model's remarkable performance signifies its robustness and high accuracy, making it a compelling choice for precise predictive modeling in this domain.

In [34]:
model_evals = pd.DataFrame(data={
    "Name": model_names,
    "Train MSE": train_mses,
    "Test MSE": test_mses,
    "Train MAE": train_maes,
    "Test MAE": test_maes,
    "Train R2": train_r2s,
    "Test R2": test_r2s
})

In [35]:
# Train MSE Bar Graph
train_mse_bar = px.bar(model_evals, x = "Name", y = "Train MSE", title = "Train MSE Bar Graph", color="Name")
train_mse_bar.update_layout(showlegend=False)
train_mse_bar.show()

# Test MSE Bar Graph
test_mse_bar = px.bar(model_evals, x = "Name", y = "Test MSE", title = "Test MSE Bar Graph", color="Name")
test_mse_bar.update_layout(showlegend=False)
test_mse_bar.show()

## Mean Squared Error (MSE) Comparison

The bar graph illustrates the MSE values for different models on both the training and test datasets. Observing the MSE for various models, it's evident that Extreme Gradient Boosting (XGBRegressor) consistently exhibits the lowest MSE among all models on both training and test datasets. This indicates superior performance in minimizing the squared differences between predicted and actual house prices, reflecting its exceptional predictive accuracy.

While several models showcase relatively low MSE values, such as RandomForest Regression, XGBRegressor stands out by achieving notably lower MSE values on both training and test datasets. This demonstrates the exceptional ability of XGBRegressor to make highly accurate predictions, capturing the nuanced variations in house prices with remarkable precision.



In [36]:
# Train MAE Bar Graph
train_mae_bar = px.bar(model_evals, x="Name", y="Train MAE", title="Train MAE Bar Graph", color="Name")
train_mae_bar.update_layout(showlegend=False)
train_mae_bar.show()

# Test MAE Bar Graph
test_mae_bar = px.bar(model_evals, x="Name", y="Test MAE", title="Test MAE Bar Graph", color="Name")
test_mae_bar.update_layout(showlegend=False)
test_mae_bar.show()

Across various regression models evaluated for predicting house prices, distinct patterns emerged in their performance. Linear Regression displayed moderate predictive capabilities, showcasing reasonable accuracy in estimating house prices. Conversely, Support Vector Regression exhibited significantly higher error metrics on both training and test datasets, suggesting notable discrepancies between predicted and actual house prices. Decision Tree Regression models presented intriguing variations, with the regular Decision Tree indicating potential overfitting on the training set by displaying a perfect score and a notable increase in error on the test set, while the variant with limited depth showcased more reasonable error metrics on both datasets. 

The Random Forest Regression model demonstrated robust performance, depicting relatively low error values on both training and test datasets, signifying accurate predictions with minor deviations from actual prices. However, the XGBRegressor model outshone the rest, showcasing exceptional accuracy and precision with remarkably low error metrics on the training set and competitive but still superior performance on the test set, indicating its unparalleled ability to predict house prices with high accuracy and minimal deviations from actual values.


In [37]:
# Train R2 Bar Graph
train_r2_bar = px.bar(model_evals, x="Name", y="Train R2", title="Train R2 Bar Graph", color="Name")
train_r2_bar.update_layout(showlegend=False)
train_r2_bar.show()

# Test R2 Bar Graph
test_r2_bar = px.bar(model_evals, x="Name", y="Test R2", title="Test R2 Bar Graph", color="Name")
test_r2_bar.update_layout(showlegend=False)
test_r2_bar.show()


Across the evaluated regression models for predicting house prices, distinct trends emerged in their R-squared (R2) scores, reflecting their ability to explain the variance in the data. Linear Regression displayed reasonable R2 scores on both training and test datasets, indicating a moderate fit to the data. Support Vector Regression notably underperformed, showcasing negative R2 scores on both datasets, implying a poor fit worse than a horizontal line. Decision Tree models presented varying outcomes, with the regular Decision Tree potentially indicating overfitting on the training set and a more controlled fit with limited depth on both datasets. 

Random Forest Regression exhibited strong R2 scores, suggesting a robust fit to the data. However, XGBRegressor outperformed all other models with remarkably high R2 scores on the training set and competitive but superior scores on the test set, signifying exceptional accuracy and predictive capability in explaining and capturing variance in house prices. The superior performance of XGBRegressor in explaining the variance underscores its potential as the optimal choice among the evaluated models for accurate house price predictions.

---
The Extreme Gradient Boosting (XGBRegressor) model showcases exceptional performance across various metrics, especially with its remarkably low Mean Squared Error (MSE) and Mean Absolute Error (MAE) for both training and test datasets. Its high R-squared values, particularly around 0.999 for training and 0.92 for the test set, signify an exceptional fit to the data, explaining almost all of the variance in house prices while demonstrating good generalization to new, unseen data.

### Why XGBRegressor is an Exceptional Model:

1. **Outstanding Accuracy:** The model displays extremely low errors for both training and test datasets, indicating high precision in predicting house prices.

2. **Near-Perfect Fit:** The model achieves an almost perfect fit to the training data, capturing an incredibly high percentage of variance in house prices.

3. **Strong Generalization:** Despite its exceptional performance on the training set, the model generalizes well to new data, maintaining high accuracy and providing reliable predictions for unseen instances.

4. **Versatility:** XGBoost is a versatile and powerful algorithm known for its capability to handle complex datasets and achieve remarkable results across various domains.

In summary, the XGBRegressor model's exceptional accuracy, strong fit to the data, and impressive generalization make it a standout choice for precise and reliable house price prediction, making it an exceptional model for deployment in practical scenarios where precision and accuracy are paramount.

# **Neural Network**

Exploring the realm of **predictive models**, I ventured into harnessing the capabilities of a **neural network** to forecast **house prices**. Throughout this journey, I delved into **various configurations** and **network architectures**, aiming for **a robust design** that strikes a **balance between performance and generalization**. There's a layer within the **model—commented out**—that presents **superior performance** during **training**, but **unleashes overfitting tendencies**, **failing to replicate this excellence in the testing phase**. 

Consequently, I've settled on the **current model design**, prioritizing **robustness**, albeit acknowledging its **performance doesn't quite match or surpass the exceptional results of extreme gradient boosting**. This disparity might stem from the **unique strengths** of **traditional machine learning models** adept at **uncovering mathematical insights** within **structured data**, while **neural networks**, **inherently intuitive**, might lack the **sheer mathematical sophistication demanded by such datasets.**

In [38]:
# Initialize the neural network
model_name = "NeuralNetwork"
net = keras.Sequential([
#     layers.Dense(512, activation='relu'),
    layers.Dense(256, activation='relu'),
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(1)
])

# Compile the model
net.compile(
    loss='mse',
    optimizer=keras.optimizers.Adam(learning_rate = 1e-3)
)

# Train the Neural Network
net.fit(
    x_train, y_train,
    epochs = 20,
#     verbose = 0
)

# Calculate metrices
train_pred = net.predict(x_train, verbose = 0)
test_pred = net.predict(x_test, verbose = 0)
train_mse, train_mae, train_r2 = metrics(y_train, train_pred)
test_mse, test_mae, test_r2 = metrics(y_test, test_pred)

# Print the appended results
print("\nModel Name :", model_name)
print("Train MSE  :", train_mse)
print("Test MSE   :", test_mse)
print("Train MAE  :", train_mae)
print("Test MAE   :", test_mae)
print("Train R2   :", train_r2)
print("Test R2    :", test_r2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20

Model Name : NeuralNetwork
Train MSE  : 939238147.3087851
Test MSE   : 1562725599.300677
Train MAE  : 20446.145360659248
Test MAE   : 25326.662617722603
Train R2   : 0.8432484882815778
Test R2    : 0.8289605770908725


### Neural Network vs. XGBRegressor:

- **Neural Network:**
  - Train MSE: 971,554,192, Test MSE: 1,587,748,229
  - Train MAE: 20,746, Test MAE: 25,137
  - Train R2: 0.838, Test R2: 0.826

- **XGBRegressor:**
  - Train MSE: 1,602,709, Test MSE: 720,266,915
  - Train MAE: 879, Test MAE: 16,234
  - Train R2: 0.999, Test R2: 0.921

### Comparison:

- **MSE and MAE:**
  - The Neural Network exhibits higher MSE and MAE on both the training and test sets compared to XGBRegressor. It indicates that the Neural Network model has larger errors in predicting house prices compared to XGBRegressor.

- **R2 Score:**
  - XGBRegressor outperforms the Neural Network significantly in R2 scores, both on the training and test sets. XGBRegressor achieves notably higher R2 scores, indicating a superior ability to explain and capture variance in house prices compared to the Neural Network model.

### Summary:

In comparison to XGBRegressor, the Neural Network model demonstrates relatively higher errors (MSE and MAE) and lower R2 scores on both the training and test datasets. XGBRegressor, on the other hand, showcases substantially lower errors and higher R2 scores, indicating superior predictive accuracy and a better fit to the house price data. Overall, based on these evaluation metrics, XGBRegressor outperforms the Neural Network in accurately predicting house prices.