<h1><center><b style="color: BurlyWood;">Housing Price Prediction Using <span style="color: Coral;">Advanced Regression Techniques</span></b></center></h1>

## <h2><span style="color: BurlyWood;">Table of Contents</span></h2>

<h4><span style="color: chocolate;">Intoduction</span></li></h4>
<h4><span style="color: chocolate;">Loading Dataset and Configurations</span></li></h4>
<h4><span style="color: chocolate;">Data Wrangling</span></li></h4>
<li>Checking missing values</li>
<li>Dropping missing values</li>
<li>Handling missing values</li>
<li>Handling duplicates</li>
<h4><span style="color: chocolate;">Data Analysis</span></li></h4>
<li>Univariate Analysis</li>
<li>Multivariate Analysis</li>
<li>Feature Engineering</li>
<li>Feature Selection</li>
<h4><span style="color: chocolate;">Model Training and Evaluation </span></li></h4>
<li>Data Preprocessing</li>
<li>Linear Regresion</li>
<li>Random Forest Regression</li>
<li>XGBoost Regression</li>
<li>Gradient Boost Regression</li>
<h4><span style="color: chocolate;">Hyperparameter Tuning</span></li></h4>
<h4><span style="color: chocolate;">Optimizing Best Model Performance</span></li></h4>
<h4><span style="color: chocolate;">Model Evaluation</span></li></h4>
<h4><span style="color: chocolate;">Feature Importance</span></li></h4>

## <h2><span style="color: BurlyWood;">1. Introduction</span></h2>
<p>
In the complex world of real estate, accurate house price prediction is a valuable tool for buyers, sellers, investors, and market analysts alike. This study focuses on the housing market in Ames, Iowa, utilizing a comprehensive dataset that captures the essence of residential properties through 80 distinct explanatory variables. These variables offer an in-depth look at almost every conceivable aspect of homes in the area, providing a root for the analysis.
</p>

<h5><li><span style="color: chocolate;">Aim and Objective</span></li></h5>

<p>
Our primary objective is to uncover novel insights into the factors that influence housing prices and to develop a predictive model using advanced regression techniques. By utilizing this extensive dataset, I aim to identify hidden patterns and key determinants of property values in Ames, Iowa . The resulting model will not only serve as a reliable tool for price prediction but also contribute to a deeper understanding of the local real estate market dynamics.
</p>

<h5><li><span style="color: chocolate;">Statement of Problem</span></li></h5>
<p>
This research aims to uncover the complex connections between different characteristics of homes and their impact on market value, potentially revealing unexpected influences and trends. The findings and methodologies developed through this study may also have broader applications, offering valuable insights for real estate professionals and researchers beyond the Ames market.
</p>

<h5><li><span style="color: chocolate;">Scope of Study</span></li></h5>

<p>
This study focuses on analyzing the factors that influence house prices in Ames, Iowa by identifying the most significant features that impact house prices, developing a reliable predictive model for house prices using advanced regression techniques and gaining new insights into the Ames housing market through in-depth analysis.
</p>

<h5><li><span style="color: chocolate;">Theory and Assumptions</span></li></h5>
<h5><span style="color: pink;">Thoery</span></h5>
<p>The sale price of a house is influenced by a complex interplay of various factors, including physical characteristics of the property, location-based features, and market conditions at the time of sale.</p>
<h5><span style="color: pink;">Assumptions</span></h5>
<p>
Overall Quality and Condition  
The variables OverallQual and OverallCond likely have a strong positive correlation with SalePrice. Houses rated higher in quality and condition are expected to sell for higher prices.  

Size Matters  
Variables related to the size of the property (e.g., GrLivArea, TotalBsmtSF, GarageArea) are likely to have a positive correlation with SalePrice. Larger houses tend to be more expensive.  

Location Impact   
The Neighborhood variable will likely show significant variation in SalePrice. Some neighborhoods are expected to have consistently higher prices than others.  

Age and Renovation   
YearBuilt and YearRemodAdd might have a complex relationship with SalePrice. Newer houses may generally be more expensive, but older houses that have been recently remodeled might also command high prices.  

Amenities Premium   
The presence of certain amenities like central air conditioning (CentralAir), and good KitchenQual might positively impact the SalePrice.  

Garage Influence   
Houses with garages (especially those with higher GarageCars capacity) might sell for more than those without.  

Basement Quality   
Houses with high-quality, finished basements (indicated by BsmtQual and BsmtFinType1) might have higher sale prices.  

Seasonal Variations   
The MoSold variable might show some seasonal trends in SalePrice, with certain months possibly having higher average prices.  

Market Conditions   
YrSold might reveal trends in the housing market over time, potentially showing overall increases or decreases in prices.  

Outliers   
Given the large difference between the mean (180,921 dollars) and max-value (755,000 dollars) SalePrice, there are likely some high-end outlier properties that might skew some analyses.  

Non-linear Relationships   
Some features might have non-linear relationships with SalePrice. For example, the impact of an additional bedroom might diminish after a certain point.  

Interaction Effects   
There may be significant interaction effects between variables. For instance, the impact of OverallQual on price might be amplified in certain high-end Neighborhoods.
</p>

<h5><li><span style="color: chocolate;">Data Source</span></li></h5>
The dataset is sourced from kaggle and it contains various features describing houses in Ames, Iowa, including sales price, location, and other information.

Overview of the dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

Data description: https://www.ishelp.info/data/housing_description.txt

## <h2><span style="color: BurlyWood;">2. Methodology</span></h2>

<p>This study involves the following methods:</p>

##### importing all the needed libraries  

import pandas as pd      #pandas library for data manipulation and processing  
import numpy as np       #numpy library for linear algebra  
import seaborn as sns    #seaborn for statistical data visualization  
import matplotlib.pyplot as plt     #matplotlib for plotting and visualizing data  
import altair as alt    #altair for declarative statistical visualization  
from scipy import stats     #stats module from scipy for statistical functions  

##### machine learning models  

from sklearn.preprocessing import LabelEncoder  
from sklearn.preprocessing import StandardScaler, MinMaxScaler        
from sklearn.model_selection import train_test_split    
from sklearn.linear_model import LinearRegression         
from sklearn.metrics import r2_score, mean_squared_error as MSE     
from sklearn.ensemble import RandomForestRegressor        
import xgboost as xgb   
from sklearn.ensemble import GradientBoostingRegressor  
from sklearn.model_selection import GridSearchCV        


##### importing warnings
import warnings
##### Ignoring all future warnings to avoid unnecessary warning messages during execution
warnings.simplefilter(action='ignore', category=FutureWarning)
##### Ignoring all deprecation warnings to avoid unnecessary warning messages during execution
warnings.filterwarnings("ignore",category=DeprecationWarning)

##### utility function to print markdown string
def printmd(string):
    display(Markdown(string))

##### display matplot visuals
%matplotlib inline

### 2.1 Data wrangling

The dataset contains 1460 rows and 81 columns of both categorical and numeric columns
![image-3.png](attachment:image-3.png)

### 2.2 Data Cleaning
Based on the standard statistical guidance rule for analysis, it is recommended to drop columns with more than 10-20% missing values to avoid bias and ensure the quality of the analysis. Following this rule, columns with more than 10% missing values will be dropped from the dataset.  

High percentages of missing values can introduce significant bias, distort statistical analysis, and affect model performance. By removing columns with substantial missing data, the complexity of the dataset is reduced, which can lead to better model performance and more reliable results.  

In this section, I handled missing values by dropping columns/rows that are contain null values more than 10% and filled the columns and rows less than 10%. Identify and remove duplicates/redundancy from my data. Also handle outliers by analysing them seperately for more insights and accurate data interpretation.  

![image.png](attachment:image.png)

#### Actions Taken 

Dropped Columns with High and Moderate Missing Values

Columns Dropped: PoolQC, MiscFeature, Alley, Fence, MasVnrType, FireplaceQu, LotFrontage.  

Reason:  
Based on the standard statistical guidance for analysis, columns with more than 10% missing values, this includes those with high percentages (over ~ 60%) and moderate percentages (17.74%), were dropped. This decision helps simplify the dataset and focus on features with more reliable data.  

Filled Low Percentage Missing Values

Columns Affected: GarageQual, GarageType, GarageFinish, GarageCond, GarageYrBlt, BsmtExposure, BsmtFinType2, BsmtQual, BsmtFinType1, BsmtCond, MasVnrArea, Electrical.  

Approach:  
Fillna: For columns with low missing value percentages (less than 10%), missing values were filled based on the information from the dataset info.  

Categorical Columns: Using the fillna function to fill columns with missing values from the information gotten from the data info.  
Numerical Columns: Used the fillna function to fill the missing values with reference to their coresponding categorical columns.
![image-2.png](attachment:image-2.png)


### 2.3 Data Analysis

My analytical approach will be structured into three main components:

##### Univariate Analysis: 
I will examine the distribution and characteristics of individual variables, with a particular focus on our target variable, SalePrice. This will help us identify potential outliers, skewed distributions, and gain insights into the nature of each feature.  
##### Multivariate Analysis: 
I'll investigate the relationships between each feature and the SalePrice. This will allow us to identify which variables have the strongest correlations with housing prices and potentially uncover non-linear relationships. 
Also, I will explore interactions between multiple features and their collective impact on SalePrice. This will help us understand complex relationships within the data and identify potential feature combinations that could improve our predictive model.

The statistical approach will be twofold:

##### Numerical Columns:   
These will be analyzed by examining their relationship with the target variable. This will involve exploring mean, median, mode, standard deviation, a correlations, patterns, and trends to understand how these variables impact the target outcome.

##### Categorical Columns:   
These will be analyzed using statistical tests such as one-way ANOVA, Lagrange test, and Tukey test. These tests will help in determining the significance of each categorical variable in relation to the target variable, ensuring a comprehensive understanding of their influence on the dataset.


### 2.3.1 Univariate Analysis
![image.png](attachment:image.png)

##### SalePrice (Target): 
It has a mean of 180,921.1959, a median of 163,000, and ranges from 34,900 to 755,000.  
##### LotArea: 
This feature has the highest mean value (10516.8281) among the numerical features, suggesting it might be measured in square feet.  
##### Skewness: 
Many features show positive skewness (right-skewed distributions), with MiscVal having the highest skewness of 24.4768.  
##### Kurtosis: 
MiscVal also has the highest kurtosis (701.0033), indicating a distribution with very heavy tails.  
##### Zero-inflated features: 
Several features (like PoolArea, 3SsnPorch, LowQualFinSF) have a median and 25th/75th percentiles of 0, suggesting many houses don't have these features.  
##### Discrete features: 
Some features like KitchenAbvGr and BsmtHalfBath have very few unique values, indicating they are likely categorical or discrete numerical features.  
##### Continuous features: 
Features like LotArea, SalePrice, and 1stFlrSF appear to be continuous variables with a wide range of values.  
##### SalePrice: 
It has a mean of 180,921.1959, a median of 163,000, and ranges from 34,900dollars to 755,000dollars.  
##### LotArea: 
This feature has the highest mean value (10516.8281) among the numerical features, suggesting it might be measured in square feet.  
##### Skewness: 
Many features show positive skewness (right-skewed distributions), with MiscVal having the highest skewness of 24.4768.  
##### Kurtosis: 
MiscVal also has the highest kurtosis (701.0033), indicating a distribution with very heavy tails.  
##### Zero-inflated features: 
Several features (like PoolArea, 3SsnPorch, LowQualFinSF) have a median and 25th/75th percentiles of 0, suggesting many houses don't have these features.  
##### Discrete features: 
Some features like KitchenAbvGr and BsmtHalfBath have very few unique values, indicating they are likely categorical or discrete numerical features.  
##### Continuous features: 
Features like LotArea, SalePrice, and 1stFlrSF appear to be continuous variables with a wide range of values.  

- **How are sale prices distributed?**

#### SalePrice Distribution (Target)
![image.png](attachment:image.png)

##### Right-Skewed Distribution:
The distribution of SalePrice is skewed to the right, meaning that there are more houses with lower prices and fewer houses with very high prices. Most of the data points are concentrated on the left side of the graph, with a long tail extending to the right.

##### Central Tendency:
The peak of the distribution occurs around the 150,000dollars - 200,000dollars range, which suggests that the majority of house prices fall within this range. This is likely where the median and mode of the SalePrice would be found.

##### Outliers:
The tail on the right indicates the presence of outliers—houses that are significantly more expensive than the majority. These are the high-priced homes that are not as common but still exist in the dataset.

##### Spread of Data:
The spread of the distribution shows that SalePrice varies widely, with values ranging from less than 50,000dollars to over 500,000dollars.

- **How are the property characteristics distributed?**

#### Property Characteristics Distribution
![image.png](attachment:image.png)

The distribution is slightly left-skewed, meaning that there are fewer houses with lower quality ratings (1-4) and more houses with higher quality ratings (5-7).  
The majority of houses have an Overall Quality rating of 5 or 6. These two categories dominate the distribution, with a significant count, suggesting that most houses in the dataset are of average quality.  
While ratings of 8, 9, and 10 are less common, they still represent a noticeable portion of the dataset. This suggests that some houses are of higher quality, although they are fewer in number compared to average-quality houses.

![image.png](attachment:image.png)

The majority of homes in the dataset have moderate living spaces, with a significant number concentrated around the 1,500 to 2,000 square feet range.  
The right-skewness indicates that while most homes are of average size, there are a few very large homes that are less common.  
The presence of a long tail to the right suggests that there are a few homes with very large above-ground living areas, which can be considered as outliers in terms of size.

- **What locations and zones has more house distribution?**

#### Location Distribution
![image.png](attachment:image.png)
##### Left Plot: Distribution of Neighborhood   
This plot shows the count of houses in each neighborhood.
- **Most Common Neighborhoods:**
    NAmes: The most common neighborhood in the dataset, with over 200 houses.
    CollgCr and OldTown: Also have significant numbers of houses, with CollgCr having around 150 and OldTown about 100 houses.
- **Least Common Neighborhoods:**
    Veenker, Blueste, and GrnHill have very few houses in the dataset, with Blueste having almost none.
- **Distribution Pattern:**
    There is a wide variation in the number of houses across different neighborhoods, indicating that some neighborhoods are much more densely represented in the dataset than others.
    
##### Right Plot: Distribution of MsZoning
This plot shows the count of houses across different zoning classifications (MsZoning).
- **Dominant Zoning:**
Residential LD (Low Density) is by far the most common zoning category, with over 1,000 houses. This suggests that the majority of houses in the dataset are in low-density residential areas.  
- **Other Zoning Types:**
Residential MD (Medium Density) is the second most common, with a much smaller count of houses.
Floating Village and Residential HD (High Density) are the least common zoning types in the dataset, with very few houses.

- **What month/year has more sales distribution?**

#### Market Trend Distribution
![image.png](attachment:image.png)
##### Left Plot - Distribution of Sales by Month Sold:
This bar chart shows the number of properties sold in each month of the year.  
**There's a clear seasonal pattern in home sales:**  
- Peak sales occur in the summer months, with June (month 6) having the highest number of sales, followed by July (month 7).  
- Winter months (December, January, February) show the lowest number of sales.  
- There's a gradual increase in sales from winter to summer, and a decrease from summer to winter.  
- This pattern likely reflects typical real estate market trends, with more activity during warmer months.

##### Right Plot - Distribution of Sales by Year Sold:
This bar chart displays the number of properties sold in each year from 2006 to 2010.  
**Observations:**  
- 2009 had the highest number of sales in the dataset.  
- 2006 and 2007 had similar sales volumes, slightly lower than 2009.  
- 2008 shows a slight dip in sales compared to surrounding years.  
- 2010 has noticeably fewer sales, but this could be due to incomplete data for that year if the dataset doesn't cover the full year.  
- The variation across years might reflect broader economic conditions or local market factors affecting home sales.

### 2.3.2 Multivariate Analysis

- **Is there a trend in house prices over the years sold (YrSold)?**
#### Price vs. Market Trends
![image.png](attachment:image.png)
**Data Trend:**  
- 2006 to 2007: There was an increase in the average sale price from around 182,000dollars in 2006 to a peak of approximately 186,000dollars in 2007.
- 2007 to 2008: After reaching a peak in 2007, the average sale price dropped significantly to around 178,000dollars in 2008.
- 2008 to 2010: Post 2008, the average sale price exhibited some volatility. There was a slight recovery in 2009, with prices rising back up to about 180,000dollars, followed by another decline in 2010 to slightly below 178,000dollars.

**Observations:**  
- *Peak in 2007:* The highest average sale price was observed in 2007. This could be attributed to various economic factors, including the housing bubble in the U.S., which peaked around that time.
- *Drop in 2008:* The sharp decline in 2008 aligns with the global financial crisis, which significantly impacted the housing market, leading to decreased property values.
- *Volatility:* The fluctuation in prices from 2008 to 2010 could indicate instability in the housing market as it struggled to recover from the financial crisis.

- **Do certain months of sale (MoSold) show higher prices?**

![image.png](attachment:image.png)
**Monthly Average Prices:**  
The chart shows that average house prices are fairly stable across the months. There are no dramatic spikes or drops, which suggests that the housing market in this dataset does not have significant seasonal variation in average prices.
However, slight differences in the height of the bars suggest that some months might be slightly more favorable for sellers (with higher average prices) than others.  

**Variability in Prices:**
The presence of large error bars indicates that there is significant variability in house prices within each month. For example, even if the average price in a month is 200,000dollars, the actual prices of houses sold in that month might range widely from, say, 100,000dollars to 300,000dollars.  
This variability could be due to a number of factors, including the types of properties sold (e.g., size, condition, location), market conditions, or buyer/seller negotiations.  

**Seasonality and Market Behavior:**
While the average prices do not show extreme seasonality, the variability indicated by the standard deviation could imply that certain months might see a wider range of buyers or more diverse types of properties being sold.
For instance, months with higher standard deviation might be periods where both high-end and low-end properties are on the market, leading to a broader price range.  

- **How do the overall condition (OverallCond) and quality ratings impact price?**

#### Price Vs. Property Characteristcs
![image.png](attachment:image.png)
##### Distribution of Sale Price by Overall Condition (Left Plot)
**Overall Condition**  
This represents the categorical variable Overall Condition  
- **Spread:**  
Lower Overall Condition scores (like 1 and 2) have narrow violins, indicating fewer data points and a narrower range of sale prices. This suggests that most homes in poor condition are clustered around lower prices.
As Overall Condition improves (moving from 3 to 9), the violins generally widen, indicating a broader distribution of sale prices. This suggests that homes in better condition have a wider range of possible sale prices, reflecting the added value of better maintenance and fewer necessary repairs.  
- **Outliers:**  
The violin plots also capture outliers at the extremes of the distribution. For example, in the Overall Condition category of 8 or 9, the violins stretch higher, indicating that some homes sold for significantly more than the typical range for that condition level.  

- **Key Insights:**
Homes in poor condition (1-3) tend to cluster around lower sale prices, and there are fewer of them in the dataset.
As the condition improves, not only does the average sale price increase, but the variability in sale prices also increases. This indicates that better-maintained homes can command a wider range of prices, likely due to additional factors such as location, size, and specific amenities.  

##### Distribution of Sale Price by Overall Quality (Right Plot)
**Overall Quality**  
This represents the categorical variable Overall Quality.  

- **Spread:**
For lower Overall Quality scores (1-4), the violins are quite narrow and low on the y-axis, indicating that houses with poor quality typically sell for lower prices and there’s less variability in those prices.  
As the Overall Quality increases (moving towards 10), the violins not only shift higher on the y-axis (indicating higher prices) but also widen, showing a larger spread of sale prices. This reflects that higher-quality homes can command a wide range of prices depending on other factors such as location, size, and additional features.  

- **Observations:**
The Overall Quality score of 10 has a particularly wide and high violin, showing that while these homes can sell for very high prices, there’s a considerable range in those prices. This might suggest that the very best quality homes are valued highly but still subject to significant variation depending on other factors.  

- **Key Insights:**
Overall Quality appears to have a stronger correlation with SalePrice than Overall Condition. Higher quality homes generally sell for much higher prices, and the range of sale prices expands significantly as quality improves.
There’s a clear upward trend in both the median sale price and the spread of sale prices as Overall Quality increases, indicating that buyers are willing to pay a premium for top-quality homes, but that premium can vary widely.  

**General Conclusions:**
- **Condition vs. Quality:**
While both Overall Condition and Overall Quality influence the sale price, Overall Quality seems to have a more substantial impact on both the level and variability of sale prices. This suggests that intrinsic quality (materials, design) is a more critical factor in determining home value than just the current physical condition of the property.  

- **Price Variability:**
As both Overall Condition and Overall Quality improve, the variability in sale prices increases. This could imply that buyers are more discerning and willing to pay more for homes that meet higher standards, but also that other factors (such as location, size, and market conditions) play a bigger role in the pricing of higher-quality homes.

![image-2.png](attachment:image-2.png)

**Observations:**  
Positive Correlation: The plot shows a clear positive correlation between GrLivArea and SalePrice. Larger houses generally have higher sale prices.

**Spread of Data:**  
There is a concentration of data points in the range of 1,000 to 3,000 square feet for GrLivArea and 100,000dollars to 400,000dollars for SalePrice.
There are a few outliers, particularly in the higher ranges of GrLivArea (above 4,000 square feet) and SalePrice (above 500,000dollars).  

**Overall Quality Impact:**
Houses with higher OverallQual (colored towards the orange and brown hues) tend to cluster towards the higher end of both GrLivArea and SalePrice, indicating that better-quality houses are generally larger and more expensive.
Conversely, houses with lower OverallQual (colored in green and yellow) are mostly found in the lower ranges of both GrLivArea and SalePrice.



- **Is there a correlation between yearbuilt and saleprice?**

##### Price vs. YearBuilt and Overall Quality
![image-2.png](attachment:image-2.png)
**Key Observations** 

- **Positive Correlation:**
There seems to be a positive correlation between Year Built and Sale Price. Generally, newer houses tend to have higher sale prices. This is evident in the upward trend of the data points from left to right in the graph.

- **Impact of Overall Quality:**
The color coding based on OverallQual reveals that houses with higher quality ratings (represented by darker colors) generally have higher sale prices, regardless of the year built. This suggests that overall quality is a significant factor influencing the sale price.

- **Data Distribution:**
The distribution of data points is not uniform. There are clusters of data points in certain areas, indicating potential trends or patterns in the data. For instance, there's a denser cluster of data points for houses built between 1960 and 2005.

- **Outliers:**
Several outliers can be observed, particularly in the higher price range. These could represent houses with unique features, larger lot sizes, or premium locations, significantly impacting their sale price.      

- **Which neighborhoods have the highest and lowest average prices?**

##### Price Vs Location
![image.png](attachment:image.png)

**Variation Across Neighborhoods:**  
The distribution of sale prices varies significantly across different neighborhoods.
NoRidge, StoneBr, and NridgHt have the highest sale prices, with median values well above 400,000.
On the other hand, neighborhoods like BrDale, SWISU, and IDOTRR have much lower sale prices, with medians under 200,000.

**Spread and Density:** 
NoRidge shows a wide range of prices, indicating variability in property values within the neighborhood.
StoneBr also shows a similar spread, though slightly less extreme.
Neighborhoods like BrDale, MeadowV, and SWISU have much tighter distributions, indicating less variability in sale prices.

**Skewness:**  
Some neighborhoods exhibit skewness in sale prices. For example, NridgHt shows a right skew, with a long tail extending towards higher prices, suggesting that while most properties are clustered around a lower price, there are a few very high-value properties.

**Outliers:** 
Some neighborhoods have distinct outliers, particularly in Gilbert and Timber, where the spread extends beyond the typical range.

**Comparison Between Neighborhoods:**  
NAmes, despite being one of the more centrally clustered neighborhoods, still shows a significant number of properties selling in the mid-200,000 range, suggesting it’s a middle-ground neighborhood.
OldTown and Edwards have lower medians but show a wide spread, indicating a mix of property values.

**Conclusions:**  
Affluent Areas: Neighborhoods like NoRidge, StoneBr, and NridgHt are more affluent with higher property values, as seen from their higher median sale prices and wider distributions.  
More Affordable Areas: Neighborhoods like BrDale, SWISU, and MeadowV are more affordable, with lower median prices and tighter distributions, indicating less variance in property values.

**Investment Potential:**   
Areas with high variance and skewness like NridgHt might be attractive for investment, as there is a potential for higher returns if the right properties are selected.


##### What is the mean SalePrice for each MSZoning category?
![image-6.png](attachment:image-6.png)

**Zoning Categories**
The plot includes four zoning categories:  
Floating Village  
Residential High-Density (HD)  
Residential Low-Density (LD)  
Residential Medium-Density (MD)    

**Sale Price Comparison**  
- **Floating Village:**     
Has the highest sale price, averaging above 200,000. This could indicate that properties in this zoning category are either more desirable, larger, or located in a more premium area.

- **Residential Low-Density (LD):**   
Follows closely, with an average sale price approaching 175,000. This is typically the case in areas with larger plots and more spacious homes.

- **Residential Medium-Density (MD):**      
Has a slightly lower average sale price, around 150,000. Medium-density zoning often includes a mix of housing types, which could account for the moderate sale price.

- **Residential High-Density (HD):**    
Has the lowest sale price, around 125,000. High-density areas usually consist of smaller homes or multi-family units, which can drive the overall prices down.  

**Urban Planning and Zoning Impacts**  
The difference in sale prices across these zones suggests how urban planning and zoning regulations impact property values.
Higher-density residential areas tend to have lower prices, possibly due to the smaller lot sizes and more compact living spaces.  
Conversely, areas like the Floating Village or Residential Low-Density zones command higher prices, likely due to their unique or more spacious environments.  

**Implications for Buyers and Investors**  
*Floating Village:* Given its high sale prices, this area could be considered more exclusive or in high demand. It may offer unique living experiences or have a scarcity of available properties.  
*Residential LD and MD:* These areas offer a balance between price and living space. They may appeal to families or buyers looking for a suburban feel with reasonable prices.  
*Residential HD:* This zone could be attractive to first-time buyers or investors looking for rental properties due to its lower price point.  

**Conclusions**  
*Higher-priced Zones:* The Floating Village and Residential LD zones indicate more premium real estate, likely due to desirable locations or housing types.  
*Affordable Zones:* Residential HD offers the most affordable options, making it suitable for budget-conscious buyers or those seeking investment properties.  
*Zoning and Property Value Correlation:* The plot clearly illustrates how zoning regulations impact property values, with denser, more urban areas generally offering lower-priced properties.

#### **Is there a significant influence between the location of houses and saleprice?**
![image-4.png](attachment:image-4.png)

**Top Categories by Average Sale Price**  
- **2-STORY 1946 & NEWER:**   
This category, representing two-story houses built in 1946 or later, has the highest average sale price, making it the most expensive type of property among the listed categories.
- **1-STORY PUD (Planned Unit Development) - 1946 & NEWER:**  
This category, also for properties built in 1946 or later but specifically within a Planned Unit Development, comes in second. These are likely newer and more upscale developments.
- **2-1/2 STORY ALL AGES:**  
This category includes properties with a unique 2-1/2 story structure, regardless of their age, and also shows a high average sale price.  

**Mid-Range Categories**  
Categories such as 1-STORY 1946 & NEWER ALL STYLES and SPLIT OR MULTI-LEVEL fall in the mid-range of average sale prices. These categories likely include a mix of traditional single-story homes and more complex split or multi-level designs.
-**2-STORY 1945 & OLDER:**  
Older two-story homes (built before 1946) still hold value but are priced lower than their newer counterparts.

**Lower-End Categories**
- **1-STORY 1945 & OLDER:**  
This category represents older single-story homes and has the lowest average sale price, which might be due to age, condition, or less desirable locations.
- **PUD - MULTILEVEL - INCL SPLIT LEV/FOYER:**  
Multi-level PUD homes, including split-level designs, also show a lower average sale price, possibly due to their design or market demand.  

**Influence of Age and Type**

- **Age Factor:** 
Generally, newer properties (those labeled "1946 & NEWER") tend to have higher average sale prices compared to older properties ("1945 & OLDER"). This suggests that age and modernity play a significant role in property valuation.
- **Building Type:**   
The type of structure also influences the average sale price. More complex structures like 2-1/2 story homes or those in planned unit developments tend to command higher prices.

**Implications for Buyers and Sellers**
- **For Buyers:** 
Understanding that newer, more complex structures tend to be priced higher can help in making informed decisions based on budget and preferences.
- **For Sellers**  
Sellers in the higher-end categories can expect better returns, while those in the lower-end categories might focus on renovations or marketing to improve sale prices.

- ** Does the number of cars in a garage have any significant influence on the saleprice?**

![image.png](attachment:image.png)

**Trend Line** 
- **Linear Regression(Red Line):** 
The red line indicates the best fit line obtained through linear regression. The equation for this line is given at the bottom right as y = 68078.0x + 60618.98. This means:
For every additional garage car, the sale price is expected to increase by approximately 68,078dollars.
The y-intercept is 60,618.98dollers, which represents the estimated sale price when there are no garage cars.    

- **Marginal Histograms**
Top Histogram: Shows the distribution of the GarageCars variable. Most properties seem to have either 2 or 3 garage cars.
Right Histogram: Shows the distribution of the SalePrice variable. The distribution appears right-skewed, with most properties clustered towards the lower end of sale prices.  

**Statistical Metrics**
**Correlation (Corr = 0.6404):**    
This indicates a moderate positive correlation between the number of garage cars and sale price. A correlation of 1 would mean a perfect positive linear relationship, while 0 would mean no linear relationship.  

**R-Squared (R-Squared = 0.4101):**   
This value indicates that approximately 41.01% of the variance in the sale price can be explained by the number of garage cars.     
**P-value (p = 0.0):** This suggests that the relationship between GarageCars and SalePrice is statistically significant.  

**Skewness**
-**GarageCars skew = -0.3425:**   
The distribution of the GarageCars variable is slightly left-skewed.  

**SalePrice skew = 1.8829:**    
The distribution of the SalePrice variable is right-skewed, indicating that most sale prices are on the lower end, with a long tail extending towards higher prices.

##### How does the size of the basement affect the saleprice of a house?**
![image-2.png](attachment:image-2.png)

**Positive Relationship:**    
The positive correlation and the slope of the regression line indicate that larger basement sizes are generally associated with higher sale prices.    
**Moderate Influence:**    
The R-squared value shows that while basement size does affect sale price, other factors (such as location, overall square footage, number of bedrooms, etc.) also play a significant role.  
**Skewness and Outliers:**    
The skewness in both variables and the presence of outliers suggest that while the general trend holds, the market is diverse, with some properties significantly deviating from the average trend.

##### Is there a correlation between the number of bedrooms above ground and saleprice?
![image-3.png](attachment:image-3.png)

**Left Plot: BedroomAbvGr vs SalePrice by FullBath**  

**Sale Price and Bedrooms**  
As the number of bedrooms increases, the sale price generally tends to increase, but this trend is not strict, with a considerable spread in sale prices for the same number of bedrooms.  
Properties with 3 to 4 bedrooms seem to be the most common, with a wide range of sale prices.  

**Impact of Full Bathrooms**     
The color of the points indicates the number of full bathrooms (0 to 3).
Properties with more full bathrooms (2 or 3) tend to have higher sale prices, even within the same bedroom category. This suggests that the number of full bathrooms is an important factor in determining sale prices.  
For instance, properties with 4 bedrooms and 3 full baths (yellow points) show some of the highest sale prices.    

**Outliers**  
There are a few outliers where properties with the same number of bedrooms (e.g., 3 or 4) have significantly higher or lower sale prices, possibly due to other factors like location, condition, or additional amenities.    

**Right Plot: BedroomAbvGr vs SalePrice by HalfBath**  

**Sale Price and Bedrooms**
Similar to the left plot, sale prices generally increase with the number of bedrooms, but with significant variability.  

**Impact of Half Bathrooms**
The color of the points here indicates the number of half bathrooms (0 to 2).
Properties with 1 or 2 half bathrooms (yellow and green points) generally tend to have higher sale prices compared to those with no half bathrooms (purple points).    
The distinction in sale prices is not as pronounced with half bathrooms as it is with full bathrooms, but it is still evident.

**Data Spread**  
There is a wide spread in sale prices for properties with 3 and 4 bedrooms, indicating that while the number of bedrooms and bathrooms are important, other factors also contribute significantly to property valuation.  

**Conclusion**    
*Bathrooms Matter:*   
The number of full and half bathrooms influences the sale price significantly, with properties having more full bathrooms generally commanding higher prices.  
*Bedrooms and Value:*   
While more bedrooms generally correlate with higher sale prices, the relationship is not linear, and the impact of additional bedrooms diminishes after a certain point, particularly for properties with more than 5 bedrooms.  
*Other Factors:*   
There are outliers and a wide spread in sale prices for properties with the same number of bedrooms and bathrooms, suggesting that other factors such as location, lot size, or home condition also play a crucial role in determining sale prices.

- **Does the house basic amenities have any influence on the saleprice?**
![image.png](attachment:image.png)

**Impact of Utilities on Sale Price**  
Homes with complete public utilities (AllPub) are much more valuable compared to those without full utilities (NoSeWa).
Insight: Ensuring that a property is equipped with all public utilities can significantly increase its market value. Developers and sellers should prioritize this aspect to attract higher offers.  

**Influence of Central Air Conditioning**  
Properties equipped with central air conditioning have a notably higher average sale price than those without it.
Insight: Central air conditioning is a highly desirable feature for home buyers, likely due to comfort and energy efficiency. Sellers looking to increase their property value might consider installing or upgrading central air systems.  

**Effect of Heating Type on Sale Price**
Heating types have a marked impact on property value.
Gas Forced Air (GasA) is the most lucrative, followed by Gas Hot Water (GasW).  
Less modern systems like Gravity Heating (Grav) and Floor heating are associated with lower sale prices.  

**Insight**   
Buyers prefer modern and efficient heating systems, which directly translates into higher property values. Investing in upgrading the heating system could be a worthwhile consideration for homeowners or investors looking to maximize sale prices.

**Strategic Recommendations**

**For Sellers**  
If your property lacks central air or has an outdated heating system, investing in these upgrades could yield a higher sale price.  
Ensure that your property is connected to all public utilities, as this can significantly enhance its value.

**For Buyers**  
If you are looking for properties with good investment potential, focus on those with modern amenities like central air and efficient heating systems, as these features are highly valued in the market.  

**For Developers**  
When planning new developments, prioritize the inclusion of full utilities, central air conditioning, and modern heating systems. These features will not only attract more buyers but also justify a higher sale price.


### 2.3.3 Feature Engineering

- **What are the trends of saleprice in respect to age of house? Does the overall quality have any influence on the sales of house?**
![image-2.png](attachment:image-2.png)

**Observations**

**Price range:** Sale prices generally range from under 100,000dollars to about 800,000dollars, with most homes falling between 100,000dollars and 400,000dollars.  
**Age distribution:** The graph covers houses from newly built (age 0) to about 140 years old, with a higher concentration of data points for houses under 80 years old.  
**Price vs. Age trend:** There's a slight downward trend in price as age increases, but it's not a strong linear relationship. Newer homes tend to have a wider range of prices, including some of the highest-priced homes.  
**Quality clusters:** The different colored clusters likely represent overall quality ratings.   
*Some patterns emerge:*  
Cluster 0 (blue) and Cluster 3 (green) appear frequently across all age ranges, often in the mid-price range.  
Cluster 1 (orange) is common in lower-priced homes across all ages.  
Cluster 2 (red) tends to represent higher-priced homes, especially among newer constructions.  
Cluster 5 (yellow) appears sporadically but often represents some of the highest-priced homes, particularly for newer constructions.    
**Price variability:** There's significant price variability within each age group, suggesting that factors other than age (like quality, location, size) strongly influence price.    
**Outliers:** There are several high-priced outliers, particularly for homes less than 20 years old, with some selling for over 700,000dollars.  
**Newer homes:** The highest concentration of data points is for homes under 50 years old, indicating a larger sample of newer homes in the dataset.  
**Older homes:** While there are fewer very old homes (100+ years), they show a wide range of prices, suggesting that age alone doesn't determine value for historic properties.  

- **Does the age of  houses when renovated have any significant influence on saleprice?**
![image-6.png](attachment:image-6.png)
**Key Observations**  
**General Trend:**  
There is a downward trend in sale price as the age of the house increases, suggesting that newer homes tend to sell for higher prices compared to older ones.  

**Impact of Renovation**
**Renovated Homes (Orange Points):**  
Renovated homes are distributed across all ages but appear to maintain relatively higher sale prices, even as the house age increases.  
Even older homes, when renovated, can achieve higher sale prices, indicating the value added by renovations.  
**Non-Renovated Homes (Blue Points):**  
These homes generally have lower sale prices as the house age increases, with fewer higher-priced outliers compared to renovated homes.   
The steep decline in prices with age is more evident in non-renovated homes.  

**Outliers**  
There are some notable outliers where older homes (some over 100 years old) have very high sale prices. These are likely due to renovations, as indicated by the orange color.  

**Interpretation**  
**Renovation Significance:**  
Renovation appears to significantly mitigate the effect of aging on property value, helping older homes retain or even increase their market value.  
**Market Value Trends:**  
Newer homes generally command higher prices, but strategic renovations can help maintain property values over time..

- **How does the neighborhood influence the saleprice zscore?**
![image-5.png](attachment:image-5.png)
**Key Insights** 

- **Outliers**  
Several neighborhoods have a significant number of outliers, indicated by red dots. Notably, NridgHt, StoneBr, and NoRidge have multiple high Z-score outliers, showing that some properties in these areas are priced significantly higher than the average.
Blueste has a notable negative outlier, indicating at least one property that sold for much less than the average.  

- **Neighborhood Variation**
NridgHt and StoneBr show a wider spread in Z-scores, indicating more variability in sale prices. This suggests that these neighborhoods might have a mix of high-end and more moderately priced properties.
Neighborhoods like Blueste and SWISU have tighter distributions, meaning the sale prices are more consistent, with less variation from the mean.  

- **Median Z-Scores**
Most neighborhoods have median Z-scores close to 0, indicating that the bulk of sale prices in these areas are near the overall mean.  
NoRidge and StoneBr have slightly higher median Z-scores, suggesting that on average, properties in these neighborhoods are priced above the mean compared to other areas.  

- **Neighborhoods with More Variability**
NridgHt, StoneBr, and NoRidge are neighborhoods with more variability in sale prices, as indicated by their wider boxes and the presence of several outliers.
OldTown and Edwards show a wide range of Z-scores within their IQR, suggesting diverse property values within these neighborhoods.  

**Implications for Market Analysis**

- **High Z-Score Neighborhoods**   
Areas like NoRidge, StoneBr, and NridgHt could be considered premium markets with a possibility of very high-priced properties. The presence of several high Z-score outliers might be indicative of luxury or highly desirable properties.  
- **Low Z-Score and Negative Outliers**   
Neighborhoods with negative Z-score outliers, like Blueste, might have distressed properties or areas where market prices are falling, which could signal opportunities for bargain hunters or areas of concern for property value stability.  
- **Investment Consideration**  
High variability neighborhoods like NridgHt and StoneBr could present opportunities for investors looking to capitalize on high-value properties or market shifts.

### 2.3.3 Feature Selection

##### What are the correlation between the numerical variables and the target?
![image-3.png](attachment:image-3.png)

#### Insights
The correlation heatmap will focus on features highly correlated (>=45%) with SalePrice, I'll identify those features and discuss their relationships.   

**YearBuilt (0.52):** There is a moderate positive correlation with SalePrice, indicating that newer homes tend to have higher sale prices. Homes built in more recent years are likely to have modern amenities and designs, which contribute to a higher value.

**YearRemodAdd (0.51):** This feature is positively correlated with SalePrice. Homes that have been remodeled more recently are likely to be more updated and therefore more valuable, leading to higher sale prices.

**MasVnrArea (0.45):** There is a moderate positive correlation with SalePrice. Masonry veneer area likely contributes to the aesthetic and possibly the durability of the home, making it more valuable.

**TotalBsmtSF (0.61):** This shows a strong positive correlation with SalePrice, indicating that homes with larger basements tend to have higher sale prices. The basement area adds to the overall usable space in a house, which is desirable for buyers.

**1stFlrSF (0.61):** This feature also shows a strong positive correlation with SalePrice. The first-floor square footage is a significant part of the living area, and a larger space here contributes to a higher home value.

**GrLivArea (0.71):** This is one of the strongest positive correlations with SalePrice. Greater living area square footage (above ground) directly contributes to the overall size and value of the home, making it a key factor in pricing.

**FullBath (0.56):** The number of full bathrooms in the home is moderately positively correlated with SalePrice. More bathrooms generally add convenience and appeal, leading to higher prices.

**TotRmsAbvGrd (0.53):** The total number of rooms above ground (excluding bathrooms) is moderately correlated with SalePrice. More rooms typically mean more living space, which is attractive to buyers.

**Fireplaces (0.47):** This feature has a moderate positive correlation with SalePrice. Fireplaces are often seen as desirable features that add comfort and aesthetic appeal, contributing to a higher home value.

**GarageCars (0.64):** The number of cars the garage can hold is strongly correlated with SalePrice. Larger garages are a significant asset, especially in suburban and rural areas, making homes with bigger garages more expensive.

**GarageArea (0.62):** Similar to GarageCars, the total garage area is strongly positively correlated with SalePrice. More garage space is a practical feature that adds value to a property.

**House_age (-0.47):** This feature is negatively correlated with SalePrice, indicating that older homes tend to have lower sale prices. As homes age, they may require more maintenance or may be outdated, which can reduce their value.

**SalePrice_ZScore (0.61):** This feature is strongly positively correlated with SalePrice.   

These correlations suggest that the size of the home (living area, basement, garage), its quality, the number of rooms and bathrooms, and its age or recent updates are the most important factors associated with sale price in this dataset.

![image-7.png](attachment:image-7.png)

19 variables were selected for the model. The reason is because the columns have moderate to  high level significance/relationship with the target variable after conducting my anaylsis. These columns are:  
  
['GarageCars','OverallQual','Condition2','BsmtQual','KitchenQual','GrLivArea','CentralAir','BldgType',  'LandContourDesc','GarageType','MSZoning_Desc','TotalBsmtSF','Fireplaces','SalePriceZscore','house_age',  'BsmtExposure','Neighborhood','1stFlrSF','SalePrice']

### 2.4 Model Training and Evaluation

##### LabelEndcoder
Labelencoder was used to convert all the categorical variables to numerical. LabelEncoder is a preprocessing technique used in machine learning to convert categorical variables into numerical format. It assigns a unique integer to each category in a feature. This is useful when dealing with algorithms that require numerical input.  

##### Train-Test-Split
Train-Test Split is a method used to evaluate machine learning models. It involves dividing the dataset into two subsets: a training set and a test set. The model is trained on the training set and evaluated on the test set. This helps assess how well the model generalizes to unseen data.The split ratio used for this model is 80:20 (train/test). This technique helps prevent overfitting and provides a more realistic estimate of model performance.

##### Normalization
StandardScaler was used in this analysis. It is a preprocessing method used to standardize features by removing the mean and scaling to unit variance. It transforms the data so that it has a mean of 0 and a standard deviation of 1. This is particularly useful when features have different scales or units. Standardization helps many machine learning algorithms perform better and converge faster.

#### Model Selection Summary
In predicting housing sale prices in Ames Iowa, I selected a diverse set of machine learning models which are: Linear Regression, Random Forest, XGBoost, and Gradient Boosting. Eachof these models offer specific advantages that enhance the accuracy and robustness of the predictions.

1. **Linear Regression** 
    Reason for Selection:
    Linear Regression was chosen for its simplicity and interpretability. It provides a clear understanding of how each feature directly impacts the sale price.  
    Benefit to Prediction:
    Serves as a strong baseline model, offering insights into the linear relationships between features and the target variable, SalePrice.
    

2. **Random Forest**    
    Reason for Selection:
    Random Forest was selected for its ability to handle non-linear relationships and its robustness to overfitting.  
    Benefit to Prediction:
    Captures complex interactions between features and provides feature importance scores, helping to identify key drivers of SalePrice. 
    

3. **XGBoost**  
    Reason for Selection:
    XGBoost was chosen for its high predictive accuracy and efficient handling of large datasets. It includes regularization to prevent overfitting.  
    Benefit to Prediction:
    Delivers highly accurate predictions by effectively capturing intricate patterns in the data, while also allowing for fine-tuning to optimize performance.
    

4. **Gradient Boosting**  
    Reason for Selection:
    Gradient Boosting was selected for its sequential learning process, which focuses on reducing prediction errors iteratively.  
    Benefit to Prediction:
    Improves prediction accuracy by building on the strengths and correcting the weaknesses of previous models, offering a balanced approach to bias and variance.

### 2.4.1 Linear Regression
**R² (R-squared): 0.8383**  
R-squared measures the proportion of variance in the dependent variable (e.g., SalePrice) that is predictable from the independent variables (e.g., features like square footage, number of bedrooms, etc.). 0.8383 means that approximately 83.83% of the variance in the target variable is explained by the model. This suggests a good fit, as a high R² indicates that the model accounts for most of the variability in the data. 

**Adjusted R²: 0.8224**  
Adjusted R-squared adjusts the R² value based on the number of predictors in the model. It accounts for the possibility of overfitting by penalizing the addition of irrelevant predictors. 0.8224, slightly lower than the R², indicates that the model's explanatory power remains strong even after accounting for the number of predictors. This suggests that the model is not overfitting and that most of the predictors are meaningful.  

**MSE (Mean Squared Error): 1,240,472,593.1965**  
MSE measures the average squared difference between the observed actual outcomes and the outcomes predicted by the model.
A value of 1,240,472,593.1965 indicates the average squared error in the predictions. The MSE is in the units of the squared target variable (e.g., dollars squared if predicting prices). A lower MSE indicates a better fit of the model, but without context (such as comparing it to other models), it's hard to say if this is a good value.  

**RMSE (Root Mean Squared Error): 35,220.3435**  
RMSE is the square root of the MSE, giving an error metric in the same units as the target variable (e.g., dollars). An RMSE of 35,220.3435 means that, on average, the model's predictions are off by about 35,220dollars from the actual values.   

**Overall Interpretation**  
- The model explains a significant portion of the variance in the data (R² = 0.8383), which suggests that the model is quite effective at predicting the target variable.  
- The Adjusted R² being close to R² suggests that the model is not overfitting and that most of the features are contributing to the model's predictive power.
- The RMSE of 35,220dollars indicates the average error in predictions, which should be evaluated in the context of the typical value of the target variable. If house prices are typically much higher than this, the error might be acceptable; if not, the error might be considered high.


### 2.4.2  Random Forest
**R² (R-squared): 0.9172**
R-squared indicates that the Random Forest model explains 91.72% of the variance in the target variable (e.g., SalePrice).
This is higher than the R² of 0.8383 from the linear regression model, indicating that the Random Forest model provides a better fit and captures more of the variability in the data.    

**Adjusted R²: 0.9090**
Adjusted R-squared, at 0.9090, is slightly lower than the R² but still quite high. It suggests that the model remains strong after adjusting for the number of predictors and potential overfitting.  
The Adjusted R² is also higher compared to the linear regression model's Adjusted R² of 0.8224, reinforcing that the Random Forest model is likely capturing more meaningful patterns in the data.  

**MSE (Mean Squared Error): 635,460,131.5412**  
The MSE value represents the average squared error between the predicted and actual values. The MSE for the Random Forest model is significantly lower than that of the linear regression model (1,240,472,593.1965). This indicates that the Random Forest model is making more accurate predictions on average.  

**RMSE (Root Mean Squared Error): 25,208.3346**
RMSE is the square root of MSE, showing the average error in the same units as the target variable (e.g., dollars).
The RMSE for the Random Forest model is lower than that of the linear regression model (35,220.3435), indicating that the Random Forest model's predictions are, on average, closer to the actual values by about 10,000dollars.   

**Overall Interpretation:**  
Better Fit: The Random Forest model outperforms the linear regression model across all metrics, with a higher R² and Adjusted R², and lower MSE and RMSE values. This suggests that the Random Forest model provides a more accurate and reliable prediction of the target variable.  

Lower Error: The lower RMSE of 25,208.3346dollars indicates that the Random Forest model’s predictions are generally more precise than those from the linear regression model, making it a better choice if prediction accuracy is the primary concern.  

Complexity vs. Performance: While Random Forest models are more complex and computationally intensive compared to linear regression, the improved performance (as indicated by these metrics) often justifies the added complexity. 

### 2.4.3 XGBoost Regression
**R² (R-squared): 0.9563**  
An R² value of 0.9563 means that the model explains 95.63% of the variance in the target variable. This is a very high R² value, indicating that the model captures most of the variability in the data. It suggests that the XGBoost model is highly effective at explaining the relationship between the predictors and the target variable.  

**Adjusted R²: 0.9534**  
The Adjusted R² of 0.9534 accounts for the number of predictors in the model and provides a slightly lower value than R². It indicates that 95.34% of the variance is explained by the model after adjusting for the number of predictors. The minor difference between R² and Adjusted R² suggests that the model is not overly complex and that the predictors included are relevant. This implies that the model is likely not overfitting.   

**MSE (Mean Squared Error): 335,062,682.6558**
The MSE of approximately 335 million represents the average of the squared differences between the actual and predicted values. While MSE is in squared units and harder to interpret directly, a lower MSE indicates better predictive performance. In comparison to the previous MSE (415 million), this lower MSE suggests that the XGBoost model has improved accuracy in its predictions.  

**RMSE (Root Mean Squared Error): 18,304.7175**  
Interpretation: RMSE provides the error in the same units as the target variable, making it more interpretable. An RMSE of about 18,305 means that, on average, the model's predictions are off by approximately 18,305 dollars. Compared to the previous RMSE (20,384), this lower RMSE indicates that the XGBoost model has smaller prediction errors and is more accurate.  

**Overall Interpretation:**
Model Performance: The XGBoost model shows strong performance with an R² of 0.9563, meaning it explains a very high percentage of the variance in the target variable. This suggests that the model is highly effective at capturing the underlying patterns in the data.  

Error Metrics: The RMSE of 18,305 is lower than the previous RMSE (20,384), indicating that the XGBoost model has better predictive accuracy and less average error in its predictions.  

Model Reliability: The close values of R² and Adjusted R² (0.9563 vs. 0.9534) suggest that the model is not overfitting and is using the features efficiently to predict the target variable.  

**Conclusion:**
The XGBoost model outperforms the previous model (e.g., Random Forest) in terms of both R² and RMSE. With its high accuracy and lower prediction error, the XGBoost model would likely be the best choice for predicting the target variable, particularly if minimizing prediction error is a key goal.

### 2.4.4 Gradient Boosting Regression
**R² (R-squared): 0.9512**   
An R² value of 0.9512 indicates that the Gradient Boosting Regressor explains 95.12% of the variance in the target variable. This is a high R² value, suggesting that the model is very effective at capturing the underlying patterns in the data and explaining a large portion of the variability.   

**Adjusted R²: 0.9480**  
The Adjusted R² of 0.9480 accounts for the number of predictors and provides a slightly lower value than R². It indicates that 94.80% of the variance is explained by the model after adjusting for the number of predictors. The minor decrease from R² to Adjusted R² suggests that the model is well-specified and not overfitting, with most of the features being relevant.   

**MSE (Mean Squared Error): 374,173,945.0365**   
The MSE of approximately 374 million represents the average of the squared differences between actual and predicted values. A lower MSE indicates better predictive accuracy. In comparison to the previous MSE of 335 million (for XGBoost), this slightly higher MSE suggests that the Gradient Boosting Regressor has slightly less accurate predictions than the XGBoost model.   

**RMSE (Root Mean Squared Error): 19,343.5763**   
Interpretation: RMSE provides the error in the same units as the target variable. An RMSE of about 19,344 means that, on average, the model's predictions are off by approximately 19,344 dollars. This is higher than the RMSE of 18,305 from the XGBoost model, indicating that the Gradient Boosting Regressor has slightly higher prediction errors.   

**Overall Interpretation:**
Model Performance: The Gradient Boosting Regressor performs very well, with an R² of 0.9512, meaning it explains a high percentage of the variance in the target variable. It is effective at capturing patterns and providing accurate predictions.  

Error Metrics: The RMSE of 19,344 is slightly higher than the previous RMSE (18,305) from the XGBoost model, indicating marginally larger prediction errors.  

Model Reliability: The small difference between R² and Adjusted R² (0.9512 vs. 0.9480) suggests that the model is well-specified and not overfitting.  

**Conclusion:**
The Gradient Boosting Regressor performs comparably to the XGBoost model, with slightly lower R² and slightly higher RMSE. While both models are effective, the XGBoost model has a marginal edge in predictive accuracy and lower prediction error. However, the Gradient Boosting Regressor is still a strong contender and could be a viable alternative depending on other factors like computational efficiency and model interpretability.

## <h2><span style="color: BurlyWood;">3. Results</span></h2>

### 3.1 Model Training


| **Model** | **RMSE**        | **R²**        | **Adjusted R²** |
|-----------|-----------------|---------------|-----------------|
| LR        | 35,697.582      | 0.8339        | 0.8229          |
| RF        | 21,464.201      | 0.9399        | 0.9360          |
| XGB       | 18,304.717      | 0.9563        | 0.9534          |
| GB        | 19,343.576      | 0.9512        | 0.9480          |

![image.png](attachment:image.png)

**Models compared**

LR: Linear Regression
RF: Random Forest
XGB: XGBoost
GB: Gradient Boosting


**Performance metric:**  
The y-axis represents the RMSE, which measures the average deviation of predictions from actual values. Lower RMSE indicates better model performance.  

**Model performance (from worst to best):**
**a) LR (Linear Regression):**  
Highest RMSE (around 35,000)
Significantly underperforms compared to other models
Suggests the relationship between features and house prices is non-linear

**b) RF (Random Forest):**  
Second-highest RMSE (about 22,000)
Performs much better than LR, but not as well as the boosting methods

**c) GB (Gradient Boosting):**  
Third-best performance (RMSE slightly below 20,000)
Very close to XGB in performance

**d) XGB (XGBoost):**  
Best performing model (lowest RMSE, around 18,000-19,000)
Marginally outperforms GB


**Observations:**  
Ensemble methods (RF, XGB, GB) significantly outperform the simple Linear Regression model.
Boosting methods (XGB and GB) show superior performance compared to bagging (RF).
The difference in performance between XGB and GB is relatively small, indicating both are strong candidates for this prediction task.


**Implications:**
The implications of these observations suggest that more complex models like XGBoost and Gradient Boosting should be preferred for tasks requiring high prediction accuracy, given their superior performance. While Linear Regression might be too simplistic for this problem, XGBoost stands out as the most accurate model. Further tuning of the ensemble models, particularly XGBoost, could lead to even better results. These findings indicate that choosing the right model is crucial for accurate predictions, with XGBoost being the top choice in this scenario.  

**Application to Business or Research:**  
For business or research applications, choosing a model with a lower RMSE (like XGBoost) could lead to more accurate predictions, which can translate into better decision-making and more reliable outcomes.

### 3.2 Model performance

**Model Performance:**   
The Optimized XGBoost Regression model exhibits outstanding performance, with a very high R² of 0.97, suggesting it explains nearly all the variance in the target variable. This reflects exceptional predictive power and accuracy.  

**Error Metrics:** 
The RMSE of 14,193 is the lowest among the compared models, indicating that the Optimized XGBoost model provides the most accurate predictions with the smallest average error.  

**Model Reliability:** 
The high values of R² and Adjusted R², along with the low MSE and RMSE, suggest that the model is well-tuned and effectively captures the data patterns without overfitting.

![image.png](attachment:image.png)

 The performance of an XGBoost model over 1000 epochs, showing the Root Mean Square Error (RMSE) for both training and test datasets. The blue line represents the training error, while the orange line shows the test error. Both start at a high RMSE around 80,000, indicating poor initial predictions. There's a rapid decrease in error for both sets in the first 200 epochs, demonstrating quick learning. The improvement rate slows down afterwards, with more gradual decreases. The training error consistently remains lower than the test error, ending at around 5,000 and 15,000 respectively.   
 This growing gap suggests some overfitting, as the model performs better on training data than on unseen test data. The model continues to improve slightly even in later epochs, especially for the training data, but the gains become minimal. Overall, the graph shows effective error reduction but indicates challenges in generalizing to new data, suggesting potential benefits from regularization or early stopping techniques.  

 
![image-2.png](attachment:image-2.png)  

**Accurate Pricing Predictions:**  
 The model achieves a 97% accuracy rate in predicting home prices based on various property features. This means that in most cases, the predicted prices are very close to the actual market prices, allowing you to set prices that reflect true market value.

**Small Margin of Error:**  
 The average difference between the predicted price and the actual sale price is around $14,193. For a realtor, this means that the model is reliable enough to make informed pricing decisions, with only a minor adjustment typically needed to fine-tune the final listing price.

**Reliable Tool for Pricing Strategy:**  
 The red line on the graph represents perfect predictions. Since most of the blue dots (representing individual homes) are clustered closely around this line, it shows that the model consistently provides accurate estimates. This can be a valuable tool in your pricing strategy, helping you to stay competitive and avoid over- or under-pricing properties.

**Outliers:**  
 A few properties, particularly higher-priced ones, show more significant deviations from the predicted price. This indicates areas where further analysis might be needed, possibly due to unique features of these homes that the model doesn’t fully capture.

### 3.3 Feature Importance  

![image.png](attachment:image.png)

This emphasizes the importance of certain features when evaluating or setting prices for homes. Features like the garage cars, overall quality, kitchen quality, basement quality and ground living area should be prioritized when assessing a property’s value. These insights can guide you in highlighting the most valuable aspects of a property to potential buyers and setting competitive prices based on the features that matter most in the market.

## <h2><span style="color: BurlyWood;">4. Business Recommendation</span></h2>

This analysis utilizing the XGBoost predictive model has identified critical factors that influence property valuations. To maximize the model’s effectiveness and drive business growth, consider the following strategic recommendations:

**1. Focus on High-Impact Features in Valuations:**  
The model reveals that the number of garage spaces (GarageCars), the overall quality of the home (OverallQual), kitchen quality (KitchenQual), basement quality (BsmtQual), and the Ground Living area (GrLivArea) have high influential features in predicting property values. Prioritize these features in your valuation process. Properties with more garage spaces, higher construction quality, well-equipped kitchens, high-quality basements, and desirable living area should be appraised at a premium. This approach will align your pricing more closely with market demand and increase the accuracy of valuations.  

**2. Enhance Model Precision through Feature Analysis:**  
**Regularly Review Feature Importances:** Continuously monitor and analyze the importance of different features in the model to adapt to changing market conditions. For instance, shifts in buyer preferences could lead to changes in which features are most valued. Keeping the model updated with the latest data ensures that it remains a reliable tool for property pricing.    

**Fine-Tuning Based on Low-Impact Features:** Consider the lower-impact features like Neighborhood and Building Type when refining the model. While these features are less critical, understanding their minor influence can help in niche markets or when dealing with unique properties. This granular approach can provide more tailored valuations for specific client needs.   

**3. Leverage Model Insights for Client Consultations:**  
**Client Education:** Use the insights from the model to educate clients about the factors that most affect their property’s value. By explaining why features like garage size, overall quality, kitchen, and basement quality have a significant impact, you can set realistic expectations and build trust.  

**Tailored Investment Advice:** Provide sellers with data-driven advice on where to invest in property improvements. For example, suggesting a garage upgrade, kitchen renovation, or basement finishing could lead to a higher sale price, backed by the model’s findings.  

**4. Integrate Model Results into Marketing Strategies:**    
**Data-Driven Pricing:** Use the model’s predictions to set competitive and accurate prices that reflect current market trends. This approach ensures that your listings are attractively priced, reducing the time properties spend on the market and improving overall sales performance.  

**Marketing Focus:** Tailor your marketing campaigns to emphasize the features the model identifies as most important. For example, highlight garage spaces, overall construction quality, and kitchen upgrades in marketing materials to attract buyers willing to pay a premium for these features.  

**5. Ongoing Model Development and Validation:**   
**Continuous Improvement:** Regularly update the model with new data to maintain its predictive accuracy. As the real estate market evolves, retraining the model will ensure it continues to provide reliable and relevant insights.   

**Validation:** Periodically validate the model against actual market sales to ensure that its predictions remain aligned with real-world outcomes. This practice will help in maintaining the model’s credibility and effectiveness.

## <h2><span style="color: BurlyWood;">5. Conclusion</span></h2>

This study aimed to predict housing prices using advanced regression techniques, incorporating careful feature selection through exploratory data analysis (EDA). We implemented and compared several models including Linear Regression (LR), Random Forest (RF), XGBoost (XGB), and Gradient Boosting (GB).

The analysis and prediction efforts have successfully identified the most influential factors affecting property valuations, providing a deep understanding of the key elements that drive market prices. By employing advanced predictive modeling techniques like XGBoost, we have achieved high accuracy in forecasting property values, enabling more precise and data-driven decisions.

The insights gained from this analysis are not only valuable for setting competitive prices but also for guiding homeowners and investors on where to focus their improvements for maximum return on investment. The model's effectiveness demonstrates its potential as a powerful tool for real estate professionals, offering a strategic advantage in a competitive market.

Overall, this approach enhances the ability to make informed decisions, optimize property valuations, and ultimately achieve better outcomes for clients and stakeholders in the real estate industry.