# Heritage Housing Data Cleaning

## Objectives
- Load the raw Ames Housing dataset
- Clean missing values and drop irrelevant features
- Prepare the data for analysis and modeling

## Inputs
- 'data/raw/house_prices_records.csv' - the unprocessed, original dataset
- 'data/raw/inherited_houses.csv' - a supplementary dataset with inherited homes
- 'data/processed/cleaned_data.csv' - the final cleaned dataset used for modeling

## Outputs
- Cleaned dataset with no missing values
- Ready-to-use data for exploratory and predictive analysis 

## Additional Comments

* Inherited homes may exhibit different trends, we may later integrate and analyse them separately.


## Load and inspect the Raw Housing Dataset
Before performing analysis or data cleaning, it is essential to assess the completeness of the dataset. Missing values can bias results or reduce the quality of predictions if left unadressed. 

In this step, we laod the Ames Housing Dataset from the raw source. This dataset includes all recorded residential property transactions. By inspecting the structure and contents of the data, we aim to idenitfy columns that may require cleaning or special treatment in later stages.

In [2]:
# Import required libraries 
import pandas as pd 

# Load the raw housing dataset
df = pd.read_csv("../data/raw/house_prices_records.csv")

# Display the first few rows
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


### Filter Dataset to Inherited Houses

In order to focus the analysis on the inherited properties it is essential to isolate the relevant records from the full dataset (house_prices_records.csv). The subset of interest is provided in a separate file (inherited houses.csv), which contains rows corresponding to the inherited properties.

Since the inherited subset does not include a unique identifier (e.g., Id), we adopt a row-wise alignment strategy:

- A temporary index column (row_id) is assigned to both the full dataset (house_prices_records.csv) and the inherited subset (inherited_houses.csv)

- Using these indices we filter the full dataset to retain only the rows that matched the inherited subset.

- The temporary index column is removed post filtering to preserve data integrity.

This approach ensures that the resulting DataFrame (df_inherited_full) containes only the inherited house records, which will be used for further data cleaning and exploratory analysis. 

In [3]:
import pandas as pd 

# Load the full house dataset
df_all = pd.read_csv("../data/raw/house_prices_records.csv")

# Load the list of inherited houses
df_inherited = pd.read_csv("../data/raw/inherited_houses.csv")

# Add a row index to both datasets
df_all["row_id"] = df_all.index
df_inherited["row_id"] = df_inherited.index

# Use the row_id to filter matching rows
df_inherited_full = df_all[df_all["row_id"].isin(df_inherited["row_id"])]

# Drop the temporary row_id
df_inherited_full = df_inherited_full.drop(columns=["row_id"])

# Confirm shape and preview
print("Inherited dataset shape", df_inherited_full.shape)
df_inherited_full.head()




Inherited dataset shape (4, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000


## Assess Missing Data in Inherited Houses Dataset

As part of our data cleaning process, we evaluate the filtered dataset containing only the inherited houses to identify any missing values.

Missing data can lead to biased predictions if not handled properly. By identifying which variables have null values, we can determine appropriate strategies (e.g. imputation or removal) in the following steps.

The output below lists all features in the inherited dataset that contain one or more missing values, sorted by the number of missing entries. This is crucial before performing any statistical analysis or model building.

In [4]:
# Show missing values in the inherited houses dataset
missing = df_inherited_full.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

WoodDeckSF       3
EnclosedPorch    2
BedroomAbvGr     1
2ndFlrSF         1
dtype: int64

### Clean Missing Values 

To ensure our dataset is suitable for analysis an machine learning, we address the issue of missing values. 

**Approach**

- **Drop columns** like 'EnclosedPorch', 'WoodDeckSF', and 'LotFrontage' that have too many missing entries to justify imputation.
- For **categorical variables** (e.g., 'GarageFinish', 'BsmtFinType1'), we fill missing values with '"None"' to indicate the absence of a feature.
- For **numerical variables** (e.g., 'BedroomAbvGr', 'GarageYrBlt'), we use **median** value of the column for imputation. This is a robust measure that limits the influence of outliers.

This process ensures the cleaned dataset maintains its integrity, prevents model bias and supports consistent training without runtime errors.

In [5]:
# Drop columns with too many missing values 
df = df.drop(columns=["EnclosedPorch", "WoodDeckSF", "LotFrontage"])

# Fill missing values for categorical columns with 'None'
df["GarageFinish"] = df["GarageFinish"].fillna("None")
df["BsmtFinType1"] = df["BsmtFinType1"].fillna("None")

# Fill missing values for numerical columns with the median value
df["BedroomAbvGr"] = df["BedroomAbvGr"].fillna(df["BedroomAbvGr"].median())
df["GarageYrBlt"] = df["GarageYrBlt"].fillna(df["GarageYrBlt"].median())

# Check remaining missing values in the dataset
df.isnull().sum().sort_values(ascending=False).head(10)

2ndFlrSF        86
BsmtExposure    38
MasVnrArea       8
BedroomAbvGr     0
1stFlrSF         0
BsmtFinSF1       0
BsmtFinType1     0
GarageArea       0
BsmtUnfSF        0
GarageYrBlt      0
dtype: int64

### Handle Remaining Missing Values 

To ensure the dataset is fully complete and safe for modeling, we address the remaining missing values using tailored imputation strategies:

- **Numerical columns** are filled with the **median**, which is robust to outliers and helps maintain data integrity.

- **Categorical columns** (like 'BsmtExposure') are filled with "None" to preserve the structure of the data without introducing bias.

This step is critical to prevent issues during model training and ensures that all the columns in the dataset are ready for further processing and analysis.

In [6]:
# Fill missing numerical columns with median
df["2ndFlrSF"] = df["2ndFlrSF"].fillna(df["2ndFlrSF"].median())
df["1stFlrSF"] = df["1stFlrSF"].fillna(df["1stFlrSF"].median())
df["BsmtExposure"] = df["BsmtExposure"].fillna("None")
df["MasVnrArea"] = df["MasVnrArea"].fillna(df["MasVnrArea"].median())
df["GarageArea"] = df["GarageArea"].fillna(df["GarageArea"].median())
df["BsmtFinSF1"] = df["BsmtFinSF1"].fillna(df["BsmtFinSF1"].median())

# Final check for any remaining missing values
df.isnull().sum().sort_values(ascending=False).head(10)

1stFlrSF        0
2ndFlrSF        0
BedroomAbvGr    0
BsmtExposure    0
BsmtFinSF1      0
BsmtFinType1    0
BsmtUnfSF       0
GarageArea      0
GarageFinish    0
GarageYrBlt     0
dtype: int64

### Save and Reload Cleaned Dataset

After completing the data cleaning process, we save the refined dataset to a new CSV file within the 'processed' folder. This allows for consistent reuse of the cleaned data in the later stages without repeating the cleaning steps.

We set 'index=False' to prevent pandas from wiritng the index as an additional column, preserving the structure of the original data.

We then reload the cleaned dataset to initiate the EDA phase. Previewing the dataset here helps confirm that the cleaning operation was successful and allows us to start identifying patterns and relationships for modelling. 

In [7]:
# Save the cleaned dataset to a new CSV file in the processed folder
# We use index=False to avoid saving the row numbers as an extra column

df.to_csv("../data/processed/cleaned_data.csv", index=False)

import pandas as pd

# Load the cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Preview the first few rows of the dataset
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,Gd,8450,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,TA,9600,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,Gd,11250,162.0,42,5,7,920,2001,2002,223500
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1998.0,...,Gd,9550,0.0,35,5,7,756,1915,1970,140000
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,Gd,14260,350.0,84,5,8,1145,2000,2000,250000


## Correlation Analysis

To understand which numerical features most significantly influence house prices, we compute the Pearson correlation coefficient between each numerical feature and the target variable, 'SalePrice'. This statistical method measures the strength and direction of the linear relationship between variables. Features with high absolute correlation values (positive or negative) are considered more relevant for predictive modelling. We visualise the top 10 most stringly correlated features using a heatmap, which facilitates the identification of patterns among predictors. This supports informed feature selection and model optimisation.

In [8]:
# Import required libraries
import pandas as pd
from scipy.stats import skew

#Load the cleaned dataset
df = pd.read_csv('../data/processed/cleaned_data.csv')

# Select ony numerical features from the dataset
numerical_df = df.select_dtypes(include='number')

# Compute the Pearson correlation matrix
corr_matrix = numerical_df.corr(numeric_only=True)

# Sort the correlation values in relation to the target variable 'SalePrice'
saleprice_corr = corr_matrix['SalePrice'].sort_values(ascending=False)

# Display the sorted correlation values
saleprice_corr




SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
YearBuilt       0.522897
YearRemodAdd    0.507101
MasVnrArea      0.472614
GarageYrBlt     0.466754
BsmtFinSF1      0.386420
OpenPorchSF     0.315856
2ndFlrSF        0.312479
LotArea         0.263843
BsmtUnfSF       0.214479
BedroomAbvGr    0.155784
OverallCond    -0.077856
Name: SalePrice, dtype: float64

In [9]:
%pip install plotly



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
# Import plotly for visualisation
import plotly.express as px

# Compute correlations with SalePrice
correlations = df.corr(numeric_only=True)['SalePrice'].sort_values(ascending=False)

# Get top 10 features most correlated with SalePrice (excluding SalePice itself)
top_corr_features = correlations[1:11].index

# Compute correlation matrix for these features 
top_corr_matrix = df[top_corr_features].corr()

# Plot an interactive heatmap using Plotly
fig = px.imshow(
    top_corr_matrix,
    text_auto=True,
    color_continuous_scale='RdBu',
    title="Top 10 Features Correlated with SalePrice",
    labels=dict(color='Correlation Coefficient'),
    x=top_corr_features,
    y=top_corr_features
)

#Update axis titles manually for context
fig.update_layout(
    xaxis_title="Correlated Predictor Variables",
    yaxis_title="Correlated Predictor Variables"
)

fig.show()

# Save the heatmap image to images folder
fig.write_image("../images/feature_correlation_heatmap.png")

In [11]:
%pip install -U kaleido


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Interpretation of the Heatmap 

The heatmap above presents the ten numerical variables most strongly correlated with 'SalePrice', based on Pearson correlation coefficients.

Each axis displays the same set of predictor variables, allowing for a comparison of their linear relationships with one another. These variables include measures of size, quality, and construction timing that are known to influence residential property value.


**Axis Titles:**
- *Correlated Predictor Variables*: These are the top ten features with the strongest statistics association with 'SalePrice'.
- *Pearson Correlation Coefficient (colour scale)*: This quanitifies the strength and direction of the linear relationship between each pair of variables. Values close to +1 indicate a strong positive correlation, while values near -1 indicate a strong negative correlation.

**Variable Description**
- 'OverallQual': Overall material and finish quality
- 'GrLivArea': Above-ground living area (sq ft)
- 'GarageArea': Garage size (sq ft)
- 'TotalBsmtSF': Total Basement area (sq ft)
- '1stFlrSF': First Floor area (sq ft)
- 'YearBuilt': Year the house was constructed
- 'YearRemodAdd': Year of the most recent remodeling
- 'MasVnrArea': Masonry veneer area (sq ft)
- 'GarageYrBlt': Year the garage was built
- 'BsmtFinSF1': Finished area of the basement (Type 1)

These features were selected because they exhibit the strongest correlation with house sale price.

**Colour Indicator (Key):**
- The **colour bar** to the right represents the **Pearson correlation coefficient** it uses a diverging colour scale ('RdBu') to visually distinguish the strength and direction of correlation between variables.
- **Dark red** values indicate **strong positive correlation** (closer to **+1.0**) meaning that as one variable increases, the other tends to increase as well.
- **Dark blue** values indicate **strong negative correlation** (closer to **-1.0**) meaning that as one variable increases the other tends to decrease.
- **Lighter shades (closer to white)** around **0.0** indicate **weak or no linear correlation**, meaning changes in one variable have little predictive power over the other. 

- This gradient helps quickly identify:

- **Highly influential predictors** (deep red against 'SalePrice'.)
- **Potential multicollinearity** between features (deep red or blue among non-target variables)
- **Redundancy**, where multiple features are strongly correlated with other (and potentially can be reduced in feature selection)

**Reading the Chart:**
- Each square shows how strongly two variables are linearly related.
- For example, a strong correlation between 'GrLiveArea' and '1stFlrSF' indicates that as the first-floor area increases, total living area tends to increase, which makes sense from an intuitive standpoint.

This visual tool helps guide **feature selection**, allowing us to identify potential multiple corelation issues and focus on the most informative predictors for modelling.

This will be discussed further below in Understanding the Pearson Correlation Coefficient 


## Explanation of Heatmap Features

The heatmap visualises the **top 10 numerical features** most strongly correlated with the target variable *SalePrice*. Each variable represents a property characteristic that potentially influences house price. The strength and direction of these correlations are represented by the color intensity and hue on the heatmap.

Below is a brief explanation of each feature:

**OverallQual** : Rates the overall material and finish quality of the house (1-10 scale). This is the strongest predictor of sale price.

**GrLivArea** : Above-ground living area in square feet. A larger living area typically increases house value.

**Garage Area** : Size of the garage in square feet. Larger garages may indicate higher end properties.

**TotalBsmtSF** : Total area of the basement in square feet. Larger basements may add to usuable space and value.

**1stFlrSF** : Area of the first floor in square feet. A larger first floor is often associated with more expensive homes.

**YearBuilt** : The original construction year. Newer homes generally command higher prices due to better condition and more modern features.

**YearRemodAdd** : Year of the latest remodel or addition . More recent updates may improve the home's value.

**MasVnrArea** : Masonry veneer area in square feet (e.g., brick or stone) May reflect exterior quality.

**GarageYrBlt** : Year the garage was built.  Usually matches or follows the house's construction year.

**BsmtFinSF1** : Finished square footage of the basement (type 1). Finished basements are often desirable living spaces.

These features were selected based on their Pearson correlation coefficients, as shown on the heatmap. The correlation values provide insight into how strongly each variable influences house prices, with the darker red indicating stronger positive correlations.

### Understanding the Pearson Correlation Coefficient

The Pearson correlation coefficent, commonly denoted as **r**, is a statisical measure we use to quantify the **strength and direction** of the linear relationship between two continuous numerical variables.

It ranges from **-1 to +1**, and is interpreted as follows:

- **r = +1.00** : Perfect positive linear correlation. As one variable increases, the other increases proportionally.
- **r = 0**: No linear correlation. There is no consistent linear relationship between the variables.
- **r = -1.0**: Perfect negative linear correlation. As one variable increases, the other decreases proportionally.

In this project, Pearson correlation is applied to measure the relationship between **numerical housing features** and the target variable, 'SalePrice'. Higher adsolute values of **r** indicate stronger linear associations. 

For example:
- 'OverallQual' has an **r = 0.79**, suggesting a strong positive linear correlation with sale price.
- 'GrLivArea' and 'GarageArea' also show high positive correlations, meaning larger homes and garages are typically more expensive.

This method is valuable in **feature selection**, as it helps identify which variables are most predictive of house price. It also highlights potential **multi correlations** between predictors, which can influence model performance.



In [12]:
%pip install scipy



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
%pip install nbformat 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Train Predictive Model 

To help Lydia Doe accurately estimate the sale prices of her inherited houses, we now move into the predictive modelling phase. This step uses the cleaned and processed dataset to train a machine learning model that can predict house sale prices based on their features.

Given the project's objective - to support informed pricing decisions we use a **Linear Regression** model as a baseline due to its interpretability and performance on structured datasets.

This phase includes:
- Splitting the dataset into training and testing sets
- Training the model on the training set
- Evaluating model performance on unseen data

By validating the model's predictive accuracy, we ensure the client receives reliable price estimates for decision-making purposes.

In [14]:
%pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import required libraries for modelling and evaluation 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Separate features (X) and target variable (y)
X = df.drop(columns=["SalePrice"])
y = df["SalePrice"]

#Convert categorical variables to numeric using one-hot encoding
X = pd.get_dummies(X)

# Split the data into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialise and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate model performance using standard regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5
r2 = r2_score(y_test, y_pred)

# Print evaluation results 
print("Model Evaulation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R\u00b2): {r2:.2f}")


Model Evaulation Metrics:
Mean Absolute Error (MAE): 21286.05
Root Mean Squared Error (RMSE): 34262.84
R-squared (R²): 0.85


## Evaluate Model Performance 

To assess the performance of the predictive model, we use three key regression metrics:

- **Mean Absolute Error (MAE):** Measures the average absolute difference between predicted and actual values. Expressed in US dollars ($). It provides a straightforward interpretation of the typical prediction error.

- **Root Mean Squared Error (RMSE):** Penalises larger errors more than MAE and is also measured in dollars. It is useful when large deviations from actual prices are particularly undesirable.

- **R<sup>2</sup> (R-squared):** Indicates the proportion of variance in the target variable ('SalePrice') that is explained by the model. This metric is unitless and ranges from 0 to 1, with values closer to 1 representing stronger predictive performance.

In this case, the model achieved:

- **MAE** ~21,286.05 USD
- **RMSE** ~34,262.84 USD
- **R<sup>2</sup>:** 0.85

These results suggest a strong model fit,with approximately 85% of the variance in house sale prices explained by the selected features. This provides the client with a reliable and interpretable foundation for estimating the market value of inherited properties.

In [16]:
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

df = pd.read_csv("../data/processed/cleaned_data.csv")

X = pd.get_dummies(df.drop(columns=["SalePrice"]))
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

# Save the linear regression model
joblib.dump(model, "models/linear_regression_model.pkl")

# Save the feature column order used during training
joblib.dump(X_train.columns.tolist(), "models/linear_regression_features.pkl")


## Predict Inherited House Prices 

With the trained Linear Regression model, we now generate sale price predictions for the four inherited properties identified by Lydia.

Each property has been prepared with the same feature structure as the training dataset, and predictions are generated using the fitted model.

These predicted prices provide Lydia with estimated market values for her inherited houses, enabling informed decisions about whether to sell, renovate, or hold the properties.

This phase of the project directly address Lydia's key question:

"How much are the inherited properties actually worth?"

In [17]:
# Prepare the inherited dataset for prediction
# Drop 'SalePrice' column if present and convert categorical variables to dummies
X_inherited = df_inherited_full.drop(columns=["SalePrice"], errors="ignore")
X_inherited = pd.get_dummies(X_inherited)

# Align inherited features to match training set (ensures same structures)
X_inherited = X_inherited.reindex(columns=X.columns, fill_value=0)

# Check for any missing values in the inherited data
print("Missing values in inherited dataset:")
print(X_inherited.isnull().sum().sort_values(ascending=False).head())

# Fill any remaining missing values with the median of each column
X_inherited = X_inherited.fillna(X_inherited.median(numeric_only=True))

# Predict sale prices using the trained model
predicted_prices = model.predict(X_inherited)

# Add predictions to the inherited DataFrame
df_inherited_full["Predicted_SalePrice"] = predicted_prices

#Label each inherited house clearly 
df_inherited_full["House"] = [f"House {i+1}" for i in range(len(df_inherited_full))]

# Create a simpler displayy Dataframe
df_inherited_display = df_inherited_full[["House", "Predicted_SalePrice"]]

# Show the final labeled predictions
df_inherited_display.head

Missing values in inherited dataset:
2ndFlrSF        1
BedroomAbvGr    1
1stFlrSF        0
BsmtFinSF1      0
BsmtUnfSF       0
dtype: int64


<bound method NDFrame.head of      House  Predicted_SalePrice
0  House 1        215669.762525
1  House 2        191117.459821
2  House 3        230019.818498
3  House 4        179302.774627>

## Exploratory Review of the Cleaned Dataset (cleaned_data.csv)

To ensure the integrity and reliability of our analysis we conduct a brief exploratory review of the cleaned dataset prior to developing the final dashboard. This step is essential for validating that the data preparation and cleaning processes were successful.

By generating summary statistics using df.describe(), we are able to:

- Confirm that all numerical features have been correctly parsed and are free from anomalies or unexpected values.
- Examine key distribution metrics (mean, median, standard deviation, min/max values) to better understand the characteristics of the dataset.
- Identify any residual inconsistencies or outliers that may warrant additional attention before visualisation or further modelling.

This quality assurance step strengthens the foundation of our predictive modelling and esnures that the final outputs especially those displayed in the dashboard are accurate and "well-understood" data.



In [18]:
# Load the cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Display summary statistics
df.describe()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtFinSF1,BsmtUnfSF,GarageArea,GarageYrBlt,GrLivArea,LotArea,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,1162.626712,327.994521,2.878082,443.639726,567.240411,472.980137,1978.589041,1515.463699,10516.828082,103.117123,46.660274,5.575342,6.099315,1057.429452,1971.267808,1984.865753,180921.19589
std,386.587738,433.576171,0.792485,456.098091,441.866955,213.804841,23.997022,525.480383,9981.264932,180.731373,66.256028,1.112799,1.382997,438.705324,30.202904,20.645407,79442.502883
min,334.0,0.0,0.0,0.0,0.0,0.0,1900.0,334.0,1300.0,0.0,0.0,1.0,1.0,0.0,1872.0,1950.0,34900.0
25%,882.0,0.0,2.0,0.0,223.0,334.5,1962.0,1129.5,7553.5,0.0,0.0,5.0,5.0,795.75,1954.0,1967.0,129975.0
50%,1087.0,0.0,3.0,383.5,477.5,480.0,1980.0,1464.0,9478.5,0.0,25.0,5.0,6.0,991.5,1973.0,1994.0,163000.0
75%,1391.25,714.5,3.0,712.25,808.0,576.0,2001.0,1776.75,11601.5,164.25,68.0,6.0,7.0,1298.25,2000.0,2004.0,214000.0
max,4692.0,2065.0,8.0,5644.0,2336.0,1418.0,2010.0,5642.0,215245.0,1600.0,547.0,9.0,10.0,6110.0,2010.0,2010.0,755000.0


### Identify Key Price Influencing Factors

Earlier in the project we performed a preliminary correlation analysis to support feature selection and guide our modelling decisions. 

In this section we present a formal version of that analysis to fulfil visualising and interpreting key data relationships.

Using Pearson's correlation coefficient, we examine which numerical features have the strongest linear relationship with the target variable ('SalePrice'). 

This would help Lydia understand which aspects of a property such as size, quality, or garage capacity most strongly influence its market value.

The results support transparency in model interpretation and offer practical guidance for pricing and renovation decisions.

## Feature Correlation Analysis

To identify which features are most influential in predicted house sale prices, we compute the Pearson correlation coefficients between 'SalePrice' and all other numeric features in the dataset.

The chart below visualises the **Top 10 features** with the **strongest** positive correlations to 'SalePrice' and all other numeric features in the dataset.

- **X-axis**: Feature names from the dataset
- **Y-axis**: Pearson correlation coefficient (between 0 and 1)  
- **Bar height**: Indicates the strength of the linear relationship

### Feature Key:
1. **OverallQual** – Overall material and finish quality  
2. **GrLivArea** – Above grade (ground) living area (sq ft)  
3. **GarageArea** - Size of garage (sq ft)  
4. **TotalBsmtSF** – Total basement area (sq ft)  
5. **1stFlrSF** – First floor area (sq ft)  
6. **YearBuilt** – Year the house was originally built  
7. **YearRemodAdd** – Year the house was last remodeled  
8. **MasVnrArea** – Masonry veneer area (sq ft)  
9. **GarageYrBlt** – Year the garage was built  
10. **BsmtFinSF1** – Finished basement area (Type 1, sq ft)

These insights allow Lydia to better understand what contributes most to a home's value, supporting both pricing and renovation strategies.



In [19]:
import pandas as pd
import plotly.express as px

# Compute top 10 positively correlated features with SalePrice
correlations = df.corr(numeric_only=True)["SalePrice"].sort_values(ascending=False)
top_features = correlations[1:11]  # Skip SalePrice itself

# Convert to DataFrame for Plotly
top_features_df = top_features.reset_index()
top_features_df.columns = ['Feature', 'Correlation']

# Create Plotly bar chart
fig = px.bar(
    top_features_df,
    x='Feature',
    y='Correlation',
    text='Correlation',
    title='Top 10 Features Most Positively Correlated with Sale Price',
    labels={'Correlation': 'Pearson Correlation Coefficient', 'Feature': 'Feature Name'}
)

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(
    yaxis=dict(range=[0, 1]),
    xaxis_tickangle=-45,
    showlegend=False
)
# Shows the bar chart
fig.show()
# Save the bar chart to the images folder
fig.write_image("../images/top_features_bar_chart.png")


In [20]:
# Compute correlations with SalePrice
correlations = df.corr(numeric_only=True)["SalePrice"].sort_values(ascending=False)

# Display top 10 positively correlated features (excluding SalePrice itself)
top_features = correlations[1:11] # Skip index 0 which is SalePrice vs SalePrice
top_features 


OverallQual     0.790982
GrLivArea       0.708624
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
YearBuilt       0.522897
YearRemodAdd    0.507101
MasVnrArea      0.472614
GarageYrBlt     0.466754
BsmtFinSF1      0.386420
Name: SalePrice, dtype: float64

In [21]:
%pip install joblib 



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Saving the Trained Linear Regression Model 

In an earlier step, we trained a Linear Regression model to predict house sale prices using structured features from the dataset. This model demonstrated strong predictive accuracy and is now ready for reuse.

To ensure reproducibility and portability, we save this trained model to disk in a serialized format (.pkl) using joblib. This allows the model to be easily reloaded in future workflows — for instance, in a dashboard or deployment environment — without the need to retrain it from scratch.

The model is saved to the models directory as:

models/linear_regression_model.pkl

Saving the model is an essential step for preserving work and integrating the predictive logic into production-ready tools.

In [22]:
import joblib

# Save the trained model to the 'models' directory
joblib.dump(model, '../models/linear_regression_model.pkl')

['../models/linear_regression_model.pkl']

## Visualisations

This section presents a series of visualisations to support the analysis of house sale prices. These plots are intended to help Lydia, the primary stakeholder, better understand how various property features influence value. By examining trends, relationships, and distributions, these visual insights also serve as an accessible way for non-technical users to interpret the model’s input variables.

Each visualisation is paired with a brief explanation of its significance and relevance to the business case.

### Heatmap: Correlation Matrix of Top Predictive Features

The heatmap below visualises the **pairwise Pearson correlation coefficients** among the top 10 numerical features most strongly correlated with `SalePrice`.

This visualisation helps identify:
- Which features are most linearly associated with sale price.
- Where multicollinearity may exist — such as when two or more predictors are highly correlated with one another.

**Key Observations:**
- `OverallQual`, `GrLivArea`, and `TotalBsmtSF` exhibit the strongest positive linear correlations with `SalePrice`.
- Features like `GarageArea`, `1stFlrSF`, and `YearBuilt` also show meaningful relationships.
- Moderate correlations among predictors such as `GarageArea` and `MasVnrArea` suggest some overlap in the information they provide, which should be considered during model refinement.

This plot provides Lydia with a clearer understanding of which property features are most influential when estimating market value.

In [23]:
# Heatmap code
# Import Plotly Express for interactive visualisations
import plotly.express as px

# Compute correlations between numeric features and SalePrice, sorted in descending order
correlations = df.corr(numeric_only=True)['SalePrice'].sort_values(ascending=False)

# Extract the name of the top 10 features (excluding SalePrice itself)
top_corr_features = correlations[1:11].index
# Compute the correlation matrix for these top features
top_corr_matrix = df[top_corr_features].corr()

# Create an interactive heatmap using Plotly 
fig = px.imshow(
    top_corr_matrix, # The correlation matrix as data
    text_auto=True, # Show correlation values on the heat map 
    color_continuous_scale='RdBu', # Use a diverging colour scale (Red-Blue)
    title="Top 10 Features Correlated with SalePrice", # Chart Title
    labels=dict(color="Correlation Coefficient"), # Legend Title
    x=top_corr_features, # Set x axis feature names 
    y=top_corr_features # Set y axis feature names 
)

# Improve readability with axis titles 
fig.update_layout(
    xaxis_title="Correlated Predictor Variables",
    yaxis_title="Correlated Predictor Variables"
)
# Display heatmap in the notebook 
fig.show()
# Save the heatmap to the images folder 
fig.write_image("../images/feature_correlation_heatmap.png")

### **Box Plot: Sale Price Distribution by Overall Quality**

This visualisation presents a **box plot** showing how house sale prices vary according to the **Overall Quality (`OverallQual`)** rating. This rating is a numerical scale from **1 (Very Poor)** to **10 (Very Excellent)**, representing the overall material and finish quality of the house.

#### **Interpretation and Insights:**

- **Positive Correlation**: There is a **strong upward trend** in sale price as the overall quality rating increases. Higher-quality homes consistently sell for higher prices.
- **Greater Price Spread at Higher Ratings**: Quality ratings of 8, 9, and 10 show wider interquartile ranges and more outliers, indicating that premium homes may vary more in price due to other influencing features (e.g. location, luxury additions).
- **Mid-range Stability**: Houses with quality ratings between 5 and 7 demonstrate narrower price distributions, reflecting more uniform pricing in mid-market properties.
- **Presence of Outliers**: All quality levels include some outliers, particularly at the upper end, representing properties that may be overvalued or feature unique characteristics.

This chart highlights **OverallQual** as a critical predictor of sale price, validating its inclusion as a top feature in the regression model and supporting informed pricing decisions for inherited properties.

> **Saved as**: `images/boxplot_saleprice_overallqual.png`


In [24]:
# Import Plotly Express for visualisation
import plotly.express as px

# Create a box plot to show the distribution of Sale Price 
fig = px.box(
    df, # DataFrame containing the data
    x="OverallQual", # Categorical variable on the x axis (quality rating from 1 -10)
    y="SalePrice", # Numerical variable on y axis (sale price in USD)
    color="OverallQual", # Color the boxes by quality rating for better visual distinction 
    title="Sale Price Distribution by Overall Quality", # Chart title
    labels={"OverallQual": "Overall Quality", "SalePrice": "Sale Price (USD)"} # Axis label for content 
)

fig.update_layout( # Update layout to enhance readability 
    xaxis_title="Overall Quality Rating", # x axis title
    yaxis_title="Sale Price (USD)", # y axis stile 
    showlegend=False # hide the legend as colour are self explanatory 
)

fig.show() # Display the plot in the notebook

# Save the figure to the images folder for reuse or reporting
fig.write_image("../images/boxplot_saleprice_overallqual.png")

### **Scatter Plot: Sale Price vs. Above Ground Living Area (`GrLivArea`)**

This visualisation shows a **scatter plot** of house sale prices against their **Above Ground Living Area `GrLivArea`**, measured in square feet. This feature represents the total finished living space above the basement.

#### **Interpretation and Insights:**

- **Positive Linear Relationship**: There is a clear **positive correlation** between `GrLivArea` and `SalePrice` —larger homes tend to fetch higher sale prices.
- **Non-Uniform Price Spread**: As the living area increases, the variability in sale prices also increases. Some large homes sell at relatively moderate prices, suggesting the influence of other factors like condition, location, or design.
- **Clusters and Outliers**:
  - Most homes fall within the 1,000–2,500 sq ft range.
  - A few extreme outliers exist, notably high-priced large houses, which may reflect luxury builds or rare features.

This chart reinforces **GrLivArea** as a key continuous feature influencing property value. It offers practical insight into the trade-off between size and price for potential buyers and for pricing inherited properties.

> **Saved as**: `images/scatter_saleprice_grlivarea.png`

In [25]:
# Import Plotly Express for visualisation
import plotly.express as px

# Scatter plot of Sale Price vs. Above Ground Living Area
fig = px.scatter(
    df, # DataFrame containing the cleaned dataset
    x="GrLivArea", # x axis: Above Ground Living Area (in sq.ft)
    y="SalePrice", # y axis: House Sale Price in USD.
    title="Sale Price vs. Above Ground Living Area (GrLivArea)", # Title for the Chart 
    labels={ # Labels for the axes x and y.
        "GrLivArea": "Above Ground Living Area (sq ft)",
        "SalePrice": "Sale Price (USD)"
    },
    trendline="ols" # Add a linear trendline to highlight the correlation
)

# Update layout for clarity and readability
fig.update_layout(
    xaxis_title="Above Ground Living Area (sq ft)", # x axis label
    yaxis_title="Sale Price (USD)" # y axis label 
)

# Show plot
fig.show()

# Save plot to images folder
fig.write_image("../images/scatter_saleprice_grlivarea.png")

In [26]:
%pip install statsmodels



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Histogram: Distribution of Sale Prices

This visualisation presents a histogram showing the distribution of sale prices across all properties in the dataset. It is an essential step in exploratory data analysis (EDA), as it helps to:

- Gives a clear picture of how house prices are spread across the dataset.
- Helps identify whether most houses are sold at a low, average, or high prices.
- Shows if the pricing pattern is skewed or contains any unusual values like extremely expensive houses.

#### **Insights:**
- The distribution is **right-skewed**, with a large number of properties priced below \$250,000.
- There are fewer high-end properties, with sale prices decreasing in frequency as values rise.
- This skew may impact model performance and could suggest using a **log transformation** of `SalePrice` for regression.

> Chart saved as: `images/histogram_saleprice.png`

In [30]:
# Annotated Code Block: Histogram of Sale Prices

import pandas as pd
import plotly.express as px

# Load the dataset
df = pd.read_csv('../data/processed/cleaned_data.csv')

# Create histogram to show distribution of Sale Prices
fig = px.histogram(
    df,
    x='SalePrice',                          # X-axis: sale prices
    nbins=40,                               # Controls granularity
    title='Distribution of House Sale Prices',
    labels={'SalePrice': 'Sale Price (USD)'}, 
    opacity=0.75,                           # Transparency level
    color_discrete_sequence=['blue']       # Visual style
)

# Layout improvements
fig.update_layout(
    xaxis_title='Sale Price (USD)',
    yaxis_title='Number of Properties',
    bargap=0.05,                            # Bar spacing
    template='plotly_white'
)

# Save to images folder
fig.write_image("../images/histogram_saleprice.png")

# Show chart
fig.show()

### Clarification on Time Period Represented

The histogram above displays the **distribution of house sale prices** across the entire dataset. While the dataset contains properties built and remodeled between **1872 and 2010**, this chart is **not time-based**. 

Instead, the histogram provides a **cross-sectional view** — it groups all sale prices together to show how frequently certain price ranges occur, **regardless of the year** the property was built or sold.

This helps Lydia and other users quickly assess typical property values and identify price concentration and outliers, without being affected by the timeline.

### Bar Chart: Average Sale Price by Decade Built

This bar chart provides a cleaner and more insightful view by summarising average house sale prices per decade. This helps Lydia and other stakeholders:

- Track how housing values have **evolved over time**.
- Observe whether **newer properties consistently demand higher prices**.
- Detect price trends that may correlate with economic or construction quality shifts.

#### **Key Insights:**
- Homes built after **2000** have the **highest average sale prices**.
- Properties from the **early 20th century** show more modest pricing, likely due to age and modernisation gaps.
- The chart supports clearer interpretation by reducing yearly noise.

> Chart saved as: `images/bar_avg_saleprice_decadebuilt.png`

In [33]:
# Import required library
import plotly.express as px

# Step 1: Group by decade for a clearer trend view
df["DecadeBuilt"] = (df["YearBuilt"] // 10) * 10
avg_price_by_decade = df.groupby("DecadeBuilt")["SalePrice"].mean().reset_index()

# Step 2: Create a cleaner, readable bar chart
fig = px.bar(
    avg_price_by_decade,
    x="DecadeBuilt",
    y="SalePrice",
    title="Average Sale Price by Decade Built",
    labels={
        "DecadeBuilt": "Decade Built",
        "SalePrice": "Average Sale Price (USD)"
    },
    hover_data={"DecadeBuilt": True, "SalePrice": ':.2f'},  # formatted hover
    text_auto=True,
    template="plotly_white"
)

# Step 3: Tidy up layout
fig.update_layout(
    xaxis=dict(tickmode='linear', tick0=1870, dtick=10),
    yaxis_title="Average Sale Price (USD)",
    xaxis_title="Decade Built",
    title_font_size=18
)

# Step 4: Show chart
fig.show()

# Step 5: Save chart to images folder
fig.write_image("../images/bar_avg_saleprice_decadebuilt.png")

## Training a Second Model: Random Forest Regressor

To complement the earlier linear regression model, we trained a **Random Forest Regressor**, which is a non-linear ensemble model capable of capturing complex interactions between features.

This model uses the top 10 most correlated predictors from our earlier analysis to estimate house sale prices. We split the data into 80% training and 20% testing sets to evaluate generalisation.

**Model Evaluation Metrics:**
- **MAE (Mean Absolute Error)**: Measures average prediction error in dollars.
- **RMSE (Root Mean Squared Error)**: Penalises large errors more heavily.
- **R² (Coefficient of Determination)**: Indicates how much variance is explained by the model.

The trained model is saved as `random_forest_model.pkl` in the `models` directory and will be compared with the linear regression model in the final analysis.

In [38]:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

#  Define features and target variable
features = [
    'OverallQual', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', '1stFlrSF',
    'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'GarageYrBlt', 'BsmtFinSF1'
]
X = df[features]
y = df['SalePrice']

#  Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Step 6: Evaluate the model
mae = mean_absolute_error(y_test, y_pred_rf)
rmse = mean_squared_error(y_test, y_pred_rf)
r2 = r2_score(y_test, y_pred_rf)

print(f"Random Forest MAE: {mae:.2f}")
print(f"Random Forest RMSE: {rmse:.2f}")
print(f"Random Forest R²: {r2:.4f}")

# Save the trained model
joblib.dump(rf_model, '../models/random_forest_model.pkl')

Random Forest MAE: 19014.93
Random Forest RMSE: 938850670.74
Random Forest R²: 0.8776


['../models/random_forest_model.pkl']

## Model Comparison and Justification 

To improve prediction accuracy, a second model (**Random Forest**) was trained and compared with the first model (**Linear Regression**).

- **Linear Regression** is a simple model that assumes a straight-line relationship between features and house prices.
- **Random Forest** is a more advanced model that uses many decision trees to capture complex patterns in the data.

### Why use two models?

- **Linear Regression** is easy to understand and useful for explaining how different features relate to price.
- **Random Forest** handles more complex patterns and makes more accurate predictions.

### How do they compare?

- The **Random Forest model** had a lower error and a higher R² score than the Linear Regression model.
- This means Random Forest was better at predicting sale prices.

Using both models helps us balance **interpretability** and **accuracy**, and provides Lydia with a stronger pricing tool.

## Using the Trained Models 

This section explains how the saved models can be reused to make predictions on new housing data.

Both models - Linear Regression and Random Forest - have been saved in the `models/` directory. These can be reloaded using the `joblib` library in Python, allowing predictions to be made without retraining the models.

Python Code can be loaded into a new cell on any notebook

`import joblib`

Load the saved model
`model = joblib.load('../models/linear_regression_model.pkl')`

Use the model for predictions
`predictions = model.predict(X_new)  # assuming X_new is your new data`

This is particularly useful for:
- Building an interactive dashboard where users can input house features and get price predictions.
- Running automated predictions on new or incoming datasets.

### Predicting House Sale Price from User Input

This section demonstrates how to use the trained machine learning model to predict the sale price of a house based on specific property features provided by a user. The model used here has been previously trained and saved using the `joblib` library. Here's how it works:

- **Model Loading**: The saved model file (`random_forest_model.pkl` or `linear_regression_model.pkl`) is loaded from disk using `joblib.load()`.
- **Input Preparation**: A new property profile is defined using a `pandas` DataFrame. It must match the format and feature names used during model training.
- **Feature Examples**:
  - `OverallQual`: Quality rating from 1 (Very Poor) to 10 (Very Excellent).
  - `GrLivArea`: Above ground living area (in square feet).
  - `GarageArea`: Size of the garage (in square feet).
  - `TotalBsmtSF`, `1stFlrSF`: Basement and 1st floor area.
  - Other features capture remodel year, build year, and basement finish details.

- **Prediction**: The model generates a predicted sale price for the house based on the input features.

This allows users — such as potential sellers or property investors — to estimate property value dynamically based on changing input parameters.

Please view the example below:

In [43]:
# Import libraries
import joblib
import pandas as pd

# Load a saved model (choose either one)
# model = joblib.load('models/linear_regression_model.pkl')
model = joblib.load('../models/random_forest_model.pkl')

# Prepare new input data (make sure columns match training features)
new_data = pd.DataFrame({
    'OverallQual': [7],        # Rating scale from 1 (Very Poor) to 10 (Very Excellent)
    'GrLivArea': [1800],       # Above ground living area in square feet
    'GarageArea': [400],       # Size of the garage in square feet
    'TotalBsmtSF': [1000],     # Total basement area in square feet
    '1stFlrSF': [1200],        # First floor square footage in square feet
    'YearBuilt': [2005],       # Year the house was originally built
    'YearRemodAdd': [2007],    # Year of the most recent remodel or addition
    'MasVnrArea': [100],       # Masonry veneer area in square feet
    'GarageYrBlt': [2005],     # Year the garage was built
    'BsmtFinSF1': [500]        # Finished square feet of basement (Type 1)
})

# Predict sale price
predicted_price = model.predict(new_data)
print(f"Predicted Sale Price: ${predicted_price[0]:,.2f}")

Predicted Sale Price: $206,869.88


In [44]:
print(df.columns.tolist())

['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinType1', 'BsmtUnfSF', 'GarageArea', 'GarageFinish', 'GarageYrBlt', 'GrLivArea', 'KitchenQual', 'LotArea', 'MasVnrArea', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd', 'SalePrice', 'DecadeBuilt']
