# Heritage Housing Data Cleaning

## Objectives
- Load the raw Ames Housing dataset
- Clean missing values and drop irrelevant features
- Prepare the data for analysis and modeling

## Inputs
- 'data/raw/house_prices_records.csv' - the unprocessed, original dataset
- 'data/raw/inherited_houses.csv' - a supplementary dataset with inherited homes
- 'data/processed/cleaned_data.csv' - the final cleaned dataset used for modeling

## Outputs
- Cleaned dataset with no missing values
- Ready-to-use data for exploratory and predictive analysis 

## Additional Comments

* Inherited homes may exhibit different trends, we may later integrate and analyse them separately.


## Load and inspect the Raw Housing Dataset
Before performing analysis or data cleaning, it is essential to assess the completeness of the dataset. Missing values can bias results or reduce the quality of predictions if left unadressed. 

In this step, we laod the Ames Housing Dataset from the raw source. This dataset includes all recorded residential property transactions. By inspecting the structure and contents of the data, we aim to idenitfy columns that may require cleaning or special treatment in later stages.

In [9]:
# Import required libraries 
import pandas as pd 

# Load the raw housing dataset
df = pd.read_csv("../data/raw/house_prices_records.csv")

# Display the first few rows
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


### Filter Dataset to Inherited Houses

In order to focus the analysis on the inherited properties it is essential to isolate the relevant records from the full dataset (house_prices_records.csv). The subset of interest is provided in a separate file (inherited houses.csv), which contains rows corresponding to the inherited properties.

Since the inherited subset does not include a unique identifier (e.g., Id), we adopt a row-wise alignment strategy:

- A temporary index column (row_id) is assigned to both the full dataset (house_prices_records.csv) and the inherited subset (inherited_houses.csv)

- Using these indices we filter the full dataset to retain only the rows that matched the inherited subset.

- The temporary index column is removed post filtering to preserve data integrity.

This approach ensures that the resulting DataFrame (df_inherited_full) containes only the inherited house records, which will be used for further data cleaning and exploratory analysis. 

In [5]:
import pandas as pd 

# Load the full house dataset
df_all = pd.read_csv("../data/raw/house_prices_records.csv")

# Load the list of inherited houses
df_inherited = pd.read_csv("../data/raw/inherited_houses.csv")

# Add a row index to both datasets
df_all["row_id"] = df_all.index
df_inherited["row_id"] = df_inherited.index

# Use the row_id to filter matching rows
df_inherited_full = df_all[df_all["row_id"].isin(df_inherited["row_id"])]

# Drop the temporary row_id
df_inherited_full = df_inherited_full.drop(columns=["row_id"])

# Confirm shape and preview
print("Inherited dataset shape", df_inherited_full.shape)
df_inherited_full.head()




Inherited dataset shape (4, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000


## Assess Missing Data in Inherited Houses Dataset

As part of our data cleaning process, we evaluate the filtered dataset containing only the inherited houses to identify any missing values.

Missing data can lead to biased predictions if not handled properly. By identifying which variables have null values, we can determine appropriate strategies (e.g. imputation or removal) in the following steps.

The output below lists all features in the inherited dataset that contain one or more missing values, sorted by the number of missing entries. This is crucial before performing any statistical analysis or model building.

In [6]:
# Show missing values in the inherited houses dataset
missing = df_inherited_full.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

WoodDeckSF       3
EnclosedPorch    2
BedroomAbvGr     1
2ndFlrSF         1
dtype: int64

### Clean Missing Values 

To ensure our dataset is suitable for analysis an machine learning, we address the issue of missing values. 

**Approach**

- **Drop columns** like 'EnclosedPorch', 'WoodDeckSF', and 'LotFrontage' that have too many missing entries to justify imputation.
- For **categorical variables** (e.g., 'GarageFinish', 'BsmtFinType1'), we fill missing values with '"None"' to indicate the absence of a feature.
- For **numerical variables** (e.g., 'BedroomAbvGr', 'GarageYrBlt'), we use **median** value of the column for imputation. This is a robust measure that limits the influence of outliers.

This process ensures the cleaned dataset maintains its integrity, prevents model bias and supports consistent training without runtime errors.

In [4]:
# Drop columns with too many missing values 
df = df.drop(columns=["EnclosedPorch", "WoodDeckSF", "LotFrontage"])

# Fill missing values for categorical columns with 'None'
df["GarageFinish"] = df["GarageFinish"].fillna("None")
df["BsmtFinType1"] = df["BsmtFinType1"].fillna("None")

# Fill missing values for numerical columns with the median value
df["BedroomAbvGr"] = df["BedroomAbvGr"].fillna(df["BedroomAbvGr"].median())
df["GarageYrBlt"] = df["GarageYrBlt"].fillna(df["GarageYrBlt"].median())

# Check remaining missing values in the dataset
df.isnull().sum().sort_values(ascending=False).head(10)

2ndFlrSF        86
BsmtExposure    38
MasVnrArea       8
BedroomAbvGr     0
1stFlrSF         0
BsmtFinSF1       0
BsmtFinType1     0
GarageArea       0
BsmtUnfSF        0
GarageYrBlt      0
dtype: int64

### Handle Remaining Missing Values 

To ensure the dataset is fully complete and safe for modeling, we address the remaining missing values using tailored imputation strategies:

- **Numerical columns** are filled with the **median**, which is robust to outliers and helps maintain data integrity.

- **Categorical columns** (like 'BsmtExposure') are filled with "None" to preserve the structure of the data without introducing bias.

This step is critical to prevent issues during model training and ensures that all the columns in the dataset are ready for further processing and analysis.

In [5]:
# Fill missing numerical columns with median
df["2ndFlrSF"] = df["2ndFlrSF"].fillna(df["2ndFlrSF"].median())
df["1stFlrSF"] = df["1stFlrSF"].fillna(df["1stFlrSF"].median())
df["BsmtExposure"] = df["BsmtExposure"].fillna("None")
df["MasVnrArea"] = df["MasVnrArea"].fillna(df["MasVnrArea"].median())
df["GarageArea"] = df["GarageArea"].fillna(df["GarageArea"].median())
df["BsmtFinSF1"] = df["BsmtFinSF1"].fillna(df["BsmtFinSF1"].median())

# Final check for any remaining missing values
df.isnull().sum().sort_values(ascending=False).head(10)

1stFlrSF        0
2ndFlrSF        0
BedroomAbvGr    0
BsmtExposure    0
BsmtFinSF1      0
BsmtFinType1    0
BsmtUnfSF       0
GarageArea      0
GarageFinish    0
GarageYrBlt     0
dtype: int64

### Save and Reload Cleaned Dataset

After completing the data cleaning process, we save the refined dataset to a new CSV file within the 'processed' folder. This allows for consistent reuse of the cleaned data in the later stages without repeating the cleaning steps.

We set 'index=False' to prevent pandas from wiritng the index as an additional column, preserving the structure of the original data.

We then reload the cleaned dataset to initiate the EDA phase. Previewing the dataset here helps confirm that the cleaning operation was successful and allows us to start identifying patterns and relationships for modelling. 

In [13]:
# Save the cleaned dataset to a new CSV file in the processed folder
# We use index=False to avoid saving the row numbers as an extra column

df.to_csv("../data/processed/cleaned_data.csv", index=False)

import pandas as pd

# Load the cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Preview the first few rows of the dataset
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,548,RFn,2003.0,...,Gd,8450,196.0,61,5,7,856,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,460,RFn,1976.0,...,TA,9600,0.0,0,8,6,1262,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,608,RFn,2001.0,...,Gd,11250,162.0,42,5,7,920,2001,2002,223500
3,961,0.0,3.0,No,216,ALQ,540,642,Unf,1998.0,...,Gd,9550,0.0,35,5,7,756,1915,1970,140000
4,1145,0.0,4.0,Av,655,GLQ,490,836,RFn,2000.0,...,Gd,14260,350.0,84,5,8,1145,2000,2000,250000


## Correlation Analysis

To understand which numerical features most significantly influence house prices, we compute the Pearson correlation coefficient between each numerical feature and the target variable, 'SalePrice'. This statistical method measures the strength and direction of the linear relationship between variables. Features with high absolute correlation values (positive or negative) are considered more relevant for predictive modelling. We visualise the top 10 most stringly correlated features using a heatmap, which facilitates the identification of patterns among predictors. This supports informed feature selection and model optimisation.

In [2]:
# Import required libraries
import pandas as pd
from scipy.stats import skew

#Load the cleaned dataset
df = pd.read_csv('../data/processed/cleaned_data.csv')

# Select ony numerical features from the dataset
numerical_df = df.select_dtypes(include='number')

# Compute the Pearson correlation matrix
corr_matrix = numerical_df.corr(numeric_only=True)

# Sort the correlation values in relation to the target variable 'SalePrice'
saleprice_corr = corr_matrix['SalePrice'].sort_values(ascending=False)

# Display the sorted correlation values
saleprice_corr




SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
YearBuilt       0.522897
YearRemodAdd    0.507101
MasVnrArea      0.472614
GarageYrBlt     0.466754
BsmtFinSF1      0.386420
OpenPorchSF     0.315856
2ndFlrSF        0.312479
LotArea         0.263843
BsmtUnfSF       0.214479
BedroomAbvGr    0.155784
OverallCond    -0.077856
Name: SalePrice, dtype: float64

In [2]:
%pip install plotly


Collecting plotly
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Collecting narwhals>=1.15.1 (from plotly)
  Downloading narwhals-1.39.0-py3-none-any.whl.metadata (11 kB)
Downloading plotly-6.0.1-py3-none-any.whl (14.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.8/14.8 MB[0m [31m675.2 kB/s[0m eta [36m0:00:00[0m:01[0m00:01[0m
[?25hDownloading narwhals-1.39.0-py3-none-any.whl (339 kB)
Installing collected packages: narwhals, plotly
Successfully installed narwhals-1.39.0 plotly-6.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
# Import plotly for visualisation
import plotly.express as px

# Get top 10 features most correlated with SalePrice (excluding SalePice itself)
top_corr_features = saleprice_corr[1:11].index

# Compute correlation matrix for these features 
top_corr_matrix = df[top_corr_features].corr()

# Plot an interactive heatmap using Plotly
fig = px.imshow(
    top_corr_matrix,
    text_auto=True,
    color_continuous_scale='RdBu',
    title="Top 10 Features Correlated with SalePrice",
    labels=dict(color='Correlation Coefficient'),
    x=top_corr_features,
    y=top_corr_features
)

#Update axis titles manually for context
fig.update_layout(
    xaxis_title="Correlated Predictor Variables",
    yaxis_title="Correlated Predictor Variables"
)

fig.show()

### Interpretation of the Heatmap 

The heatmap above presents the ten numerical variables most strongly correlated with 'SalePrice', based on Pearson correlation coefficients.

Each axis displays the same set of predictor variables, allowing for a comparison of their linear relationships with one another. These variables include measures of size, quality, and construction timing that are known to influence residential property value.


**Axis Titles:**
- *Correlated Predictor Variables*: These are the top ten features with the strongest statistics association with 'SalePrice'.
- *Pearson Correlation Coefficient (colour scale)*: This quanitifies the strength and direction of the linear relationship between each pair of variables. Values close to +1 indicate a strong positive correlation, while values near -1 indicate a strong negative correlation.

**Variable Description**
- 'OverallQual': Overall material and finish quality
- 'GrLivArea': Above-ground living area (sq ft)
- 'GarageArea': Garage size (sq ft)
- 'TotalBsmtSF': Total Basement area (sq ft)
- '1stFlrSF': First Floor area (sq ft)
- 'YearBuilt': Year the house was constructed
- 'YearRemodAdd': Year of the most recent remodeling
- 'MasVnrArea': Masonry veneer area (sq ft)
- 'GarageYrBlt': Year the garage was built
- 'BsmtFinSF1': Finished area of the basement (Type 1)

These features were selected because they exhibit the strongest correlation with house sale price.

**Colour Indicator (Key):**
- The **colour bar** to the right represents the **Pearson correlation coefficient** it uses a diverging colour scale ('RdBu') to visually distinguish the strength and direction of correlation between variables.
- **Dark red** values indicate **strong positive correlation** (closer to **+1.0**) meaning that as one variable increases, the other tends to increase as well.
- **Dark blue** values indicate **strong negative correlation** (closer to **-1.0**) meaning that as one variable increases the other tends to decrease.
- **Lighter shades (closer to white)** around **0.0** indicate **weak or no linear correlation**, meaning changes in one variable have little predictive power over the other. 

- This gradient helps quickly identify:

- **Highly influential predictors** (deep red against 'SalePrice'.)
- **Potential multicollinearity** between features (deep red or blue among non-target variables)
- **Redundancy**, where multiple features are strongly correlated with other (and potentially can be reduced in feature selection)

**Reading the Chart:**
- Each square shows how strongly two variables are linearly related.
- For example, a strong correlation between 'GrLiveArea' and '1stFlrSF' indicates that as the first-floor area increases, total living area tends to increase, which makes sense from an intuitive standpoint.

This visual tool helps guide **feature selection**, allowing us to identify potential multiple corelation issues and focus on the most informative predictors for modelling.

This will be discussed further below in Understanding the Pearson Correlation Coefficient 


## Explanation of Heatmap Features

The heatmap visualises the **top 10 numerical features** most strongly correlated with the target variable *SalePrice*. Each variable represents a property characteristic that potentially influences house price. The strength and direction of these correlations are represented by the color intensity and hue on the heatmap.

Below is a brief explanation of each feature:

**OverallQual** : Rates the overall material and finish quality of the house (1-10 scale). This is the strongest predictor of sale price.

**GrLivArea** : Above-ground living area in square feet. A larger living area typically increases house value.

**Garage Area** : Size of the garage in square feet. Larger garages may indicate higher end properties.

**TotalBsmtSF** : Total area of the basement in square feet. Larger basements may add to usuable space and value.

**1stFlrSF** : Area of the first floor in square feet. A larger first floor is often associated with more expensive homes.

**YearBuilt** : The original construction year. Newer homes generally command higher prices due to better condition and more modern features.

**YearRemodAdd** : Year of the latest remodel or addition . More recent updates may improve the home's value.

**MasVnrArea** : Masonry veneer area in square feet (e.g., brick or stone) May reflect exterior quality.

**GarageYrBlt** : Year the garage was built.  Usually matches or follows the house's construction year.

**BsmtFinSF1** : Finished square footage of the basement (type 1). Finished basements are often desirable living spaces.

These features were selected based on their Pearson correlation coefficients, as shown on the heatmap. The correlation values provide insight into how strongly each variable influences house prices, with the darker red indicating stronger positive correlations.

### Understanding the Pearson Correlation Coefficient

The Pearson correlation coefficent, commonly denoted as **r**, is a statisical measure we use to quantify the **strength and direction** of the linear relationship between two continuous numerical variables.

It ranges from **-1 to +1**, and is interpreted as follows:

- **r = +1.00** : Perfect positive linear correlation. As one variable increases, the other increases proportionally.
- **r = 0**: No linear correlation. There is no consistent linear relationship between the variables.
- **r = -1.0**: Perfect negative linear correlation. As one variable increases, the other decreases proportionally.

In this project, Pearson correlation is applied to measure the relationship between **numerical housing features** and the target variable, 'SalePrice'. Higher adsolute values of **r** indicate stronger linear associations. 

For example:
- 'OverallQual' has an **r = 0.79**, suggesting a strong positive linear correlation with sale price.
- 'GrLivArea' and 'GarageArea' also show high positive correlations, meaning larger homes and garages are typically more expensive.

This method is valuable in **feature selection**, as it helps identify which variables are most predictive of house price. It also highlights potential **multi correlations** between predictors, which can influence model performance.



In [6]:
%pip install scipy



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install nbformat 

Collecting nbformat
  Downloading nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting fastjsonschema>=2.15 (from nbformat)
  Downloading fastjsonschema-2.21.1-py3-none-any.whl.metadata (2.2 kB)
Collecting jsonschema>=2.6 (from nbformat)
  Downloading jsonschema-4.23.0-py3-none-any.whl.metadata (7.9 kB)
Collecting attrs>=22.2.0 (from jsonschema>=2.6->nbformat)
  Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting jsonschema-specifications>=2023.03.6 (from jsonschema>=2.6->nbformat)
  Downloading jsonschema_specifications-2025.4.1-py3-none-any.whl.metadata (2.9 kB)
Collecting referencing>=0.28.4 (from jsonschema>=2.6->nbformat)
  Downloading referencing-0.36.2-py3-none-any.whl.metadata (2.8 kB)
Collecting rpds-py>=0.7.1 (from jsonschema>=2.6->nbformat)
  Downloading rpds_py-0.24.0-cp312-cp312-macosx_10_12_x86_64.whl.metadata (4.1 kB)
Collecting typing-extensions>=4.4.0 (from referencing>=0.28.4->jsonschema>=2.6->nbformat)
  Downloading typing_extensions-4.13.2

## Train Predictive Model 

To help Lydia Doe accurately estimate the sale prices of her inherited houses, we now move into the predictive modelling phase. This step uses the cleaned and processed dataset to train a machine learning model that can predict house sale prices based on their features.

Given the project's objective - to support informed pricing decisions we use a **Linear Regression** model as a baseline due to its interpretability and performance on structured datasets.

This phase includes:
- Splitting the dataset into training and testing sets
- Training the model on the training set
- Evaluating model performance on unseen data

By validating the model's predictive accuracy, we ensure the client receives reliable price estimates for decision-making purposes.

In [17]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp312-cp312-macosx_10_13_x86_64.whl.metadata (31 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.0-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp312-cp312-macosx_10_13_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading joblib-1.5.0-py3-none-any.whl (307 kB)
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.5.0 scikit-learn-1.6.1 threadpoolctl-3.6.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m

In [8]:
# Import required libraries for modelling and evaluation 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the cleaned dataset
df = pd.read_csv("../data/processed/cleaned_data.csv")

# Separate features (X) and target variable (y)
X = df.drop(columns=["SalePrice"])
y = df["SalePrice"]

#Convert categorical variables to numeric using one-hot encoding
X = pd.get_dummies(X)

# Split the data into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialise and train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate model performance using standard regression metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mse**0.5
r2 = r2_score(y_test, y_pred)

# Print evaluation results 
print("Model Evaulation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R-squared (R\u00b2): {r2:.2f}")


Model Evaulation Metrics:
Mean Absolute Error (MAE): 21286.05
Root Mean Squared Error (RMSE): 34262.84
R-squared (R²): 0.85


## Evaluate Model Performance 

To assess the performance of the predictive model, we use three key regression metrics:

- **Mean Absolute Error (MAE):** Measures the average absolute difference between predicted and actual values. Expressed in US dollars ($). It provides a straightforward interpretation of the typical prediction error.

- **Root Mean Squared Error (RMSE):** Penalises larger errors more than MAE and is also measured in dollars. It is useful when large deviations from actual prices are particularly undesirable.

- **R<sup>2</sup> (R-squared):** Indicates the proportion of variance in the target variable ('SalePrice') that is explained by the model. This metric is unitless and ranges from 0 to 1, with values closer to 1 representing stronger predictive performance.

In this case, the model achieved:

- **MAE** ~21,286.05 USD
- **RMSE** ~34,262.84 USD
- **R<sup>2</sup>:** 0.85

These results suggest a strong model fit,with approximately 85% of the variance in house sale prices explained by the selected features. This provides the client with a reliable and interpretable foundation for estimating the market value of inherited properties.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

df = pd.read_csv("../data/processed/cleaned_data.csv")

X = pd.get_dummies(df.drop(columns=["SalePrice"]))
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)


## Predict Inherited House Prices 

With the trained Linear Regression model, we now generate sale price predictions for the four inherited properties identified by Lydia.

Each property has been prepared with the same feature structure as the training dataset, and predictions are generated using the fitted model.

These predicted prices provide Lydia with estimated market values for her inherited houses, enabling informed decisions about whether to sell, renovate, or hold the properties.

This phase of the project directly address Lydia's key question:

"How much are the inherited properties actually worth?"

In [12]:
# Prepare the inherited dataset for prediction
# Drop 'SalePrice' column if present and convert categorical variables to dummies
X_inherited = df_inherited_full.drop(columns=["SalePrice"], errors="ignore")
X_inherited = pd.get_dummies(X_inherited)

# Align inherited features to match training set (ensures same structures)
X_inherited = X_inherited.reindex(columns=X.columns, fill_value=0)

# Check for any missing values in the inherited data
print("Missing values in inherited dataset:")
print(X_inherited.isnull().sum().sort_values(ascending=False).head())

# Fill any remaining missing values with the median of each column
X_inherited = X_inherited.fillna(X_inherited.median(numeric_only=True))

# Predict sale prices using the trained model
predicted_prices = model.predict(X_inherited)

# Add predictions to the inherited DataFrame
df_inherited_full["Predicted_SalePrice"] = predicted_prices

#Label each inherited house clearly 
df_inherited_full["House"] = [f"House {i+1}" for i in range(len(df_inherited_full))]

# Create a simpler displayy Dataframe
df_inherited_display = df_inherited_full[["House", "Predicted_SalePrice"]]

# Show the final labeled predictions
df_inherited_display.head

Missing values in inherited dataset:
2ndFlrSF        1
BedroomAbvGr    1
1stFlrSF        0
BsmtFinSF1      0
BsmtUnfSF       0
dtype: int64


<bound method NDFrame.head of      House  Predicted_SalePrice
0  House 1        215669.762525
1  House 2        191117.459821
2  House 3        230019.818498
3  House 4        179302.774627>