# Data Pre-processing

## Notebook Summary

This notebook focuses on preparing a cleaned dataset for linear modelling techniques, such as linear regression and ridge regression. The notebook details pre-processing steps, including standardising or normalising features to meet the assumptions of linear models. A key component is feature engineering, which involves creating or transforming variables to better capture underlying patterns and improve model performance. The dataset will be structured to ensure optimal performance and interpretability, providing a robust foundation for accurate predictive modelling using techniques such as linear regression and ridge regression.

## Notebook Setup

In [2]:
# Imports
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import chi2_contingency
from scipy.stats import pearsonr
from sklearn.preprocessing import OrdinalEncoder

## Data Loading & Understanding

In [None]:
# Creating DataFrame
clean_house_df = pd.read_csv("london_house_price_data_clean.csv")
# Viewing DataFrame
clean_house_df.head(5)

Unnamed: 0,fullAddress,postcode,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,tenure,...,saleEstimate_lowerPrice,saleEstimate_currentPrice,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price
0,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,Leasehold,...,683000.0,759000.0,834000.0,MEDIUM,2025-01-10T11:04:57.114Z,49000.0,6.901408,2019-09-04,1995-01-03,249950
1,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,Leasehold,...,368000.0,388000.0,407000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,7.777778,2024-01-25,1995-01-03,32000
2,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,Freehold,...,1198000.0,1261000.0,1324000.0,HIGH,2024-10-07T13:26:59.894Z,81000.0,6.864407,2022-12-14,1995-01-03,133000
3,"17 Collins Street, London, SE3 0UG",SE3 0UG,SE3,51.466089,0.006967,1.0,2.0,97.0,1.0,Freehold,...,897000.0,944000.0,991000.0,HIGH,2024-10-07T13:26:59.894Z,119000.0,14.424242,2021-09-30,1995-01-03,128500
4,"14 Theodore Road, London, SE13 6HT",SE13 6HT,SE13,51.451327,-0.007569,1.0,3.0,135.0,2.0,Freehold,...,690000.0,726000.0,762000.0,HIGH,2024-10-07T13:26:59.894Z,71000.0,10.839695,2021-12-06,1995-01-03,75000


In [4]:
# Shape of the data
rows, cols = clean_house_df.shape
print(f"The data has {rows} rows and {cols} columns")

The data has 265911 rows and 25 columns


### Identification of categorical columns

The data input for regression models must be strictly numerical, necessitating the transformation of categorical variables into a suitable numerical format. This ensures compatibility with the mathematical foundations of regression algorithms.

- 'fullAddress'
- 'postcode'
- 'outcode'
- 'tenure'
- 'saleEstimate_confidenceLevel' 

#### fullAddress column

In [5]:
# Count the number of unique values in the 'fullAddress' column
print(f"The fullAddress column has {clean_house_df['fullAddress'].nunique()} unique values")

The fullAddress column has 82613 unique values


#### tenure column

No natural order was present in the values of this column, so dummy columns were used.

In [6]:
# Count the number of unique values in the 'tenure' column
print(f"The fullAddress column has {clean_house_df['tenure'].nunique()} unique values")

# Unique values in the column
tenure_unique_values = clean_house_df['tenure'].unique()

# Displaying the unique values
print(tenure_unique_values)

The fullAddress column has 4 unique values
['Leasehold' 'Freehold' 'Feudal' 'Shared']


In [7]:
# Performing one-hot encoding
tenure_encoded = pd.get_dummies(clean_house_df['tenure'], prefix='tenure')

# Converting True/False to binary
tenure_encoded = tenure_encoded.astype(int)

# Concatenating the one-hot encoded column with the original dataframe
clean_house_df = pd.concat([clean_house_df, tenure_encoded], axis=1)

# Droping original column
clean_house_df = clean_house_df.drop('tenure', axis=1)

#### saleEstimate_confidenceLevel

A natural order is present in the contents of this column, so ordinal encoding was used here for the conversion into numerical data: LOW = 0.0, MEDIUM = 1.0, HIGH = 2.0

In [8]:
# Count the number of unique values in the 'saleEstimate_confidenceLevel' column
print(f"The saleEstimate_confidenceLevel column has {clean_house_df['saleEstimate_confidenceLevel'].nunique()} unique values")

# Unique values in the column
confidence_unique_values = clean_house_df['saleEstimate_confidenceLevel'].unique()

# Displaying the unique values
print(confidence_unique_values)

The saleEstimate_confidenceLevel column has 3 unique values
['MEDIUM' 'HIGH' 'LOW']


In [9]:
# Order of categories
categories = [['LOW', 'MEDIUM', 'HIGH']]

# Initialising encoder
encoder = OrdinalEncoder(categories=categories)

# Fit and transform the data
confidence_encoded = encoder.fit_transform(clean_house_df[['saleEstimate_confidenceLevel']])

# Add the encoded values to the dataframe
clean_house_df['saleEstimate_confidenceLevel_encoded'] = confidence_encoded

In [11]:
# Viewing DataFrame post-conversions
clean_house_df.head(5)

Unnamed: 0,fullAddress,postcode,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,propertyType,...,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price,tenure_Feudal,tenure_Freehold,tenure_Leasehold,tenure_Shared,saleEstimate_confidenceLevel_encoded
0,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,Flat/Maisonette,...,49000.0,6.901408,2019-09-04,1995-01-03,249950,0,0,1,0,1.0
1,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,Flat/Maisonette,...,28000.0,7.777778,2024-01-25,1995-01-03,32000,0,0,1,0,2.0
2,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,End Terrace House,...,81000.0,6.864407,2022-12-14,1995-01-03,133000,0,1,0,0,2.0
3,"17 Collins Street, London, SE3 0UG",SE3 0UG,SE3,51.466089,0.006967,1.0,2.0,97.0,1.0,Mid Terrace House,...,119000.0,14.424242,2021-09-30,1995-01-03,128500,0,1,0,0,2.0
4,"14 Theodore Road, London, SE13 6HT",SE13 6HT,SE13,51.451327,-0.007569,1.0,3.0,135.0,2.0,Terrace Property,...,71000.0,10.839695,2021-12-06,1995-01-03,75000,0,1,0,0,2.0
