# 02 - Data Cleaning

In this notebook, we load the raw housing price dataset and clean it by:
- Removing duplicates
- Handling missing values
- Formatting dates
- Converting categorical variables
- Saving the cleaned dataset for modeling

---

Import the libraries used for data processing:
- pandas for handling and cleaning data
- os for interacting with files and directories

In [1]:
import pandas as pd
import os

the file is too large to load into memory directly so i only need 1000

In [2]:
input_path = '../inputs/datasets/raw/price_paid_records.csv'
df_chunk = pd.read_csv(input_path, nrows=1000)
df_chunk.head()

Unnamed: 0,Transaction unique identifier,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory Type,Record Status - monthly file only
0,{81B82214-7FBC-4129-9F6B-4956B4A663AD},25000,1995-08-18 00:00,T,N,F,OLDHAM,OLDHAM,GREATER MANCHESTER,A,A
1,{8046EC72-1466-42D6-A753-4956BF7CD8A2},42500,1995-08-09 00:00,S,N,F,GRAYS,THURROCK,THURROCK,A,A
2,{278D581A-5BF3-4FCE-AF62-4956D87691E6},45000,1995-06-30 00:00,T,N,F,HIGHBRIDGE,SEDGEMOOR,SOMERSET,A,A
3,{1D861C06-A416-4865-973C-4956DB12CD12},43150,1995-11-24 00:00,T,N,F,BEDFORD,NORTH BEDFORDSHIRE,BEDFORDSHIRE,A,A
4,{DD8645FD-A815-43A6-A7BA-4956E58F1874},18899,1995-06-23 00:00,S,N,F,WAKEFIELD,LEEDS,WEST YORKSHIRE,A,A


Load the raw dataset from the `inputs/datasets/raw` folder into a DataFrame, and check its the number of rows and columns with shape.

In [3]:
# Load raw dataset
input_path = '../inputs/datasets/raw/price_paid_records.csv'
df_chunk = pd.read_csv(input_path, nrows=1000)
df_chunk.shape

(1000, 11)

Preview the first 5 rows to see if it is working 

In [4]:
df_chunk.head()

Unnamed: 0,Transaction unique identifier,Price,Date of Transfer,Property Type,Old/New,Duration,Town/City,District,County,PPDCategory Type,Record Status - monthly file only
0,{81B82214-7FBC-4129-9F6B-4956B4A663AD},25000,1995-08-18 00:00,T,N,F,OLDHAM,OLDHAM,GREATER MANCHESTER,A,A
1,{8046EC72-1466-42D6-A753-4956BF7CD8A2},42500,1995-08-09 00:00,S,N,F,GRAYS,THURROCK,THURROCK,A,A
2,{278D581A-5BF3-4FCE-AF62-4956D87691E6},45000,1995-06-30 00:00,T,N,F,HIGHBRIDGE,SEDGEMOOR,SOMERSET,A,A
3,{1D861C06-A416-4865-973C-4956DB12CD12},43150,1995-11-24 00:00,T,N,F,BEDFORD,NORTH BEDFORDSHIRE,BEDFORDSHIRE,A,A
4,{DD8645FD-A815-43A6-A7BA-4956E58F1874},18899,1995-06-23 00:00,S,N,F,WAKEFIELD,LEEDS,WEST YORKSHIRE,A,A


---


### Check Column Data Types

print out the data type of each column to ensure values like dates and numbers are correctly interpreted this will helps avoid errors later 

In [5]:
print("Column Data Types:")
print(df_chunk.dtypes)

Column Data Types:
Transaction unique identifier        object
Price                                 int64
Date of Transfer                     object
Property Type                        object
Old/New                              object
Duration                             object
Town/City                            object
District                             object
County                               object
PPDCategory Type                     object
Record Status - monthly file only    object
dtype: object


---

### View Value Counts for Categorical Columns

This shows how many unique values appear in key categorical columns like property type, town, and whether the house is new or old. It helps spot unbalanced or unexpected values.

In [6]:
categorical_cols = ["Property Type", "Old/New", "Duration", "County", "Town/City"]
for col in categorical_cols:
    print(f"\nValue counts for {col}:")
    print(df_chunk[col].value_counts())


Value counts for Property Type:
Property Type
S    328
T    316
D    232
F    124
Name: count, dtype: int64

Value counts for Old/New:
Old/New
N    865
Y    135
Name: count, dtype: int64

Value counts for Duration:
Duration
F    806
L    194
Name: count, dtype: int64

Value counts for County:
County
GREATER LONDON           121
WEST MIDLANDS             43
GREATER MANCHESTER        40
WEST YORKSHIRE            38
ESSEX                     36
                        ... 
SLOUGH                     1
RUTLAND                    1
CONWY                      1
SOUTH GLOUCESTERSHIRE      1
YORK                       1
Name: count, Length: 85, dtype: int64

Value counts for Town/City:
Town/City
LONDON                67
BIRMINGHAM            15
LEICESTER             14
MANCHESTER            14
NOTTINGHAM            13
                      ..
HYTHE                  1
SURBITON               1
MALVERN                1
CHALFONT ST. GILES     1
HELSTON                1
Name: count, Length: 449, d

### Handle Missing Values

---

this Check for any missing data values in key columns like "Price" or "Date of Transfer" so we remove those rows for analysis

In [7]:

print("Missing values in each column:")
print(df_chunk.isnull().sum())

df_chunk.dropna(subset=["Price", "Date of Transfer"], inplace=True)
print(f"Shape after dropping missing price/date: {df_chunk.shape}")

Missing values in each column:
Transaction unique identifier        0
Price                                0
Date of Transfer                     0
Property Type                        0
Old/New                              0
Duration                             0
Town/City                            0
District                             0
County                               0
PPDCategory Type                     0
Record Status - monthly file only    0
dtype: int64
Shape after dropping missing price/date: (1000, 11)


---

### Cleaning 

now i can remove columns that are not useful for my analysis or modeling to simplify the dataset.

In [8]:
df_chunk.drop(columns=[
    "Transaction unique identifier",
    "District",
    "Record Status - monthly file only"
], inplace=True, errors='ignore')

Converting the "Date of Transfer" column from text to proper datetime format so i can extract time-based features.

In [9]:
df_chunk["Date of Transfer"] = pd.to_datetime(df_chunk["Date of Transfer"], errors='coerce')

Extract the year and month from the "Date of Transfer" column and store them in new columns to allow time-based analysis.

In [10]:
df_chunk["Year"] = df_chunk["Date of Transfer"].dt.year
df_chunk["Month"] = df_chunk["Date of Transfer"].dt.month

Convert text categories into numeric format so they can be used in machine learning models:
- Old/New: N = 0, Y = 1
- Duration: F (Freehold) = 1, L (Leasehold) = 0
- Property Type are put into separate columns (Property_D, Property_F)

In [11]:
df_chunk["Old/New"] = df_chunk["Old/New"].map({'N': 0, 'Y': 1})
df_chunk["Duration"] = df_chunk["Duration"].map({'F': 1, 'L': 0})
df_chunk = pd.get_dummies(df_chunk, columns=["Property Type"], prefix="Property")

Remove duplicate rows to keep the dataset clean and reset the index to keep row numbers in order.

In [12]:
# Drop duplicates
df_chunk.drop_duplicates(inplace=True)
df_chunk.reset_index(drop=True, inplace=True)

---

## Save cleaned data

Save the cleaned dataset and will creates the folder if it doesn't exist and saves the file without the index column

In [13]:
# Save cleaned data
output_path = '../outputs/datasets/collection/HousePricesRecords_clean.csv'
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df_chunk.to_csv(output_path, index=False)
print(f"Cleaned data saved to: {output_path}")

Cleaned data saved to: ../outputs/datasets/collection/HousePricesRecords_clean.csv
