# Data Cleaning 

### Notebook summary:
This notebook will detail the steps and actions taken in the data cleaning process. Involving the removal of errors, inconsistencies, and irrelevant information, ensuring that the data is accurate, reliable, and ready for the pre-processing stage. 



## Notebook Setup

In [1]:
# Imports
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Loading

In [5]:
# Data Loading and viewing
housing_df = pd.read_csv(r'C:\Users\sanja\capstone-SanjayRaju2000\src\data\kaggle_london_house_price_data.csv')
housing_df.head()

Unnamed: 0,fullAddress,postcode,country,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,...,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price,history_percentageChange,history_numericChange
0,"Flat 9, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,England,EC4A,51.517282,-0.110314,1.0,1.0,45.0,1.0,...,630000.0,HIGH,2024-10-07T13:26:59.894Z,244000.0,68.539326,2010-03-30,1995-01-02,830000,,
1,"Flat 6, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,England,EC4A,51.517282,-0.110314,,,,,...,660000.0,MEDIUM,2024-10-07T13:26:59.894Z,425000.0,242.857143,2000-05-26,1995-01-02,830000,,
2,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,England,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,...,834000.0,MEDIUM,2025-01-10T11:04:57.114Z,49000.0,6.901408,2019-09-04,1995-01-03,249950,,
3,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,England,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,...,407000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,7.777778,2024-01-25,1995-01-03,32000,,
4,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,England,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,...,1324000.0,HIGH,2024-10-07T13:26:59.894Z,81000.0,6.864407,2022-12-14,1995-01-03,133000,,


In [19]:
# Data types
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418201 entries, 0 to 418200
Data columns (total 28 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   fullAddress                                418201 non-null  object 
 1   postcode                                   418201 non-null  object 
 2   country                                    418201 non-null  object 
 3   outcode                                    418201 non-null  object 
 4   latitude                                   418201 non-null  float64
 5   longitude                                  418201 non-null  float64
 6   bathrooms                                  340270 non-null  float64
 7   bedrooms                                   377665 non-null  float64
 8   floorAreaSqM                               392991 non-null  float64
 9   livingRooms                                357701 non-null  float64
 10  tenure  

## Data Understanding

### Data format

In [20]:
# Viewing how many rows and columns there are
housing_df.shape

(418201, 28)

- The dataset contains 418,201 rows and 28 columns.

- Each row represents a housing property with 28 associated attributes.

### Null values

In [21]:
# Viewing how many null values
housing_df.isnull().sum()

fullAddress                                       0
postcode                                          0
country                                           0
outcode                                           0
latitude                                          0
longitude                                         0
bathrooms                                     77931
bedrooms                                      40536
floorAreaSqM                                  25210
livingRooms                                   60500
tenure                                        11512
propertyType                                   1136
currentEnergyRating                           84526
rentEstimate_lowerPrice                        1741
rentEstimate_currentPrice                      1741
rentEstimate_upperPrice                        1741
saleEstimate_lowerPrice                         640
saleEstimate_currentPrice                       640
saleEstimate_upperPrice                         640
saleEstimate

- Null values vary across columns. Dropping too many rows could reduce the dataset and impact analysis.

- Most missing data is under 40%, except for the *history percentage change* and *history numeric change* columns. These can be dropped without major impact, assuming randomness and stable distribution.






### Duplicates

In [22]:
# Viewing the total number of duplicates 
print(housing_df.duplicated().sum())

7


- The output shows that in total across the dataset there are only seven duplicated columns.

In [23]:
# Viewing how many unique values per column
print(housing_df.nunique())

fullAddress                                  137760
postcode                                      48392
country                                           1
outcode                                         168
latitude                                      92247
longitude                                     93060
bathrooms                                         9
bedrooms                                          9
floorAreaSqM                                    491
livingRooms                                       9
tenure                                            4
propertyType                                     19
currentEnergyRating                               7
rentEstimate_lowerPrice                         748
rentEstimate_currentPrice                       807
rentEstimate_upperPrice                         861
saleEstimate_lowerPrice                        5586
saleEstimate_currentPrice                      6152
saleEstimate_upperPrice                        6724
saleEstimate

- Some columns do not have as many unique values as entries, which is expected.
  - Properties like flats may share postcodes.
  - Numerical fields like price and rent are naturally non-unique.
  
- *Full address* should be unique, but only 30% of entries are unique.
  - Indicates properties may be reported multiple times (e.g., different years).
  - Further examination needed.


## Omitting data

### Duplicates

In [24]:
# Dropping all duplicate rows in the dataframe
housing_df = housing_df.drop_duplicates()


In [25]:
# Checking if all the duplicate rows have been successfully dropped
print(housing_df.duplicated().sum())

0


In [26]:
# Dropping columns
housing_df = housing_df.drop(['country', 'history_percentageChange', 'history_numericChange'], axis=1)

In [27]:
# Viewing the data post column removal
housing_df.head()

Unnamed: 0,fullAddress,postcode,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,tenure,...,saleEstimate_lowerPrice,saleEstimate_currentPrice,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price
0,"Flat 9, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,EC4A,51.517282,-0.110314,1.0,1.0,45.0,1.0,Leasehold,...,570000.0,600000.0,630000.0,HIGH,2024-10-07T13:26:59.894Z,244000.0,68.539326,2010-03-30,1995-01-02,830000
1,"Flat 6, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,EC4A,51.517282,-0.110314,,,,,Leasehold,...,540000.0,600000.0,660000.0,MEDIUM,2024-10-07T13:26:59.894Z,425000.0,242.857143,2000-05-26,1995-01-02,830000
2,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,Leasehold,...,683000.0,759000.0,834000.0,MEDIUM,2025-01-10T11:04:57.114Z,49000.0,6.901408,2019-09-04,1995-01-03,249950
3,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,Leasehold,...,368000.0,388000.0,407000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,7.777778,2024-01-25,1995-01-03,32000
4,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,Freehold,...,1198000.0,1261000.0,1324000.0,HIGH,2024-10-07T13:26:59.894Z,81000.0,6.864407,2022-12-14,1995-01-03,133000


- The *country* column was removed:
  - Only one unique (non-numerical) value: "England."
  - Not useful for statistical modeling.

- The *history percentage change* and *history numeric change* columns were removed:
  - High number of missing values.
  - No clear definition in the data dictionary.
  - Unlikely to significantly impact the target variable.


### Null values

In [28]:
# Dropping all rows with null values 
housing_df = housing_df.dropna()

In [29]:
# Viewing the number of rows and columns post removal of rows with missing values
housing_df.shape

(265911, 25)

In [30]:
# Checking if all null values have been successfully dropped
housing_df.isnull().sum()

fullAddress                                  0
postcode                                     0
outcode                                      0
latitude                                     0
longitude                                    0
bathrooms                                    0
bedrooms                                     0
floorAreaSqM                                 0
livingRooms                                  0
tenure                                       0
propertyType                                 0
currentEnergyRating                          0
rentEstimate_lowerPrice                      0
rentEstimate_currentPrice                    0
rentEstimate_upperPrice                      0
saleEstimate_lowerPrice                      0
saleEstimate_currentPrice                    0
saleEstimate_upperPrice                      0
saleEstimate_confidenceLevel                 0
saleEstimate_ingestedAt                      0
saleEstimate_valueChange.numericChange       0
saleEstimate_

## Cleaned Data

In [31]:
# Viewing the data post removal of null values
housing_df.sample(10)

Unnamed: 0,fullAddress,postcode,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,tenure,...,saleEstimate_lowerPrice,saleEstimate_currentPrice,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price
41171,"19 Chanctonbury Way, London, N12 7JB",N12 7JB,N12,51.616176,-0.190473,1.0,3.0,111.0,2.0,Freehold,...,937000.0,987000.0,1036000.0,HIGH,2024-10-07T13:26:59.894Z,67000.0,7.282609,2023-01-18,1999-07-12,182500
216081,"445 Forest Road, London, E17 5LD",E17 5LD,E17,51.590241,-0.024739,1.0,2.0,74.0,1.0,Leasehold,...,454000.0,478000.0,502000.0,HIGH,2025-01-10T11:04:57.114Z,28000.0,6.222222,2023-02-27,2016-08-10,390000
135639,"62 Durham Road, London, E12 5AX",E12 5AX,E12,51.551306,0.041801,2.0,3.0,104.0,2.0,Freehold,...,631000.0,664000.0,697000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,4.445213,2023-11-22,2007-09-28,250000
333827,"104 Peters Court, Porchester Road, London, W2 5DS",W2 5DS,W2,51.516894,-0.188278,1.0,1.0,55.0,1.0,Leasehold,...,467000.0,492000.0,516000.0,HIGH,2025-01-10T11:04:57.114Z,22000.0,4.680851,2022-12-13,2022-12-13,470000
277511,"29 Dawlish Avenue, London, N13 4HP",N13 4HP,N13,51.616467,-0.121899,2.0,3.0,137.0,1.0,Freehold,...,944000.0,993000.0,1043000.0,HIGH,2024-10-07T13:26:59.894Z,133000.0,15.465116,2021-05-12,2021-05-12,860000
288755,"24 Ayrsome Road, London, N16 0RD",N16 0RD,N16,51.560799,-0.07993,2.0,4.0,140.0,1.0,Freehold,...,1366000.0,1438000.0,1510000.0,HIGH,2025-01-10T11:04:57.114Z,187000.0,14.948042,2021-09-29,2021-09-29,1251000
230100,"Flat 504, Baldwin Point, 6 Sayer Street, Londo...",SE17 1FG,SE17,51.491992,-0.096318,1.0,1.0,57.0,1.0,Leasehold,...,540000.0,569000.0,597000.0,HIGH,2024-10-07T13:26:59.894Z,59000.0,11.568627,2023-07-25,2017-10-20,555000
121632,"2 Walnut Gardens, London, E15 1LL",E15 1LL,E15,51.551096,0.003395,3.0,5.0,150.0,5.0,Freehold,...,594000.0,625000.0,657000.0,HIGH,2024-10-07T13:26:59.894Z,25000.0,4.166667,2024-04-22,2006-09-11,330000
341050,"123 South Park Crescent, London, SE6 1JL",SE6 1JL,SE6,51.439988,0.008758,1.0,3.0,94.0,2.0,Freehold,...,517000.0,544000.0,571000.0,HIGH,2025-01-10T11:04:57.114Z,4000.0,0.740741,2023-02-07,2023-02-07,540000
130408,"15 Denman House, Lordship Terrace, London, N16...",N16 0JD,N16,51.562444,-0.081564,1.0,2.0,24.0,1.0,Leasehold,...,461000.0,486000.0,510000.0,HIGH,2024-10-07T13:26:59.894Z,41000.0,9.213483,2019-10-23,2007-05-18,250000


In [32]:
# Saving the cleaned version of the dataset
housing_df.to_csv('london_house_price_data_clean', index=False)