# Data Cleaning 

## Set up

In [33]:
# Imports
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [34]:
# Data Loading and viewing
housing_df = pd.read_csv(r'C:\Users\sanja\kaggle_london_house_price_data.csv')
housing_df.head()

Unnamed: 0,fullAddress,postcode,country,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,...,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price,history_percentageChange,history_numericChange
0,"Flat 9, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,England,EC4A,51.517282,-0.110314,1.0,1.0,45.0,1.0,...,630000.0,HIGH,2024-10-07T13:26:59.894Z,244000.0,68.539326,2010-03-30,1995-01-02,830000,,
1,"Flat 6, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,England,EC4A,51.517282,-0.110314,,,,,...,660000.0,MEDIUM,2024-10-07T13:26:59.894Z,425000.0,242.857143,2000-05-26,1995-01-02,830000,,
2,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,England,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,...,834000.0,MEDIUM,2025-01-10T11:04:57.114Z,49000.0,6.901408,2019-09-04,1995-01-03,249950,,
3,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,England,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,...,407000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,7.777778,2024-01-25,1995-01-03,32000,,
4,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,England,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,...,1324000.0,HIGH,2024-10-07T13:26:59.894Z,81000.0,6.864407,2022-12-14,1995-01-03,133000,,


In [35]:
# Data types
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418201 entries, 0 to 418200
Data columns (total 28 columns):
 #   Column                                     Non-Null Count   Dtype  
---  ------                                     --------------   -----  
 0   fullAddress                                418201 non-null  object 
 1   postcode                                   418201 non-null  object 
 2   country                                    418201 non-null  object 
 3   outcode                                    418201 non-null  object 
 4   latitude                                   418201 non-null  float64
 5   longitude                                  418201 non-null  float64
 6   bathrooms                                  340270 non-null  float64
 7   bedrooms                                   377665 non-null  float64
 8   floorAreaSqM                               392991 non-null  float64
 9   livingRooms                                357701 non-null  float64
 10  tenure  

## Shape of the data 

### Data format

In [36]:
# Viewing how many rows and columns there are
housing_df.shape

(418201, 28)

The dataset has 418,201 rows and 28 columns. This shows that there are 418,201 entries in the dataset, each with 28 attributes associated with them. Meaning that there 418,201 different housing properties reported on in this dataset. 

### Null values

In [37]:
# Viewing how many null values
housing_df.isnull().sum()

fullAddress                                       0
postcode                                          0
country                                           0
outcode                                           0
latitude                                          0
longitude                                         0
bathrooms                                     77931
bedrooms                                      40536
floorAreaSqM                                  25210
livingRooms                                   60500
tenure                                        11512
propertyType                                   1136
currentEnergyRating                           84526
rentEstimate_lowerPrice                        1741
rentEstimate_currentPrice                      1741
rentEstimate_upperPrice                        1741
saleEstimate_lowerPrice                         640
saleEstimate_currentPrice                       640
saleEstimate_upperPrice                         640
saleEstimate

From the output above, it can be seen that the number of null values vary drastically from column to column. With portions of the data missing, dropping too many rows could lead to a substantial reduction in the dataset size, which might adversely affect the reliability of any statistical analysis or predictive modeling.

However the missing values across all the data appear to be only a small percentage of the data population (less than forty percent) with the exception of the history percentage change and the history numeric change columns, therefore these values can be dropped without having a significant effect on the data. As long as the missing data points are random and will not change the distribution features. 




### Duplicates

In [38]:
# Viewing the total number of duplicates 
print(housing_df.duplicated().sum())

7


The output shows that in total across the dataset there are only seven duplicated columns.

In [39]:
# Viewing how many unique values per column
print(housing_df.nunique())

fullAddress                                  137760
postcode                                      48392
country                                           1
outcode                                         168
latitude                                      92247
longitude                                     93060
bathrooms                                         9
bedrooms                                          9
floorAreaSqM                                    491
livingRooms                                       9
tenure                                            4
propertyType                                     19
currentEnergyRating                               7
rentEstimate_lowerPrice                         748
rentEstimate_currentPrice                       807
rentEstimate_upperPrice                         861
saleEstimate_lowerPrice                        5586
saleEstimate_currentPrice                      6152
saleEstimate_upperPrice                        6724
saleEstimate

From the output above, it can be seen that not all of the columns have as many unique values as entries in the dataset which is to be expected. Some properties such as flats or apartment complexes may share the same postcode, numerical values such as price and rental estimates will definitely not be unique per property etc. 

However, full address is a column that should have unique entries in the dataset, yet the output shows that only thirty percent of them are unique, indicating that some properties have been reported on more than once. Perhaps at different times or different years showing variations in price, further examination of this column is required.

## Omitting data

In [40]:
# Dropping all duplicate rows in the dataframe
housing_df.drop_duplicates()


Unnamed: 0,fullAddress,postcode,country,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,...,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price,history_percentageChange,history_numericChange
0,"Flat 9, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,England,EC4A,51.517282,-0.110314,1.0,1.0,45.0,1.0,...,630000.0,HIGH,2024-10-07T13:26:59.894Z,244000.0,68.539326,2010-03-30,1995-01-02,830000,,
1,"Flat 6, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,England,EC4A,51.517282,-0.110314,,,,,...,660000.0,MEDIUM,2024-10-07T13:26:59.894Z,425000.0,242.857143,2000-05-26,1995-01-02,830000,,
2,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,England,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,...,834000.0,MEDIUM,2025-01-10T11:04:57.114Z,49000.0,6.901408,2019-09-04,1995-01-03,249950,,
3,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,England,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,...,407000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,7.777778,2024-01-25,1995-01-03,32000,,
4,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,England,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,...,1324000.0,HIGH,2024-10-07T13:26:59.894Z,81000.0,6.864407,2022-12-14,1995-01-03,133000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
418196,"9 Harvard Road, London, SE13 6SE",SE13 6SE,England,SE13,51.452919,-0.010221,2.0,3.0,123.0,2.0,...,711000.0,HIGH,2025-01-10T11:04:57.114Z,27000.0,4.153846,2024-09-27,2024-09-27,650000,17.117117,95000.0
418197,"Lower Ground Floor Flat 5, Northwood Hall, Hor...",N6 5PE,England,N6,51.571692,-0.137007,1.0,2.0,60.0,1.0,...,473000.0,HIGH,2025-01-10T11:04:57.114Z,10000.0,2.272727,2024-09-27,2024-09-27,440000,2.325581,10000.0
418198,"17 Merritt Road, London, SE4 1DU",SE4 1DU,England,SE4,51.456054,-0.035073,2.0,3.0,100.0,1.0,...,853000.0,HIGH,2025-01-10T11:04:57.114Z,38000.0,4.909561,2024-09-27,2024-09-27,774000,128.655835,435500.0
418199,"15 Chester Row, London, SW1W 9JF",SW1W 9JF,England,SW1W,51.493587,-0.152122,3.0,4.0,218.0,2.0,...,5975000.0,MEDIUM,2025-01-10T11:04:57.114Z,-818000.0,-13.088000,2024-09-27,2024-09-27,6250000,73.611111,2650000.0


In [41]:
# Dropping columns
housing_df = housing_df.drop(['country', 'history_percentageChange', 'history_numericChange'], axis=1)

In [43]:
# Viewing the data post column removal
housing_df.head()

Unnamed: 0,fullAddress,postcode,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,tenure,...,saleEstimate_lowerPrice,saleEstimate_currentPrice,saleEstimate_upperPrice,saleEstimate_confidenceLevel,saleEstimate_ingestedAt,saleEstimate_valueChange.numericChange,saleEstimate_valueChange.percentageChange,saleEstimate_valueChange.saleDate,history_date,history_price
0,"Flat 9, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,EC4A,51.517282,-0.110314,1.0,1.0,45.0,1.0,Leasehold,...,570000.0,600000.0,630000.0,HIGH,2024-10-07T13:26:59.894Z,244000.0,68.539326,2010-03-30,1995-01-02,830000
1,"Flat 6, 35 Furnival Street, London, EC4A 1JQ",EC4A 1JQ,EC4A,51.517282,-0.110314,,,,,Leasehold,...,540000.0,600000.0,660000.0,MEDIUM,2024-10-07T13:26:59.894Z,425000.0,242.857143,2000-05-26,1995-01-02,830000
2,"Flat 35, Octavia House, Medway Street, London,...",SW1P 2TA,SW1P,51.495505,-0.132379,2.0,2.0,71.0,1.0,Leasehold,...,683000.0,759000.0,834000.0,MEDIUM,2025-01-10T11:04:57.114Z,49000.0,6.901408,2019-09-04,1995-01-03,249950
3,"24 Chester Court, Lomond Grove, London, SE5 7HS",SE5 7HS,SE5,51.478185,-0.092201,1.0,1.0,64.0,1.0,Leasehold,...,368000.0,388000.0,407000.0,HIGH,2024-10-07T13:26:59.894Z,28000.0,7.777778,2024-01-25,1995-01-03,32000
4,"18 Alexandra Gardens, London, N10 3RL",N10 3RL,N10,51.588774,-0.139599,1.0,4.0,137.0,2.0,Freehold,...,1198000.0,1261000.0,1324000.0,HIGH,2024-10-07T13:26:59.894Z,81000.0,6.864407,2022-12-14,1995-01-03,133000


The country column was removed due to the fact that there was only one unique value across the whole dataset, which was also non-numerical so it could not have been used in any statistical model. All the properties detailed in this dataset are in England only, hence there was no need to include the column. 

The history percentage change and the history numeric change columns were removed due to their high number of missing values, also the data dictionary does not detail what exactly these columns represent. It is fairly unlikely that these columns will have a significant impact on the target variable. 

## Handling null values