## Introduction

This assessment requires me to produce a machine learning model that is trained, tested, and evaluated using a set of secondary data.

Several datasets were explored before selecting a suitable one for this project. The dataset chosen is sourced from the UK Land Registry, a government-maintained open data platform that records property transactions across England and Wales. The dataset contains detailed information on residential property sales, including sale price, date of transfer, property type, location, tenure, and whether the property is newly built.
This data is published by GOV.uk and is free to use. https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

## Importing the dataset

In [27]:
import pandas as pd
import numpy as np


This code imports the two CSV files and concatenates them into a single DataFrame to create one unified dataset.

In [36]:
df1 = pd.read_csv("../data/pp-2023-part1.csv")
df2 = pd.read_csv("../data/pp-2023-part2.csv")
df = pd.concat([df1, df2], ignore_index=True)
df.shape

## (rows, coloumns)


(856734, 27)

In [40]:
# Displays the first five rows of the dataset to provide an initial overview of the data structure and values.
df.head()


Unnamed: 0,{0E082196-CE18-5C09-E063-4704A8C0A10E},221000,2023-09-22 00:00,PL6 6JX,T,N,F,3,Unnamed: 8,PILLAR WALK,...,150000,2023-04-21 00:00,NG10 2BH,44,LANDSDOWN GROVE,LONG EATON,NOTTINGHAM,EREWASH,DERBYSHIRE,B
0,{0E082196-CE19-5C09-E063-4704A8C0A10E},228000.0,2023-08-25 00:00,PL7 1SJ,S,N,F,102,,MERAFIELD ROAD,...,,,,,,,,,,
1,{0E082196-CE1A-5C09-E063-4704A8C0A10E},480000.0,2023-10-26 00:00,TQ6 0AS,F,N,L,1A,,RIVER VIEW,...,,,,,,,,,,
2,{0E082196-CE1B-5C09-E063-4704A8C0A10E},625000.0,2023-07-14 00:00,TQ1 2HB,D,N,F,14,,OXLEA CLOSE,...,,,,,,,,,,
3,{0E082196-CE1C-5C09-E063-4704A8C0A10E},174000.0,2023-08-04 00:00,PL2 1LL,T,N,F,58,,ST AUBYN AVENUE,...,,,,,,,,,,
4,{0E082196-CE1D-5C09-E063-4704A8C0A10E},87500.0,2023-11-10 00:00,PL1 4HR,F,N,L,72,,GEORGE STREET,...,,,,,,,,,,


In [41]:
# Returns the number of rows and columns in the dataset to show its overall size.
df.shape


(856734, 27)

In [42]:
# Provides a summary of the dataset including column names, data types, and non-null value counts.
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856734 entries, 0 to 856733
Data columns (total 27 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   {0E082196-CE18-5C09-E063-4704A8C0A10E}  428367 non-null  object 
 1   221000                                  428367 non-null  float64
 2   2023-09-22 00:00                        428367 non-null  object 
 3   PL6 6JX                                 427335 non-null  object 
 4   T                                       856734 non-null  object 
 5   N                                       856734 non-null  object 
 6   F                                       856734 non-null  object 
 7   3                                       428367 non-null  object 
 8   Unnamed: 8                              118592 non-null  object 
 9   PILLAR WALK                             421194 non-null  object 
 10  Unnamed: 10                             1598

In [43]:
# Calculates the total number of missing values in each column to identify data quality issues.
df.isnull().sum()


{0E082196-CE18-5C09-E063-4704A8C0A10E}    428367
221000                                    428367
2023-09-22 00:00                          428367
PL6 6JX                                   429399
T                                              0
N                                              0
F                                              0
3                                         428367
Unnamed: 8                                738142
PILLAR WALK                               435540
Unnamed: 10                               696921
PLYMOUTH                                  428367
CITY OF PLYMOUTH                          428367
CITY OF PLYMOUTH.1                        428367
A                                              0
A.1                                       428367
{FD226036-863F-4CB7-E053-4804A8C00430}    428367
150000                                    428367
2023-04-21 00:00                          428367
NG10 2BH                                  429576
44                  

In [44]:
# Generates descriptive statistics for numerical features, including mean, standard deviation, and quartiles.
df.describe()


Unnamed: 0,221000,150000
count,428367.0,428367.0
mean,404828.4,406198.0
std,1640404.0,1396228.0
min,1.0,100.0
25%,175000.0,177000.0
50%,272995.0,280000.0
75%,420000.0,425000.0
max,393000000.0,251000000.0
