In [13]:
#Import libraries
import pandas as pd

# Precleaning Exploring
This section will just be looking at the dataset. The dataset is taken from https://www.kaggle.com/arnavkulkarni/housing-prices-in-london.

The main objective here is to explore the dataset and make note of what needs to be cleaned. The cleaning will be done in a seperate python file. The goal of this project is to predict house prices in London and make recommendations based on a price/budget, to give the user an idea of location and size. So the focus will be on keeping features that could potentially have a big impact of the price of the house.

In [14]:
#Importing the data and viewing the head of the data
df = pd.read_csv("London.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Property Name,Price,House Type,Area in sq ft,No. of Bedrooms,No. of Bathrooms,No. of Receptions,Location,City/County,Postal Code
0,0,Queens Road,1675000,House,2716,5,5,5,Wimbledon,London,SW19 8NY
1,1,Seward Street,650000,Flat / Apartment,814,2,2,2,Clerkenwell,London,EC1V 3PA
2,2,Hotham Road,735000,Flat / Apartment,761,2,2,2,Putney,London,SW15 1QL
3,3,Festing Road,1765000,House,1986,4,4,4,Putney,London,SW15 1LP
4,4,Spencer Walk,675000,Flat / Apartment,700,2,2,2,Putney,London,SW15 1PL


In [15]:
#Checking how much data there is and dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3480 entries, 0 to 3479
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         3480 non-null   int64 
 1   Property Name      3480 non-null   object
 2   Price              3480 non-null   int64 
 3   House Type         3480 non-null   object
 4   Area in sq ft      3480 non-null   int64 
 5   No. of Bedrooms    3480 non-null   int64 
 6   No. of Bathrooms   3480 non-null   int64 
 7   No. of Receptions  3480 non-null   int64 
 8   Location           2518 non-null   object
 9   City/County        3480 non-null   object
 10  Postal Code        3480 non-null   object
dtypes: int64(6), object(5)
memory usage: 299.2+ KB


In [16]:
#Dataset contains some Surrey and Essex
df["City/County"].value_counts().head(5)

London        2972
Surrey         262
Middlesex       78
Essex           62
Twickenham      12
Name: City/County, dtype: int64

I'll drop all rows apart from London during cleaning, then drop the column. Some of the counties were London, for example, Twickenham above. However, there was only a small number and they were spread out. 

In [17]:
#Location within London should have an influence in housing pricing. As there will be many locations, 
#I will look at grouping the postcodes, then drop this columms. As they will needed to be converted to columns for modelling
df["Location"].value_counts()

Putney              96
Barnes              71
Wandsworth          70
Wimbledon           68
Esher               64
                    ..
Medway Street        1
22 Bute Gardens      1
Kenninghall Road     1
Duchess Walk         1
22 Ensign Street     1
Name: Location, Length: 656, dtype: int64

In [18]:
#It looks like the rooms, bathrooms and receptions could be the same in all
corr = df.corr()
corr

Unnamed: 0.1,Unnamed: 0,Price,Area in sq ft,No. of Bedrooms,No. of Bathrooms,No. of Receptions
Unnamed: 0,1.0,0.142117,0.055871,-0.018649,-0.018649,-0.018649
Price,0.142117,1.0,0.66771,0.435533,0.435533,0.435533
Area in sq ft,0.055871,0.66771,1.0,0.777299,0.777299,0.777299
No. of Bedrooms,-0.018649,0.435533,0.777299,1.0,1.0,1.0
No. of Bathrooms,-0.018649,0.435533,0.777299,1.0,1.0,1.0
No. of Receptions,-0.018649,0.435533,0.777299,1.0,1.0,1.0


In [8]:
#Checking the types of houses there are
df["House Type"].value_counts()

Flat / Apartment    1565
House               1430
New development      357
Penthouse            100
Studio                10
Bungalow               9
Duplex                 7
Mews                   2
Name: House Type, dtype: int64

House type will be dropped for modelling. Area, postcode group and number of rooms will be the features being used.

In [9]:
df["Location"].isnull().value_counts()

False    2518
True      962
Name: Location, dtype: int64

In [10]:
#Slicing the postal codes so can see the ratios
df["area_code"] = df["Postal Code"].apply(lambda x: x[:2])
df["area_code"].value_counts().head(10)

SW    1525
NW     256
W1     240
KT     191
N1     145
E1     141
TW      93
EC      92
W4      81
HA      66
Name: area_code, dtype: int64

Looks like the majority of the data is in the SW region. So I won't be using location or post codes within the models.

# Data cleaning checklist:
- Drop Non-London city rows, then drop city/county column
- Drop bathrooms, housetype and receptions
- Drop NA columns
- Change column headers, lowercase, no spaces and create dummy variables for house type
- Change Price, area ft and bedrooms to floats