# Capstone Workbook 3: Pre-processing

In [1]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Import data 
airbnb_ldn = pd.read_csv('airbnb_ldn_final.csv')

In [3]:
airbnb_ldn.drop(columns='Unnamed: 0', inplace=True)

In [4]:
airbnb_ldn.shape

(32674, 36)

In [5]:
airbnb_ldn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32674 entries, 0 to 32673
Data columns (total 36 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Listing Title                                      32674 non-null  object 
 1   Property Type                                      32674 non-null  object 
 2   Listing Type                                       32674 non-null  object 
 3   City                                               32674 non-null  object 
 4   Zipcode                                            32674 non-null  object 
 5   Number of Reviews                                  32674 non-null  int64  
 6   Bedrooms                                           32674 non-null  object 
 7   Bathrooms                                          32674 non-null  int64  
 8   Max Guests                                         32674 non-null  int64  
 9   Airbnb

In [6]:
# split into categorical and numerical columns
cat_cols = airbnb_ldn.select_dtypes(include='object')
num_cols = airbnb_ldn.select_dtypes(exclude='object')


In [7]:
# View categorical columns
cat_cols.head().T

Unnamed: 0,0,1,2,3,4
Listing Title,Cozy 2BR house with a garden view,GuestReady - Amazing home with a private garden,Cosy cottage on Richmond Park,"Entire Flat. Free parking, Garden , Richmond park",Maisonette inbetween Richmond Park and Wimbledon
Property Type,Entire home,Entire home,Entire home,Entire rental unit,Private room in rental unit
Listing Type,entire_home,entire_home,entire_home,entire_home,private_room
City,Greater London,Greater London,Greater London,Greater London,Greater London
Zipcode,SW15 3,SW15 3,SW15 3,SW15 3,SW15 3
Bedrooms,2,2,1,2,1
Airbnb Superhost,f,t,f,f,f
Cancellation Policy,strict_14_with_grace_period,,,strict_14_with_grace_period,strict_14_with_grace_period
Check-in Time,12:00 PM - 12:00 AM,3:00 PM - 12:00 AM,After 3:00 PM,3:00 PM - 11:00 PM,12:00 PM - 10:00 PM
Checkout Time,10:00 AM,11:00 AM,11:00 AM,11:00 AM,11:00 AM


## Binary Columns

Looking at the categorical columns, there are a couple that can immediately be identified as ones for some numerical transformation. 

To being with 'Airbnb Superhost' is a binary column and can thus be made numerical:

In [8]:
# confirm Airbnb superhost is binary:
airbnb_ldn['Airbnb Superhost']

0        f
1        t
2        f
3        f
4        f
        ..
32669    f
32670    f
32671    f
32672    f
32673    f
Name: Airbnb Superhost, Length: 32674, dtype: object

The presence of two variables, f (false) and t (true) confirm the column is binary. It will now be made numerical:

In [9]:
# made the column binary in both dataframes
cat_cols['Airbnb Superhost'] = np.where(cat_cols['Airbnb Superhost'] == 't', 1, 0)
airbnb_ldn['Airbnb Superhost'] = np.where(airbnb_ldn['Airbnb Superhost'] == 't', 1, 0)

In [10]:
# check conversation has worked:
airbnb_ldn['Airbnb Superhost'].value_counts()

Airbnb Superhost
0    24696
1     7978
Name: count, dtype: int64

Now looking at other columns with a small number of distinct values or potential for an increase in granularity. Initially identified ones:

- Listing Type
- Cancellation policy
- Checkin time
- Checkout time
- Bedrooms

## Cancellation Policy

In [11]:
# check values within cancellation policy:
print(airbnb_ldn['Cancellation Policy'].value_counts())
print(f"Null values: {airbnb_ldn['Cancellation Policy'].isnull().sum()}")

Cancellation Policy
strict_14_with_grace_period         7027
moderate                            5671
flexible                            4971
better_strict_with_grace_period     1224
super_strict_30                       60
super_strict_60                       26
firm_30_strict_with_grace_period      18
Name: count, dtype: int64
Null values: 13677


The cancellation policy can be split into several main categories - New grouping : original value;
- No policy : Null values
- Medium : moderate, flexible, luxury_moderate
- Strict : strict_14_with_grace_period, better_strict_with_grace_period, firm_30_strict_with_grace_period
- Super strict : super_strict_30, super_strict_60

In [12]:
# create mapping function to group cancellation policy data:
def map_cancellation_policy(i):
    if i in ['moderate', 'flexible', 'luxury_moderate']:
        return 'medium'
    elif i in ['strict_14_with_grace_period', 'better_strict_with_grace_period', 'firm_30_strict_with_grace_period']:
        return 'strict'
    elif i in ['super_strict_30', 'super_strict_60']:
        return 'super_strict'
    else:
        return 'no_policy'

In [13]:
# apply function to dataframe
airbnb_ldn['Cancellation Policy'] = airbnb_ldn['Cancellation Policy'].map(map_cancellation_policy)

In [14]:
# check appropriate transformation has been applied
airbnb_ldn['Cancellation Policy'].value_counts()

Cancellation Policy
no_policy       13677
medium          10642
strict           8269
super_strict       86
Name: count, dtype: int64

The 'Cancellation Policy' column will now be one-hot encoded:

In [15]:
airbnb_ldn = pd.get_dummies(airbnb_ldn, columns = ['Cancellation Policy'])

In [16]:
airbnb_ldn.columns

Index(['Listing Title', 'Property Type', 'Listing Type', 'City', 'Zipcode',
       'Number of Reviews', 'Bedrooms', 'Bathrooms', 'Max Guests',
       'Airbnb Superhost', 'Cleaning Fee (Native)', 'Extra People Fee(Native)',
       'Check-in Time', 'Checkout Time', 'Minimum Stay', 'Latitude',
       'Longitude', 'Overall Rating', 'Airbnb Communication Rating',
       'Airbnb Accuracy Rating', 'Airbnb Checkin Rating',
       'Airbnb Location Rating', 'Airbnb Value Rating', 'Amenities',
       'Airbnb Host ID', 'guest_controls', 'Pets Allowed',
       'Count Available Days LTM', 'Count Blocked Days LTM',
       'Count Reservation Days LTM', 'Occupancy Rate LTM',
       'Number of Bookings LTM',
       'Number of Bookings LTM - Number of observed month',
       'Average Daily Rate (Native)', 'Annual Revenue LTM (Native)',
       'Cancellation Policy_medium', 'Cancellation Policy_no_policy',
       'Cancellation Policy_strict', 'Cancellation Policy_super_strict'],
      dtype='object')

In [17]:
# change from 'bool' to 'int' datatype:
for col in ['Cancellation Policy_medium', 'Cancellation Policy_no_policy', 'Cancellation Policy_strict', 'Cancellation Policy_super_strict']:
    airbnb_ldn[col] = airbnb_ldn[col].astype(int)

## Check-in Time

The next column to transform will be the checkin time column, the number of distinct values will  be found:

In [18]:
# check number of distinct values in the dataframe
airbnb_ldn['Check-in Time'].value_counts()

Check-in Time
After 3:00 PM         11044
After 2:00 PM          2741
Flexible               2261
After 4:00 PM          1700
3:00 PM - 10:00 PM     1212
                      ...  
After 11:00 PM            1
After 6:00 AM             1
10:00 AM - 4:00 PM        1
After 5:00 AM             1
After %{time}             1
Name: count, Length: 159, dtype: int64

In [19]:
# checking the null values for the 'Check-in Time' column:
airbnb_ldn['Check-in Time'].isnull().sum()

1757

It can be seen that there are 160 distinct values in the 'Check-in Time' column (including nulls). This is quite a lot, hence a way or compressing these will be determined.

To begin, it looks as though 'After 3: 00 PM' is the most common check-in time, there seem to be other columns that contain some element of 3pm. These will be investigated:

In [20]:
(airbnb_ldn[airbnb_ldn['Check-in Time'].str.contains('3', regex=True, na=False)])['Check-in Time'].value_counts()

Check-in Time
After 3:00 PM                   11044
3:00 PM - 10:00 PM               1212
3:00 PM - 9:00 PM                 958
3:00 PM - 8:00 PM                 810
3:00 PM - 11:00 PM                713
3:00 PM - 12:00 AM                661
3:00 PM - 6:00 PM                 319
3:00 PM - 7:00 PM                 298
3:00 PM - 2:00 AM (next day)      298
3:00 PM - 5:00 PM                 136
3:00 PM - 1:00 AM (next day)      112
1:00 PM - 3:00 PM                  50
12:00 PM - 3:00 PM                 20
11:00 AM - 3:00 PM                 15
10:00 AM - 3:00 PM                  8
9:00 AM - 3:00 PM                   3
8:00 AM - 3:00 PM                   2
After 3:00 AM                       2
Name: count, dtype: int64

In [21]:
(airbnb_ldn[airbnb_ldn['Check-in Time'].str.startswith(('12', '1 ', '2', '3', '4', '5'), na=False)])['Check-in Time'].value_counts()

Check-in Time
3:00 PM - 10:00 PM               1212
3:00 PM - 9:00 PM                 958
3:00 PM - 8:00 PM                 810
3:00 PM - 11:00 PM                713
3:00 PM - 12:00 AM                661
2:00 PM - 10:00 PM                475
2:00 PM - 12:00 AM                325
3:00 PM - 6:00 PM                 319
4:00 PM - 10:00 PM                308
2:00 PM - 11:00 PM                307
3:00 PM - 7:00 PM                 298
3:00 PM - 2:00 AM (next day)      298
4:00 PM - 8:00 PM                 296
4:00 PM - 7:00 PM                 280
2:00 PM - 9:00 PM                 240
4:00 PM - 11:00 PM                209
2:00 PM - 8:00 PM                 182
4:00 PM - 12:00 AM                161
4:00 PM - 9:00 PM                 138
3:00 PM - 5:00 PM                 136
2:00 PM - 7:00 PM                 131
3:00 PM - 1:00 AM (next day)      112
2:00 PM - 6:00 PM                 102
2:00 PM - 2:00 AM (next day)       97
5:00 PM - 10:00 PM                 83
12:00 PM - 10:00 PM                8

**Complicated deadling with the checkin times, potentially return to at a later stage. Column will be ignored for now**

## Checkout Time

The next column to transform will be the check-out time column, the number of distinct values will  be found:

In [22]:
print(airbnb_ldn['Checkout Time'].value_counts())
print(f"Null values :  {airbnb_ldn['Checkout Time'].isnull().sum()}")

Checkout Time
11:00 AM    15268
10:00 AM     9025
12:00 PM     3964
1:00 PM       394
2:00 PM       252
9:00 AM       185
3:00 PM       135
12:00 AM       83
4:00 PM        60
5:00 PM        42
6:00 PM        36
8:00 AM        28
11:00 PM       14
2:00 AM        10
9:00 PM         9
1:00 AM         8
8:00 PM         6
7:00 PM         5
10:00 PM        4
3:00 AM         3
7:00 AM         2
Name: count, dtype: int64
Null values :  3141


Looking at the 'Checkout Time' column, there are 22 distinct time categories (including nulls), these can be divided into ... sub-groups (sub-group : values)

- morning : 7:00 AM, 8:00 AM, 9:00 AM, 10:00 AM, 11:00 AM
- afternoon : 12:00 PM, 1:00 PM, 2:00 PM, 3:00 PM, 4:00 PM, 5:00 PM
- evening : 6:00 PM, 7:00 PM, 8:00 PM, 9:00 PM 
- late : 10:00 PM, 11:00 PM, 12:00 AM, 1:00 AM
- very_early : 1:00 AM, 2:00 AM, 3:00 AM 
- none : NaN

The column will be split into the described groups:

In [23]:
# create mapping function to 'Checkout Time' data:
def map_checkout_time(i):
    if i in ['7:00 AM', '8:00 AM', '9:00 AM', '10:00 AM', '11:00 AM']:
        return 'morning'
    elif i in ['12:00 PM', '1:00 PM', '2:00 PM', '3:00 PM', '4:00 PM', '5:00 PM']:
        return 'afternoon'
    elif i in ['6:00 PM', '7:00 PM', '8:00 PM', '9:00 PM']:
        return 'evening'
    elif i in ['10:00 PM', '11:00 PM', '12:00 PM', '1:00 AM']:
        return 'late'
    elif i in ['1:00 AM', '2:00 AM', '3:00 AM']:
        return 'very_early'
    else:
        return 'none'

In [24]:
# apply function to dataframe
airbnb_ldn['Checkout Time'] = airbnb_ldn['Checkout Time'].map(map_checkout_time)

In [25]:
# check the correct transformation has been applied:
airbnb_ldn['Checkout Time'].value_counts()

Checkout Time
morning       24508
afternoon      4847
none           3224
evening          56
late             26
very_early       13
Name: count, dtype: int64

These different categories can now be one-hot encoded:

In [26]:
airbnb_ldn = pd.get_dummies(airbnb_ldn, columns = ['Checkout Time'])

In [27]:
airbnb_ldn.head().T

Unnamed: 0,0,1,2,3,4
Listing Title,Cozy 2BR house with a garden view,GuestReady - Amazing home with a private garden,Cosy cottage on Richmond Park,"Entire Flat. Free parking, Garden , Richmond park",Maisonette inbetween Richmond Park and Wimbledon
Property Type,Entire home,Entire home,Entire home,Entire rental unit,Private room in rental unit
Listing Type,entire_home,entire_home,entire_home,entire_home,private_room
City,Greater London,Greater London,Greater London,Greater London,Greater London
Zipcode,SW15 3,SW15 3,SW15 3,SW15 3,SW15 3
Number of Reviews,9,11,1,20,0
Bedrooms,2,2,1,2,1
Bathrooms,2,1,2,1,1
Max Guests,6,4,3,4,2
Airbnb Superhost,0,1,0,0,0


In [28]:
# change the datatype to integer
for col in ['Checkout Time_afternoon', 'Checkout Time_evening', 'Checkout Time_late', 'Checkout Time_morning', 'Checkout Time_none', 'Checkout Time_very_early']:
    airbnb_ldn[col] = airbnb_ldn[col].astype(int)

As shown, the 'Checkout Time' column has been split into the relevant categories.

## Bedrooms

The 'Bedrooms' column will now be assessed:

In [29]:
print(airbnb_ldn['Bedrooms'].value_counts())
print(f"Null values: {airbnb_ldn['Bedrooms'].isnull().sum()}")

Bedrooms
1         19457
2          7541
3          2470
Studio     1934
4           840
5           301
6            75
0            21
7            18
8             7
10            4
9             2
16            1
15            1
12            1
22            1
Name: count, dtype: int64
Null values: 0


There are 16 bedroom values. The 'Bedrooms' column is currently an 'object' column. This is due to the presence of 'Studio' within the columns data. For the purpose of modelling, it would be better if this column was a numerical datatype. Hence, the 'Studio' values, will be changed to the value '0.5' and the columns datatype will be converted to a 'float'.

Begin with changing the 'Studio' values to '0.5':

In [30]:
airbnb_ldn['Bedrooms'] = airbnb_ldn['Bedrooms'].replace({'Studio' : '0.5'})

The column will now be converted to a 'float' datatype:

In [31]:
airbnb_ldn['Bedrooms'] = airbnb_ldn['Bedrooms'].astype(float)

## Listing Type

The 'Listing Type' column will now be evaluted:

In [32]:
airbnb_ldn['Listing Type'].unique()

array(['entire_home', 'private_room', 'shared_room', 'hotel_room'],
      dtype=object)

Above are the four values present within the 'Listing Type' column. These can be one-hot encoded:

In [33]:
airbnb_ldn = pd.get_dummies(airbnb_ldn, columns = ['Listing Type'])

In [34]:
# convert to numerical binary
for col in ['Listing Type_entire_home', 'Listing Type_hotel_room', 'Listing Type_private_room', 'Listing Type_shared_room']:
    airbnb_ldn[col] = airbnb_ldn[col].astype('int')

In [35]:
airbnb_ldn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32674 entries, 0 to 32673
Data columns (total 47 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Listing Title                                      32674 non-null  object 
 1   Property Type                                      32674 non-null  object 
 2   City                                               32674 non-null  object 
 3   Zipcode                                            32674 non-null  object 
 4   Number of Reviews                                  32674 non-null  int64  
 5   Bedrooms                                           32674 non-null  float64
 6   Bathrooms                                          32674 non-null  int64  
 7   Max Guests                                         32674 non-null  int64  
 8   Airbnb Superhost                                   32674 non-null  int32  
 9   Cleani

## Pets Allowed

Looking at other binary columns. 'Pets Allowed' can be made into a binary column.

In [36]:
# convert 'pets allowed' to numerical
airbnb_ldn['Pets Allowed'] = airbnb_ldn['Pets Allowed'].astype('int')

Finish data preprocessing for now and move onto creating the first model.

In [37]:
airbnb_ldn.head().T

Unnamed: 0,0,1,2,3,4
Listing Title,Cozy 2BR house with a garden view,GuestReady - Amazing home with a private garden,Cosy cottage on Richmond Park,"Entire Flat. Free parking, Garden , Richmond park",Maisonette inbetween Richmond Park and Wimbledon
Property Type,Entire home,Entire home,Entire home,Entire rental unit,Private room in rental unit
City,Greater London,Greater London,Greater London,Greater London,Greater London
Zipcode,SW15 3,SW15 3,SW15 3,SW15 3,SW15 3
Number of Reviews,9,11,1,20,0
Bedrooms,2.0,2.0,1.0,2.0,1.0
Bathrooms,2,1,2,1,1
Max Guests,6,4,3,4,2
Airbnb Superhost,0,1,0,0,0
Cleaning Fee (Native),154.8,0.0,0.0,34.8,0.0


## Looking at the 'City' column:

First, the number of distinct values within this column will be determined:

In [38]:
airbnb_ldn['City'].nunique()

263

There are quite a few different values for city. Considering all the rows are specific to London postcodes, this is a surprisingly large amount of distinct values within the 'column'. This will be investigated further:

In [39]:
# determine what the distinct values within the 'city' column are:
airbnb_ldn['City'].unique()

array(['Greater London', 'London', 'Ealing', 'london', 'Chiswick',
       'Wimbledon', 'England', 'Putney', 'West London', ' London',
       'Hammersmith, London', 'Hammersmith', 'Central London', 'London ',
       'Kensington', 'GB', 'Battersea ', 'West Sussex',
       'South Kensington', 'London Kensington', 'Knightsbridge ',
       'LONDON ', 'Pimlico', 'Central London ', 'Westminster',
       'South Norwood', 'Croydon', 'London Borough of Lewisham',
       'Brixton', 'Peckham, London', 'BETHWIN ROAD', 'Vauxhall',
       'London Borough of Southwark', 'London South', 'Swan Mead', '*',
       'Hanwell', 'Harrow', 'The Hyde', 'Edgware', 'Colindale ',
       'Colindale', 'Mill hill Broadway ', 'North Kensington',
       'London W8', 'Denbigh Road', 'Notting Hill', 'North Kensington ',
       ' Notting Hill ,Greater London', 'Brent', 'Marylebone',
       'City of Westminster', 'Paddington', 'Fitzrovia, London', 'LONDON',
       'Maida Vale', 'Flask Walk ', 'Hampstead', 'Camden Town',
  

It looks as though a lot of the regional names of the different properties are listed within the 'city' column. The relative quantity of these will be determined:

In [40]:
airbnb_ldn['City'].value_counts().head(30)

City
Greater London          25346
London                   6553
London                    106
England                    54
 London                    38
Hammersmith                17
Enfield                    16
GB                         15
Islington                  14
New Barnet                 10
Edgware                    10
Putney                     10
Bayswater                   9
Barnet                      9
Wimbledon                   9
Central London              9
Kensington                  8
Brixton                     7
Marylebone                  7
Central London              7
Hackney                     7
United Kingdom              6
Chiswick                    6
Londra                      6
Clapham                     6
Limehouse, London           5
Clerkenwell, London         5
Ealing                      5
Stratford                   5
Hampstead                   5
Name: count, dtype: int64

Looking at the 30 most common 'city' values, it can be seen that the vast majority of values contain 'Greater London', followed by just 'London'. It looks as though some property owners have labelled their properties borough/specific region within London, as the 'city'. As the information in this column is predominantly the same few general values ('Greater London' and 'London), and the other values giving more detail locational information. As the values within this column do not follow a continuous format, it wouldn't be helpful to use it in further analysis and therefore this column can be dropped.

In [41]:
airbnb_ldn.drop(columns='City', inplace=True)

Export the preprocessed data to CSV:

In [42]:
airbnb_ldn.to_csv('airbnb_ldn_pp.csv')