# Capstone Workbook 3: Pre-processing

In [None]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import data 
airbnb_ldn = pd.read_csv('airbnb_ldn_final.csv')

In [None]:
airbnb_ldn.drop(columns='Unnamed: 0', inplace=True)

In [None]:
airbnb_ldn.shape

In [None]:
airbnb_ldn.info()

In [None]:
# split into categorical and numerical columns
cat_cols = airbnb_ldn.select_dtypes(include='object')
num_cols = airbnb_ldn.select_dtypes(exclude='object')


In [None]:
# View categorical columns
cat_cols.head().T

Looking at the categorical columns, there are a couple that can immediately be identified as ones for some numerical transformation. 

To being with 'Airbnb Superhost' is a binary column and can thus be made numerical:

In [None]:
# confirm Airbnb superhost is binary:
airbnb_ldn['Airbnb Superhost']

The presence of two variables, f (false) and t (true) confirm the column is binary. It will now be made numerical:

In [None]:
# made the column binary in both dataframes
cat_cols['Airbnb Superhost'] = np.where(cat_cols['Airbnb Superhost'] == 't', 1, 0)
airbnb_ldn['Airbnb Superhost'] = np.where(airbnb_ldn['Airbnb Superhost'] == 't', 1, 0)

In [None]:
# check conversation has worked:
airbnb_ldn['Airbnb Superhost'].value_counts()

Now looking at other columns with a small number of distinct values or potential for an increase in granularity. Initially identified ones:

- Listing Type
- Cancellation policy
- Checkin time
- Checkout time
- Bedrooms

Beginning with 'Cancellation policy':

In [None]:
# check values within cancellation policy:
print(airbnb_ldn['Cancellation Policy'].value_counts())
print(f"Null values: {airbnb_ldn['Cancellation Policy'].isnull().sum()}")

The cancellation policy can be split into several main categories - New grouping : original value;
- No policy : Null values
- Medium : moderate, flexible, luxury_moderate
- Strict : strict_14_with_grace_period, better_strict_with_grace_period, firm_30_strict_with_grace_period
- Super strict : super_strict_30, super_strict_60

In [None]:
# create mapping function to group cancellation policy data:
def map_cancellation_policy(i):
    if i in ['moderate', 'flexible', 'luxury_moderate']:
        return 'medium'
    elif i in ['strict_14_with_grace_period', 'better_strict_with_grace_period', 'firm_30_strict_with_grace_period']:
        return 'strict'
    elif i in ['super_strict_30', 'super_strict_60']:
        return 'super_strict'
    else:
        return 'no_policy'

In [None]:
# apply function to dataframe
airbnb_ldn['Cancellation Policy'] = airbnb_ldn['Cancellation Policy'].map(map_cancellation_policy)

In [None]:
# check appropriate transformation has been applied
airbnb_ldn['Cancellation Policy'].value_counts()

The 'Cancellation Policy' column will now be one-hot encoded:

In [None]:
airbnb_ldn = pd.get_dummies(airbnb_ldn, columns = ['Cancellation Policy'])

In [None]:
airbnb_ldn.columns

In [None]:
# change from 'bool' to 'int' datatype:
for col in ['Cancellation Policy_medium', 'Cancellation Policy_no_policy', 'Cancellation Policy_strict', 'Cancellation Policy_super_strict']:
    airbnb_ldn[col] = airbnb_ldn[col].astype(int)

The next column to transform will be the checkin time column, the number of distinct values will  be found:

In [None]:
# check number of distinct values in the dataframe
airbnb_ldn['Check-in Time'].value_counts()

In [None]:
airbnb_ldn['Check-in Time'].isnull().sum()

It can be seen that there are 160 distinct values in the 'Check-in Time' column (including nulls). This is quite a lot, hence a way or compressing these will be determined.

To begin, it looks as though 'After 3: 00 PM' is the most common check-in time, there seem to be other columns that contain some element of 3pm. These will be investigated:

In [None]:
(airbnb_ldn[airbnb_ldn['Check-in Time'].str.contains('3', regex=True, na=False)])['Check-in Time'].value_counts()

In [None]:
(airbnb_ldn[airbnb_ldn['Check-in Time'].str.startswith(('12', '1 ', '2', '3', '4', '5'), na=False)])['Check-in Time'].value_counts()

**Complicated deadling with the checkin times, potentially return to at a later stage. Column will be ignored for now**

The next column to transform will be the check-out time column, the number of distinct values will  be found:

In [None]:
print(airbnb_ldn['Checkout Time'].value_counts())
print(f"Null values :  {airbnb_ldn['Checkout Time'].isnull().sum()}")

Looking at the 'Checkout Time' column, there are 22 distinct time categories (including nulls), these can be divided into ... sub-groups (sub-group : values)

- morning : 7:00 AM, 8:00 AM, 9:00 AM, 10:00 AM, 11:00 AM
- afternoon : 12:00 PM, 1:00 PM, 2:00 PM, 3:00 PM, 4:00 PM, 5:00 PM
- evening : 6:00 PM, 7:00 PM, 8:00 PM, 9:00 PM 
- late : 10:00 PM, 11:00 PM, 12:00 AM, 1:00 AM
- very_early : 1:00 AM, 2:00 AM, 3:00 AM 
- none : NaN

The column will be split into the described groups:

In [None]:
# create mapping function to 'Checkout Time' data:
def map_checkout_time(i):
    if i in ['7:00 AM', '8:00 AM', '9:00 AM', '10:00 AM', '11:00 AM']:
        return 'morning'
    elif i in ['12:00 PM', '1:00 PM', '2:00 PM', '3:00 PM', '4:00 PM', '5:00 PM']:
        return 'afternoon'
    elif i in ['6:00 PM', '7:00 PM', '8:00 PM', '9:00 PM']:
        return 'evening'
    elif i in ['10:00 PM', '11:00 PM', '12:00 PM', '1:00 AM']:
        return 'late'
    elif i in ['1:00 AM', '2:00 AM', '3:00 AM']:
        return 'very_early'
    else:
        return 'none'

In [None]:
# apply function to dataframe
airbnb_ldn['Checkout Time'] = airbnb_ldn['Checkout Time'].map(map_checkout_time)

In [None]:
# check the correct transformation has been applied:
airbnb_ldn['Checkout Time'].value_counts()

These different categories can now be one-hot encoded:

In [None]:
airbnb_ldn = pd.get_dummies(airbnb_ldn, columns = ['Checkout Time'])

In [None]:
airbnb_ldn.head().T

In [None]:
# change the datatype to integer
for col in ['Checkout Time_afternoon', 'Checkout Time_evening', 'Checkout Time_late', 'Checkout Time_morning', 'Checkout Time_none', 'Checkout Time_very_early']:
    airbnb_ldn[col] = airbnb_ldn[col].astype(int)

As shown, the 'Checkout Time' column has been split into the relevant categories.

The 'Bedrooms' column will now be assessed:

In [None]:
print(airbnb_ldn['Bedrooms'].value_counts())
print(f"Null values: {airbnb_ldn['Bedrooms'].isnull().sum()}")

There are 16 bedroom values. The 'Bedrooms' column is currently an 'object' column. This is due to the presence of 'Studio' within the columns data. For the purpose of modelling, it would be better if this column was a numerical datatype. Hence, the 'Studio' values, will be changed to the value '0.5' and the columns datatype will be converted to a 'float'.

Begin with changing the 'Studio' values to '0.5':

In [None]:
airbnb_ldn['Bedrooms'] = airbnb_ldn['Bedrooms'].replace({'Studio' : '0.5'})

The column will now be converted to a 'float' datatype:

In [None]:
airbnb_ldn['Bedrooms'] = airbnb_ldn['Bedrooms'].astype(float)

The 'Listing Type' column will now be evaluted:

In [None]:
airbnb_ldn['Listing Type'].unique()

Above are the four values present within the 'Listing Type' column. These can be one-hot encoded:

In [None]:
airbnb_ldn = pd.get_dummies(airbnb_ldn, columns = ['Listing Type'])

In [None]:
# convert to numerical binary
for col in ['Listing Type_entire_home', 'Listing Type_hotel_room', 'Listing Type_private_room', 'Listing Type_shared_room']:
    airbnb_ldn[col] = airbnb_ldn[col].astype('int')

In [None]:
airbnb_ldn.info()

Looking at other binary columns. 'Pets Allowed' can be made into a binary column.

In [None]:
# convert 'pets allowed' to numerical
airbnb_ldn['Pets Allowed'] = airbnb_ldn['Pets Allowed'].astype('int')

Finish data preprocessing for now and move onto creating the first model.

In [None]:
airbnb_ldn.head().T

Export the preprocessed data to CSV:

In [None]:
airbnb_ldn.to_csv('airbnb_ldn_pp.csv')