# Capstone Workbook 5: NLP

There are numerous columns within the dataframe that contain some form of text information. The columns contain essential information, describing components such as the listing title and the included amenities.

The advanced modelling part of this project will required these text columns to be represented numerically. A range of different approachs will be used to process these text columns. By representing them numerically, the relevant information can be fed into the machine learning models, thus producing a more indepth, useful predictive output.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import statsmodels.api as sm

In [2]:
# Import data 
airbnb_ldn = pd.read_csv('airbnb_ldn_pp.csv')

In [3]:
# drop 'Unnamed: 0'
airbnb_ldn = airbnb_ldn.drop(columns = 'Unnamed: 0')

Split the columns to view just those with the object datatype, keeping the target column:

In [4]:
X_obj = airbnb_ldn.select_dtypes(include='object')
X_num = airbnb_ldn.select_dtypes(exclude='object')

In [5]:
# confirm datatype of remaining columns:
X_obj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32674 entries, 0 to 32673
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Listing Title   32674 non-null  object
 1   Property Type   32674 non-null  object
 2   Zipcode         32674 non-null  object
 3   Amenities       32674 non-null  object
 4   guest_controls  32674 non-null  object
dtypes: object(5)
memory usage: 1.2+ MB


Now the object datatype columns have been isolated with the target column, the data will be checked to confirm there are no null values present:

In [6]:
# check for nulls
X_obj.isnull().sum()

Listing Title     0
Property Type     0
Zipcode           0
Amenities         0
guest_controls    0
dtype: int64

## Starting With Listing Title

The listing title column will now be processed. Two different types of embedding will be used, Count Vectorizing and TF-IDF. The two types of embedding will be carried out and continued to the modelling stage, so that their relative impact can be evaluated, to determine which is the most effective. 

In [7]:
# isolate the 'Listing Title' column:
lt_array = X_obj['Listing Title']

In [8]:
# View the listing title array:
lt_array

0                        Cozy 2BR house with a garden view
1          GuestReady - Amazing home with a private garden
2                            Cosy cottage on Richmond Park
3        Entire Flat. Free parking, Garden , Richmond park
4         Maisonette inbetween Richmond Park and Wimbledon
                               ...                        
32669                Service  Apartment- London Thamesmead
32670       Large Double Room with Free Parking and Garden
32671                   Forest view room in welcoming home
32672          Spacious Double Room with En-suite bathroom
32673    Large Room & Bathroom close to Forest and Station
Name: Listing Title, Length: 32674, dtype: object

In [9]:
# required imports 
from sklearn.feature_extraction.text import CountVectorizer

The listing title column will now be count vectorized:

Several parameters have been included:
- min_df: the minimum amount of times a text item must appear for it to be included
- stop_words: remove english stopwords, such as 'the' and 'and'
- ngram_range: included single word items and multiple word blocks, up to a length of 4 words per block.


In [10]:
# .1 Instantiate transformer object
bagofwords_lt = CountVectorizer(min_df = 50,
                                stop_words = "english",
                             ngram_range = (1, 4))

# 2. Fit
bagofwords_lt.fit(lt_array)

# 3. Transform
lt_transformed = bagofwords_lt.transform(lt_array)
lt_transformed

<32674x739 sparse matrix of type '<class 'numpy.int64'>'
	with 192522 stored elements in Compressed Sparse Row format>

In [11]:
# View the words that are occuring in the Listing Title:
bagofwords_lt.vocabulary_

{'cozy': 200,
 '2br': 16,
 'house': 353,
 'garden': 302,
 'view': 713,
 'house garden': 356,
 'garden view': 306,
 'guestready': 322,
 'amazing': 26,
 'home': 343,
 'private': 540,
 'private garden': 545,
 'cosy': 186,
 'cottage': 196,
 'park': 516,
 'entire': 245,
 'flat': 267,
 'free': 296,
 'parking': 517,
 'entire flat': 246,
 'flat free': 274,
 'free parking': 297,
 'maisonette': 441,
 'wimbledon': 734,
 'room': 575,
 'private room': 546,
 'friendly': 298,
 'clean': 169,
 'place': 531,
 'people': 525,
 'nice': 491,
 'quiet': 556,
 'double': 216,
 'single': 605,
 'rooms': 592,
 'near': 473,
 'piccadilly': 529,
 'line': 393,
 'house near': 358,
 'lovely': 416,
 'lovely room': 427,
 'beautiful': 64,
 'views': 714,
 'stay': 648,
 'family': 255,
 'east': 234,
 'guests': 323,
 'family home': 257,
 'home garden': 346,
 'large': 379,
 'large room': 385,
 'stunning': 661,
 'london': 403,
 'london home': 411,
 'bedroom': 83,
 'bathroom': 56,
 'private double': 543,
 'double bedroom': 218,
 

In [12]:
# see the quantity of tokens:
len(bagofwords_lt.vocabulary_)

739

In [13]:
# create a dataframe for the listing title count vectorized words:
listing_title_cv = pd.DataFrame(columns=bagofwords_lt.get_feature_names(), data=lt_transformed.toarray())

In [14]:
# Viewing the listing title dataframe:
display(listing_title_cv)

Unnamed: 0,10,10 mins,15,1bd,1bd flat,1bed,1br,1br flat,1st,20,...,westminster,wharf,wi,wi fi,wifi,wimbledon,wimbledon tennis,wonderful,wood,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32669,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32670,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32671,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32672,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# gain some statistical insights to the included text:
listing_title_cv.describe()

Unnamed: 0,10,10 mins,15,1bd,1bd flat,1bed,1br,1br flat,1st,20,...,westminster,wharf,wi,wi fi,wifi,wimbledon,wimbledon tennis,wonderful,wood,zone
count,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,...,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0
mean,0.005693,0.002081,0.003061,0.003917,0.002204,0.004377,0.006488,0.002571,0.001561,0.002785,...,0.003673,0.012426,0.001867,0.001836,0.007345,0.011997,0.001989,0.002816,0.002601,0.025433
std,0.075235,0.045573,0.05579,0.062468,0.046891,0.066012,0.08029,0.050639,0.039478,0.052701,...,0.060492,0.111054,0.043168,0.042814,0.085391,0.109436,0.044558,0.052989,0.050939,0.157633
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0


The same Listing title will now be split using TF-IDF:

In [16]:
# required import
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text

The parameters used for the TF-IDF embedding is similar to those used for the count vectorizing. One main different is the ngram lengths. In this instance, we are only concerned with 'bigrams' and hence, the ngram_range has been specified as (2,2).

In [17]:
# using our custom tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=50,
                        ngram_range = (2,2), # just looking at bi-grams
                        stop_words = "english")
tfidf.fit(lt_array)

# create dataframe of tf-idf bigrams:
lt_tfidf_transformed = tfidf.transform(lt_array)
listing_title_tfidf = pd.DataFrame(columns=tfidf.get_feature_names(), data=lt_tfidf_transformed.toarray())

In [18]:
# display dataframe:
display(listing_title_tfidf)

Unnamed: 0,10 mins,1bd flat,1br flat,2bd flat,2bed 2bath,2br flat,amazing location,apartment balcony,apartment camden,apartment central,...,victoria park,victorian flat,victorian home,victorian house,west end,west hampstead,west kensington,west london,wi fi,wimbledon tennis
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32670,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# statistical insights of the tf-idf words:
listing_title_tfidf.describe()

Unnamed: 0,10 mins,1bd flat,1br flat,2bd flat,2bed 2bath,2br flat,amazing location,apartment balcony,apartment camden,apartment central,...,victoria park,victorian flat,victorian home,victorian house,west end,west hampstead,west kensington,west london,wi fi,wimbledon tennis
count,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,...,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0
mean,0.00152,0.001796,0.002175,0.00177,0.001808,0.001602,0.001223,0.002517,0.001177,0.003768,...,0.001724,0.001942,0.001491,0.004471,0.001852,0.001399,0.001914,0.005139,0.001543,0.001622
std,0.0341,0.038904,0.043727,0.038994,0.04011,0.036339,0.031038,0.044371,0.030547,0.052267,...,0.038706,0.04089,0.036429,0.061878,0.040634,0.03305,0.039761,0.062858,0.036516,0.037015
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Using the TF-IDF method has output a smaller quantity of embedded text items. Using this method, only 300 text items have been returned.

## Postcode Column

The Zipcode column will be split, so that just the first portion is retained (the district code). This will reduce the number of distinct values within the postcode column. This will group the properties better, giving a clearer idea of which postcode locations are more predictive of a higher annual revenue:

In [20]:
# splitting the zipcode column, to return the district code only:
X_obj['Zipcode'].str.split().str[0]

0        SW15
1        SW15
2        SW15
3        SW15
4        SW15
         ... 
32669    SE28
32670    SE28
32671      E4
32672      E4
32673      E4
Name: Zipcode, Length: 32674, dtype: object

The column within the dataframe will be redefined, to just contain the district code:

In [21]:
# re-assign the zipcode column with the reduced postal district value:
X_obj['Zipcode'] = X_obj['Zipcode'].str.split().str[0]

In [22]:
zc_array = X_obj['Zipcode']

This column can now be be processed. As the postcode values do not have any meaning other than to indicate the area they are assigned to, the quantity of their occurance is the important element within this context. Meaning count vectorizing this column will be sufficient:

In [23]:
# .1 Instantiate transformer object
bagofwords_zc = CountVectorizer()

# 2. Fit
bagofwords_zc.fit(zc_array)

# 3. Transform
zc_transformed = bagofwords_zc.transform(zc_array)
zc_transformed

<32674x180 sparse matrix of type '<class 'numpy.int64'>'
	with 32674 stored elements in Compressed Sparse Row format>

In [24]:
# create a datafrme of the count vectorized postcode column:
zipcode_cv = pd.DataFrame(columns=(bagofwords_zc).get_feature_names(), data=zc_transformed.toarray())

In [25]:
# View this column:
display(zipcode_cv)

Unnamed: 0,e1,e10,e11,e12,e13,e14,e15,e16,e17,e18,...,wc1r,wc1v,wc1x,wc2a,wc2b,wc2e,wc2h,wc2n,wc2r,wd23
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32669,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32670,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32671,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
32672,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# gain some statistical insights to the postcode values:
zipcode_cv.describe()

Unnamed: 0,e1,e10,e11,e12,e13,e14,e15,e16,e17,e18,...,wc1r,wc1v,wc1x,wc2a,wc2b,wc2e,wc2h,wc2n,wc2r,wd23
count,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,...,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0,32674.0
mean,0.025555,0.003948,0.004254,0.001194,0.003336,0.02176,0.006672,0.010314,0.00655,0.000673,...,0.000857,0.000214,0.00606,9.2e-05,0.00153,0.003856,0.007009,0.00153,0.002357,6.1e-05
std,0.157807,0.062711,0.065086,0.034529,0.057662,0.145903,0.08141,0.101034,0.080665,0.02594,...,0.029262,0.014636,0.07761,0.009582,0.039089,0.06198,0.083425,0.039089,0.048488,0.007824
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [27]:
print(f"There are now 180 different postcode values, whereas previously there were {airbnb_ldn['Zipcode'].nunique()}.")

There are now 180 different postcode values, whereas previously there were 699.


By reducing the quantity of different postcodes, the model results will be easier to interpret, as more insight will be able to be gained regarding which postcode areas are the most lucrative.

## Looking at 'Amenities' column:

The different values within the 'Amenities' column can be split to see the components that they are made up of:

In [28]:
# original amenities split:
list_amenities = X_obj['Amenities'][0][1:-1].split(', ')

In [29]:
# updated amenities split to remove the outer quotation marks:
list_amenities = [amenity[1:-1] for amenity in list_amenities]
list_amenities

['Free parking on premises',
 'Air conditioning',
 'Wifi',
 'Kitchen',
 'Indoor fireplace',
 'Cable TV',
 'Dryer',
 'Dedicated workspace',
 'Hair dryer',
 'TV',
 'Shampoo',
 'Iron',
 'Hangers',
 'Washer',
 'Heating',
 'Essentials',
 'Bathtub',
 'Bidet',
 'Body soap',
 'Cleaning products',
 'Conditioner',
 'Hot water',
 'Shower gel',
 'Bed linens',
 'Clothing storage',
 'Drying rack for clothing',
 'Extra pillows and blankets',
 'Room-darkening shades',
 'Safe',
 'Piano',
 'Sound system',
 'Baby bath',
 'Babysitter recommendations',
 'Board games',
 'Children’s books and toys',
 'Children’s dinnerware',
 'Crib',
 'Portable fans',
 'Carbon monoxide alarm',
 'Fire extinguisher',
 'First aid kit',
 'Smoke alarm',
 'Pocket wifi',
 'Baking sheet',
 'Barbecue utensils',
 'Bread maker',
 'Coffee maker',
 'Cooking basics',
 'Dining table',
 'Dishes and silverware',
 'Dishwasher',
 'Freezer',
 'Hot water kettle',
 'Keurig coffee machine',
 'Kitchenette',
 'Microwave',
 'Mini fridge',
 'Oven',
 '

It looks as though the amenities column is storing the different amenities as a continuous string not as an actual list. The values will be converted into a list format, so that the individual features can be correctly isolated:

In [30]:
# required imports 
import ast

In [31]:
# convert the amenities column to a list:
X_obj['Amenities'] = X_obj['Amenities'].apply(ast.literal_eval)

In [32]:
# isolate the original amenities column:
amenities = pd.DataFrame(X_obj['Amenities'])

In [33]:
X_obj['Amenities'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 32674 entries, 0 to 32673
Series name: Amenities
Non-Null Count  Dtype 
--------------  ----- 
32674 non-null  object
dtypes: object(1)
memory usage: 255.4+ KB


Now that the 'Amenities' column is in the correct list format, the specific items can be isolated:

In [34]:
# Get unique features from all rows:
unique_features = set([feature for sublist in X_obj['Amenities'] for feature in sublist])

In [35]:
# Add columns for each unique feature and then initialise with 0:
for feature in unique_features:
    X_obj[feature] = 0

  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0
  X_obj[feature] = 0


In [36]:
# Iterate over each row and set values based on features
for index, row in X_obj.iterrows():
    for feature in row['Amenities']:
        X_obj.at[index, feature] = 1

In [37]:
# view the newly prodcued dataframe with the separate amenities columns:
X_obj

Unnamed: 0,Listing Title,Property Type,Zipcode,Amenities,guest_controls,Record player,Lockbox,Smoke detector,Wifi,Private entrance,...,Pack ’n play/Travel crib,Dryer,Full kitchen,Barbecue utensils,TV,EV charger,Single level home,Dedicated workspace,Gym,Keurig coffee machine
0,Cozy 2BR house with a garden view,Entire home,SW15,"[Free parking on premises, Air conditioning, W...","{""allows_children"": true, ""allows_infants"": tr...",0,1,0,1,1,...,0,1,0,1,1,0,0,1,0,1
1,GuestReady - Amazing home with a private garden,Entire home,SW15,"[Wifi, Kitchen, Dryer, Dedicated workspace, Ha...","{""allows_children"": true, ""allows_infants"": tr...",0,1,0,1,0,...,0,1,0,0,1,0,0,1,0,0
2,Cosy cottage on Richmond Park,Entire home,SW15,"[Free parking on premises, Air conditioning, W...","{""allows_children"": false, ""allows_infants"": f...",0,0,0,1,0,...,0,1,0,0,1,0,0,1,0,0
3,"Entire Flat. Free parking, Garden , Richmond park",Entire rental unit,SW15,"[Free parking on premises, Wifi, Kitchen, Dedi...","{""allows_children"": true, ""allows_infants"": tr...",0,1,0,1,1,...,0,0,0,0,1,0,0,1,0,0
4,Maisonette inbetween Richmond Park and Wimbledon,Private room in rental unit,SW15,"[Free parking on premises, Wifi, Breakfast, Ki...","{""allows_children"": false, ""allows_infants"": f...",0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32669,Service Apartment- London Thamesmead,Entire condo,SE28,"[Free parking on premises, Wifi, Kitchen, Dedi...","{""allows_children"": true, ""allows_infants"": tr...",0,1,0,1,1,...,0,0,0,0,1,0,1,1,0,0
32670,Large Double Room with Free Parking and Garden,Private room in home,SE28,"[Free parking on premises, Wifi, Kitchen, Dedi...","{""allows_children"": true, ""allows_infants"": tr...",0,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0
32671,Forest view room in welcoming home,Private room in home,E4,"[Wifi, Kitchen, Hair dryer, TV, Iron, Hangers,...","{""allows_children"": true, ""allows_infants"": fa...",0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
32672,Spacious Double Room with En-suite bathroom,Private room in home,E4,"[Wifi, Kitchen, Hair dryer, TV, Iron, Hangers,...","{""allows_children"": false, ""allows_infants"": f...",0,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0


The amenities have now been processed effectively, with each individual amenities appearance within a property row being listed as a separate column, whose occurance is numerially represented.

## Guest Controls

The guest control column is in the form of a dictionary, with each item being inclued, followed by the necessary 'True' or 'False' value. 

The quantity of different 'guest controls' will be determined:

In [38]:
# determining the number of individual features in the 'guest controls' column:
len([x.split(':')[0] for x in X_obj['guest_controls'][0][1:].split(', ')])

40

In [39]:
# looking at individual text items from the 'guest_controls' column:
([x.split(':')[0] for x in X_obj['guest_controls'][0][1:].split(', ')])

['"allows_children"',
 '"allows_infants"',
 '"allows_pets"',
 '"allows_smoking"',
 '"allows_events"',
 '"id"',
 '"host_check_in_time_message"',
 '"localized_structured_house_rules_with_tips"',
 '"p3_structured_house_rules"',
 '"No pets"',
 '"No parties or events"',
 '"Self check-in with lockbox"]',
 '"structured_house_rules"',
 '"No pets"',
 '"No parties or events"]',
 '"structured_house_rules_with_tips"',
 '"long_term_text"',
 '"text"',
 '"tip"',
 '"details"',
 '"airmoji_key"',
 '{"key"',
 '"long_term_text"',
 '"text"',
 '"tip"',
 '"details"',
 '"airmoji_key"',
 '{"key"',
 '"long_term_text"',
 '"text"',
 '"tip"',
 '"details"',
 '"airmoji_key"',
 '{"key"',
 '"long_term_text"',
 '"text"',
 '"tip"',
 '"details"',
 '"airmoji_key"',
 '"allows_non_china_users"']

There look to be 40 different 'guest control' options. This is a fairly small quantity of features when compared to the previously processed amenities column. Therefore, this column will not be processed further (can be potentially be processed and included in a future project iteration).

## Concatenating New Dataframe with Processed Text Columns

Two dataframes are to be made. Both dataframes will contain the processed 'amenities' and 'zipcode' columns. One of the dataframes will contain the count vectorized 'Listing Title' with the other containing the TF-IDF processed 'Listing title' column. Both of the dataframes will be modelled, to determine which type of processing has been most successful for the 'Listing title' column.

Now that several columns have been transformed to be represented through numeric values, their original columns can be dropped:

In [40]:
# drop the redundant columns:
X_obj.drop(columns=[
    'Listing Title', 'Zipcode', 'Amenities', 'guest_controls', 'Property Type'
], inplace=True)

In [41]:
# concatenate with the count vectorized listing title column:
airbnb_ldn_cv = pd.concat([X_num, X_obj, zipcode_cv, listing_title_cv], axis=1)

In [42]:
# concatenate with the tf-idf listing title column:
airbnb_ldn_tfidf = pd.concat([X_num, X_obj, zipcode_cv, listing_title_tfidf], axis=1)

Some additional columns need to be dropped - columns on the original dataframe that have now been numerically represented:

### Export these dataframes:

The two concatenated dataframe will be exported:

Some additional columns need to be dropped as copies are now made since concatenated with the main dataframe:

In [43]:
# for the count vectorized column:
airbnb_ldn_cv.to_csv('airbnb_ldn_cv.csv', index=False)

In [44]:
# for the TF-IDF column:
airbnb_ldn_tfidf.to_csv('airbnb_ldn_tfidf.csv', index=False)

In [45]:
airbnb_ldn_cv.shape

(32674, 1093)

### DataFrame to export to produce the map visualisation

A version of the processed data will be exported so that certain components can be visualised using other softwares (tableau). An amended dataframe will be created and exported for that purposed.

This is completed at this stage, as the postcodes grouped by postcal districts is required.

In [46]:
# concatenate the relevant dataframes together:
tableau_airbnb = pd.concat([airbnb_ldn[['Latitude', 'Longitude', 'Annual Revenue LTM (Native)']], zc_array], axis=1)

In [47]:
# export the dataframe:
tableau_airbnb.to_csv('location_airbnb_ldn_3.csv', index=False)

# Conclusion

Various approaches have been taken to deal with different text columns. The different formats that the texts columns originally existed in, meant a 'one size fits all' approach was not possible.

Through count vectorizing, TF-IDF and a version of one-hot encoding, these text columns have been effectively embedded. Through this embedding, the columns can now be fed into the machine learning models. Additional models accounting for these text columns will take place, in the following workbook 'Advanced Modelling'. 