### Principal Component Analysis

## Reviews Preparation for Natural Language Processing

Add review_scores_rating from listings data to reviews data. Listings data only has review scores pertaining to the most recent review for a particular listing. This means that there will be many reviews that do not have a score, which we will remove during the merge.

In [59]:
#Read in libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import nltk

In [60]:
#Suppress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [61]:
#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#Ignore warnings
import warnings; warnings.simplefilter('ignore')

In [62]:
#Set path to listings and review data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate/'

In [63]:
#Parse in listings dates
date = ['calendar_last_scraped', 'calendar_updated', 'first_review' ,'host_since', 'last_review']

#Read in Airbnb Listings Data
listings = pd.read_csv(path + '01_04_2020_Listings_Cleaned.csv',parse_dates=date, index_col=0, low_memory=True, sep=',')

#Read in Airbnb Calendar and Reviews data
reviews = pd.read_csv(path + '01_04_2020_Reviews_Cleaned.csv', sep = ',',
                       parse_dates=['date'], low_memory=True,index_col=0)

In [64]:
#Preview listings
listings.head().T

Unnamed: 0,0,1,2,3,5
accommodates,3,5,2,2,6
amenities,TV Cable TV Internet Wifi Kitchen Pets liv...,Internet Wifi Kitchen Heating Family/kid fri...,TV Internet Wifi Kitchen Free street parking...,TV Internet Wifi Kitchen Free street parking...,TV Cable TV Internet Wifi Kitchen Free par...
availability_30,0,0,30,30,0
availability_365,77,0,365,365,20
bathrooms,1,1,4,4,1
bed_type,Real Bed,Real Bed,Real Bed,Real Bed,Real Bed
bedrooms,1,2,1,1,2
beds,2,3,1,1,3
calculated_host_listings_count,1,1,9,9,1
calculated_host_listings_count_private_rooms,0,0,9,9,0


In [65]:
#Preview reviews data
reviews.head()

Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name
19330,...,2013-12-01,9000494,209514,9215434,Ramon
143113,Stop and book it now Rea Website hidd...,2017-06-07,158659946,4833101,35954713,Tim
1021372,So I moved to SF in late May from Mich...,2013-06-02,4928809,635850,6542011,Michael
64636,This was the perfect home from home ou...,2014-10-16,21374058,1150867,13431837,Chris & Tess
174143,We loved our time in beautiful SF The p...,2018-08-10,305042501,7226841,73281468,Jessica ( + Mark)


**Merge review_scores_rating from listings to corresponding reviews**

Ratings on airbnb
At airbnb hosts and guests are not reviewed in same. Where guests simply get a written review hosts also receives a star rating from 1 to 5 on 6 parameters:

Accuracy
Communication
Cleanliness
Location
Check In
Value
which are also calculated into one overall rating.

In [66]:
listings_cols=['host_is_superhost', 'host_response_time', 'latitude', 'longitude', 
               'neighbourhood_cleansed', 'number_of_reviews', 'room_type','id']

#Merge Review scores from listings to reviews dataframe. Merge on last review to confirm scores are assigned to proper review
review_scores = reviews.merge(listings[listings_cols], left_on= ['listing_id'], 
                              right_on=['id'], suffixes=('_review', '_listings'))

#Drop duplicate values
review_scores.drop_duplicates(inplace=True)

#Drop unnecessary columns from review_scores
review_scores.drop(columns=['id_listings'], axis = 1, inplace= True)

#View review_scores shape
print('Data shape:', review_scores.shape)

#Check
review_scores.head().T

Data shape: (3762306, 13)


Unnamed: 0,0,1,2,3,4
comments,...,...,...,...,...
date,2013-12-01 00:00:00,2013-12-01 00:00:00,2013-12-01 00:00:00,2013-12-01 00:00:00,2013-12-01 00:00:00
id_review,9000494,9000494,9000494,9000494,9000494
listing_id,209514,209514,209514,209514,209514
reviewer_id,9215434,9215434,9215434,9215434,9215434
reviewer_name,Ramon,Ramon,Ramon,Ramon,Ramon
host_is_superhost,False,False,False,False,False
host_response_time,within an hour,within an hour,,within an hour,within an hour
latitude,37.7712,37.7712,37.7712,37.7712,37.7712
longitude,-122.45,-122.45,-122.45,-122.45,-122.45


### Quick clean up for NLP

In [67]:
#View missing values in review_scores
print('\nMissing values:\n', review_scores.isna().sum())


Missing values:
 comments                       0
date                           0
id_review                      0
listing_id                     0
reviewer_id                    0
reviewer_name                  0
host_is_superhost              0
host_response_time        313129
latitude                       0
longitude                      0
neighbourhood_cleansed         0
number_of_reviews              0
room_type                      0
dtype: int64


In [68]:
#Remove rows with host_response_time
review_scores = review_scores[-review_scores.host_response_time.isna()]

#View updated reviews shape
print('Updated reviews data shape:',review_scores.shape)

Updated reviews data shape: (3449177, 13)


Sample Data 

We'll take a 5% sample for our analysis

In [69]:
#Sample
sample = review_scores.sample(frac=0.05,random_state=1)

#Sample Preview
print('Sample shape:', sample.shape)
display(sample.head().T)

Sample shape: (172459, 13)


Unnamed: 0,1849956,2220730,1156352,2038492,796105
comments,This a jewel in the DogpatchPotrero area Super...,Megs place is sparkling clean and in an awesom...,We had a great time in SF and loved staying at...,Very cozy place with the beach literally right...,Kevin is very professional prompt and warmhear...
date,2016-01-26 00:00:00,2019-03-08 00:00:00,2015-08-16 00:00:00,2018-10-17 00:00:00,2018-01-10 00:00:00
id_review,60697388,421167935,42860595,337874233,226453615
listing_id,288213,10427768,4269254,4252808,1393654
reviewer_id,308818,32888504,35391497,76550363,133118406
reviewer_name,Desigan,家惠,Hallie,Patrick,Daniel
host_is_superhost,True,True,True,True,True
host_response_time,within an hour,within an hour,within an hour,within an hour,within an hour
latitude,37.7597,37.7736,37.783,37.7573,37.7524
longitude,-122.388,-122.426,-122.432,-122.509,-122.459


Stop Word Removal

Language Parsing

In [70]:
#Import stopwords
from nltk.corpus import stopwords

#check stopwords
stop =stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [71]:
#Exclude stopwords from comments
sample['comments_parsed'] = sample['comments'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#Check
review_scores.head()

Unnamed: 0,comments,date,id_review,listing_id,reviewer_id,reviewer_name,host_is_superhost,host_response_time,latitude,longitude,neighbourhood_cleansed,number_of_reviews,room_type
0,...,2013-12-01,9000494,209514,9215434,Ramon,False,within an hour,37.77118,-122.44963,Haight Ashbury,400,Private room
1,...,2013-12-01,9000494,209514,9215434,Ramon,False,within an hour,37.77118,-122.44963,Haight Ashbury,439,Private room
3,...,2013-12-01,9000494,209514,9215434,Ramon,False,within an hour,37.77118,-122.44963,Haight Ashbury,476,Private room
4,...,2013-12-01,9000494,209514,9215434,Ramon,False,within an hour,37.771184,-122.449626,Haight Ashbury,389,Private room
6,...,2013-12-01,9000494,209514,9215434,Ramon,False,within an hour,37.77118,-122.44963,Haight Ashbury,429,Private room


In [72]:
#Set path to write processed data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\03_Processed'

#Write to csv
sample.to_csv(path + '/01_10_2020_Reviews_Processed_Text_Analysis.csv',sep=',', index=False)