### Principal Component Analysis

## Reviews Preparation for Natural Language Processing

Add review_scores_rating from listings data to reviews data. Listings data only has review scores pertaining to the most recent review for a particular listing. This means that there will be many reviews that do not have a score, which we will remove during the merge.

In [54]:
#Read in libraries
import pandas as pd
import swifter

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import nltk

In [55]:
#Suppress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [56]:
#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#Ignore warnings
import warnings; warnings.simplefilter('ignore')

In [57]:
#Set path to listings and review data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate/'

In [58]:
#Parse in listings dates
date = ['calendar_last_scraped', 'calendar_updated', 'first_review' ,'host_since', 'last_review']

#Read in Airbnb Listings Data
listings = pd.read_csv(path + '01_04_2020_Listings_Cleaned.csv',parse_dates=date, index_col=0, low_memory=True, sep=',')

#Read in Airbnb Calendar and Reviews data
reviews = pd.read_csv(path + '01_04_2020_Reviews_Cleaned.csv', sep = ',',
                       parse_dates=['date'], low_memory=True,index_col=0)

In [59]:
#Preview listings
listings.head().T

Unnamed: 0,0,1,2,3,5
accommodates,3,5,2,2,6
amenities,TV Cable TV Internet Wifi Kitchen Pets liv...,Internet Wifi Kitchen Heating Family/kid fri...,TV Internet Wifi Kitchen Free street parking...,TV Internet Wifi Kitchen Free street parking...,TV Cable TV Internet Wifi Kitchen Free par...
availability_30,0,0,30,30,0
availability_365,77,0,365,365,20
bathrooms,1,1,4,4,1
bed_type,Real Bed,Real Bed,Real Bed,Real Bed,Real Bed
bedrooms,1,2,1,1,2
beds,2,3,1,1,3
calculated_host_listings_count,1,1,9,9,1
calculated_host_listings_count_private_rooms,0,0,9,9,0


In [60]:
#Preview reviews data
reviews.head()

Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name
19330,...,2013-12-01,9000494,209514,9215434,Ramon
143113,Stop and book it now. Rea (Website hi...,2017-06-07,158659946,4833101,35954713,Tim
1021372,So I moved to SF in late May from Mich...,2013-06-02,4928809,635850,6542011,Michael
64636,"This was the perfect home from home, o...",2014-10-16,21374058,1150867,13431837,Chris & Tess
174143,We loved our time in beautiful SF! The ...,2018-08-10,305042501,7226841,73281468,Jessica ( + Mark)


**Merge review_scores_rating from listings to corresponding reviews**

Ratings on airbnb
At airbnb hosts and guests are not reviewed in same. Where guests simply get a written review hosts also receives a star rating from 1 to 5 on 6 parameters:

Accuracy
Communication
Cleanliness
Location
Check In
Value
which are also calculated into one overall rating.

In [61]:
listings_cols=['host_is_superhost', 'host_response_time', 'latitude', 'longitude', 
               'neighbourhood_cleansed', 'number_of_reviews', 'room_type','id']

#Merge Review scores from listings to reviews dataframe. Merge on last review to confirm scores are assigned to proper review
reviews = reviews.merge(listings[listings_cols], left_on= ['listing_id'], 
                              right_on=['id'], suffixes=('_review', '_listings'))

#Drop duplicate values
reviews.drop_duplicates(inplace=True)

#Drop unnecessary columns from review_scores
reviews.drop(columns=['id_listings'], axis = 1, inplace= True)

#View review_scores shape
print('Data shape:', reviews.shape)

#Check
reviews.head().T

Data shape: (3730148, 13)


Unnamed: 0,0,1,2,3,4
comments,...,...,...,...,...
date,2013-12-01 00:00:00,2013-12-01 00:00:00,2013-12-01 00:00:00,2013-12-01 00:00:00,2013-12-01 00:00:00
id_review,9000494,9000494,9000494,9000494,9000494
listing_id,209514,209514,209514,209514,209514
reviewer_id,9215434,9215434,9215434,9215434,9215434
reviewer_name,Ramon,Ramon,Ramon,Ramon,Ramon
host_is_superhost,False,False,False,False,False
host_response_time,within an hour,within an hour,,within an hour,within an hour
latitude,37.7712,37.7712,37.7712,37.7712,37.7712
longitude,-122.45,-122.45,-122.45,-122.45,-122.45


### Quick clean up for NLP

In [62]:
#View missing values in review_scores
print('\nMissing values:\n', reviews.isna().sum())


Missing values:
 comments                       0
date                           0
id_review                      0
listing_id                     0
reviewer_id                    0
reviewer_name                  0
host_is_superhost              0
host_response_time        311302
latitude                       0
longitude                      0
neighbourhood_cleansed         0
number_of_reviews              0
room_type                      0
dtype: int64


In [63]:
#Remove rows with host_response_time
reviews = reviews[-reviews.host_response_time.isna()]

#View updated reviews shape
print('Updated reviews data shape:',reviews.shape)

Updated reviews data shape: (3418846, 13)


Sample Data 

We'll take a 5% sample for our analysis

In [64]:
#Sample
sample = reviews.sample(frac=0.05,random_state=1)

#Sample Preview
print('Sample shape:', sample.shape)
display(sample.head().T)

Sample shape: (170942, 13)


Unnamed: 0,4436320,1957096,3955476,4287431,1933145
comments,Great stay. Place is large and a great value....,I had the best experience ever in Airbnb with ...,Je was very hospitable & sweet. The common are...,I felt genuinely welcome at Tammy and Gabriel'...,we loved staying with caro! my friend and i ar...
date,2018-09-25 00:00:00,2018-08-08 00:00:00,2017-10-21 00:00:00,2013-10-01 00:00:00,2018-10-03 00:00:00
id_review,328282929,304031277,205287584,7758550,331801305
listing_id,26909554,11437138,20368086,1667732,21220773
reviewer_id,1178520,52206767,151432903,3438775,12008848
reviewer_name,William (Gui),Jihee,Ling,Jesper,Ariel
host_is_superhost,False,True,False,False,False
host_response_time,within a few hours,within an hour,within an hour,within an hour,within a few hours
latitude,37.749,37.7773,37.7466,37.7551,37.7206
longitude,-122.481,-122.411,-122.478,-122.41,-122.429


Stop Word Removal

Language Parsing

In [65]:
#Import stopwords
from nltk.corpus import stopwords

#check stopwords
stop =stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [66]:
#Exclude stopwords from comments
sample['comments_parsed'] = sample['comments'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#Check
sample.head()

Unnamed: 0,comments,date,id_review,listing_id,reviewer_id,reviewer_name,host_is_superhost,host_response_time,latitude,longitude,neighbourhood_cleansed,number_of_reviews,room_type,comments_parsed
4436320,Great stay. Place is large and a great value....,2018-09-25,328282929,26909554,1178520,William (Gui),False,within a few hours,37.74905,-122.48099,Outer Sunset,29,Entire home/apt,Great stay. Place large great value. Five star...
1957096,I had the best experience ever in Airbnb with ...,2018-08-08,304031277,11437138,52206767,Jihee,True,within an hour,37.77733,-122.41078,South of Market,150,Private room,I best experience ever Airbnb Maria. I say bes...
3955476,Je was very hospitable & sweet. The common are...,2017-10-21,205287584,20368086,151432903,Ling,False,within an hour,37.74657,-122.47787,Parkside,77,Private room,Je hospitable & sweet. The common area super c...
4287431,I felt genuinely welcome at Tammy and Gabriel'...,2013-10-01,7758550,1667732,3438775,Jesper,False,within an hour,37.75511,-122.41,Mission,24,Private room,"I felt genuinely welcome Tammy Gabriel's, than..."
1933145,we loved staying with caro! my friend and i ar...,2018-10-03,331801305,21220773,12008848,Ariel,False,within a few hours,37.72063,-122.42917,Excelsior,153,Private room,loved staying caro! friend huge princess diari...


## Sentiment Analysis

In [67]:
#Import and instantiate sentiment intensity analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

#Write fuctions to capture positive, negative, neutral, and compound scores to later apply to reviews.comments_parsed
def neg_scores(comment):
    #Function to capture neg semantic score 
    score = analyzer.polarity_scores(comment)['neg']
    return score

def pos_scores(comment):
    #Function to capture positive semantic score 
    score = analyzer.polarity_scores(comment)['pos']
    return score

def neutral_scores(comment):
    #Function to capture negative semantic score 
    score = analyzer.polarity_scores(comment)['neu']
    return score

def compound_scores(comment):
    #Function to capture compound semantic score 
    score = analyzer.polarity_scores(comment)['compound']
    return score

In [68]:
#Apply functions to reviews and assign scores to unique column
sample['sentiment_neg']= sample['comments_parsed'].swifter.apply(neg_scores)
sample['sentiment_pos']= sample['comments_parsed'].swifter.apply(pos_scores)
sample['sentiment_neu']= sample['comments_parsed'].swifter.apply(neutral_scores)
sample['sentiment_compound']= sample['comments_parsed'].swifter.apply(compound_scores)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




In [69]:
#Set path to write processed data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\03_Processed'

#Write to csv
sample.to_csv(path + '/01_10_2020_Reviews_Processed_Text_Analysis.csv',sep=',', index=False)