**Pre-processing Airbnb Review Data for NLP**

# Introduction

## Read in libraries, data, and set notebook preferences

**Read in libraries**

In [27]:
#Read in libraries
import pandas as pd
import swifter
import numpy as np

import matplotlib as plt

import nltk
import sklearn

**Read in data**

In [11]:
#Set path to data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate'

#Read in data
df = pd.read_csv(path + '/2020_0131_Reviews_Cleaned.csv',sep=',',index_col=0,
                 parse_dates=['date'])

**Set preferences for notebook**

In [32]:
#Ignore warnings
import warnings; warnings.simplefilter('ignore')

#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_colwidth',1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#Set plt style 538
plt.style.use('fivethirtyeight')

## Preview data

In [13]:
#View data and shape
print('Reviews data shape:', df.shape)
display(df.head())

Reviews data shape: (456909, 7)


Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name,language
19330,"Hello Josh Thank you very much for everything. I found myself very comfortable in your home. Quiet, comfortable and very complete and very clean, which I value highly. Next time I'd come with my family. I hope it's possible.",2013-12-01,9000494,209514,9215434,Ramon,en
143113,"Stop and book it now. Rea (Website hidden by Airbnb) this later!!! If your a single person looking for a story book San Francisco experience, look no farther. Staying in Mikes place couldn't be any more wonderful. If your familiar with ""Tales of the City"" Mike is the Olympia Dukakis. The home is warm and inviting with all the nuances of an old Victorian. Mike is an amazing host . He can tell you how walk drive or public transit the city (don't bother with a car). Would love to keep the gem to myself but everyone deserves this unique place to lay your head. Make sure while you're there be introduced to William . Book IT you won't be disappointed .",2017-06-07,158659946,4833101,35954713,Tim,en
1021372,"So I moved to SF in late May from Michigan to intern at Genentech for the summer. I stayed at Anjan’s apartment for 7 days while I was looking for a more permanent housing situation. Anjan was extremely hospitable and welcoming throughout the week. He was also very knowledgeable about the area and always offered to help in any way that he could. The area (SOMA) is very safe and is very “walkable.” There are plenty of restaurants and stores nearby (there’s even a target a few blocks away), so you have everything you need within a couple blocks from the apartment. As for the bedroom, it was spacious and clean. The bathroom was nice and I had to myself for the entirety of my stay. I felt very comfortable living at Anjan’s for a week and I really enjoyed staying there. If you’re a respectful person and are looking for a place to stay in SF for a short time, I highly recommend staying at Anjan’s. He’s a great person and a great host.",2013-06-02,4928809,635850,6542011,Michael,en
64636,"This was the perfect home from home, our host was amazing like most California's we had a wonderful time.",2014-10-16,21374058,1150867,13431837,Chris & Tess,en
147460,"旅游期间在budi的房子住了三晚,体验非常的不错!首先我所住的房间在首层,靠近马路,窗户很大,采光很好,而且这条马路车流很少,所以也很安静;房间装修很好,大床还有电动折叠功能,写字桌上还有budi免费提供的零食;厨房设备齐全,吃不惯美式快餐的人可以去超市买点东西回来自己料理!其次,房东Budi为人很热情,也会说少量中文,他在我头一天到达后,很详细地给我介绍了交通,景点信息,当地风俗,让我迅速了解了不少攻略｡房子里还有一个美国老奶奶,每天早上都会在厨房泡上一壶咖啡,想喝的都可以免费去喝!最后是交通方面,从房子走到金门公园(加州科学博物馆,笛洋美术馆)只需要十五分钟左右,房子附近五分钟路程内有公交车可以到达市区内各个景点,而且基本都不超过40分钟(到联合广场大概半小时)｡",2017-02-01,129708498,4948327,40378027,被公司知道微博名故此修改,zh-cn


In [14]:
#View data types
df.dtypes

comments                 object
date             datetime64[ns]
id                        int64
listing_id                int64
reviewer_id               int64
reviewer_name            object
language                 object
dtype: object

# Preprocessing

## SpaCY pipeline

## Sentiment Analysis with Vader

In [65]:
#Import stopwords
from nltk.corpus import stopwords

#check stopwords
stop =stopwords.words('english')
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [66]:
#Exclude stopwords from comments
sample['comments_parsed'] = sample['comments'].swifter.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

#Check
sample.head()

Unnamed: 0,comments,date,id_review,listing_id,reviewer_id,reviewer_name,host_is_superhost,host_response_time,latitude,longitude,neighbourhood_cleansed,number_of_reviews,room_type,comments_parsed
4436320,Great stay. Place is large and a great value....,2018-09-25,328282929,26909554,1178520,William (Gui),False,within a few hours,37.74905,-122.48099,Outer Sunset,29,Entire home/apt,Great stay. Place large great value. Five star...
1957096,I had the best experience ever in Airbnb with ...,2018-08-08,304031277,11437138,52206767,Jihee,True,within an hour,37.77733,-122.41078,South of Market,150,Private room,I best experience ever Airbnb Maria. I say bes...
3955476,Je was very hospitable & sweet. The common are...,2017-10-21,205287584,20368086,151432903,Ling,False,within an hour,37.74657,-122.47787,Parkside,77,Private room,Je hospitable & sweet. The common area super c...
4287431,I felt genuinely welcome at Tammy and Gabriel'...,2013-10-01,7758550,1667732,3438775,Jesper,False,within an hour,37.75511,-122.41,Mission,24,Private room,"I felt genuinely welcome Tammy Gabriel's, than..."
1933145,we loved staying with caro! my friend and i ar...,2018-10-03,331801305,21220773,12008848,Ariel,False,within a few hours,37.72063,-122.42917,Excelsior,153,Private room,loved staying caro! friend huge princess diari...


## Sentiment Analysis

In [67]:
#Import and instantiate sentiment intensity analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

#Write fuctions to capture positive, negative, neutral, and compound scores to later apply to reviews.comments_parsed
def neg_scores(comment):
    #Function to capture neg semantic score 
    score = analyzer.polarity_scores(comment)['neg']
    return score

def pos_scores(comment):
    #Function to capture positive semantic score 
    score = analyzer.polarity_scores(comment)['pos']
    return score

def neutral_scores(comment):
    #Function to capture negative semantic score 
    score = analyzer.polarity_scores(comment)['neu']
    return score

def compound_scores(comment):
    #Function to capture compound semantic score 
    score = analyzer.polarity_scores(comment)['compound']
    return score

In [68]:
#Apply functions to reviews and assign scores to unique column
sample['sentiment_neg']= sample['comments_parsed'].swifter.apply(neg_scores)
sample['sentiment_pos']= sample['comments_parsed'].swifter.apply(pos_scores)
sample['sentiment_neu']= sample['comments_parsed'].swifter.apply(neutral_scores)
sample['sentiment_compound']= sample['comments_parsed'].swifter.apply(compound_scores)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=170942.0, style=ProgressStyle(descript…




# Write file to csv

In [69]:
#Set path to write processed data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\03_Processed'

#Write to csv
sample.to_csv(path + '/2020_0207_Reviews_Processed_NLP.csv',sep=',', index=False)