### Introduction

In the following notebook, I will be preprocesing Reviews data from Airbnb for later modeling

**Import libraries**

In [25]:
import pandas as pd
import swifter
import spacy
import warnings

**Set notebook preferences**

In [26]:
#Set pandas preferences
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 200)

#Surpress warnings
warnings.filterwarnings('ignore')

**Read in data**

In [27]:
#Set path to reviews data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'

#Read in reviews data
df = pd.read_csv(path + '/2020_0526_Reviews_Cleaned.csv', parse_dates=['date'], dtype = {'host_id':'int'},
                 index_col=0)

**Preview data**

In [28]:
print('Data shape:', df.shape)
df.head()

Data shape: (39943, 6)


Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",2011-11-23,261358,1395774,80.0,1257432
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",2012-02-04,284811,1434864,80.0,1427641


To do

- Translate
- strip puncutaion, lowercase, remove stop words
- run spell check
- tokenize
- apply lemma and stemming to english


### Feature Engineering

**Translate Non-English Reviews**

In [30]:
#Import Google translator
import googletrans
from googletrans import Translator

#Check that all languages in review_langs is in google_langs
print('# of languages in Reviews not in Google\'s Translator:', len(df[~(df.language.isin(googletrans.LANGUAGES.keys()))].language.unique()))


# of languages in Reviews not in Google's Translator: 0


**Normalize english comments**

*Until I figure out how to translate non-english reviews, we will leave them alone*

In [31]:
#Subset english reviews
english_df = df.loc[df.language == 'en']

#View shape
english_df.shape

(37432, 7)

In [32]:
import nltk
import spacy
import en_core_web_sm

In [33]:
nlp = en_core_web_sm.load()
stopwords = spacy.lang.en.stop_words.STOP_WORDS

print(stopwords)

{'as', "'re", 'he', 'without', 'via', 'towards', 'nine', 'up', 'of', 'where', 'seems', '’s', 'while', 'whence', 'and', "'m", 'show', 'top', 'become', 'how', 'himself', 'unless', 'nor', 'call', 'whose', 'yourself', 'which', 'get', 'therefore', 'i', 'ourselves', 'very', 'part', 'whenever', 'everywhere', 'nothing', 'more', 'same', 'anyhow', 'move', 'first', 'already', 'what', 'off', 'with', 'under', 'mostly', 'rather', 'latter', 'former', 'herein', 'why', 'whither', 'yours', 'when', 'many', 'beside', 'or', 'should', 'along', 'made', 'next', 'will', 'my', 'eight', 'hereafter', 'thru', '’ll', 'once', 'wherever', 'such', 'thereafter', 'namely', 'beforehand', 'across', 'at', "'s", 'thus', 'formerly', 'us', 'became', 'sixty', 'were', 'most', 'ten', 'two', 'mine', 'although', 'might', 'something', 'about', 'side', 'to', 'make', 'on', 'therein', 'the', 'this', 'their', 'throughout', 'them', 'could', 'three', "'ve", 'alone', 'whereas', 'somewhere', 'still', 'through', 'meanwhile', 'empty', 'onto'

In [34]:
#lower case, remove punctuation,tokenize, remove stop words
import re
def normalized_tokens(comments):
    """
    Series: series containing text you would like normalized.
    Normalized meaning raw text is converted into lower-case w/ punctuation and stopwords removed, as well as tokenized"""
    comments = comments.lower()
    comments = re.sub(r'[^\w\s]+', ' ', comments) #Remove punctuation
    comments= re.sub(r'\s\s+', ' ', comments) #Remove excess white spaces between text
    comments.strip() #strip leading/trailing whitespace
#     raw_tokens = nlp(comments)#tokenize
#     clean_tokens = [word for word in raw_tokens if not word in stopwords]
    return comments


english_df['comments_normalized'] = english_df['comments'].swifter.apply(normalized_tokens)


HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=37432.0, style=ProgressStyle(descripti…




**Comments mentioning cancellation**

In [35]:
english_df = english_df.loc[~(df['comments'].str.contains('This is an automated posting.'))]

english_df.shape

(36753, 8)

**Comment word counts**

In [36]:
#Count number of words in comments
english_df['word_count'] = english_df['comments'].str.count(' ') + 1

#Check
display(english_df.head(10))

Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id,language,comments_tokens,word_count
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,33
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,32
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770,en,he s great location is perfect especially if you have a bicycle,11
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",2011-11-23,261358,1395774,80.0,1257432,en,rebecca s studio is great i felt completely at home with all the comforts and amenities that one could expect both the building and studio are very clean modern and convenient to public transportation and san francisco rebecca was very helpful and accommodating i d stay at her place again and would recommend anyone visiting sf to consider it as an excellent alternative to a hotel especially if you prefer a modern accommodation,71
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",2012-02-04,284811,1434864,80.0,1427641,en,susie is a great hostess very attentive and also gave me my privacy when i needed it unfortunately for things beyond her control some kind of machinery malfunction or something from another apt best we could figure the room wasn t very quiet at night during the week i stayed but otherwise it is a lovely place and i would return susie is very nice and has a loveable pooch zoey,69
488284,"Lynnore is a very friendly person with a great personality, and has lots of local SF knowledge. Her place in Bernal Heights is close to a number of great restaurants and is just a short walk from the 24th and Mission BART. She definitely made me feel at home, and was a great host! A+",2012-08-11,598064,174446,97.0,1094388,en,lynnore is a very friendly person with a great personality and has lots of local sf knowledge her place in bernal heights is close to a number of great restaurants and is just a short walk from the 24th and mission bart she definitely made me feel at home and was a great host a,58
493290,We are two italians and we stayed at Josh's loft for 4 days. He is very hospitable and available! His loft is equipped with everything you need and comfortable.,2012-09-01,199334,2428097,99.0,648181,en,we are two italians and we stayed at josh s loft for 4 days he is very hospitable and available his loft is equipped with everything you need and comfortable,29
417223,Helena is an amazing host! Her apartment is absolutely stunning and at the best possible location in SF! She's also been very helpful and made sure that we have a smooth accommodation.,2012-09-14,511991,2686896,100.0,723220,en,helena is an amazing host her apartment is absolutely stunning and at the best possible location in sf she s also been very helpful and made sure that we have a smooth accommodation,32
488242,"We spent a week at Owen and we really enjoyed our stay. It was perfect. The apartment is clean, spacious and bright, and it is located in a really nice and well served by public transport.\r\nCommunication with Owen was perfect. We had a very good experience. \r\nI highly recommend you to stay at Owen during your next visit to San Francisco !",2012-10-20,375168,1345310,100.0,170323,en,we spent a week at owen and we really enjoyed our stay it was perfect the apartment is clean spacious and bright and it is located in a really nice and well served by public transport communication with owen was perfect we had a very good experience i highly recommend you to stay at owen during your next visit to san francisco,62
8226,"Had a pleasant stay at Tom's home for 3 nights. Safe and comfortable place for 2 adults and 2 young children. Tom was very responsive to my inquires and was flexible with my check-in and check-out time. He was nice enough to let us borrow his parking pass, but we ended up finding parking closer to the property so that wasn't too bad. The home was well organized, clean and had everything we needed. Love his huge solid wood dining table (fits 6 ppl). Area was quiet and we felt really sa...",2012-11-13,51374,504203,100.0,236023,en,had a pleasant stay at tom s home for 3 nights safe and comfortable place for 2 adults and 2 young children tom was very responsive to my inquires and was flexible with my check in and check out time he was nice enough to let us borrow his parking pass but we ended up finding parking closer to the property so that wasn t too bad the home was well organized clean and had everything we needed love his huge solid wood dining table fits 6 ppl area was quiet and we felt really safe especially wit...,177


### Preprocess data

**Standardize data**

- commit changes to cleaning
- create py file for cleaning and tokenization
- remove rows with 
- Create tokens
- remove stop words
- spell check tokens
- n grams(bi and tri)
comments