### Introduction

In the following notebook, I will be preprocesing Reviews data from Airbnb for later modeling

**Import libraries**

In [1]:
import pandas as pd
import swifter
import spacy
import warnings

**Set notebook preferences**

In [2]:
#Set pandas preferences
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 200)

#Surpress warnings
warnings.filterwarnings('ignore')

**Read in data**

In [3]:
#Set path to reviews data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'

#Read in reviews data
df = pd.read_csv(path + '/2020_0526_Reviews_Cleaned.csv', parse_dates=['date'], dtype = {'host_id':'int'},
                 index_col=0)

**Preview data**

In [4]:
print('Data shape:', df.shape)
df.head()

Data shape: (39192, 7)


Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id,language
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626,en
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859,en
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770,en
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",2011-11-23,261358,1395774,80.0,1257432,en
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",2012-02-04,284811,1434864,80.0,1427641,en


To do

- Translate
- strip puncutaion, lowercase, remove stop words
- run spell check
- tokenize
- apply lemma and stemming to english


### Text Processing

**Translate Non-English Reviews**

https://pypi.org/project/translate/

There could be 2 reasons for this:
1. IP address is temporarily blocked.
2. You have reached the character limit.

I faced the same issue and ended up using another package called translate and it works flawlessly. The syntax is pretty similar too. You can find it here or do pip install translate

non_english_df = df.loc[df.language !='en']


**Normalize english comments**

*Until I figure out how to translate non-english reviews, we will leave them alone*

**Data prep**

In [5]:
#Subset english reviews
english_df = df.loc[df.language == 'en']

#View shape
english_df.shape

(36755, 7)

**Normalize comments**

Normalize meaning remove punctuation, lowercase all letters, and strip ghost-white space.

In [6]:
#Import normalized_text
from Text_Processors import normalized_text

#Normalize comments
english_df['comments_normalized'] = english_df['comments'].apply(normalized_text)

display(english_df.head(3))

Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id,language,comments_normalized
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770,en,he s great location is perfect especially if you have a bicycle


**Tokenize and lemmatize comments**

In [7]:
#Import libraries
import spacy
import en_core_web_sm

#Init spacy tokenizer and stop words
nlp = spacy.load('en_core_web_sm')
stopwords = nlp.Defaults.stop_words

#Tokenize comments_normalized
english_df['tokens_raw'] = [nlp.tokenizer(text) for text in english_df['comments_normalized']]

#Remove stopwords and lemmatize tokens_raw
english_df['tokens_clean'] = english_df['tokens_raw'].apply(lambda x: [token.lemma_ for token in x if not token.is_stop])

#Check
display(english_df.head(3))

Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id,language,comments_normalized,tokens_raw,tokens_clean
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"(paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment)","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestle, mission, noe, valley, definitely, recommend, apartment]"
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"(did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were)","[stay, challenge, resolve, inflexible, personality, ask, lawrence, refuse, refund, mumble, breath, rediculous]"
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770,en,he s great location is perfect especially if you have a bicycle,"(he, s, great, location, is, perfect, especially, if, you, have, a, bicycle)","[s, great, location, perfect, especially, bicycle]"


**Remove tokens that do not appear in X documents**

In [8]:
english_df.head()

Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id,language,comments_normalized,tokens_raw,tokens_clean
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"(paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment)","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestle, mission, noe, valley, definitely, recommend, apartment]"
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"(did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were)","[stay, challenge, resolve, inflexible, personality, ask, lawrence, refuse, refund, mumble, breath, rediculous]"
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770,en,he s great location is perfect especially if you have a bicycle,"(he, s, great, location, is, perfect, especially, if, you, have, a, bicycle)","[s, great, location, perfect, especially, bicycle]"
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",2011-11-23,261358,1395774,80.0,1257432,en,rebecca s studio is great i felt completely at home with all the comforts and amenities that one could expect both the building and studio are very clean modern and convenient to public transportation and san francisco rebecca was very helpful and accommodating i d stay at her place again and would recommend anyone visiting sf to consider it as an excellent alternative to a hotel especially if you prefer a modern accommodation,"(rebecca, s, studio, is, great, i, felt, completely, at, home, with, all, the, comforts, and, amenities, that, one, could, expect, both, the, building, and, studio, are, very, clean, modern, and, convenient, to, public, transportation, and, san, francisco, rebecca, was, very, helpful, and, accommodating, i, d, stay, at, her, place, again, and, would, recommend, anyone, visiting, sf, to, consider, it, as, an, excellent, alternative, to, a, hotel, especially, if, you, prefer, a, modern, accomm...","[rebecca, s, studio, great, feel, completely, home, comfort, amenity, expect, build, studio, clean, modern, convenient, public, transportation, san, francisco, rebecca, helpful, accommodate, have, stay, place, recommend, visit, sf, consider, excellent, alternative, hotel, especially, prefer, modern, accommodation]"
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",2012-02-04,284811,1434864,80.0,1427641,en,susie is a great hostess very attentive and also gave me my privacy when i needed it unfortunately for things beyond her control some kind of machinery malfunction or something from another apt best we could figure the room wasn t very quiet at night during the week i stayed but otherwise it is a lovely place and i would return susie is very nice and has a loveable pooch zoey,"(susie, is, a, great, hostess, very, attentive, and, also, gave, me, my, privacy, when, i, needed, it, unfortunately, for, things, beyond, her, control, some, kind, of, machinery, malfunction, or, something, from, another, apt, best, we, could, figure, the, room, wasn, t, very, quiet, at, night, during, the, week, i, stayed, but, otherwise, it, is, a, lovely, place, and, i, would, return, susie, is, very, nice, and, has, a, loveable, pooch, zoey)","[susie, great, hostess, attentive, give, privacy, need, unfortunately, thing, control, kind, machinery, malfunction, apt, well, figure, room, wasn, t, quiet, night, week, stay, lovely, place, return, susie, nice, loveable, pooch, zoey]"


**Stem and Lemmatize tokens**

### Feature Engineering

**Comment word counts**

In [9]:
#Count number of words in comments
english_df['word_count'] = english_df['comments'].str.count(' ') + 1

#Check
display(english_df.head(3))

Unnamed: 0,comments,date,listing_id,reviewer_id,review_scores_rating,host_id,language,comments_normalized,tokens_raw,tokens_clean,word_count
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,2010-10-04,44680,140276,100.0,196626,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"(paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment)","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestle, mission, noe, valley, definitely, recommend, apartment]",33
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,2011-05-23,59831,501557,20.0,287859,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"(did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were)","[stay, challenge, resolve, inflexible, personality, ask, lawrence, refuse, refund, mumble, breath, rediculous]",32
12146,"He's great. Location is perfect, especially if you have a bicycle.",2011-09-30,71779,654056,60.0,368770,en,he s great location is perfect especially if you have a bicycle,"(he, s, great, location, is, perfect, especially, if, you, have, a, bicycle)","[s, great, location, perfect, especially, bicycle]",11


### Preprocess data

**Standardize data**

- spell check tokens
- Drop tokens only a few characters long 
- n grams(bi and tri)
comments