### Introduction

In the following notebook, I will be preprocesing Reviews data from Airbnb for later modeling

**Import libraries**

In [112]:
import pandas as pd
import swifter
import spacy
import warnings

**Set notebook preferences**

In [113]:
#Set pandas preferences
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 200)

#Surpress warnings
warnings.filterwarnings('ignore')

**Read in data**

In [114]:
#Set path to reviews data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'

#Read in reviews data
df = pd.read_csv(path + '/2020_0526_Reviews_Cleaned.csv',
                 index_col=0)

**Preview data**

In [115]:
print('Data shape:', df.shape)
df.head()

Data shape: (36753, 3)


Unnamed: 0,comments,review_scores_rating,language
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,en
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,en
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,en
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",80.0,en
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",80.0,en


### Text Processing

**Normalize comments**

In [116]:
#Import normalized_text
from Text_Processors import normalized_text

#Normalize comments
df['comments_normalized'] = df['comments'].apply(normalized_text)

display(df.head(3))

Unnamed: 0,comments,review_scores_rating,language,comments_normalized
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,en,he s great location is perfect especially if you have a bicycle


**Tokenize, lemmatize, and add POS tags comments**

In [117]:
#Import libraries
import nltk
import spacy
import en_core_web_sm

#Init spacy tokenizer and stop words
nlp = spacy.load('en_core_web_sm')
stopwords = nlp.Defaults.stop_words

#Tokenize comments_normalized
df['tokens_raw'] = [nlp.tokenizer(text) for text in df['comments_normalized']]

#Remove stopwords, lemmatize tokens_raw, and remove tokens shorter than 2 characters
df['tokens_clean'] = df['tokens_raw'].apply(lambda x: [token.lemma_ for token in x if not token.is_stop and len(token) >1])

#Apply POS tags to tokens_clean
df['tokens_clean_pos'] = df['tokens_clean'].swifter.apply(lambda x: nltk.pos_tag(x))

#Check
display(df.head(3))

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=36753.0, style=ProgressStyle(descripti…




Unnamed: 0,comments,review_scores_rating,language,comments_normalized,tokens_raw,tokens_clean,tokens_clean_pos
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"(paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment)","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestle, mission, noe, valley, definitely, recommend, apartment]","[(paul, NN), (super, NN), (nice, JJ), (place, NN), (super, JJ), (nice, JJ), (guy, NN), (apartment, NN), (extremely, RB), (clean, JJ), (excellent, JJ), (location, NN), (nestle, JJ), (mission, NN), (noe, FW), (valley, NN), (definitely, RB), (recommend, JJ), (apartment, NN)]"
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"(did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were)","[stay, challenge, resolve, inflexible, personality, ask, lawrence, refuse, refund, mumble, breath, rediculous]","[(stay, NN), (challenge, NN), (resolve, VBP), (inflexible, JJ), (personality, NN), (ask, NN), (lawrence, NN), (refuse, NN), (refund, NN), (mumble, JJ), (breath, NN), (rediculous, JJ)]"
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,en,he s great location is perfect especially if you have a bicycle,"(he, s, great, location, is, perfect, especially, if, you, have, a, bicycle)","[great, location, perfect, especially, bicycle]","[(great, JJ), (location, NN), (perfect, NN), (especially, RB), (bicycle, VB)]"


### Feature Engineering

**Comment word counts**

In [118]:
#Count number of words in comments
df['comments_word_count'] = df['comments'].str.count(' ') + 1

#Check
display(df.head(3))

Unnamed: 0,comments,review_scores_rating,language,comments_normalized,tokens_raw,tokens_clean,tokens_clean_pos,comments_word_count
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"(paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment)","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestle, mission, noe, valley, definitely, recommend, apartment]","[(paul, NN), (super, NN), (nice, JJ), (place, NN), (super, JJ), (nice, JJ), (guy, NN), (apartment, NN), (extremely, RB), (clean, JJ), (excellent, JJ), (location, NN), (nestle, JJ), (mission, NN), (noe, FW), (valley, NN), (definitely, RB), (recommend, JJ), (apartment, NN)]",33
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"(did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were)","[stay, challenge, resolve, inflexible, personality, ask, lawrence, refuse, refund, mumble, breath, rediculous]","[(stay, NN), (challenge, NN), (resolve, VBP), (inflexible, JJ), (personality, NN), (ask, NN), (lawrence, NN), (refuse, NN), (refund, NN), (mumble, JJ), (breath, NN), (rediculous, JJ)]",32
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,en,he s great location is perfect especially if you have a bicycle,"(he, s, great, location, is, perfect, especially, if, you, have, a, bicycle)","[great, location, perfect, especially, bicycle]","[(great, JJ), (location, NN), (perfect, NN), (especially, RB), (bicycle, VB)]",11


**Cluster scores and assign positive/negative/neutral label**

In [119]:
#Init KMeans
from sklearn.cluster import KMeans

#Fit to data and print clusters
kmeans = KMeans(n_clusters=3, random_state=24, n_jobs=-1, ).fit(df.review_scores_rating.values.reshape(-1,1))
print('Custer centers found by kmeans:\n',kmeans.cluster_centers_)

Custer centers found by kmeans:
 [[88.9625    ]
 [97.94612717]
 [58.66981132]]


In [120]:
#Map labels to df to replace k_means labels
labels = {0: 'great',
         1: 'good',
         2: 'poor'}

#Convert and append review_labels to df
review_labels = [labels[k] for k in kmeans.labels_]

df['review_label'] = review_labels

#Check
display(df.head())

Unnamed: 0,comments,review_scores_rating,language,comments_normalized,tokens_raw,tokens_clean,tokens_clean_pos,comments_word_count,review_label
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,en,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"(paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment)","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestle, mission, noe, valley, definitely, recommend, apartment]","[(paul, NN), (super, NN), (nice, JJ), (place, NN), (super, JJ), (nice, JJ), (guy, NN), (apartment, NN), (extremely, RB), (clean, JJ), (excellent, JJ), (location, NN), (nestle, JJ), (mission, NN), (noe, FW), (valley, NN), (definitely, RB), (recommend, JJ), (apartment, NN)]",33,good
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,en,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"(did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were)","[stay, challenge, resolve, inflexible, personality, ask, lawrence, refuse, refund, mumble, breath, rediculous]","[(stay, NN), (challenge, NN), (resolve, VBP), (inflexible, JJ), (personality, NN), (ask, NN), (lawrence, NN), (refuse, NN), (refund, NN), (mumble, JJ), (breath, NN), (rediculous, JJ)]",32,poor
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,en,he s great location is perfect especially if you have a bicycle,"(he, s, great, location, is, perfect, especially, if, you, have, a, bicycle)","[great, location, perfect, especially, bicycle]","[(great, JJ), (location, NN), (perfect, NN), (especially, RB), (bicycle, VB)]",11,poor
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",80.0,en,rebecca s studio is great i felt completely at home with all the comforts and amenities that one could expect both the building and studio are very clean modern and convenient to public transportation and san francisco rebecca was very helpful and accommodating i d stay at her place again and would recommend anyone visiting sf to consider it as an excellent alternative to a hotel especially if you prefer a modern accommodation,"(rebecca, s, studio, is, great, i, felt, completely, at, home, with, all, the, comforts, and, amenities, that, one, could, expect, both, the, building, and, studio, are, very, clean, modern, and, convenient, to, public, transportation, and, san, francisco, rebecca, was, very, helpful, and, accommodating, i, d, stay, at, her, place, again, and, would, recommend, anyone, visiting, sf, to, consider, it, as, an, excellent, alternative, to, a, hotel, especially, if, you, prefer, a, modern, accomm...","[rebecca, studio, great, feel, completely, home, comfort, amenity, expect, build, studio, clean, modern, convenient, public, transportation, san, francisco, rebecca, helpful, accommodate, stay, place, recommend, visit, sf, consider, excellent, alternative, hotel, especially, prefer, modern, accommodation]","[(rebecca, NN), (studio, NN), (great, JJ), (feel, NN), (completely, RB), (home, NN), (comfort, NN), (amenity, NN), (expect, VBP), (build, JJ), (studio, NN), (clean, JJ), (modern, JJ), (convenient, NN), (public, JJ), (transportation, NN), (san, JJ), (francisco, NN), (rebecca, NN), (helpful, JJ), (accommodate, NN), (stay, NN), (place, NN), (recommend, VBP), (visit, NN), (sf, NN), (consider, VB), (excellent, JJ), (alternative, JJ), (hotel, NN), (especially, RB), (prefer, VBP), (modern, JJ), (ac...",71,great
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",80.0,en,susie is a great hostess very attentive and also gave me my privacy when i needed it unfortunately for things beyond her control some kind of machinery malfunction or something from another apt best we could figure the room wasn t very quiet at night during the week i stayed but otherwise it is a lovely place and i would return susie is very nice and has a loveable pooch zoey,"(susie, is, a, great, hostess, very, attentive, and, also, gave, me, my, privacy, when, i, needed, it, unfortunately, for, things, beyond, her, control, some, kind, of, machinery, malfunction, or, something, from, another, apt, best, we, could, figure, the, room, wasn, t, very, quiet, at, night, during, the, week, i, stayed, but, otherwise, it, is, a, lovely, place, and, i, would, return, susie, is, very, nice, and, has, a, loveable, pooch, zoey)","[susie, great, hostess, attentive, give, privacy, need, unfortunately, thing, control, kind, machinery, malfunction, apt, well, figure, room, wasn, quiet, night, week, stay, lovely, place, return, susie, nice, loveable, pooch, zoey]","[(susie, RB), (great, JJ), (hostess, JJ), (attentive, JJ), (give, NN), (privacy, NN), (need, VBP), (unfortunately, JJ), (thing, NN), (control, NN), (kind, NN), (machinery, NN), (malfunction, NN), (apt, RB), (well, RB), (figure, NN), (room, NN), (wasn, NN), (quiet, JJ), (night, NN), (week, NN), (stay, NN), (lovely, RB), (place, JJ), (return, NN), (susie, NN), (nice, JJ), (loveable, JJ), (pooch, NN), (zoey, NN)]",69,great


# Write file to CSV

In [121]:
#Set path
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\03_Processed'

#Write file
df.to_csv(path + '/2020_0616_Reviews_Processed.csv')