### Introduction

In the following notebook, I will be preprocesing Reviews data from Airbnb for later modeling

**Import libraries**

In [1]:
import pandas as pd
import warnings

**Set notebook preferences**

In [2]:
#Set pandas preferences
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 200)

#Surpress warnings
warnings.filterwarnings('ignore')

**Read in data**

In [3]:
#Set path to reviews data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\02_Cleaned'

#Read in reviews data
df = pd.read_csv(path + '/2020_0526_Reviews_Cleaned.csv',
                 index_col=0)

**Preview data**

In [4]:
print('Data shape:', df.shape)
df.head()

Data shape: (36750, 2)


Unnamed: 0,comments,review_scores_rating
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",80.0
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",80.0


### Text Processing

**Normalize comments**

In [5]:
#Import normalized_text
from Text_Processors import normalized_text

#Normalize comments
df['comments_normalized'] = df['comments'].apply(normalized_text)

display(df.head(3))

Unnamed: 0,comments,review_scores_rating,comments_normalized
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,he s great location is perfect especially if you have a bicycle


**Tokenize, lemmatize, and add POS tags comments**

In [6]:
#Import libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#Tokenize comments_normalized
df['tokens'] =  df.comments_normalized.apply(word_tokenize)

#Init stopwords
stopwords = set(stopwords.words("english"))

#Init WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#Write function to remove stopwords
def remove_stopwords(row):
    temp_list = row['tokens']
    new_tokens = [lemmatizer.lemmatize(token) for token in temp_list if not token in stopwords]
    return (new_tokens)

#Apply to df
df['tokens_clean'] = df.apply(remove_stopwords, axis = 1)

#Add POS to tokens
df['tokens_pos_tags'] = df['tokens_clean'].apply(nltk.pos_tag)

#Check
display(df.head(3))

Unnamed: 0,comments,review_scores_rating,comments_normalized,tokens,tokens_clean,tokens_pos_tags
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"[paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment]","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestled, mission, noe, valley, definitely, recommend, apartment]","[(paul, NN), (super, NN), (nice, JJ), (place, NN), (super, JJ), (nice, JJ), (guy, NN), (apartment, NN), (extremely, RB), (clean, JJ), (excellent, JJ), (location, NN), (nestled, VBN), (mission, NN), (noe, FW), (valley, NN), (definitely, RB), (recommend, JJ), (apartment, NN)]"
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"[did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were]","[stay, challenge, resolved, inflexible, personality, asked, lawrence, refused, refund, anything, mumbled, breath, rediculous]","[(stay, NN), (challenge, NN), (resolved, VBD), (inflexible, JJ), (personality, NN), (asked, VBD), (lawrence, NN), (refused, VBD), (refund, NN), (anything, NN), (mumbled, VBN), (breath, RB), (rediculous, JJ)]"
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,he s great location is perfect especially if you have a bicycle,"[he, s, great, location, is, perfect, especially, if, you, have, a, bicycle]","[great, location, perfect, especially, bicycle]","[(great, JJ), (location, NN), (perfect, NN), (especially, RB), (bicycle, VB)]"


### Feature Engineering

**Comment word counts**

In [7]:
#Count number of words in comments
df['comments_word_count'] = df['comments'].str.count(' ') + 1

#Check
display(df.head(3))

Unnamed: 0,comments,review_scores_rating,comments_normalized,tokens,tokens_clean,tokens_pos_tags,comments_word_count
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"[paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment]","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestled, mission, noe, valley, definitely, recommend, apartment]","[(paul, NN), (super, NN), (nice, JJ), (place, NN), (super, JJ), (nice, JJ), (guy, NN), (apartment, NN), (extremely, RB), (clean, JJ), (excellent, JJ), (location, NN), (nestled, VBN), (mission, NN), (noe, FW), (valley, NN), (definitely, RB), (recommend, JJ), (apartment, NN)]",33
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"[did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were]","[stay, challenge, resolved, inflexible, personality, asked, lawrence, refused, refund, anything, mumbled, breath, rediculous]","[(stay, NN), (challenge, NN), (resolved, VBD), (inflexible, JJ), (personality, NN), (asked, VBD), (lawrence, NN), (refused, VBD), (refund, NN), (anything, NN), (mumbled, VBN), (breath, RB), (rediculous, JJ)]",32
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,he s great location is perfect especially if you have a bicycle,"[he, s, great, location, is, perfect, especially, if, you, have, a, bicycle]","[great, location, perfect, especially, bicycle]","[(great, JJ), (location, NN), (perfect, NN), (especially, RB), (bicycle, VB)]",11


**Cluster scores and assign positive/negative/neutral label**

In [8]:
#Init KMeans
from sklearn.cluster import KMeans

#Fit to review_scores_rating and print clusters
kmeans = KMeans(n_clusters=3, random_state=24, n_jobs=-1, ).fit(df.review_scores_rating.values.reshape(-1,1))
print('Custer centers found by kmeans:\n',kmeans.cluster_centers_)

Custer centers found by kmeans:
 [[97.7305724 ]
 [87.7121433 ]
 [57.40273038]]


In [9]:
#Map labels to df to replace k_means labels
labels = {0: 'good',
         1: 'great',
         2: 'poor'}

#Convert and append review_labels to df
review_labels = [labels[k] for k in kmeans.labels_]

df['review_label'] = review_labels

#Check
display(df.head())

Unnamed: 0,comments,review_scores_rating,comments_normalized,tokens,tokens_clean,tokens_pos_tags,comments_word_count,review_label
7790,Paul has a super nice place and is a super nice guy. The apartment is extremely clean and has an excellent location nestled between the Mission and Noe Valley. Definitely recommend his apartment!,100.0,paul has a super nice place and is a super nice guy the apartment is extremely clean and has an excellent location nestled between the mission and noe valley definitely recommend his apartment,"[paul, has, a, super, nice, place, and, is, a, super, nice, guy, the, apartment, is, extremely, clean, and, has, an, excellent, location, nestled, between, the, mission, and, noe, valley, definitely, recommend, his, apartment]","[paul, super, nice, place, super, nice, guy, apartment, extremely, clean, excellent, location, nestled, mission, noe, valley, definitely, recommend, apartment]","[(paul, NN), (super, NN), (nice, JJ), (place, NN), (super, JJ), (nice, JJ), (guy, NN), (apartment, NN), (extremely, RB), (clean, JJ), (excellent, JJ), (location, NN), (nestled, VBN), (mission, NN), (noe, FW), (valley, NN), (definitely, RB), (recommend, JJ), (apartment, NN)]",33,good
10317,Did not stay here. There was a challenge that was not resolved. Inflexible personality. I asked for and Lawrence refused to refund anything.. Mumbled under his breath how 'rediculous' we were.,20.0,did not stay here there was a challenge that was not resolved inflexible personality i asked for and lawrence refused to refund anything mumbled under his breath how rediculous we were,"[did, not, stay, here, there, was, a, challenge, that, was, not, resolved, inflexible, personality, i, asked, for, and, lawrence, refused, to, refund, anything, mumbled, under, his, breath, how, rediculous, we, were]","[stay, challenge, resolved, inflexible, personality, asked, lawrence, refused, refund, anything, mumbled, breath, rediculous]","[(stay, NN), (challenge, NN), (resolved, VBD), (inflexible, JJ), (personality, NN), (asked, VBD), (lawrence, NN), (refused, VBD), (refund, NN), (anything, NN), (mumbled, VBN), (breath, RB), (rediculous, JJ)]",32,poor
12146,"He's great. Location is perfect, especially if you have a bicycle.",60.0,he s great location is perfect especially if you have a bicycle,"[he, s, great, location, is, perfect, especially, if, you, have, a, bicycle]","[great, location, perfect, especially, bicycle]","[(great, JJ), (location, NN), (perfect, NN), (especially, RB), (bicycle, VB)]",11,poor
27172,"Rebecca's studio is great. I felt completely at home with all the comforts and amenities that one could expect. Both the building and studio are very clean, modern and convenient to public transportation and San Francisco. Rebecca was very helpful and accommodating. I'd stay at her place again and would recommend anyone visiting SF to consider it as an excellent alternative to a hotel, especially if you prefer a modern accommodation.",80.0,rebecca s studio is great i felt completely at home with all the comforts and amenities that one could expect both the building and studio are very clean modern and convenient to public transportation and san francisco rebecca was very helpful and accommodating i d stay at her place again and would recommend anyone visiting sf to consider it as an excellent alternative to a hotel especially if you prefer a modern accommodation,"[rebecca, s, studio, is, great, i, felt, completely, at, home, with, all, the, comforts, and, amenities, that, one, could, expect, both, the, building, and, studio, are, very, clean, modern, and, convenient, to, public, transportation, and, san, francisco, rebecca, was, very, helpful, and, accommodating, i, d, stay, at, her, place, again, and, would, recommend, anyone, visiting, sf, to, consider, it, as, an, excellent, alternative, to, a, hotel, especially, if, you, prefer, a, modern, accomm...","[rebecca, studio, great, felt, completely, home, comfort, amenity, one, could, expect, building, studio, clean, modern, convenient, public, transportation, san, francisco, rebecca, helpful, accommodating, stay, place, would, recommend, anyone, visiting, sf, consider, excellent, alternative, hotel, especially, prefer, modern, accommodation]","[(rebecca, NN), (studio, NN), (great, JJ), (felt, VBD), (completely, RB), (home, VBN), (comfort, NN), (amenity, NN), (one, CD), (could, MD), (expect, VB), (building, NN), (studio, NN), (clean, JJ), (modern, JJ), (convenient, NN), (public, JJ), (transportation, NN), (san, JJ), (francisco, NN), (rebecca, NN), (helpful, JJ), (accommodating, VBG), (stay, JJ), (place, NN), (would, MD), (recommend, VB), (anyone, NN), (visiting, VBG), (sf, JJ), (consider, VB), (excellent, JJ), (alternative, JJ), (h...",71,great
507880,"Susie is a great hostess, very attentive and also gave me my privacy when I needed it. Unfortunately for things beyond her control, some kind of machinery malfunction or something from another apt, best we could figure, the room wasn't very quiet at night during the week I stayed. But otherwise it is a lovely place and I would return.\r\nSusie is very nice and has a loveable pooch Zoey!",80.0,susie is a great hostess very attentive and also gave me my privacy when i needed it unfortunately for things beyond her control some kind of machinery malfunction or something from another apt best we could figure the room wasn t very quiet at night during the week i stayed but otherwise it is a lovely place and i would return susie is very nice and has a loveable pooch zoey,"[susie, is, a, great, hostess, very, attentive, and, also, gave, me, my, privacy, when, i, needed, it, unfortunately, for, things, beyond, her, control, some, kind, of, machinery, malfunction, or, something, from, another, apt, best, we, could, figure, the, room, wasn, t, very, quiet, at, night, during, the, week, i, stayed, but, otherwise, it, is, a, lovely, place, and, i, would, return, susie, is, very, nice, and, has, a, loveable, pooch, zoey]","[susie, great, hostess, attentive, also, gave, privacy, needed, unfortunately, thing, beyond, control, kind, machinery, malfunction, something, another, apt, best, could, figure, room, quiet, night, week, stayed, otherwise, lovely, place, would, return, susie, nice, loveable, pooch, zoey]","[(susie, RB), (great, JJ), (hostess, NN), (attentive, NN), (also, RB), (gave, VBD), (privacy, NN), (needed, VBN), (unfortunately, RB), (thing, NN), (beyond, IN), (control, NN), (kind, NN), (machinery, NN), (malfunction, NN), (something, NN), (another, DT), (apt, JJ), (best, NN), (could, MD), (figure, VB), (room, NN), (quiet, JJ), (night, NN), (week, NN), (stayed, VBD), (otherwise, RB), (lovely, JJ), (place, NN), (would, MD), (return, VB), (susie, JJ), (nice, JJ), (loveable, JJ), (pooch, NN),...",69,great


# Write file to CSV

In [10]:
#Drop intermediate cols and sort cols alphabetically
df.drop(['comments_normalized','tokens'], axis=1, inplace = True)
df = df.reindex(sorted(df.columns), axis=1)

#Print shape of dataframe
print('Final shape of processed data:', df.shape)

#Set path
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python\In Progress\Airbnb - San Francisco\Data\03_Processed'

#Write file
df.to_csv(path + '/2020_0616_Reviews_Processed.csv')

Final shape of processed data: (36750, 6)
