# ASHEVILLE AIRBNB SENTIMENT ANALYSIS

> The purpose of this report is **to analyze customer reviews for Airbnb on Asheville, North Carolina, United States**. And act as a stepping stone **to know what the customers think of the service offered by Asheville's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **TEXT PREPROCESSING** and **TOPIC MODELLING** part.

> The dataset contains the **detailed review data for listings in Asheville, North Carolina** compiled on **08 November, 2020**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [43]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np
import spacy

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
import en_core_web_sm
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data

df = pd.read_csv('asheville-reviews-clean.csv')

In [3]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik...",lisa superb hostess treat like family provide ...
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...,lovely little place walking distance downtown ...
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ...",lisa nice work however realize house old norma...
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...,feel lucky found beautiful home asheville quie...
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat...",great roomy little apartment beautiful private...


In [4]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173892 entries, 0 to 173891
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   listing_id        173892 non-null  int64 
 1   id                173892 non-null  int64 
 2   date              173892 non-null  object
 3   reviewer_id       173892 non-null  int64 
 4   reviewer_name     173892 non-null  object
 5   comments          173892 non-null  object
 6   comments_cleaned  173709 non-null  object
dtypes: int64(3), object(4)
memory usage: 9.3+ MB


In [5]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [6]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,comments_cleaned,object,168409,183,0.11,173892
1,listing_id,int64,2044,0,0.0,173892
2,id,int64,173892,0,0.0,173892
3,date,object,2904,0,0.0,173892
4,reviewer_id,int64,158449,0,0.0,173892
5,reviewer_name,object,16279,0,0.0,173892
6,comments,object,170971,0,0.0,173892


> Although these have been fixed on the previous process, seems that there are some `dtypes` that are not proper, there are also a missing values on *comments_clean* feature. Therefore once again I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [7]:
# check the missing values

df[df['comments_cleaned'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned
254,155305,286556497,2018-07-06,199962397,Leif,A,
633,156926,552755567,2019-10-22,7946489,Юлия,Время проведенное с Дарьей было увлекательным ...,
638,156926,560000486,2019-11-05,61670213,Oxana,"Очень интересно!! не жалею о новом опыте, и вп...",
1292,259576,203258198,2017-10-14,149002825,David,.,
1386,259576,359959590,2018-12-18,229443928,Raphael,.,
...,...,...,...,...,...,...,...
171567,42981397,645925705,2020-08-02,93550848,Zach,.,
171568,42981397,652297593,2020-08-16,93550848,Zach,.,
171666,43087424,702326399,2020-10-20,106022454,Hau,i,
172534,43817239,656384354,2020-08-25,37130879,Dani,...,


> It seems the missing values are caused by the other language or improper commentaries as shown above, therefore I'll fill these values as *No Description* instead and move to clean the datatypes.

In [8]:
# fill missing values

df['comments_cleaned'] = df['comments_cleaned'].fillna('No Description')

In [9]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [10]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,object,2044,0,0.0,173892
1,id,object,173892,0,0.0,173892
2,date,datetime64[ns],2904,0,0.0,173892
3,reviewer_id,object,158449,0,0.0,173892
4,reviewer_name,object,16279,0,0.0,173892
5,comments,object,170971,0,0.0,173892
6,comments_cleaned,object,168410,0,0.0,173892


> It seems that everything's on set. I'll move to the text processing to later do the text modelling.

## TEXT PROCESSING

### LEMMATIZATION

> For this part, I'll lemmatize the text on the *comments_clean* feature to get the tokenized result for modelling part.

In [11]:
# load spacymodel for lemmatization

nlp = en_core_web_sm.load(disable=['parser', 'ner'])

In [12]:
# function to lemmatize text

def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']): 
    output = []
    for text in texts:
        doc = nlp(text) 
        output.append(' '.join([word.lemma_ for word in doc if word.pos_ in allowed_postags ]))
    return output

In [13]:
# apply lemmatization

comment_list = df['comments_cleaned'].tolist()
comment_tokenized = lemmatization(comment_list)

In [14]:
# check lemmatized comment

print(comment_tokenized[0])

treat family coziest little home experience magical town private sunny apartment little flat need people impeccable lovely neighborhood


In [15]:
# create new feature to store tokenized text

df['comments_tokenized'] = comment_tokenized

In [16]:
# show dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik...",lisa superb hostess treat like family provide ...,treat family coziest little home experience ma...
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...,lovely little place walking distance downtown ...,lovely little place distance downtown responsi...
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ...",lisa nice work however realize house old norma...,work old case floor permanent renter squeaky f...
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...,feel lucky found beautiful home asheville quie...,lucky beautiful home quiet clean guest gloriou...
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat...",great roomy little apartment beautiful private...,great roomy little apartment beautiful private...


### SENTIMENT ANALYSIS - VADER

> Now, I'll start to analyze the sentiment using **VADER ( Valence Aware Dictionary for Sentiment Reasoning)**. It is basically a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. VADER relies on a dictionary that maps lexical features to emotion intensities known as sentiment scores. The sentiment score of a text can be obtained by summing up the intensity of each word in the text.

In [17]:
# function to assign sentiment class based on compound score

def sentiment(comp):
    if comp >= 0.05:
        return 'positive'
    elif (comp > -0.05) and (comp < 0.05):
        return 'neutral'
    elif comp <= -0.05 :
        return 'negative'

In [18]:
# initialize sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# calculate compound score

compound_score = []
for i in df['comments_tokenized']:
    compound_score.append(analyzer.polarity_scores(i)['compound'])

In [19]:
# create new feature to store compound score

df['compound_score'] = compound_score

In [20]:
# check sentiment on first data

sentiment(df['compound_score'][0])

'positive'

In [21]:
# calculate sentiment based on compound score

sent = []
for i in range(0, len(df)):
    sent.append(sentiment(df['compound_score'][i]))

In [22]:
# create new feature to store sentiment

df['sentiment'] = sent

In [23]:
# check final dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik...",lisa superb hostess treat like family provide ...,treat family coziest little home experience ma...,0.8519,positive
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...,lovely little place walking distance downtown ...,lovely little place distance downtown responsi...,0.8481,positive
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ...",lisa nice work however realize house old norma...,work old case floor permanent renter squeaky f...,0.8176,positive
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...,feel lucky found beautiful home asheville quie...,lucky beautiful home quiet clean guest gloriou...,0.9957,positive
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat...",great roomy little apartment beautiful private...,great roomy little apartment beautiful private...,0.9351,positive


> Now we've already got everything we need to do the modelling. First I will do topic modelling.

## TOPIC MODELLING

> **Topic Modeling** falls under unsupervised machine learning where the documents are processed to obtain the relative topics. It is a very important concept of the traditional Natural Processing Approach because of its potential to obtain semantic relationship between words in the document clusters. 

> We will use one of **sklearn's method of topic modeling**, the NMF modeling. The NMF is based on Non-negative Matrix Factorization to implement topic modeling. In the NMF model we will use the tf-idf feature vector to train the model.

### NON NEGATIVE MATRIX FACTORIZATION (NMF)

> **Non-Negative Matrix Factorization** is a statistical method to reduce the dimension of the input corpora. It uses factor analysis method to provide comparatively less weightage to the words with less coherence. NMF produces more coherent topics compared to LDA, and it is by default produces sparse representations. This mean that most of the entries are close to zero and only very few parameters have significant values. This can be used when we strictly require fewer topics. 

> In sort, **the goal of NMF is to find two non-negative matrices (W, H) whose product approximates the non-negative matrix X**. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. We will be using sklearn’s implementation of NMF.

In [24]:
# converting the text term-document matrix

vectorizer = TfidfVectorizer(max_features=100)
tf_idf = vectorizer.fit_transform(df['comments_tokenized'])

In [25]:
# show feature names

vectorizer.get_feature_names()

['able',
 'access',
 'amazing',
 'amenity',
 'apartment',
 'area',
 'available',
 'awesome',
 'bathroom',
 'beautiful',
 'bed',
 'bedroom',
 'brewery',
 'check',
 'clean',
 'close',
 'coffee',
 'comfortable',
 'comfy',
 'communication',
 'convenient',
 'cottage',
 'couple',
 'cozy',
 'cute',
 'day',
 'distance',
 'dog',
 'downtown',
 'drive',
 'easy',
 'enough',
 'excellent',
 'experience',
 'extra',
 'family',
 'fantastic',
 'friend',
 'friendly',
 'getaway',
 'good',
 'great',
 'guest',
 'helpful',
 'home',
 'host',
 'hot',
 'house',
 'kitchen',
 'large',
 'little',
 'local',
 'location',
 'lot',
 'lovely',
 'many',
 'minute',
 'morning',
 'mountain',
 'much',
 'neighborhood',
 'next',
 'nice',
 'night',
 'parking',
 'peaceful',
 'people',
 'perfect',
 'place',
 'plenty',
 'porch',
 'private',
 'question',
 'quick',
 'quiet',
 'recommend',
 'recommendation',
 'responsive',
 'restaurant',
 'room',
 'short',
 'small',
 'space',
 'spacious',
 'spot',
 'stay',
 'super',
 'sure',
 'thank'

In [26]:
# applying NMF factorization

nmf_model = NMF(n_components=5, init='nndsvd', random_state=42)
W = nmf_model.fit_transform(tf_idf)
H = nmf_model.components_
print(W.shape, H.shape)

(173892, 5) (5, 100)


In [27]:
# get topics

get_topics = []
for index, topic in enumerate(H):
    feature_names = vectorizer.get_feature_names()
    get_topics.append(' '.join([feature_names[i] for i in topic.argsort()[-5:]]))

In [28]:
# show topics

get_topics

['home beautiful perfect space host',
 'communication spot host location great',
 'amazing nice great stay place',
 'quiet drive minute close downtown',
 'room bed nice comfortable clean']

> Seems that the first, third, and fifth topic talking about how nice the place and also the host is. The second, and fouth one specifically are talking about the location.

In [30]:
# predict the topics based on tokenized comments

topics = []

for i in df['comments_tokenized']:
    text_to_vector = vectorizer.transform([i])
    prob_score = nmf_model.transform(text_to_vector)
    topics.append(get_topics[np.argmax(prob_score)])

In [31]:
# create new feature to store the topics

df['topics'] = topics

In [32]:
# show top 5 data

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik...",lisa superb hostess treat like family provide ...,treat family coziest little home experience ma...,0.8519,positive,home beautiful perfect space host
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...,lovely little place walking distance downtown ...,lovely little place distance downtown responsi...,0.8481,positive,quiet drive minute close downtown
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ...",lisa nice work however realize house old norma...,work old case floor permanent renter squeaky f...,0.8176,positive,quiet drive minute close downtown
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...,feel lucky found beautiful home asheville quie...,lucky beautiful home quiet clean guest gloriou...,0.9957,positive,room bed nice comfortable clean
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat...",great roomy little apartment beautiful private...,great roomy little apartment beautiful private...,0.9351,positive,home beautiful perfect space host


In [33]:
# show example on topic 1

df['comments'][0]

'Lisa is superb hostess, she will treat you like family and provide you with the coziest little home in Asheville which will definitely enhance your experience of the magical town! Just like the Eco-retreat, the Private sunny apartment is a neat little flat with all you need for up to 3 people, the place was impeccable in lovely neighborhood. You can hardly beat this one!'

In [34]:
# show example on topic 2

df['comments'][1]

'This was a lovely little place walking distance from downtown. Lisa was very responsive. My best Airbnb experience yet!'

In [35]:
# show example on topic 5

df['comments'][3]

'I feel very lucky to have found this beautiful home in Asheville. It was very quiet, clean and well thought out for a guest\'s needs. I stayed here for a glorious month while attending classes at UNCA. The location and apartment was absolutely perfect. I walked to campus through the woods everyday and walked to downtown almost as often...I could be anywhere in ten minutes but I loved being "home" here. It was so peaceful and lush. There\'s a nice big kitchen, which was awesome because I love to cook and there were beautiful gardens with vegetables and herbs to use. Loved the bathroom, it was like a suite unto itself, huge vanity and a great walk in shower. I also thought the bed was very comfortable and I appreciated the good quality linens and pillows. Lisa made it easy for me to feel at home. She gave me a tour of town and showed me how to find the trail to UNCA and some cool bike trails too...let me borrow her bike, which was really nice. I had a wonderful time, will definitely com

> Seems that this model is quite right to predict the topics. Now to the next phase, I'll start to dump the cleaned data and build the model using various algorithm on the separate notebook. Although I want to address something first, many of features such as *comments* and *comments_cleaned* are rather giving out some false sentiment and topics. I think just for the sake of modelling, if possible I would rather drop these data later. I'll show it below.

In [40]:
# dump to new dataframe

df.to_csv('asheville-reviews-tokenized.csv', index=False)

In [39]:
# show the anomaly

df[df['comments']=='No Description']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
11888,1827412,426615038,2019-03-21,247099480,David,No Description,description,description,0.0,neutral,home beautiful perfect space host
15856,2411109,406175955,2019-01-28,110949606,Josi,No Description,description,description,0.0,neutral,home beautiful perfect space host
18887,3095136,127198043,2017-01-16,111272759,Ash,No Description,description,description,0.0,neutral,home beautiful perfect space host
20030,3225871,209550903,2017-11-05,147365729,Andrew,No Description,description,description,0.0,neutral,home beautiful perfect space host
20390,3314819,52436690,2015-10-29,46496931,Dwight,No Description,description,description,0.0,neutral,home beautiful perfect space host
30057,5144212,555444539,2019-10-27,4592,Rebecca,No Description,description,description,0.0,neutral,home beautiful perfect space host
31303,5696919,441087924,2019-04-21,140470172,Michael,No Description,description,description,0.0,neutral,home beautiful perfect space host
33816,6234618,311153650,2018-08-20,74602012,Eileen,No Description,description,description,0.0,neutral,home beautiful perfect space host
39678,7556089,441017868,2019-04-21,72106403,Tori,No Description,description,description,0.0,neutral,home beautiful perfect space host
42045,8051829,339537756,2018-10-21,166411083,Michael,No Description,description,description,0.0,neutral,home beautiful perfect space host


In [42]:
# show the anomaly 2

df[df['comments_cleaned']=='No Description']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
254,155305,286556497,2018-07-06,199962397,Leif,A,No Description,description,0.0,neutral,home beautiful perfect space host
633,156926,552755567,2019-10-22,7946489,Юлия,Время проведенное с Дарьей было увлекательным ...,No Description,description,0.0,neutral,home beautiful perfect space host
638,156926,560000486,2019-11-05,61670213,Oxana,"Очень интересно!! не жалею о новом опыте, и вп...",No Description,description,0.0,neutral,home beautiful perfect space host
1292,259576,203258198,2017-10-14,149002825,David,.,No Description,description,0.0,neutral,home beautiful perfect space host
1386,259576,359959590,2018-12-18,229443928,Raphael,.,No Description,description,0.0,neutral,home beautiful perfect space host
...,...,...,...,...,...,...,...,...,...,...,...
171567,42981397,645925705,2020-08-02,93550848,Zach,.,No Description,description,0.0,neutral,home beautiful perfect space host
171568,42981397,652297593,2020-08-16,93550848,Zach,.,No Description,description,0.0,neutral,home beautiful perfect space host
171666,43087424,702326399,2020-10-20,106022454,Hau,i,No Description,description,0.0,neutral,home beautiful perfect space host
172534,43817239,656384354,2020-08-25,37130879,Dani,...,No Description,description,0.0,neutral,home beautiful perfect space host


## REFERENCES

>- https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664
>- https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45