# ASHEVILLE AIRBNB SENTIMENT ANALYSIS

> The purpose of this report is **to analyze customer reviews for Airbnb on Asheville, North Carolina, United States**. And act as a stepping stone **to know what the customers think of the service offered by Asheville's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **MODEL BUILDING** and **MODEL SELECTION** part.

> The dataset contains the **detailed review data for listings in Asheville, North Carolina** compiled on **08 November, 2020**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [1]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np
import spacy

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
import en_core_web_sm
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# modelling

from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data

df = pd.read_csv('asheville-reviews-tokenized.csv')

In [3]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik...",lisa superb hostess treat like family provide ...,treat family coziest little home experience ma...,0.8519,positive,home beautiful perfect space host
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...,lovely little place walking distance downtown ...,lovely little place distance downtown responsi...,0.8481,positive,quiet drive minute close downtown
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ...",lisa nice work however realize house old norma...,work old case floor permanent renter squeaky f...,0.8176,positive,quiet drive minute close downtown
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...,feel lucky found beautiful home asheville quie...,lucky beautiful home quiet clean guest gloriou...,0.9957,positive,room bed nice comfortable clean
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat...",great roomy little apartment beautiful private...,great roomy little apartment beautiful private...,0.9351,positive,home beautiful perfect space host


In [4]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173892 entries, 0 to 173891
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   listing_id          173892 non-null  int64  
 1   id                  173892 non-null  int64  
 2   date                173892 non-null  object 
 3   reviewer_id         173892 non-null  int64  
 4   reviewer_name       173892 non-null  object 
 5   comments            173892 non-null  object 
 6   comments_cleaned    173892 non-null  object 
 7   comments_tokenized  172705 non-null  object 
 8   compound_score      173892 non-null  float64
 9   sentiment           173892 non-null  object 
 10  topics              173892 non-null  object 
dtypes: float64(1), int64(3), object(7)
memory usage: 14.6+ MB


In [5]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [6]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,comments_tokenized,object,161619,1187,0.68,173892
1,listing_id,int64,2044,0,0.0,173892
2,id,int64,173892,0,0.0,173892
3,date,object,2904,0,0.0,173892
4,reviewer_id,int64,158449,0,0.0,173892
5,reviewer_name,object,16279,0,0.0,173892
6,comments,object,170971,0,0.0,173892
7,comments_cleaned,object,168410,0,0.0,173892
8,compound_score,float64,1841,0,0.0,173892
9,sentiment,object,3,0,0.0,173892


> Although these have been fixed on the previous process, seems that there are some `dtypes` that are not proper, there are also a missing values on *comments_tokenized* feature, and check the previous matter regarding the *no description comments*. Therefore once again I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [7]:
# check the missing values

df[df['comments_tokenized'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
471,156926,67089326,2016-03-26,54401579,Raquel,"Un sitio excelente, comodo, limpio y en una zo...",un sitio excelente comodo limpio en una zona a...,,0.0,neutral,home beautiful perfect space host
489,156926,108262942,2016-10-15,50062310,Chen-Yi,Everything was described correctly!,everything described correctly,,0.0,neutral,home beautiful perfect space host
571,156926,363795585,2018-12-29,55670211,Kathryn,Thanks for having me,thanks,,0.0,neutral,home beautiful perfect space host
592,156926,454126304,2019-05-17,64777113,Ashton,I enjoyed my stay. If I am ever in Asheville a...,enjoyed stay ever asheville would definitely stay,,0.0,neutral,home beautiful perfect space host
595,156926,467554866,2019-06-10,112577954,Axelle,Excellente auberge de jeunesse! Je recommande!!,excellente auberge de jeunesse je recommande,,0.0,neutral,home beautiful perfect space host
...,...,...,...,...,...,...,...,...,...,...,...
173535,45023214,666276487,2020-09-20,159345242,Jonathan,This house was very nice and well kept up. It ...,house nice well kept exactly expected would de...,,0.0,neutral,home beautiful perfect space host
173802,45623198,701646936,2020-10-18,168239689,Alex,Very clean and cozy!,clean cozy,,0.0,neutral,home beautiful perfect space host
173832,45684376,705149781,2020-10-30,50451438,Taliyah,Very cozy,cozy,,0.0,neutral,home beautiful perfect space host
173875,45846658,703921445,2020-10-25,371770957,Alexander,Very cozy space and thoughtful host!,cozy space thoughtful host,,0.0,neutral,home beautiful perfect space host


> Seeing above missing values, I think the review itself still considered proper. But for modelling purpose, I'll drop these instead so that these data will not disturb the model.

In [8]:
# see the anomaly

anomaly = df[(df['comments']=='No Description') | (df['comments_cleaned']=='No Description')]

In [9]:
# show anomaly

anomaly.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned,comments_tokenized,compound_score,sentiment,topics
254,155305,286556497,2018-07-06,199962397,Leif,A,No Description,description,0.0,neutral,home beautiful perfect space host
633,156926,552755567,2019-10-22,7946489,Юлия,Время проведенное с Дарьей было увлекательным ...,No Description,description,0.0,neutral,home beautiful perfect space host
638,156926,560000486,2019-11-05,61670213,Oxana,"Очень интересно!! не жалею о новом опыте, и вп...",No Description,description,0.0,neutral,home beautiful perfect space host
1292,259576,203258198,2017-10-14,149002825,David,.,No Description,description,0.0,neutral,home beautiful perfect space host
1386,259576,359959590,2018-12-18,229443928,Raphael,.,No Description,description,0.0,neutral,home beautiful perfect space host


In [10]:
# see percentages

print(f'Length of anomaly : {round(len(anomaly)/len(df)*100, 2)}%')

Length of anomaly : 0.14%


In [11]:
anomaly.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 237 entries, 254 to 173730
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   listing_id          237 non-null    int64  
 1   id                  237 non-null    int64  
 2   date                237 non-null    object 
 3   reviewer_id         237 non-null    int64  
 4   reviewer_name       237 non-null    object 
 5   comments            237 non-null    object 
 6   comments_cleaned    237 non-null    object 
 7   comments_tokenized  237 non-null    object 
 8   compound_score      237 non-null    float64
 9   sentiment           237 non-null    object 
 10  topics              237 non-null    object 
dtypes: float64(1), int64(3), object(7)
memory usage: 22.2+ KB


> The anomaly are just about 0.14% of the total data. I think it's still safe to drop these values.

In [12]:
# dropping the anomaly

df = df[~((df['comments'].isin(anomaly['comments'])) | (df['comments_tokenized'].isin(anomaly['comments_tokenized'])))]

In [13]:
# drop missing values

df = df.dropna()

In [14]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [15]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,object,2042,0,0.0,172465
1,id,object,172465,0,0.0,172465
2,date,datetime64[ns],2904,0,0.0,172465
3,reviewer_id,object,157219,0,0.0,172465
4,reviewer_name,object,16166,0,0.0,172465
5,comments,object,169865,0,0.0,172465
6,comments_cleaned,object,167552,0,0.0,172465
7,comments_tokenized,object,161618,0,0.0,172465
8,compound_score,float64,1841,0,0.0,172465
9,sentiment,object,3,0,0.0,172465


> Now, I'll go to the modelling part. But since this will be a **multiclass classification** and the data are quite **imbalanced**, I'll be using **Gaussian Naive Bayes and Stochastic Gradient Descent** to kind of tackle this, and we might need to tweak and resample it somehow later.

In [16]:
# show the imbalanced target

df['sentiment'].value_counts()

positive    164534
neutral       6861
negative      1070
Name: sentiment, dtype: int64

## MODELLING

> A problem with imbalanced classification is that **there are too few examples of the minority class for a model to effectively learn the decision boundary**. One way **to solve this problem is to oversample the examples in the minority class**. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.

> **Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling TEchnique, or SMOTE**. It works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

> We will use this method below.

In [17]:
# label the target

df['sentiment'] = LabelEncoder().fit_transform(df['sentiment'])

In [18]:
# set the dependent

vectorizer = TfidfVectorizer(max_features = 100)
tf_idf = vectorizer.fit_transform(df['comments_cleaned']).toarray()
print(tf_idf)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.39453716 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.09858178]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.26343102]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [19]:
# split data

X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['sentiment'], test_size=0.2, random_state=101)

### GAUSSIAN NAIVE BAYES MODEL

> Naive Bayes can be extended to real-valued attributes, most commonly by assuming a **Gaussian distribution**. This extension of naive Bayes is called **Gaussian Naive Bayes**. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because we only need to estimate the mean and the standard deviation from our training data.

In [20]:
# initialize model

model_gnb = GaussianNB()
model_gnb.fit(X_train, y_train)

GaussianNB()

In [21]:
# predict and see the classification report

y_pred_gnb = model_gnb.predict(X_test)
print(confusion_matrix(y_test,y_pred_gnb))
print(classification_report(y_test,y_pred_gnb))

[[  109    62    35]
 [  230   939   188]
 [ 3574  6532 22824]]
              precision    recall  f1-score   support

           0       0.03      0.53      0.05       206
           1       0.12      0.69      0.21      1357
           2       0.99      0.69      0.82     32930

    accuracy                           0.69     34493
   macro avg       0.38      0.64      0.36     34493
weighted avg       0.95      0.69      0.79     34493



> Now we can see the model prediction are quite bad. I'll try the same model, but using **SMOTE**.

In [22]:
# set and apply SMOTE

smote = SMOTE('minority')
X_sm, y_sm = smote.fit_sample(X_train, y_train)
print(X_sm.shape, y_sm.shape)

(268712, 100) (268712,)


In [23]:
# initialize model with oversampled data

model_gnb.fit(X_sm, y_sm)

GaussianNB()

In [24]:
# predict and see the classification report with oversampled data

y_pred_gnb_sm = model_gnb.predict(X_test)
print(confusion_matrix(y_test,y_pred_gnb_sm))
print(classification_report(y_test,y_pred_gnb_sm))

[[   98    57    51]
 [  300   844   213]
 [ 2434  6486 24010]]
              precision    recall  f1-score   support

           0       0.03      0.48      0.06       206
           1       0.11      0.62      0.19      1357
           2       0.99      0.73      0.84     32930

    accuracy                           0.72     34493
   macro avg       0.38      0.61      0.37     34493
weighted avg       0.95      0.72      0.81     34493



> Although there are some visible improvement, we can say that this model prediction are still bad.

### STOCHASTIC GRADIENT DESCENT

> **Stochastic Gradient Descent (SGD)** is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It is a **Linear classifiers (SVM, logistic regression, a.o.) with SGD training**. This estimator **implements regularized linear models with stochastic gradient descent (SGD) learning** : the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule / learning rate. SGD allows minibatch (online/out-of-core) learning, see the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

> This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter. **By default, it fits a Linear Support Vector Machine (SVM)**. And Linear SVM is **widely regarded as one of the best text classification algorithms**.

In [25]:
# initialize model 

model_sgd = SGDClassifier(random_state=42, class_weight='balanced')
model_sgd.fit(X_train, y_train)

SGDClassifier(class_weight='balanced', random_state=42)

In [26]:
# predict and see the classification report

y_pred_sgd = model_sgd.predict(X_test)
print(confusion_matrix(y_test,y_pred_sgd))
print(classification_report(y_test,y_pred_sgd))

[[   79    54    73]
 [  151   736   470]
 [  584  1040 31306]]
              precision    recall  f1-score   support

           0       0.10      0.38      0.15       206
           1       0.40      0.54      0.46      1357
           2       0.98      0.95      0.97     32930

    accuracy                           0.93     34493
   macro avg       0.49      0.63      0.53     34493
weighted avg       0.95      0.93      0.94     34493



> We can see that using **weighted class** this model **proved to be far better than naive bayes model** in terms of **accuracy and f1 score**. But I'll try to implement SMOTE to see whether it can be improved.

In [27]:
# initialize model with oversampled data

model_sgd.fit(X_sm, y_sm)

SGDClassifier(class_weight='balanced', random_state=42)

In [28]:
# predict and see the classification report with oversampled data

y_pred_sgd_sm = model_sgd.predict(X_test)
print(confusion_matrix(y_test,y_pred_sgd_sm))
print(classification_report(y_test,y_pred_sgd_sm))

[[  152    17    37]
 [  714   339   304]
 [ 2770   899 29261]]
              precision    recall  f1-score   support

           0       0.04      0.74      0.08       206
           1       0.27      0.25      0.26      1357
           2       0.99      0.89      0.94     32930

    accuracy                           0.86     34493
   macro avg       0.43      0.63      0.42     34493
weighted avg       0.95      0.86      0.90     34493



> Although it's quite good, it seems that using SMOTE rather decreasing its accuracy and f1 score. Therefore, I'll go with the previous sample for hyperparameter tuning on the next notebook.

In [29]:
# dump to new dataframe

df.to_csv('asheville-reviews-tuning.csv', index=False)

## REFERENCES 

>- https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568
>- https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
>- https://machinelearningmastery.com/naive-bayes-for-machine-learning/#:~:text=This%20extension%20of%20naive%20Bayes,deviation%20from%20your%20training%20data