# ASHEVILLE AIRBNB SENTIMENT ANALYSIS

> The purpose of this report is **to analyze customer reviews for Airbnb on Asheville, North Carolina, United States**. And act as a stepping stone **to know what the customers think of the service offered by Asheville's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **HYPERPARAMETER TUNING** part.

> The dataset contains the **detailed review data for listings in Asheville, North Carolina** compiled on **08 November, 2020**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [1]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np
import spacy

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
import en_core_web_sm
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# modelling

import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data

df = pd.read_csv('asheville-reviews-tuning.csv')

In [3]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [4]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,listing_id,int64,2042,0,0.0,172465
1,id,int64,172465,0,0.0,172465
2,date,object,2904,0,0.0,172465
3,reviewer_id,int64,157219,0,0.0,172465
4,reviewer_name,object,16166,0,0.0,172465
5,comments,object,169865,0,0.0,172465
6,comments_cleaned,object,167552,0,0.0,172465
7,comments_tokenized,object,161618,0,0.0,172465
8,compound_score,float64,1841,0,0.0,172465
9,sentiment,int64,3,0,0.0,172465


> Now on to parameter tuning process.

## HYPERPARAMETER

In [5]:
# SGD Classifier param

loss = ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'] 
penalty = ['l1', 'l2', 'elasticnet'] 
alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000] 
learning_rate = ['constant', 'optimal', 'invscaling', 'adaptive'] 
eta0 = [1, 10, 100] 

sgd_param = {'loss' : loss, 'penalty': penalty, 'alpha' : alpha, 'learning_rate' : learning_rate, 'eta0' : eta0}

In [6]:
# function to automate tuning process

def cvparam(est, xtr, ytr):
    scv = RandomizedSearchCV(estimator = est, param_distributions = sgd_param, cv=10, scoring = 'f1_macro')
    result = scv.fit(xtr, ytr)
    return result

In [7]:
# label the target

df['sentiment'] = LabelEncoder().fit_transform(df['sentiment'])

In [8]:
# set the dependent

vectorizer = TfidfVectorizer(max_features = 100)
tf_idf = vectorizer.fit_transform(df['comments_cleaned']).toarray()
print(tf_idf)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.39453716 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.09858178]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.26343102]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [9]:
# split data

X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['sentiment'], test_size=0.2, random_state=101)

In [10]:
# initialize model

model_sgd = SGDClassifier(random_state = 42, class_weight='balanced')
model_sgd.fit(X_train, y_train)

SGDClassifier(class_weight='balanced', random_state=101)

In [11]:
# search parameter

for i in range(1,4):
    cv_rfc = cvparam(model_sgd, X_train, y_train)
    print('Score : ', i, cv_rfc.best_score_)
    print('Parameter : ', i, cv_rfc.best_params_)

Score :  1 0.5268813769260572
Parameter :  1 {'penalty': 'l1', 'loss': 'modified_huber', 'learning_rate': 'adaptive', 'eta0': 100, 'alpha': 0.0001}
Score :  2 0.3254592396512974
Parameter :  2 {'penalty': 'elasticnet', 'loss': 'modified_huber', 'learning_rate': 'optimal', 'eta0': 100, 'alpha': 100}
Score :  3 0.405856978818368
Parameter :  3 {'penalty': 'l2', 'loss': 'perceptron', 'learning_rate': 'optimal', 'eta0': 10, 'alpha': 0.1}


## MODEL TESTING

In [22]:
# initialize model

first_model  = SGDClassifier(penalty='l1', loss='modified_huber', learning_rate='adaptive', eta0=100, alpha=0.0001, random_state=42, class_weight='balanced')
second_model = SGDClassifier(penalty='elasticnet', loss='modified_huber', learning_rate='optimal', eta0=100, alpha=100, random_state=42, class_weight='balanced')
third_model  = SGDClassifier(penalty='l2', loss='perceptron', learning_rate='optimal', eta0=10, alpha=0.1, random_state=42, class_weight='balanced')

In [23]:
# function to automate model scoring 

def score(model, X_train, X_test, y_train, y_test):
    model  = model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    return print(classification_report(y_test,y_pred))

In [29]:
models = [first_model, second_model, third_model]
for i in range(3):
    print('Model', i)
    score(models[i], X_train, X_test, y_train, y_test)

Model 0
              precision    recall  f1-score   support

           0       0.10      0.34      0.16       206
           1       0.42      0.55      0.47      1357
           2       0.98      0.96      0.97     32930

    accuracy                           0.94     34493
   macro avg       0.50      0.61      0.53     34493
weighted avg       0.95      0.94      0.94     34493

Model 1
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       206
           1       0.00      0.00      0.00      1357
           2       0.95      1.00      0.98     32930

    accuracy                           0.95     34493
   macro avg       0.32      0.33      0.33     34493
weighted avg       0.91      0.95      0.93     34493

Model 2
              precision    recall  f1-score   support

           0       0.06      0.40      0.10       206
           1       0.34      0.42      0.38      1357
           2       0.98      0.94      0.96     329

> Looking at the results above, I think I'll go with the first parameter since the result in general is better than the rest of it. Therefore, I'll dump this model to proceed to the model testing part.

In [32]:
# dump model with joblib

joblib.dump(first_model, 'model_sgd_tuned')

['model_sgd_tuned']

## REFERENCES 

>- https://www.knowledgehut.com/tutorials/machine-learning/hyperparameter-tuning-machine-learning
>- https://towardsdatascience.com/how-to-make-sgd-classifier-perform-as-well-as-logistic-regression-using-parfit-cc10bca2d3c4