# ASHEVILLE AIRBNB SENTIMENT ANALYSIS

> The purpose of this report is **to analyze customer reviews for Airbnb on Asheville, North Carolina, United States**. And act as a stepping stone **to know what the customers think of the service offered by Asheville's Airbnb, and this analysis could help to know if the hosts are providing good customer service or not**. The analysis progress would be separated on several notebook, and will cover from *data preprocessing, text preprocessing, topic modelling, visualization, model building, to model testing*. 

> This notebook specifically will only cover the **DATA PREPROCESSING** part.

> The dataset contains the **detailed review data for listings in Asheville, North Carolina** compiled on **08 November, 2020**. The data are from the **Inside Airbnb site**, it is sourced from publicly available information, from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion. More on this data, and other similar data refers to this [link](http://insideairbnb.com/get-the-data.html)

## IMPORT LIBRARIES

In [1]:
# data wrangling

import re
import string
import pandas as pd
import numpy as np

# data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# text processing

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

# filter warning

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
# load data

df = pd.read_csv('asheville-reviews.csv')

In [3]:
# show top 5

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik..."
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ..."
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat..."


In [4]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173892 entries, 0 to 173891
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     173892 non-null  int64 
 1   id             173892 non-null  int64 
 2   date           173892 non-null  object
 3   reviewer_id    173892 non-null  int64 
 4   reviewer_name  173892 non-null  object
 5   comments       173838 non-null  object
dtypes: int64(3), object(3)
memory usage: 8.0+ MB


In [5]:
# function to check data summary

def summary(df):
    
    columns = df.columns.to_list()
    
    dtypes = []
    unique_counts = []
    missing_counts = []
    missing_percentages = []
    total_counts = [df.shape[0]] * len(columns)

    for col in columns:
        dtype = str(df[col].dtype)
        dtypes.append(dtype)
        unique_count = df[col].nunique()
        unique_counts.append(unique_count)
        missing_count = df[col].isnull().sum()
        missing_counts.append(missing_count)
        missing_percentage = round((missing_count/df.shape[0]) * 100, 2)
        missing_percentages.append(missing_percentage)

    df_summary = pd.DataFrame({
        "column": columns,
        "dtypes": dtypes,
        "unique_count": unique_counts,
        "missing_values": missing_counts,
        "missing_percentage": missing_percentages,
        "total_count": total_counts,
    })

    return df_summary.sort_values(by="missing_percentage", ascending=False).reset_index(drop=True)

In [6]:
# check summary

summary(df)

Unnamed: 0,column,dtypes,unique_count,missing_values,missing_percentage,total_count
0,comments,object,170970,54,0.03,173892
1,listing_id,int64,2044,0,0.0,173892
2,id,int64,173892,0,0.0,173892
3,date,object,2904,0,0.0,173892
4,reviewer_id,int64,158449,0,0.0,173892
5,reviewer_name,object,16279,0,0.0,173892


> There are some `dtypes` that are not proper, then there are also a missing values on *comments* feature. I'll check on it later. But I'll clean the data on preprocessing first before going on text cleaning.

## PREPROCESSING

In [7]:
# fixing columns dtpes

for i in df.columns:
    if i == 'listing_id' or i == 'id' or i == 'reviewer_id':
        df[i] = df[i].astype(np.object)
    elif i == 'date' :
        df[i] = pd.to_datetime(df[i])
    else : 
        pass

In [8]:
# check info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173892 entries, 0 to 173891
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   listing_id     173892 non-null  object        
 1   id             173892 non-null  object        
 2   date           173892 non-null  datetime64[ns]
 3   reviewer_id    173892 non-null  object        
 4   reviewer_name  173892 non-null  object        
 5   comments       173838 non-null  object        
dtypes: datetime64[ns](1), object(5)
memory usage: 8.0+ MB


In [9]:
# check missing values

df[df['comments'].isna()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
11888,1827412,426615038,2019-03-21,247099480,David,
15856,2411109,406175955,2019-01-28,110949606,Josi,
18887,3095136,127198043,2017-01-16,111272759,Ash,
20030,3225871,209550903,2017-11-05,147365729,Andrew,
20390,3314819,52436690,2015-10-29,46496931,Dwight,
30057,5144212,555444539,2019-10-27,4592,Rebecca,
31303,5696919,441087924,2019-04-21,140470172,Michael,
33816,6234618,311153650,2018-08-20,74602012,Eileen,
39678,7556089,441017868,2019-04-21,72106403,Tori,
42045,8051829,339537756,2018-10-21,166411083,Michael,


In [10]:
# fill missing values

df['comments'].fillna('No Description', inplace=True)

In [11]:
# check missing values

df.isna().sum()

listing_id       0
id               0
date             0
reviewer_id      0
reviewer_name    0
comments         0
dtype: int64

> Now that everything is properly cleaned. I'll continue to text processing.

## TEXT PROCESSING

> To start with, I'll clean the text on *comments* features by doing * case folding* and *tokenizing* as well as *removing stopwords* on the text.

In [12]:
# function to clean text

def clean_text(data, stopword):
    
    # casefolding
    data = [i.lower() for i in data]
    data = [' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|\d+", " ", i).split()) for i in data]
    res = ' '.join(data) 

    # tokenizing 
    word_tokens = word_tokenize(res)    
    res = ' '.join([i for i in word_tokens if not i in stopword])
    
    return res

In [13]:
# set stopword

stop_words = set(stopwords.words('english'))

# text cleaning

comment_filtered = []
for i in df['comments']:
    comment_filtered.append(clean_text([i], stop_words))

In [14]:
# check filtered comment

comment_filtered[0]

'lisa superb hostess treat like family provide coziest little home asheville definitely enhance experience magical town like eco retreat private sunny apartment neat little flat need people place impeccable lovely neighborhood hardly beat one'

In [15]:
# create new feature to store cleaned text

df['comments_cleaned'] = comment_filtered

In [16]:
# show dataframe

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_cleaned
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik...",lisa superb hostess treat like family provide ...
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...,lovely little place walking distance downtown ...
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ...",lisa nice work however realize house old norma...
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...,feel lucky found beautiful home asheville quie...
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat...",great roomy little apartment beautiful private...


> Next, I'll drop this cleaned data to new dataframe to be used on the next part. 

In [17]:
# drop to new dataframe

df.to_csv('asheville-reviews-clean.csv', index=False)

## REFERENCES

>- https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221?_branch_match_id=835004835328579359
>- https://towardsdatascience.com/stemming-lemmatization-what-ba782b7c0bd8