# Data Cleaning and Pre-processing

### Airbnb Reviews from the city of Florence


**Open-source data** from: <br>
http://insideairbnb.com/get-the-data.html (data compiled: 12 July, 2021)

The raw dataset contains **534,217 reviews per 8,562 listings**, with an **average of  62.39 comments per listing** and a **standard deviation of  87.23**.

After running the following steps, we selected a **representative random sample of 51,047 reviews** (10% of the original dataset), preserving the language shares. Then, we used the sample in the Sentiment, Topic modeling, Topic classification analysis.

- Drop missing data
- Drop automatic comments where the host cancelled reservation
- Remove punctuation, emoji, numbers, duplicates
- Identify language of comments (langdetect https://pypi.org/project/langdetect/)
- Keep only comments written in en, it, fr, es, de, pt, ru. Other languages are neglible
- Translate comments in  it, fr, es, de, pt, ru to English (https://pypi.org/project/deep-translator/).



## Import Libraries

In [62]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import string

from langdetect import detect
from deep_translator import GoogleTranslator

from tqdm import tqdm
tqdm.pandas()

from textblob import TextBlob

import seaborn as sns
sns.set_theme(style="whitegrid")

#import shapely
#from shapely.geometry import Point

import pickle 

import importlib

import nltk
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

from nltk.tokenize import word_tokenize

import itertools

In [9]:
# Upload dataset

df_reviews = pd.read_csv('data/reviews.csv')

In [10]:
df_reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,24469,536790507,2019-09-27,238649967,Marina,Excelente trato y atención por parte de los du...
1,24470,102982,2010-09-22,77244,Luciana,We loved to stay at Benedetta and Lorenzo apt....
2,24470,124303,2010-10-22,233800,Steve,A perfect place to spend a week. Amenities wer...
3,24470,440911489,2019-04-21,150908154,Paola,Buona posizione tranquilla e ottima accoglienz...
4,24472,340528,2011-06-28,239905,Michael,Lovely place. Its closeness to the train stati...


In [60]:
print(f'The dataset contains {df_reviews.shape[0]} reviews per {df_reviews.listing_id.nunique()} listings.')

The dataset contains 534217 reviews per 8562 listings.


In [12]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536693 entries, 0 to 536692
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     536693 non-null  int64 
 1   id             536693 non-null  int64 
 2   date           536693 non-null  object
 3   reviewer_id    536693 non-null  int64 
 4   reviewer_name  536692 non-null  object
 5   comments       536561 non-null  object
dtypes: int64(3), object(3)
memory usage: 24.6+ MB


In [61]:
# Mean, standard deviation, min, max of comments per listing

df_reviews.groupby('listing_id').id.count().agg([np.mean,np.std,np.min,np.max])

mean     62.393950
std      87.227643
amin      1.000000
amax    812.000000
Name: id, dtype: float64

In [13]:
# Check for missing values

df_reviews.isnull().sum()

listing_id         0
id                 0
date               0
reviewer_id        0
reviewer_name      1
comments         132
dtype: int64

In [14]:
# Drop rows with missing values

df_reviews.dropna(inplace=True)
df_reviews.shape

(536560, 6)

In [16]:
# Check length of comments and add a new column

df_reviews['length_comments'] = df_reviews['comments'].apply(lambda x: len(x.split()))

In [17]:
df_reviews['length_comments'].max()

1002

In [18]:
df_reviews['length_comments'].min()

1

In [19]:
# Lowercase comments

df_reviews['comments'] = df_reviews['comments'].str.lower()
df_reviews['comments'].head()

0    excelente trato y atención por parte de los du...
1    we loved to stay at benedetta and lorenzo apt....
2    a perfect place to spend a week. amenities wer...
3    buona posizione tranquilla e ottima accoglienz...
4    lovely place. its closeness to the train stati...
Name: comments, dtype: object

In [20]:
# Drop automatic comments where host cancelled reservation

index_canceled = df_reviews[df_reviews["comments"].str.contains("canceled", case=False, na=False)].index
df_reviews.drop(index_canceled, inplace=True)
df_reviews.shape

(534217, 7)

In [21]:
# Remove punctuation

df_reviews['comments'] = df_reviews['comments'].str.replace('[^\w\s]','',regex=True)
df_reviews['comments'].head()

0    excelente trato y atención por parte de los du...
1    we loved to stay at benedetta and lorenzo apt ...
2    a perfect place to spend a week amenities were...
3    buona posizione tranquilla e ottima accoglienz...
4    lovely place its closeness to the train statio...
Name: comments, dtype: object

In [23]:
# Remove emoji

# Refer to: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

df_reviews['comments'] = df_reviews['comments'].apply(lambda x: remove_emoji(x))

In [None]:
# Remove numbers

df_reviews['comments'] = df_reviews['comments'].str.replace(r'[0-9]','',regex=True)

In [None]:
# Replace rows with empty string with none and drop them

df_reviews['comments'] = df_reviews.comments.replace(r'^\s*$','none', regex=True)
df_reviews.drop(df_reviews[df_reviews['comments']=='none'].index, inplace = True)

In [None]:
# Remove duplicates if any

df_reviews.comments.drop_duplicates()

In [None]:
# Identify language of comments

def safe_detect(x):
    ''' Function detecting language.
    If a language is not identified,
    it returns empty.
    '''
    try:
        lan = detect(x)
    except:
        lan = ""
    return lan

In [None]:
# Create a new column 'language'
# NB: safe_detect takes a long time to run

#df_reviews['language'] = df_reviews['comments'].apply(safe_detect)

In [None]:
f, ax = plt.subplots(figsize=(6, 8))
sns.set_color_codes("pastel")
sns.countplot(y="language", data=df_reviews, order = df_reviews['language'].value_counts(normalize=True).index,
              label="Total", color="b")
plt.show()

In [None]:
# Keep records with comments in en, it, fr, es, de, pt, ru

df_reviews_reduced = df_reviews.loc[df_reviews['language'].isin(['en','it','fr','es','de','pt','ru'])]

In [None]:
# Save a reduced sample of comments written in en, it, fr, es, de, pt, ru. Other languages are neglible.

#df_reviews_reduced.to_pickle('./data/df_reviews_reduced.pkl')

In [24]:
df_reviews_reduced = pd.read_pickle('./data/df_reviews_reduced.pkl')

In [25]:
df_reviews_reduced.language.value_counts(normalize=True)

en    0.733527
it    0.105975
fr    0.064106
es    0.057437
de    0.016075
pt    0.012238
ru    0.010641
Name: language, dtype: float64

In [None]:
# Pick a random sample of reviews (10%) preserving the share of languages

sample_df = df_reviews_reduced.groupby('language').apply(lambda x: x.sample(frac=0.1)).reset_index(level='language', drop=True)

In [None]:
# Translate comments other than English to English
#
def safe_translate(x):
    ''' Function that uses Google Translator 
    to translate comments to English.
    If a language is not identified,
    it returns empty.
    '''
    try:
        sentence = GoogleTranslator(source='auto', target='en').translate(x)
    except:
        print("Failed: ", x)
        sentence = ""
    return sentence

In [None]:
# Add a new column with comments translated 

sample_df['en_translation'] = sample_df['comments'].loc[sample_df['language']
                                                        .isin(['it','fr','es','de','pt','ru'])].progress_apply(safe_translate)

In [None]:
# Save updated dataset

#sample_df.to_pickle('./data/df_reviews_sample.pkl')

In [63]:
# Check if the dataset correctly uploads

df_reviews_sample = pd.read_pickle('./data/df_reviews_sample.pkl')

In [68]:
print(f'The dataset contains {df_reviews_sample.shape[0]} reviews per {df_reviews_sample.listing_id.nunique()} listings.')

The dataset contains 51047 reviews per 6142 listings.


In [64]:
# Validating the accuracy of some translations
df_reviews_sample.comments[df_reviews_sample['language'].isin(['it'])].iloc[0]

'lo studio è molto caratteristico e in posizione centrale ideale per muoversi a piedi verso tutti i punti di interesse principali il quartiere è piuttosto attivo e fornito di tutti i comfort necessari apprezzatissima lospitalità dellhost marco che ci ha dato la possibilità di effettuare un check in anticipato allarrivo e di lasciare le valigie in studio anche dopo il check out prima della partenza qualche piccolo problema di odori provenienti dalla corte estera e di umidità ma nel complesso esperienza fantastica'

In [65]:
df_reviews_sample.en_translation[df_reviews_sample['language'].isin(['it'])].iloc[0]

'the studio is very characteristic and in an ideal central position to move on foot to all the main points of interest the neighborhood is quite active and equipped with all the necessary comforts very much appreciated the hospitality of the host marco who gave us the opportunity to check in early upon arrival and to leave the suitcases in the studio even after the check out before departure some small problem of odors coming from the foreign court and humidity but overall a fantastic experience'

In [66]:
df_reviews_sample.comments[df_reviews_sample['language'].isin(['es'])].iloc[0]

'el departamento se encuentra en un punto estratégico ubicado en el centro a  minutos del duomo y  minutos del ponte vecchio es muy silencioso y cuenta con una habitación y un sofa cama en el estar donde se encuentra la cocina integrada mina respondio amablemente e incluso tuvimos un problema con las llaves y en  minutos se acerco otra personas para solucionarnos el problema recomendable  br'

In [67]:
df_reviews_sample.en_translation[df_reviews_sample['language'].isin(['es'])].iloc[0]

'The apartment is located in a strategic point located in the center, minutes from the duomo and minutes from the Ponte Vecchio, it is very quiet and has a room and a sofa bed in the living room where the integrated kitchen is located. Mina responded kindly and we even had a problem with the keys and in minutes other people approached to solve the recommended problem br'