# Exploratory Data Analysis of Google Reviews

In this notebook, you will find an Exploratory Data Analysis (EDA) of the Google Review data. The EDA involves pre-processing.

## Notes

In [1]:
# To detect a language
# for r in google_reviews['Review Text']:
#     r = str(r)
#     if original in r:
#         _, _, _, detected_language = cld2.detect(r, returnVectors=True)
#         print(detected_language)

## Loading Data

In [2]:
# IMPORT REQUIRED PACKAGES
import pandas as pd
from pandas import util
import sklearn
import os
import glob
import re

import nltk
import pycld2 as cld2

In [44]:
# Load venues_ams data
venues_ams = pd.read_csv("venues_ams.csv") # NOT USING?

# Define path and files    
path = "./GoogleReviews"
#all_files = os.path.join(path, "*.csv")
all_files = glob.glob(path + "/*.csv")

# Create dataframe containing all reviews
google_reviews = pd.concat((pd.read_csv(f) for f in all_files))

In [45]:
# Save google_reviews in csv file
google_reviews.to_csv("total_google_reviews.csv")

## Data Cleaning

Removing nan, removing the translated review, changing the types of the data.
Not yeat removing punctuation!

### Functions

In [71]:
def remove_nan(data, column_name):
    '''Returns only the data where values in column_name are not empty (NaN)'''
    
    data = data[data[column_name].notna()]
    
    return data


def clean_translation(data):
    '''Returns only the (Translated by Google) English review not the (Original) review from column_name.
        Indicated by sep, specifies what separator to separate the review by.'''
    
    sep = "(Original)"
    if sep in data:
        translation, separator, original = data.partition(sep)
        data = translation
        
    return data

def clean_dtypes(data):
    ''' Changes the dtype of the data columns.'''    
    
    # Changing the column types 
    type_dict = {'Unnamed: 0': object,
                 'Name': 'str',
                 'Review Rate': 'str',
                 'Review Time': 'str', 
                 'Review Text': 'str',
                 }

    # Change data type for all columns
    data = data.astype(type_dict)
    
    return data

def clean_string(s):
    '''Lowercases reviews, removes \n and (translated by google).'''
    
    s = s.lower()
    s = s.replace("\n", '')
    s = s.replace("(translated by google)", '')
    
    return s

### Apply functions

In [47]:
%%time
#%%timeit
google_reviews = remove_nan(google_reviews, 'Review Text')

CPU times: user 158 ms, sys: 37 ms, total: 195 ms
Wall time: 210 ms


In [48]:
%%time
google_reviews["Review Text"] = google_reviews["Review Text"].apply(clean_translation)

CPU times: user 195 ms, sys: 9.11 ms, total: 204 ms
Wall time: 213 ms


In [50]:
%%time
google_reviews = clean_dtypes(google_reviews)

CPU times: user 58.4 ms, sys: 8.36 ms, total: 66.7 ms
Wall time: 80.6 ms


In [51]:
%%time
google_reviews["Review Text"] = google_reviews["Review Text"].apply(clean_string)

CPU times: user 526 ms, sys: 39.2 ms, total: 566 ms
Wall time: 585 ms


### Fix indexes

In [58]:
# Reset index
google_reviews = google_reviews.reset_index()

In [59]:
google_reviews[1000:1010]

Unnamed: 0.1,index,Unnamed: 0,Name,Review Rate,Review Time,Review Text
1000,440,440,McDonald's,4 stars,3 years ago,it's all ok
1001,441,441,McDonald's,3 stars,3 years ago,the petet was cold.
1002,442,442,McDonald's,5 stars,a year ago,very tasty
1003,444,444,McDonald's,2 stars,3 years ago,busy!
1004,445,445,McDonald's,5 stars,2 years ago,good offers and promotions
1005,446,446,McDonald's,1 star,3 years ago,bad.
1006,447,447,McDonald's,3 stars,2 years ago,slow and inaccurate service
1007,448,448,McDonald's,4 stars,3 years ago,ole is barely normal
1008,449,449,McDonald's,5 stars,3 years ago,too long
1009,450,450,McDonald's,3 stars,a year ago,before


In [61]:
# Drop and rename indexes
google_reviews = google_reviews.drop(columns=["Unnamed: 0"])
google_reviews = google_reviews.rename(columns={"index": "Original Index", "Review Rate": "Rating", "Review Time": "Date", "Review Text": "Review"})

In [66]:
google_reviews[1000:1010]

Unnamed: 0,Original Index,Name,Rating,Date,Review
1000,440,McDonald's,4 stars,3 years ago,it's all ok
1001,441,McDonald's,3 stars,3 years ago,the petet was cold.
1002,442,McDonald's,5 stars,a year ago,very tasty
1003,444,McDonald's,2 stars,3 years ago,busy!
1004,445,McDonald's,5 stars,2 years ago,good offers and promotions
1005,446,McDonald's,1 star,3 years ago,bad.
1006,447,McDonald's,3 stars,2 years ago,slow and inaccurate service
1007,448,McDonald's,4 stars,3 years ago,ole is barely normal
1008,449,McDonald's,5 stars,3 years ago,too long
1009,450,McDonald's,3 stars,a year ago,before


### Save the cleaned data!

In [64]:
# Save cleaned dataframe to csv
google_reviews.to_csv("cleaned_reviews.csv")

In [67]:
# Make a copy of the reviews for testing
copy_reviews = google_reviews.copy()

## EDA

### Word Count

In [11]:
# LOAD DATA
google_reviews = pd.read_csv("cleaned_reviews.csv")
google_reviews = google_reviews.drop(columns=["Unnamed: 0"])

In [12]:
# Create DF that saves all the reviews that contain specific words
word_counts = pd.DataFrame(columns=["word", "nr. reviews", "index reviews"])

In [13]:
# How to find Reviews that contain a specific word
google_reviews[google_reviews["Review"].str.contains("wheelchair")]

Unnamed: 0,Original Index,Name,Rating,Date,Review
413,85,House of Watt,4 stars,3 years ago,for a children's party you have come to the r...
479,151,House of Watt,4 stars,3 years ago,cozy and helpfulwheelchair accessible …
489,162,House of Watt,4 stars,3 years ago,cozy and helpfulwheelchair accessible …
4437,52,Mr. Crab,2 stars,5 months ago,severely overpriced and my husband comes in in...
4872,231,Cobra Caf√©,4 stars,7 months ago,good place in the museum district if you have...
...,...,...,...,...,...
391913,118,The Cottage,4 stars,2 years ago,very amiable staff. not a huge choice on menu ...
393160,462,Hoi Tin,5 stars,2 years ago,"delicious peking duck. upon collection, the e..."
394079,233,Zomerlust,1 star,3 years ago,unfriendly reception. as a wheelchair user it...
394469,10,The Bulldog Port 26,5 stars,4 years ago,great place to relax and smoke in peace withou...


In [14]:
# Testing
test_wheelchair = google_reviews[google_reviews["Review"].str.contains("wheelchair")]
test_wheelchair.shape

(119, 5)

In [15]:
# List of words related to accessibility
accessibility_words = ["entrance", "wheelchair", "bathroom", "toilet", "steps", "narrow", "wide", "spacious", "disability"]
word_counts["word"] = accessibility_words
word_counts

Unnamed: 0,word,nr. reviews,index reviews
0,entrance,,
1,wheelchair,,
2,bathroom,,
3,toilet,,
4,steps,,
5,narrow,,
6,wide,,
7,spacious,,
8,disability,,


In [16]:
# Find the nr. of reviews that contain the accessibility word

# Create a list with the counts
review_counts = []

# Loop over accessibility_words and count reviews in which they occur
for word in accessibility_words:
    df = google_reviews[google_reviews["Review"].str.contains(word)]
    review_counts.append(df.shape[0])

word_counts["nr. reviews"] = review_counts
word_counts

Unnamed: 0,word,nr. reviews,index reviews
0,entrance,474,
1,wheelchair,119,
2,bathroom,526,
3,toilet,1919,
4,steps,113,
5,narrow,242,
6,wide,1592,
7,spacious,917,
8,disability,3,


In [17]:
word_counts[word_counts["word"]=="entrance"]

Unnamed: 0,word,nr. reviews,index reviews
0,entrance,474,


In [174]:
x = word_counts[word_counts["word"]=="entrance"].index.values
word_counts["index reviews"][x] = "tes test"
word_counts
print(int(x))

0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  word_counts["index reviews"][x] = "tes test"


In [170]:
test_list=[1, 1, 1, 5, 6, 7, 8]
word_counts.iat[4, 2] = test_list
word_counts

Unnamed: 0,word,nr. reviews,index reviews
0,entrance,474.0,lala
1,wheelchair,119.0,"[2, 4, 55, 6, 7, 8, 6]"
2,bathroom,526.0,"[1, 1, 1]"
3,toilet,1919.0,"[1, 2, 1]"
4,steps,113.0,"[1, 1, 1, 5, 6, 7, 8]"
5,narrow,242.0,lala
6,wide,1592.0,lala
7,spacious,917.0,lala
8,disability,3.0,lala


In [18]:
# Find the indexes of the reviews that contain the accessibility words
def acc_reviews(data):
    
    li = ["entrance", "wheelchair", "bathroom", "toilet", "steps", "narrow", "wide", "spacious", "disability"]
    
    for word in li:
        df = google_reviews[google_reviews["Review"].str.contains(word)]
        word_index = data[data["word"]==word].index.values
        i = df.index.values
        data.loc[int(word_index), 4] = i

    return data

In [28]:
word_counts.dtypes

word             object
nr. reviews       int64
index reviews    object
dtype: object

In [20]:
word_counts["index reviews"] = word_counts["index reviews"].apply(acc_reviews)

TypeError: 'float' object is not subscriptable

In [None]:
%%time
acc_reviews(word_counts)

ERROR! Session/line number was not unique in database. History logging moved to new session 284
