# Disney World Reviews Analysis


Data Source: https://www.kaggle.com/arushchillar/disneyland-reviews <br>


Disneyland Reviews dataset consists of reviews and ratings of 3 Disneyland location (namely California, Paris & Hongkong), posted by visitors on TripAdvisor.<br>

Number of reviews: 42,000<br>
Timespan: Oct 2010 - May 2019<br>
Number of Attributes/Columns in data: 6 

Attribute Information:

1. Review_ID: unique id given to each review
2. Rating: ranging from 1 (unsatisfied) to 5 (satisfied)
3. Year_Month: when the reviewer visited the theme park
4. Reviewer_Location: country of origin of visitor
5. Review_Text: comments made by visitor
6. Disneyland_Branch: location of Disneyland Park


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.


# [1]. Reading Data

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


# using the SQLite Table to read data.
import sqlite3

import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

#from gensim.models import Word2Vec
#from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [2]:
con = sqlite3.connect('disneyReviews.db') 

In [3]:
# Selecting only positive and negative reviews i.e. 
# Rating=3 will be ignored as this rating is neutral (ie neither positive or negative)
# SELECT * FROM Reviews WHERE Rating != 3 LIMIT 20000, will give top 20000 data points
# We can change the number to any other number based on your computing power

filtered_df = pd.read_sql_query(""" SELECT * FROM DisneylandReviews WHERE Rating!=3 LIMIT 20000 """, con)

# filtered_df = pd.read_sql_query(""" SELECT * FROM DisneylandReviews WHERE Rating!=3 LIMIT 20000 """, con)

In [4]:
# Convert reviews with Rating>3 a positive rating, and reviews with a Rating<3 a negative rating.
def partition(x):
    if x < 3:
        return 0
    return 1

In [6]:
#changing reviews with Rating less than 3 to be positive and Rating greater than 3 to be negative
actualScore = filtered_df['Rating']
positiveNegative = actualScore.map(partition) 
filtered_df['Rating'] = positiveNegative
print("Number of data points in our data", filtered_df.shape)
filtered_df.head(3)

Number of data points in our data (20000, 6)


Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,1,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,1,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,1,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong


In [14]:
display_df = pd.read_sql_query("""
SELECT Review_ID,Rating, Year_Month, Reviewer_Location, Review_Text, Branch, COUNT(*)
FROM DisneylandReviews
GROUP BY Review_Text
HAVING COUNT(*)>1
""", con)

In [15]:
print(display_df.shape)
display_df.head()

(24, 7)


Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch,COUNT(*)
0,198850214,4,2014-3,United States,3 day Military Hopper Pass; best deal around. ...,Disneyland_California,2
1,606997669,5,2018-8,France,ActiveX VT ERROR:,Disneyland_Paris,2
2,166784597,4,2013-5,United States,Disneyland we love it! The service is incomp...,Disneyland_California,2
3,226905150,5,2014-5,United States,Disneyland Paris is different then other Disne...,Disneyland_Paris,2
4,239871388,4,2014-10,Canada,"Disneyland, Hong Kong Disneyland (Hong Kong) i...",Disneyland_HongKong,2


In [27]:
display_df[display_df['Review_ID']==198850214]

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch,COUNT(*)
0,198850214,4,2014-3,United States,3 day Military Hopper Pass; best deal around. ...,Disneyland_California,2


In [29]:
display_df['COUNT(*)'].sum()

48

In [28]:
# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 0
    return 1

#  Exploratory Data Analysis

## [2] Data Preprocessing

It is observed (as shown in the table below) that the reviews data had significant (~1%) duplicate entries. Hence, it is necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [30]:
display_df.head()

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch,COUNT(*)
0,198850214,4,2014-3,United States,3 day Military Hopper Pass; best deal around. ...,Disneyland_California,2
1,606997669,5,2018-8,France,ActiveX VT ERROR:,Disneyland_Paris,2
2,166784597,4,2013-5,United States,Disneyland we love it! The service is incomp...,Disneyland_California,2
3,226905150,5,2014-5,United States,Disneyland Paris is different then other Disne...,Disneyland_Paris,2
4,239871388,4,2014-10,Canada,"Disneyland, Hong Kong Disneyland (Hong Kong) i...",Disneyland_HongKong,2


As displayed above the same Reviewer_ID has multiple reviews of the with the same values for Rating, Year_MOnth, Reviewer_LOcation, Branch and Review_Text. On further analysis it was found that <br>
<br> 
Review_text is exactly same for the duplicate reviews.

It was inferred after analysis that reviews with same parameters. Therefore, it is imperative to reduce redundancy. Decided to eliminate the rows having same exact parameters to avoid bias.<br>

The method used for the same was that we first sort the data according to Review_ID and then just keep the first similar Review_Text and drop redundant records from the dataframe. for eg. in the above just the review for Review_ID=198850214, first occurence remains. This method ensures that there is only one representative for each Review/Rating and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [31]:
#Sorting data according to Review_ID in ascending order
sorted_df=filtered_df.sort_values('Review_ID', axis=0, ascending=True, inplace=False, kind='mergesort', na_position='last')

In [32]:
#Deduplication of entries
final_df=sorted_df.drop_duplicates(subset={"Review_ID","Year_Month","Reviewer_Location","Review_Text","Branch"}, keep='first', inplace=False)
final_df.shape

(19993, 6)

In [34]:
#Calculate % of data still remains post deduplication
(final_df['Review_ID'].size*1.0)/(final_df['Review_ID'].size*1.0)*100

100.0

In [38]:
#Get the dataframe shape, before starting the Text-preprocessing phase
print(final_df.shape)

#Check number of positive and negative reviews are present in our dataset?
final_df['Rating'].value_counts()


(19993, 6)


1    18586
0     1407
Name: Rating, dtype: int64

Observation : Deduplication (Perhaps, Original) dataset is imbalanced with more than ~92% of records are positive reviews

# [3].  Text Preprocessing.

Now that dataframe deduplication of DisneyLand Reviews data requires some preprocessing before we could perform additional analysis and build prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [43]:
# printing some random reviews
review_0 = final_df['Review_Text'].values[0]
print(sent_0)
print("*"*100)

review_1900 = final_df['Review_Text'].values[1900]
print(review_1900)
print("*"*100)

review_12900 = final_df['Review_Text'].values[12900]
print(review_12900)
print("*"*100)

review_19900 = final_df['Review_Text'].values[19900]
print(review_19900)
print("*"*100)

Obviously I haven't visited Hong Kong Disneyland myself, but it doesn't take a genius to deduce from the final attractions list that this ISN'T the Disneyland we all know and love. This park is distinctly lacking in attractions... and lands.There are only 4 actual moving rides: The Jungle Cruise (slow moving boat ride), The Many Adventures of Winnie The Pooh (slow moving dark ride), Buzz Lightyear's Astro Blasters (slow moving dark ride), and Space Mountain (high speed family coaster). So, not much to do for those craving excitement or thrills.There is no Pirates of the Carribean, no Big Thunder Mountain, no Haunted Mansion, no Indiana Jones Adventure, no Splash Mountain, no Peter Pan's Flight, no It's A Small World, no Star Tours, and no Autopia. Yes, you will find none of these attractions in Hong Kong Disneyland.There's also no Frontierland. What? Yes, Frontierland has been completely obliterated from the park, leaving only 4 lands: Main Street USA, Adventureland, Fantasyland and To

In [48]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
review_8219 = final_df['Review_Text'].values[8219]
review_8219 = re.sub(r"http\S+", "", review_8219)


print(review_8219)

I am in the senior age range and I still think there is no place like it. I go every year and it is always new. We do the 2 day hopper . I was not able to do a lot of walking this time but all the staff were so kind and helpfull. Roll on next time


In [62]:
# check for html tags in a text column using python/BeautifulSoup : https://stackoverflow.com/questions/24856035/how-to-detect-with-python-if-the-string-contains-html-code
final_df[final_df['Review_Text'].str.contains("<")==True].Review_Text.count()

0

In [63]:
#check for punctuations/Special characters (ie '.',',' & '#')
final_df[final_df['Review_Text'].str.contains(".|,|#")==True].Review_Text.count()

19993

In [46]:
final_df[['Review_Text']].query('Review_Text.str.contains("http")', engine='python')

Unnamed: 0,Review_Text
8219,"Hi All,Planning for a trip to Hong Kong Disney..."
8045,Hong Kong Disneyland was a huge surprise to me...
6944,We visited on a Monday and again on a Sunday. ...
19958,We just visited Disneyland and picked a dead ...
19653,"I grew up near Disneyland. Yes, I have been th..."
6219,We thoroughly enjoyed our 2 days at Disneyland...
19459,We just got back from our first trip to Disney...
18772,"I don't care how old I get, Disneyland will al..."
5910,I had gone with my 2.5 year old with slight ap...
5877,Some of the reviews for Disneyland talk about ...


Observation : Review_Text field does not consist of any html tags, however, this field does have punctuation/special chars and 

In [64]:
# Removal of contracted words reference : https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [66]:
final_df[final_df['Review_Text'].str.contains("'t")==True].Review_Text

8254    Obviously I haven't visited Hong Kong Disneyla...
8252    Visited Hong Kong Disneyland on the 28th Septe...
8249    I went there on a weekday. It wasn't that crow...
8248    We were there 1st and 2nd of November 2005. It...
8247    Hong Kong Disneyland is indeed very small, but...
8244    Husband, daughter and self visited on 23rd Nov...
8243    What a waste of time and money. And to think I...
8242    We visited Hong Kong Disneyland on 28 February...
8241    I took my 11 yr old daughter to Disneyland HK ...
8234    I visited the park with my 2 children ages 7 a...
8233    Many people seem to compare DLHK with its sist...
8232    I visited HK Disney quite a while ago (Jan 200...
8230    My family has been going to Disney parks since...
8228    Having read so much bad publicity I was expect...
8227    II went there with a friend and our two 3 year...
8226    After reading so many bad reviews about HK Dis...
8225    Disneyland Hong Kong. We have been twice thus ...
8224    Having