# Disney World Reviews Analysis


Data Source: https://www.kaggle.com/arushchillar/disneyland-reviews <br>


Disneyland Reviews dataset consists of reviews and ratings of 3 Disneyland location (namely California, Paris & Hongkong), posted by visitors on TripAdvisor.<br>

Number of reviews: 42,000<br>
Timespan: Oct 2010 - May 2019<br>
Number of Attributes/Columns in data: 6 

Attribute Information:

1. Review_ID: unique id given to each review
2. Rating: ranging from 1 (unsatisfied) to 5 (satisfied)
3. Year_Month: when the reviewer visited the theme park
4. Reviewer_Location: country of origin of visitor
5. Review_Text: comments made by visitor
6. Disneyland_Branch: location of Disneyland Park


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.


# [1]. Reading Data

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")



import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

#from gensim.models import Word2Vec
#from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [2]:
con = sqlite3.connect('disneyReviews.db') 

In [18]:
df = pd.read_sql_query(""" SELECT * FROM DisneylandReviews WHERE Rating!=3 LIMIT 60000 """, con)

In [19]:
df.head(10)

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong
5,670585330,5,2019-4,India,Great place! Your day will go by and you won't...,Disneyland_HongKong
6,670571027,2,2019-4,Australia,"Feel so let down with this place,the Disneylan...",Disneyland_HongKong
7,670570869,5,2019-3,India,I can go on talking about Disneyland. Whatever...,Disneyland_HongKong
8,670443403,5,2019-4,United States,Disneyland never cease to amaze me! I've been ...,Disneyland_HongKong
9,670435886,5,2019-4,Canada,We spent the day here with our grown kids and ...,Disneyland_HongKong


In [20]:
df['Branch'].value_counts()

Disneyland_California    17745
Disneyland_Paris         11547
Disneyland_HongKong       8255
Name: Branch, dtype: int64

In [21]:
df['Rating'].value_counts()

5    23146
4    10775
2     2127
1     1499
Name: Rating, dtype: int64

In [22]:
# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 0
    return 1

In [23]:
df[df.groupby(['Review_Text'])['Rating'].transform('count') > 1]

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
5277,268817357,4,2015-4,United States,Hong Kong Disneyland is a very clean and lovel...,Disneyland_HongKong
5278,268817356,4,2015-4,United States,Hong Kong Disneyland is a very clean and lovel...,Disneyland_HongKong
5751,239871388,4,2014-10,Canada,"Disneyland, Hong Kong Disneyland (Hong Kong) i...",Disneyland_HongKong
5779,239015375,4,2014-10,Canada,"Disneyland, Hong Kong Disneyland (Hong Kong) i...",Disneyland_HongKong
6871,164862064,5,2013-6,Singapore,Great atmosphere... A place for everyone in th...,Disneyland_HongKong
6879,164862064,5,2013-6,Singapore,Great atmosphere... A place for everyone in th...,Disneyland_HongKong
7478,133668239,5,missing,Hong Kong,I am a Hongkonger and an international travell...,Disneyland_HongKong
7482,133552193,5,missing,Hong Kong,I am a Hongkonger and an international travell...,Disneyland_HongKong
7595,129231609,5,2012-4,United States,Let me just start off by saying that although ...,Disneyland_HongKong
7596,129207323,5,2011-9,Australia,Having never been to any Disneyland I was thri...,Disneyland_HongKong


In [24]:
df[df.groupby(['Review_Text'])['Rating'].transform('count') > 1].count()

Review_ID            44
Rating               44
Year_Month           44
Reviewer_Location    44
Review_Text          44
Branch               44
dtype: int64

In [16]:
df(['Review_Text']).values()

TypeError: 'DataFrame' object is not callable

In [27]:
df.groupby(['Year_Month']).count()

Unnamed: 0_level_0,Review_ID,Rating,Reviewer_Location,Review_Text,Branch
Year_Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-10,20,20,20,20,20
2010-11,23,23,23,23,23
2010-12,52,52,52,52,52
2010-3,2,2,2,2,2
2010-4,1,1,1,1,1
2010-5,4,4,4,4,4
2010-6,7,7,7,7,7
2010-7,5,5,5,5,5
2010-8,7,7,7,7,7
2010-9,8,8,8,8,8
