# About Dataset
The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor.

Column Description:

Review_ID: unique id given to each review
Rating: ranging from 1 (unsatisfied) to 5 (satisfied)
Year_Month: when the reviewer visited the theme park
Reviewer_Location: country of origin of visitor
Review_Text: comments made by visitor
Disneyland_Branch: location of Disneyland Park

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('DisneylandReviews.csv', encoding='latin-1', index_col=False)
baby_df.head(10)

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong
5,670591897,3,2019-4,Singapore,"Have been to Disney World, Disneyland Anaheim ...",Disneyland_HongKong
6,670585330,5,2019-4,India,Great place! Your day will go by and you won't...,Disneyland_HongKong
7,670574142,3,2019-3,Malaysia,Think of it as an intro to Disney magic for th...,Disneyland_HongKong
8,670571027,2,2019-4,Australia,"Feel so let down with this place,the Disneylan...",Disneyland_HongKong
9,670570869,5,2019-3,India,I can go on talking about Disneyland. Whatever...,Disneyland_HongKong


In [10]:
baby_df['Review_Text'].isnull().values.any()

False

In [14]:
baby_df.dtypes

Review_ID             int64
Rating                int64
Year_Month           object
Reviewer_Location    object
Review_Text          object
Branch               object
dtype: object

In [5]:
baby_df['Review_Text']=baby_df['Review_Text'].astype(str)

In [8]:
baby_df["Review_Text"].replace(np.nan, "", inplace=True)
#short test:
baby_df["Review_Text"][24] == baby_df["Review_Text"][24]

True

In [15]:
baby_df["Review_Text"] = baby_df["Review_Text"].apply(remove_punctuation)
baby_df['Review_Text'][4]

'the location is not in the city took around 1 hour from Kowlon my kids like disneyland so much everything is fine   but its really crowded and hot in Hong Kong'

In [17]:
a = 0
for x in baby_df["Rating"]:

    if x == 3:
        baby_df.drop(index = a, inplace = True, axis=0)
        a=a+1
    else:
        a=a+1
        continue

sum(baby_df["Rating"] == 3)

0

In [18]:
baby_df["Rating"].replace(1,-1, inplace = True)
baby_df["Rating"].replace(4,1, inplace = True)
baby_df["Rating"].replace(5,1, inplace = True)
baby_df["Rating"].replace(2,-1, inplace = True)
sum(baby_df["Rating"]**2 != 1)

0

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
train, test = train_test_split(baby_df, train_size=0.8, test_size=0.2)

In [22]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(list(train["Review_Text"]))
y = train["Rating"]
x_test = vectorizer.transform(list(test["Review_Text"]))
y_test = test["Rating"]

In [23]:
model = LogisticRegression(max_iter=200)
model.fit(x, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [24]:
zipped_coef = list(zip(list(range(model.coef_.shape[1])), model.coef_[0]))
sorted_coef = sorted(zipped_coef, key= lambda v: v[1])
sorted_coef = np.array(sorted_coef)
sorted_coef_indexes = sorted_coef[:, 0].astype(int)
words = np.array(vectorizer.get_feature_names_out())
most_positive_words = words[sorted_coef_indexes[-10:]]
most_negative_words = words[sorted_coef_indexes[:10]]
print("Most positive words: ", most_positive_words)
print("Most negative words: ", most_negative_words)

Most positive words:  ['plenty' 'lovely' 'perfect' 'blast' 'wonderful' 'fantastic' 'downside'
 'longest' 'loved' 'amazing']
Most negative words:  ['worse' 'overcrowded' 'waste' 'boring' 'disappointing' 'miserable' 'poor'
 'worst' 'supposed' 'filthy']


In [25]:
model.predict(x_test)

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [26]:
probabilty = model.predict_proba(x_test)
probabilty

array([[3.62729281e-01, 6.37270719e-01],
       [2.89999746e-01, 7.10000254e-01],
       [5.31434652e-09, 9.99999995e-01],
       ...,
       [7.05444198e-04, 9.99294556e-01],
       [1.89503685e-03, 9.98104963e-01],
       [3.19828612e-04, 9.99680171e-01]])

In [30]:
print("Most positive: ", test.iloc[probabilty.T[1].argmax()]["Review_Text"])
print("\n Most negative: ", test.iloc[probabilty.T[0].argmax()]["Review_Text"])

Most positive:  My family of four with 2 adults and a 16  21 year old visited Disneyland park July 23   27 We purchased a 4 day hopper passWe usually go to the Disney world Resort in Orlando With that said we did enjoy our time at the park When we were there the park was open from 8 am   12am everyday They did have magic mornings with entry at 7am however we did not go that early Disneys California Adventure park hours were 8 am   10 pmThey do have security to enter the park however they only had 2 areas that you could enter and there was a very long line every day at all hours of the day to enter They check your bags and even if you dont have a bag still have to wait in the long line to enter They dont have a line for people with no bags One you have your bags checked they you go through the medal detectors Once you go through the security you are in a big courtyard like area this is the area between the two parks and Downtown Disney You can then proceed to the park you want to go to 

In [31]:
model.score(x_test, y_test)

0.9585885486018642

In [32]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [33]:
vectorizer_small = CountVectorizer(vocabulary=significant_words)

x_small = vectorizer_small.fit_transform(list(train["Review_Text"]))
y_small = train["Rating"]
x_test_small = vectorizer_small.transform(list(test["Review_Text"]))
y_test_small = test["Rating"]
light_model = LogisticRegression()
light_model.fit(x_small, y_small)
proba = light_model.predict_proba(x_test_small)
print("Most positive: ", test.iloc[proba.T[1].argmax()]["Review_Text"])
print("\nMost negative: ", test.iloc[proba.T[0].argmax()]["Review_Text"])
light_model.score(x_test_small, y_test_small)

Most positive:  We have just returned from an amazing four night five day stay at Disneyland Paris which was a fantastic family holiday and we all had a great time Disneyland is located in Marne le Vallee Chessy which is around 40 minutes on the RER train services from Paris Gare du Nord Getting there is very simple and around    20 for two adults and one child When using the ticket machines in Gare du Nord you need to have it in change or be prepared to pay by card As we got the Eurostar direct into Paris this was the ideal route for us and it was very easyOnce we got to Marne le Vallee we got the free shuttle bus to our hotel Sequoia Lodge to drop off our bags and collect our park tickets We then walked back to the park through the Disney village which took around fifteen minutes We had a quick stop for lunch in the village before heading down to the main Disneyland Park It   s very easy to find simply follow the crowds walking down then you pass underneath the main Disneyland Hotel 

0.9102529960053263

In [34]:
for word, coef in zip(vectorizer_small.get_feature_names_out(), light_model.coef_[0]):
    print(f"{word}: {coef}")

love: 0.8085222526541562
great: 0.955370536372613
easy: 0.8168174300883709
old: -0.07773904581399672
little: 0.34249489819952866
perfect: 1.506179489073232
loves: 0.2709680720903975
well: 0.24098105144052623
able: 0.3931337381140211
car: -0.21453935078209396
broke: -1.3898222199306463
less: -0.11925435083628011
even: -0.4993623842298293
waste: -1.4150801016694448
disappointed: -1.3154438135647366
work: -0.5481820715531313
product: -1.7759242421656691
money: -1.2346762463212277
would: -0.2964575273671813
return: -0.25958838067937956


In [35]:
print(f"Light model - {light_model.score(x_test_small, y_test_small)}")
print(f"Standard model - {model.score(x_test, y_test)}")

Light model - 0.9102529960053263
Standard model - 0.9585885486018642


In [39]:
%%time
%%timeit
light_model.predict(x_test_small)

158 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
CPU times: total: 12.7 s
Wall time: 13.6 s


In [40]:
%%time
%%timeit
model.predict(x_test)

3.07 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: total: 2.03 s
Wall time: 2.44 s
