# About Dataset
The dataset includes 42,000 reviews of 3 Disneyland branches - Paris, California and Hong Kong, posted by visitors on Trip Advisor.

Column Description:

Review_ID: unique id given to each review
Rating: ranging from 1 (unsatisfied) to 5 (satisfied)
Year_Month: when the reviewer visited the theme park
Reviewer_Location: country of origin of visitor
Review_Text: comments made by visitor
Disneyland_Branch: location of Disneyland Park

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('DisneylandReviews.csv', encoding='latin-1', index_col=False)
baby_df.head(10)

Unnamed: 0,Review_ID,Rating,Year_Month,Reviewer_Location,Review_Text,Branch
0,670772142,4,2019-4,Australia,If you've ever been to Disneyland anywhere you...,Disneyland_HongKong
1,670682799,4,2019-5,Philippines,Its been a while since d last time we visit HK...,Disneyland_HongKong
2,670623270,4,2019-4,United Arab Emirates,Thanks God it wasn t too hot or too humid wh...,Disneyland_HongKong
3,670607911,4,2019-4,Australia,HK Disneyland is a great compact park. Unfortu...,Disneyland_HongKong
4,670607296,4,2019-4,United Kingdom,"the location is not in the city, took around 1...",Disneyland_HongKong
5,670591897,3,2019-4,Singapore,"Have been to Disney World, Disneyland Anaheim ...",Disneyland_HongKong
6,670585330,5,2019-4,India,Great place! Your day will go by and you won't...,Disneyland_HongKong
7,670574142,3,2019-3,Malaysia,Think of it as an intro to Disney magic for th...,Disneyland_HongKong
8,670571027,2,2019-4,Australia,"Feel so let down with this place,the Disneylan...",Disneyland_HongKong
9,670570869,5,2019-3,India,I can go on talking about Disneyland. Whatever...,Disneyland_HongKong


In [2]:
baby_df['Review_Text'].isnull().values.any()

False

We can observe that we don't have any null values in the dataset, we can proceed with analyzing further

In [4]:
baby_df.dtypes

Review_ID             int64
Rating                int64
Year_Month           object
Reviewer_Location    object
Review_Text          object
Branch               object
dtype: object

In [5]:
baby_df['Review_Text']=baby_df['Review_Text'].astype(str)

In [6]:
baby_df["Review_Text"].replace(np.nan, "", inplace=True)
#short test:
baby_df["Review_Text"][24] == baby_df["Review_Text"][24]

True

In [7]:
baby_df["Review_Text"] = baby_df["Review_Text"].apply(remove_punctuation)
baby_df['Review_Text'][4]

'the location is not in the city took around 1 hour from Kowlon my kids like disneyland so much everything is fine   but its really crowded and hot in Hong Kong'

In [8]:
a = 0
for x in baby_df["Rating"]:

    if x == 3:
        baby_df.drop(index = a, inplace = True, axis=0)
        a=a+1
    else:
        a=a+1
        continue

sum(baby_df["Rating"] == 3)

0

In [9]:
baby_df["Rating"].replace(1,-1, inplace = True)
baby_df["Rating"].replace(4,1, inplace = True)
baby_df["Rating"].replace(5,1, inplace = True)
baby_df["Rating"].replace(2,-1, inplace = True)
sum(baby_df["Rating"]**2 != 1)

0

After removing all records with rating 3, removing punctuation and assigning records with rating more than 3 to positive and less than 3 to negative, we can hop into CountVectorizer

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
train, test = train_test_split(baby_df, train_size=0.8, test_size=0.2)

In [12]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(list(train["Review_Text"]))
y = train["Rating"]
x_test = vectorizer.transform(list(test["Review_Text"]))
y_test = test["Rating"]

In the subsequent task, we partition the data into training and test sets. Following that, we convert the words present in the reviews into arrays of numbers (vectors). This process enables us to instruct the program in recognizing specific characteristics of words through numerical representations.

In [13]:
model = LogisticRegression(max_iter=200)
model.fit(x, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
zipped_coef = list(zip(list(range(model.coef_.shape[1])), model.coef_[0]))
sorted_coef = sorted(zipped_coef, key= lambda v: v[1])
sorted_coef = np.array(sorted_coef)
sorted_coef_indexes = sorted_coef[:, 0].astype(int)
words = np.array(vectorizer.get_feature_names_out())
most_positive_words = words[sorted_coef_indexes[-10:]]
most_negative_words = words[sorted_coef_indexes[:10]]
print("Most positive words: ", most_positive_words)
print("Most negative words: ", most_negative_words)

Most positive words:  ['lovely' 'awesome' 'amazing' 'blast' 'longest' 'perfect' 'loved'
 'fantastic' 'downside' 'wonderful']
Most negative words:  ['worst' 'waste' 'overcrowded' 'boring' 'disappointing' 'miserable' 'poor'
 'filthy' 'worse' 'dirty']


In [15]:
model.predict(x_test)

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [16]:
probabilty = model.predict_proba(x_test)
probabilty

array([[5.02340122e-03, 9.94976599e-01],
       [8.39392763e-02, 9.16060724e-01],
       [1.69032609e-05, 9.99983097e-01],
       ...,
       [8.30169874e-03, 9.91698301e-01],
       [2.05403297e-05, 9.99979460e-01],
       [4.38171592e-03, 9.95618284e-01]])

In [17]:
print("Most positive: ", test.iloc[probabilty.T[1].argmax()]["Review_Text"])
print("\n Most negative: ", test.iloc[probabilty.T[0].argmax()]["Review_Text"])

Most positive:  Our group of 6 recently returned from a 4 night 3 day stay We have 4 children 15 13 9 6 so planning was very important My 9 and 6 year old hadnt been there and my teenagers were young on our last trip so we had a wide range of requests to accommodate TIP 1 Have a plan aheadbut dont be so stuck on it that you lose your mind when things go off the plan because they will We met as a family before to research attractions and made a list by person of the must do attractions We made sure to get those ones done early at park open when wait times were more reasonable Get FASTPASS or MAXPASS so you can book ride times for the in demand ones TIP 2 BUY WATER DRINKS SNACKS ahead of time We picked up 24 bottled waters at Walgreens for 399 The little savings can add up over your trip and saves on another lineTIP 3 EXTRA SHOES Rotate your shoes each day and in case they get wet on Splash Mountain Grizzly River Run they will Over 3 days at the park we walked nearly 50km 30 miles and ou

In [18]:
model.score(x_test, y_test)

0.955525965379494

In [19]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [20]:
vectorizer_small = CountVectorizer(vocabulary=significant_words)

x_small = vectorizer_small.fit_transform(list(train["Review_Text"]))
y_small = train["Rating"]
x_test_small = vectorizer_small.transform(list(test["Review_Text"]))
y_test_small = test["Rating"]
light_model = LogisticRegression()
light_model.fit(x_small, y_small)
proba = light_model.predict_proba(x_test_small)
print("Most positive: ", test.iloc[proba.T[1].argmax()]["Review_Text"])
print("\nMost negative: ", test.iloc[proba.T[0].argmax()]["Review_Text"])
light_model.score(x_test_small, y_test_small)

Most positive:  This was my 3rd trip to disney since it opened in 1992 This is a great place for all adults seem to love it just as much as the children There really is so much to do When planning a trip to disney plan what you want to do ahead of time and work out how long you plan to stay My last trip was with a friend in Nov 07 and we stayed 3 nightst 4 days This was just the right amount of time in low season as the parks close earlier and being two adults we managed to get around the parks quite quickly and we had time to repeate are favourite rides big thunder mountain x 4 space mountain x 3 We travelled by eurostar which was nice quick and hassle free We stayed on site at the Disney sequoia lodge See separate review This was just the right amount of time to explore the the parks however we did not have time to do a day trip to paris or visit other attrations like the shopping complex near by There is the 2 parks and the Disney Village The disney park takes the most time it is sp

0.9071904127829561

In [21]:
for word, coef in zip(vectorizer_small.get_feature_names_out(), light_model.coef_[0]):
    print(f"{word}: {coef}")

love: 0.7720431342737724
great: 0.9563649861724409
easy: 0.9247051988648635
old: -0.05017617015704574
little: 0.32094910765600554
perfect: 1.6335466486284644
loves: 0.2446153430380924
well: 0.269662285816826
able: 0.36090663812647256
car: -0.2791029682311213
broke: -1.4130458040605405
less: -0.1933940843478206
even: -0.47922091242520787
waste: -1.5271181071944753
disappointed: -1.2855433267720395
work: -0.5079604029028281
product: -1.7074284297219113
money: -1.2663530589611804
would: -0.3154379473902324
return: -0.2341529950904864


Once more, we instruct our program to identify relevant words, but this time we employ a specific dictionary that we have provided. Given that the number of words in the list is less than the total count, we anticipate a potential decrease in prediction accuracy. Additionally, we evaluate the strength or weight assigned to each word at the conclusion of the process.

In [22]:
print(f"Light model - {light_model.score(x_test_small, y_test_small)}")
print(f"Standard model - {model.score(x_test, y_test)}")

Light model - 0.9071904127829561
Standard model - 0.955525965379494


In [23]:
%%time
%%timeit
light_model.predict(x_test_small)

168 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
CPU times: total: 10.7 s
Wall time: 14.2 s


In [24]:
%%time
%%timeit
model.predict(x_test)

3.85 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: total: 2.06 s
Wall time: 3.1 s


Comparing our light model with original model we can observe that original model has around 5 % points better accuracy but with the cost of being around 22 times slower comparing to lighter model.