# Data Preprocessing
This notebook is for cleaning **OTHER data columns** (**NOT** the `Review_Text` column) in the Disneyland_Reviews.csv

The `Review_Text` column will be processed and vectorized in the `starter.py` notebook which uses the `text_preprocessing.py` file.

## 0. Import libraries

In [2]:
# Required libraries in Colab
# ! pip install transformers sentencepiece --quiet

In [3]:
import numpy as np
import pandas as pd
import re
from tqdm import tqdm, tqdm_pandas

import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', None)
import plotly.express as px
from wordcloud import WordCloud

from nltk.tokenize import word_tokenize, sent_tokenize



## 1. Preprocessing

In [4]:
# Import data
df = pd.read_csv('../data/Disneyland_Reviews_updated.csv')

df.head()

Unnamed: 0,Rating,Year_Month,Reviewer_Location,Review_Title,Review_Text,Branch
0,5,2023-09,"Johor Bahru, Malaysia",Worth every penny and every minute,"I visited Disney Land Tokyo with my family on a weekend night in December 2022. We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate. We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money. We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland. We were amazed by the beautiful decorations and the festive atmosphere. We had a wonderful time at Disney Land Tokyo at night with our family. We felt that it was worth every penny and every minute. We would definitely recommend it to anyone who wants to experience the best of both parks in a short time. It was a memorable visit that we will never forget.",Disneyland_Tokyo
1,5,2023-09,"Perth, Australia",The BEST day in Tokyo Disney,"Honestly this was a brilliant day at Tokyo Disney. If you come to Tokyo you simply cannot miss Disney. The wait times were not long at all and even though there were lots of people there, lines flowed seemlessly. Had lots of yummy treats like turkey legs, different flavoured popcorns - curry, strawberry cheesecake and Mickey shaped ice cream sandwiches or ice blocks. Went on all the rides with ease, most we went on 2 x. Spent 10 hrs there and it just went so fast. It was a humid day but Disney had the lines flowing so you weren’t in the sun long and water stations were everywhere. We used the 40 yr Premium Pass for 4 rides and we were able to book them and use them getting notifications on the app when the time was close. The best day. Don’t miss Disneyland Tokyo.",Disneyland_Tokyo
2,5,2023-09,,Lovely place,It is a smaller version of Orlando. Very busy and long lineups due to its popularity. We watched a special effects film there. It was pretty awesome. Great experience and worth a visit.,Disneyland_Tokyo
3,4,2023-09,"Attadale, Australia",Solo day at Disneyland,"Definitely a must see however doesn’t quite top the OG in Anaheim. A lot of the rides are in Japanese but still fantastic fun - the beauty and the beast ride was my favourite. Food wasn’t anything great however all added to the experience, lots of popcorn stands offering different flavours which was quite cool. Brings out your inner child and all the nostalgia :)",Disneyland_Tokyo
4,4,2023-09,"Orange County, CA",Happy 40th Tokyo Disneyland!,"Being a Magic Key passholder at Disneyland in California, I knew going to Tokyo Disneyland that I wouldn't need to spend too much time on rides I already know and love. Instead I concentrated my time on rides that aren't available in my neck of the woods. We arrived to the park at noon because we were had spent the morning in Tokyo exchanging JR vouchers and reacclimating to the new timezone. We also had to drop off our luggage at our hotel before heading to the parks. We had our Disney park tickets prepurchased on Klook so we were set to go. We jumped on the shuttle from our hotel and was dropped off at the train station and made our way to the park from Maihama.We purchased the Premier Access Pass for the Beauty and the Beast Ride as soon as we entered the gates via the Tokyo Disneyland app, which to me is the BEST ride in this park, hands down. It's fully immersive and just wonderful in the storytelling to the animatronics but in Japanese. If you're a fan of Beauty and the Beast, this is the ride to end all rides. Not only did Tokyo Disneyland recreate Belle's French village, they constructed the Beast's castle for the ride. The mother-effing castle is is here in the park in addition to Cinderella's castle. You walk through the beginning of the tale and then you enter a tea cup and ride a trackless ride around the castle, as a guest. If you've been on Rise of the Resistance, then you know what fully immersive means when it comes to a ride. We saw full grown adults with tears coming off of this ride. It's that good.While Tokyo Disneyland feels smaller somehow, it has massive amounts of land so everything is spaced out. Pick and choose your rides. We went on Space Mountain, Monster's Inc, Star Tours, Beauty and the Beast, It's a Small World, Pooh's Hunny Hunt, Cinderella's Fairytale Castle walk-through, Big Thunder Mountain, and Splash Mountain (the only one left since DL and WDW is retheming theirs). To be able to visit another Disney park is literally a dream come true and the admission cost is way more cost-effective in Japan than in the US. That's currently a FACT. We spent about $60ish USD per person per park in Japan, so do the math. Of course, you have to pay for airfare, so it might be a wash. The Castmembers at Tokyo Disneyland are so animated and seem genuinely happy to be interacting with guests. It was their 40th Anniversary when we went so it was extra magical with the option of obtaining free fast passes for certain rides. We had a wonderful time even in rainy humidity.",Disneyland_Tokyo


### 1.1 Data columns

In [5]:
# Drop nulls
df = df.dropna()

# Split Year_Month
df[['Review_Year','Review_Month']] = df['Year_Month'].str.split('-', expand = True)
df['Review_Year'] = df['Review_Year'].apply(lambda x: int(x))
df['Review_Month'] = df['Review_Month'].apply(lambda x: int(x))
df = df.drop(columns=["Year_Month"])

# Sort by year and month
df = df.sort_values(by=['Review_Year', 'Review_Month'], ascending=False)

# Drop duplicates
df = df.drop_duplicates(subset=['Review_Text'], keep="first")

# Sort back by index and create an id column
df = df.sort_index()
df['Review_ID'] = df.index.map(lambda x: x+1)


df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13321 entries, 0 to 15472
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Rating             13321 non-null  int64 
 1   Reviewer_Location  13321 non-null  object
 2   Review_Title       13321 non-null  object
 3   Review_Text        13321 non-null  object
 4   Branch             13321 non-null  object
 5   Review_Year        13321 non-null  int64 
 6   Review_Month       13321 non-null  int64 
 7   Review_ID          13321 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 936.6+ KB


In [6]:
df.head(5)

Unnamed: 0,Rating,Reviewer_Location,Review_Title,Review_Text,Branch,Review_Year,Review_Month,Review_ID
0,5,"Johor Bahru, Malaysia",Worth every penny and every minute,"I visited Disney Land Tokyo with my family on a weekend night in December 2022. We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate. We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money. We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland. We were amazed by the beautiful decorations and the festive atmosphere. We had a wonderful time at Disney Land Tokyo at night with our family. We felt that it was worth every penny and every minute. We would definitely recommend it to anyone who wants to experience the best of both parks in a short time. It was a memorable visit that we will never forget.",Disneyland_Tokyo,2023,9,1
1,5,"Perth, Australia",The BEST day in Tokyo Disney,"Honestly this was a brilliant day at Tokyo Disney. If you come to Tokyo you simply cannot miss Disney. The wait times were not long at all and even though there were lots of people there, lines flowed seemlessly. Had lots of yummy treats like turkey legs, different flavoured popcorns - curry, strawberry cheesecake and Mickey shaped ice cream sandwiches or ice blocks. Went on all the rides with ease, most we went on 2 x. Spent 10 hrs there and it just went so fast. It was a humid day but Disney had the lines flowing so you weren’t in the sun long and water stations were everywhere. We used the 40 yr Premium Pass for 4 rides and we were able to book them and use them getting notifications on the app when the time was close. The best day. Don’t miss Disneyland Tokyo.",Disneyland_Tokyo,2023,9,2
3,4,"Attadale, Australia",Solo day at Disneyland,"Definitely a must see however doesn’t quite top the OG in Anaheim. A lot of the rides are in Japanese but still fantastic fun - the beauty and the beast ride was my favourite. Food wasn’t anything great however all added to the experience, lots of popcorn stands offering different flavours which was quite cool. Brings out your inner child and all the nostalgia :)",Disneyland_Tokyo,2023,9,4
4,4,"Orange County, CA",Happy 40th Tokyo Disneyland!,"Being a Magic Key passholder at Disneyland in California, I knew going to Tokyo Disneyland that I wouldn't need to spend too much time on rides I already know and love. Instead I concentrated my time on rides that aren't available in my neck of the woods. We arrived to the park at noon because we were had spent the morning in Tokyo exchanging JR vouchers and reacclimating to the new timezone. We also had to drop off our luggage at our hotel before heading to the parks. We had our Disney park tickets prepurchased on Klook so we were set to go. We jumped on the shuttle from our hotel and was dropped off at the train station and made our way to the park from Maihama.We purchased the Premier Access Pass for the Beauty and the Beast Ride as soon as we entered the gates via the Tokyo Disneyland app, which to me is the BEST ride in this park, hands down. It's fully immersive and just wonderful in the storytelling to the animatronics but in Japanese. If you're a fan of Beauty and the Beast, this is the ride to end all rides. Not only did Tokyo Disneyland recreate Belle's French village, they constructed the Beast's castle for the ride. The mother-effing castle is is here in the park in addition to Cinderella's castle. You walk through the beginning of the tale and then you enter a tea cup and ride a trackless ride around the castle, as a guest. If you've been on Rise of the Resistance, then you know what fully immersive means when it comes to a ride. We saw full grown adults with tears coming off of this ride. It's that good.While Tokyo Disneyland feels smaller somehow, it has massive amounts of land so everything is spaced out. Pick and choose your rides. We went on Space Mountain, Monster's Inc, Star Tours, Beauty and the Beast, It's a Small World, Pooh's Hunny Hunt, Cinderella's Fairytale Castle walk-through, Big Thunder Mountain, and Splash Mountain (the only one left since DL and WDW is retheming theirs). To be able to visit another Disney park is literally a dream come true and the admission cost is way more cost-effective in Japan than in the US. That's currently a FACT. We spent about $60ish USD per person per park in Japan, so do the math. Of course, you have to pay for airfare, so it might be a wash. The Castmembers at Tokyo Disneyland are so animated and seem genuinely happy to be interacting with guests. It was their 40th Anniversary when we went so it was extra magical with the option of obtaining free fast passes for certain rides. We had a wonderful time even in rainy humidity.",Disneyland_Tokyo,2023,9,5
5,3,Singapore,Princesses made our day,"It was fun! But…Stars deducted because I paid 7,500yen for Premier Access seating for the parade. The parade was cancelled due to heat, and the staff was unable provide any details on refunds. No refunds in the App till date, I lost 7,500yen for nothing.Mass diners were crowded with several diners left without tables, and tables occupied by teens for their nap or escape from the heat.Full praises for the Princes/Princesses! There was no queue system to take picture with them, BUT the princes/princesses were highly trained to scan the crowd to know who came first and naturally engaged them. WOW!",Disneyland_Tokyo,2023,8,6


### 1.2 Deconstruct `Review_Text` into sentences

In [7]:
# Split into sentences

df.loc[:,'Review_Text'] = df['Review_Text'].apply(lambda x: sent_tokenize(x))
df = df.explode('Review_Text')
df = df.reset_index(drop=True)

# Create Sentence_ID
df['Sentence_ID'] = df.index.map(lambda x : x+1)

# Rearrange columns
df = df[["Review_ID", "Sentence_ID", "Review_Year", "Review_Month", "Branch", "Rating", "Reviewer_Location", "Review_Title", "Review_Text"]]

df.head(5)

Unnamed: 0,Review_ID,Sentence_ID,Review_Year,Review_Month,Branch,Rating,Reviewer_Location,Review_Title,Review_Text
0,1,1,2023,9,Disneyland_Tokyo,5,"Johor Bahru, Malaysia",Worth every penny and every minute,I visited Disney Land Tokyo with my family on a weekend night in December 2022.
1,1,2,2023,9,Disneyland_Tokyo,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate.
2,1,3,2023,9,Disneyland_Tokyo,5,"Johor Bahru, Malaysia",Worth every penny and every minute,"We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money."
3,1,4,2023,9,Disneyland_Tokyo,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland.
4,1,5,2023,9,Disneyland_Tokyo,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We were amazed by the beautiful decorations and the festive atmosphere.


### 1.3 TODO: Clean `Reviewer_Location` column?

## 3. Export

In [8]:
# Export to pkl

df.to_pickle("../data/processed_reviews.pkl")  
