# Sentiment Analysis
https://www.kaggle.com/code/ganiesenov/sentiment-analysis-using-nltk-gensim-models

ABSA
* https://github.com/yangheng95/PyABSA/blob/v2/examples-v2/aspect_polarity_classification/Aspect_Sentiment_Classification.ipynb
* https://www.kaggle.com/code/phiitm/aspect-based-sentiment-analysis
* https://www.kaggle.com/code/nkitgupta/aspect-based-sentiment-analysis

## 0. Import libraries

In [2]:
import numpy as np
import pandas as pd
import re
from tqdm import tqdm, tqdm_pandas

import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', None)
import plotly.express as px
from wordcloud import WordCloud

from nltk.corpus import PlaintextCorpusReader, stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import contractions
from num2words import num2words
from gensim import corpora, models

from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

from sklearn.cluster import KMeans



## 1. Preprocessing

In [3]:
# Import data
df = pd.read_csv('../data/Disneyland_Reviews_updated.csv')

df.head()

Unnamed: 0,Rating,Year_Month,Reviewer_Location,Review_Title,Review_Text,Branch
0,5,2023-09,"Johor Bahru, Malaysia",Worth every penny and every minute,"I visited Disney Land Tokyo with my family on a weekend night in December 2022. We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate. We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money. We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland. We were amazed by the beautiful decorations and the festive atmosphere. We had a wonderful time at Disney Land Tokyo at night with our family. We felt that it was worth every penny and every minute. We would definitely recommend it to anyone who wants to experience the best of both parks in a short time. It was a memorable visit that we will never forget.",Disneyland_Tokyo
1,5,2023-09,"Perth, Australia",The BEST day in Tokyo Disney,"Honestly this was a brilliant day at Tokyo Disney. If you come to Tokyo you simply cannot miss Disney. The wait times were not long at all and even though there were lots of people there, lines flowed seemlessly. Had lots of yummy treats like turkey legs, different flavoured popcorns - curry, strawberry cheesecake and Mickey shaped ice cream sandwiches or ice blocks. Went on all the rides with ease, most we went on 2 x. Spent 10 hrs there and it just went so fast. It was a humid day but Disney had the lines flowing so you weren’t in the sun long and water stations were everywhere. We used the 40 yr Premium Pass for 4 rides and we were able to book them and use them getting notifications on the app when the time was close. The best day. Don’t miss Disneyland Tokyo.",Disneyland_Tokyo
2,5,2023-09,,Lovely place,It is a smaller version of Orlando. Very busy and long lineups due to its popularity. We watched a special effects film there. It was pretty awesome. Great experience and worth a visit.,Disneyland_Tokyo
3,4,2023-09,"Attadale, Australia",Solo day at Disneyland,"Definitely a must see however doesn’t quite top the OG in Anaheim. A lot of the rides are in Japanese but still fantastic fun - the beauty and the beast ride was my favourite. Food wasn’t anything great however all added to the experience, lots of popcorn stands offering different flavours which was quite cool. Brings out your inner child and all the nostalgia :)",Disneyland_Tokyo
4,4,2023-09,"Orange County, CA",Happy 40th Tokyo Disneyland!,"Being a Magic Key passholder at Disneyland in California, I knew going to Tokyo Disneyland that I wouldn't need to spend too much time on rides I already know and love. Instead I concentrated my time on rides that aren't available in my neck of the woods. We arrived to the park at noon because we were had spent the morning in Tokyo exchanging JR vouchers and reacclimating to the new timezone. We also had to drop off our luggage at our hotel before heading to the parks. We had our Disney park tickets prepurchased on Klook so we were set to go. We jumped on the shuttle from our hotel and was dropped off at the train station and made our way to the park from Maihama.We purchased the Premier Access Pass for the Beauty and the Beast Ride as soon as we entered the gates via the Tokyo Disneyland app, which to me is the BEST ride in this park, hands down. It's fully immersive and just wonderful in the storytelling to the animatronics but in Japanese. If you're a fan of Beauty and the Beast, this is the ride to end all rides. Not only did Tokyo Disneyland recreate Belle's French village, they constructed the Beast's castle for the ride. The mother-effing castle is is here in the park in addition to Cinderella's castle. You walk through the beginning of the tale and then you enter a tea cup and ride a trackless ride around the castle, as a guest. If you've been on Rise of the Resistance, then you know what fully immersive means when it comes to a ride. We saw full grown adults with tears coming off of this ride. It's that good.While Tokyo Disneyland feels smaller somehow, it has massive amounts of land so everything is spaced out. Pick and choose your rides. We went on Space Mountain, Monster's Inc, Star Tours, Beauty and the Beast, It's a Small World, Pooh's Hunny Hunt, Cinderella's Fairytale Castle walk-through, Big Thunder Mountain, and Splash Mountain (the only one left since DL and WDW is retheming theirs). To be able to visit another Disney park is literally a dream come true and the admission cost is way more cost-effective in Japan than in the US. That's currently a FACT. We spent about $60ish USD per person per park in Japan, so do the math. Of course, you have to pay for airfare, so it might be a wash. The Castmembers at Tokyo Disneyland are so animated and seem genuinely happy to be interacting with guests. It was their 40th Anniversary when we went so it was extra magical with the option of obtaining free fast passes for certain rides. We had a wonderful time even in rainy humidity.",Disneyland_Tokyo


### 1.1 Data columns

In [4]:
# Drop nulls
df = df.dropna()

# Split Year_Month
df[['Review_Year','Review_Month']] = df['Year_Month'].str.split('-', expand = True)
df['Review_Year'] = df['Review_Year'].apply(lambda x: int(x))
df['Review_Month'] = df['Review_Month'].apply(lambda x: int(x))
df = df.drop(columns=["Year_Month"])

# Sort by year and month
df = df.sort_values(by=['Review_Year', 'Review_Month'], ascending=False)

# Drop duplicates
df = df.drop_duplicates(subset=['Review_Text'], keep="first")

# Sort back by index and create an id column
df = df.sort_index()
df['Review_ID'] = df.index.map(lambda x: x+1)


df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13321 entries, 0 to 15472
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Rating             13321 non-null  int64 
 1   Reviewer_Location  13321 non-null  object
 2   Review_Title       13321 non-null  object
 3   Review_Text        13321 non-null  object
 4   Branch             13321 non-null  object
 5   Review_Year        13321 non-null  int64 
 6   Review_Month       13321 non-null  int64 
 7   Review_ID          13321 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 936.6+ KB


In [5]:
df.head()

Unnamed: 0,Rating,Reviewer_Location,Review_Title,Review_Text,Branch,Review_Year,Review_Month,Review_ID
0,5,"Johor Bahru, Malaysia",Worth every penny and every minute,"I visited Disney Land Tokyo with my family on a weekend night in December 2022. We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate. We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money. We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland. We were amazed by the beautiful decorations and the festive atmosphere. We had a wonderful time at Disney Land Tokyo at night with our family. We felt that it was worth every penny and every minute. We would definitely recommend it to anyone who wants to experience the best of both parks in a short time. It was a memorable visit that we will never forget.",Disneyland_Tokyo,2023,9,1
1,5,"Perth, Australia",The BEST day in Tokyo Disney,"Honestly this was a brilliant day at Tokyo Disney. If you come to Tokyo you simply cannot miss Disney. The wait times were not long at all and even though there were lots of people there, lines flowed seemlessly. Had lots of yummy treats like turkey legs, different flavoured popcorns - curry, strawberry cheesecake and Mickey shaped ice cream sandwiches or ice blocks. Went on all the rides with ease, most we went on 2 x. Spent 10 hrs there and it just went so fast. It was a humid day but Disney had the lines flowing so you weren’t in the sun long and water stations were everywhere. We used the 40 yr Premium Pass for 4 rides and we were able to book them and use them getting notifications on the app when the time was close. The best day. Don’t miss Disneyland Tokyo.",Disneyland_Tokyo,2023,9,2
3,4,"Attadale, Australia",Solo day at Disneyland,"Definitely a must see however doesn’t quite top the OG in Anaheim. A lot of the rides are in Japanese but still fantastic fun - the beauty and the beast ride was my favourite. Food wasn’t anything great however all added to the experience, lots of popcorn stands offering different flavours which was quite cool. Brings out your inner child and all the nostalgia :)",Disneyland_Tokyo,2023,9,4
4,4,"Orange County, CA",Happy 40th Tokyo Disneyland!,"Being a Magic Key passholder at Disneyland in California, I knew going to Tokyo Disneyland that I wouldn't need to spend too much time on rides I already know and love. Instead I concentrated my time on rides that aren't available in my neck of the woods. We arrived to the park at noon because we were had spent the morning in Tokyo exchanging JR vouchers and reacclimating to the new timezone. We also had to drop off our luggage at our hotel before heading to the parks. We had our Disney park tickets prepurchased on Klook so we were set to go. We jumped on the shuttle from our hotel and was dropped off at the train station and made our way to the park from Maihama.We purchased the Premier Access Pass for the Beauty and the Beast Ride as soon as we entered the gates via the Tokyo Disneyland app, which to me is the BEST ride in this park, hands down. It's fully immersive and just wonderful in the storytelling to the animatronics but in Japanese. If you're a fan of Beauty and the Beast, this is the ride to end all rides. Not only did Tokyo Disneyland recreate Belle's French village, they constructed the Beast's castle for the ride. The mother-effing castle is is here in the park in addition to Cinderella's castle. You walk through the beginning of the tale and then you enter a tea cup and ride a trackless ride around the castle, as a guest. If you've been on Rise of the Resistance, then you know what fully immersive means when it comes to a ride. We saw full grown adults with tears coming off of this ride. It's that good.While Tokyo Disneyland feels smaller somehow, it has massive amounts of land so everything is spaced out. Pick and choose your rides. We went on Space Mountain, Monster's Inc, Star Tours, Beauty and the Beast, It's a Small World, Pooh's Hunny Hunt, Cinderella's Fairytale Castle walk-through, Big Thunder Mountain, and Splash Mountain (the only one left since DL and WDW is retheming theirs). To be able to visit another Disney park is literally a dream come true and the admission cost is way more cost-effective in Japan than in the US. That's currently a FACT. We spent about $60ish USD per person per park in Japan, so do the math. Of course, you have to pay for airfare, so it might be a wash. The Castmembers at Tokyo Disneyland are so animated and seem genuinely happy to be interacting with guests. It was their 40th Anniversary when we went so it was extra magical with the option of obtaining free fast passes for certain rides. We had a wonderful time even in rainy humidity.",Disneyland_Tokyo,2023,9,5
5,3,Singapore,Princesses made our day,"It was fun! But…Stars deducted because I paid 7,500yen for Premier Access seating for the parade. The parade was cancelled due to heat, and the staff was unable provide any details on refunds. No refunds in the App till date, I lost 7,500yen for nothing.Mass diners were crowded with several diners left without tables, and tables occupied by teens for their nap or escape from the heat.Full praises for the Princes/Princesses! There was no queue system to take picture with them, BUT the princes/princesses were highly trained to scan the crowd to know who came first and naturally engaged them. WOW!",Disneyland_Tokyo,2023,8,6


### 1.2 Text Processing (to BoW)

Steps:
1. Decompose into sentences
2. Expand contractions
3. Convert numbers to words
4. Convert to lowercase
5. Remove punctuations and special characters
6. Stemming/Lemmatizing

In [6]:
# Split into sentences
df_sent = df.copy()

df_sent.loc[:,'Review_Text'] = df_sent['Review_Text'].apply(lambda x: sent_tokenize(x))
df_sent = df_sent.explode('Review_Text')
df_sent = df_sent.reset_index(drop=True)

# Create Sentence_ID
df_sent['Sentence_ID'] = df_sent.index.map(lambda x : x+1)

df_sent.head(5)

Unnamed: 0,Rating,Reviewer_Location,Review_Title,Review_Text,Branch,Review_Year,Review_Month,Review_ID,Sentence_ID
0,5,"Johor Bahru, Malaysia",Worth every penny and every minute,I visited Disney Land Tokyo with my family on a weekend night in December 2022.,Disneyland_Tokyo,2023,9,1,1
1,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate.,Disneyland_Tokyo,2023,9,1,2
2,5,"Johor Bahru, Malaysia",Worth every penny and every minute,"We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money.",Disneyland_Tokyo,2023,9,1,3
3,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland.,Disneyland_Tokyo,2023,9,1,4
4,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We were amazed by the beautiful decorations and the festive atmosphere.,Disneyland_Tokyo,2023,9,1,5


In [31]:
# Function for sentence preprocessing

def preprocess_sentence(sentence, stemmer = None, lemmatizer = None):

    # Expand contractions
    output = contractions.fix(sentence)

    # Split into words
    output = output.split()
    for i in range(len(output)):

        # Convert to number
        if output[i].isnumeric():
            try:
                output[i] = num2words(output[i])
            except:
                pass

        # Make lowercase
        output[i] = output[i].lower()

    # Remove punctuation
    pattern = r"[^\w\s]"
    output = [re.sub(pattern, '', word) for word in output]

    # Define stopword list
    stopword_list = stopwords.words('english')

    # Stemming
    if stemmer != None:
        output = [stemmer.stem(word) for word in output if word not in stopword_list]  # Remove everything that is not a letter and stopwords

     # Lemmatizing
    if lemmatizer != None:
        output = [lemmatizer.lemmatize(word) for word in output if word not in stopword_list]  # Remove everything that is not a letter and stopwords

    return output


In [9]:
# Make bag of words

tqdm_pandas(tqdm())

df_sent["Bag_of_Words"] = df_sent["Review_Text"].progress_apply(lambda x: preprocess_sentence(x,
                                                                                     # stemmer = PorterStemmer(),
                                                                                    lemmatizer = WordNetLemmatizer()
                                                                                    ))

0it [00:00, ?it/s]
88918it [00:29, 3035.37it/s]


In [10]:
df_sent.head(10)

Unnamed: 0,Rating,Reviewer_Location,Review_Title,Review_Text,Branch,Review_Year,Review_Month,Review_ID,Sentence_ID,Bag_of_Words
0,5,"Johor Bahru, Malaysia",Worth every penny and every minute,I visited Disney Land Tokyo with my family on a weekend night in December 2022.,Disneyland_Tokyo,2023,9,1,1,"[visited, disney, land, tokyo, family, weekend, night, december, 2022]"
1,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We bought the evening entry that allowed us to enter the park after 3 p.m. at a discounted rate.,Disneyland_Tokyo,2023,9,1,2,"[bought, evening, entry, allowed, u, enter, park, three, pm, discounted, rate]"
2,5,"Johor Bahru, Malaysia",Worth every penny and every minute,"We thought it was a great deal because we could still enjoy most of the attractions, parades, and shows without spending too much time or money.",Disneyland_Tokyo,2023,9,1,3,"[thought, great, deal, could, still, enjoy, attraction, parade, show, without, spending, much, time, money]"
3,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We arrived at the park around 4 p.m. and headed straight to Tokyo Disneyland.,Disneyland_Tokyo,2023,9,1,4,"[arrived, park, around, four, pm, headed, straight, tokyo, disneyland]"
4,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We were amazed by the beautiful decorations and the festive atmosphere.,Disneyland_Tokyo,2023,9,1,5,"[amazed, beautiful, decoration, festive, atmosphere]"
5,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We had a wonderful time at Disney Land Tokyo at night with our family.,Disneyland_Tokyo,2023,9,1,6,"[wonderful, time, disney, land, tokyo, night, family]"
6,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We felt that it was worth every penny and every minute.,Disneyland_Tokyo,2023,9,1,7,"[felt, worth, every, penny, every, minute]"
7,5,"Johor Bahru, Malaysia",Worth every penny and every minute,We would definitely recommend it to anyone who wants to experience the best of both parks in a short time.,Disneyland_Tokyo,2023,9,1,8,"[would, definitely, recommend, anyone, want, experience, best, park, short, time]"
8,5,"Johor Bahru, Malaysia",Worth every penny and every minute,It was a memorable visit that we will never forget.,Disneyland_Tokyo,2023,9,1,9,"[memorable, visit, never, forget]"
9,5,"Perth, Australia",The BEST day in Tokyo Disney,Honestly this was a brilliant day at Tokyo Disney.,Disneyland_Tokyo,2023,9,2,10,"[honestly, brilliant, day, tokyo, disney]"


### 1.3 Vectorize

In [29]:
# Using gensim

# Make vocab dictionary
vocab = corpora.Dictionary(df_sent["Bag_of_Words"])

# Vectorize
vocab_vectors = [vocab.doc2bow(text) for text in df_sent["Bag_of_Words"]]

# Checking
record = 5

print(f"Text: {df_sent['Review_Text'][record]}")

sample = vocab_vectors[record]

for i in range(len(sample)):
    print(f'Word {sample[i][0]}: "{vocab[sample[i][0]]}" appears {sample[i][1]} time(s).')

Text: We had a wonderful time at Disney Land Tokyo at night with our family.
Word 2: "disney" appears 1 time(s).
Word 3: "family" appears 1 time(s).
Word 4: "land" appears 1 time(s).
Word 5: "night" appears 1 time(s).
Word 6: "tokyo" appears 1 time(s).
Word 32: "time" appears 1 time(s).
Word 45: "wonderful" appears 1 time(s).


In [32]:
# Using skLearn

vectorizer = TfidfVectorizer(analyzer=preprocess_sentence(lemmatizer = WordNetLemmatizer()), max_df = 0.5, min_df = 2, stop_words = 'english')

X = vectorizer.fit_transform(df_sent["Review_Text"])

print("n_samples: %d, n_features: %d" % X.shape)

TypeError: preprocess_sentence() missing 1 required positional argument: 'sentence'

In [30]:
# Do the actual clustering

true_k = 5

km = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=10, verbose = False)

print("Clustering sparse data with %s" % km)

km.fit(X)


print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names_out()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Clustering sparse data with KMeans(max_iter=1000, n_clusters=5, n_init=10, verbose=False)
Top terms per cluster:
Cluster 0: is it a and to disneyland park this in of
Cluster 1: you to if a are can and in will have
Cluster 2: and to we a of for i in were are
Cluster 3: was it and a i to we not of for
Cluster 4:  fun enjoy it good great a we not go


## Try

In [20]:
testing = pd.read_pickle('../data/processed_review.pkl')

testing.head(5)

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed_review.pkl'

In [7]:
! pip install transformers sentencepiece --quiet

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the ABSA model and tokenizer
model_name = "yangheng/deberta-v3-base-absa-v1.1"
#tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

aspects = model.config.id2label
print(aspects)


s = "Park was perfect. It was my kids first time, and would come back again! Everyone was super kind and helpful! The silhouettes were the perfect thing to bring home to display!"

for aspect in aspects:
   print(aspect, classifier(s,  text_pair=aspect))

2023-09-30 10:21:26.198191: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]

: 

In [23]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification
from transformers import pipeline


# Load the ABSA model and tokenizer
model_name = "yangheng/deberta-v3-base-absa-v1.1"
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

tokenizer2 = AutoTokenizer.from_pretrained("yanekyuk/bert-uncased-keyword-extractor")
model2 = AutoModelForTokenClassification.from_pretrained("yanekyuk/bert-uncased-keyword-extractor")


def extract_aspect(text):
    extractor = pipeline("ner", model=model2, tokenizer=tokenizer2)
    phrasesids = []
    for tag in extractor(text):
        if tag['entity'].startswith('B'):
            phrasesids.append([tag['start'], tag['end']])
        if tag['entity'].startswith('I'):
            phrasesids[-1][-1] = tag['end']
    phrases = [text[p[0]:p[1]] for p in phrasesids]
    return phrases

text =  "As a former annual pass holder it's sad to see how much the park has deviated from Walt's vision, crowded and overprice, the park is no longer a place that you can visit for the afternoon or just drop in to have dinner. Went here with some friends that had an extra ticket so I walked around with them for a few hours. Some of the rides are still easy to get on but other have long waits. If it's your first visit then you will need to allocate multiple days if you want to see everything."


pipe(text, candidate_labels=extract_aspect(text))

Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


ValueError: ignored