In [1]:
%matplotlib inline

In [2]:
# Write your imports here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import nltk
import torch

from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from sklearn.feature_extraction.text import CountVectorizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


# Working with Text Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll be looking at and exploring European restaurant reviews. The dataset is rather tiny, but that's just because it has to run on any machine. In real life, just like with images, texts can be several terabytes long.

The dataset is located [here](https://www.kaggle.com/datasets/gorororororo23/european-restaurant-reviews) and as always, it's been provided to you in the `data/` folder.

### Problem 1. Read the dataset (1 point)
Read the dataset, get acquainted with it. Ensure the data is valid before you proceed.

How many observations are there? Which country is the most represented? What time range does the dataset represent?

Is the sample balanced in terms of restaurants, i.e., do you have an equal number of reviews for each one? Most importantly, is the dataset balanced in terms of **sentiment**?

Lets take a look at the dataset:

In [3]:
reviews_data = pd.read_csv("data/European Restaurant Reviews.csv")
reviews_data

Unnamed: 0,Country,Restaurant Name,Sentiment,Review Title,Review Date,Review
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,..."
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al..."
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...
...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,..."
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...


Lets see the number of observations for each country:

In [4]:
countries_count = reviews_data["Country"].value_counts()
countries_count

Country
France     512
Italy      318
Morroco    210
Cuba       146
Poland     135
Russia     100
India       81
Name: count, dtype: int64

As we can see, its not evenly spread out and France is by far the most represented.

As for the time frame, Lets see:

In [5]:
reviews_data["Review Date"].unique()

array(['May 2024 •', 'Feb 2024 •', 'Nov 2023 •', 'Mar 2023 •',
       'Nov 2022 •', 'Jul 2021 •', 'Jan 2020 •', 'Oct 2019 •',
       'Jun 2019 •', 'May 2019 •', 'Mar 2019 •', 'Dec 2018 •',
       'Sept 2018 •', 'Oct 2018 •', 'Jul 2018 •', 'May 2018 •',
       'Feb 2018 •', 'May 2017 •', 'Feb 2017 •', 'Aug 2017 •',
       'Jul 2017 •', 'Jun 2017 •', 'Jan 2017 •', 'May 2016 •',
       'Oct 2016 •', 'Aug 2016 •', 'Jul 2016 •', 'Jun 2016 •',
       'Feb 2016 •', 'Mar 2016 •', 'Dec 2015 •', 'Nov 2015 •',
       'Oct 2015 •', 'Sept 2015 •', 'Aug 2015 •', 'Mar 2015 •',
       'Jul 2015 •', 'May 2015 •', 'Apr 2015 •', 'Feb 2022 •',
       'Feb 2015 •', 'Nov 2014 •', 'Oct 2014 •', 'Jul 2014 •',
       'May 2014 •', 'Apr 2014 •', 'Dec 2013 •', 'Nov 2013 •',
       'Sept 2013 •', 'Aug 2013 •', 'Jul 2013 •', 'Jun 2013 •',
       'Mar 2013 •', 'Jan 2013 •', 'Dec 2012 •', 'Oct 2012 •',
       'Aug 2012 •', 'Jun 2012 •', 'May 2012 •', 'Dec 2011 •',
       'Mar 2012 •', 'Nov 2011 •', 'Feb 2012 •', 'Oc

With observation, we can see that the review dates range from September 2010 to June 2024

In [6]:
reviews_data["Restaurant Name"].value_counts()

Restaurant Name
The Frog at Bercy Village                512
Ad Hoc Ristorante (Piazza del Popolo)    318
The LOFT                                 210
Old Square (Plaza Vieja)                 146
Stara Kamienica                          135
Pelmenya                                 100
Mosaic                                    81
Name: count, dtype: int64

It looks like there is only one restaurant for each country. Lets check to make sure:

In [7]:
for country in reviews_data["Country"].unique():
    restaurants_count = len(reviews_data[reviews_data["Country"] == country]["Restaurant Name"].unique())
    print("Number of restaurants in " + country + ": " + str(restaurants_count))

Number of restaurants in France: 1
Number of restaurants in Italy: 1
Number of restaurants in Poland: 1
Number of restaurants in India: 1
Number of restaurants in Russia: 1
Number of restaurants in Morroco: 1
Number of restaurants in Cuba: 1


Our observation was correct. Now lets see if the dataset is balanced in terms of sentiment:

In [8]:
reviews_data["Sentiment"].value_counts()

Sentiment
Positive    1237
Negative     265
Name: count, dtype: int64

Well, looks like there are way more positive reviews. Lets look at the arrangement for each restaurant:

In [9]:
for restaurant in reviews_data["Restaurant Name"].unique():
    positive_reviews = len(reviews_data[(reviews_data["Restaurant Name"] == restaurant) & (reviews_data["Sentiment"] == "Positive")]["Sentiment"])
    negative_reviews = len(reviews_data[(reviews_data["Restaurant Name"] == restaurant) & (reviews_data["Sentiment"] == "Negative")]["Sentiment"])
    print("Number of positive reviews in " + restaurant + ": " + str(positive_reviews))
    print("Number of negative reviews in " + restaurant + ": " + str(negative_reviews) + "\n")

Number of positive reviews in The Frog at Bercy Village: 360
Number of negative reviews in The Frog at Bercy Village: 152

Number of positive reviews in Ad Hoc Ristorante (Piazza del Popolo): 270
Number of negative reviews in Ad Hoc Ristorante (Piazza del Popolo): 48

Number of positive reviews in Stara Kamienica: 120
Number of negative reviews in Stara Kamienica: 15

Number of positive reviews in Mosaic: 81
Number of negative reviews in Mosaic: 0

Number of positive reviews in Pelmenya: 90
Number of negative reviews in Pelmenya: 10

Number of positive reviews in The LOFT: 207
Number of negative reviews in The LOFT: 3

Number of positive reviews in Old Square (Plaza Vieja): 109
Number of negative reviews in Old Square (Plaza Vieja): 37



Lets draw a scattetplot:

In [10]:
# sentiment_df = pd.DataFrame(columns = ["Restaurant", "Positive", "Negative"])
# sentiment_df["Restaurant"] = reviews_data["Restaurant Name"].unique()

# for restaurant in reviews_data["Restaurant Name"].unique():
#     positive_reviews = len(reviews_data[(reviews_data["Restaurant Name"] == restaurant) & (reviews_data["Sentiment"] == "Positive")]["Sentiment"])
#     negative_reviews = len(reviews_data[(reviews_data["Restaurant Name"] == restaurant) & (reviews_data["Sentiment"] == "Negative")]["Sentiment"])
#     sentiment_df.loc[sentiment_df["Restaurant"] == restaurant, "Positive"] = positive_reviews
#     sentiment_df.loc[sentiment_df["Restaurant"] == restaurant, "Negative"] = negative_reviews

# colors = ["red", "green", "blue", "orange", "purple", "yellow", "cyan"]

# plt.scatter(sentiment_df["Positive"], sentiment_df["Negative"], c = colors, label = sentiment_df["Restaurant"])
# plt.legend()
# plt.show
# sentiment_df

Looks consistent for all restaurants.

Looks consistent for all restaurants.

### Problem 2. Getting acquainted with reviews (1 point)
Are positive comments typically shorter or longer? Try to define a good, robust metric for "length" of a text; it's not necessary just the character count. Can you explain your findings?

Lets create another column for the word count in each review:

In [11]:
modified_reviews_data = reviews_data
for review in modified_reviews_data["Review"]:
    word_count = len(re.split(r"\W+", review)) - 1
    modified_reviews_data.loc[modified_reviews_data["Review"] == review, "Words Count"] = word_count

modified_reviews_data["Words count"] = modified_reviews_data["Words Count"].astype(int)

Now lets look into the lengths of the reviews:

In [12]:
print(modified_reviews_data["Words count"].describe())
print("Median: " + str(modified_reviews_data["Words Count"].median()))

count    1502.000000
mean       66.379494
std        74.938374
min         2.000000
25%        26.000000
50%        43.000000
75%        76.000000
max       654.000000
Name: Words count, dtype: float64
Median: 43.0


Now lets do the same for the positive and negative reviews separately:

In [13]:
positive_reviews_word_count = modified_reviews_data[modified_reviews_data["Sentiment"] == "Positive"]["Words count"]
print(positive_reviews_word_count.describe())
print("Median: " + str(positive_reviews_word_count.median()))

count    1237.000000
mean       50.133387
std        39.105626
min         2.000000
25%        25.000000
50%        37.000000
75%        61.000000
max       339.000000
Name: Words count, dtype: float64
Median: 37.0


In [14]:
negative_reviews_word_count = modified_reviews_data[modified_reviews_data["Sentiment"] == "Negative"]["Words count"]
print(negative_reviews_word_count.describe())
print("Median: " + str(negative_reviews_word_count.median()))

count    265.000000
mean     142.215094
std      133.265921
min       13.000000
25%       55.000000
50%       95.000000
75%      179.000000
max      654.000000
Name: Words count, dtype: float64
Median: 95.0


As we can see, negative reviews are significantly longer. That may be due to the fact, that a frustrated person is more likely to describe in detail what they didnt like. They are more motivated to show other people why not to choose this place.

### Problem 3. Preprocess the review content (2 points)
You'll likely need to do this while working on the problems below, but try to synthesize (and document!) your preprocessing here. Your tasks will revolve around words and their connection to sentiment. While preprocessing, keep in mind the domain (restaurant reviews) and the task (sentiment analysis).

### Problem 3. Top words (1 point)
Use a simple word tokenization and count the top 10 words in positive reviews; then the top 10 words in negative reviews*. Once again, try to define what "top" words means. Describe and document your process. Explain your results.

\* Okay, you may want to see top N words (with $N \ge 10$).

Lets create word counters for the positive and the negative reviews separately:

First we need to create our stopword set:

In [15]:
nltk.download("stopwords")
stop = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /home/gecata/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now lets create our stemmer:

In [16]:
stemmer = PorterStemmer()

Now we are ready to create the word counters. We are going to create a function, which will count the words in the reviews by default, but optionally it will count the words in the titles instead, because we will need it in the next problem.

In [17]:
def make_counter_by_sentiment(sentiment, title = False):
    #Make a word counter for all the words in the reviews or the titles
    if not title:
        all_sentiment_reviews = reviews_data[reviews_data["Sentiment"] == sentiment]["Review"].str.cat(sep = " ")
    elif title:
        all_sentiment_reviews = reviews_data[reviews_data["Sentiment"] == sentiment]["Review Title"].str.cat(sep = " ")
        
    words_counter_raw = Counter(re.split(r"\W+", all_sentiment_reviews.lower()))

    #Remove the stopwords
    words_counter_no_stopwords = Counter({word: count for word, count in words_counter_raw.items() if word not in stop})

    #Perform stemming
    words_counter = Counter()
    for word in words_counter_no_stopwords:
        stemmed = stemmer.stem(word)
        try:
            words_counter[stemmed] += words_counter_no_stopwords[stemmed]
        except(KeyError):
            words_counter[stemmed] = 0
            words_counter[stemmed] += words_counter_no_stopwords[stemmed]            

    return words_counter

Now lets look at the top 10 most common words for the positive and the negative reviews:

In [18]:
positive_word_counter = make_counter_by_sentiment("Positive")
positive_word_counter.most_common(10)

[('food', 1482),
 ('recommend', 1152),
 ('great', 1144),
 ('place', 1122),
 ('good', 1028),
 ('wine', 720),
 ('time', 716),
 ('menu', 705),
 ('nice', 616),
 ('special', 602)]

In [19]:
negative_word_counter = make_counter_by_sentiment("Negative")
negative_word_counter.most_common(10)

[('menu', 532),
 ('food', 494),
 ('like', 435),
 ('wine', 350),
 ('place', 297),
 ('one', 268),
 ('time', 255),
 ('us', 205),
 ('meal', 192),
 ('nice', 176)]

As we can see, words like "menu", "food", "staff" and so on are common in both negative and positive reviews. That can be explained, because thats what a restaurant offers and when people leave reviews, they are likely to write the most about that. Its really interesting, that there are no "negative" words like "bad", "terrible", "never" and so on in the top 10 words in the negative reviews.

### Problem 4. Review titles (2 point)
How do the top words you found in the last problem correlate to the review titles? Do the top 10 words (for each sentiment) appear in the titles at all? Do reviews which contain one or more of the top words have the same words in their titles?

Does the title of a comment present a good summary of its content? That is, are the titles descriptive, or are they simply meant to catch the attention of the reader?

In [20]:
positive_title_word_counter = make_counter_by_sentiment("Positive", title = True)
positive_title_word_counter.most_common(10)

[('great', 226),
 ('place', 204),
 ('food', 178),
 ('good', 106),
 ('best', 78),
 ('excellent', 78),
 ('dinner', 74),
 ('restaurant', 73),
 ('beer', 68),
 ('meal', 62)]

In [21]:
negative_title_word_counter = make_counter_by_sentiment("Negative", title = True)
negative_title_word_counter.most_common(10)

[('food', 32),
 ('tourist', 18),
 ('place', 16),
 ('bad', 15),
 ('wine', 14),
 ('rome', 14),
 ('go', 14),
 ('rude', 12),
 ('great', 12),
 ('lack', 12)]

Here are the number of reviews, where the top 10 most common positive words appear and the number of reviews, where they also appear in the title:

In [22]:
for word, _ in positive_word_counter.most_common(10):
    reviews_with_word = reviews_data[reviews_data["Review"].str.contains(word)]
    reviews_and_titles_with_word = reviews_data[(reviews_data["Review"].str.contains(word)) & (reviews_data["Review Title"].str.contains(word))]
    print(f"{word} in reviews: " + str(reviews_with_word.shape[0]) + f"\n{word} in titles:"+ str(reviews_and_titles_with_word.shape[0]) + "\n")

food in reviews: 720
food in titles:129

recommend in reviews: 302
recommend in titles:9

great in reviews: 381
great in titles:36

place in reviews: 416
place in titles:56

good in reviews: 460
good in titles:33

wine in reviews: 297
wine in titles:23

time in reviews: 257
time in titles:5

menu in reviews: 287
menu in titles:7

nice in reviews: 301
nice in titles:8

special in reviews: 159
special in titles:3



Here is the same with the negative words:

In [23]:
for word, _ in negative_word_counter.most_common(10):
    reviews_with_word = reviews_data[reviews_data["Review"].str.contains(word)]
    reviews_and_titles_with_word = reviews_data[(reviews_data["Review"].str.contains(word)) & (reviews_data["Review Title"].str.contains(word))]
    print(f"{word} in reviews: " + str(reviews_with_word.shape[0]) + f"\n{word} in titles:"+ str(reviews_and_titles_with_word.shape[0]) + "\n")

menu in reviews: 287
menu in titles:7

food in reviews: 720
food in titles:129

like in reviews: 172
like in titles:1

wine in reviews: 297
wine in titles:23

place in reviews: 416
place in titles:56

one in reviews: 357
one in titles:6

time in reviews: 257
time in titles:5

us in reviews: 913
us in titles:84

meal in reviews: 173
meal in titles:11

nice in reviews: 301
nice in titles:8



As we can see, these words are way less used in the titles. Also, by looking at the most common words in the titles like we saw above, they tend to use way more "strong" words, such as "best", "excellent", "rude" and so on. This means, that usually people write the titles not to summarize the contents in the reviews, but rather to grab the attention of the readers.

### Problem 5. Bag of words (1 point)
Based on your findings so far, come up with a good set of settings (hyperparameters) for a bag-of-words model for review titles and contents. It's easiest to treat them separately (so, create two models); but you may also think about a unified representation. I find the simplest way of concatenating the title and content too simplistic to be useful, as it doesn't allow you to treat the title differently (e.g., by giving it more weight).

The documentation for `CountVectorizer` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Familiarize yourself with all settings; try out different combinations and come up with a final model; or rather - two models :).

Lets create two separate CountVectorizer models for the reviews and the titles:

In [24]:
review_vectorizer = CountVectorizer(ngram_range = (1, 2), token_pattern = r"\b\w+\b", stop_words = "english")
review_matrix = review_vectorizer.fit_transform(reviews_data["Review"])

title_vectorizer = CountVectorizer(ngram_range = (1, 2), token_pattern = r"\b\w+\b", stop_words = "english")
title_matrix = title_vectorizer.fit_transform(reviews_data["Review Title"])

### Problem 6. Deep sentiment analysis models (1 point)
Find a suitable model for sentiment analysis in English. Without modifying, training, or fine-tuning the model, make it predict all contents (or better, combinations of titles and contents, if you can). Meaure the accuracy of the model compared to the `sentiment` column in the dataset.

First, I tried using DistilBERT, but unfortunately my laptop could not handle it. Now I`m going to try nlptown. Lets first load the pre-trained tokenizer and model, then move the model to the CPU, because I have an integrated GPU.

In [25]:
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

device = torch.device("cpu")
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(105879, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

In [26]:
def preprocess(text, max_length=512):
    inputs = tokenizer(
        text, 
        return_tensors="pt", 
        padding="max_length", 
        truncation=True, 
        max_length=max_length
    )
    return inputs

In [29]:
def predict_sentiment(text_series, max_length = 512):
    sentiments = []

    for text in text_series:
        inputs = preprocess(text, max_length = max_length)
        inputs = {key: val.to(device) for key, val in inputs.items()}

        with torch.no_grad():  # Disable gradient calculation for inference
            outputs = model(**inputs)
            probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_class = torch.argmax(probabilities, dim=1).item()

            sentiment_map = {0: "Negative", 1: "Negative", 2: "Positive", 3: "Positive", 4: "Positive"}
            sentiments.append(sentiment_map[predicted_class])
    
    return pd.Series(sentiments, index=text_series.index)

predicted_sentiments = predict_sentiment(reviews_data["Review"])
predicted_sentiments

0       Negative
1       Negative
2       Positive
3       Negative
4       Negative
          ...   
1497    Negative
1498    Negative
1499    Negative
1500    Negative
1501    Negative
Length: 1502, dtype: object

Lets create a new dataset with the predictions and see how many it got right:

In [30]:
reviews_data_with_predictions = reviews_data
reviews_data_with_predictions["Predictions"] = predicted_sentiments
reviews_data_with_predictions

Unnamed: 0,Country,Restaurant Name,Sentiment,Review Title,Review Date,Review,Words Count,Words count,Predictions
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...,28.0,28,Negative
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,...",57.0,57,Negative
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al...",40.0,40,Positive
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...,278.0,278,Negative
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...,249.0,249,Negative
...,...,...,...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...,146.0,146,Negative
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...,29.0,29,Negative
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,...",30.0,30,Negative
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...,71.0,71,Negative


In [36]:
reviews_data_with_predictions[reviews_data_with_predictions["Sentiment"] == reviews_data_with_predictions["Predictions"]].shape

(1447, 9)

Wow, it got most of them right!

### Problem 7. Deep features (embeddings) (1 point)
Use the same model to perform feature extraction on the review contents (or contents + titles) instead of direct predictions. You should already be familiar how to do that from your work on images.

Use the cosine similarity between texts to try to cluster them. Are there "similar" reviews (you'll need to find a way to measure similarity) across different restaurants? Are customers generally in agreement for the same restaurant?

### \* Problem 8. Explore and model at will
In this lab, we focused on preprocessing and feature extraction and we didn't really have a chance to train (or compare) models. The dataset is maybe too small to be conclusive, but feel free to play around with ready-made models, and train your own.