In [58]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import seaborn as sns
import re
import zipfile
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
from nltk import pos_tag
from nltk import ngrams
from nltk.corpus import opinion_lexicon
from nltk.corpus import wordnet as wn
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from transformers import pipeline
from sklearn.metrics import accuracy_score
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertModel, BertTokenizer
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import tensorflow_hub as hub

# Working with Text Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll be looking at and exploring European restaurant reviews. The dataset is rather tiny, but that's just because it has to run on any machine. In real life, just like with images, texts can be several terabytes long.

The dataset is located [here](https://www.kaggle.com/datasets/gorororororo23/european-restaurant-reviews) and as always, it's been provided to you in the `data/` folder.

### Problem 1. Read the dataset (1 point)
Read the dataset, get acquainted with it. Ensure the data is valid before you proceed.

How many observations are there? Which country is the most represented? What time range does the dataset represent?

Is the sample balanced in terms of restaurants, i.e., do you have an equal number of reviews for each one? Most importantly, is the dataset balanced in terms of **sentiment**?

1. First we will load the data using pd.read_csv and create a function to turn the columns into snake_case for easier use of the column names.

In [6]:
restaurant_data = pd.read_csv('data/European Restaurant Reviews.csv')

In [7]:
restaurant_data

Unnamed: 0,Country,Restaurant Name,Sentiment,Review Title,Review Date,Review
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,..."
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al..."
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...
...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,..."
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...


In [8]:
 """
    Convert a column name to snake_case.

    This function takes a column name as input and converts it to snake_case format.
    Snake_case format means all characters are lowercase and spaces or periods are replaced with underscores.

    Parameters:
    column_name (str): The name of the column to be converted.

    Returns:
    str: The converted column name in snake_case format.
 """
def convert_to_snake_case(column_name):
    return column_name.lower().replace('.', '_').replace(' ', '_')

In [9]:
restaurant_data_renamed_columns = restaurant_data.rename(columns=lambda col: convert_to_snake_case(col))

In [10]:
restaurant_data_renamed_columns

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,..."
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al..."
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...
...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,..."
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...


In [11]:
restaurant_data.shape

(1502, 6)

2. As it can be seen from the shape and also from the dataframe there are 1502 observations(Rows) and 6 features(Columns) and the columns are renamed.
3. Now let's see if there are any NaN values in the data and in which column they are

In [12]:
missing_values = restaurant_data.isnull().sum()

In [13]:
missing_values

Country            0
Restaurant Name    0
Sentiment          0
Review Title       0
Review Date        0
Review             0
dtype: int64

4. As it can be seen there aren't any NaN values in the data
5. Now let's see which country is most represented in the data and see the number of representations of the countries.

In [14]:
most_represented_country = restaurant_data_renamed_columns['country'].value_counts().idxmax()
country_counts = restaurant_data_renamed_columns['country'].value_counts()

In [15]:
most_represented_country

'France'

In [16]:
country_counts 

country
France     512
Italy      318
Morroco    210
Cuba       146
Poland     135
Russia     100
India       81
Name: count, dtype: int64

6. As it can be seen France is the most represented country in the set followed by Italy. And let's see if its only reviews for one restaurant from the country.

In [17]:
restaurant_data_renamed_columns['restaurant_name'].unique()

array(['The Frog at Bercy Village',
       'Ad Hoc Ristorante (Piazza del Popolo)', 'Stara Kamienica',
       'Mosaic', 'Pelmenya', 'The LOFT', 'Old Square (Plaza Vieja)'],
      dtype=object)

7. Yes these are reviews for only 1 restaurant per country.
8. Now let's see the time range of the reviews.

In [18]:
# Create a new DataFrame that copies the original dataset
restaurant_data_processed = restaurant_data_renamed_columns.copy()

# Process the review_date in the new DataFrame
restaurant_data_processed['review_date'] = restaurant_data_processed['review_date'].str.replace('•', '', regex=False).str.strip()

# Convert the cleaned date strings to datetime, assuming the first day of the month
restaurant_data_processed['review_date'] = pd.to_datetime(
    restaurant_data_processed['review_date'],
    format='%b %Y',
    errors='coerce'
)

# Check the time range in the new DataFrame
time_range_processed = restaurant_data_processed['review_date'].min(), restaurant_data_processed['review_date'].max()

# Display the result
print(f"The review dates in the processed data range from {time_range_processed[0]} to {time_range_processed[1]}.")

The review dates in the processed data range from 2010-09-01 00:00:00 to 2024-07-01 00:00:00.


8. The time range of the reviews in the set is between September 2010 and July 2024.

9. Now let's see whether the reviews are evenly distributed across restaurants

In [19]:
restaurant_review_counts = restaurant_data_renamed_columns['restaurant_name'].value_counts()

In [20]:
restaurant_review_counts 

restaurant_name
The Frog at Bercy Village                512
Ad Hoc Ristorante (Piazza del Popolo)    318
The LOFT                                 210
Old Square (Plaza Vieja)                 146
Stara Kamienica                          135
Pelmenya                                 100
Mosaic                                    81
Name: count, dtype: int64

10. As we already know for every country there is only 1 restaurant so all the reviews for the country are for one restaurant. The distribution is not even as there aren't equal numbers of reviews for each restaurant. The number of reviews varies significantly between restaurants, with "The Frog at Bercy Village" having 512 reviews and "Mosaic" having only 81 reviews. A balanced dataset would have an equal number of reviews for each restaurant, which is clearly not the case here.
11. Now let's see if the sentiments are balanced.

In [21]:
sentiment_counts = restaurant_data_processed['sentiment'].value_counts()

In [22]:
sentiment_counts

sentiment
Positive    1237
Negative     265
Name: count, dtype: int64

12. The dataset is not balanced in terms of sentiment. As there are 1237 positive and only 265 negative sentiments. A balanced dataset would have roughly an equal number of positive and negative reviews. In this case, the positive reviews significantly outnumber the negative ones, which indicates that the dataset is skewed towards positive sentiment.

### Problem 2. Getting acquainted with reviews (1 point)
Are positive comments typically shorter or longer? Try to define a good, robust metric for "length" of a text; it's not necessary just the character count. Can you explain your findings?

1. To analyze whether positive comments are typically shorter or longer, I will calculate the word count, sentence count, and average sentence length for each review and then compare these metrics between positive and negative sentiments.
2. First I will use a simpler method that doesn't require external resources to calculate the text length. Counting the total number of characters and total numbers of words in the review using simple string splitting.

In [23]:
def calculate_lengths(text):
    word_count = len(text.split())
    char_count = len(text)
    return word_count, char_count

# Apply the function to calculate word and character counts for each review
restaurant_data_processed['word_count'], restaurant_data_processed['char_count'] = zip(*restaurant_data_processed['review'].apply(calculate_lengths))

# Calculate the average word count and character count for positive and negative reviews
length_summary = restaurant_data_processed.groupby('sentiment')[['word_count', 'char_count']].mean()

In [24]:
length_summary

Unnamed: 0_level_0,word_count,char_count
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Negative,140.573585,761.007547
Positive,50.183508,281.910267


3. From this summary we can see that negative reviews tend to be significantly longer than positive reviews, both in terms of word count and character count. This could indicate that people are more likely to write longer and more detailed reviews when they have negative experiences compared to when they have positive experiences.

This difference in length is a common phenomenon, as negative experiences often provoke more elaborate explanations or justifications

4. Now let's use the nltk library to do the same operation

In [25]:
def calculate_lengthss(text):
    words = nltk.word_tokenize(text)
    word_count = len(words)
    char_count = len(text)
    return word_count, char_count

# Apply the function to calculate word and character counts for each review
restaurant_data_processed['word_count'], restaurant_data_processed['char_count'] = zip(*restaurant_data_processed['review'].apply(calculate_lengthss))

# Calculate the average word count and character count for positive and negative reviews
length_summary_1 = restaurant_data_processed.groupby('sentiment')[['word_count', 'char_count']].mean()

In [26]:
length_summary_1

Unnamed: 0_level_0,word_count,char_count
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Negative,158.05283,761.007547
Positive,57.075182,281.910267


5. From this summary we can see the same trend that negative reveiws tend to be longer than positive ones. Using nltk we got only difference in the word count which could mean that it's more accurate and splits the words correctly or the opposite. Never mind that it still shows the same trend.

### Problem 3. Preprocess the review content (2 points)
You'll likely need to do this while working on the problems below, but try to synthesize (and document!) your preprocessing here. Your tasks will revolve around words and their connection to sentiment. While preprocessing, keep in mind the domain (restaurant reviews) and the task (sentiment analysis).

In [27]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [28]:
def preprocess_review(review):
    # Lowercase
    review = review.lower()
    
    # Tokenize
    tokens = word_tokenize(review)
    
    # Remove punctuation and non-alphabetic words
    tokens = [word for word in tokens if word.isalpha()]
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # Join the tokens back into a single string
    processed_review = ' '.join(tokens)
    
    return processed_review

# Apply preprocessing to the review_text column
restaurant_data_renamed_columns['cleaned_review'] = restaurant_data_renamed_columns['review'].apply(preprocess_review)

In [29]:
restaurant_data_renamed_columns

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review,cleaned_review
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...,manager became agressive said carbonara good r...
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,...",ordered beef fillet ask done medium got well d...
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al...",attractive venue welcoming albeit somewhat slo...
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...,sadly used high tripadvisor rating literally f...
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...,start meal especially given price visited husb...
...,...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...,despite review saying place hang especially be...
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...,beer good food awfull decent thing shish kabob...
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,...",terrible service truly comedic level full pint...
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...,visited havana club museum located old havana ...


### Problem 3. Top words (1 point)
Use a simple word tokenization and count the top 10 words in positive reviews; then the top 10 words in negative reviews*. Once again, try to define what "top" words means. Describe and document your process. Explain your results.

\* Okay, you may want to see top N words (with $N \ge 10$).

1. Let's create a function to count the most frequent words used in positive and negative reviews. 

In [30]:
def get_top_words(reviews, top_n=10):
    all_words = ' '.join(reviews).split()
    word_freq = Counter(all_words)
    return word_freq.most_common(top_n)

# Separate positive and negative reviews
positive_reviews = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Positive']['cleaned_review']
negative_reviews = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Negative']['cleaned_review']

# Get top words
top_positive_words = get_top_words(positive_reviews, top_n=10)
top_negative_words = get_top_words(negative_reviews, top_n=10)

In [31]:
top_positive_words

[('food', 746),
 ('great', 571),
 ('service', 545),
 ('good', 513),
 ('restaurant', 434),
 ('place', 398),
 ('nice', 307),
 ('wine', 302),
 ('menu', 265),
 ('staff', 259)]

In [32]:
top_negative_words

[('restaurant', 251),
 ('food', 249),
 ('u', 209),
 ('wine', 204),
 ('table', 172),
 ('good', 153),
 ('menu', 152),
 ('service', 146),
 ('one', 142),
 ('would', 130)]

2. As we can see there are many contextual words like food, service, restaurant and wine in both. There could be seen some sntimential words like good, nice , bad. But we should focus more on the sentiment words. We can exclude the common words like food service and so on. We shall focus more on the adjectives or Bi or Tri-grams like (great service). The top words are the words that occur the most but they wouldn't necesseraly be the right ones.
3. Let's begin by extracting the words that are most common by editing the function to exclude words of our choice.

In [33]:
def get_top_words(reviews, top_n=10, exclude_words=None):
    all_words = ' '.join(reviews).split()
    word_freq = Counter(all_words)
    
    # If there are words to exclude, filter them out
    if exclude_words:
        word_freq = Counter({word: count for word, count in word_freq.items() if word not in exclude_words})
    
    return word_freq.most_common(top_n)

# Separate positive and negative reviews
positive_reviews = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Positive']['cleaned_review']
negative_reviews = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Negative']['cleaned_review']

# Common words to exclude for positive reviews
exclude_words = set(["food", "service", "restaurant", "place", "wine", "menu"])

# Get top words after filtering
top_positive_words_filtered = get_top_words(positive_reviews, top_n=10, exclude_words=exclude_words)

# Display the results
print("Top 10 filtered words in positive reviews:", top_positive_words_filtered)

Top 10 filtered words in positive reviews: [('great', 571), ('good', 513), ('nice', 307), ('staff', 259), ('friendly', 234), ('excellent', 231), ('delicious', 223), ('really', 217), ('u', 217), ('time', 203)]


4. Now it looks better as we have words(adjectives that we can connect with positive reviews). Let's try for negative words

In [34]:
def get_top_words(reviews, top_n=10, exclude_words=None):
    all_words = ' '.join(reviews).split()
    word_freq = Counter(all_words)
    
    # If there are words to exclude, filter them out
    if exclude_words:
        word_freq = Counter({word: count for word, count in word_freq.items() if word not in exclude_words})
    
    return word_freq.most_common(top_n)

# Common words to exclude for negative reviews
exclude_words_negative = set(["food", "service", "restaurant", "place", "wine", "menu"])

# Get top words after filtering for negative reviews
top_negative_words_filtered = get_top_words(negative_reviews, top_n=10, exclude_words=exclude_words_negative)

# Display the results
print("Top 10 filtered words in negative reviews:", top_negative_words_filtered)

Top 10 filtered words in negative reviews: [('u', 209), ('table', 172), ('good', 153), ('one', 142), ('would', 130), ('waitress', 110), ('rome', 107), ('truffle', 105), ('meal', 103), ('could', 101)]


5. As it can be seen this doesn't work so well so we could try to just extract the adjectives using the nltk library. For me the top words aren't the ones who occur most but those who give sentimantal meaning. We will use two methods
6. Method one is by using a VADER ANALYSIS.VADER is a popular sentiment analysis tool that provides a measure of the positivity, negativity, and neutrality of words. It’s often used for social media text but can work well with review data.
7. Another approach is focusing specifically on adjectives or adverbs that are typically negative.

In [35]:
sid = SentimentIntensityAnalyzer()

def get_top_negative_words_vader(reviews, top_n=10):
    word_scores = {}
    
    for review in reviews:
        tokens = review.split()
        for word in tokens:
            score = sid.polarity_scores(word)
            if score['compound'] < 0:  # Negative sentiment score
                word_scores[word] = word_scores.get(word, 0) + 1
                
    word_freq = Counter(word_scores)
    return word_freq.most_common(top_n)

# Get top negative words using VADER
top_negative_words_vader = get_top_negative_words_vader(negative_reviews, top_n=20)

print("Top 20 negative words using VADER:", top_negative_words_vader)

Top 20 negative words using VADER: [('bad', 40), ('terrible', 36), ('disappointed', 35), ('problem', 30), ('empty', 29), ('poor', 28), ('hard', 27), ('disappointing', 22), ('sorry', 19), ('awful', 18), ('wrong', 17), ('sadly', 16), ('rude', 15), ('worst', 14), ('disappointment', 14), ('unfortunately', 14), ('charged', 12), ('pay', 12), ('rotten', 12), ('avoid', 11)]


In [36]:
def get_top_negative_adjectives(reviews, top_n=10):
    negative_adjectives = []

    for review in reviews:
        tokens = word_tokenize(review)
        tagged = pos_tag(tokens)
        for word, tag in tagged:
            if tag == 'JJ':  # Adjective
                score = sid.polarity_scores(word)
                if score['compound'] < 0:  # Check if the word has negative sentiment
                    negative_adjectives.append(word)

    word_freq = Counter(negative_adjectives)
    return word_freq.most_common(top_n)

# Get top negative adjectives
top_negative_adjectives = get_top_negative_adjectives(negative_reviews, top_n=20)

In [37]:
top_negative_adjectives

[('bad', 40),
 ('terrible', 36),
 ('poor', 28),
 ('empty', 27),
 ('hard', 25),
 ('awful', 18),
 ('disappointed', 18),
 ('wrong', 16),
 ('disappointing', 16),
 ('rotten', 12),
 ('sorry', 11),
 ('uncomfortable', 10),
 ('negative', 9),
 ('rude', 9),
 ('sad', 7),
 ('stop', 6),
 ('unpleasant', 6),
 ('horrible', 6),
 ('aggressive', 6),
 ('low', 5)]

8. From both methods we now see that bad,terrible,poor,dissappointed,dissapointing, awful are the top words we can connect with negative reviews.
9. Now let's do the same two methods for positive words to see whether they are the same as the ones we originally got.

In [38]:
sia = SentimentIntensityAnalyzer()

def extract_sentiment_words(reviews, sentiment='pos'):
    sentiment_words = []
    for review in reviews:
        tokens = nltk.word_tokenize(review)
        tagged_tokens = nltk.pos_tag(tokens)
        
        for word, tag in tagged_tokens:
            # Only consider adjectives or adverbs
            if tag.startswith('JJ') or tag.startswith('RB'):
                if (sentiment == 'pos' and sia.polarity_scores(word)['compound'] > 0) or \
                   (sentiment == 'neg' and sia.polarity_scores(word)['compound'] < 0):
                    sentiment_words.append(word)
    
    word_freq = Counter(sentiment_words)
    return word_freq.most_common(20)

# Get the top positive words from positive reviews
positive_reviews = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Positive']['cleaned_review']
top_positive_words = extract_sentiment_words(positive_reviews, sentiment='pos')

print("Top 20 positive sentiment words in positive reviews:", top_positive_words)

Top 20 positive sentiment words in positive reviews: [('great', 571), ('good', 502), ('nice', 292), ('friendly', 234), ('delicious', 221), ('excellent', 185), ('lovely', 174), ('best', 167), ('well', 157), ('definitely', 115), ('wonderful', 96), ('special', 86), ('fantastic', 76), ('fresh', 72), ('happy', 64), ('beautiful', 62), ('perfect', 60), ('helpful', 53), ('amazing', 51), ('sure', 43)]


10. As it can be seen most of them are the same but these are way more accurate than the first try where we had words like staff. Using VADER is one really good way to extract words for sentiment analysis as it gets the adjectives that are connected with positive or negative sentiments.
11. Positive: Great,Good,Nice,Friendly and Negative: Bad, Terrible,Dissapointed,Poor
12. We could also try looking at combination of words n-grams but i these satisfy my requirements for now and i could come back here to check them out if i want even better accuracy percentage for sentimental analysis.

### Problem 4. Review titles (2 point)
How do the top words you found in the last problem correlate to the review titles? Do the top 10 words (for each sentiment) appear in the titles at all? Do reviews which contain one or more of the top words have the same words in their titles?

Does the title of a comment present a good summary of its content? That is, are the titles descriptive, or are they simply meant to catch the attention of the reader?

1. Let's create a function to check whether a word from the top words we have appear in the review titles.

In [39]:
def word_in_titles(word, titles):
    return titles.str.contains(word, case=False, regex=False).any()

# Check if top positive words appear in titles
positive_titles = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Positive']['review_title']
positive_words_in_titles = {word: word_in_titles(word, positive_titles) for word, _ in top_positive_words}

# Check if top negative words appear in titles
negative_titles = restaurant_data_renamed_columns[restaurant_data_renamed_columns['sentiment'] == 'Negative']['review_title']
negative_words_in_titles = {word: word_in_titles(word, negative_titles) for word, _ in top_negative_words_vader}

In [40]:
positive_words_in_titles

{'great': True,
 'good': True,
 'nice': True,
 'friendly': True,
 'delicious': True,
 'excellent': True,
 'lovely': True,
 'best': True,
 'well': True,
 'definitely': True,
 'wonderful': True,
 'special': True,
 'fantastic': True,
 'fresh': True,
 'happy': True,
 'beautiful': True,
 'perfect': True,
 'helpful': True,
 'amazing': True,
 'sure': True}

In [41]:
negative_words_in_titles

{'bad': True,
 'terrible': True,
 'disappointed': True,
 'problem': False,
 'empty': False,
 'poor': True,
 'hard': False,
 'disappointing': True,
 'sorry': True,
 'awful': True,
 'wrong': False,
 'sadly': False,
 'rude': True,
 'worst': True,
 'disappointment': True,
 'unfortunately': True,
 'charged': False,
 'pay': False,
 'rotten': True,
 'avoid': True}

2. As we can see for the positive words it's most likely the case that they used for the title, but for the negative ones there are a lot of words that are not used in the title.
3. Now let's see which words from our top words occur in both the title and the review.

In [42]:
def word_in_review_and_title(word, reviews, titles):
    reviews_with_word = reviews.str.contains(word, case=False, regex=False)
    titles_with_word = titles.str.contains(word, case=False, regex=False)
    return (reviews_with_word & titles_with_word).sum()

# Analyze correlation for positive words
positive_words_correlation = {word: word_in_review_and_title(word, positive_reviews, positive_titles) for word, _ in top_positive_words}

# Analyze correlation for negative words
negative_words_correlation = {word: word_in_review_and_title(word, negative_reviews, negative_titles) for word, _ in top_negative_words_vader}


In [43]:
positive_words_correlation

{'great': 126,
 'good': 63,
 'nice': 25,
 'friendly': 15,
 'delicious': 31,
 'excellent': 36,
 'lovely': 17,
 'best': 32,
 'well': 1,
 'definitely': 0,
 'wonderful': 17,
 'special': 4,
 'fantastic': 14,
 'fresh': 5,
 'happy': 2,
 'beautiful': 13,
 'perfect': 7,
 'helpful': 0,
 'amazing': 30,
 'sure': 1}

In [44]:
negative_words_correlation

{'bad': 2,
 'terrible': 7,
 'disappointed': 1,
 'problem': 0,
 'empty': 0,
 'poor': 2,
 'hard': 0,
 'disappointing': 2,
 'sorry': 0,
 'awful': 3,
 'wrong': 0,
 'sadly': 0,
 'rude': 7,
 'worst': 2,
 'disappointment': 1,
 'unfortunately': 0,
 'charged': 0,
 'pay': 0,
 'rotten': 4,
 'avoid': 1}

4. As it can be seen a lot of words aren't used in both the title and the review itself. Most positive ones are used but in the negative part we can see a lot of words not being used in both places.
5. Now let's see whether the review titles are just for attention grab or they are descriptive. Seeing how much words overlap, which could indicate that the title is just a summary of the review. Then seeing if the sentiment correlation is high to check whether the title is with the same sentiment as the review and to see if the tone is the same(Negative-Negative) not (Positive-Negative). And last i will see if the titles contain attention-grabbing words such as AMAZING,Terrible and so on to see whether they are made to just catch the eye.
6. Let's begin with words overlap

In [45]:
def calculate_word_overlap(row):
    title_words = set(row['review_title'].split())
    review_words = set(row['cleaned_review'].split())
    overlap = title_words.intersection(review_words)
    return len(overlap) / len(title_words) if len(title_words) > 0 else 0

# Apply the function to calculate word overlap for each review
restaurant_data_renamed_columns['word_overlap'] = restaurant_data_renamed_columns.apply(calculate_word_overlap, axis=1)

# Analyze the average overlap
average_overlap = restaurant_data_renamed_columns['word_overlap'].mean()
print(f"Average word overlap between titles and reviews: {average_overlap:.2%}")


Average word overlap between titles and reviews: 12.85%


7. This result indicates that the title of a review only partially reflects the content of the review. This suggests that the titles may be more geared towards catching attention rather than fully summarizing the review's content.
8. Let's see the sentiment consistency

In [46]:
def calculate_sentiment_consistency(row):
    title_sentiment = sid.polarity_scores(row['review_title'])['compound']
    review_sentiment = sid.polarity_scores(row['cleaned_review'])['compound']
    return 1 if (title_sentiment > 0 and review_sentiment > 0) or (title_sentiment < 0 and review_sentiment < 0) else 0

# Apply the function to calculate sentiment consistency for each review
restaurant_data_renamed_columns['sentiment_consistency'] = restaurant_data_renamed_columns.apply(calculate_sentiment_consistency, axis=1)

# Analyze the average consistency
average_consistency = restaurant_data_renamed_columns['sentiment_consistency'].mean()
print(f"Sentiment consistency between titles and reviews: {average_consistency:.2%}")


Sentiment consistency between titles and reviews: 67.84%


9. This result suggests that titles often reflect the sentiment of the review but not always. In other words, about one-third of the time, the title's sentiment might differ from the overall sentiment of the review. Which is a bit off but let's still see the attention grabbing words

In [47]:
attention_grabbing_words = {"amazing", "terrible", "best", "worst", "fantastic", "awful", "must-visit", "avoid"}

def contains_attention_grabbing_words(title):
    title_words = set(title.split())
    return 1 if title_words.intersection(attention_grabbing_words) else 0

# Apply the function to check for attention-grabbing words in each title
restaurant_data_renamed_columns['attention_grabbing'] = restaurant_data_renamed_columns['review_title'].apply(contains_attention_grabbing_words)

# Calculate the percentage of titles with attention-grabbing words
percentage_attention_grabbing = restaurant_data_renamed_columns['attention_grabbing'].mean()
print(f"Percentage of titles with attention-grabbing words: {percentage_attention_grabbing:.2%}")


Percentage of titles with attention-grabbing words: 3.79%


10. Surprising enough it shows that there arent a lot of attention grabbing words. Which means that the review titles aren't that focused on just attention.
11. Lets make a quick review of our answers and make a summary.
12. The low word overlap and moderate sentiment consistency indicate that while titles provide a general idea of the review, they often do not delve into specifics. Titles might focus on key points or emotions rather than summarizing the entire review content.
13. The fact that there is a 32.16% inconsistency in sentiment between titles and reviews suggests that some titles might be misleading, either intentionally or unintentionally. This inconsistency can affect how the reviews are perceived by readers, especially if the title is more positive or negative than the actual review.
14. The low incidence of attention-grabbing words suggests that most reviews prioritize content quality over sensationalism. This could indicate that the reviews are more reliable and thoughtful, aiming to genuinely inform rather than just attract attention.
15. In general the data suggests that while titles are not overly sensational, they are also not deeply descriptive. This balance might be a strategy to engage readers without resorting to clickbait tactics, but it also means readers need to read the full review to get a complete understanding.

### Problem 5. Bag of words (1 point)
Based on your findings so far, come up with a good set of settings (hyperparameters) for a bag-of-words model for review titles and contents. It's easiest to treat them separately (so, create two models); but you may also think about a unified representation. I find the simplest way of concatenating the title and content too simplistic to be useful, as it doesn't allow you to treat the title differently (e.g., by giving it more weight).

The documentation for `CountVectorizer` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Familiarize yourself with all settings; try out different combinations and come up with a final model; or rather - two models :).

In [48]:
# Bag-of-Words for Titles
title_vectorizer = CountVectorizer(
    ngram_range=(1, 2),
    max_features=300,
    min_df=2,
    max_df=0.85,
    stop_words='english',
    binary=True
)

title_bow = title_vectorizer.fit_transform(restaurant_data_renamed_columns['review_title'])

# Bag-of-Words for Review Contents
content_vectorizer = CountVectorizer(
    ngram_range=(1, 3),
    max_features=1000,
    min_df=3,
    max_df=0.75,
    stop_words='english',
    binary=False
)

content_bow = content_vectorizer.fit_transform(restaurant_data_renamed_columns['review'])

In [49]:
title_bow 

<1502x300 sparse matrix of type '<class 'numpy.int64'>'
	with 3786 stored elements in Compressed Sparse Row format>

In [50]:
content_bow 

<1502x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 33695 stored elements in Compressed Sparse Row format>

### Problem 6. Deep sentiment analysis models (1 point)
Find a suitable model for sentiment analysis in English. Without modifying, training, or fine-tuning the model, make it predict all contents (or better, combinations of titles and contents, if you can). Meaure the accuracy of the model compared to the `sentiment` column in the dataset.

1. After many attempts to handle torch and it not working i decided to use VADER to make sentiment predictions. Let's create a function to get sentiment prediciton out of Vader. It will perform sentiment analysis on the title and review combined and determine the sentiment based on compound score

In [59]:
def get_vader_sentiment(text):
    sentiment_score = sid.polarity_scores(text)
    # VADER gives a compound score that summarizes the sentiment
    return sentiment_score['compound']

# If you have separate columns for title and content
restaurant_data_renamed_columns['combined_text'] = restaurant_data_renamed_columns['review_title'] + " " + restaurant_data_renamed_columns['review']

# Apply VADER sentiment analysis
restaurant_data_renamed_columns['vader_sentiment'] = restaurant_data_renamed_columns['combined_text'].apply(get_vader_sentiment)

# Determine positive or negative sentiment based on compound score
restaurant_data_renamed_columns['vader_sentiment_label'] = restaurant_data_renamed_columns['vader_sentiment'].apply(lambda score: 'Positive' if score >= 0 else 'Negative')


2. Let's see how accurate it is

In [60]:
# Calculate accuracy
accuracy = (restaurant_data_renamed_columns['vader_sentiment_label'] == restaurant_data_renamed_columns['sentiment']).mean()
print(f"VADER Sentiment Analysis Accuracy: {accuracy:.2%}")


VADER Sentiment Analysis Accuracy: 89.68%


3. As it can be seen the accuracy is pretty good, but let's also check some examples to see where it goes wrong and right.

In [61]:
# Select a subset of the columns to compare predictions with actual sentiments
comparison_df = restaurant_data_renamed_columns[['combined_text', 'sentiment', 'vader_sentiment_label']]

# Optionally, display the first few rows to see the comparison
print(comparison_df.head(10))  # Show the first 10 rows for comparison


                                       combined_text sentiment  \
0  Rude manager The manager became agressive when...  Negative   
1  A big disappointment I ordered a beef fillet a...  Negative   
2  Pretty Place with Bland Food This is an attrac...  Negative   
3  Great service and wine but inedible food Sadly...  Negative   
4  Avoid- Worst meal in Rome - possibly ever From...  Negative   
5  Shocking management, TERRIBLE service by mum a...  Negative   
6  We tired the tasting menu - avoid We tired the...  Negative   
7  Huge Disappointment This restaurant’s high rat...  Negative   
8  Expensive mediocre food and service We got the...  Negative   
9  all around awful My wife and I booked well in ...  Negative   

  vader_sentiment_label  
0              Negative  
1              Negative  
2              Negative  
3              Positive  
4              Negative  
5              Negative  
6              Negative  
7              Negative  
8              Negative  
9            

4. As it can be seen it works mostly right but it does get mixed up. In the third review it thinks its positive as it takes Great service and doesn't count much the inedible food part. It mixes it up because there is a word with positive sentiment in it.

### Problem 7. Deep features (embeddings) (1 point)
Use the same model to perform feature extraction on the review contents (or contents + titles) instead of direct predictions. You should already be familiar how to do that from your work on images.

Use the cosine similarity between texts to try to cluster them. Are there "similar" reviews (you'll need to find a way to measure similarity) across different restaurants? Are customers generally in agreement for the same restaurant?

1. We will use the USE(UNIVERSAL SENTENCE ENCODER) pre-trained model to embed.

In [54]:
texts = (restaurant_data_renamed_columns['review_title'] + " " + restaurant_data_renamed_columns['review']).tolist()

# Load the USE model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Compute embeddings
embeddings = embed(texts)

# Calculate cosine similarity
cosine_sim_matrix = np.inner(embeddings, embeddings)

# Clustering
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters=5, affinity='precomputed', linkage='complete')
clustering.fit(1 - cosine_sim_matrix)  # 1 - cosine similarity to use as distance

# Add cluster labels to the dataframe
restaurant_data_renamed_columns['cluster'] = clustering.labels_



1. Lets first combine the review_title and review columns from the DataFrame into a single text string for each row and then convert it into a list which will be used for embedding
2. Then we will use the Universal Sentence Encoder (USE), a pre-trained model from TensorFlow Hub, to convert each text (combination of title and review) into a high-dimensional vector (embedding). These embeddings capture semantic information about the text, allowing similar texts to have similar embeddings.
3. Then we will compute the cosine similarity between every pair of text embeddings. The cosine similarity measures how similar two texts are in terms of their direction in the vector space, regardless of their magnitude.
4. Applies Agglomerative Clustering (a type of hierarchical clustering) to group the reviews into n_clusters=5 clusters based on the computed cosine similarity. This clustering method uses the inverse of cosine similarity (i.e., 1 - cosine similarity) as a distance measure, where more similar reviews (higher cosine similarity) are considered closer together in the clustering process.
5. Adds a new column (cluster) to the DataFrame, where each review is assigned a cluster label (0 through 4). These labels indicate which cluster each review belongs to.
6. Let's see what we got in the dataframe.

In [73]:
restaurant_data_renamed_columns

Unnamed: 0,country,restaurant_name,sentiment,review_title,review_date,review,cleaned_review,word_overlap,sentiment_consistency,attention_grabbing,combined_text,vader_sentiment,vader_sentiment_label,cluster
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...,manager became agressive said carbonara good r...,0.500000,1,0,Rude manager The manager became agressive when...,-0.9460,Negative,3
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,...",ordered beef fillet ask done medium got well d...,0.000000,1,0,A big disappointment I ordered a beef fillet a...,-0.7684,Negative,3
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al...",attractive venue welcoming albeit somewhat slo...,0.000000,1,0,Pretty Place with Bland Food This is an attrac...,-0.5112,Negative,0
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...,sadly used high tripadvisor rating literally f...,0.428571,1,0,Great service and wine but inedible food Sadly...,0.9967,Positive,4
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...,start meal especially given price visited husb...,0.125000,0,0,Avoid- Worst meal in Rome - possibly ever From...,-0.4701,Negative,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1497,Cuba,Old Square (Plaza Vieja),Negative,The Tourism Trap,Oct 2016 •,Despite the other reviews saying that this is ...,despite review saying place hang especially be...,0.000000,0,0,The Tourism Trap Despite the other reviews say...,0.9262,Positive,2
1498,Cuba,Old Square (Plaza Vieja),Negative,the beer factory,Oct 2016 •,beer is good. food is awfull The only decent...,beer good food awfull decent thing shish kabob...,0.333333,0,0,the beer factory beer is good. food is awfull...,0.4404,Positive,3
1499,Cuba,Old Square (Plaza Vieja),Negative,brewery,Oct 2016 •,"for terrible service of a truly comedic level,...",terrible service truly comedic level full pint...,0.000000,0,0,brewery for terrible service of a truly comedi...,-0.1189,Negative,3
1500,Cuba,Old Square (Plaza Vieja),Negative,It's nothing exciting over there,Oct 2016 •,We visited the Havana's Club Museum which is l...,visited havana club museum located old havana ...,0.000000,0,0,It's nothing exciting over there We visited th...,0.2306,Positive,2


7. The clusters represent groups of reviews that are similar to each other based on their content. Reviews in the same cluster are considered more semantically similar to each other than to reviews in other clusters.
The specific meaning of each cluster (e.g., "positive reviews," "complaints about service") depends on the underlying data and how the clustering algorithm groups them. To interpret the clusters, you would typically examine the content of the reviews within each cluster.

8. Let's export this dataframe to a csv so we can explore on our own and so that we can see if there is any deeper connection for the cluster numbers. For now we know that the clusters created by the Agglomerative Clustering algorithm represent groups of reviews that are similar to each other based on their text content (combination of titles and reviews). The numbers assigned to each cluster (0, 1, 2, 3, 4) are arbitrary labels that the algorithm assigns to each group. These numbers don't have an inherent meaning other than to identify which reviews belong to the same cluster.
9. Reviews in the same cluster are expected to share common themes, language, sentiment, or other textual characteristics that make them more similar to each other than to reviews in other clusters.

In [74]:
output_file_path = 'restaurant_reviews_with_clusters.csv'
restaurant_data_renamed_columns.to_csv(output_file_path, index=False)

10. Let's see if people have the same thoughts about restaurants based on what clusters their reviews fall into.

In [75]:
# Group by restaurant and cluster, and count the number of reviews in each group
cluster_distribution = restaurant_data_renamed_columns.groupby(['restaurant_name', 'cluster']).size().unstack(fill_value=0)

# Display the cluster distribution for each restaurant
print(cluster_distribution)


cluster                                  0  1    2    3   4
restaurant_name                                            
Ad Hoc Ristorante (Piazza del Popolo)  225  2    0   35  56
Mosaic                                  79  0    0    2   0
Old Square (Plaza Vieja)                 5  0  123   14   4
Pelmenya                                32  0    0   12  56
Stara Kamienica                         73  0    0   12  50
The Frog at Bercy Village              200  3    4  228  77
The LOFT                               154  0    3    5  48


11. As it can be seen there are some visible agreements in terms of what people think of the restaurant where there can be seen a bigger number of reviews clustered into one. The problem is with The Frog at Bercy Village where there are two clusters which are almost similar. But also this is the restaurant with most reviews so i guess its normal for reviews to be mixed.

### \* Problem 8. Explore and model at will
In this lab, we focused on preprocessing and feature extraction and we didn't really have a chance to train (or compare) models. The dataset is maybe too small to be conclusive, but feel free to play around with ready-made models, and train your own.