In [1]:
import pandas as pd
import re
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
  

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sergelen.n\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sergelen.n\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sergelen.n\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sergelen.n\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# Working with Text Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll be looking at and exploring European restaurant reviews. The dataset is rather tiny, but that's just because it has to run on any machine. In real life, just like with images, texts can be several terabytes long.

The dataset is located [here](https://www.kaggle.com/datasets/gorororororo23/european-restaurant-reviews) and as always, it's been provided to you in the `data/` folder.

### Problem 1. Read the dataset (1 point)
Read the dataset, get acquainted with it. Ensure the data is valid before you proceed.

How many observations are there? Which country is the most represented? What time range does the dataset represent?

Is the sample balanced in terms of restaurants, i.e., do you have an equal number of reviews for each one? Most importantly, is the dataset balanced in terms of **sentiment**?

In [2]:
def standardize_column_names(df):
    def to_snake_case(name):
        name = name.lower()
        name = re.sub(r'[^a-z0-9]', '_', name)
        name = re.sub(r'_+', '_', name)
        name = name.strip('_')
        return name
    
    df.columns = [to_snake_case(col) for col in df.columns]
    return df

In [3]:
df = pd.read_csv('data/european_restaurant_reviews.csv')
df = standardize_column_names(df)


print("=" * 30)
print("Standardized column names:")
print(df.columns.tolist())

print("=" * 30)
print("Dataset Info:")
print("=" * 30)
print(df.info())

print("=" * 30)
print("First 5 Rows:")
print("=" * 30)
print(df.head())

Standardized column names:
['country', 'restaurant_name', 'sentiment', 'review_title', 'review_date', 'review']
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1502 entries, 0 to 1501
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   country          1502 non-null   object
 1   restaurant_name  1502 non-null   object
 2   sentiment        1502 non-null   object
 3   review_title     1502 non-null   object
 4   review_date      1502 non-null   object
 5   review           1502 non-null   object
dtypes: object(6)
memory usage: 70.5+ KB
None
First 5 Rows:
  country            restaurant_name sentiment  \
0  France  The Frog at Bercy Village  Negative   
1  France  The Frog at Bercy Village  Negative   
2  France  The Frog at Bercy Village  Negative   
3  France  The Frog at Bercy Village  Negative   
4  France  The Frog at Bercy Village  Negative   

                                review_title re

In [4]:
df['review_date'] = df['review_date'].str.replace(r'\bSept\b', 'Sep', regex=True)

df['review_date'] = df['review_date'].str.extract(r'(\w+\s\d{4})')

df['review_date'] = pd.to_datetime(df['review_date'], format='%b %Y', errors='coerce')

date_range = (df['review_date'].min(), df['review_date'].max())
print(f"Dataset time range: {date_range[0]} to {date_range[1]}")

Dataset time range: 2010-09-01 00:00:00 to 2024-07-01 00:00:00


In [5]:
num_observations = len(df)
print("=" * 30)
print(f"Number of observations: {num_observations}")
print("=" * 30)

most_represented_country = df['country'].value_counts().idxmax()
print(f"Most represented country: {most_represented_country}")
print("=" * 30)

restaurant_counts = df['restaurant_name'].value_counts()
print("Restaurant Review Counts:")
print(restaurant_counts.describe())
print("=" * 30)

sentiment_counts = df['sentiment'].value_counts(normalize=True)
print("Sentiment Distribution:")
print(sentiment_counts)
print("=" * 30)

Number of observations: 1502
Most represented country: France
Restaurant Review Counts:
count      7.000000
mean     214.571429
std      153.396933
min       81.000000
25%      117.500000
50%      146.000000
75%      264.000000
max      512.000000
Name: count, dtype: float64
Sentiment Distribution:
sentiment
Positive    0.823569
Negative    0.176431
Name: proportion, dtype: float64


### Problem 2. Getting acquainted with reviews (1 point)
Are positive comments typically shorter or longer? Try to define a good, robust metric for "length" of a text; it's not necessary just the character count. Can you explain your findings?

In [6]:
df['review_word_count'] = df['review'].apply(lambda x: len(str(x).split()))
df['review_unique_word_count'] = df['review'].apply(lambda x: len(set(str(x).split())))
df['review_char_count'] = df['review'].apply(lambda x: len(str(x)))
df['review_avg_word_length'] = df['review_char_count'] / df['review_word_count']

sentiment_stats = df.groupby('sentiment')[['review_word_count', 'review_unique_word_count', 'review_char_count', 'review_avg_word_length']].describe()
print("Sentiment Statistics:")
print(sentiment_stats)


Sentiment Statistics:
          review_word_count                                                   \
                      count        mean         std   min   25%   50%    75%   
sentiment                                                                      
Negative              265.0  140.573585  131.759636  13.0  55.0  95.0  177.0   
Positive             1237.0   50.183508   38.741043   2.0  25.0  37.0   61.0   

                 review_unique_word_count             ... review_char_count  \
             max                    count       mean  ...               75%   
sentiment                                             ...                     
Negative   646.0                    265.0  93.626415  ...             983.0   
Positive   340.0                   1237.0  40.817300  ...             340.0   

                  review_avg_word_length                                \
              max                  count      mean       std       min   
sentiment                        

### Problem 3. Preprocess the review content (2 points)
You'll likely need to do this while working on the problems below, but try to synthesize (and document!) your preprocessing here. Your tasks will revolve around words and their connection to sentiment. While preprocessing, keep in mind the domain (restaurant reviews) and the task (sentiment analysis).

In [7]:
df['clean_review'] = df['review'].str.lower()
df['clean_review'] = df['clean_review'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))

df['clean_review'] = df['clean_review'].apply(word_tokenize)

stop_words = set(stopwords.words('english'))
df['clean_review'] = df['clean_review'].apply(lambda x: [word for word in x if word not in stop_words])

lemmatizer = WordNetLemmatizer()
df['clean_review'] = df['clean_review'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

print(df['clean_review'].head())

0    [manager, became, agressive, said, carbonara, ...
1    [ordered, beef, fillet, ask, done, medium, got...
2    [attractive, venue, welcoming, albeit, somewha...
3    [sadly, used, high, tripadvisor, rating, liter...
4    [start, meal, bad, especially, given, price, v...
Name: clean_review, dtype: object


### Problem 3. Top words (1 point)
Use a simple word tokenization and count the top 10 words in positive reviews; then the top 10 words in negative reviews*. Once again, try to define what "top" words means. Describe and document your process. Explain your results.

\* Okay, you may want to see top N words (with $N \ge 10$).

In [8]:
stop_words = set(stopwords.words('english'))
df['clean_review'] = df['clean_review'].apply(lambda x: [word for word in x if word not in stop_words])

### Problem 4. Review titles (2 point)
How do the top words you found in the last problem correlate to the review titles? Do the top 10 words (for each sentiment) appear in the titles at all? Do reviews which contain one or more of the top words have the same words in their titles?

Does the title of a comment present a good summary of its content? That is, are the titles descriptive, or are they simply meant to catch the attention of the reader?

In [9]:
top_positive_words = ['delicious', 'great', 'amazing', 'love', 'best']
top_negative_words = ['bad', 'terrible', 'disappointed', 'horrible', 'worst']

def check_title_for_top_words(row, top_words):
    title_words = row['review_title'].lower().split()
    return any(word in title_words for word in top_words)

df['top_positive_in_title'] = df.apply(lambda x: check_title_for_top_words(x, top_positive_words), axis=1)
df['top_negative_in_title'] = df.apply(lambda x: check_title_for_top_words(x, top_negative_words), axis=1)

positive_in_titles = df['top_positive_in_title'].sum()
negative_in_titles = df['top_negative_in_title'].sum()

print(f"Number of reviews with top positive words in the title: {positive_in_titles}")
print(f"Number of reviews with top negative words in the title: {negative_in_titles}")

Number of reviews with top positive words in the title: 393
Number of reviews with top negative words in the title: 38


### Problem 5. Bag of words (1 point)
Based on your findings so far, come up with a good set of settings (hyperparameters) for a bag-of-words model for review titles and contents. It's easiest to treat them separately (so, create two models); but you may also think about a unified representation. I find the simplest way of concatenating the title and content too simplistic to be useful, as it doesn't allow you to treat the title differently (e.g., by giving it more weight).

The documentation for `CountVectorizer` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Familiarize yourself with all settings; try out different combinations and come up with a final model; or rather - two models :).

In [10]:
from sklearn.feature_extraction.text import CountVectorizer


title_vectorizer = CountVectorizer(
    ngram_range=(1, 2),      
    stop_words='english',   
    max_features=1000,       
    min_df=3                
)

X_title = title_vectorizer.fit_transform(df['review_title'])

print(title_vectorizer.get_feature_names_out()[:20])  

['100' 'abbie' 'absolutely' 'absolutely best' 'ad' 'ad hoc' 'affordable'
 'albeit' 'albeit overpriced' 'amazing' 'amazing dinner'
 'amazing experience' 'amazing food' 'amazing place' 'amazing restaurant'
 'ambiance' 'ambience' 'american' 'anniversary' 'apology']


### Problem 6. Deep sentiment analysis models (1 point)
Find a suitable model for sentiment analysis in English. Without modifying, training, or fine-tuning the model, make it predict all contents (or better, combinations of titles and contents, if you can). Meaure the accuracy of the model compared to the `sentiment` column in the dataset.

In [13]:
from transformers import pipeline

sentiment_analyzer = pipeline("sentiment-analysis", truncation=True, max_length=512)
df['combined_review'] = df['review_title'] + " " + df['review']
predictions = df['combined_review'].apply(lambda x: sentiment_analyzer(x[:512])[0]['label'])
df['predicted_sentiment'] = predictions.map({'POSITIVE': 'Positive', 'NEGATIVE': 'Negative'})
accuracy = (df['predicted_sentiment'] == df['sentiment']).mean()
print(f"Accuracy of sentiment analysis: {accuracy * 100:.2f}%")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Accuracy of sentiment analysis: 96.01%


### Problem 7. Deep features (embeddings) (1 point)
Use the same model to perform feature extraction on the review contents (or contents + titles) instead of direct predictions. You should already be familiar how to do that from your work on images.

Use the cosine similarity between texts to try to cluster them. Are there "similar" reviews (you'll need to find a way to measure similarity) across different restaurants? Are customers generally in agreement for the same restaurant?

### \* Problem 8. Explore and model at will
In this lab, we focused on preprocessing and feature extraction and we didn't really have a chance to train (or compare) models. The dataset is maybe too small to be conclusive, but feel free to play around with ready-made models, and train your own.