# **Text Mining | <span style="color: khaki;">Project Notebook</span>**


#### NOVA IMS / BSc in Data Science / Text Mining 2024/2025
### <b>Group Project: Solving the Hyderabadi Word Soup</b>
#### Notebook `Multilabel Classification & Sentiment Analysis & Topic Modelling & Co-occurrence Analysis`

#### Group:
- `Francisco Gomes (20221810)`
- `Maria Henriques (20221952)`
- `Carolina Almeida (20221855)`
- `Duarte Carvalho (20221900)`
- `Marta Monteiro (20221954)`

#### <font color='#BFD72'>Table of Contents </font> <a class="anchor" id='toc'></a>
- [1. Data Understanding](#P1)
- [2. General Data Preparation](#P2)
- [3. Multilabel Classification](#P3)
    - [3.1 Specific Data Preparation](#P31)
    - [3.2 Model Implementation](#P32)
    - [3.3 Model Evaluation](#P3n)
- [4. Sentiment Analysis](#P4)
    - [4.1 Specific Data Preparation](#P41)
    - [4.2 Model Implementation](#P42)
    - [4.3 Model Evaluation](#P43)
- [5. Topic Modelling](#P5)
- [6. Co-occurrence Analysis](#P6)

---

<font color='#BFD72F' size=5>1. Data Understanding</font> <a class="anchor" id="P1"></a>
  
[Back to TOC](#toc)

## <b>Importing the Datasets and Libraries</b>

This section, as mentioned in the title, is intended for importing the data and necessary libraries.

In [None]:
# ==============================
# Import Essential Libraries
# ==============================

# General-purpose libraries
import re
import time
import random
import string
import emoji
from collections import Counter
from itertools import combinations
from copy import deepcopy

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
from prettytable import PrettyTable, HRuleStyle, VRuleStyle
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from wordcloud import WordCloud
import seaborn as sns

# Text processing and NLP
import nltk
from nltk.tokenize import word_tokenize, PunktSentenceTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from num2words import num2words
import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel

# Statistical and correlation analysis
from scipy.stats import pearsonr

# Machine Learning Libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD, LatentDirichletAllocation
from gensim.models import Word2Vec
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer, classification_report, mean_squared_error, mean_absolute_error
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neural_network import MLPClassifier

# Import TextBlob for sentiment analysis
from textblob import TextBlob

# Network graph libraries
import networkx as nx
from networkx.algorithms.community import greedy_modularity_communities 
from adjustText import adjust_text
import matplotlib.cm as cm 

# Suppress warnings
import warnings
from pandas.errors import SettingWithCopyWarning
warnings.filterwarnings("ignore", category = RuntimeWarning)
warnings.filterwarnings("ignore", category = SettingWithCopyWarning)

# ==============================
# Load Project-specific Modules
# ==============================
from Functions_Group_8 import *

# ==============================
# Configure Notebook Settings
# ==============================

# Enable autoreload for custom modules
%load_ext autoreload
%autoreload 2

The cell below contains the respective paths for the group members. 
<br>
Please add any new paths as needed to run the code.

In [None]:
# path = '/Users/franciscogomes/Desktop/Faculdade/3rd year/1st Semester/Text Mining/Project TM08/Project-Text-Mining-main/data_hyderabad'
path = '/Users/carol/Desktop/NOVA IMS/Third Year - First Semester/Text Mining/Group Project/Group 8/Text_Mining_Group_8/Data'
# path = 'C:/Users/marga/OneDrive/Documentos/universidade/3º ano/1º semestre/Text Mining/Text Mining - Project/data_hyderabad'
# path = 'C:/Users/dacar/OneDrive - NOVAIMS/Ambiente de Trabalho/Text Mining Project/data_hyderabad'
# path = '\Users\marta\Desktop\eu\faculdade\3\First Semester\Text Mining\Projeto\Project Statement\data_hyderabad'

In [None]:
reviews = pd.read_csv(path + '/10k_reviews.csv')
restaurants = pd.read_csv(path + '/105_restaurants.csv')

## <b>Preliminary Data Analysis and Arrangements</b>

In this part, we extracted some initial information regarding data types, missing values and memory usage.

We used `deepcopy` on both DataFrames to ensure that any modifications made to these copies do not affect the original DataFrames, preserving their integrity for future use.

In [None]:
reviews = deepcopy(reviews)
restaurants = deepcopy(restaurants)

The following outputs show the structure and content of the <b>Reviews</b> and <b>Restaurants</b> DataFrames, confirming that the data has been successfully duplicated and is ready for further analysis.

In [None]:
display(reviews.head())
display(restaurants.head())

The code below helps us examine missing values, data types, and memory usage of the DataFrames.

In [None]:
display(reviews.info(memory_usage = 'deep'))
display(restaurants.info(memory_usage = 'deep'))

A quick glance at the <b>Reviews</b> dataset reveals that the variables <u>Reviewer</u>, <u>Review</u>, <u>Rating</u>, <u>Metadata</u>, and <u>Time</u> contain missing values. 
<br>
In the <b>Restaurant</b> dataset, only the <u>Collections</u> variable has missing values. 

For the first dataset, the missing values will be dropped as they constitute an insignificant proportion. 
<br>
However, for the second dataset, the missing values in the collections column will remain untouched, as this variable will not be used in the modeling phase.

Finally, we decided to change the data type of the variable <u>Time</u> to datetime, as it would be more suitable for future manipulations.

In [None]:
reviews['Year'] = pd.to_datetime(reviews['Time'], format = '%m/%d/%Y %H:%M').dt.year

## <b>Data Exploration and Treatment</b>

This section aims to extract useful data insights.

<u>Reviews Dataset</u>

In [None]:
reviews.describe(include = 'object').T

**What are some insights that we can take?**

- **Restaurant:** There are 10000 entries, with 100 unique restaurant names. The most frequently reviewed restaurant is "Beyond Flavours," appearing 100 times.
- **Reviewer:** Among the 9962 entries, there are 7446 unique reviewers. The most active reviewer is "Parijat Ray," contributing 13 reviews.
- **Review:** Of the 9955 reviews, 9364 are unique. The most commonly occurring review is the word "good," appearing 237 times, suggesting it might be a frequent descriptor.
- **Rating:** There are 9962 ratings spanning 10 unique values, with the highest frequency being 3832 for a rating of 5, indicating a strong positive bias in ratings.
- **Metadata:** Metadata is present in 9962 entries, with 2477 unique values. The most common metadata entry is "1 Review," appearing 919 times, possibly indicating a majority of single-review users.
- **Time:** The time column includes 9962 entries with 9782 unique values. The most frequent timestamp is "7/29/2018 20:34," appearing 3 times, which might suggest either duplicate entries or coincidental timing.

The code checks for inconsistent values in the <u>Rating</u> column that fall outside the valid range of 1 to 5.

In [None]:
reviews['Rating'] = pd.to_numeric(reviews['Rating'], errors = 'coerce')

out_of_range_ratings = reviews[(reviews['Rating'] < 1) | (reviews['Rating'] > 5)]

# Display the out-of-range ratings
if not out_of_range_ratings.empty:
    print("Values in 'Rating' that are out of range (not between 1 and 5):")
    print(out_of_range_ratings)
else:
    print("All values in 'Rating' are within the range of 1 to 5.")

This code removes rows from the **Reviews** DataFrame where all values in the specified columns are missing.

In [None]:
reviews.dropna(how = 'all', subset = ['Reviewer', 'Review', 'Rating', 'Metadata', 'Time'], inplace = True)

In [None]:
reviews['Review'] = reviews['Review'].fillna('No Review')

This code identifies and counts the total number of duplicate rows.

In [None]:
reviews[reviews.duplicated()].count()

There are no duplicate entries, eliminating the need to address duplicate data.

To gain a quick insight into the distribution of our reviews, we implemented a simple code that categorizes reviews into positive, neutral, and negative groups. 
<br>
A more detailed analysis will be provided in `4. Sentiment Analysis`.

In [None]:
Reviews_ = reviews.groupby('Review').size().to_frame().reset_index().rename(columns = {0: 'Count'})

positive_words = ['good', 'very good', 'excellent', 'great', 'nice', 'best', 'amazing', 'superlative']
negative_words = ['bad', 'worst', 'terrible', 'horrible', 'poor', 'disappointing', 'late', 'disappointed']

def categorize_review(review):
    if any(word in review.lower() for word in positive_words):
        return 'Positive'
    elif any(word in review.lower() for word in negative_words):
        return 'Negative'
    else:
        return 'Neutral'

Reviews_['Category'] = Reviews_['Review'].apply(categorize_review)

Reviews_[Reviews_['Count'] > 1].sort_values(by = 'Count', ascending = False)

In [None]:
bar_plot(Reviews_, variable = 'Category', x_label = 'Category')

We observe that the most frequent reviews are positive, followed by neutral, and then negative.
<br>
While this is a basic approach and doesn't capture all nuances of the reviews, we expect the overall distribution to remain similar.

During this exploration, we also noticed the presence of the word **nic**.

In [None]:
reviews[reviews['Review'].str.contains(r'\bnic\b', case = False, na = False, regex = True)]

We observed that this word appears in two rows, both associated with a rating of 5, which suggests it was intended to be **nice**. 
<br>
Therefore, we will address these two cases. 

While more thorough preprocessing of the review text will be done later, we chose not to ignore these instances for now.

In [None]:
reviews.loc[3795, 'Review'] = re.sub(r'\bnic\b', 'nice', reviews.loc[3795, 'Review'], flags = re.IGNORECASE)
reviews.loc[7574, 'Review'] = re.sub(r'\bnic\b', 'nice', reviews.loc[7574, 'Review'], flags = re.IGNORECASE)

<u>Restaurants Dataset</u>

In [None]:
restaurants.describe(include = 'object').T

**What are some insights that we can take?**

- **Name:** There are 105 entries, each corresponding to a unique restaurant name.
- **Links:** Each restaurant has a unique link.
- **Cost:** This column contains 105 entries, with 29 unique values. The most common cost is 500, appearing 13 times.
- **Collections:** It has 51 entries, with 42 unique values. The most frequent collection is "Food Hygiene Rated Restaurants in Hyderabad," which appears 4 times.
- **Cuisines:** There are 105 entries in ths column, with 92 unique values. The most common cuisine combination is "North Indian, Chinese," appearing 4 times.
- **Timings:** The "Timings" column contains 104 entries, with 77 unique values. The most frequent timing is "11 AM to 11 PM," appearing 6 times.

The code checks for inconsistent values in the <u>Cost</u> column that fall outside the positive range.

In [None]:
restaurants['Cost'] = pd.to_numeric(restaurants['Cost'], errors = 'coerce')

negative_costs = restaurants[restaurants['Cost'] < 0]

if not negative_costs.empty:
    print("Negative values in 'Cost':")
    print(negative_costs)
else:
    print("No negative values in 'Cost'.")

This code identifies and counts the total number of duplicate rows.

In [None]:
restaurants[restaurants.duplicated()].count()

There are no duplicate entries, eliminating the need to address duplicate data.

## <b>Data Visualization<b>

In this section, we explored the data and created visualizations to draw some initial conclusions and insights.

**Distribution of Ratings in Reviews**

In [None]:
bar_plot(reviews, variable = 'Rating', x_label = 'Rating', y_label = 'Count')

We observe that 3832 reviews have a rating of 5, followed by 2373 with a rating of 4, 1193 with a rating of 3, 684 with a rating of 2, and 1735 with a rating of 1. 
<br>
This indicates that, on average, there are more positive ratings than negative ones, as already mentioned above. 
<br>
Additionally, we note the presence of some decimal ratings, reflecting indecisiveness among a few reviewers. However, these cases represent a small minority.

**Distribution of Reviews with Pictures**

In [None]:
bar_plot(reviews, variable = 'Pictures', x_label = 'Pictures posted with review', y_label = 'Count')

The majority of reviews were submitted without accompanying pictures, although some reviews do include images.

**Top 10 Most Frequent Words in Reviews**

In [None]:
donut_chart(top_words(reviews, 'Review', top_n = 10), title = 'Top 10 Words in the column Review')

The <b>most frequently used word</b> in the <b>Reviews</b> is <u>the</u>, followed by <u>and</u> and <u>was</u>.

**Top 20 Most Frequent Words in Reviews**

In [None]:
word_cloud_1 = word_cloud(top_words(reviews, column_name = 'Review').set_index('Word')['Frequency'].to_dict())

plt.figure(figsize = (10, 6))
plt.imshow(word_cloud_1, interpolation = 'bilinear')
plt.axis('off')
plt.title('Top 20 Words in the column Review', fontsize = 16)
plt.show()

This word cloud conveys the same insights as the previous donut chart but presents them in a more visually engaging manner.

It is important to highlight that the most frequent words in the reviews are predominantly stop words. 
<br>
These words will be removed during the <font color='#BFD72F' size=3>2. General Data Preparation</font> <a class="anchor" id="P2"></a> stage, as they hold no significance for the final multilabel classification model.

**Number of Reviews per Year**

In [None]:
number_of_reviews_per_year = reviews.groupby('Year').size().reset_index(name = 'Review Count')
line_plot(number_of_reviews_per_year, 'Year', 'Review Count', 'Number of Reviews per Year')

In 2016, there were only 43 reviews, with a slight increase in 2017. 
<br>
However, the most significant growth occurred between 2017 and 2018, when the number of reviews surged by over 4000. 
<br>
The peak was reached in 2018, with 4903 reviews, followed by a slight decrease in 2019.

**Popularity of Restaurants Mentioned in Reviews**

In [None]:
word_cloud_2 = word_cloud(reviews['Restaurant'].value_counts())

plt.figure(figsize = (10, 6))
plt.imshow(word_cloud_2, interpolation = 'bilinear')
plt.axis('off')
plt.title('Popularity of Restaurants in Reviews', fontsize = 16)
plt.show()

It is interesting to note that <u>Beyond Flavours</u>, <u>Driven Cafe</u>, and <u>Eat India Company</u> are the most reviewed restaurants.

**Popularity of Cuisine Types**

In [None]:
list_of_cuisines = [cuisine.strip() for cuisine in restaurants['Cuisines'].dropna().str.split(',').sum()]
word_cloud_3 = word_cloud(pd.Series(list_of_cuisines).value_counts())

plt.figure(figsize = (10, 6))
plt.imshow(word_cloud_3, interpolation = 'bilinear')
plt.axis('off')
plt.title('Popularity of Cuisine Types', fontsize = 16)
plt.show()

<u>North Indian</u> is the type of cuisine with the most reviews.
<br>
This likely suggests that the restaurants mentioned above specialize in this type of cuisine.

---

<font color='#BFD72F' size=5>2. General Data Preparation</font> <a class="anchor" id="P2"></a>
  
[Back to TOC](#toc)

## <b>Preprocessing Text</b>

At this stage, we aim to clean the <b>Reviews</b> column, as it contains many irrelevant elements that could negatively impact the model's performance.

We begin by integrating the two datasets (<i>reviews</i> and <i>restaurants</i>) and removing irrelevant columns. 
<br>
The resulting <b>integrated_datasets</b> is shown below.

In [None]:
integrated_datasets = pd.merge(restaurants, reviews, left_on = 'Name', right_on = 'Restaurant')
integrated_datasets.drop(['Collections', 'Reviewer','Name', 'Pictures','Links', 'Metadata', 'Cost', 'Timings', 'Time', 'Year'], axis = 1, inplace = True)
display(integrated_datasets.head())

From now on, considering that we have two phases in this project, each requiring different types of text, we will create two new datasets. 
<br>
For now, these are simply copies of <i>integrated_datasets</i>, but they will undergo different modifications later on.

In [None]:
data_for_multilabel_classification = deepcopy(integrated_datasets)
data_for_sentiment_analysis = deepcopy(integrated_datasets)

Both datasets will undergo the same cleaning process, with some exceptions: 

In [None]:
data_for_multilabel_classification_clean = clean_dataframe(data_for_multilabel_classification, 'Review')
data_for_sentiment_analysis_clean = clean_dataframe(data_for_sentiment_analysis, 'Review', False, False, False, False, True)

The difference is that *data_for_multilabel_classification_clean* is cleaned, with emojis and punctuation removed, all text converted to lowercase, stop words eliminated, and tokenized. This preprocessing is not applied to *data_for_sentiment_analysis_clean*.

Since the first phase of this project focuses on multilabel classification, the following two visualizations were created using the corresponding dataset. 

**Distribution of Review Lengths (in Tokens)**

In [None]:
token_lengths = [len(tokens) for tokens in data_for_multilabel_classification_clean['Review_cleaned']]
token_lengths_dataframe = pd.DataFrame(token_lengths, columns = ['Token Length'])
plot_histogram(token_lengths_dataframe, x = 'Token Length', nbins = 20, title = 'Visualization of the Size of the Review', labels = {'Token Length': 'Number of Tokens'}, xaxis_title = 'Number of Tokens', yaxis_title = 'Count of Reviews')

We observe that the majority of reviews (8702) contain between 0 and 49 tokens, while 920 reviews have between 50 and 99 tokens. 
<br>
These are relatively typical review lengths, and we consider them normal for this dataset.

**Top 10 Most Frequent Words in Cleaned Reviews**

In [None]:
data_for_multilabel_classification_clean['Review_cleaned_Temporary'] = data_for_multilabel_classification_clean['Review_cleaned'].apply(lambda x: ' '.join(x)) # Temporary column, just for this visualization

donut_chart(top_words(data_for_multilabel_classification_clean, 'Review_cleaned_Temporary', top_n = 10), title = 'Top 10 Words in the column Review (After Cleaning)')

In [None]:
data_for_multilabel_classification_clean.drop(columns = ['Review_cleaned_Temporary'], inplace = True) # The temporary column is removed

After the initial cleaning, we can observe that the most frequent words are <u>good</u>, <u>food</u> and <u>place</u>. 
<br>
These are no longer considered stopwords.

## <b>POS Tag and Lemmatization</b>

In this section, we convert the tokens in the 'Review_cleaned' column of *data_for_multilabel_classification_clean* into lemmas, as these will be used as input for the model. 
<br>
Additionally, we process the Part-of-Speech (POS) tags. 

In [None]:
data_for_multilabel_classification_clean[['Lemmas', 'POS_Tags']] = data_for_multilabel_classification_clean['Review_cleaned'].apply(lemmatize_tokens).apply(pd.Series)

In [None]:
plot_treemap(data_for_multilabel_classification_clean['Lemmas'], data_for_multilabel_classification_clean['POS_Tags'], 20)

We can check the frequency of different words, this time with the correct POS tags associated with each.

The following line of code converts all the cuisine entries in the 'Cuisines' column to lowercase.

In [None]:
data_for_multilabel_classification_clean['Cuisines'] = data_for_multilabel_classification_clean['Cuisines'].str.lower()

## <b>Dimensionality Reduction</b>

In this section, we chose to remove words that appear frequently across different types of cuisines, as they do not help in distinguishing between them. 
<br>
Additionally, we eliminated words that were rarely occurring in the dataset, as they could negatively impact the model's performance.

In [None]:
data_for_multilabel_classification_clean['Lemmas'] = data_for_multilabel_classification_clean['Lemmas'].apply(lambda x: ' '.join(x))

In [None]:
One_Hot_Encoder =  CountVectorizer(min_df = 40, max_df = 134600, dtype = np.int8, max_features = None, binary = True)

words_per_cuisine_type = One_Hot_Encoder.fit_transform(data_for_multilabel_classification_clean['Lemmas'].to_list())

words_per_cuisine_type = pd.DataFrame(words_per_cuisine_type.toarray(), columns = One_Hot_Encoder.get_feature_names_out())

words_per_cuisine_type.index = data_for_multilabel_classification_clean.index

words_per_cuisine_type['Cuisine Type'] = data_for_multilabel_classification_clean['Cuisines'].copy()

In [None]:
words_per_cuisine_type['Cuisine Type'] = words_per_cuisine_type['Cuisine Type'].str.split(',')

words_per_cuisine_type_exploded = words_per_cuisine_type.explode('Cuisine Type')

words_per_cuisine_type_exploded['Cuisine Type'] = words_per_cuisine_type_exploded['Cuisine Type'].str.strip()

words_per_cuisine_type_exploded

The DataFrame above represents the words that appear for each type of cuisine, where 0 indicates the word does not appear and 1 indicates the word is present.

**Word Clouds for Each Cuisine Type**

In [None]:
unique_cuisines = words_per_cuisine_type_exploded['Cuisine Type'].unique()

plt.figure(figsize = (15, 50))

for i, cuisine in enumerate(unique_cuisines):
    word_frequencies = generate_word_frequencies_for_cuisine(words_per_cuisine_type_exploded, cuisine)
    
    plt.subplot(14, 3, i + 1)
    wordcloud = word_cloud(word_frequencies)
    
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.title(f"{cuisine}", fontsize = 12) 

plt.tight_layout()
plt.show()

From the grid above, we can get an idea of the words that appear for each type of cuisine. 
<br>
At first glance, words like *food*, *good*, *place*, *order*, *ambience*, and *service* seem to appear in most, if not all, of the word clouds. 
<br>
We will remove these words.

In [None]:
words_to_drop = ['food', 'good', 'place', 'order', 'ambience', 'service']

In [None]:
words_per_cuisine_type_exploded_2 = (words_per_cuisine_type_exploded.groupby('Cuisine Type').sum()).T
display(words_per_cuisine_type_exploded_2)

In [None]:
cuisine_totals = words_per_cuisine_type_exploded_2.sum()
percentages = words_per_cuisine_type_exploded_2.div(cuisine_totals, axis = 1) * 100 
display(percentages)

In the DataFrame above, we can observe the percentage of each word for each cuisine type. 
<br>
Our goal is to drop words with percentages smaller than 1% or greater than 99%.

In [None]:
words_to_drop = set(words_to_drop) | set(percentages[(percentages.sum(axis = 1) < 1) | (percentages.sum(axis = 1) > 99)].index) 

In [None]:
data_for_multilabel_classification_clean['Lemmas_Treated'] = data_for_multilabel_classification_clean['Lemmas'].apply(lambda lemmas: ' '.join(word for word in lemmas.split() if word not in words_to_drop))

**Word Clouds for Each Cuisine Type (After Dimensinality Reduction)**

In [None]:
words_per_cuisine_type_exploded.drop(columns = words_to_drop, inplace = True, errors = 'ignore')

plt.figure(figsize = (15, 50))

for i, cuisine in enumerate(unique_cuisines):
    word_frequencies = generate_word_frequencies_for_cuisine(words_per_cuisine_type_exploded, cuisine)
    
    wordcloud = word_cloud(word_frequencies)
    
    plt.subplot(14, 3, i + 1)
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.title(f"{cuisine}", fontsize = 12)

plt.tight_layout()
plt.show()

Once again, we can visualize the most frequent words in the reviews for each cuisine type, this time after applying the previously described process to reduce dimensionality.

**Comparison of Word Counts: Lemmas vs Lemmas (Treated)**

In [None]:
plot_word_count_comparison(data_for_multilabel_classification_clean, 'Lemmas', 'Lemmas_Treated')

After making these modifications, we observe that exactly <b>36 673</b> lemmas were removed.

---

<font color='#BFD72F' size=5>3. Multilabel Classification</font> <a class="anchor" id="P3"></a>
  
[Back to TOC](#toc)

*How well can we classify a restaurant’s cuisine type using the content of their reviews as input?*

<font color='#BFD72F' size=5>3.1. Specific Data Preparation</font> <a class="anchor" id="P31"></a>

First, we will define the <b>data</b> by selecting only the necessary columns: 
<br>
<i>Cuisines</i> (containing the target) and <i>Lemmas_treated</i> (the reviews after all the mentioned modifications). 

This data is shown below.

In [None]:
data = data_for_multilabel_classification_clean[['Cuisines','Lemmas_Treated']]
display(data)

## <b>Vectorization</b>

In this section, we will test different types of vectorization, and all of them will be added to our data for testing in the model.

---
# **Term-Frequency-Inverse Document Frequency (TF-IDF)**
---

We applied the TF-IDF (Term Frequency-Inverse Document Frequency) technique to extract the most important features from the text data. 
<br>
Using the `TfidfVectorizer` with a limit of 5000 features and removal of English stop words, we transformed the cleaned text in the Lemmas_Treated column into a TF-IDF matrix, which was then stored in a new column, TF_IDF.

In [None]:
TFIDF_model = TfidfVectorizer(max_features = 5000, stop_words = 'english')

In [None]:
TFIDF_matrix = TFIDF_model.fit_transform(data['Lemmas_Treated'])

In [None]:
data['TF_IDF'] = list(TFIDF_matrix.toarray())

---
# **Word2Vec Skip-gram**
---

We trained a `Word2Vec` model using the Lemmas_Treated column, where each sentence was split into words. 
<br>
The model was configured with a vector size of 25, a context window of 5 words, and a minimum word count of 1. 
<br>
We used the Skip-gram approach (sg = 1) to learn word embeddings, with 4 worker threads for parallel processing. 
<br>
The resulting word embeddings capture semantic relationships between words in the text data.

In [None]:
Word2Vec_model = Word2Vec(sentences = data["Lemmas_Treated"].apply(lambda x: x.split()), vector_size = 25, window = 5, min_count = 1, workers = 4, sg = 1)

In [None]:
table = PrettyTable()

table.field_names = ['Category', '\033[1mtaste\033[0m']  # Random word

table.hrules = HRuleStyle.ALL  
table.vrules = VRuleStyle.ALL 

similarity_result = similarity(Word2Vec_model, 'taste')
prediction_result = prediction(Word2Vec_model, 'taste')
text_gen_result = text_generator(Word2Vec_model, ['great', 'taste'], 5, random_nr = True, random_nr_max = 10)

table.add_row(['Similarity (Most)', f"{similarity_result[0]} ({similarity_result[1]:.4f})"])
table.add_row(['Similarity (Least)', f"{similarity_result[2]} ({similarity_result[3]:.4f})"])
table.add_row(['Prediction (Most)', f"{prediction_result[0]} ({prediction_result[1]:.4f})"])
table.add_row(['Prediction (Least)', f"{prediction_result[2]} ({prediction_result[3]:.4f})"])
table.add_row(['Text Generator', ', '.join(text_gen_result)]) 

print(table)

In this experiment, we applied the Word2Vec model to a randomly selected word, **taste**, to explore various relationships and predictions within the text data. 


The results were presented in the following categories:

- **Similarity (Most):** The word "preparation" showed the highest similarity to "taste" with a score of 0.9220, indicating a strong semantic connection between the two words.
- **Similarity (Least):** The word "su" had the lowest similarity to "taste" with a score altough 0.4896, suggesting a much weaker relationship.
- **Prediction (Most):** The model predicted "burger" as the most likely related word, albeit with a very low score of 0.0009, indicating that the prediction strength is not very high.
- **Prediction (Least):** The model predicted "wrap" with an even lower score of 0.0005, suggesting an even weaker association.
- **Text Generator**: Using the words "great" and "taste," the model generated a list of related words, including "coffee," "chocolate," "burger," and "brownie," which align with common associations of taste in the dataset.

In [None]:
PCA_Scatter_Plot_1(Word2Vec_model)

In [None]:
vectors_Word2Vec_model = pd.DataFrame(Word2Vec_model.wv.vectors, index = list(Word2Vec_model.wv.key_to_index.keys()))
display(vectors_Word2Vec_model)

In [None]:
data['Word2Vec_Skip_gram'] = data['Lemmas_Treated'].apply(lambda x: sentence_vectorizer(x, Word2Vec_model.wv, vector_size = 25))

---
# **Word2Vec Continuous Bag-of-Words (CBOW)**
---

We trained a `Word2Vec` model using the Lemmas_Treated column, where each sentence was split into words.
<br>
The model was configured with a vector size of 25, a minimum word count of 1, and 4 worker threads for parallel processing.
<br>
We used the Continuous Bag of Words (CBOW) approach (sg = 0) to learn word embeddings, with 5 epochs for training and a negative sampling exponent (ns_exponent = -1).
<br>
The resulting word embeddings capture the semantic relationships between words in the text data.

In [None]:
CBOW_MODEL = Word2Vec(sentences = data["Lemmas_Treated"].apply(lambda x: x.split()), vector_size = 25, min_count = 1, workers = 4, sg = 0, epochs = 5, ns_exponent = -1)

In [None]:
table = PrettyTable()

table.field_names = ['Category', '\033[1mtaste\033[0m']  # Random word

table.hrules = HRuleStyle.ALL  
table.vrules = VRuleStyle.ALL 

similarity_result = similarity(CBOW_MODEL, 'taste')
prediction_result = prediction(CBOW_MODEL, 'taste')
text_gen_result = text_generator(CBOW_MODEL, ['great', 'taste'], 5, random_nr = True, random_nr_max = 10)

table.add_row(['Similarity (Most)', f"{similarity_result[0]} ({similarity_result[1]:.4f})"])
table.add_row(['Similarity (Least)', f"{similarity_result[2]} ({similarity_result[3]:.4f})"])
table.add_row(['Prediction (Most)', f"{prediction_result[0]} ({prediction_result[1]:.4f})"])
table.add_row(['Prediction (Least)', f"{prediction_result[2]} ({prediction_result[3]:.4f})"])
table.add_row(['Text Generator', ', '.join(text_gen_result)]) 

print(table)

In this experiment, we applied the Word2Vec model to a randomly selected word, **taste**, to analyze its relationships and predictions within the text data.

The results were presented in the following categories:

- **Similarity (Most):** The word "well" showed the highest similarity to "taste" with a score of 1.0000, indicating a very strong semantic connection between the two words.
- **Similarity (Least):** The word "jaago" had the lowest similarity to "taste" with a score of 0.6381, suggesting a much weaker relationship.
- **Prediction (Most):** The model predicted "not" as the most likely related word, with a score of 0.3328, indicating a relatively weak but notable association.
- **Prediction (Least):** The model predicted "restaurant" with an even lower score of 0.0184, suggesting an even weaker connection.
- **Text Generator:** Using the words "great" and "taste," the model generated a list of related words, including "restaurant," "no," "like," and "visit," reflecting common associations of taste in the dataset.

In [None]:
PCA_Scatter_Plot_1(CBOW_MODEL)

In [None]:
vectors_CBOW_model = pd.DataFrame(CBOW_MODEL.wv.vectors, index = list(CBOW_MODEL.wv.key_to_index.keys()))
display(vectors_CBOW_model)

In [None]:
data['CBOW'] = data['Lemmas_Treated'].apply(lambda x: sentence_vectorizer(x, CBOW_MODEL.wv, vector_size = 25))

---
# **GloVe - Pretrained Model**
---

In our experiment, we decided to use `GloVe` (Global Vectors for Word Representation), a pre-trained word embedding model, to explore semantic relationships in the text data. 
<br>
Instead of using the standard GloVe embeddings directly from a library, we chose to manually load a retrained GloVe model to have more control over the loading process.

We used the manual_loading function to load the GloVe vectors from the file *glove.6B.50d.txt*, which contains 50-dimensional word embeddings for a large vocabulary. 
<br>
This allowed us to work with a specific version of the model that suited our needs, ensuring that we could fine-tune and manipulate the embeddings as required for our analysis.

In [None]:
glove_vectors = manual_loading('glove/glove.6B.50d.txt') 
# glove_vectors = manual_loading('glove.6B.50d.txt') 

In [None]:
data['GloVe'] = data['Lemmas_Treated'].apply(lambda lemmas: glove_vectorization(lemmas, glove_vectors))

<font color='#BFD72F' size=5>3.2. Model Implementation</font> <a class="anchor" id="P32"></a>

For this next section, we will explore various combinations of vectorization techniques and machine learning models to identify the best-performing model for multilabel classification.
<br>
We aim to test different feature representations, such as TF-IDF, Word2Vec, CBOW, and GloVe, alongside various classification algorithms. 

We preprocessed the target column by splitting the comma-separated values into individual cuisine labels and then applied the `MultiLabelBinarizer` to encode these labels into a binary format, storing the resulting encoded lists in the 'Cuisines_Encoded' column.

In [None]:
data['Cuisines'] = data['Cuisines'].apply(lambda x: x.split(', '))

multi_label_binarizer = MultiLabelBinarizer()

encoding = multi_label_binarizer.fit_transform(data['Cuisines'])

data['Cuisines_Encoded'] = encoding.tolist()

In [None]:
display(data.head())

**Distribution of Cuisine Types Across the Dataset**

In [None]:
bar_plot(words_per_cuisine_type_exploded, 'Cuisine Type', 'Types of Cuisines')

We can observe that the <b>dataset is imbalanced</b>, which will need to be taken into consideration when selecting the metrics for our model.

Below, a comprehensive series of tests is conducted to determine the optimal model, highlighting its superior performance among the alternatives considered.

---
# **Term-Frequency-Inverse Document Frequency (TF-IDF)**
---

We split our TF-IDF vectorized input into training and validation sets. 
<br>
This process will be repeated for all vectorization methods, resulting in different arrays that will be tested with various models.

In [None]:
X_train_TF_IDF, X_val_TF_IDF, y_train_TF_IDF, y_val_TF_IDF = train_test_split(np.array(data['TF_IDF'].tolist()), encoding, test_size = 0.2, random_state = 0)

Since the splitting process could have introduced some NaN values, we implemented the following code to check for their presence.

In [None]:
print("Missing values in X_train_TF_IDF:", np.isnan(X_train_TF_IDF).sum())
print("Missing values in X_val_TF_IDF:", np.isnan(X_val_TF_IDF).sum())
print("Missing values in y_train_TF_IDF:", np.isnan(y_train_TF_IDF).sum() if isinstance(y_train_TF_IDF, np.ndarray) else 0)
print("Missing values in y_val_TF_IDF:", np.isnan(y_val_TF_IDF).sum() if isinstance(y_val_TF_IDF, np.ndarray) else 0)

However, no missing values were found.

In this code, we train several classifiers — *Logistic Regression*, *Decision Tree*, *Multilayer Perceptron*, *Random Forest*, and *Dummy Classifier* — using both the `OneVsRest` and `ClassifierChain` strategies. 
<br>
These strategies are employed to adapt to our multilabel classification problem, where each input can belong to multiple classes. 

This process will be repeated for all vectorization methods, allowing us to evaluate the performance of different models with various feature representations.

In [None]:
# Logistic Regression
One_vs_Rest_Logistic_Regression_1 = OneVsRestClassifier(LogisticRegression(C = 0.1, solver = 'lbfgs', class_weight = 'balanced')).fit(X_train_TF_IDF, y_train_TF_IDF)
Classifer_Chain_Logistic_Regression_1 = ClassifierChain(LogisticRegression(C = 0.1, solver = 'lbfgs', class_weight = 'balanced')).fit(X_train_TF_IDF, y_train_TF_IDF)

# Decision Tree Classifier
One_vs_Rest_Decision_Tree_Classifier_1 = OneVsRestClassifier(DecisionTreeClassifier(criterion = 'gini', max_depth = 10, min_samples_leaf = 1, min_samples_split = 50)).fit(X_train_TF_IDF, y_train_TF_IDF)
Classifer_Chain_Decision_Tree_Classifier_1 = ClassifierChain(DecisionTreeClassifier(criterion = 'gini', max_depth = 10, min_samples_leaf = 1, min_samples_split = 50)).fit(X_train_TF_IDF, y_train_TF_IDF)

# Multilayer Perceptron Classifier
One_vs_Rest_Multilayer_Perceptron_Classifier_1 = OneVsRestClassifier(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_TF_IDF, y_train_TF_IDF)
Classifer_Chain_Multilayer_Perceptron_Classifier_1 = ClassifierChain(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_TF_IDF, y_train_TF_IDF)

# Random Forest Classifier
One_vs_Rest_Random_Forest_Classifier_1 = OneVsRestClassifier(RandomForestClassifier(n_estimators = 50, max_depth = 10, min_samples_leaf = 1, class_weight = 'balanced')).fit(X_train_TF_IDF, y_train_TF_IDF)
Classifer_Chain_Random_Forest_Classifier_1 = ClassifierChain(RandomForestClassifier(n_estimators = 50, max_depth = 10, min_samples_leaf = 1, class_weight = 'balanced')).fit(X_train_TF_IDF, y_train_TF_IDF)

# Dummy Classifier
One_vs_Rest_Dummy_Classifier_1 = OneVsRestClassifier(DummyClassifier(strategy = 'most_frequent')).fit(X_train_TF_IDF, y_train_TF_IDF)
Classifier_Chain_Dummy_Classifier_1 = ClassifierChain(DummyClassifier(strategy = 'most_frequent')).fit(X_train_TF_IDF, y_train_TF_IDF)

Next, metrics such as F1 score, precision, and recall were calculated for each of these models.

In [None]:
# Calculate metrics for OneVsRest classifiers
One_vs_Rest_TF_IDF = {
    "LogisticRegression": calculate_metrics(One_vs_Rest_Logistic_Regression_1, X_val_TF_IDF, y_val_TF_IDF),
    "DecisionTree": calculate_metrics(One_vs_Rest_Decision_Tree_Classifier_1, X_val_TF_IDF, y_val_TF_IDF),
    "MLPClassifier": calculate_metrics(One_vs_Rest_Multilayer_Perceptron_Classifier_1, X_val_TF_IDF, y_val_TF_IDF),
    "RandomForest": calculate_metrics(One_vs_Rest_Random_Forest_Classifier_1, X_val_TF_IDF, y_val_TF_IDF),
    "Dummy": calculate_metrics(One_vs_Rest_Dummy_Classifier_1, X_val_TF_IDF, y_val_TF_IDF)}

# Calculate metrics for ClassifierChain classifiers
Classifer_Chain_TF_IDF = {
    "LogisticRegression": calculate_metrics(Classifer_Chain_Logistic_Regression_1, X_val_TF_IDF, y_val_TF_IDF),
    "DecisionTree": calculate_metrics(Classifer_Chain_Decision_Tree_Classifier_1, X_val_TF_IDF, y_val_TF_IDF),
    "MLPClassifier": calculate_metrics(Classifer_Chain_Multilayer_Perceptron_Classifier_1, X_val_TF_IDF, y_val_TF_IDF),
    "RandomForest": calculate_metrics(Classifer_Chain_Random_Forest_Classifier_1, X_val_TF_IDF, y_val_TF_IDF),
    "Dummy": calculate_metrics(Classifier_Chain_Dummy_Classifier_1, X_val_TF_IDF, y_val_TF_IDF)}

**Model Performance Comparison for TF-IDF Vectorization**

In [None]:
plot_model_performance(One_vs_Rest_TF_IDF, Classifer_Chain_TF_IDF, 'Term-Frequency-Inverse Document Frequency (TF-IDF)')

All models outperformed the dummy classifier, which is a positive indication. 
<br>
For this vectorization method, the model with the highest F1 score, precision, and recall appears to be the *MLPClassifier* with a *ClassifierChain* approach.

---
# **Word2Vec Skip-gram**
---

We split our Word2Vec Skip-fram vectorized input into training and validation sets. 

In [None]:
X_train_Word2Vec_Skip_gram, X_val_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram = train_test_split(np.array(data['Word2Vec_Skip_gram'].tolist()), encoding ,test_size = 0.2, random_state = 0)

Since the splitting process could have introduced some NaN values, we implemented the following code to check for their presence.

In [None]:
print("Missing values in X_train_Word2Vec_Skip_gram:", np.isnan(X_train_Word2Vec_Skip_gram).sum())
print("Missing values in X_val_Word2Vec_Skip_gram:", np.isnan(X_val_Word2Vec_Skip_gram).sum())
print("Missing values in y_train_Word2Vec_Skip_gram:", np.isnan(y_train_Word2Vec_Skip_gram).sum() if isinstance(y_train_Word2Vec_Skip_gram, np.ndarray) else 0)
print("Missing values in y_val_Word2Vec_Skip_gram:", np.isnan(y_val_Word2Vec_Skip_gram).sum() if isinstance(y_val_Word2Vec_Skip_gram, np.ndarray) else 0)

Missing values were detected in the X_train and X_val arrays.
<br>
These missing values were treated, as seen below.

In [None]:
# Clean train dataset
X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram = clean_nan_values(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)

# Clean test dataset
X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram = clean_nan_values(X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram)

In this code, we train several classifiers — *Logistic Regression*, *Decision Tree*, *Multilayer Perceptron*, *Random Forest*, and *Dummy Classifier* — using both the `OneVsRest` and `ClassifierChain` strategies. 
<br>
These strategies are employed to adapt to our multilabel classification problem, where each input can belong to multiple classes. 

In [None]:
# Logistic Regression
One_vs_Rest_Logistic_Regression_2 = OneVsRestClassifier(LogisticRegression(solver = 'lbfgs', random_state = 0, class_weight = calculate_class_weights(y_train_Word2Vec_Skip_gram))).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)
Classifer_Chain_Logistic_Regression_2 = ClassifierChain(LogisticRegression(solver = 'lbfgs', random_state = 0, class_weight = calculate_class_weights(y_train_Word2Vec_Skip_gram))).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)

# Decision Tree Classifier
One_vs_Rest_Decision_Tree_Classifier_2 = OneVsRestClassifier(DecisionTreeClassifier(random_state = 42, criterion = 'gini', max_depth = 20, min_samples_leaf = 10, min_samples_split = 150, class_weight = calculate_class_weights(y_train_Word2Vec_Skip_gram))).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)
Classifer_Chain_Decision_Tree_Classifier_2 = ClassifierChain(DecisionTreeClassifier(random_state = 42, criterion = 'gini', max_depth = 20, min_samples_leaf = 10, min_samples_split = 150, class_weight = calculate_class_weights(y_train_Word2Vec_Skip_gram))).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)

# Multilayer Perceptron Classifier
One_vs_Rest_Multilayer_Perceptron_Classifier_2 = OneVsRestClassifier(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)
Classifer_Chain_Multilayer_Perceptron_Classifier_2 = ClassifierChain(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)

# Random Forest Classifier
One_vs_Rest_Random_Forest_Classifier_2 = OneVsRestClassifier(RandomForestClassifier(random_state = 42, criterion = 'log_loss', n_estimators = 100, max_depth = 10, min_samples_leaf = 5, class_weight = calculate_class_weights(y_train_Word2Vec_Skip_gram))).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)
Classifer_Chain_Random_Forest_Classifier_2 = ClassifierChain(RandomForestClassifier(random_state = 42, criterion = 'log_loss', n_estimators = 100, max_depth = 10, min_samples_leaf = 5, class_weight = calculate_class_weights(y_train_Word2Vec_Skip_gram))).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)

# Dummy Classifier
One_vs_Rest_Dummy_Classifier_2 = OneVsRestClassifier(DummyClassifier(strategy = 'most_frequent')).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)
Classifier_Chain_Dummy_Classifier_2 = ClassifierChain(DummyClassifier(strategy = 'most_frequent')).fit(X_train_Word2Vec_Skip_gram, y_train_Word2Vec_Skip_gram)

Next, metrics such as F1 score, precision, and recall were calculated for each of these models.

In [None]:
# Calculate metrics for OneVsRest classifiers
One_vs_Rest_Word2Vec_Skip_gram = {
    "LogisticRegression": calculate_metrics(One_vs_Rest_Logistic_Regression_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "DecisionTree": calculate_metrics(One_vs_Rest_Decision_Tree_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "MLPClassifier": calculate_metrics(One_vs_Rest_Multilayer_Perceptron_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "RandomForest": calculate_metrics(One_vs_Rest_Random_Forest_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "Dummy": calculate_metrics(One_vs_Rest_Dummy_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram)}

# Calculate metrics for ClassifierChain classifiers
Classifer_Chain_Word2Vec_Skip_gram = {
    "LogisticRegression": calculate_metrics(Classifer_Chain_Logistic_Regression_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "DecisionTree": calculate_metrics(Classifer_Chain_Decision_Tree_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "MLPClassifier": calculate_metrics(Classifer_Chain_Multilayer_Perceptron_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "RandomForest": calculate_metrics(Classifer_Chain_Random_Forest_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram),
    "Dummy": calculate_metrics(Classifier_Chain_Dummy_Classifier_2, X_val_Word2Vec_Skip_gram, y_val_Word2Vec_Skip_gram)}

**Model Performance Comparison for Word2Vec Skip-gram Vectorization**

In [None]:
plot_model_performance(One_vs_Rest_Word2Vec_Skip_gram, Classifer_Chain_Word2Vec_Skip_gram, 'Word2Vec Skip-gram')

In this case, the comparison with the dummy classifier is not as favorable, as the MLPClassifier shows similar performance. 
<br>
Overall, the scores for the other models appear lower compared to the TF-IDF method. 
<br>
However, the best-performing model seems to be the *RandomForest* with a *OneVsRest* approach.

---
# **Word2Vec Continuous Bag-of-Words (CBOW)**
---

We split our CBOW vectorized input into training and validation sets. 

In [None]:
X_train_CBOW, X_val_CBOW, y_train_CBOW, y_val_CBOW = train_test_split(np.array(data['CBOW'].tolist()), encoding ,test_size = 0.2, random_state = 0)

Since the splitting process could have introduced some NaN values, we implemented the following code to check for their presence.

In [None]:
print("Missing values in X_train_CBOW:", np.isnan(X_train_CBOW).sum())
print("Missing values in X_val_CBOW:", np.isnan(X_val_CBOW).sum())
print("Missing values in y_train_CBOW:", np.isnan(y_train_CBOW).sum() if isinstance(y_train_CBOW, np.ndarray) else 0)
print("Missing values in y_val_CBOW:", np.isnan(y_val_CBOW).sum() if isinstance(y_val_CBOW, np.ndarray) else 0)

Missing values were detected in the X_train and X_val arrays.
<br>
These missing values were treated, as seen below.

In [None]:
# Clean train dataset
X_train_CBOW, y_train_CBOW = clean_nan_values(X_train_CBOW, y_train_CBOW)

# Clean test dataset
X_val_CBOW, y_val_CBOW = clean_nan_values(X_val_CBOW, y_val_CBOW)

In this code, we train several classifiers — *Logistic Regression*, *Decision Tree*, *Multilayer Perceptron*, *Random Forest*, and *Dummy Classifier* — using both the `OneVsRest` and `ClassifierChain` strategies. 
<br>
These strategies are employed to adapt to our multilabel classification problem, where each input can belong to multiple classes. 

In [None]:
# Logistic Regression
One_vs_Rest_Logistic_Regression_3 = OneVsRestClassifier(LogisticRegression(solver = 'lbfgs', random_state = 0, class_weight = calculate_class_weights(y_train_CBOW))).fit(X_train_CBOW, y_train_CBOW)
Classifer_Chain_Logistic_Regression_3 = ClassifierChain(LogisticRegression(solver = 'lbfgs', random_state = 0, class_weight = calculate_class_weights(y_train_CBOW))).fit(X_train_CBOW, y_train_CBOW)

# Decision Tree Classifier
One_vs_Rest_Decision_Tree_Classifier_3 = OneVsRestClassifier(DecisionTreeClassifier(random_state = 42, criterion = 'gini', max_depth = 20, min_samples_leaf = 10, min_samples_split = 150, class_weight = calculate_class_weights(y_train_CBOW))).fit(X_train_CBOW, y_train_CBOW)
Classifer_Chain_Decision_Tree_Classifier_3 = ClassifierChain(DecisionTreeClassifier(random_state = 42, criterion = 'gini', max_depth = 20, min_samples_leaf = 10, min_samples_split = 150, class_weight = calculate_class_weights(y_train_CBOW))).fit(X_train_CBOW, y_train_CBOW)

# Multilayer Perceptron Classifier
One_vs_Rest_Multilayer_Perceptron_Classifier_3 = OneVsRestClassifier(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_CBOW, y_train_CBOW)
Classifer_Chain_Multilayer_Perceptron_Classifier_3 = ClassifierChain(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_CBOW, y_train_CBOW)

# Random Forest Classifier
One_vs_Rest_Random_Forest_Classifier_3 = OneVsRestClassifier(RandomForestClassifier(random_state = 42, criterion = 'log_loss', n_estimators = 100, max_depth = 10, min_samples_leaf = 5, class_weight = calculate_class_weights(y_train_CBOW))).fit(X_train_CBOW, y_train_CBOW)
Classifer_Chain_Random_Forest_Classifier_3 = ClassifierChain(RandomForestClassifier(random_state = 42, criterion = 'log_loss', n_estimators = 100, max_depth = 10, min_samples_leaf = 5, class_weight = calculate_class_weights(y_train_CBOW))).fit(X_train_CBOW, y_train_CBOW)

# Dummy Classifier
One_vs_Rest_Dummy_Classifier_3 = OneVsRestClassifier(DummyClassifier(strategy = 'most_frequent')).fit(X_train_CBOW, y_train_CBOW)
Classifier_Chain_Dummy_Classifier_3 = ClassifierChain(DummyClassifier(strategy = 'most_frequent')).fit(X_train_CBOW, y_train_CBOW)

Next, metrics such as F1 score, precision, and recall were calculated for each of these models.

In [None]:
# Calculate metrics for OneVsRest classifiers
One_vs_Rest_CBOW = {
    "LogisticRegression": calculate_metrics(One_vs_Rest_Logistic_Regression_3, X_val_CBOW, y_val_CBOW),
    "DecisionTree": calculate_metrics(One_vs_Rest_Decision_Tree_Classifier_3, X_val_CBOW, y_val_CBOW),
    "MLPClassifier": calculate_metrics(One_vs_Rest_Multilayer_Perceptron_Classifier_3, X_val_CBOW, y_val_CBOW),
    "RandomForest": calculate_metrics(One_vs_Rest_Random_Forest_Classifier_3, X_val_CBOW, y_val_CBOW),
    "Dummy": calculate_metrics(One_vs_Rest_Dummy_Classifier_3, X_val_CBOW, y_val_CBOW)}

# Calculate metrics for ClassifierChain classifiers
Classifer_Chain_CBOW = {
    "LogisticRegression": calculate_metrics(Classifer_Chain_Logistic_Regression_3, X_val_CBOW, y_val_CBOW),
    "DecisionTree": calculate_metrics(Classifer_Chain_Decision_Tree_Classifier_3, X_val_CBOW, y_val_CBOW),
    "MLPClassifier": calculate_metrics(Classifer_Chain_Multilayer_Perceptron_Classifier_3, X_val_CBOW, y_val_CBOW),
    "RandomForest": calculate_metrics(Classifer_Chain_Random_Forest_Classifier_3, X_val_CBOW, y_val_CBOW),
    "Dummy": calculate_metrics(Classifier_Chain_Dummy_Classifier_3, X_val_CBOW, y_val_CBOW)}

**Model Performance Comparison for Word2Vec Continuous Bag-of-Words (CBOW)**

In [None]:
plot_model_performance(One_vs_Rest_CBOW, Classifer_Chain_CBOW, 'Word2Vec Continuous Bag-of-Words (CBOW)')

Here, once again, the MLPClassifier shows much lower performance, and the other models also perform worse than in previous cases. 
<br>
The best model, which is the *RandomForest* with a *OneVsRest* classifier, exhibits a significant disparity between recall and both precision and F1 score, indicating that it won't be a suitable choice for the final model.

---
# **GloVe - Pretrained Model**
---

We split our GloVe vectorized input into training and validation sets. 

In [None]:
X_train_GloVe, X_val_GloVe, y_train_GloVe, y_val_GloVe = train_test_split(np.array(data['GloVe'].tolist()), encoding, test_size = 0.2, random_state = 0)

Since the splitting process could have introduced some NaN values, we implemented the following code to check for their presence.

In [None]:
print("Missing values in X_train_GloVe:", np.isnan(X_train_GloVe).sum())
print("Missing values in X_val_GloVe:", np.isnan(X_val_GloVe).sum())
print("Missing values in y_train_GloVe:", np.isnan(y_train_GloVe).sum() if isinstance(y_train_GloVe, np.ndarray) else 0)
print("Missing values in y_val_GloVe:", np.isnan(y_val_GloVe).sum() if isinstance(y_val_GloVe, np.ndarray) else 0)

However, no missing values were found.

In this code, we train several classifiers — *Logistic Regression*, *Decision Tree*, *Multilayer Perceptron*, *Random Forest*, and *Dummy Classifier* — using both the `OneVsRest` and `ClassifierChain` strategies. 
<br>
These strategies are employed to adapt to our multilabel classification problem, where each input can belong to multiple classes. 

In [None]:
# Logistic Regression
One_vs_Rest_Logistic_Regression_4 = OneVsRestClassifier(LogisticRegression(solver = 'lbfgs', random_state = 0, class_weight = calculate_class_weights(y_train_GloVe))).fit(X_train_GloVe, y_train_GloVe)
Classifer_Chain_Logistic_Regression_4 = ClassifierChain(LogisticRegression(solver = 'lbfgs', random_state = 0, class_weight = calculate_class_weights(y_train_GloVe))).fit(X_train_GloVe, y_train_GloVe)

# Decision Tree Classifier
One_vs_Rest_Decision_Tree_Classifier_4 = OneVsRestClassifier(DecisionTreeClassifier(random_state = 42, criterion = 'gini', max_depth = 20, min_samples_leaf = 10, min_samples_split = 150, class_weight = calculate_class_weights(y_train_GloVe))).fit(X_train_GloVe, y_train_GloVe)
Classifer_Chain_Decision_Tree_Classifier_4 = ClassifierChain(DecisionTreeClassifier(random_state = 42, criterion = 'gini', max_depth = 20, min_samples_leaf = 10, min_samples_split = 150, class_weight = calculate_class_weights(y_train_GloVe))).fit(X_train_GloVe, y_train_GloVe)

# Multilayer Perceptron Classifier
One_vs_Rest_Multilayer_Perceptron_Classifier_4 = OneVsRestClassifier(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_GloVe, y_train_GloVe)
Classifer_Chain_Multilayer_Perceptron_Classifier_4 = ClassifierChain(MLPClassifier(random_state = 42, hidden_layer_sizes = (128, 32), activation = 'relu', solver = 'adam', alpha = 0.001, max_iter = 70, early_stopping = True, learning_rate_init = 0.001)).fit(X_train_GloVe, y_train_GloVe)

# Random Forest Classifier
One_vs_Rest_Random_Forest_Classifier_4 = OneVsRestClassifier(RandomForestClassifier(random_state = 42, criterion = 'log_loss', n_estimators = 100, max_depth = 10, min_samples_leaf = 5, class_weight = calculate_class_weights(y_train_GloVe))).fit(X_train_GloVe, y_train_GloVe)
Classifer_Chain_Random_Forest_Classifier_4 = ClassifierChain(RandomForestClassifier(random_state = 42, criterion = 'log_loss', n_estimators = 100, max_depth = 10, min_samples_leaf = 5, class_weight = calculate_class_weights(y_train_GloVe))).fit(X_train_GloVe, y_train_GloVe)

# Dummy Classifier
One_vs_Rest_Dummy_Classifier_4 = OneVsRestClassifier(DummyClassifier(strategy = 'most_frequent')).fit(X_train_GloVe, y_train_GloVe)
Classifier_Chain_Dummy_Classifier_4 = ClassifierChain(DummyClassifier(strategy = 'most_frequent')).fit(X_train_GloVe, y_train_GloVe)

Next, metrics such as F1 score, precision, and recall were calculated for each of these models.

In [None]:
# Calculate metrics for OneVsRest classifiers
One_vs_Rest_GloVe = {
    "LogisticRegression": calculate_metrics(One_vs_Rest_Logistic_Regression_4, X_val_GloVe, y_val_GloVe),
    "DecisionTree": calculate_metrics(One_vs_Rest_Decision_Tree_Classifier_4, X_val_GloVe, y_val_GloVe),
    "MLPClassifier": calculate_metrics(One_vs_Rest_Multilayer_Perceptron_Classifier_4, X_val_GloVe, y_val_GloVe),
    "RandomForest": calculate_metrics(One_vs_Rest_Random_Forest_Classifier_4, X_val_GloVe, y_val_GloVe),
    "Dummy": calculate_metrics(One_vs_Rest_Dummy_Classifier_4, X_val_GloVe, y_val_GloVe)}

# Calculate metrics for ClassifierChain classifiers
Classifer_Chain_GloVe = {
    "LogisticRegression": calculate_metrics(Classifer_Chain_Logistic_Regression_4, X_val_GloVe, y_val_GloVe),
    "DecisionTree": calculate_metrics(Classifer_Chain_Decision_Tree_Classifier_4, X_val_GloVe, y_val_GloVe),
    "MLPClassifier": calculate_metrics(Classifer_Chain_Multilayer_Perceptron_Classifier_4, X_val_GloVe, y_val_GloVe),
    "RandomForest": calculate_metrics(Classifer_Chain_Random_Forest_Classifier_4, X_val_GloVe, y_val_GloVe),
    "Dummy": calculate_metrics(Classifier_Chain_Dummy_Classifier_4, X_val_GloVe, y_val_GloVe)}

**Model Performance Comparison for GloVe - Pretrained Model**

In [None]:
plot_model_performance(One_vs_Rest_GloVe, Classifer_Chain_GloVe, 'GloVe - Pretrained Model')

Once again, the MLPClassifier shows poor performance, and the other models also do not perform exceptionally well. 
<br>
The best model appears to be the *RandomForest* with a *ClassifierChain* approach.

<font color='#BFD72F' size=5>3.3. Model Evaluation</font> <a class="anchor" id="P33"></a>

Based on the analysis from the plots above, we have decided that our final model should be a <b>Multilayer Perceptron Classifier</b> using the <b>TF-IDF</b> vectorizer. 
<br>
This choice was made as it consistently outperformed the dummy classifier by the widest margin across the evaluated metrics.

In [None]:
Classifer_Chain_Multilayer_Perceptron_Classifier_1

**Checking for Overfitting in the Multilayer Perceptron Classifier (Classifier Chain) with TF-IDF**

In [None]:
check_overfitting(Classifer_Chain_Multilayer_Perceptron_Classifier_1, X_train_TF_IDF, y_train_TF_IDF, X_val_TF_IDF, y_val_TF_IDF)

We observe that the training data outperforms the validation data across all metrics, with a significant gap. 
<br>
One approach to address this issue is oversampling, which we will apply next.

## **Oversampling**

In this step, we applied **random oversampling** to address the class imbalance in our training and validation datasets. 

First, we combined the feature and label arrays for both the training and validation sets. 
<br>
Then, using the `resample` function, we oversampled the data to match the original size of the training set, ensuring a balanced representation of classes. 
<br>
The resulting oversampled datasets were then split back into their respective feature (X) and label (y) components. 

This technique helps to mitigate the risk of overfitting and improve the model's generalization by providing a more balanced dataset.

In [None]:
Combined_X_y_train = np.hstack((X_train_TF_IDF, y_train_TF_IDF))
Combined_X_y_val = np.hstack((X_val_TF_IDF, y_val_TF_IDF))

oversampled_train_data = resample(Combined_X_y_train, replace = True, n_samples = len(X_train_TF_IDF), random_state = 42)
oversampled_val_data = resample(Combined_X_y_val, replace = True, n_samples = len(X_val_TF_IDF), random_state = 42)

X_train_TF_IDF_random_oversample = oversampled_train_data[:, : - y_train_TF_IDF.shape[1]]
y_train_TF_IDF_random_oversample = oversampled_train_data[:, - y_train_TF_IDF.shape[1]:]

X_val_TF_IDF_random_oversample = oversampled_val_data[:, : - y_val_TF_IDF.shape[1]]
y_val_TF_IDF_random_oversample = oversampled_val_data[:, - y_val_TF_IDF.shape[1]:]

**Evaluating Model Performance After Random Oversampling for Multilayer Perceptron Classifier**

In [None]:
check_overfitting(Classifer_Chain_Multilayer_Perceptron_Classifier_1, X_train_TF_IDF_random_oversample, y_train_TF_IDF_random_oversample, X_val_TF_IDF_random_oversample, y_val_TF_IDF_random_oversample)

Even after applying random oversampling to address the class imbalance in the training and validation datasets, we did not observe a significant improvement in performance. 
<br>
Despite balancing the classes, the model's results remained largely unchanged, suggesting that oversampling may not have been the most effective solution in this case.

## **Evaluating Performance for Each Cuisine Type**

However, the main objective of this model is to assess how well it can predict each type of cuisine. 
<br>
To achieve this, we will present separate metrics for each label, as this approach is more suitable given the varying levels of class imbalance across the different labels.

In [None]:
# Predictions on the Training Set
train_predictions = Classifer_Chain_Multilayer_Perceptron_Classifier_1.predict(X_train_TF_IDF)

Classification_Report_Train = classification_report(y_train_TF_IDF, train_predictions, target_names = multi_label_binarizer.classes_, output_dict = True)

Cuisine_Type_Performances_Train = pd.DataFrame(Classification_Report_Train).transpose()

In [None]:
# Predictions on the Validation Set
validation_predictions = Classifer_Chain_Multilayer_Perceptron_Classifier_1.predict(X_val_TF_IDF)

Classification_Report_Validation = classification_report(y_val_TF_IDF, validation_predictions, target_names = multi_label_binarizer.classes_, output_dict = True)

Cuisine_Type_Performances_Validation = pd.DataFrame(Classification_Report_Validation).transpose()

Using the following code, we created a DataFrame that displays the F1 scores for both the training and validation sets for each type of cuisine.

In [None]:
Cuisine_Type_Performances_Train_Filter = Cuisine_Type_Performances_Train.loc[multi_label_binarizer.classes_]
Cuisine_Type_Performances_Validation_Filter = Cuisine_Type_Performances_Validation.loc[multi_label_binarizer.classes_]

f1_scores = pd.DataFrame({'Cuisine Type': multi_label_binarizer.classes_, 'F1 Score (Train)': Cuisine_Type_Performances_Train_Filter['f1-score'], 'F1 Score (Validation)': Cuisine_Type_Performances_Validation_Filter['f1-score']}).set_index('Cuisine Type')

display(f1_scores)

**F1 Scores for Each Cuisine Type**

In [None]:
F1_Scores_Cuisine_Types(f1_scores)

Despite our efforts, we observed that the most frequent cuisine types achieved better F1 scores, while the less frequent cuisines, such as Mexican, had a score of 0. 
<br>
This indicates that the model struggles with predicting these less represented classes. 
<br>
Additionally, overfitting remains a significant issue, as the model performs much better on the training data compared to the validation data. 

Overall, we did not achieve the desired results, and further improvements are needed to address these challenges.

---

<font color='#BFD72F' size=5>4. Sentiment Analysis</font> <a class="anchor" id="P4"></a>
  
[Back to TOC](#toc)

*How well can we predict a restaurant’s Zomato score using the polarity of their reviews as input?*

The aim of this section is to apply sentiment analysis models to assess the polarity of the reviews.

<font color='#BFD72F' size=5>4.1. Specific Data Preparation</font> <a class="anchor" id="P41"></a>

First, we will define the <b>data_</b> by selecting only the necessary columns: 
<br>
<i>Cuisines</i>, <i>Review_cleaned</i> and <i>Rating</i>

This data is shown below.

In [None]:
data_ = data_for_sentiment_analysis_clean[['Cuisines', 'Review_cleaned', 'Rating']]
display(data_)

<font color='#BFD72F' size=5>4.2. Model Implementation</font> <a class="anchor" id="P42"></a>

---
# <b>Valence Aware Dictionary and sEntiment Reasoner (VADER)</b>
---

In this step, we used the VADER sentiment analysis tool to analyze the polarity of the cleaned reviews.
<br>
`The SentimentIntensityAnalyzer` was applied to the Review_cleaned column, with the sentiment scores averaged across sentences for each review. 
<br>
The results were then displayed in the updated dataset.

In [None]:
Vader_Sentiment_Analyzer = SentimentIntensityAnalyzer()

In [None]:
data_ = Apply_Vader(data_, column = 'Review_cleaned', mean_sentence = True)

display(data_.head())

Then we to generate summary statistics for the sentiment scores. 
<br>
This helped us get an overview of the distribution of both the individual compound scores and the mean compound scores for each review, allowing us to better understand the overall sentiment of the dataset.

In [None]:
data_[['Compound Score (VADER)', 'Mean Compound Score (VADER)']].describe()

- **Skewed Sentiment Distribution:** The mean of the Compound Score (VADER) is 0.4876, indicating a general positive sentiment across the dataset. Most reviews appear to express a positive sentiment overall. The Mean Compound Score (VADER) has a lower mean of 0.2833, suggesting that the overall sentiment of the reviews may be somewhat less positive on average when considering the sentence-level sentiment.

- **High Variability:** Both the Compound Score (VADER) and Mean Compound Score (VADER) show significant variability, with a high standard deviation (0.5796 and 0.3734 respectively). This suggests a wide range of sentiments across the reviews, with many reviews showing highly positive or negative sentiments.

- **Sentiment Range:** The minimum value for both scores is negative (around -0.99), indicating some reviews have strong negative sentiments. The maximum values for both scores are near 1 (0.9997 for Compound Score and 0.9956 for Mean Compound Score), reflecting reviews with very positive sentiments.

- **Sentiment Concentration:** The 50th percentile (median) of the Compound Score (VADER) is 0.7841, suggesting that half of the reviews are more positive than this value. For the Mean Compound Score (VADER), the median is 0.3280, reflecting that the sentence-level sentiment tends to be less positive but still generally favorable.

- **Positive Sentiment Dominance:** The higher 75th percentile for the Compound Score (VADER) (0.9393) further supports that a majority of reviews lean towards the positive side, while a significant portion of the dataset also exhibits relatively strong positive sentiments.

The `Pearson correlation` between the *Compound Score (VADER)* and *Mean Compound Score (VADER)* was computed.

In [None]:
correlation_between_scores = compute_pearson_correlation(data_, 'Compound Score (VADER)', 'Mean Compound Score (VADER)')
print(f"Pearson correlation between 'Compound Score' and 'Mean Compound Score': {correlation_between_scores:.2f}")

It resulted in a value of 0.82, indicating a strong positive linear relationship between the overall and sentence-level sentiment scores.

This relationship is illustrated in the plot below.

**Correlation Comparison Between Compound Score and Mean Compound Score (VADER)**

In [None]:
plot_correlation_comparison(data_)

**Distribution of Sentiment Scores Using VADER**

In [None]:
data_to_plot = deepcopy(data_)

data_to_plot = extract_scores(data_to_plot, review_column = 'Review_cleaned')

plot_histogram_VADER(data_to_plot, positive_column = 'Positive', negative_column = 'Negative', neutral_column = 'Neutral', compound_column = 'Compound Score')

**Sentiment Analysis Heatmap Using VADER**

In [None]:
plot_heatmap_VADER(data_to_plot)

The heatmap shows a strong negative correlation between the neutral and positive sentiment scores. 
<br>
This suggests that as a review's sentiment becomes more positive, it tends to have a lower neutral score, and vice versa. 
<br>
This pattern may indicate that reviews with more pronounced positive sentiments tend to express more extreme opinions, while neutral reviews lack significant sentiment, further strengthening this inverse relationship. 

Understanding such correlations can be helpful in refining sentiment analysis models to better interpret review data and improve accuracy in sentiment classification.

**Sentiment Scores by Cuisine Type**

In [None]:
data_['Cuisines_List'] = data_['Cuisines'].apply(lambda x: x.split(', '))

exploded_data_ = data_.explode('Cuisines_List').reset_index(drop = True)

In [None]:
score_per_cuisine(exploded_data_)

The histogram for compound scores across different types of cuisines shows a skew towards positive reviews for all cuisines.

---
# <b>TextBlob</b>
---

In this step, we used the TextBlob sentiment analysis tool to analyze the polarity of the cleaned reviews.
<br>
A custom function was applied to the Review_cleaned column to calculate the sentiment scores for each review.
<br>
The results were then displayed in the updated dataset.

In [None]:
data_ = Apply_TextBlob(data_, column = 'Review_cleaned')

display(data_.head())

**Distribution of Polarity and Subjectivity from TextBlob**

In [None]:
plot_histogram_TextBlob(data_)

This code segment categorizes the reviews based on their sentiment and subjectivity scores obtained from TextBlob analysis. Reviews are divided into different groups:

In [None]:
highly_positive_reviews = data_[data_['Polarity Score (TextBlob)'] > 0.7]
highly_negative_reviews = data_[data_['Polarity Score (TextBlob)'] < - 0.7]

neutral_reviews = data_[(data_['Polarity Score (TextBlob)'] > -0.1) & (data_['Polarity Score (TextBlob)'] < 0.1)]

highly_subjective_reviews = data_[data_['Subjectivity Score (TextBlob)'] > 0.9]
highly_objective_reviews = data_[data_['Subjectivity Score (TextBlob)']< 0.2]

print("Highly Positive Reviews:", len(highly_positive_reviews))
print("Highly Negative Reviews:", len(highly_negative_reviews))
print("Neutral Reviews:", len(neutral_reviews))
print("Highly Subjective Reviews:", len(highly_subjective_reviews))
print("Highly Objective Reviews:", len(highly_objective_reviews))

The result of the categorization is as follows:

- **Highly Positive Reviews:** 659 reviews have a polarity score greater than 0.7, indicating strong positive sentiment.
- **Highly Negative Reviews:** 140 reviews have a polarity score less than -0.7, indicating strong negative sentiment.
- **Neutral Reviews:** 1658 reviews have a polarity score between -0.1 and 0.1, reflecting a neutral sentiment.
- **Highly Subjective Reviews:** 482 reviews have a subjectivity score greater than 0.9, suggesting they are highly subjective and personal.
- **Highly Objective Reviews:** 559 reviews have a subjectivity score less than 0.2, indicating they are highly objective and factual.

These results show the distribution of sentiment and subjectivity in the reviews, with most of the reviews being neutral and subjective.

**Polarity Scores by Cuisine Type**

In [None]:
data_['Cuisines_List'] = data_['Cuisines'].apply(lambda x: x.split(', '))

exploded_data_ = data_.explode('Cuisines_List').reset_index(drop = True)

In [None]:
score_per_cuisine(exploded_data_, vader = False)

All histograms of the polarity scores seem to have a relatively normal distribution for each type of cuisine. 
<br>
This suggests that the sentiment expressed in the reviews is evenly distributed across the different cuisines, with reviews generally having a balanced mix of positive, neutral, and negative sentiments.

<font color='#BFD72F' size=5>4.3. Model Evaluation</font> <a class="anchor" id="P43"></a>

We will compare the scores from these models with the actual ratings. 

First, we check for any missing or infinite values in the desired columns, as there could have been a typo during the modeling process.

In [None]:
print('Missing Values:')
print(data_[['Compound Score (VADER)', 'Mean Compound Score (VADER)', 'Polarity Score (TextBlob)', 'Subjectivity Score (TextBlob)']].isnull().sum())
print() 
print('Infinite Values:')
print(np.isinf(data_[['Compound Score (VADER)', 'Mean Compound Score (VADER)', 'Polarity Score (TextBlob)', 'Subjectivity Score (TextBlob)']]).sum())

Only the 'Mean Compound Score (VADER)' column had 8 missing values, which will be removed.

In [None]:
data_ = data_.dropna(subset = ['Mean Compound Score (VADER)'])

In this step, we apply the `MinMaxScaler` to normalize the values of the 'Rating' column and the sentiment scores ('Compound Score (VADER)' and 'Polarity Score (TextBlob)'). 
<br>
This ensures that all values are within the range [0, 1], making them comparable. 
<br>
We also check for any missing values in the scaled columns to ensure data integrity.

In [None]:
Min_Max_Scaler = MinMaxScaler()

Rating_Scaled = Min_Max_Scaler.fit_transform(data_['Rating'].to_numpy().reshape(-1, 1))
Compound_Score_Vader_Scaled = Min_Max_Scaler.fit_transform(data_['Compound Score (VADER)'].to_numpy().reshape(-1, 1))
Polarity_Score_TextBlob_Scaled = Min_Max_Scaler.fit_transform(data_['Polarity Score (TextBlob)'].to_numpy().reshape(-1, 1))

print("Missing Values in Rating_Scaled:", np.isnan(Rating_Scaled).sum())
print("Missing Values in Compound_Score_Vader_Scaled:", np.isnan(Compound_Score_Vader_Scaled).sum())
print("Missing Values in Polarity_Score_TextBlob_Scaled:", np.isnan(Polarity_Score_TextBlob_Scaled).sum())

The following code removes the row with missing values in the '+'Rating'+' column and drops the corresponding rows in the sentiment score columns, ensuring data consistency before further analysis.

In [None]:
mask = ~np.isnan(Rating_Scaled.reshape(-1)) & ~np.isnan(Compound_Score_Vader_Scaled.reshape(-1)) & ~np.isnan(Polarity_Score_TextBlob_Scaled.reshape(-1))

Rating_Scaled, Compound_Score_Vader_Scaled, Polarity_Score_TextBlob_Scaled = [x[mask] for x in [Rating_Scaled, Compound_Score_Vader_Scaled, Polarity_Score_TextBlob_Scaled]]

This code was used to compute sentiment analysis metrics by comparing the scaled values of the 'Rating' column with the scaled sentiment scores from VADER and TextBlob. 

The resulting metrics were displayed for evaluation.

In [None]:
sentiment_analysis_metrics = Sentiment_Analysis_Metrics(Rating_Scaled, Compound_Score_Vader_Scaled, Polarity_Score_TextBlob_Scaled)
display(sentiment_analysis_metrics)

- **Error Metrics:** Both models demonstrate relatively low error rates, with VADER showing slightly better results than TextBlob in most cases. For example, VADER has a lower Mean Squared Error (0.0794) and Root Mean Squared Error (0.2818) compared to TextBlob's MSE (0.0773) and RMSE (0.2780). Similarly, VADER has a lower Mean Absolute Error (0.2080) than TextBlob (0.2297).
- **Correlation:** The Pearson correlation between sentiment scores and ratings is reasonably strong for both models, with VADER scoring slightly higher (0.7018) than TextBlob (0.6957). This indicates a good positive relationship between sentiment scores and ratings.

Overall, while both models align reasonably well with actual ratings, VADER shows a slight edge in performance metrics.

**Regression Plot: VADER vs. TextBlob Sentiment Scores**

In [None]:
plot_regression(Compound_Score_Vader_Scaled, Polarity_Score_TextBlob_Scaled)

As mentioned above, a positive correlation is observed.

**Normalized Distributions of Rating and Sentiment Scores**

In [None]:
plot_normalized_distributions(Rating_Scaled, Compound_Score_Vader_Scaled, Polarity_Score_TextBlob_Scaled)

The curve for the rating and the Compound Score (VADER) appears similar, while it differs significantly from the Polarity Score (TextBlob). 
<br>
This supports our conclusion that VADER demonstrates better performance.

---

<font color='#BFD72F' size=5>5. Topic Modelling</font> <a class="anchor" id="P5"></a>
  
[Back to TOC](#toc)

*Can the reviews be classified according to emergent topics? (e.g., can review j be made up of 0.5 topic “service; speed; sympathy”, and 0.3 topic “ambiance; decoration; furniture”?) What do the emergent topic mean? (i.e., are they meaningful regarding the project’s context?) Can relevant insights be extracted from the topics?*

In this step, we perform topic modeling on the preprocessed text data (the same used for the multilabel classifiaction) using various vectorization methods and models.
<br>
 By applying Bag of Words (BoW) and TF-IDF for text representation, and `Latent Semantic Analysis` (LSA) and `Latent Dirichlet Allocation` (LDA) for modeling, we aim to identify underlying topics within the data. 
 <br>
 The analysis explores up to 5 topics, extracting the top 10 words for each, using the Lemmas_Treated column as the input text.

In [None]:
Topic_Modelling = run_topic_modelling(data = data_for_multilabel_classification_clean, vectorization_methods = ['bow', 'tfidf'], models = ['LSA', 'LDA'], max_k = 5, n_top_words = 10, text_column = 'Lemmas_Treated')

- BoW-based methods (both LSA and LDA) generally produced higher coherence scores, making them more effective for extracting meaningful topics from this dataset.

- TF-IDF-based methods resulted in lower coherence scores, indicating that the additional weighting of terms may not have been as effective in this context.

Overall, the results suggest that **LSA with BoW** performed the best in terms of topic coherence and interpretability, offering a clear view of the most prominent themes, such as food quality, service experience, and delivery.

In [None]:
fig = make_subplots(rows = 2, cols = 2, subplot_titles = ["LSA & TFIDF", "LSA & BOW", "LDA & TFIDF", "LDA & BOW"], shared_yaxes = True, shared_xaxes = False)

fig.add_trace(heatmap_topic_modelling(data_for_multilabel_classification_clean, 'LSA', 'TFIDF'), row = 1, col = 1)
fig.add_trace(heatmap_topic_modelling(data_for_multilabel_classification_clean, 'LSA', 'BOW'), row = 1, col = 2)
fig.add_trace(heatmap_topic_modelling(data_for_multilabel_classification_clean, 'LDA', 'TFIDF'), row = 2, col = 1)
fig.add_trace(heatmap_topic_modelling(data_for_multilabel_classification_clean, 'LDA', 'BOW'), row = 2, col = 2)

fig.update_layout(height = 800, width = 1000, title_text = "Topic Modelling Heatmaps", title_font = dict(size = 20, family = "Arial", weight = "bold"), template = "plotly_white", 
                  showlegend = False, coloraxis = dict(colorscale = "Blues", colorbar = dict(title="Correlation", tickvals = [-1, 0, 1])))

fig.show()

| **Aspect**              | **LSA with BOW**                          | **LDA with BOW**                          | **LSA with TFIDF**                        | **LDA with TFIDF**                         |
|-------------------------|-------------------------------------------|-------------------------------------------|------------------------------------------|-------------------------------------------|
| **Coherence Score**     | 0.4536                                   | 0.4095                                    | 0.4202                                   | 0.3802                                    |
| **Top Themes (Topics)** | 1. General food/service (good, food, place)<br>2. Specific dishes (chicken, biryani, fried)<br>3. Veg focus (taste, veg, paneer)<br>4. Social setting (place, friends, chocolate)<br>5. Mixed food mentions (wings, burger, mandi) | 1. General food (chicken, ordered, taste)<br>2. Service and experience (staff, visit, time)<br>3. Delivery-related terms (order, delivery, zomato)<br>4. Ambience and quality (place, great, nice)<br>5. Food mentions (biryani, paneer, quantity) | 1. General satisfaction (good, service, taste)<br>2. Experience focus (place, service, great)<br>3. Social setting (friends, hangout, music)<br>4. Delivery performance (delivery, fast, quick)<br>5. Food issues (biryani, bad, spicy) | 1. General food satisfaction (good, food, taste)<br>2. Service and experience (awesome, nice, staff)<br>3. Delivery speed/issues (fast, superb, waste)<br>4. Food quality (spicy, tasty, avg)<br>5. Specific food items (paratha, stale, dal) |
| **Topic Overlap**       | Moderate overlap (Topics 1 and 2)         | Better separation; Topics distinct         | Slight redundancy (Topics 1 and 2 overlap) | Clearer separation, some niche focus       |
| **Correlation Highlights** | - Strong negative: Topic 0 and Topic 2 (-0.69)<br>- Weak correlations elsewhere | - Strong negative: Topic 3 and Topic 0 (-0.35)<br>- Topics moderately separated | - Strong negative: Topic 0 and Topic 1 (-0.72)<br>- Weak correlations overall | - Strong negative: Topic 0 and Topic 1 (-0.73)<br>- Weak correlations elsewhere |
| **Granularity**         | Balanced, broad topics                   | Granular and specific topics              | Balanced themes with distinct nuances     | Very granular, specific word-level topics  |
| **Strengths**           | Best coherence score, broad interpretability | Clear separation of topics               | Captures delivery/service nuances         | Clear separation, niche-specific terms     |
| **Weaknesses**          | Topics 1 and 2 overlap                     | Lower coherence, slight noise in terms    | Topics 1 and 2 overlap; slight redundancy   | Lower coherence, noisy rare terms          |

---

<font color='#BFD72F' size=5>6. Co-occurrence Analysis</font> <a class="anchor" id="P6"></a>
  
[Back to TOC](#toc)

*What dishes are mentioned together in the reviews? Do they form clusters? Can you identify cuisine types based on those clusters?*

For this phase, we decided to utilize the treated reviews as they were prepared for the multilabel classification task.

In [None]:
data_clustering = data_for_multilabel_classification_clean[['Cuisines', 'Lemmas_Treated']]

We extracted dish mentions from the treated lemmas in the reviews and stored them in a new column, 'Dishes', for further analysis.

In [None]:
data_clustering['Dishes'] = data_clustering['Lemmas_Treated'].apply(find_dishes)

We processed this new column to create a structured list of individual dishes. 
<br>
Using this, we generated all possible pairs of co-occurring dishes and computed their frequencies using `Counter`. 
<br>
The unique dishes were extracted and used to construct a square matrix, with rows and columns representing dishes and their co-occurrence counts.

In [None]:
data_clustering['Dishes_List'] = data_clustering['Dishes'].dropna().apply(lambda x: [dish.strip() for dish in x.split(',')])

pairs = []
for dishes in data_clustering['Dishes_List'].dropna():
    pairs.extend(combinations(sorted(dishes), 2))

pair_counts = Counter(pairs)

unique_dishes = sorted({dish for dishes in data_clustering['Dishes_List'].dropna() for dish in dishes})

matrix = pd.DataFrame(0, index = unique_dishes, columns = unique_dishes)

**Visualizing the Top 10 Dishes Co-occurrence Network**

We plotted the top 10 dishes to simplify the visualization for better understanding.

In [None]:
G = nx.Graph()
for (dish1, dish2), weight in pair_counts.items():
    G.add_edge(dish1, dish2, weight = weight)

dish_degrees = G.degree(weight = 'weight')
top_dishes = sorted(dish_degrees, key = lambda x: x[1], reverse = True)[:10]
top_dishes = [dish for dish, _ in top_dishes]

subgraph = G.subgraph(top_dishes)

sorted_edges = sorted(subgraph.edges(data = True), key = lambda edge: edge[2]['weight'], reverse = True)[:3]
top_3_edges = {(u, v): d['weight'] for u, v, d in sorted_edges}

plt.figure(figsize = (12, 8))
pos = nx.spring_layout(subgraph, k = 0.3, seed = 42)  

edge_widths = [2 if (u, v) in top_3_edges or (v, u) in top_3_edges else 0.7 for u, v in subgraph.edges]

nx.draw_networkx_nodes(subgraph, pos, node_size = 500, node_color = (141/255, 160/255, 203/255), edgecolors = "black")

nx.draw_networkx_edges(subgraph, pos, alpha = 0.7, width = edge_widths)

edge_labels = {edge: weight for edge, weight in top_3_edges.items()}
nx.draw_networkx_edge_labels(subgraph, pos, edge_labels = edge_labels, font_size = 10)

nx.draw_networkx_labels(subgraph, pos, font_size = 10, font_family = "sans-serif")

plt.title(f"Top 10 Dishes Co - occurrence Network", fontsize = 16, fontweight = "bold")
plt.axis("off")
plt.show()

The co-occurrence network visualization for the top 10 dishes reveals significant patterns in dish pairings. 
<br>
The **strongest connection**, represented by the thickest edge, is between **Chicken Biryani** and **Chicken** (weight: 174), indicating that these two dishes are frequently mentioned together in reviews. 
<br>
Other notable connections include Chicken and Chicken Tikka (weight: 117) and Butter Chicken and Chicken (weight: 61), highlighting the centrality of chicken-based dishes in customer preferences.

From the graph, it is evident that dishes like Chicken Biryani, Chicken, and Butter Chicken act as key nodes, linking with multiple other dishes such as Paneer Tikka, Tandoori Chicken, and Chicken Soup. 
<br>
This suggests that these dishes are central to the dining experience and are often ordered or discussed together.

Furthermore, non-chicken dishes such as Gulab Jamun and Butter Naan are also present in the network, showing their relevance as complementary items to chicken-based meals. 
<br>
Overall, the network emphasizes the dominance of chicken-based dishes in co-occurrence and highlights key pairings that could inform menu structuring or promotional strategies.

**Network Visualization of Dishes for Each Cuisine**

In [None]:
unique_cuisines = sorted(set(cuisine.strip() for cuisines in data_clustering['Cuisines'].dropna() for cuisine in cuisines.split(',')))

n_rows = -(-len(unique_cuisines) // 3)  

fig, axes = plt.subplots(n_rows, 3, figsize = (20, 5 * n_rows))
axes = axes.flatten() 

for i, cuisine in enumerate(unique_cuisines):
    ax = axes[i]
    subgraph = get_cuisine_subgraph(cuisine, data_clustering)
    pos = nx.spring_layout(subgraph, k = 0.3, seed = 42) 

    nx.draw_networkx_nodes(subgraph, pos, node_size = 500, node_color = [(141/255, 160/255, 203/255)] * subgraph.number_of_nodes(), edgecolors = "black", ax = ax)
    nx.draw_networkx_edges(subgraph, pos, alpha = 0.7, ax = ax)
    nx.draw_networkx_labels(subgraph, pos, font_size = 8, ax = ax)

    ax.set_title(cuisine.capitalize(), fontsize = 10, fontweight = "bold")
    ax.axis("off")

for i in range(len(unique_cuisines), len(axes)):
    axes[i].axis("off")

plt.tight_layout()
plt.show()

The network visualizations for different cuisines highlight unique patterns of dish co-occurrence within each cuisine. 
<br>
Prominent dishes are often central in their respective networks, forming strong connections with complementary items. 
<br>
The graphs reveal distinct clusters for each cuisine, showcasing key pairings and relationships that could inform menu optimization, marketing strategies, or customer preference analysis. 

Next, we decided to investigate whether the dishes formed distinct clusters based on their co-occurrence in reviews. 
<br>
By applying the `Greedy Modularity` method, we identified communities that represent groups of dishes frequently mentioned together. 
<br>
This clustering approach aimed to uncover underlying patterns and relationships between dishes, providing insights into how they are paired or grouped by customers.

In [None]:
G = nx.Graph()
for (dish1, dish2), weight in pair_counts.items():
    G.add_edge(dish1, dish2, weight = weight)

communities = list(greedy_modularity_communities(G))

cluster_report = []
for idx, community in enumerate(communities):
    subgraph = G.subgraph(community)
    
    nodes = list(subgraph.nodes)
    size = len(nodes)
    degrees = subgraph.degree(weight = 'weight')
    prominent_nodes = sorted(degrees, key = lambda x: x[1], reverse = True)[:3]  
    prominent_nodes = [node for node, degree in prominent_nodes]
    
    cluster_report.append({'Cluster ID': idx + 1, 'Nodes': nodes, 'Size': size,})

cluster_report = pd.DataFrame(cluster_report)
display(cluster_report)

colormap = cm.get_cmap('Blues', len(communities))
colors = [colormap(i) for i in range(len(communities))]

for idx, community in enumerate(communities):
    subgraph = G.subgraph(community)
    plt.figure(figsize = (8, 6))
    
    pos = nx.spring_layout(subgraph, k = 0.3, seed = 42) 
    
    nx.draw_networkx_nodes(subgraph, pos, node_size = 500, node_color = colors[idx], edgecolors = "black")
    
    nx.draw_networkx_edges(subgraph, pos, alpha = 0.7)
    
    nx.draw_networkx_labels(subgraph, pos, font_size = 10, font_family = "sans-serif")
    
    plt.title(f"Community {idx + 1}", fontsize = 16, fontweight = "bold")
    plt.axis("off")
    plt.show()

The clustering analysis revealed **five distinct communities** of dishes based on their co-occurrence patterns:

- **Cluster 1:** The largest cluster, containing 27 dishes, includes a diverse mix such as Shahi Paneer, Barbeque Chicken, and Cheese Burger, indicating these dishes are frequently mentioned together and form a highly interconnected network.
- **Cluster 2:** With 22 dishes, this cluster features items like Andhra Chicken Tikka, Apollo Fish, and Afghani Chicken, suggesting a focus on regional and protein-rich dishes.
- **Cluster 3:** Comprising 20 dishes, this cluster includes favorites such as Dal Makhani, Paneer Butter Masala, and Achari Chicken, highlighting popular Indian comfort foods often enjoyed together.
- **Cluster 4:** A smaller cluster of 6 dishes, including Alfredo Pasta and Creamy Alfredo Pasta, points to a specialization around pasta and creamy dishes, indicating a cohesive group of Italian-inspired options.
- **Cluster 5:** The smallest cluster with only 2 dishes, Egg Curry and Barfi Ice Cream, may reflect a niche pairing or a less frequent combination.