# Student Information
---

> Dionysios Rigatos <br />
> Department of Informatics  <br />
> Athens University of Economics and Business <br />
> p3200262@aueb.gr

# Roget's Thesaurus in the 21st Century

The first known thesaurus was written in the 1st century CE by [Philo of Byblos](https://en.wikipedia.org/wiki/Philo_of_Byblos); it was called *Περὶ τῶν διαφόρως σημαινομένων*, loosly translated in English as *On Synonyms*. Fast forward about two millenia and we arrive to the most well known thesaurus, compiled by [Peter Mark Roget](https://en.wikipedia.org/wiki/Peter_Mark_Roget), a British physician, natural theologian, and lexicographer. [Roget's Thesaurus](https://en.wikipedia.org/wiki/Roget%27s_Thesaurus) was released on 29 April 1852, containing 15,000 words. Subsequent editions were larger, with the latest totalling 443,000 words. In Greek the most well known thesaurus, *Αντιλεξικόν ή Ονομαστικόν της Νεοελληνικής Γλώσσης* was released in 1949 by [Θεολόγος Βοσταντζόγλου](https://el.wikipedia.org/wiki/%CE%98%CE%B5%CE%BF%CE%BB%CF%8C%CE%B3%CE%BF%CF%82_%CE%92%CE%BF%CF%83%CF%84%CE%B1%CE%BD%CF%84%CE%B6%CF%8C%CE%B3%CE%BB%CE%BF%CF%85); the latest updated edition was released in 2008 and remains an indispensable source for writing in Greek.

Roget organised the entries of the thesaurus in a hierarchy of categories. Your task in this assignment is to investigate how these categories fare with the meaning of English words as captured by Machine Learning techniques, namely, their embeddings.

Note that this is an assignment that requires initiative and creativity from your part. There is no simple right or wrong answer. It is up to you to find the best solution. You have three weeks to do it. Make them count.

In [None]:
# Standard Libraries
import json
import re
from ast import literal_eval

In [None]:
# Data Handling and Visualization Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px

# ML Model & Data Preprocessing
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (adjusted_rand_score, normalized_mutual_info_score,
                             homogeneity_score, completeness_score, v_measure_score,
                             classification_report, confusion_matrix)
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import umap

# Imbalanced Dataset Handling
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Web Scraping
import requests
from bs4 import BeautifulSoup

# Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (Dense, Dropout, BatchNormalization, Input, PReLU)
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import (EarlyStopping, ReduceLROnPlateau, LearningRateScheduler)
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.losses import CategoricalFocalCrossentropy 
from tensorflow.keras.regularizers import l1

# NLP Libraries
import spacy

# External Model APIs
import voyageai
from mistralai.client import MistralClient

%matplotlib inline

In [None]:
spacy_en = spacy.load("en_core_web_sm")

* If there is an issue with loading the embeddings, feel free to use my API keys.

* If the embedding file is not found, the embeddings will be fetched automatically using the API.

In [None]:
voyage_client = voyageai.Client(api_key="pa-JukUHSiLV-1AH4xpPymGslB9Hyxi-XB58rS0eMXllWw")
mistral_client = MistralClient(api_key="XWJS5gJBfw6tQflBDpUh0SjP1rToTixZ") 

## Get Roget's Thesaurus Classification

You can find [Roget's Thesaurus classification online at the Wikipedia](https://en.wiktionary.org/wiki/Appendix:Roget%27s_thesaurus_classification). You must download the categorisation (and the words belonging in each category), save them and store them in the way that you deem most convenient for processing.

### Step 1: Fetch the Thesaurus

* Initially we want to fetch the thesaurus from the [Project Gutenber Roget's Thesaurus](https://www.gutenberg.org/cache/epub/22/pg22-images.html). 

* We will do this using BeautifulSoup as it is a simple and easy to use library for web scraping.

In [None]:
dataset_url = "https://www.gutenberg.org/cache/epub/22/pg22-images.html"

try:
    response = requests.get(dataset_url)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
except Exception as e:
    print(e)
else:
    print(f"Page retrieval OK with Status Code ({response.status_code})")

### Step 2: Extract Classes, Sections and Words

* Once we have the scraped HTML, we want to extract the classes, sections and entries from the thesaurus.

* Initially we will only extract the necessary information without performing any text processing techniques on the data, therefore it will be in a raw format.

From a quick inspection of the HTML, we can see that:

* There are multiple `h2` tags which might not contain relevant information.
* Divisions, despite being hierarchically under Classes, are not directly under them. They are under `h2` tags along with them.
* Sections are under divisions as `h3` tags.
* Entries are under sections as `p` tags, belonging to the `p2` class.

```html
<h2>ROGET’S THESAURUS<br>OF<br>ENGLISH WORDS AND PHRASES</h2>

<h2><a id="class01"></a>CLASS I<br>
WORDS EXPRESSING ABSTRACT RELATIONS</h2>
<h3><a id="sect01"></a>S<small>ECTION</small> I. EXISTENCE</h3>
<h4>1. BEING, IN THE ABSTRACT</h4>
<p class="p2">
<b>#1. Existence.—N.</b> existence, being, entity, <i>ens</i>, <i>esse</i>,
subsistence.<br>
```

##### A convention
* Since we wish to classify using two levels, if a Class contains Divisions, then the Divisions will act as the second hierarchy and contain all the words their Secitons had. If a Class does not contain Divisions, then the Sections will act as the second hierarchy and contain all the words they originally had.

In [None]:
raw_data = {}

# Find all classes and divisions
h2_headers = soup.find_all("h2", id=None)
relevant_h2 = [h2 for h2 in h2_headers if re.search("class|division", h2.text, re.IGNORECASE)]

current_class = ""

for header in relevant_h2:
    header_text = header.text.strip()
    
    # Determine if it is a class or division
    if "class" in header_text.lower():
        current_class = header_text
        class_data = {}
        
        # Find all sections within each class only
        class_sections = [section for section in header.find_all_next('h3') if section.find_previous('h2') == header]
        
        # Iterate over each section
        for section in class_sections:
            section_title = section.text
                    
            # Find all 'p" belonging to this section only
            entries = [entry.text for entry in section.find_all_next('p', class_="p2") if entry.find_previous('h3') == section]
            
            class_data[section_title] = entries
            
        raw_data[current_class] = class_data
        
    # Divisions will act as sections within the class
    elif "division" in header_text.lower():
        if current_class:  
            section_title = header_text
            
            entries = [entry.text for entry in header.find_all_next('p', class_="p2") if entry.find_previous('h2') == header]
            

        raw_data[current_class][section_title] = entries

* We will store our extracted, yet **unprocessed** data in JSON format.

In [None]:
with open('thesaurus_raw.json', 'w') as json_file:
    json.dump(raw_data, json_file, indent=4, ensure_ascii=False)

* Let's take a look out our hierarchy as well as how many entries are contained per Section.

* We notice that a Class might have multiple Sections with the same numbering; this is because we flattened the Divisions.

* Each entry is a word along with all of its synonym phrases - it is still a single string as it is unprocessed.

In [None]:
def print_structure(data):
    total_entries = 0

    for class_title, class_data in data.items():
        print(f"\n{class_title}")
        for section_title, entries in class_data.items():
            print(f"  {section_title} [Entries: {len(entries)}]")
            total_entries += len(entries)
    print(f"\nTotal Entries: {total_entries}")

In [None]:
print_structure(raw_data)

* We have slightly over 1000 entries, which matches what we expected from the thesaurus.

##### Entry Example

```json
"\n#1. Existence.—N. existence, being, entity, ens, esse,\r\nsubsistence.\r\n     reality, actuality; positiveness &c. adj.; fact, matter of fact, sober\r\nreality; truth &c. 494; actual existence.\r\n     presence &c. (existence in space) 186; coexistence &c. 120.\r\n     stubborn fact, hard fact; not a dream &c. 515; no joke.\r\n     center of life, essence, inmost nature, inner reality, vital\r\nprinciple.\r\n     [Science of existence], ontology.\r\n     V. exist, be; have being &c. n.; subsist, live, breathe, stand,\r\nobtain, be the case; occur &c. (event) 151; have place, prevail; find\r\noneself, pass the time, vegetate.\r\n     consist in, lie in; be comprised in, be contained in, be constituted\r\nby.\r\n     come into existence &c. n.; arise &c. (begin) 66; come forth &c.\r\n(appear) 446.\r\n     become &c. (be converted) 144; bring into existence &c. 161.\r\n     abide, continue, endure, last, remain, stay.\r\n     Adj. existing &c. v.; existent, under the sun; in existence &c. n.;\r\nextant; afloat, afoot, on foot, current, prevalent; undestroyed.\r\n     real, actual, positive, absolute; true &c. 494; substantial,\r\nsubstantive; self-existing, self-existent; essential.\r\n     well-founded,  well-grounded; unideal[obs3], unimagined; not potential\r\n&c. 2; authentic.\r\n     Adv. actually &c. adj.; in fact, in point of fact, in reality; indeed;\nde facto, ipso facto.\r\n     Phr. ens rationis; ergo sum cogito; \"thinkest thou existence doth\r\ndepend on time?\" [Byron].\r\n"
```

### Step 3: Cleaning the Dataset & Extracting Words 

* We now have our dataset, but it's still quite useless for analysis as its format is not suitable for processing.

* We will **clean the data** and **extract the synonyms** from the entries, so that we have a dataset ready for natural language pre-processing.

* Let's initialize a function that can extract phrases from entries. A few notable issues in entries are:


    * Phrases are *usually* separated by commas, but also by semicolons and full stops.
        * We will handle this by replacing semicolons and full stops with commas.


    * There are a lot of special characters, line breaks and other formatting issues (e.g. `&c.`).
        * We will handle this by replacing special characters with spaces and then removing any extra spaces.


    * Phrases are not always separated by a space.
        * We will handle this by replacing any double spaces with a single space.

    
    * There are words and phrases in brackets - which are sometimes relevant and sometimes not (e.g. `[obs3]` is not relevant).
        * We will handle this by extracting any words or phrases in brackets and we will handle them later. This might lead to some useless phrases such as `obs`, but this is easily cleaned later on if deemed problematic - it is much more important to keep some relevant phrases that are in brackets.
    

    * We might end up with phrases that are 1 character long or less. 
        * These will be removed.



* We will also create two functions for extracting Class and Section names.

In [None]:
def sanitize_entry(entry):
    cleaned_phrases = []
    
    # Replace special characters with commas.
    entry = re.sub(r"[^A-Za-z'\s]+", ",", entry)
    entry_phrases = re.split(',', entry)
    
    for phrase in entry_phrases:
        phrase = re.sub("[^A-Za-z'\s]+", "", phrase) 
        
        cleaned_words = [word.lower() for word in phrase.split() if phrase] 
        cleaned_phrase = " ".join(cleaned_words)
        
        # If the phrase is not empty, add it to the list of cleaned phrases.
        if len(cleaned_phrase) > 1:
            cleaned_phrases.append(cleaned_phrase)
    
    return cleaned_phrases

In [None]:
def sanitize_section(section_name):
    section_name = section_name.lower()
    relevant_name = re.sub(r"[^A-Za-z\s]+", "", section_name)
    relevant_name = "_".join(relevant_name.split()[2:])
    relevant_name = re.sub(r"\s+", "_", relevant_name)
    
    return "sc_" + relevant_name

In [None]:
class_names = ["existence", "space", "matter", "intellect", "volition", "affections"]
class_map = {raw_name: clean_name for raw_name, clean_name in zip(raw_data.keys(), class_names)}

def sanitize_class(class_name):
    return "cl_" + class_map[class_name]

* Since our classes are only six, we will manually create a list of their names.

* Let's clean!

In [None]:
cleaned_data = {}

for class_title, class_data in raw_data.items():
    cleaned_class_data = {}
    
    for section_title, entries in class_data.items():
        cleaned_entries = []
        
        for entry in entries:
            cleaned_phrases = sanitize_entry(entry)
            cleaned_entries.extend(cleaned_phrases)
        
        cleaned_section_name = sanitize_section(section_title)
        cleaned_class_data[cleaned_section_name] = cleaned_entries
    
    cleaned_class_name = sanitize_class(class_title)
    cleaned_data[cleaned_class_name] = cleaned_class_data

* Let's take a look at our data structure now that we have cleaned it.

In [None]:
print_structure(cleaned_data)

* We notice that the amount of "Entries" is significantly higher, but now each entry was split into the phrases it contained.

* We still might have irrelevant phrases, such as `obs`or other stopwords.

* We will store our cleaned data in a new JSON file.

In [None]:
with open('thesaurus_cleaned.json', 'w') as json_file:
    json.dump(cleaned_data, json_file, indent=4, ensure_ascii=False)

### Step 4: Who is JSON? (Dataset Restructuring)

* We now have a scraped, categorised and cleaned dataset that may be used for a variety of tasks. Yay!

* We have been working with JSON files as they are easy to read, write and parse in structured dictionary-like formats.

* It is however important to note that we are going to be performing analytical and machine learning techniques, and JSON/dictionaries might not be the best formats for this.

* We will restructure our dataset into a pandas DataFrame, which is a more suitable format for data analysis and machine learning.

In [None]:
class_names = []
section_names = []
phrase_entry = []
for class_name, sections in cleaned_data.items():
    for section_name, phrases in sections.items():
        for phrase in phrases:
            class_names.append(class_name)
            section_names.append(section_name)
            phrase_entry.append(phrase)

data_df = pd.DataFrame({'class': class_names, 'section': section_names, 'phrase': phrase_entry})
data_df.info()

### Step 5: Text Preprocessing

* We will now perform whatever text preprocessing is left on our dataset to prepare it for our models.

#### Stopword and Non-English Phrase Removal

* Stopwords are words that are very common and carry little meaning, such as "the", "and", "is", "in", etc. They are usually removed from text when performing analytical tasks as they do not carry much contextual meaning.

* Some words and phrases in our dataset are not in English, and we will remove them as well.

* Some dataset-specific stopwords will also be removed.

In [None]:
def sanitize_phrase(phrase, vocabulary, stopwords):
    phrase = " ".join([word for word in phrase.split() if word in vocabulary and word not in stopwords and len(word) > 1])
    return phrase

In [None]:
custom_stopwords = ["obs", "adj", "adv", "lat", "fr", "ens", "de", "et", "al", "cf", "gr"]
stopwords = set(spacy_en.Defaults.stop_words).union(custom_stopwords)

en_vocabulary_set = set(spacy_en.vocab.strings)

data_df['phrase'] = data_df['phrase'].apply(sanitize_phrase, vocabulary=en_vocabulary_set, stopwords=stopwords)
data_df = data_df[data_df['phrase'].apply(lambda x: len(x) > 0)].dropna(how='any').reset_index(drop=True)
data_df.info()

* We see that we removed almost half of our phrases, which is a good sign that we had a lot of irrelevant phrases.

#### Encoding

* We will encode our classes and sections into numerical values.

In [None]:
class_encoder = LabelEncoder()
section_encoder = LabelEncoder()
data_df['class_'] = class_encoder.fit_transform(data_df['class'])
data_df['section_'] = section_encoder.fit_transform(data_df['section'])

section_mappings = {section_encoding: section_label for section_label, section_encoding in zip(data_df['section'].unique(), data_df['section_'].unique())}
class_mappings = {class_encoding: class_label for class_label, class_encoding in zip(data_df['class'].unique(), data_df['class_'].unique())}

In [None]:
class_mappings, section_mappings

#### More Filtering

* Let's take a look at the class distribution of our dataset.

In [None]:
class_distribution = data_df['class'].value_counts(normalize=True).reset_index()
class_distribution.columns = ['class', 'percentage']
class_distribution

* There's significant class imbalance. This will be very apparent in clustering and classification tasks as well.

* One technique that worked very well was removing phrases that appeared in more than half of the sections. This is a very high threshold, but it is a good way to remove phrases that are not very informative.

In [None]:
data_df = data_df.drop_duplicates(subset='phrase').dropna(how='any').reset_index(drop=True)
phrase_counts = data_df.groupby('phrase')['section'].nunique()
phrases_to_remove = phrase_counts[phrase_counts >= data_df['section'].nunique()//2].index 

data_df = data_df[~data_df['phrase'].isin(phrases_to_remove)].dropna(how='any').reset_index(drop=True)

In [None]:
class_distribution = data_df['class'].value_counts(normalize=True).reset_index()
class_distribution.columns = ['class', 'percentage']
class_distribution

#### Honorable Mentions

* Lemmatization was implemented but it did not yield any significant improvements.

* Other filtering techniques to remove irrelevant phrases were tested, but they did not yield any significant improvements or introduced significant bias.

#### Finished!

* We have now finished our text preprocessing and we are ready to move on to the next step.

* A large portion of our dataset was removed, but this is a good thing as it means we had a lot of irrelevant data.

* There is no need for Tokenization as we will be using pre-trained word embeddings and these techniques are performed by their respective API's.

* We will export our **clean** dataset as a CSV file for easy access and use in the next steps.

In [None]:
data_df.to_csv('thesaurus_processed.csv', index=False)

Let's also extract a vocabulary of all the words in our dataset.

In [None]:
dataset_vocabulary = data_df['phrase'].unique()
len(dataset_vocabulary)

## Get Word Embeddings

You will embeddings for the word entries in Roget's Thesaurus. It is up to you to find the embeddings; you can use any of the available models. Older models like word2vec, GloVe, BERT, etc., may be easier to use, but recent models like Llama 2 and Mistral have been trained on larger corpora. OpenAI and Google offer their embeddings through APIs, but they are not free.

You should think about how to store the embeddings you retrieve. You may use plain files (e.g., JSON, CSV) and vanilla Python, or a vector database.

### Voyage AI Embeddings

* Initially, we will use the [Voyage AI](https://www.voyageai.com/) embeddings as they are (almost) free and easy to use. 

* A benefit of Voyage AI is that it automatically embeds phrases, which is useful for our dataset as it contains phrases and not just single words. 

* It also handles out-of-vocabulary words and phrases.

In [None]:
def extract_embeddings_voyageai(vocabulary_array, batch_size=128):
    unique_embeddings = {}
    unique_vocabulary = np.unique(vocabulary_array).tolist()
    
    for i in tqdm(range(0, len(unique_vocabulary), batch_size), desc="VoyageAI Processing"):
        batch_phrases = unique_vocabulary[i:i+batch_size]
        batch_embeddings = voyage_client.embed(texts=batch_phrases, model="voyage-large-2").embeddings
        
        for phrase, embedding in zip(batch_phrases, batch_embeddings):
            unique_embeddings[phrase] = embedding
            
    return unique_embeddings

### Mistral AI Embeddings

* For our second embedding model, we will use the [Mistral](https://mistral.ai/) embeddings.

* Mistral is SOTA and has been trained on a large corpus, so it is a good choice for our task. 

* Just like Voyage, Mistral performs very well on well known benchmarks such as HuggingFace's [MTEB](https://huggingface.co/spaces/mteb/leaderboard).

* Mistral's embeddings are not free, but they are very cheap and we will only be using a small amount of them (arround 2 cents per run).

In [None]:
def extract_embeddings_mistralai(vocabulary_array, batch_size=2048):
    unique_embeddings = {}
    unique_vocabulary = np.unique(vocabulary_array).tolist()
        
    for i in tqdm(range(0, len(unique_vocabulary), batch_size), desc="MistralAI Processing"):
        batch_phrases = unique_vocabulary[i:i+batch_size]
        
        embeddings_batch_response = mistral_client.embeddings(model="mistral-embed", input=batch_phrases)
        batch_embeddings = [e.embedding for e in embeddings_batch_response.data]
        
        for phrase, embedding in zip(batch_phrases, batch_embeddings):
            unique_embeddings[phrase] = embedding
            
    return unique_embeddings

### Storage of Embeddings

* In order to save time and money, we will store the embeddings in a CSV file as it is convenient and easy to use. 

* Our embedding-fetching functions are smart - they only fetch once per phrase in our vocabulary, so we will not be making any repetitive API calls.

In [None]:
def store_embeddings(vocabulary_array):
    # Extract embeddings
    vygai_embeddings = extract_embeddings_voyageai(vocabulary_array)
    mistral_embeddings = extract_embeddings_mistralai(vocabulary_array)
    
    # Combine embeddings and phrases into a DataFrame
    embeddings_df = pd.DataFrame({'phrase': vocabulary_array, 
                                  'vygai_embedding': [vygai_embeddings[phrase] for phrase in vocabulary_array],
                                  'mistralai_embedding': [mistral_embeddings[phrase] for phrase in vocabulary_array]})
    
    return embeddings_df

* When fetching embeddings, we first want to check locally whether we have the embeddings already. If we do, we will retrieve the appropriate file.

In [None]:
def attach_embeddings_to_df(data_df):
    try:
        embeddings_df = pd.read_csv('thesaurus_embeddings.csv', converters={'vygai_embedding': literal_eval, 'mistralai_embedding': literal_eval})
    except FileNotFoundError:
        embeddings_df = store_embeddings(data_df['phrase'].unique())
        embeddings_df.to_csv('thesaurus_embeddings.csv', index=False)
    
    data_df = data_df.merge(embeddings_df, on="phrase", how="left")
    
    return data_df

In [None]:
data_df = attach_embeddings_to_df(data_df)

In [None]:
print(f"VoyageAI Embedding Length: {len(data_df['vygai_embedding'].iloc[0])}")
print(f"MistralAI Embedding Length: {len(data_df['mistralai_embedding'].iloc[0])}")

* Our Voyage embeddings are 1536-dimensional, while our Mistral embeddings are 1024-dimensional.

* Now that we have loaded them, lets ensure that we have no missing values.

In [None]:
data_df = data_df.dropna(how='any')

if data_df['vygai_embedding'].isna().sum() > 0:
    print(f"Missing VoyageAI Embeddings: {data_df['vygai_embedding'].isna().sum()}")
    
if data_df['mistralai_embedding'].isna().sum() > 0:
    print(f"Missing MistralAI Embeddings: {data_df['mistralai_embedding'].isna().sum()}")

* All good!

In [None]:
copy_df = data_df.copy()

### Visualization of Embeddings

* Exploring the embeddings is a good idea to understand how they represent our data. We will also see whether they cluster in any meaningful way.

* Since our embeddings span across thousands of dimensions, we will need to reduce their dimensionality in order to visualize them. Luckily, we are well equipped for this.

* We can also store these reductions in a column so as to save time in the future. 

* We will play around with a few different dimensionality reduction techniques to visualize our embeddings as a warmup for the next tasks.
    * For every technique, we will create a _reducer function that will do the dirty work for us and simply return the dataframe-ready embeddings.

* **NOTE**: For the sake of time and space, we will omit certain visualizations that are not very informative and have nothing new to offer. They have been filtered out during the creation of this notebook.

* Let's initially define a visualization function that will allow us to visualize our embeddings in 2 and 3 dimensions. 

In [None]:
def visualize_embeddings(data_df, embedding_col_name, label_col_name, method="", dimensions=2):
    if dimensions == 2:
        fig = px.scatter(data_df, x=data_df[embedding_col_name].apply(lambda x: x[0]), 
                         y=data_df[embedding_col_name].apply(lambda x: x[1]), 
                         color=data_df[label_col_name],
                         color_discrete_sequence=px.colors.qualitative.Antique,
                         labels={'x': 'Dimension 1', 'y': 'Dimension 2', 'color': label_col_name})
        fig.update_traces(marker={'size': 3})
    elif dimensions == 3:
        fig = px.scatter_3d(data_df, 
                            x=data_df[embedding_col_name].apply(lambda x: x[0]), 
                            y=data_df[embedding_col_name].apply(lambda x: x[1]), 
                            z=data_df[embedding_col_name].apply(lambda x: x[2]), 
                            color=data_df[label_col_name], 
                            color_discrete_sequence=px.colors.qualitative.Antique, 
                            labels={'x': 'Dimension 1', 'y': 'Dimension 2', 'z': 'Dimension 3', 'color': label_col_name})
        fig.update_traces(marker={'size': 2})
        
    fig.update_layout(title=f'Visualization of Phrases by {label_col_name} w/ {method}')
    fig.show()

#### Principal Component Analysis - A True Classic

* We will start with PCA, which is a classic and simple dimensionality reduction technique.

* PCA is unsupervised and linear, which means it is fast and easy to use.

* We can easily measure the explained variance of our embeddings, which is a good indicator of how well they are represented in a lower dimension.

In [None]:
def pca_reducer(embeddings, components=2):
    embeddings_array = np.array(embeddings.tolist())
    
    pca = PCA(n_components=components, random_state=0)
    reduced = pca.fit_transform(embeddings_array)
    
    return reduced.tolist(), pca.explained_variance_ratio_

##### PCA - VoyageAI

* We will start with 2D PCA for our VoyageAI embeddings. 

* We will also store the explained variance in a variable for later use.

In [None]:
data_df["vygai_embedding_pca_2d"], vygai_pca_2d_var = pca_reducer(data_df["vygai_embedding"], components=2)

visualize_embeddings(data_df, "vygai_embedding_pca_2d", "class", method="PCA", dimensions=2)
print(f"VoyageAI PCA 2D Variance Ratio: {vygai_pca_2d_var.sum()}")

* It seems that our embeddings are not very well represented in 2D, as we only capture approximately 6% of the variance. Τhis is problematic, but not unexpected, as we are dealing with high-dimensional data. 

* Our plot is not informative at all - let's try 3D PCA.

In [None]:
data_df["vygai_embedding_pca_3d"], vygai_pca_3d_var = pca_reducer(data_df["vygai_embedding"], components=3)

visualize_embeddings(data_df, "vygai_embedding_pca_3d", "class", method="PCA", dimensions=3)

print(f"VoyageAI PCA 3D Variance Ratio: {vygai_pca_3d_var.sum()}")

* The results are not much better, as we only capture approximately 7.7% of the variance.

* The plot is also not very informative, as the embeddings are still not well separated but we get the general direction of each class.

* Let's also plot by section for completeness.

In [None]:
visualize_embeddings(data_df, "vygai_embedding_pca_2d", "section", method="PCA", dimensions=2)

##### PCA - Mistral

* Similarly, we will perform 2D and 3D PCA for our Mistral embeddings.

In [None]:
data_df["mistralai_embedding_pca_2d"], mistralai_pca_2d_var = pca_reducer(data_df["mistralai_embedding"], components=2)

visualize_embeddings(data_df, "mistralai_embedding_pca_2d", "class", method="PCA", dimensions=2)
print(f"MistralAI PCA 2D Variance Ratio: {mistralai_pca_2d_var.sum()}")

In [None]:
data_df["mistralai_embedding_pca_3d"], mistralai_pca_3d_var = pca_reducer(data_df["mistralai_embedding"], components=3)

visualize_embeddings(data_df, "mistralai_embedding_pca_3d", "class", method="PCA", dimensions=3)

print(f"MistralAI PCA 3D Variance Ratio: {mistralai_pca_3d_var.sum()}")

* The results are worse than VoyageAI, as we only capture approximately 4.5% and 5.7% of the variance respectively for 2D and 3D PCA.

* The plots are also not expected to be very informative, as the embeddings are not well separated.

* * Let's also plot by section for completeness.

In [None]:
visualize_embeddings(data_df, "mistralai_embedding_pca_2d", "section", method="PCA", dimensions=2)

##### PCA - Conclusions and Dissapointment

* Honestly, the results are not great. We can see that the explained variance is very low, which means that our embeddings are not well represented in a lower dimension.

* We might be suffering from the **curse of dimensionality**, which is a common issue with high-dimensional data.

* How many components do we need to explain most of the variance?

In [None]:
def plot_variance_ratio(benchmark, title):
    cumulative_pca_result = np.cumsum(benchmark)
    
    n_components_for_90 = np.argmax(cumulative_pca_result >= 0.9) + 1
    cumulative_pca_90 = cumulative_pca_result[n_components_for_90 - 1] if n_components_for_90 > 0 else 0

    fig = make_subplots(rows=1, cols=2, subplot_titles=("Explained Variance per PC", "Cumulative Explained Variance"))

    fig.add_trace(go.Scatter(x=list(range(1, len(benchmark)+1)), y=benchmark, mode='lines+markers',
                             name='PVE per PC'), row=1, col=1)
    
    fig.add_trace(go.Scatter(x=list(range(1, len(benchmark)+1)), y=cumulative_pca_result, mode='lines+markers',
                             name='Cumulative PVE', line=dict(color='orange')), row=1, col=2)

    if n_components_for_90 > 0:
        fig.add_shape(type='line', line=dict(dash='dash', color='red'),
                      x0=n_components_for_90, x1=n_components_for_90, y0=0, y1=cumulative_pca_90,
                      row=1, col=2, name="90% PVE")
        fig.add_trace(go.Scatter(x=[n_components_for_90], y=[cumulative_pca_90], text=["0.9"], mode="text", showlegend=False),
                      row=1, col=2)

    fig.update_xaxes(title_text="Principal Component", row=1, col=1)
    fig.update_xaxes(title_text="Principal Component", row=1, col=2)

    fig.update_yaxes(title_text="PVE", row=1, col=1)
    fig.update_yaxes(title_text="Cumulative PVE", range=[0, 1.05], row=1, col=2)

    fig.update_layout(height=600, width=1000, title_text=title)

    fig.show()

* Let's try with 700 components and see if we can explain most of the variance.

* This is a lot of components, but it is still less than the original dimensions.

In [None]:
N = 700

voyage_pca_benchmark, voyage_pca_benchmark_var = pca_reducer(data_df['vygai_embedding'], components=N)
mistral_pca_benchmark, mistral_pca_benchmark_var = pca_reducer(data_df['mistralai_embedding'], components=N)

* Let's create a scree plot to visualize the explained variance.

* The plot is interactive, so we can zoom in and see the explained variance for different numbers of components.

In [None]:
plot_variance_ratio(voyage_pca_benchmark_var, "VoyageAI PCA Explained Variance Ratio")
plot_variance_ratio(mistral_pca_benchmark_var, "MistralAI PCA Explained Variance Ratio")

* We see that, in both cases, we need a lot of components to explain most of the variance, which is not ideal for visualization. Both need around **500-600 components** to explain 90% of the variance.
    * MistralAI requires slightly less components than VoyageAI, but it is still a lot.

* Let's play around with some other techniques and see if we can get better results.

##### UMAP - The New Kid on the Block

* UMAP is a newer, non-linear and more advanced dimensionality reduction technique that is known for its ability to preserve local and global structure in high-dimensional data.

* It's an advancement over t-SNE, which is another popular non-linear dimensionality reduction technique.

* UMAP requires a lot of hyperparameter tuning and thus it is difficult to get the desired results.

* We will only perform 2D UMAP, as 3D UMAP did not yield any improvements.

In [None]:
def umap_reducer(embeddings, components=2):
    embeddings_array = np.array(embeddings.tolist())

    umap_ = umap.UMAP(n_components=components, metric='manhattan', init='random',  n_neighbors=50, min_dist=0.0) # manhattan as we dont need cosine similarity because the embeddings are already normalized
    reduced = umap_.fit_transform(embeddings_array)
    
    return reduced.tolist()

##### UMAP - VoyageAI

In [None]:
data_df["vygai_embedding_umap_2d"] = umap_reducer(data_df["vygai_embedding"], components=2)

visualize_embeddings(data_df, "vygai_embedding_umap_2d", "class", method="UMAP", dimensions=2)

* ... let's also plot by section.

In [None]:
visualize_embeddings(data_df, "vygai_embedding_umap_2d", "section", method="UMAP", dimensions=2)

##### UMAP - MistralAI

In [None]:
data_df["mistralai_embedding_umap_2d"] = umap_reducer(data_df["mistralai_embedding"], components=2)

visualize_embeddings(data_df, "mistralai_embedding_umap_2d", "class", method="UMAP", dimensions=2)

* Section plotting omitted as it is not informative.

##### UMAP - Conclusions

* UMAP does not seem to perform much better than PCA, as we still capture a very low amount of variance.

* This might be an issue due to the low amount of components we are using, but it is still not ideal for visualization. We expect it to perform better than PCA in higher dimensions.

* MistralAI's visualization was omitted for brevity, but they were almost identical to VoyageAI's.

#### Linear Discriminant Analysis (LDA) - Almost Cheating?

* So far we have only used unsupervised techniques, but we can also employ supervised techniques for dimensionality reduction.

* LDA is a supervised technique that is known for its ability to separate classes in high-dimensional data by finding the best projection that maximizes the distance between classes.

* We will use LDA to reduce our embeddings' dimensions and see if we can get a better visualization than PCA and UMAP.

In [None]:
def lda_reducer(embeddings, labels, n=3):
    embeddings_array = np.array(embeddings.tolist())
    
    lda = LDA(n_components=n)
    lda_result = lda.fit_transform(embeddings_array, labels)
    
    return lda_result.tolist(), lda.explained_variance_ratio_

##### LDA - VoyageAI

In [None]:
data_df["vygai_embedding_lda_2d"], vygai_lda_2d_var = lda_reducer(data_df["vygai_embedding"], data_df["class"], n=2)

visualize_embeddings(data_df, "vygai_embedding_lda_2d", "class", method="LDA", dimensions=2)

In [None]:
data_df["vygai_embedding_lda_3d"], vygai_lda_3d_var = lda_reducer(data_df["vygai_embedding"], data_df["class"], n=3)

visualize_embeddings(data_df, "vygai_embedding_lda_3d", "class", method="LDA", dimensions=3)

* ... let's also plot by section.

In [None]:
visualize_embeddings(data_df, "vygai_embedding_lda_2d", "section", method="LDA", dimensions=2)

##### LDA - Mistral

In [None]:
data_df["mistralai_embedding_lda_2d"], mistralai_lda_2d_var = lda_reducer(data_df["mistralai_embedding"], data_df["class"], n=2)

visualize_embeddings(data_df, "mistralai_embedding_lda_2d", "class", method="LDA", dimensions=2)

In [None]:
data_df["mistralai_embedding_lda_3d"], mistralai_lda_3d_var = lda_reducer(data_df["mistralai_embedding"], data_df["class"], n=3)

visualize_embeddings(data_df, "mistralai_embedding_lda_3d", "class", method="LDA", dimensions=3)

* ... let's also plot by section.

In [None]:
visualize_embeddings(data_df, "mistralai_embedding_lda_2d", "section", method="LDA", dimensions=2)

##### LDA - Conclusions

In [None]:
print(f"3D LDA Explained VoyageAI Variance Ratio: {vygai_lda_3d_var.sum()}")
print(f"3D LDA Mistral Explained Variance Ratio: {mistralai_lda_3d_var.sum()}")

* LDA performed very well on our embeddings, which is not surprising as it is a supervised technique. With just 3 components, we were able to capture around 70-75% of the variance.

* We were finally able to get a sneak peek of our embeddings, and we can see that they are sort of seperated in this 3D space. By rotating the plot, we can see that the embeddings are not perfectly seperated, but they are not a complete mess either as they lean towards different directions.

* It is important to note that LDA is a supervised technique, and it is not always possible to use it. It is also not always the best choice for dimensionality reduction, but it seems to have worked well for our embeddings.

#### Honorable Mentions 

##### t-SNE

* t-SNE is a very popular and powerful dimensionality reduction technique, but it is also very slow and not suitable for high-dimensional data.

* In our case each run took over 10 minutes, which is not feasible for our task and not worth the time.

* The results were also not great, as we can see that the embeddings are not well represented in a lower dimension.

#### Conclusions

* We have explored a few different dimensionality reduction techniques and we have seen that our embeddings are not well represented in a lower dimension.

* This is a common issue with high-dimensional data, and it is known as the curse of dimensionality. Additionally, we have *a lot* of data - something that will make visualization difficult anyway.

* We are unlikely to enjoy any satisfying visualizations of our embeddings, but we can still use them for analytical and machine learning tasks.

* We have prepared our visualization-friendly dimensionality-recued embeddings for the next steps, and we are ready to move on.

## Clustering

With the embeddings at hand, you can check whether unsupervised Machine Learning methods can arrive at classifications that are comparable to the Roget's Thesaurus Classification. You can use any clustering method of your choice (experiment freely). You must decide how to measure the agreement between the clusters you find and the classes defined by Roget's Thesaurus and report your results accordingly. The comparison will be at the class level (six classes) and the section / division level (so there must be two different clusterings, unless you can find good results with hierarchical clustering).

### Evaluation Metrics

* Since we're focusing on checking agreement between clusters and labels, we will not use techniques such as the Elbow Method or Silhouette Score, as they are not really required for our task.

* Instead, we will measure the ground truth labels against the clusters with the techniques that follow.

* Additionally, we will cheat a bit as we know the expected number of clusters. We will use this information to try and replicate Roget's classification, although this is not applicable in real-world scenarios. Perhaps the student does not fall far from the professor's slides (where student = apple and professor's slides = tree).

* A *note*; we will not visualize the clusters every single time as it will result in an extremely large amount of plots. We will only visualize the clusters we find interesting or important.

In [None]:
expected_classes_n = len(data_df['class'].unique())
expected_sections_n = len(data_df['section'].unique())

print(f"Expected Classes: {expected_classes_n}")
print(f"Expected Sections: {expected_sections_n}")

#### Pivot Table

* We define a function that creates pivot table that will allow us to compare our clusters with the Roget's Thesaurus classification.

* With the pivot table, we can easily see how many words from each class and section were assigned to each cluster.

In [None]:
def cluster_distribution(df_clustered, cluster_column, hue):
    distr = df_clustered.groupby([cluster_column, hue]).size().unstack(fill_value=0)
    distr['total'] = distr.sum(axis=1)
    distr['winner'] = distr.iloc[:, :-1].idxmax(axis=1)
    return distr

#### Numerical Measures

* **Adjusted Rand Index (ARI)** - ARI is a measure of the similarity between two data clusterings. It considers all pairs of samples and counts pairs that are assigned in the same or different clusters in the predicted and true clusterings. The values range from -1 to 1, where 1 indicates perfect agreement and 0 indicates random agreement.

* **Normalized Mutual Information (NMI)** - NMI is a measure of the agreement between two clusterings of the same data. It is normalized against chance. The values range from 0 to 1, where 1 indicates perfect agreement and 0 indicates random agreement.

* **Homogeneity, Completeness and V-measure** - These are three related measures that can be used to evaluate the quality of clusters. Their values range from 0 to 1, where 1 indicates perfect agreement and 0 indicates random agreement.
    * Homogeneity measures whether all of the clusters contain only data points which are members of a single class. 
    * Completeness measures whether all members of a given class are assigned to the same cluster. 
    * V-measure is the harmonic mean of homogeneity and completeness.

In [None]:
def measure_agreement(true_labels, predictions, setup=""):
    predicted_labels = np.array(predictions)

    ari = adjusted_rand_score(true_labels, predicted_labels)
    nmi = normalized_mutual_info_score(true_labels, predicted_labels)
    homogeneity = homogeneity_score(true_labels, predicted_labels)
    completeness = completeness_score(true_labels, predicted_labels)
    v_measure = v_measure_score(true_labels, predicted_labels)
    
    return pd.DataFrame({'setup': setup, 'ARI': ari, 'NMI': nmi, 'homogeneity': homogeneity, 'completeness': completeness, 'v-measure': v_measure}, index=[0])

* We will use a copy of our original dataframe to perform these evaluations, as we will be modifying the dataframe in the process.

* For each clustering technique, we will perform these evaluations and store the results in a dataframe for easy comparison.

In [None]:
cluster_data_df = data_df.copy()

### K-Means Clustering

* We will start off with K-Means clustering, which is a simple and popular clustering technique.

* K-Means suffers from the curse of dimensionality more than other techniques, but it is still worth to try it out.

In [None]:
def perform_kmeans(embeddings, n_clusters):
    if isinstance(embeddings, pd.Series):
        embeddings = np.array(embeddings.tolist())

    kmeans = KMeans(n_clusters=n_clusters, random_state=0, n_init=10)
    predictions = kmeans.fit_predict(embeddings)
    
    return predictions

In [None]:
kmeans_class_metrics_df = pd.DataFrame(columns=['setup', 'ARI', 'NMI', 'homogeneity', 'completeness', 'v-measure'])
kmeans_section_metrics_df = pd.DataFrame(columns=['setup', 'ARI', 'NMI', 'homogeneity', 'completeness', 'v-measure'])

#### Class Level Clustering 

##### All Dimension Clustering

* We will start off by clustering our embeddings without any dimensionality reduction.

* We do not expect great results, but it is a good experiment to see how K-Means performs on high-dimensional data.

In [None]:
kmeans_class_default_vygai_setup = "vygai_kmeans_class_default"

cluster_data_df[kmeans_class_default_vygai_setup] = perform_kmeans(cluster_data_df['vygai_embedding'], n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], cluster_data_df[kmeans_class_default_vygai_setup], 
                                                                                setup=kmeans_class_default_vygai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_class_default_vygai_setup, 'class')

In [None]:
kmeans_class_default_mistralai_setup = "mistralai_kmeans_class_default"

cluster_data_df[kmeans_class_default_mistralai_setup] = perform_kmeans(cluster_data_df['mistralai_embedding'], n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], cluster_data_df[kmeans_class_default_mistralai_setup], 
                                                                                setup=kmeans_class_default_mistralai_setup)], ignore_index=True
                                    )

In [None]:
cluster_distribution(cluster_data_df,kmeans_class_default_mistralai_setup, 'class')

* The distribution of words in the clusters is not very good, as we can see that the words are not well separated. This was expected due to the high dimensionality of our data.

* In fact, for both our embeddings, `cl_volition` strongly dominates the clusters, which is not ideal.

##### PCA & Clustering

* We will now use PCA to reduce the dimensions of our embeddings and then perform K-Means clustering.

* We will use 550 components for VoyageAI and 510 components for MistralAI, as these are the number of components required to explain around 90% of the variance.

In [None]:
vygai_pca_emb_optimal = pca_reducer(cluster_data_df['vygai_embedding'], components=550)
mistralai_pca_emb_optimal = pca_reducer(cluster_data_df['mistralai_embedding'], components=510)

In [None]:
print(f"VoyageAI PCA Optimal Variance Ratio: {vygai_pca_emb_optimal[1].sum()}")
print(f"MistralAI PCA Optimal Variance Ratio: {mistralai_pca_emb_optimal[1].sum()}")

In [None]:
kmeans_class_pca_vygai_setup = "vygai_kmeans_class_pca"

cluster_data_df[kmeans_class_pca_vygai_setup] = perform_kmeans(vygai_pca_emb_optimal[0], n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], 
                                                                                cluster_data_df[kmeans_class_pca_vygai_setup],
                                                                                setup=kmeans_class_pca_vygai_setup)
                                     ], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_class_pca_vygai_setup, 'class')

In [None]:
kmeans_class_pca_mistralai_setup = "mistralai_kmeans_class_pca"

cluster_data_df[kmeans_class_pca_mistralai_setup] = perform_kmeans(mistralai_pca_emb_optimal[0], n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], cluster_data_df[kmeans_class_pca_mistralai_setup],
                                                                                setup=kmeans_class_pca_mistralai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_class_pca_mistralai_setup, 'class')

* We can still see that the words are not well separated in the expected clusters - as there is no distinct category dominating in words for each cluster.

* Let's take a look at the 2D PCA plot for VoyageAI.

In [None]:
visualize_embeddings(cluster_data_df, "vygai_embedding_pca_2d", kmeans_class_pca_vygai_setup, method="KMeans", dimensions=2)

* The plot shows that the kmeans algorithm *thinks* that it is clustering in an acceptable manner, however the real distribution is completely different.

##### UMAP & Clustering

* Finally, we will use UMAP to reduce the dimensions of our embeddings and then perform K-Means clustering.

* We will use 2 components for VoyageAI and MistralAI as it will make no difference whatsoever in the results.

In [None]:
kmeans_class_umap_vygai_setup = "vygai_kmeans_class_umap"

cluster_data_df[kmeans_class_umap_vygai_setup] = perform_kmeans(data_df["vygai_embedding_umap_2d"], n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], cluster_data_df[kmeans_class_umap_vygai_setup],
                                                                                setup=kmeans_class_umap_vygai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_class_umap_vygai_setup, 'class')

In [None]:
kmeans_class_umap_mistralai_setup = "mistralai_kmeans_class_umap"

cluster_data_df[kmeans_class_umap_mistralai_setup] = perform_kmeans(data_df["mistralai_embedding_umap_2d"], n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], cluster_data_df[kmeans_class_umap_mistralai_setup],
                                                                                setup=kmeans_class_umap_mistralai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_class_umap_mistralai_setup, 'class')

* We still see that the words are not well separated in the expected clusters - as there is no distinct category for each cluster.

* Let's take a look at the 2D UMAP plot for VoyageAI.

In [None]:
visualize_embeddings(cluster_data_df, "vygai_embedding_umap_2d", kmeans_class_umap_vygai_setup, method="UMAP", dimensions=2)

* It looks like a nice kite.

##### Conclusions

In [None]:
kmeans_class_metrics_df = kmeans_class_metrics_df.sort_values(by='setup', ascending=False, ignore_index=True)
kmeans_class_metrics_df

* As we see, K-Means clustering did not perform well on our high-dimensional data despite our best efforts to reduce the dimensions.

* We can see that the words are not well separated in the clusters, and there is no distinct category for each cluster.

* We will now move on to section / division level clustering with our heads high.

In [None]:
cluster_data_df['vygai_kmeans_class_pca'] = cluster_data_df['vygai_kmeans_class_pca'].astype(str)
visualize_embeddings(cluster_data_df, "vygai_embedding_pca_2d", "vygai_kmeans_class_pca", method="PCA for clustering labels", dimensions=2)

In [None]:
visualize_embeddings(cluster_data_df, "vygai_embedding_pca_2d", "class", method="PCA for actual labels", dimensions=2)

#### Section Level Clustering

* We will also perform clustering at the section/division level, as this is the second level of hierarchy in Roget's Thesaurus.
    * Reminder: We name our section/division level as "sc_" for simplicity, irregardless of type.

* For brevity, we will only perform the clustering with all dimensions and PCA-reduced embeddings as we have seen that UMAP does not perform well.

##### All Dimension Clustering

In [None]:
kmeans_section_default_vygai_setup = "vygai_kmeans_section_default"

cluster_data_df[kmeans_section_default_vygai_setup] = perform_kmeans(cluster_data_df['vygai_embedding'], n_clusters=expected_sections_n)

kmeans_section_metrics_df = pd.concat([kmeans_section_metrics_df, measure_agreement(cluster_data_df['section_'], cluster_data_df[kmeans_section_default_vygai_setup],
                                                                                setup=kmeans_section_default_vygai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_section_default_vygai_setup, 'section')

In [None]:
kmeans_section_default_mistralai_setup = "mistralai_kmeans_section_default"

cluster_data_df[kmeans_section_default_mistralai_setup] = perform_kmeans(cluster_data_df['mistralai_embedding'], n_clusters=expected_sections_n)

kmeans_section_metrics_df = pd.concat([kmeans_section_metrics_df, measure_agreement(cluster_data_df['section_'], cluster_data_df[kmeans_section_default_mistralai_setup],
                                                                                setup=kmeans_section_default_mistralai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_section_default_mistralai_setup, 'section')[['total', 'winner']]

##### PCA & Clustering

In [None]:
kmeans_section_pca_vygai_setup = "vygai_kmeans_section_pca"

cluster_data_df[kmeans_section_pca_vygai_setup] = perform_kmeans(vygai_pca_emb_optimal[0], n_clusters=expected_sections_n)

kmeans_section_metrics_df = pd.concat([kmeans_section_metrics_df, measure_agreement(cluster_data_df['section_'],
                                                                                cluster_data_df[kmeans_section_pca_vygai_setup],
                                                                                setup=kmeans_section_pca_vygai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_section_pca_vygai_setup, 'section')[['total', 'winner']]

In [None]:
kmeans_section_pca_mistralai_setup = "mistralai_kmeans_section_pca"

cluster_data_df[kmeans_section_pca_mistralai_setup] = perform_kmeans(mistralai_pca_emb_optimal[0], n_clusters=expected_sections_n)

kmeans_section_metrics_df = pd.concat([kmeans_section_metrics_df, measure_agreement(cluster_data_df['section_'],
                                                                                cluster_data_df[kmeans_section_pca_mistralai_setup],
                                                                                setup=kmeans_section_pca_mistralai_setup)], ignore_index=True)

In [None]:
cluster_distribution(cluster_data_df, kmeans_section_pca_mistralai_setup, 'section')[['total', 'winner']]

##### Conclusions

In [None]:
kmeans_section_metrics_df = kmeans_section_metrics_df.sort_values(by='setup', ascending=False, ignore_index=True)
kmeans_section_metrics_df

* We see that still the results are not great. There are sections dominating the clusters, and the words are not well separated in the clusters. 

* VoyageAI's results are slightly better than MistralAI's, but they are still not in the realm of good.

* Let's visualize the clusters against the actual labels for the sake of it.

In [None]:
cluster_data_df['vygai_kmeans_section_pca'] = cluster_data_df['vygai_kmeans_section_pca'].astype(str)
visualize_embeddings(cluster_data_df, "vygai_embedding_pca_2d", "vygai_kmeans_section_pca", method="PCA", dimensions=2)

In [None]:
visualize_embeddings(data_df, "vygai_embedding_pca_2d", "section", method="PCA", dimensions=2)

### Honorable Mentions (that I attempted)

The following clustering techniques were tested but omitted so as to limit the length of the notebook and save run time. Their results were either worse than K-Means or they never finished running.

* **HDBSCAN** - HDBSCAN is a density-based clustering technique that is known for its ability to find clusters of varying density. While attempting to use it, we found that it was not suitable for our task as it was taking too long.

* **Gaussian Mixture Models** - GMM is a probabilistic model that assumes that the data is generated from a mixture of several Gaussian distributions. We attempted to use it, but it was not suitable for our task as it was taking too long. Running it for all dimensions is impossible, running it on our reduced embeddings seems pointless as they are not isotropic.

* **Spectral Clustering** - Spectral clustering is a technique that uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. Apart from being slow, it was not suitable for our task as it gave us a warning about the graph not being fully connected and the results were worse than random. Running it for all dimensions is impossible, running it on our reduced embeddings seems pointless as the data is not in the proper distribution for it to work well.

* **Hierarchical Clustering** - Hierarchical clustering is a technique that builds a hierarchy of clusters. Its runtime was too long and the results too disappointing to be included in this notebook.

### A Cheating Swan Song featuring Linear Discriminant Analysis & K-Means

* As we saw earlier, LDAs performed very well on our embeddings, which is not surprising as it is a supervised technique. With just 3 components, we were able to capture around 70-75% of the variance.

* Using its results completely defeats the purpose of our task, as we are trying to find clusters without any prior knowledge of the data.

* However I could not help but try as I find it fascinating.

* For brevity, we will only perform it on the MistralAI embeddings.

* We will re-perform LDA for the maximum amount of components (# of classes - 1) and then perform K-Means clustering.

* We will write our conclusions in the end.

#### Class Level Clustering

In [None]:
kmeans_class_lda_mistralai_setup = "mistralai_kmeans_class_lda"

lda_optimal_class, _ = lda_reducer(cluster_data_df['mistralai_embedding'], cluster_data_df['class'], n=expected_classes_n-1)

cluster_data_df[kmeans_class_lda_mistralai_setup] = perform_kmeans(lda_optimal_class, n_clusters=expected_classes_n)

kmeans_class_metrics_df = pd.concat([kmeans_class_metrics_df, measure_agreement(cluster_data_df['class_'], 
                                                                                cluster_data_df[kmeans_class_lda_mistralai_setup],
                                                                                setup=kmeans_class_lda_mistralai_setup)
                                     ], ignore_index=True)

In [None]:
cluster_data_df['mistralai_kmeans_class_lda'] = cluster_data_df['mistralai_kmeans_class_lda'].astype(str)
visualize_embeddings(cluster_data_df, "mistralai_embedding_lda_2d", kmeans_class_lda_mistralai_setup, method="LDA", dimensions=2)

In [None]:
cluster_distribution(cluster_data_df, kmeans_class_lda_mistralai_setup, 'class')

* The seperation is not perfect, but it is much better than what we have seen so far.

In [None]:
kmeans_class_metrics_df

* The numerical measures are also much better than what we have seen so far compared to the other techniques.

#### Section Level Clustering

In [None]:
kmeans_section_lda_mistralai_setup = "mistralai_kmeans_section_lda"

lda_optimal_section, _ = lda_reducer(cluster_data_df['mistralai_embedding'], cluster_data_df['section_'], n=expected_sections_n-1)

cluster_data_df[kmeans_section_lda_mistralai_setup] = perform_kmeans(lda_optimal_section, n_clusters=expected_sections_n)

kmeans_section_metrics_df = pd.concat([kmeans_section_metrics_df, measure_agreement(cluster_data_df['section_'],
                                                                                cluster_data_df[kmeans_section_lda_mistralai_setup],
                                                                                setup=kmeans_section_lda_mistralai_setup)], ignore_index=True)

In [None]:
cluster_data_df['mistralai_kmeans_section_lda'] = cluster_data_df['mistralai_kmeans_section_lda'].astype(str)
visualize_embeddings(cluster_data_df, "mistralai_embedding_lda_2d", kmeans_section_lda_mistralai_setup, method="LDA", dimensions=2)

In [None]:
cluster_distribution(cluster_data_df, kmeans_section_lda_mistralai_setup, 'section')

In [None]:
kmeans_section_metrics_df

* Likewise, the seperation is not perfect, but it is much better than what we have seen so far.

#### Conclusions

* Reducing the dimensions of our embeddings with LDA and then performing K-Means clustering gave us the best results so far.

* Is this cheating? Yes, it is. But it is also a good experiment to see how well our embeddings can be seperated with prior knowledge. 

* Why is it cheating? Because we are clustering on a biased subspace of our data that have been selected to maximize the seperation of the classes.

* It was still fun to play with!

### Clustering Conclusions

* Clustering did not yield satisfactory results - it seems that unsupervised methods cannot arrive at classifications that are comparable to the Roget's Thesaurus Classification, at least with the embeddings we have.

* This might be because the thesaurus was published in 1852 while our embeddings are derived from internet corpora, which are much more recent. Language changes over time, and the meaning of words changes with it. A lot of the words in the thesaurus are quite uncommon and archaic, and they might not be well represented in our embeddings. 

* Due to the high dimensionality of our embeddings, we were unable to find any meaningful clusters. We attempted to reduce the dimensions with PCA, UMAP and (cheekily) LDA, but the results were still not satisfactory. Both simple clustering algorithms such as k-means and more sophisticated ones such as HDBSCAN and GMM did not yield any meaningful results.

* The aforementioned apply to both class and section level clustering, although the results were slightly better for the section level.

## Class Prediction

Now we flip over to supervised Machine Learning methods. You must experiment and come up with the best classification method, whose input will be a word and its target will be its class, or its section / division (so there must be two different models).

* Now we will attempt to predict the class and section of a word using our embeddings with supervised machine learning methods.

* A **note**; in order to avoid repetition, we will only use the embeddings that yielded the best results for each technique. 

In [None]:
supervised_df = data_df.copy()

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(supervised_df['mistralai_embedding'].tolist(), supervised_df['class'].tolist(), test_size=0.2, random_state=0)
X_train_section, X_test_section, Y_train_section, Y_test_section = train_test_split(supervised_df['mistralai_embedding'].tolist(), supervised_df['section'].tolist(), test_size=0.2, random_state=0)

### Evaluation Plots

* We will define any necessary functions for evaluation plots.

#### Confusion Matrix with Heatmap

In [None]:
def heatmap_confusion_matrix(true_labels, predicted_labels, title):
    cm = confusion_matrix(true_labels, predicted_labels)

    _, ax = plt.subplots(figsize=(5, 5))
    sns.heatmap(cm, annot=True, fmt='d', ax=ax)
    ax.set_title(title)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    plt.show()

### Dummy Classifier - Baseline

* Initially, we will use a dummy classifier as a baseline to compare our models against.

* We will use the stratified strategy to ensure that the class distribution is preserved.

* We will also use the VoyageAI embeddings - without a specific reason.

#### Class Level Baseline

* Let's start with the class classification baseline.

In [None]:
dummy_class = DummyClassifier(strategy="stratified", random_state=0)
dummy_class.fit(X_train, Y_train)

In [None]:
dummy_class_predictions = dummy_class.predict(X_test)

print(classification_report(Y_test, dummy_class_predictions))

* The dummy classifier, as expected, performs as if it is random (6 classes = 16% accuracy if random).

#### Section Level Baseline

In [None]:
dummy_section = DummyClassifier(strategy="stratified", random_state=0)
dummy_section.fit(X_train_section, Y_train_section)

In [None]:
dummy_section_predictions = dummy_section.predict(X_test_section)

print(classification_report(Y_test_section, dummy_section_predictions))

* Our baseline is set! Let's move on to the real models.

### Naive Bayes Classifier

* We will start off with a Naive Bayes classifier, which is a simple and popular classification technique for text data.

* We will use the Gaussian Naive Bayes classifier, as it is suitable for continuous data.

#### Class Level Classification

* We start off by fitting our model.

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, Y_train)

* We can then predict on our test set and evaluate the model's performance.

In [None]:
Y_pred_gnb_class = gnb.predict(X_test)
print(classification_report(Y_test, Y_pred_gnb_class))

* Let's also visualize the confusion matrix.

In [None]:
heatmap_confusion_matrix(Y_test, Y_pred_gnb_class, "Gaussian Naive Bayes Confusion Matrix")

#### Section Level Classification

In [None]:
gnb_section = GaussianNB()
gnb_section.fit(X_train_section, Y_train_section)

In [None]:
Y_pred_gnb_section = gnb_section.predict(X_test_section)
print(classification_report(Y_test_section, Y_pred_gnb_section))

#### Conclusions

* Naive Bayes seems to not perform very well despite being almost 3 times better than the dummy classifier. 

* Our data is extremely high-dimensional and a simple probabilistic model such as Naive Bayes is perhaps not the best choice for our task.

* Still, we have outperformed our baseline and it sets a new standard for us to beat.

### Support Vector Machine (SVM) Classifier feat. LDA

* SVM is a true classic and a powerful classification technique that is known for its ability to find the best hyperplane that separates the classes.

* SVM does not perform well on high-dimensional data (takes too long), we will use LDA to reduce the dimensions.

* For SVM, VoyageAI's embeddings will be used for diversity.

* First, we want to split our data into training and testing sets.

In [None]:
X_train_class_svm, X_test_class_svm, Y_train_class_svm, Y_test_class_svm = train_test_split(supervised_df["vygai_embedding"].tolist(), supervised_df['class'].tolist(), test_size=0.2, random_state=0)
X_train_section_svm, X_test_section_svm, Y_train_section_svm, Y_test_section_svm = train_test_split(supervised_df["vygai_embedding"].tolist(), supervised_df['section'].tolist(), test_size=0.2, random_state=0)

#### Class Level Prediction

* As we said, we'll use LDA to reduce the dimensions of our embeddings. 

* In order to avoid test set leakage, we will fit the LDA on the training set and then transform both the training and testing sets. 

* We will use the maximum amount of components that LDA can provide us with.

In [None]:
lda_svm = LDA(n_components=expected_classes_n - 1)
X_train_class_svm = lda_svm.fit_transform(X_train_class_svm, Y_train_class_svm)
X_test_class_svm = lda_svm.transform(X_test_class_svm)

In [None]:
svm = SVC(kernel='rbf', class_weight='balanced')
svm.fit(X_train_class_svm, Y_train_class_svm) 
Y_pred_class_svm = svm.predict(X_test_class_svm)

* Let's evaluate our model.

In [None]:
print(classification_report(Y_test_class_svm, Y_pred_class_svm))

* It seems that our model is not performing very well, but we are still going upwards! 

* In fact, the LDA trick performs much better than other dimensionality reduction techniques as tested.

In [None]:
heatmap_confusion_matrix(Y_test_class_svm, Y_pred_class_svm, "SVM Class Classification Confusion Matrix")

#### Section Level Prediction

* Likewise, we'll train our SVM model and make predictions.

In [None]:
lda_svm = LDA(n_components=expected_sections_n - 1)
X_train_section_svm = lda_svm.fit_transform(X_train_section_svm, Y_train_section_svm)
X_test_section_svm = lda_svm.transform(X_test_section_svm)

In [None]:
svm = SVC(kernel='rbf', class_weight='balanced')

svm.fit(X_train_section_svm, Y_train_section_svm)
Y_pred_section_svm = svm.predict(X_test_section_svm)

In [None]:
print(classification_report(Y_test_section_svm, Y_pred_section_svm))

In [None]:
# scatter plot the predictions on the PCA reduced embeddings
fig = px.scatter(x=[x[0] for x in X_test_class_svm], y=[x[1] for x in X_test_class_svm], color=Y_pred_class_svm, color_discrete_sequence=px.colors.qualitative.Antique)
fig.update_layout(title="SVM Class Classification on PCA Reduced VoyageAI Embeddings")
fig.show()

#### Conclusions

* SVM was an improvement over anything we have seen so far, but we are still yet to reach the a satisfactory level of performance.

* LDA was of great help both accuracy and performance-wise, and it seems that it is the best choice for our task.

### Solo Linear Discriminant Analysis (LDA) 

* LDA performed very well on our embeddings, which is not surprising as it is a supervised technique. With just 3 components, we were able to capture around 70-75% of the variance. Let's see how it performs in a classification task!

In [None]:
X_train_vyg_lda_class, X_test_vyg_lda_class, Y_train_vyg_lda_class, Y_test_vyg_lda_class = train_test_split(supervised_df["mistralai_embedding"].tolist(), supervised_df['class'].tolist(), test_size=0.2, random_state=0)
X_train_vyg_lda_section, X_test_vyg_lda_section, Y_train_vyg_lda_section, Y_test_vyg_lda_section = train_test_split(supervised_df["mistralai_embedding"].tolist(), supervised_df['section'].tolist(), test_size=0.2, random_state=0)

#### Class Level Prediction

In [None]:
lda = LDA(n_components=expected_classes_n - 1)
lda.fit(X_train_vyg_lda_class, Y_train_vyg_lda_class)

In [None]:
Y_pred_lda_class = lda.predict(X_test_vyg_lda_class)
print(classification_report(Y_test_vyg_lda_class, Y_pred_lda_class))

#### Section Level Prediction

In [None]:
lda = LDA(n_components=expected_sections_n - 1)
lda.fit(X_train_vyg_lda_section, Y_train_vyg_lda_section)

In [None]:
Y_pred_lda_section = lda.predict(X_test_vyg_lda_section)
print(classification_report(Y_test_vyg_lda_section, Y_pred_lda_section))

#### Conclusions

* Class-level classification seems almost identical to SVM, but section-level classification has improved slightly.

### Neural Network

* Let's move onto more sophisticated models. We will use a simple neural network for classification.

#### Helper Functions

* In order to promote code reusability, we will define a few helper functions that will allow us to easily create and evaluate our models.

* First, we'll define the data splitting function.

* We want a training, validation and testing set. 

* We also want to give the option for resampling, as our dataset is imbalanced. (A lot of experimentation was done with this, but it did not yield any significant improvements.)
    * It is important to resample **only** the training set, as we do not want to introduce bias in our validation and testing sets.

In [None]:
from imblearn.under_sampling import InstanceHardnessThreshold

def prepare_dataset(df, embedding_col_name, label_col_name, test_size=0.2, val_size=0.2, enable_resampling=False, random_state=0):
    X = np.stack(df[embedding_col_name].values)
    Y = df[label_col_name].values
    
    Y_categorical = to_categorical(Y)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y_categorical, test_size=test_size, random_state=random_state)
    X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=val_size)
    
    if enable_resampling:
        oversampler = SMOTE(sampling_strategy='not majority')
        undersampler = InstanceHardnessThreshold(random_state=0,estimator=GaussianNB(), sampling_strategy='majority', cv=10)
        pipeline = Pipeline([('undersampling', undersampler), ('oversampling', oversampler)])
        X_train, Y_train = pipeline.fit_resample(X_train, Y_train)
    
    return X_train, X_val, X_test, Y_train, Y_val, Y_test

* Now we'll define our trainer and evaluator functions.

* We will use the Adam optimizer, as it is a good choice for most tasks. 

* We will use the sparse categorical focal crossentropy loss function, as it is suitable for *unbalanced* multi-class classification tasks.

* Finally, we'll use callback functions to stop training if the validation loss does not improve for a certain amount of epochs. Specifically:
    * EarlyStopping will stop training if the validation loss does not improve for a few epochs.
    * ReduceLROnPlateau will reduce the learning rate if the validation loss does not improve for a few epochs.
    * LearningRateScheduler will reduce the learning rate after a few epochs.

In [None]:
def trainer(model, X_train, Y_train, X_val, Y_val, epochs=100, batch_size=32, learning_rate=0.001, verbose=0):
    step_decay = lambda epoch: 0.001 * 0.2**(np.floor(epoch / 10))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss=CategoricalFocalCrossentropy(), metrics=['accuracy', 'AUC'])
    
    callbacks = [
        EarlyStopping(monitor='val_loss', patience=15, verbose=1, restore_best_weights=True),
        ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, verbose=1, min_lr=0.00001),
        LearningRateScheduler(step_decay)
    ]
    
    history = model.fit(
        X_train, Y_train,
        validation_data=(X_val, Y_val),
        epochs=epochs,  
        batch_size=batch_size,
        callbacks=callbacks, 
        verbose=verbose, 
        shuffle=True, 
        use_multiprocessing=True,
    )
    
    return history

def evaluate_model(model, X_test, y_test):
    loss, accuracy, auc = model.evaluate(X_test, y_test, verbose=0)
    
    return loss, accuracy, auc    

* And their respective plotting functions.

In [None]:
def plot_training_history(history, title):
    fig = make_subplots(rows=1, cols=2, subplot_titles=("Loss", "Accuracy"))

    fig.add_trace(go.Scatter(x=list(range(1, len(history.history['loss'])+1)), y=history.history['loss'], mode='lines+markers', name='Training Loss'), row=1, col=1)
    fig.add_trace(go.Scatter(x=list(range(1, len(history.history['val_loss'])+1)), y=history.history['val_loss'], mode='lines+markers', name='Validation Loss'), row=1, col=1)

    fig.add_trace(go.Scatter(x=list(range(1, len(history.history['accuracy'])+1)), y=history.history['accuracy'], mode='lines+markers', name='Training Accuracy'), row=1, col=2)
    
    fig.add_trace(go.Scatter(x=list(range(1, len(history.history['val_accuracy'])+1)), y=history.history['val_accuracy'], mode='lines+markers', name='Validation Accuracy'), row=1, col=2)
    
    fig.update_xaxes(title_text="Epoch", row=1, col=1) 
    fig.update_xaxes(title_text="Epoch", row=1, col=2) 
    
    fig.update_yaxes(title_text="Loss", row=1, col=1)
    fig.update_yaxes(title_text="Accuracy", row=1, col=2)
    
    fig.update_layout(title_text=title, height=500, width=1000)
    
    fig.show()

#### Class Level Prediction

* We are now ready to train our neural network for class-level prediction.

* We will build a two-layer neural network, with 64 neurons in the first layer and 32 neurons in the second layer. 
    * The number of neurons was chosen through experimentation with validation data.
    * We will use L1 regularization to pick up on the most important features and prevent overfitting.
* We will also use the PReLU activation function, as it is a good choice for most tasks and an improvement over ReLU.
* Each batch will be normalized so as to improve convergence.
* Finally, Dropout will be used to prevent overfitting and improve generalization as it acts like an ensemble of models.

In [None]:
X_train_nn_class, X_val_nn_class, X_test_nn_class, Y_train_nn_class, Y_val_nn_class, Y_test_nn_class = prepare_dataset(supervised_df, 'mistralai_embedding', 'class_', test_size=0.2, val_size=0.2, random_state=0)

nn_class_mistralai = Sequential([ 
    Input(shape=(X_train_nn_class.shape[1],)),
    Dense(64, kernel_regularizer=l1(1e-5)),
    BatchNormalization(),
    PReLU(),
    Dropout(0.5),
    Dense(32, kernel_regularizer=l1(1e-5)),
    BatchNormalization(),
    PReLU(),
    Dropout(0.5),    
    Dense(Y_train_nn_class.shape[1], activation='softmax')
])

* Let's take a look at our model!

In [None]:
nn_class_mistralai.summary()

* Now we'll plot our training results.

In [None]:
history_nn_class_mistralai = trainer(nn_class_mistralai, X_train_nn_class, Y_train_nn_class, X_val_nn_class, Y_val_nn_class, epochs=150, batch_size=32, learning_rate=0.001)

plot_training_history(history_nn_class_mistralai, "Neural Network Training History for MistralAI Class Classification")

* Let's see how it predicts.

In [None]:
loss_nn_class_mistralai, accuracy_nn_class_mistralai, auc_nn_class_mistralai = evaluate_model(nn_class_mistralai, X_test_nn_class, Y_test_nn_class)
print(f"Neural Network MistralAI Class Classification Loss: {loss_nn_class_mistralai}")
print(f"Neural Network MistralAI Class Classification Accuracy: {accuracy_nn_class_mistralai}")
print(f"Neural Network MistralAI Class Classification AUC: {auc_nn_class_mistralai}")

heatmap_confusion_matrix(np.argmax(Y_test_nn_class, axis=1), np.argmax(nn_class_mistralai.predict(X_test_nn_class), axis=1), "Neural Network MistralAI Class Classification Confusion Matrix")

#### Section Level Prediction

* Let's repeat the process for section-level prediction.

In [None]:
X_train_nn_section, X_val_nn_section, X_test_nn_section, Y_train_nn_section, Y_val_nn_section, Y_test_nn_section = prepare_dataset(supervised_df, 'mistralai_embedding', 'section_', test_size=0.2, val_size=0.2, random_state=0) 

nn_section_mistralai = Sequential([
    Input(shape=(X_train_nn_section.shape[1],)),
    Dense(64, kernel_regularizer=l1(1e-5)),
    BatchNormalization(),
    PReLU(),
    Dropout(0.5),
    Dense(32, kernel_regularizer=l1(1e-5)),
    BatchNormalization(),
    PReLU(),
    Dropout(0.5),    
    Dense(Y_train_nn_section.shape[1], activation='softmax')
])

In [None]:
history_nn_section_mistralai = trainer(nn_section_mistralai, X_train_nn_section, Y_train_nn_section, X_val_nn_section, Y_val_nn_section, epochs=150, batch_size=32, learning_rate=0.001)

plot_training_history(history_nn_section_mistralai, "Neural Network Training History for MistralAI Section Classification")

In [None]:
loss_nn_section_mistralai, accuracy_nn_section_mistralai, auc_nn_section_mistralai = evaluate_model(nn_section_mistralai, X_test_nn_section, Y_test_nn_section)

print(f"Neural Network MistralAI Section Classification Loss: {loss_nn_section_mistralai}")
print(f"Neural Network MistralAI Section Classification Accuracy: {accuracy_nn_section_mistralai}")
print(f"Neural Network MistralAI Section Classification AUC: {auc_nn_section_mistralai}")

#### Conclusions

* Our neural network performed very average - and honestly not better than LDA-injected SVM or LDA.

* Sadly, more complex networks reached overfitting and did not yield any significant improvements. Simpler networks did not perform any better than our last baseline.
    * We could start throwing CNNs and RNNs at the problem, but it would not result in any significant improvements.

* Despite applying a few advanced techniques such as PReLU and focal loss, we were unable to achieve any significant improvements.



## Conclusions & Playground

* Roget's Thesaurus was a challenging task to classify, and despite the disappointing results, a lot was learned in experimenting with different techniques.


* Due to the old age of the thesaurus, part of the language used in it is quite peculiar and maybe was not well represented in our embeddings - also suffering from the curse of dimensionality.
    * This is especially noticeable when we try to cluster the words and predict their classes/sections.



* In retrospect, using SOTA embeddings of such high dimensionality was not the best choice for our task. A tamer set of embeddings might have improved performance issues and allowed for more sophisticated techniques to be used. 


* Similarly, supervised classification techniques did not perform well at all. We quickly reached a ceiling and were unable to achieve any significant improvements between primitive and advanced models. 


* The aforementioned hint towards the following;
    * The human-made classes and sections in the century-old Roget's Thesaurus are not well represented in our embedding space.
    * There was an issue with the quality of the data preprocessing techniques applied to the dataset. In my defense, a lot of different methods were tested and the best ones were chosen. 
    * We did not have the resources to try out more sophisticated techniques such as pre-trained language models, which might have performed better - although it is not guaranteed due to the ceiling.
    * Flattening the sections into divisions might have introduced some bias and created over-represented sections, which might have affected our results. The dataset was imbalanced on a class-level as well, with resampling techniques not yielding any significant improvements.

* It would be very easy to fall into the trap of introducing bias in our models without realizing, such as performing SMOTE on *both* the training, validation and training sets, or fitting LDA on the entire dataset. It is important to be aware of these issues and to avoid them as much as possible, as they can lead to misleading results.

* I, as a student, really enjoyed working with LDA and trying to think up of ways to utilize it in a classification task - as seen in SVM.
    * I also had to scrap more than half of the notebook before delivery as it was too long.

Predictions can be made using the following function, simply replace the string with the phrase/word you want to classify, and specify whether you want to predict the class or section. Specifically, we will allow the reader to predict using our Neural Network model with the MistralAI embeddings.

In [None]:
def make_predictions(model, phrase) -> np.ndarray:
    
    embeddings_batch_response = mistral_client.embeddings(model="mistral-embed", input=[phrase])
    phrase_embedding = [np.array(e.embedding) for e in embeddings_batch_response.data][0]
    phrase_embedding = np.array(phrase_embedding).reshape(1, -1)
        
    predictions = model.predict(phrase_embedding)
    y_pred = np.argmax(predictions, axis=1)    
    
    return y_pred

class_prediction_model = nn_class_mistralai
section_prediction_model = nn_section_mistralai

##### Class Level

In [None]:
class_phrase = "I love LDA" 

class_pred = make_predictions(class_prediction_model, class_phrase)

pred_class_label = class_encoder.inverse_transform(class_pred)
print(f"Predicted Class: {pred_class_label}")

##### Section Level

In [None]:
section_phrase = "SVM too"

section_pred = make_predictions(section_prediction_model, section_phrase)

pred_section_label = section_encoder.inverse_transform(section_pred)
print(f"Predicted Section: {pred_section_label}")

## Submission Instructions

* You must submit your assignment as a Jupyter notebook that will contain the full code and documentation of how you solved the questions, plus all accompanying material, such as embedding files, etc.

* You are not required to upload your assignment; you may, if you wish, do your work in GitHub and submit a link to the private repository you will be using. If you do that, make sure to share the private repository with your instructor. 

* You may also include plain Python files that contain code that is called by your Jupyter notebook.

* You must use [poetry](https://python-poetry.org/) for all dependency management. Somebody wishing to replicate your work should be able to do so by using the poetry file.

## Honor Code

You understand that this is an individual assignment, and as such you must carry it out alone. You may discuss with your colleagues in order to better understand the questions, if they are not clear enough, but you should not ask them to share their answers with you, or to help you by giving specific advice. You can use ChatGPT or other chatbots, if you find them useful, along with traditional web search.