# Roget's Thesaurus in the 21st Century

The first known thesaurus was written in the 1st century CE by [Philo of Byblos](https://en.wikipedia.org/wiki/Philo_of_Byblos); it was called *Περὶ τῶν διαφόρως σημαινομένων*, loosly translated in English as *On Synonyms*. Fast forward about two millenia and we arrive to the most well known thesaurus, compiled by [Peter Mark Roget](https://en.wikipedia.org/wiki/Peter_Mark_Roget), a British physician, natural theologian, and lexicographer. [Roget's Thesaurus](https://en.wikipedia.org/wiki/Roget%27s_Thesaurus) was released on 29 April 1852, containing 15,000 words. Subsequent editions were larger, with the latest totalling 443,000 words. In Greek the most well known thesaurus, *Αντιλεξικόν ή Ονομαστικόν της Νεοελληνικής Γλώσσης* was released in 1949 by [Θεολόγος Βοσταντζόγλου](https://el.wikipedia.org/wiki/%CE%98%CE%B5%CE%BF%CE%BB%CF%8C%CE%B3%CE%BF%CF%82_%CE%92%CE%BF%CF%83%CF%84%CE%B1%CE%BD%CF%84%CE%B6%CF%8C%CE%B3%CE%BB%CE%BF%CF%85); the latest updated edition was released in 2008 and remains an indispensable source for writing in Greek.

Roget organised the entries of the thesaurus in a hierarchy of categories. Your task in this assignment is to investigate how these categories fare with the meaning of English words as captured by Machine Learning techniques, namely, their embeddings.

Note that this is an assignment that requires initiative and creativity from your part. There is no simple right or wrong answer. It is up to you to find the best solution. You have three weeks to do it. Make them count.

## Get Roget's Thesaurus Classification

You can find [Roget's Thesaurus classification online at the Wikipedia](https://en.wiktionary.org/wiki/Appendix:Roget%27s_thesaurus_classification). You must download the categorisation (and the words belonging in each category), save them and store them in the way that you deem most convenient for processing.

First of all we import the needed modules for the code, We will use the following: 

- Requests to get the html from the site

- BeautifulSoup to parse the html content 

- json to create the json file where we store the words

In [1]:
%pip install requests
%pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import json

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Now, we will send a get request to get the site and initialize the parser for the HTML content.

In [2]:
# Step 1: Send a GET request to the URL
url = "https://www.gutenberg.org/cache/epub/22/pg22-images.html"
response = requests.get(url)
html = response.text

# Step 2: Parse the HTML content
soup = BeautifulSoup(html, "html.parser")

And now we are ready to parse the content and save it as we want for the json file. I found most convinient to store it based on the pattern of Class > Divisions > Sections > Content where division exists and Class > Sections > Content where it doesn't.The Content basically contains one sentence for each word that belongs to the current section of the current division in the current class. The sentence basically is all words for the specified word in the front. 

In [3]:
Chapters_data = [] # List to store the data

# Find all the chapters

Chapters = soup.find_all(class_="chapter") # Extract all the chapters

# Extract data from each chapter

for chapter in Chapters: #
    chapter_data = {} # Create a dictionary to store the data
    chapter_data["Class"] = chapter.find("h2").text.strip() # Extract the class of the chapter
    divisions = chapter.find_all("h2") # Extract all the divisions
    divisions = divisions[1:] # Remove the first division as it is the chapter

    if divisions: # If there are divisions
        chapter_data["Divisions"] = [] # Create a list to store the divisions
        for division in divisions: # Extract data from each division
            division_data = {} # Create a dictionary to store the data
            division_data["Division"] = division.text.strip() # Extract the division
            section_data = [] # Create a list to store the sections
            next_sibling = division.next_sibling # Extract the next sibling
            while next_sibling: # Extract data from each section
                if next_sibling.name == "h2": # If the next sibling is a division,
                    break # break the loop
                if next_sibling.name == "h3": # If the next sibling is a section
                    section = next_sibling.text.strip() # Extract the section
                    paragraphs = [] # Create a list to store the paragraphs
                    next_paragraph = next_sibling.next_sibling # Extract the next sibling
                    while next_paragraph: # Extract data from each paragraph
                        if next_paragraph.name == "h3" or next_paragraph.name == "h2": # If the next sibling is a section or division,
                            break # break the loop
                        if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"): # If the next sibling is a paragraph
                            paragraphs.append(next_paragraph.text.strip()) # Append the paragraph to the list
                        next_paragraph = next_paragraph.next_sibling # Move to the next sibling
                    section_data.append({"Section": section, "Content": paragraphs}) # Append the section and its content to the list
                next_sibling = next_sibling.next_sibling # Move to the next sibling
            division_data["Sections"] = section_data # Add the sections to the division
            chapter_data["Divisions"].append(division_data) # Add the division to the chapter
    else: # If there are no divisions
        section_data = [] # Create a list to store the sections
        sections = chapter.find_all("h3") # Extract all the sections
        for section in sections: # Extract data from each section
            section_name = section.text.strip() # Extract the section
            paragraphs = [] # Create a list to store the paragraphs
            next_paragraph = section.next_sibling # Extract the next sibling
            while next_paragraph: # Extract data from each paragraph
                if next_paragraph.name == "h3" or next_paragraph.name == "h2": # If the next sibling is a section or division,
                    break # break the loop
                if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"): # If the next sibling is a paragraph
                    paragraphs.append(next_paragraph.text.strip()) # Append the paragraph to the list
                next_paragraph = next_paragraph.next_sibling # Move to the next sibling
            section_data.append({"Section": section_name, "Content": paragraphs}) # Append the section and its content to the list
        chapter_data["Sections"] = section_data # Add the sections to the chapter

    Chapters_data.append(chapter_data) # Add the chapter to the list

And lastly now with the data ready we just simply write them and store them in a json file.

In [4]:
# Step 3: Write data to JSON file

with open("gutenberg_data.json", "w") as json_file:
    json.dump(Chapters_data, json_file, indent=4)

print("Data has been written to gutenberg_data.json")

Data has been written to gutenberg_data.json


## Get Word Embeddings

You will embeddings for the word entries in Roget's Thesaurus. It is up to you to find the embeddings; you can use any of the available models. Older models like word2vec, GloVe, BERT, etc., may be easier to use, but recent models like Llama 2 and Mistral have been trained on larger corpora. OpenAI and Google offer their embeddings through APIs, but they are not free.

You should think about how to store the embeddings you retrieve. You may use plain files (e.g., JSON, CSV) and vanilla Python, or a vector database.

First thing as always is to import the needed modules. We will use the following:

- Numpy to convert our numpy array to a list

- re so we can find words based on a regex

- gensim for the model to make the embeddings

In [5]:
%pip install numpy
%pip install gensim
import numpy as np
import re
from gensim import downloader

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


The next thing we need to do is get the words from the sentences in the json file and keep the ones that are in the model for the embeddings. After doing that, what is left to do is just to save the embeddings to a json file.

In [6]:
import gzip
print("Opening file...")
with open("gutenberg_data.json", "r") as json_file:
    data = json.load(json_file)
    print("File loaded.")

# Extract sentences from the data
sentences = []
for chapter in data:
    if "Divisions" in chapter:
        for division in chapter["Divisions"]:
            for section in division["Sections"]:
                for con in section["Content"]:
                    sentences.append(con)
    else:
        for section in chapter["Sections"]:
            for con in section["Content"]:
                sentences.append(con)

# Extract words from the sentences
words = []

for sentence in sentences:
    words.extend(re.findall(r'\b(?![0-9]+\b)\w+\b', sentence))


print("Words found:", len(words))

print("Loading word2vec model...")

model = downloader.load("word2vec-google-news-300")  # Load the word2vec model online

# Load the word2vec model locally
#model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300', binary=True)


print("Finding embeddings...")

words_in_model = []
embeddings = []
for word in words:
    if word in model and word != 'N' and word !='adj' and word != "&c":
        embeddings.append(model[word]) # Add the embedding to the list
        words_in_model.append(word)

print("Embeddings found for", len(embeddings), "words.")

embeddings = np.array(embeddings).tolist()  # Convert NumPy array to Python list

print("Opening file...")

# Write embeddings to JSON file
with open("embeddings.json", "w") as json_file:
    json.dump(embeddings, json_file, indent=4)

print("Embeddings stored successfully.")


Opening file...
File loaded.
Words found: 199784
Loading word2vec model...
Finding embeddings...
Embeddings found for 166850 words.
Opening file...
Embeddings stored successfully.


## Clustering

With the embeddings at hand, you can check whether unsupervised Machine Learning methods can arrive at classifications that are comparable to the Roget's Thesaurus Classification. You can use any clustering method of your choice (experiment freely). You must decide how to measure the agreement between the clusters you find and the classes defined by Roget's Thesaurus and report your results accordingly. The comparison will be at the class level (six classes) and the section / division level (so there must be two different clusterings, unless you can find good results with hierarchical clustering).

Once again we will import the modules needed. And these are the following:

- KMeans from sklearn.cluster so we can perform the clustering 

- some metrics from sklearn.metrics to come in a conclusion about the agreement between the classes of Roget and the clusters

In [7]:
%pip install scikit-learn
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score


[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


Now that we have done that, first thing we will need to do is load the embeddings that we are going to use to cluster and save each word as its class in a list so we can compare them later on with the cluster labels with the metrics. 

In [8]:
# Load word embeddings
with open("embeddings.json", "r") as json_file:
    embeddings = json.load(json_file)

# Convert embeddings to NumPy array
embeddings_array = np.array(embeddings)

# Load Roget's Thesaurus classification
with open("gutenberg_data.json", "r") as json_file:
    data = json.load(json_file)

# Create a list of classes for each word based on the chapters
rogets_classes = []
for i, chapter in enumerate(data):
    if "Divisions" in chapter:
        for division in chapter["Divisions"]:
            for section in division["Sections"]:
                for content in section["Content"]:
                    words = re.findall(r'\b(?![0-9]+\b)\w+\b', content)  # Split content into words
                    for word in words:
                        if word in model and word != 'N' and word !='adj' and word != "&c":
                            rogets_classes.append(i)  # Assign class index to each word
    else:
        for section in chapter["Sections"]:
            for content in section["Content"]:
                words = re.findall(r'\b(?![0-9]+\b)\w+\b', content)  # Split content into words
                for word in words:
                    if word in model and word != 'N' and word !='adj' and word != "&c":
                        rogets_classes.append(i)  # Assign class index to each word

print("Roget's classes assigned to words.")

Roget's classes assigned to words.


And now we are going to perform the clustering. We are going to create the same amount of clusters as the Roget's Classes, so it is going to be 6.

In [9]:
# Cluster the word embeddings using K-means
print("Clustering embeddings...")
num_clusters = 6 # Number of Roget's classes (chapters)
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings_array)
print("Clustering complete.")

Clustering embeddings...


  super()._check_params_vs_input(X, default_n_init=10)


Clustering complete.


And last but not least we are going to run some metrics.

In [10]:
# Calculate Adjusted Rand Index and Adjusted Mutual Information
ari = adjusted_rand_score(rogets_classes, cluster_labels)
ami = adjusted_mutual_info_score(rogets_classes, cluster_labels)
print('rogets_classes:', rogets_classes[:5])
print('cluster_labels:', cluster_labels[:5])

print("Adjusted Rand Index:", ari)
print("Adjusted Mutual Information:", ami)

rogets_classes: [0, 0, 0, 0, 0]
cluster_labels: [3 3 0 0 5]
Adjusted Rand Index: 0.027231992609547843
Adjusted Mutual Information: 0.030546747126600788


## Class Prediction

Now we flip over to supervised Machine Learning methods. You must experiment and come up with the best classification method, whose input will be a word and its target will be its class, or its section / devision (so there must be two different models).

## Submission Instructions

* You must submit your assignment as a Jupyter notebook that will contain the full code and documentation of how you solved the questions, plus all accompanying material, such as embedding files, etc.

* You are not required to upload your assignment; you may, if you wish, do your work in GitHub and submit a link to the private repository you will be using. If you do that, make sure to share the private repository with your instructor. 

* You may also include plain Python files that contain code that is called by your Jupyter notebook.

* You must use [poetry](https://python-poetry.org/) for all dependency management. Somebody wishing to replicate your work should be able to do so by using the poetry file.

## Honor Code

You understand that this is an individual assignment, and as such you must carry it out alone. You may discuss with your colleagues in order to better understand the questions, if they are not clear enough, but you should not ask them to share their answers with you, or to help you by giving specific advice. You can use ChatGPT or other chatbots, if you find them useful, along with traditional web search.