# Roget's Thesaurus in the 21st Century

The first known thesaurus was written in the 1st century CE by [Philo of Byblos](https://en.wikipedia.org/wiki/Philo_of_Byblos); it was called *Περὶ τῶν διαφόρως σημαινομένων*, loosly translated in English as *On Synonyms*. Fast forward about two millenia and we arrive to the most well known thesaurus, compiled by [Peter Mark Roget](https://en.wikipedia.org/wiki/Peter_Mark_Roget), a British physician, natural theologian, and lexicographer. [Roget's Thesaurus](https://en.wikipedia.org/wiki/Roget%27s_Thesaurus) was released on 29 April 1852, containing 15,000 words. Subsequent editions were larger, with the latest totalling 443,000 words. In Greek the most well known thesaurus, *Αντιλεξικόν ή Ονομαστικόν της Νεοελληνικής Γλώσσης* was released in 1949 by [Θεολόγος Βοσταντζόγλου](https://el.wikipedia.org/wiki/%CE%98%CE%B5%CE%BF%CE%BB%CF%8C%CE%B3%CE%BF%CF%82_%CE%92%CE%BF%CF%83%CF%84%CE%B1%CE%BD%CF%84%CE%B6%CF%8C%CE%B3%CE%BB%CE%BF%CF%85); the latest updated edition was released in 2008 and remains an indispensable source for writing in Greek.

Roget organised the entries of the thesaurus in a hierarchy of categories. Your task in this assignment is to investigate how these categories fare with the meaning of English words as captured by Machine Learning techniques, namely, their embeddings.

Note that this is an assignment that requires initiative and creativity from your part. There is no simple right or wrong answer. It is up to you to find the best solution. You have three weeks to do it. Make them count.

## Get Roget's Thesaurus Classification

You can find [Roget's Thesaurus classification online at the Wikipedia](https://en.wiktionary.org/wiki/Appendix:Roget%27s_thesaurus_classification). You must download the categorisation (and the words belonging in each category), save them and store them in the way that you deem most convenient for processing.

In [70]:
import requests
from bs4 import BeautifulSoup
import json

# Step 1: Send a GET request to the URL
url = "https://www.gutenberg.org/cache/epub/22/pg22-images.html"
response = requests.get(url)
html = response.text

# Step 2: Parse the HTML content
soup = BeautifulSoup(html, "html.parser")

Chapters_data = []

# Find all the chapters

Chapters = soup.find_all(class_="chapter")

# Extract data from each chapter

for chapter in Chapters:
    chapter_data = {}
    chapter_data["Class"] = chapter.find("h2").text.strip()
    divisions = chapter.find_all("h2")
    divisions = divisions[1:]

    if divisions:
        chapter_data["Divisions"] = []
        for division in divisions:
            division_data = {}
            division_data["Division"] = division.text.strip()
            section_data = []
            next_sibling = division.next_sibling
            while next_sibling:
                if next_sibling.name == "h2":
                    break
                if next_sibling.name == "h3":
                    section = next_sibling.text.strip()
                    paragraphs = []
                    next_paragraph = next_sibling.next_sibling
                    while next_paragraph:
                        if next_paragraph.name == "h3" or next_paragraph.name == "h2":
                            break
                        if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"):
                            paragraphs.append(next_paragraph.text.strip())
                        next_paragraph = next_paragraph.next_sibling
                    section_data.append({"Section": section, "Content": paragraphs})
                next_sibling = next_sibling.next_sibling
            division_data["Sections"] = section_data
            chapter_data["Divisions"].append(division_data)
    else:
        section_data = []
        sections = chapter.find_all("h3")
        for section in sections:
            section_name = section.text.strip()
            paragraphs = []
            next_paragraph = section.next_sibling
            while next_paragraph:
                if next_paragraph.name == "h3" or next_paragraph.name == "h2":
                    break
                if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"):
                    paragraphs.append(next_paragraph.text.strip())
                next_paragraph = next_paragraph.next_sibling
            section_data.append({"Section": section_name, "Content": paragraphs})
        chapter_data["Sections"] = section_data

    Chapters_data.append(chapter_data)

# Step 3: Write data to JSON file

with open("gutenberg_data.json", "w") as json_file:
    json.dump(Chapters_data, json_file, indent=4)



In [71]:
import requests
from bs4 import BeautifulSoup
import json

# Step 1: Send a GET request to the URL
url = "https://www.gutenberg.org/cache/epub/22/pg22-images.html"
response = requests.get(url)
html = response.text

# Step 2: Parse the HTML content
soup = BeautifulSoup(html, "html.parser")

Chapters_data = {}

# Find all the chapters
Chapters = soup.find_all(class_="chapter")

# Extract data from each chapter
for chapter in Chapters:
    class_name = chapter.find("h2").text.strip()
    Chapters_data[class_name] = {}

    divisions = chapter.find_all("h2")
    divisions = divisions[1:]

    if divisions:
        for division in divisions:
            division_name = division.text.strip()
            Chapters_data[class_name][division_name] = {"content": []}

            next_sibling = division.next_sibling
            while next_sibling:
                if next_sibling.name == "h2":
                    break
                if next_sibling.name == "h3":
                    section = next_sibling.text.strip()
                    paragraphs = []
                    next_paragraph = next_sibling.next_sibling
                    while next_paragraph:
                        if next_paragraph.name == "h3" or next_paragraph.name == "h2":
                            break
                        if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"):
                            paragraphs.append(next_paragraph.text.strip())
                        next_paragraph = next_paragraph.next_sibling
                    Chapters_data[class_name][division_name]["content"].extend(paragraphs)
                next_sibling = next_sibling.next_sibling
    else:
        sections = chapter.find_all("h3")
        for section in sections:
            section_name = section.text.strip()
            Chapters_data[class_name][section_name] = {"content": []}

            next_paragraph = section.next_sibling
            while next_paragraph:
                if next_paragraph.name == "h3" or next_paragraph.name == "h2":
                    break
                if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"):
                    Chapters_data[class_name][section_name]["content"].append(next_paragraph.text.strip())
                next_paragraph = next_paragraph.next_sibling

# Step 3: Write data to JSON file
with open("gutenberg_data2.json", "w") as json_file:
    json.dump(Chapters_data, json_file, indent=4)

In [72]:
import requests
from bs4 import BeautifulSoup
import json

# Step 1: Send a GET request to the URL
url = "https://www.gutenberg.org/cache/epub/22/pg22-images.html"
response = requests.get(url)
html = response.text

# Step 2: Parse the HTML content
soup = BeautifulSoup(html, "html.parser")

Chapters_data = {}

# Find all the chapters
Chapters = soup.find_all(class_="chapter")

# Extract data from each chapter
for chapter in Chapters:
    class_name = chapter.find("h2").text.strip()
    Chapters_data[class_name] = {}

    divisions = chapter.find_all("h2")
    divisions = divisions[1:]

    if divisions:
        for division in divisions:
            division_name = division.text.strip()
            Chapters_data[class_name][division_name] = {}

            next_sibling = division.next_sibling
            while next_sibling:
                if next_sibling.name == "h2":
                    break
                if next_sibling.name == "h3":
                    section = next_sibling.text.strip()
                    paragraphs = []
                    next_paragraph = next_sibling.next_sibling
                    while next_paragraph:
                        if next_paragraph.name == "h3" or next_paragraph.name == "h2":
                            break
                        if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"):
                            paragraphs.append(next_paragraph.text.strip())
                        next_paragraph = next_paragraph.next_sibling
                    Chapters_data[class_name][division_name][section] = {"content": paragraphs}
                next_sibling = next_sibling.next_sibling
    else:
        sections = chapter.find_all("h3")
        for section in sections:
            section_name = section.text.strip()
            Chapters_data[class_name][section_name] = {"content": []}

            next_paragraph = section.next_sibling
            while next_paragraph:
                if next_paragraph.name == "h3" or next_paragraph.name == "h2":
                    break
                if next_paragraph.name == "p" and next_paragraph.text.strip().startswith("#"):
                    Chapters_data[class_name][section_name]["content"].append(next_paragraph.text.strip())
                next_paragraph = next_paragraph.next_sibling

# Step 3: Write data to JSON file
with open("gutenberg_data3.json", "w") as json_file:
    json.dump(Chapters_data, json_file, indent=4)

## Get Word Embeddings

You will embeddings for the word entries in Roget's Thesaurus. It is up to you to find the embeddings; you can use any of the available models. Older models like word2vec, GloVe, BERT, etc., may be easier to use, but recent models like Llama 2 and Mistral have been trained on larger corpora. OpenAI and Google offer their embeddings through APIs, but they are not free.

You should think about how to store the embeddings you retrieve. You may use plain files (e.g., JSON, CSV) and vanilla Python, or a vector database.

In [79]:
from gensim.models import Word2Vec as w2v
import gensim
from gensim.models import Word2Vec
import gensim

print("Opening file...")
with open("gutenberg_data.json", "r") as json_file:
    data = json.load(json_file)
    print("File loaded.")

# check the length of the data
#print(len(data))
#print the first chapter
#print(data[0])

sentences = []  # Initialize an empty list of sentences
for chapter in data:
    #print(chapter["Class"])
    if "Divisions" in chapter:
        for division in chapter["Divisions"]:
            #print(division["Division"])
            for section in division["Sections"]:
                #print(section["Section"])
                #print(section["Content"])
                for con in section["Content"]:
                    sentences.append(con)
                
    else:
        for section in chapter["Sections"]:
            #print(section["Section"])
            #print(section["Content"])
            for con in section["Content"]:
                    sentences.append(con)
                

print(len(sentences))
#print(sentences)

# Train Word2Vec model
print("Training Word2Vec model...")
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

embeddings = {}
for word in sentences:
    if word in model.wv:
        embeddings[word] = model.wv[word].tolist()
print(len(embeddings))

# Store embeddings in a file (e.g., JSON)
print("Storing embeddings...")
with open("rogets_embeddings.json", "w") as json_file:
    json.dump(embeddings, json_file)

print("Embeddings stored successfully.")


Opening file...
File loaded.
1043
Training Word2Vec model...
0
Storing embeddings...
Embeddings stored successfully.


## Clustering

With the embeddings at hand, you can check whether unsupervised Machine Learning methods can arrive at classifications that are comparable to the Roget's Thesaurus Classification. You can use any clustering method of your choice (experiment freely). You must decide how to measure the agreement between the clusters you find and the classes defined by Roget's Thesaurus and report your results accordingly. The comparison will be at the class level (six classes) and the section / division level (so there must be two different clusterings, unless you can find good results with hierarchical clustering).

## Class Prediction

Now we flip over to supervised Machine Learning methods. You must experiment and come up with the best classification method, whose input will be a word and its target will be its class, or its section / devision (so there must be two different models).

## Submission Instructions

* You must submit your assignment as a Jupyter notebook that will contain the full code and documentation of how you solved the questions, plus all accompanying material, such as embedding files, etc.

* You are not required to upload your assignment; you may, if you wish, do your work in GitHub and submit a link to the private repository you will be using. If you do that, make sure to share the private repository with your instructor. 

* You may also include plain Python files that contain code that is called by your Jupyter notebook.

* You must use [poetry](https://python-poetry.org/) for all dependency management. Somebody wishing to replicate your work should be able to do so by using the poetry file.

## Honor Code

You understand that this is an individual assignment, and as such you must carry it out alone. You may discuss with your colleagues in order to better understand the questions, if they are not clear enough, but you should not ask them to share their answers with you, or to help you by giving specific advice. You can use ChatGPT or other chatbots, if you find them useful, along with traditional web search.