### 0 Imports

If this code block is failing, make sure you have all of the necessary packages installed. See `requirements.txt` in the root directory for more information.

In [22]:
import os
from gensim.models import Word2Vec
from nltk.corpus import words
import nltk
import random
import requests
import json
from dotenv import load_dotenv
from tqdm import tqdm
from openai import OpenAI
import re
from nltk.corpus import wordnet as wn

### 1 Specify Target Directories and Variables

Make sure that `models_dir` points to where you have stored all of the models from the previous steps. Also make sure that there are no other files ending in `.model` in that directory that you do not want to be included in the processing as the mode will open all `.model` files. The synonym maps generated in following steps will be saved into the directory that you specify in `synonyms_dir`.

In [7]:
models_dir = "../models"
synonyms_dir = "../synonyms"

### 2 Generate List of Vocabulary Items in All Models

The running of file `1.Train_Embedding_Models.ipynb` or `.py` and step 0 and 1  of this file are required for this step. Here, all of the models from the previous training on world corpora are loaded and we calculate the intersect of their vocabulary. Using this intersect, we then take a sample of 1000 words that all models contain so that we can fairly evaluate across all the models. These 1000 words are used for further investigation and are saved as `sampled_words.txt`.

#### 2.1 Find All Overlapping Words

This code loads in the vocabulary of all of the trained embedding models. Using these sets, it find the intersect to randomly sample words used in every model for every English variety.

In [3]:
common_vocab = None

for file_name in os.listdir(models_dir):
    if file_name.endswith(".model"):
        model_path = os.path.join(models_dir, file_name)
        model = Word2Vec.load(model_path)
        vocab = set(model.wv.index_to_key)
        
        if common_vocab is None:
            common_vocab = vocab
        else:
            common_vocab = vocab.intersection(common_vocab)

nltk.download('words')
english_words = set(words.words())
common_vocab = common_vocab.intersection(english_words)
print("Filtered Vocabulary:", common_vocab)

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\johns\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!




#### 2.2 Randomly Sample 1000 Words for Study

In [4]:
random.seed(42)
sampled_words = random.sample(sorted(common_vocab), 1000)
output_file_path = os.path.join(synonyms_dir, "sampled_words.txt")

with open(output_file_path, "w") as file:
    for word in sampled_words:
        file.write(word + "\n")

print(f"Sampled words saved to {output_file_path}")

Sampled words saved to ../synonyms\sampled_words.txt


### 3 Traditional Synonymy Methods

In this larger stage, we will look at synonyms that are encoded using common static methods like the thesaurus from Merriam Webster, and synonyms in the Princeton WordNet domain. For both sections 3.1 and 3.2, step 0 and 1 in this file and a file `{synonyms_dir}/sampled_words.txt` (generated in step 2) are required to run the code. 3.1 and 3.2 can be run independently of each other.

#### 3.1 Get Merriam Webster Synonyms

This part uses the Merriam Webster API to get the Merriam Webster synonyms for each word under observation.  You also must create a Merriam Webster API key and place it in a file `/src/.env` for the environment to pick it up and connect to the API (see `.env-example` for an example). You can set up an API key [here](https://dictionaryapi.com/) for free.

##### 3.1.1 Make All API Calls and Collect Raw Data

In [None]:
sampled_words_file = os.path.join(synonyms_dir, "sampled_words.txt")
with open(sampled_words_file, "r") as file:
    sampled_words = [line.strip() for line in file.readlines()]

load_dotenv()
API_KEY = os.getenv("MERRIAM_WEBSTER_API_KEY")
if not API_KEY:
    raise ValueError("API key not found in .env file. Please add MERRIAM_WEBSTER_API_KEY.")

API_URL = "https://www.dictionaryapi.com/api/v3/references/thesaurus/json"

sampled_words_file = os.path.join(synonyms_dir, "sampled_words.txt")
with open(sampled_words_file, "r") as file:
    sampled_words = [line.strip() for line in file.readlines()]

synonyms_map = {}
for word in tqdm(sampled_words):
    response = requests.get(f"{API_URL}/{word}?key={API_KEY}")
    if response.status_code == 200:
        data = response.json()
        synonyms_map[word] = data
    else:
        print(f"Failed to fetch synonyms for {word}. Status code: {response.status_code}")
        synonyms_map[word] = []

output_file = os.path.join(synonyms_dir, "mw_synonyms.json")
with open(output_file, "w") as file:
    json.dump(synonyms_map, file, indent=4)

print(f"Synonyms map saved to {output_file}")

100%|██████████| 1000/1000 [06:37<00:00,  2.51it/s]


Synonyms map saved to ../synonyms\mw_synonyms.json


##### 3.1.2 Process and Clean Raw Data From API Calls

Each word is returned by Merriam Webster in an inconsistent format. For future processing, the word are manually cleaned and put into a large map where each key is the target word and the value is a list of synonym strings. The intermediate output is saved as `mw_synonyms.json` and the final cleaned output is saved as `cleaned_mw_synonyms.json`.

In [6]:
raw_file = os.path.join(synonyms_dir, "mw_synonyms.json")

with open(raw_file, "r") as file:
    raw_data = json.load(file)

cleaned_map = {}
for word, entries in raw_data.items():
    try:
        if isinstance(entries, list) and isinstance(entries[0], str):
            cleaned_map[word] = entries
            continue
    except:
        print(f"Error processing word: {word}, entries: {entries}")
        cleaned_map[word] = []
        break
    synonyms = set() 
    for entry in entries:
        if "meta" in entry and "syns" in entry["meta"]:
            for syn_list in entry["meta"]["syns"]:
                synonyms.update(syn_list)
    cleaned_map[word] = list(synonyms)

cleaned_file = os.path.join(synonyms_dir, f"cleaned_{os.path.basename(raw_file)}")
with open(cleaned_file, "w") as file:
    json.dump(cleaned_map, file, indent=4)

print(f"Cleaned synonyms map saved to {cleaned_file}")

Error processing word: staphylococcus, entries: []
Cleaned synonyms map saved to ../synonyms\cleaned_mw_synonyms.json


#### 3.2 WordNet Synonyms

This section uses the Princeton WordNet package to get synonyms for the sampled words. The output will be a file `{synonyms_dir}/cleaned_wordnet_synonyms.json`.

In [27]:
sampled_words_file = os.path.join(synonyms_dir, "sampled_words.txt")
with open(sampled_words_file, "r") as file:
    sampled_words = [line.strip() for line in file.readlines()]

wordnet_syns = {}
for word in sampled_words:
    all_syns = set()
    for syn in wn.synsets(word):
        all_syns.update(syn.lemma_names())
    wordnet_syns[word] = list(all_syns)

output_file = os.path.join(synonyms_dir, "cleaned_wordnet_synonyms.json")
with open(output_file, "w") as file:
    json.dump(wordnet_syns, file, indent=4)
print(f"WordNet synonyms map saved to {output_file}")

WordNet synonyms map saved to ../synonyms\cleaned_wordnet_synonyms.json


### 4 LLM Generated Synonyms

This stage of the file generates synonyms for all the target words using three different LLMs. GPT-4o-mini, Deepseek-V3, and Llama3.3 are used in sections 4.1, 4.2, and 4.3 respectively. Each section can be run independently and depends on step 0 and 1 in this file and a file `{synonyms_dir}/sampled_words.txt` (generated in step 2).

#### 4.1 Get GPT Generated Synonyms

For GPT generated synonyms, we use a batched generation approach which completes within 24 hours. This section cost approximately $0.01 in credits to run. In order to run the generated requests, you need to have an OpenAI API key set up and included in the environment. To include the key, place it in a file `/src/.env` for the environment to pick it up and connect to the API (see `.env-example` for an example). You can set up an API key [here](https://platform.openai.com/docs/overview) and load needed credits.

##### 4.1.1 Generate Batch Request

To start with a batch request, each request needs to be added to a larger `.jsonl` file. The code will loop through the selected sampled words and add a request for the synonyms of each to the file. The full batch with all requests will be saved to `{synonyms_dir}/batch_gpt_req.jsonl`.

In [10]:
batch_requests = []

sampled_words_file = os.path.join(synonyms_dir, "sampled_words.txt")
with open(sampled_words_file, "r") as file:
    sampled_words = [line.strip() for line in file.readlines()]

for word in sampled_words:
    batch_requests.append({
        "custom_id": word,
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini-2024-07-18",
            "messages": [
                {
                    "role": "system", 
                    "content": "You are a a machine that simply functions as a thesaurus. You will be given a word and you will return a map from that word to a list of synonyms. You should include as many or as few synonyms as match the word. (E.g. given 'help' you might return {'help': ['assist', 'aid', 'support']})"
                },
                {
                    "role": "user",
                    "content": word
                }
            ],
            "max_tokens": 300,
            "temperature": 0.7
        }
    })

with open(os.path.join(synonyms_dir, "batch_gpt_req.jsonl"), "w") as f:
    for obj in batch_requests:
        f.write(json.dumps(obj) + '\n')

##### 4.1.2 Submit Batch Request

Once all the requests are batched and placed in `{synonyms_dir}/batch_gpt_req.jsonl`, it can be submitted to the OpenAI API. This code submits it and prints the created batch object. Save the id from the output object for future steps and to check the status of your batch. If you lose track of your batch you should be able to find all of your running batches [here](https://platform.openai.com/batches). Your batch should finish within 24 hours but often finishes much faster. Check the status of your batch using the id in the following step.

In [11]:
load_dotenv()
OpenAI.api_key = os.getenv("OPENAI_API_KEY")

client = OpenAI()

batch_input_file = client.files.create(
  file=open(os.path.join(synonyms_dir, "batch_gpt_req.jsonl"), "rb"),
  purpose="batch"
)

batch_input_file_id = batch_input_file.id

job_info = client.batches.create(
  input_file_id=batch_input_file_id,
  endpoint="/v1/chat/completions",
  completion_window="24h",
  metadata={
    "description": "generate synonyms for sampled words"
  }
)

print(job_info)

Batch(id='batch_680bdd66ab8c819099ddd3f73a092bb2', completion_window='24h', created_at=1745608038, endpoint='/v1/chat/completions', input_file_id='file-DUTS3EJ1WDi4Rs5WexzAJA', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1745694438, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'generate synonyms for sampled words'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))


##### 4.1.3 Check Batch Status

Use this code block to check the status of you batch. Once it is completed, you can continue with further analysis. Make sure to set your batch ID in `BATCH_ID` using the output of the previous cell.

In [15]:
# copy from above
BATCH_ID = "batch_680bdd66ab8c819099ddd3f73a092bb2"

client = OpenAI()

print("Job Status:", client.batches.retrieve(BATCH_ID).status)

Job Status: expired


##### 4.1.4 Get and Decode GPT Responses

Once your job status shows complete, you are able to run this step. First, please run 4.1.3 in order to set the `BATCH_ID` in the code. This step gets the output file from the OpenAI API and processes each response. The code attempts to decode a variety of possible formats that the GPT model might output, but cannot always decode every response. All of the successfully decoded synonyms are placed in a JSON map that is saved in `{synonyms_dir}/cleaned_gpt_synonyms.json` and any responses that could not be decoded are placed into `{synonyms_dir}/failed_gpt_synonyms.json`. Before performing any analysis (like the analysis in the following Jupyter notebook) **you must manually look at the failed outputs and add the represented data to `{synonyms_dir}/cleaned_gpt_synonyms.json`**.

In [20]:
client = OpenAI()

file_response = client.files.content(client.batches.retrieve(BATCH_ID).output_file_id)

gpt_results = {}
failed_responses = {}
for line in file_response.text.split('\n'):
    if line:
        response = json.loads(line)
        word = response["custom_id"]
        synonyms = response["response"]["body"]["choices"][0]["message"]["content"]
        try:
            json_str = re.search(r'\{.*\}', synonyms, re.DOTALL).group(0).strip()
            json_str = re.sub(r'(?<!\w)\'(.*?)\'(?!\w)', r'"\1"', json_str)
            json_obj = json.loads(json_str)
            gpt_results.update(json_obj)
        except:
            print(f"Error decoding for word '{word}'")
            failed_responses[word] = synonyms

output_file = os.path.join(synonyms_dir, "cleaned_gpt_synonyms.json")
with open(output_file, "w") as file:
    json.dump(gpt_results, file, indent=4)

print(f"Synonyms map saved to {output_file}")

with open(os.path.join(synonyms_dir, "failed_gpt_synonyms.json"), "w") as file:
    json.dump(failed_responses, file, indent=4)

print(f"Failed responses saved to {os.path.join(synonyms_dir, 'failed_gpt_synonyms.json')}")

Error decoding for word 'solvent'
Error decoding for word 'sprayer'
Error decoding for word 'edification'
Error decoding for word 'open'
Error decoding for word 'monument'
Error decoding for word 'analgesia'
Error decoding for word 'duo'
Error decoding for word 'skim'
Synonyms map saved to ../synonyms\cleaned_gpt_synonyms.json
Failed responses saved to ../synonyms\failed_gpt_synonyms.json


#### 4.2 DeepSeek Generation

For DeepSeek generated synonyms, we submit a request for each individual vocabulary item of interest. DeepSeek is an open source model that can be downloaded and run on its own, but due to system limitations on my computer and the affordability of using an API, I elected to use that option. For this exploration, I am using the LambdaAI API. In order to run this code segment, you need to have a LambdaAI API key set up and included in the environment. To include the key, place it in a file `/src/.env` for the environment to pick it up and connect to the API (see `.env-example` for an example). You can set up an API key [here](https://docs.lambda.ai/public-cloud/lambda-inference-api). Unlike OpenAI, you do not need to load credits and can instead set up direct billing and only be billed for your usage. This step cost approximately $0.05 to run.

##### 4.2.1 Set Up

Here, we load the necessary environment variables and all of the sampled words from the respective file in order to run the requests in the following steps.

In [None]:
load_dotenv()
api_key = os.getenv("LAMBDA_API_KEY")
api_base = "https://api.lambda.ai/v1"

client = OpenAI(
    api_key=api_key,
    base_url=api_base,
)

model = "deepseek-v3-0324"

sampled_words_file = os.path.join(synonyms_dir, "sampled_words.txt")
with open(sampled_words_file, "r") as file:
    sampled_words = [line.strip() for line in file.readlines()]

##### 4.2.2 Submitting Requests and Storing Decoded Results

In this step, for each target word, a request is submitted and then an attempt is made to decode the response. It is not always possible to decode the response from the API which must be resolved manually later. 

In [72]:
deepseek_map = {}
failed_responses = {}

for word in tqdm(sampled_words):
    chat_completion = client.chat.completions.create(
        messages=[{
            "role": "system",
            "content": "You are a a machine that simply functions as a thesaurus. You will be given a word and you will return a map from that word to a list of synonyms. You should include as many or as few synonyms as match the word. (E.g. given 'help' you might return {'help': ['assist', 'aid', 'support']})"
        }, {
            "role": "user",
            "content": word
        }],
        model=model,
    )

    try:
        json_str = re.search(r'\{.*\}', chat_completion.choices[0].message.content, re.DOTALL).group(0).strip()
        json_str = re.sub(r'(?<!\w)\'(.*?)\'(?!\w)', r'"\1"', json_str)
        json_obj = json.loads(json_str)
        deepseek_map.update(json_obj)
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON for word '{word}': {e}")
        failed_responses[word] = chat_completion.choices[0].message.content


 30%|██▉       | 299/1000 [10:34<39:13,  3.36s/it]  

Error decoding JSON for word 'casson': Extra data: line 1 column 57 (char 56)


 46%|████▌     | 456/1000 [16:08<40:47,  4.50s/it]

Error decoding JSON for word 'nobs': Extra data: line 4 column 3 (char 100)


 77%|███████▋  | 767/1000 [27:19<08:23,  2.16s/it]

Error decoding JSON for word 'altern': Expecting property name enclosed in double quotes: line 3 column 1 (char 82)


 97%|█████████▋| 972/1000 [34:44<00:53,  1.91s/it]

Error decoding JSON for word 'sabbath': Invalid \escape: line 1 column 64 (char 63)


100%|██████████| 1000/1000 [35:40<00:00,  2.14s/it]


##### 4.2.3 Saving Results

All of the successfully decoded synonyms are placed in a JSON map that is saved in `{synonyms_dir}/cleaned_deepseek_synonyms.json` and any responses that could not be decoded are placed into `{synonyms_dir}/failed_deepseek_responses.json`. Before performing any analysis (like the analysis in the following Jupyter notebook) **you must manually look at the failed outputs and add the represented data to `{synonyms_dir}/cleaned_deepseek_synonyms.json`**.

In [73]:
deepseek_synonyms_file = os.path.join(synonyms_dir, "cleaned_deepseek_synonyms.json")
with open(deepseek_synonyms_file, "w") as file:
    json.dump(deepseek_map, file, indent=4)

print(f"Deepseek synonyms map saved to {deepseek_synonyms_file}")

failed_responses_file = os.path.join(synonyms_dir, "failed_deepseek_responses.json")
with open(failed_responses_file, "w") as file:
    json.dump(failed_responses, file, indent=4)

print(f"Failed responses saved to {failed_responses_file}")

Deepseek synonyms map saved to ../synonyms\cleaned_deepseek_synonyms.json
Failed responses saved to ../synonyms\failed_deepseek_responses.json


#### 4.3 Llama Synonyms

For Llama generated synonyms, we submit a request for each individual vocabulary item of interest. Llama is an open source model that can be downloaded and run on its own, but due to system limitations on my computer and the affordability of using an API, I elected to use that option. For this exploration, I am using the LambdaAI API. In order to run this code segment, you need to have a LambdaAI API key set up and included in the environment. To include the key, place it in a file `/src/.env` for the environment to pick it up and connect to the API (see `.env-example` for an example). You can set up an API key [here](https://docs.lambda.ai/public-cloud/lambda-inference-api). Unlike OpenAI, you do not need to load credits and can instead set up direct billing and only be billed for your usage. This step cost approximately $0.02 to run.

##### 4.3.1 Set Up

Here, we load the necessary environment variables and all of the sampled words from the respective file in order to run the requests in the following steps.

In [8]:
load_dotenv()
api_key = os.getenv("LAMBDA_API_KEY")
api_base = "https://api.lambda.ai/v1"

client = OpenAI(
    api_key=api_key,
    base_url=api_base,
)

model = "llama3.3-70b-instruct-fp8"

sampled_words_file = os.path.join(synonyms_dir, "sampled_words.txt")
with open(sampled_words_file, "r") as file:
    sampled_words = [line.strip() for line in file.readlines()]

##### 4.3.2 Submitting Requests and Storing Decoded Results

In this step, for each target word, a request is submitted and then an attempt is made to decode the response. It is not always possible to decode the response from the API which must be resolved manually later. 

In [13]:
llama_map = {}
failed_responses = {}

for word in tqdm(sampled_words):
    chat_completion = client.chat.completions.create(
        messages=[{
            "role": "system",
            "content": "You are a a machine that simply functions as a thesaurus. You will be given a word and you will return a map from that word to a list of synonyms. You should include as many or as few synonyms as match the word. (E.g. given 'help' you might return {'help': ['assist', 'aid', 'support']})"
        }, {
            "role": "user",
            "content": word
        }],
        model=model,
    )

    try:
        json_str = re.search(r'\{.*\}', chat_completion.choices[0].message.content, re.DOTALL).group(0).strip()
        json_str = re.sub(r'(?<!\w)\'(.*?)\'(?!\w)', r'"\1"', json_str)
        json_obj = json.loads(json_str)
        llama_map.update(json_obj)
    except:
        print(f"Error decoding for word '{word}'")
        failed_responses[word] = chat_completion.choices[0].message.content


  5%|▍         | 47/1000 [00:49<32:45,  2.06s/it]

Error decoding for word 'lucy'


 11%|█         | 112/1000 [01:51<22:08,  1.50s/it]

Error decoding for word 'gazi'


 35%|███▌      | 352/1000 [05:35<25:59,  2.41s/it]

Error decoding for word 'faraday'


 58%|█████▊    | 580/1000 [09:10<05:52,  1.19it/s]

Error decoding for word 'gynecologic'


 61%|██████    | 612/1000 [09:39<06:05,  1.06it/s]

Error decoding for word 'autonomously'


 72%|███████▏  | 720/1000 [11:23<05:21,  1.15s/it]

Error decoding for word 'mao'


100%|██████████| 1000/1000 [15:33<00:00,  1.07it/s]


##### 4.3.3 Saving Results

All of the successfully decoded synonyms are placed in a JSON map that is saved in `{synonyms_dir}/cleaned_llama_synonyms.json` and any responses that could not be decoded are placed into `{synonyms_dir}/failed_llama_responses.json`. Before performing any analysis (like the analysis in the following Jupyter notebook) **you must manually look at the failed outputs and add the represented data to `{synonyms_dir}/cleaned_llama_synonyms.json`**.

In [14]:
llama_synonyms_file = os.path.join(synonyms_dir, "cleaned_llama_synonyms.json")
with open(llama_synonyms_file, "w") as file:
    json.dump(llama_map, file, indent=4)

print(f"Llama synonyms map saved to {llama_synonyms_file}")

failed_responses_file = os.path.join(synonyms_dir, "failed_llama_responses.json")
with open(failed_responses_file, "w") as file:
    json.dump(failed_responses, file, indent=4)

print(f"Failed responses saved to {failed_responses_file}")

Llama synonyms map saved to ../synonyms\cleaned_llama_synonyms.json
Failed responses saved to ../synonyms\failed_llama_responses.json
