# Generating Embeddings for a dictionary
Given an excel file of dictionary names and their phrases, this notebook converts these dictionaries into embeddings and caches them for downstream inference Tasks

## 0: Installing dependencies and Imports
Ensures that necessary dependencies are installed, required imports are made and sets up environment variables.

In [1]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import pandas as pd
import os
import pickle

In [None]:
MODEL_NAME = "yiyanghkust/finbert-pretrain"

In the following cell, enter the name of the dictionary file whose embeddings you wish to generate in `DICTIONARY_SRC`. Enter the desired name of the embedding file in `TARGET_FILE_NAME`.

In [None]:
DICTIONARY_SRC = "Dictionaries_v5"
TARGET_FILE_NAME = "DictEmbeddings_v5_all-fibert"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModel.from_pretrained(MODEL_NAME)

## 1: Formatting and pre-processsing data
Given the excel file of an expected format, the following cells clean up the data and converts the file into a compatible format for embedding generation.

In [None]:
PATH_TO_FILE = os.path.join(os.getcwd(), f"{DICTIONARY_SRC}.xlsx")

df = pd.read_excel(PATH_TO_FILE)

# Cleaning up the Excel file
df.columns = df.iloc[0]
df = df[1:]
df = df.reset_index(drop=True)
df = df.drop(columns="Dictionary Name")

dictionaryNames = df.columns.to_list()
dictionaryDataCleaned = {}

for dictionary in dictionaryNames:
    phrases = df[dictionary].dropna().to_list()
    dictionaryDataCleaned[dictionary] = [' '.join(phrases)]


# Preview options
# df
# dictionaryDataCleaned['Negative Prescriptions']

## 2: Generating Embeddings for Single Dictionary
The following is the core function used to generate embeddings for a dictionary. It takes in the list of phrases in the dictionary as well as environment variables (model and tokenizer) and returns the numpy arrays which is cached with relation the dictionary name.

In [None]:
def generateDictionaryEmbeddings(phrases: list[str], model: AutoModel, tokenizer: AutoTokenizer) -> np.array:
    """ Generates the embeddings for each phrase for a single dictionary.
    Given a list of phrases, this funcion computes and returns the embeddings.
    """
    # Setting up object to capture tokenized results
    tokens = {'input_ids': [], 'attention_mask': []}

    # Tokenizing phrases
    for phrase in phrases:
        new_tokens = tokenizer.encode_plus(phrase, max_length=128, truncation=True, padding='max_length', 
                                        return_tensors='pt')
        tokens['input_ids'].append(new_tokens['input_ids'][0])
        tokens['attention_mask'].append(new_tokens['attention_mask'][0])

    #Post-tokenizing: stacking all the input_ids and attention_masks into one tensor
    tokens['input_ids'] = torch.stack(tokens['input_ids'])
    tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

    # Generating embeddings
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state

    # Filtering - focusing attention
    attention = tokens['attention_mask']
    mask = attention.unsqueeze(-1).expand(embeddings.shape).float()
    mask_embeddings = embeddings * mask

    # Generating mean-pooled values
    summed = torch.sum(mask_embeddings, 1)
    counts = torch.clamp(mask.sum(1), min=1e-9)
    mean_pooled = summed / counts
    mean_pooled = mean_pooled.detach().numpy()

    return mean_pooled    

## 3: Caching Results for Downstream Tasks
This block of code handles generating the embeddings for each dictionary and saving this data into a JSON file which is used during the `cosine_similarity` stage.

In [None]:
# Creating a dictionary which maps a dictionary name to its embeddings
embeddings = {}

for dictionaryName, phrases in dictionaryDataCleaned.items():
    embeddings[dictionaryName] = generateDictionaryEmbeddings(phrases, model, tokenizer)

# converting the dictionary to JSON
dfOut = pd.DataFrame(embeddings.items(), columns=["dictionaryName", "embedding"])
dfOut.to_json("dictionaryEmbeddings.json", orient="records")

# alternate solution: Using pickles.dump for serializing
with open(f"{TARGET_FILE_NAME}.pkl", "wb") as outFile:
    pickle.dump(embeddings, outFile)