# Generating Embeddings for Dataset of Advertisments
Given an excel file of dictionary names and their phrases, this notebook converts these dictionaries into embeddings and caches them for downstream inference Tasks

## 0: Installing dependencies and Imports
Ensures that necessary dependencies are installed, required imports are made and sets up environment variables.

In [None]:
%pip install transformers

In [None]:
import sys
sys.path.append('..')

In [None]:
from transformers import AutoTokenizer,AutoModelForMaskedLM
import torch
import numpy as np
import pandas as pd
import pickle
from Scripts.adsApi import getAllAdTexts

### Defining parameters
`MODEL_NAME` refers to the flavor of `sentence-transformers` model chosen. 
`IN_FILE` refers to the file name of the advertisement database.
`OUT_FILE` refers to the file name that the ad embeddings will save to.

In [None]:
MODEL_NAME = "yiyanghkust/finbert-pretrain"
IN_FILE = "dataset_nov23"
OUT_FILE = "AdEmbeddings_nov23_finbert"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model =  AutoModelForMaskedLM.from_pretrained(MODEL_NAME)

## 1: Formatting and pre-processsing data
Given the dataset of advertisements and attributes such as ad text, the following cells clean up and extracts relevant data.

In [None]:
ads = getAllAdTexts(IN_FILE)

## 2: Generating Embeddings for Single Advertisment
The following is the core function used to generate embeddings for an advertisement. It takes in the `adText` (list of sentences) in the advertisement as well as environment variables (model and tokenizer) and returns the numpy array representing the embedding which is cached with relation the advertisement's identifier.

In [None]:
def generateAdEmbeddings(adText: str, model: AutoModelForMaskedLM, tokenizer: AutoTokenizer) -> np.array:
    """ Generates the embeddings for each phrase for a single dictionary.
    Given a list of phrases, this funcion computes and returns the embeddings.
    """
    # pre-processing data
    sentences = adText.split(".")[:-1]
    sentences = [sentence.strip() for sentence in sentences]

    # Setting up object to capture tokenized results
    tokens = {'input_ids': [], 'attention_mask': []}

    # Tokenizing phrases
    for sentence in sentences:
        new_tokens = tokenizer.encode_plus(sentence, max_length=128, truncation=True, padding='max_length', 
                                        return_tensors='pt')
        tokens['input_ids'].append(new_tokens['input_ids'][0])
        tokens['attention_mask'].append(new_tokens['attention_mask'][0])

    # filtering out nulls
    if len(tokens['input_ids']) == 0:
        return (-1,-1)

    #Post-tokenizing: stacking all the input_ids and attention_masks into one tensor
    tokens['input_ids'] = torch.stack(tokens['input_ids'])
    tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

    # Generating embeddings
    outputs = model(**tokens)
    embeddings = outputs.logits
    # Filtering - focusing attention
    attention = tokens['attention_mask']
    mask = attention.unsqueeze(-1).expand(embeddings.shape).float()
    mask_embeddings = embeddings * mask

    # Generating mean-pooled values
    summed = torch.sum(mask_embeddings, 1)
    counts = torch.clamp(mask.sum(1), min=1e-9)
    mean_pooled = summed / counts
    mean_pooled = mean_pooled.detach().numpy()

    return mean_pooled    

## 3: Caching Results for Downstream Tasks
This block of code handles generating the embeddings for each dictionary and serializing this data into a `.JSON` file (as well as a `.pkl` file) which is used during the `cosine_similarity` stage.

In [None]:
# Creating a dictionary which maps a dictionary name to its embeddings
embeddings = {}

for identifier, adText in ads.items():
    embeddings[identifier] = generateAdEmbeddings(adText, model, tokenizer)
    print(identifier)

In [None]:
# converting the dictionary to JSON
dfOut = pd.DataFrame(embeddings.items(), columns=["identifier", "embedding"])
# alternate solution: Using pickles.dump for serializing
with open(f"{OUT_FILE}.pkl", "wb") as outFile:
    pickle.dump(embeddings, outFile)

In [None]:
# from transformers import file_utils
# print(file_utils.default_cache_path)