# Generating Text Embeddings Using Sentence-Transformers
This notebook was used to create text embeddings for the paper abstracts in the High Energy Physics Citation Network dataset. Simply run the cells below, making sure to follow the written instructions!

## Load the Data
Be sure to set the ```ABSTRACT_FILE``` variable to the location where you saved the cleaned abstract text data from the **"CleanDataset"** notebook.

In [None]:
ABSTRACT_FILE =  "abs_text.txt"
NUMBER_OF_PAPERS = 29555
id_to_text = {}
with open(ABSTRACT_FILE, 'r') as f:
    L = f.readlines()
    for line_ in L:
        id_, text_ = line_.split('\t')
        id_to_text[id_] = text_.strip()
#sanity check
print (len(id_to_text) == NUMBER_OF_PAPERS)    

True


## Load the Model

In [None]:
from sentence_transformers import SentenceTransformer
id_to_embedding = {}
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

## Generate Embeddings using Sentence-Transformers

In [None]:
from tqdm.notebook import tqdm
id_to_embedding = {}
for id_, text_ in tqdm(id_to_text.items()):
  id_to_embedding[id_] = model.encode(text_)

  0%|          | 0/29555 [00:00<?, ?it/s]

In [None]:
#id_to_embedding

Save the text embeddings as a Python object binary (.pickle) file using the cell below.

In [None]:
import pickle
with open('sentence_transformers_embeddings.pkl', 'wb') as f:
  pickle.dump(id_to_embedding, f)