## Generating Embeddings for PubMed Papers and SBIR Abstracts

**Related to Paper Section: “Data Preprocessing (Embedding Creation)”**  

In this notebook (`pmid2emb.ipynb`), we transform text data (PubMed abstracts and SBIR abstracts) into **high-dimensional embedding vectors** using a large language model:

1. **Data Loading**  
   - **PubMed**: Loads a CSV file (`out.csv`) into `pubmed_df`, each row containing a `pmid` and an `abstract`.  
   - **SBIR**: Loads a CSV file (`sbir_df.csv`) into `sbir_df`, each row containing an `sbid` and an `Abstract`.  
   - We prepend each abstract with the prefix `"passage: "` to align with the model’s input expectations.

2. **Model and Tokenizer Initialization**  
   - Uses the `intfloat/e5-large-v2` model from Hugging Face Transformers.  
   - Moves the model to GPU(s) and sets up a helper function `average_pool(...)` to perform mean pooling over the hidden states, masking out padding tokens.

3. **Batch-wise Inference**  
   - Splits the data into chunks of size `th` (e.g., 2,000).  
   - Tokenizes each chunk’s texts (max length 512, padding, truncation).  
   - Forwards them through the model to obtain last hidden states.  
   - Applies `average_pool` to produce a single vector per abstract.

4. **Dictionary of Embeddings**  
   - Saves each abstract’s vector in a dictionary keyed by `pmid` (`pubmed_embeddings_dict.pkl`) or `sbid` (`sbir_embeddings_dict.pkl`).  
   - Periodically checkpoints the dictionaries to avoid reprocessing large amounts of data if interrupted.

5. **Output**  
   - By the end, we have two serialized dictionaries:
     1. **PubMed**: `pmid → [vector of size ~1024]`  
     2. **SBIR**: `sbid → [vector of size ~1024]`  

These embeddings subsequently enable **dimensionality reduction**, **semantic distance** calculations, and other downstream analyses in our research pipeline.

In [1]:
from tqdm import tqdm
import pickle
import pandas as pd
import numpy as np
import torch
import requests
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import plotly.express as px
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModel

In [6]:
PICKLES_ADDRESS =  '../../data/pickles/'
SBIR_ADDRESS = '../../data/sbir_dataset/'
PUBMED_ADDRESS = '../../data/pubmed_dataset/'

In [7]:
import psutil

def available_memory():
    available_ram = psutil.virtual_memory().available
    print(f"Available RAM: {available_ram / 1024 ** 3:.2f} GB")

available_memory()

Available RAM: 303.45 GB


In [8]:
def gpu_info():
    device = torch.device('cuda:0')
    torch.cuda.set_device(device)
    print('GPUs available:', torch.cuda.device_count())
    print('current GPU number: ', torch.cuda.current_device())
    print('GPU name: ', torch.cuda.get_device_name(device))
    print('GPU capability: ', torch.cuda.get_device_capability(device))
    print('GPU memory: ', torch.cuda.get_device_properties(device).total_memory)
    print('GPU memory allocated: ', torch.cuda.memory_allocated(device))
    print('GPU memory cached: ', torch.cuda.memory_reserved(device))
    print('GPU memory reserved: ', torch.cuda.memory_reserved(device))
    print('GPU memory free: ', torch.cuda.memory_reserved(device) - torch.cuda.memory_allocated(device))


def ram_info():
    !free -h

gpu_info()  
ram_info()

GPUs available: 3
current GPU number:  0
GPU name:  NVIDIA RTX 6000 Ada Generation
GPU capability:  (8, 9)
GPU memory:  51010207744
GPU memory allocated:  0
GPU memory cached:  0
GPU memory reserved:  0
GPU memory free:  0
               total        used        free      shared  buff/cache   available
Mem:           503Gi       192Gi       222Gi       332Mi        89Gi       303Gi
Swap:          1.0Ti       223Mi       1.0Ti


In [9]:
#load pubmed_df
sbir_df = pd.read_csv(SBIR_ADDRESS + 'sbir_df.csv', engine='python')
sbir_df['Abstract'] = 'passage: ' + sbir_df['Abstract'].astype(str)
sbir_df = sbir_df[['sbid', 'Abstract']]
sbir_df.head()

Unnamed: 0,sbid,Abstract
0,S000001,passage: In the last decade there has been an ...
1,S000003,passage: 1109 BRAVO’s Neuropak provides an eff...
2,S000005,"passage: Glaucoma, a disease that damages your..."
3,S000006,passage: 2X4Lab will investigate the applicabi...
4,S000009,passage: Technology is changing at a rapid pac...


In [10]:
#load sbir_df
pubmed_df = pd.read_csv(PUBMED_ADDRESS + 'out.csv', engine='python')
pubmed_df['abstract'] = 'passage: ' + pubmed_df['abstract'].astype(str)
pubmed_df = pubmed_df[['pmid', 'abstract']]
pubmed_df.head()

Unnamed: 0,pmid,abstract
0,28867731,passage: Epithelial-to-mesenchymal transition ...
1,23611783,"passage: In plants, flavonoids have been shown..."
2,31440131,"passage: a priori In recent years, digital com..."
3,30551143,passage: Toxin-antitoxin (TA) systems are ubiq...
4,30229639,passage: nan


In [17]:
sbir_df[sbir_df['sbid'] == 'S031663']['Abstract'].values[0]

'passage: One of the most significant events in our lives is the birth of a child  At the same time  on average     of couples find difficulties conceiving within the first year of trying  In our society  historically  fertility issues are mostly associated with femalesandapos  health  However  in         of these cases male fertility issues may be at the source of the problem  while  typically  are much easier to treat  The fundamental evaluation of male fertility in the clinical laboratory begins with sperm characterization  where the golden standard is to use a benchtop microscope  manually  and or a computer assisted automated semen analysis  CASA  system  Though  partially due to cost and partially due to social stigma associated with clinical evaluation or treatment  male infertility often stays undetected  This has generated a demand for inexpensive in home sperm evaluation kits  However  there is currently no product on the market that offers in home and quantitative measuremen

In [None]:
pubmed_df[pubmed_df['sbid'] == 'S002593']

In [11]:
def average_pool(last_hidden_states: torch.Tensor,
                 attention_mask: torch.Tensor) -> torch.Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.

device = torch.device('cuda')

tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-large-v2')
model = AutoModel.from_pretrained('intfloat/e5-large-v2').to(device)
model = nn.DataParallel(model)  # Utilize multiple GPUs with DataParallel

pmid2emb

In [5]:
th = 2000
#iterate rows of pubmed_df
pubmed_embeddings = {}
for i in range(0, len(pubmed_df), th):
    print(i)
    sub_df = pubmed_df.loc[i:i+th]
    pmids = sub_df['pmid'].to_list()
    abstracts = sub_df['abstract'].to_list()
    batch_dict = tokenizer(abstracts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    batch_dict = {key: value.to(device) for key, value in batch_dict.items()}
    with torch.no_grad():
        outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']).detach().cpu().numpy()
    for pmid, emb in zip(pmids, embeddings):
        pubmed_embeddings[pmid] = emb
    #save the dict every 100000 rows
    if i % 100000 == 0:
        with open(PUBMED_ADDRESS + 'pubmed_embeddings_dict.pkl', 'wb') as f:
            pickle.dump(pubmed_embeddings, f)
        print('saved at row', i)
with open(PUBMED_ADDRESS + 'pubmed_embeddings_dict.pkl', 'wb') as f:
    pickle.dump(pubmed_embeddings, f)
print('saved at row', i)

0
saved at row 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
54000
56000
58000
60000
62000
64000
66000
68000
70000
72000
74000
76000
78000
80000
82000
84000
86000
88000
90000
92000
94000
96000
98000
100000
saved at row 100000
102000
104000
106000
108000
110000
112000
114000
116000
118000
120000
122000
124000
126000
128000
130000
132000
134000
136000
138000
140000
142000
144000
146000
148000
150000
152000
154000
156000
158000
160000
162000
164000
166000
168000
170000
172000
174000
176000
178000
180000
182000
184000
186000
188000
190000
192000
194000
196000
198000
200000
saved at row 200000
202000
204000
206000
208000
210000
212000
214000
216000
218000
220000
222000
224000
226000
228000
230000
232000
234000
236000
238000
240000
242000
244000
246000
248000
250000
252000
254000
256000
258000
260000
262000
264000
266000
268000
270000
272000
274000
276000
278000
280000
282000
284000
28

In [6]:
with open(PUBMED_ADDRESS + 'pubmed_embeddings_dict.pkl', 'wb') as f:
    pickle.dump(pubmed_embeddings, f)
print('saved at row', i)

saved at row 10928000


In [9]:
len(pubmed_embeddings), len(pubmed_df)

(10928078, 10928078)

sbid2emb

In [18]:
th = 2000
sbir_embeddings = {}
for i in range(0, len(sbir_df), th):
    print(i)
    sub_df = sbir_df.loc[i:i+th]
    sbids = sub_df['sbid'].to_list()
    abstracts = sub_df['Abstract'].to_list()
    batch_dict = tokenizer(abstracts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    batch_dict = {key: value.to(device) for key, value in batch_dict.items()}
    with torch.no_grad():
        outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']).detach().cpu().numpy()
    for sbid, emb in zip(sbids, embeddings):
        sbir_embeddings[sbid] = emb
    #save the dict every 100000 rows
    if i % 100000 == 0:
        with open(SBIR_ADDRESS + 'sbir_embeddings_dict.pkl', 'wb') as f:
            pickle.dump(sbir_embeddings, f)
        print('saved at row', i)
with open(SBIR_ADDRESS + 'sbir_embeddings_dict.pkl', 'wb') as f:
    pickle.dump(sbir_embeddings, f)
print('saved at row', len(sbir_embeddings))

0
saved at row 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
34000
36000
38000
40000
42000
44000
46000
48000
50000
52000
54000
56000
58000
60000
62000
saved at row 63488
