# Text Embeddings with SPECTER

## Overview 

This Python script is dedicated to generating and storing embeddings for a set of papers. It reads the raw data, processes the text of titles and abstracts, and converts them into embeddings using a pre-loaded model. In this case, the embedding vectors is a 768-dimensional vector and the model is the SPECTER. The embeddings are then stored in either CSV or NumPy format, organized by the publication year of the papers.

Transforming raw text into embeddings is a computationally intensive task. While it's possible to perform this operation on a CPU, it's not recommended due to the significant time and resource overhead. Instead, using a GPU is advisable for efficiency.

Given the computational demands of this task, it's essential to adopt a memory-efficient approach. Instead of loading the entire dataset into memory, we'll process the data iteratively. This involves reading each paper's text line-by-line, generating its embedding, and then immediately writing the embedding to storage. This method ensures that only a minimal amount of data is held in memory at any given time.

## Workflow

- **Loading the Embedding Model**:The notebook starts by loading a pre-trained embedding model. If a GPU is available, the model is moved to the GPU to accelerate the processing; otherwise, it continues using the CPU.

- **Counting the Number of Papers**: It counts the total number of papers in the dataset by reading the raw CSV file and counting the lines. This is required to keep track of the progress of the process with a progress bar (tqdm)

- **Setting Storage Method:** Users can choose between 'csv' and 'numpy' as the storage method for the generated embeddings. This choice determines the format in which the embeddings will be saved.

- **Processing Data in Chunks:** To optimize memory usage, the script processes the data in chunks. It reads and processes a specified number of papers at a time, generating embeddings for each chunk before moving on to the next.

- **Generating and Storing Embeddings:** For each chunk of data, the script performs the following steps:
    - Reads the papers’ data from the raw CSV file.
    - Extracts the publication year of each paper.
    - Combines the title and abstract of each paper and generates an embedding using the pre-loaded model.
    - Stores the generated embeddings in a list, organized by publication year.


- **Saving Embeddings:** When all papers for a specific year have been processed, or when moving to papers from a new year, the notebook saves the embeddings to disk. It creates a new file for each year, storing the embeddings in either CSV or NumPy format depending on the chosen storage method.


## Considerations for Storing Embeddings

When deciding how to store the generated embeddings, several factors come into play:

- **Read/Write Speed**: For operations where speed is crucial, binary formats like numpy's `.npy` or `.npz` (for sparse matrices) are recommended. These formats offer faster read/write speeds compared to traditional CSV files.

- **Interoperability**: If the embeddings need to be accessed by various software or tools, the CSV format is more universal. However, it's worth noting that CSV files tend to be larger and slower to read/write compared to binary formats.

- **Data Volume**: If dealing with a vast amount of embeddings, it might be beneficial to process and store the data in chunks. This approach can further optimize memory usage and improve overall efficiency.

With these considerations in mind, we'll now delve into the process of generating and storing embeddings using the SPECTER model.

## Output
The notebook generates files containing the embeddings of the papers, organized by their publication year. Each file is named after the corresponding year and contains the embeddings in either CSV or NumPy format, depending on the user’s choice.


In [None]:
import os
import csv
import torch
import numpy as np
from tqdm.notebook import tqdm
from transformers import AutoTokenizer, AutoModel

%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')

from embeddings import load_model, get_embedding

## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

In [None]:
# Constants
PATH_OUTPUT = '../data/'
PATH_INPUT = '../data/raw/'
STORAGE = 'csv'
CHUNK_SIZE = 50
TOTAL_PAPERS = None

# Check if paths exist
if not os.path.exists(PATH_OUTPUT) or not os.path.exists(PATH_INPUT):
    raise Exception("Input or output path does not exist.")

# Load the embedding model
print('Loading the embedding model...')
tokenizer, model = load_model()

# Move the model to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
print(f"Using {device.upper()}.")

# Count the number of papers
print('Get the number of papers to process...')
with open(PATH_INPUT + 'papers_raw.csv', 'r', encoding='utf-8') as file:
    line_count = sum(1 for line in file)
TOTAL_PAPERS = line_count - 1  # Subtract 1 for the header


def get_last_processed_index(path_output):
    total_processed = 0
    vectors_path = os.path.join(path_output, 'vectors')
    
    if not os.path.exists(vectors_path):
        return total_processed
    
    for file in os.listdir(vectors_path):
        if file.endswith('.csv') or file.endswith('.npy'):
            file_path = os.path.join(vectors_path, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                total_processed += sum(1 for line in f) - 1  # Subtract 1 to exclude the header row
    
    return total_processed

def save_vectors(vectors, year, storage, path_output):
    vectors_path = os.path.join(path_output, 'vectors')
    os.makedirs(vectors_path, exist_ok=True)  # Ensure the directory exists
    
    file_path = os.path.join(vectors_path, f'{year}_vectors')
    
    if storage == 'csv':
        file_path += '.csv'
        mode = 'a' if os.path.exists(file_path) else 'w'
        with open(file_path, mode, encoding='utf-8', newline='') as writer:
            csv_writer = csv.writer(writer, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            if mode == 'w':
                print(f'Creating new file for year {year}...')
                csv_writer.writerow(["PaperID"] + [f"{i}" for i in range(len(vectors[0]) - 1)])  # Adjusted header format
            csv_writer.writerows(vectors)
    elif storage == 'numpy':
        file_path += '.npy'
        vectors = np.array([vec[1:] for vec in vectors])  # Exclude PaperID for numpy storage
        if os.path.exists(file_path):
            existing_vectors = np.load(file_path, allow_pickle=True)
            vectors = np.vstack((existing_vectors, vectors))
        np.save(file_path, vectors)
    else:
        raise ValueError("Unsupported storage format. Use 'csv' or 'numpy'.")

def process_papers(start_index):
    with open(PATH_INPUT + 'papers_raw.csv', 'r', encoding='utf-8') as reader:
        csv_reader = csv.reader(reader, delimiter='\t', quotechar='"')

        # Skip headers and already processed papers
        print('Already done papers...')
        for _ in tqdm(range(start_index + 2)):
            next(csv_reader)

        for chunk_start in tqdm(range(start_index, TOTAL_PAPERS, CHUNK_SIZE)):
            chunk_data = [line_csv for _, line_csv in zip(range(CHUNK_SIZE), csv_reader)]
            
            # Group by year
            papers_by_year = {}
            for data in chunk_data:
                year = int(data[1].split('-')[0])
                if year not in papers_by_year:
                    papers_by_year[year] = []
                papers_by_year[year].append(data)

            # Process each year group
            for year, papers in papers_by_year.items():
                texts = [paper[2] + paper[3] for paper in papers]
                vectors = get_embedding(texts, tokenizer, model)
                vectors_with_id = [[paper[0]] + list(vectors[i]) for i, paper in enumerate(papers)]
                save_vectors(vectors_with_id, year, STORAGE, PATH_OUTPUT)


# Get the last processed paper index
last_processed_index = get_last_processed_index(PATH_OUTPUT)
print(f"Resuming from paper {last_processed_index + 1}.")

# Process the papers
process_papers(last_processed_index)