# Text Embeddings with SPECTER

## Overview 

This Python script is dedicated to generating and storing embeddings for a set of papers. It reads the raw data, processes the text of titles and abstracts, and converts them into embeddings using a pre-loaded model. In this case, the embedding vectors is a 768-dimensional vector and the model is the SPECTER. The embeddings are then stored in either CSV or NumPy format, organized by the publication year of the papers.

Transforming raw text into embeddings is a computationally intensive task. While it's possible to perform this operation on a CPU, it's not recommended due to the significant time and resource overhead. Instead, using a GPU is advisable for efficiency.

Given the computational demands of this task, it's essential to adopt a memory-efficient approach. Instead of loading the entire dataset into memory, we'll process the data iteratively. This involves reading each paper's text line-by-line, generating its embedding, and then immediately writing the embedding to storage. This method ensures that only a minimal amount of data is held in memory at any given time.

## Workflow

- **Loading the Embedding Model**:The notebook starts by loading a pre-trained embedding model. If a GPU is available, the model is moved to the GPU to accelerate the processing; otherwise, it continues using the CPU.

- **Counting the Number of Papers**: It counts the total number of papers in the dataset by reading the raw CSV file and counting the lines. This is required to keep track of the progress of the process with a progress bar (tqdm)

- **Setting Storage Method:** Users can choose between 'csv' and 'numpy' as the storage method for the generated embeddings. This choice determines the format in which the embeddings will be saved.

- **Processing Data in Chunks:** To optimize memory usage, the script processes the data in chunks. It reads and processes a specified number of papers at a time, generating embeddings for each chunk before moving on to the next.

- **Generating and Storing Embeddings:** For each chunk of data, the script performs the following steps:
    - Reads the papers’ data from the raw CSV file.
    - Extracts the publication year of each paper.
    - Combines the title and abstract of each paper and generates an embedding using the pre-loaded model.
    - Stores the generated embeddings in a list, organized by publication year.


- **Saving Embeddings:** When all papers for a specific year have been processed, or when moving to papers from a new year, the notebook saves the embeddings to disk. It creates a new file for each year, storing the embeddings in either CSV or NumPy format depending on the chosen storage method.


## Considerations for Storing Embeddings

When deciding how to store the generated embeddings, several factors come into play:

- **Read/Write Speed**: For operations where speed is crucial, binary formats like numpy's `.npy` or `.npz` (for sparse matrices) are recommended. These formats offer faster read/write speeds compared to traditional CSV files.

- **Interoperability**: If the embeddings need to be accessed by various software or tools, the CSV format is more universal. However, it's worth noting that CSV files tend to be larger and slower to read/write compared to binary formats.

- **Data Volume**: If dealing with a vast amount of embeddings, it might be beneficial to process and store the data in chunks. This approach can further optimize memory usage and improve overall efficiency.

With these considerations in mind, we'll now delve into the process of generating and storing embeddings using the SPECTER model.

## Output
The notebook generates files containing the embeddings of the papers, organized by their publication year. Each file is named after the corresponding year and contains the embeddings in either CSV or NumPy format, depending on the user’s choice.


In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')

import embeddings
import csv
from tqdm.notebook import tqdm
import torch


## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

In [None]:
# Load the embedding model
print('Load the embedding model...')
tokenizer, model = embeddings.load_model()

# Move the model to GPU if available
if torch.cuda.is_available():
    model = model.to('cuda')
    print("Model moved to GPU.")
else:
    print("Using CPU.")

# Count the number of papers
print('Get the number of papers to process...')
with open('../data/raw/papers_raw.csv', 'r', encoding='utf-8') as file:
    line_count = sum(1 for line in file)
total_papers = line_count - 1  # Subtract 1 for the header

# Choose storage method: 'csv' or 'numpy'
storage = 'csv'  # Change to 'numpy' if needed

# Process data in chunks for memory efficiency
chunk_size = 50
print('Processing...')
current_year_vectors = []
current_year = None

with open('../data/raw/papers_raw.csv', 'r', encoding='utf-8') as reader:
    csv_reader = csv.reader(reader, delimiter='\t', quotechar='"')
    next(csv_reader)  # Skip header

    for chunk_start in tqdm(range(0, total_papers, chunk_size)):
        chunk_data = [line for _, line in zip(range(chunk_size), csv_reader)]
        
        # Generate embeddings for the chunk
        for line in chunk_data:
            year = int(line[1].split('-')[0])  # Assuming the year is in the second column
            
            if current_year is None:
                current_year = year
            
            if year != current_year:
                # Save the vectors for the previous year
                if storage == 'csv':
                    with open(f'../data/vectors/{current_year}_vectors.csv', 'w', encoding='utf-8', newline='\n') as writer:
                        csv_writer = csv.writer(writer, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                        csv_writer.writerow(["PaperID"] + list(range(0, 768)))
                        csv_writer.writerows(current_year_vectors)
                elif storage == 'numpy':
                    np.save(f'../data/vectors/{current_year}_vectors.npy', np.array(current_year_vectors))
                
                # Reset for the new year
                current_year_vectors = []
                current_year = year
            
            text = line[2] + line[3]
            vector = embeddings.get_embedding(text, tokenizer, model)
            current_year_vectors.append([line[0]] + list(vector))  # Add PaperID at the beginning

        # Save vectors for the last year after the loop
        if storage == 'csv':
            with open(f'../data/vectors/{current_year}_vectors.csv', 'w', encoding='utf-8', newline='') as writer:
                csv_writer = csv.writer(writer, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerow(["PaperID"] + list(range(0, 768)))
                csv_writer.writerows(current_year_vectors)
        elif storage == 'numpy':
            np.save(f'../data/vectors/{current_year}_vectors.npy', np.array(current_year_vectors))            