# Notebook to explain the creation of the embeddings dataset patentSBERTa - Not Executable

Within this notebook, it will be explained the process to create the corresponding embeddings from the Content_file:

- "/bigstorage/DATASETS_JSON/Content_JSONs/Cited_2020_Uncited_2010-2019_Cleaned_Content_22k/CLEANED_CONTENT_DATASET_cited_patents_by_2020_uncited_2010-2019.json"



## Content_File

Each patent within the Content_file is structured with the following information:

{
- "Application_Number": "2097973",
- "Application_Date": "2013-04-03",
- "Application_Category": "B1",
- "Content": {
    - "title": "DYNAMIC RANGE IMPROVEMENTS OF LOAD MODULATED AMPLIFIERS",
    - "c-en-0001": "An antibody or antibody fragment according to claim 1, wherein said polypeptide is a recombinant polypeptide, and optionallywherein said recombinant polypeptide is biotinylated.",
    - "c-en-0002": ...,
    - "c-en-0003": ...,
    - "c-en-0004": ...,
    - "c-en-0005": ...,
    - "c-en-0006": ...,
    - "p0001": "The present invention relates generally to power amplifiers and amplifying methods and more specifically to high efficiency power amplifiers.",
    - "p0002": ...,
    - "p0003": ...,
    - "p0004": ...,
    - "p0005": ...,
    - "p0006": ...,
    - "p0007": ...,
    - "p0008": ...,
    - "p0010": ...,
    - "p0011": ...,
    - "p0013": ...,
    - "p0014": ...,
    - ... ,
    - "p0099": ...

        }

    }

## Requirements

In [2]:
import pandas as pd
import numpy as np
import json
import os

import argparse
# import faiss
# from faiss import write_index, read_index
from sentence_transformers import SentenceTransformer, models


  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


# Creation of embeddings

## Process

The process to embed the content from each patent is as follows:

1. We load and extract the information that we want to embed: Content information such as Abstract, Claims, Paragraphs and Figure references.

2. For each element, the model SBERT tokenizes each word and then by applying the corresponding pooling strategy it determines the tokenized embedding for the whole text/sentence.

    For instance, if we consider the first claim, it would tokenize each word, and by choosing mean pooling strategy it computes the average of all tokenized embeddings from the claim. 
    This new embedding is a vector of 768 dimensions.

    BEFORE

    - "An antibody or antibody fragment according to claim 1, wherein said polypeptide is a recombinant polypeptide, and optionallywherein said recombinant polypeptide is biotinylated."

    AFTER

    - [-0.16003, -0.268536, -0.242169, -0.123718, 0.078339, -0.404047, 0.366489, -0.170657, -0.340436, 0.315698, ..., 0.031384, ,0.310652, 0.108265, 0.036854, -0.043038, -0.157491, 0.004529, 0.138287, -0.117745, -0.139216]

3. Each embedded sentence is then stored in the file: "/bigstorage/you/rag_project/embeddings_preindexed/embeddings_PatentSBERTa_mean.npy"

## Pooling Strategy

Pooling: This refers to the process of combining the output of the network over several dimensions into a fixed-size output. It's a common technique to reduce the dimensionality and to capture the essence of the input features.

The common pooling strategies are:

- Mean Pooling: Takes the average of all output vectors, effectively capturing the average effect of all the input tokens. For embeddings, mean pooling would average the embeddings of all tokens in the sequence, providing a single embedding vector representing the whole input.

- Max Pooling: Takes the maximum value across the output vectors in each dimension, which can be thought of as capturing the most significant feature in each dimension of the embeddings. This is sometimes used to ensure that the strongest signal in each dimension is not diluted by averaging.

- CLS Pooling: In models like BERT, the "CLS" token is a special token placed at the beginning of the input sequence. The embedding corresponding to this token can be used as the aggregate representation of the input sequence after the model has been fine-tuned, as it is designed to capture the context of the entire sequence.

## Code to embed

In [14]:
def main():
    parser = argparse.ArgumentParser(description='Rank documents using BM25')

    # Chosen model is SBERT (Sentence-BERT) 
    parser.add_argument('--model', '-m', type=str, default='AI-Growth-Lab/PatentSBERTa', help='The model to use for embeddings')
    
    # Chosen pooling strategy
    parser.add_argument('--pooling', '-p', type=str, default='mean', choices=['mean', 'max', 'cls'], help='The pooling strategy to use for embeddings')
    
    # Input file to embed
    parser.add_argument('--input_file', '-i', type=str, help='The input file to create embeddings for', 
                        default='/bigstorage/DATASETS_JSON/Content_JSONs/Cited_2020_Uncited_2010-2019_Cleaned_Content_22k/CLEANED_CONTENT_DATASET_cited_patents_by_2020_uncited_2010-2019.json')
    
    # Output directory where the file will be saved
    parser.add_argument('--output_dir', '-o', type=str, default='embeddings_preindexed', help='The output file to save the embeddings to')    
    
    # This line parses all of the arguments provided via the command line when the script is run.
    # The parsed arguments become attributes of the args object, so you can access the values provided by the user with args.model, args.pooling, etc.
    args = parser.parse_args()
    

    # Load the input JSON file
    with open(args.input_file, 'r') as json_file:
        data = json.load(json_file)
    print(f"Loaded {len(data)} documents from {args.input_file}")
    
    # Convert the JSON to a pandas dataframe (long table of content)
    columns = ['Application_Number', 'Application_Date', 'Application_Category', 'Content_Type', 'Content']

    data_accumulator = []
    for doc in data:
        for content_type, content in doc['Content'].items():
            # Create a dictionary for each row and append it to the list
            row_data = {
                'Application_Number': doc['Application_Number'], 
                'Application_Date': doc['Application_Date'], 
                'Application_Category': doc['Application_Category'], 
                'Content_Type': content_type, 
                'Content': content
            }
            data_accumulator.append(row_data)
        print(row_data)
        print(data_accumulator)
        break

    # Create the DataFrame from the accumulated data list only once
    df = pd.DataFrame(data_accumulator, columns=columns)
    print(df)

    # Load the model
    base_model = models.Transformer(args.model, max_seq_length=512)
    pooling_model = models.Pooling(base_model.get_word_embedding_dimension(), pooling_mode=args.pooling)
    model = SentenceTransformer(modules=[base_model, pooling_model]).to('cuda')

    # Encode the corpus with potentially adjusted batch size
    corpus_embeddings = model.encode(df['Content'].tolist(), show_progress_bar=True, batch_size=128)  # Adjust batch size as needed
    print(f"Encoded {len(corpus_embeddings)} documents")

    # Save the embeddings as a numpy array
    if not os.path.exists(args.output_dir): # create the output directory if it doesn't exist
        os.makedirs(args.output_dir)

    np.save(f'/bigstorage/Pablo_TER/test.npy', corpus_embeddings)
    # np.save(f'{args.output_dir}/embeddings_{args.model.split("/")[-1]}_{args.pooling}.npy', corpus_embeddings)
    print(f"Saved the embeddings to {args.output_dir}/embeddings_{args.model.split('/')[-1]}_{args.pooling}.npy")
    


if __name__ == '__main__':
    main()


## Loading File + Shape + Size

In [19]:
import json
import numpy as np

def open_npy_file(path_to_npy):
    # Load the numpy array from the .npy file
    npy_data = np.load(path_to_npy)
    return npy_data

path = "/bigstorage/you/rag_project/embeddings_precalculated/embeddings_PatentSBERTa_mean.npy"
patentSBERTa = open_npy_file(path)


size_in_bytes = patentSBERTa.nbytes
# Convert the size to gigabytes
size_in_gb = size_in_bytes / (1024 ** 3)

print("The file \"embeddings_PatentSBERTa_mean.npy\":")
print(" Shape:", patentSBERTa.shape)
print(" Size in GB:", size_in_gb)


The file "embeddings_PatentSBERTa_mean.npy":
 Shape: (2382315, 768)
 Size in GB: 6.815857887268066
