## 2_embedding of chunks.ipynb

This notebook processes and embeds text chunks using a pre-trained model. The workflow involves:
1. Loading chunked JSON files containing document text.
2. Encoding the text into embeddings using the Model.
3. Saving the embeddings as `.pkl` files for efficient storage and retrieval.
4. Optionally merging all embedding files into a single `.pkl` file.

### Output
- Separate `.pkl` files for each chunked JSON input file, named in the format `m3_chunk_<chunksize>_embedding_<index>.pkl`.
- A merged `.pkl` file containing all embeddings, if enabled.

### Notes
- The embeddings are computed in batches (default batch size: 128) to optimize memory usage.
- The merged `.pkl` file is created by combining individual `.pkl` files, allowing for batch processing.

In [2]:
import torch
import pandas as pd
import numpy as np
import pickle
import json

from FlagEmbedding import FlagModel

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

print(f"Using device: {device}")

Using device: cuda


In [4]:
model = FlagModel('BAAI/bge-m3',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

In [None]:
def process_and_save_embedding(corpus_path, output_path):
    # Load dataset
    with open(corpus_path, 'r') as f:
        corpus = json.load(f)
    
    # Extract ID and text
    chunk_ids = list(corpus.keys())
    chunk_texts = list(corpus.values())
    
    # Encoded text
    chunk_embedding = model.encode(chunk_texts, batch_size=128)
    
    # Store IDs and embedding results in a dictionary
    chunk_embedding_dict = {chunk_id: embedding for chunk_id, embedding in zip(chunk_ids, chunk_embedding)}
    
    # Save the result as a pickle file
    with open(output_path, 'wb') as f:
        pickle.dump(chunk_embedding_dict, f)
    
    # Delete temporary variables to save memory
    del corpus, chunk_ids, chunk_texts, chunk_embedding, chunk_embedding_dict

In [None]:
# Loop through each file
filenum = 5
for i in range(1,6):
    corpus_path = f'../chunk_embedding/chunk_512/chunk_doc_512_{i}.json'
    output_path = f'./m3_chunk_512_embedding_{i}.pkl'
    process_and_save_embedding(corpus_path, output_path)
    print(f"Processed and saved embeddings for file_{i} as {output_path}")

Inference Embeddings: 100%|██████████| 21351/21351 [4:00:17<00:00,  1.48it/s]  


Processed and saved embeddings for file_5 as ./m3_chunk_512_embedding_5.pkl


Clear memory / Restart

In [None]:
import pickle
import glob

# Get paths to all pickle files
file_paths = glob.glob('../pkl/m3_chunk_512/m3_chunk_512_embedding_*.pkl')

# Create a file to store the merged data
with open('./m3_chunk_512_embedding.pkl', 'wb') as f_out:
    # Iterate through each document
    for file_path in file_paths:
        with open(file_path, 'rb') as f_in:
            # Load data one file at a time
            chunk_embedding_dict = pickle.load(f_in)
            # Write to output file key-value pair by key-value pair
            for key, value in chunk_embedding_dict.items():
                # Chunked writes using pickle protocol
                pickle.dump({key: value}, f_out, protocol=pickle.HIGHEST_PROTOCOL)

print("All embeddings have been merged and saved in batches.")

Load all chunked documents directly and encode them

In [5]:
# Load the dataset
corpus_path = '../chunk_embedding/chunk_doc_128.json'

# Load the corpus data
with open(corpus_path, 'r') as f:
    corpus = json.load(f)

chunk_ids = list(corpus.keys())
chunk_texts = list(corpus.values())

In [None]:
# batch_size to match the device (usually 128 64 is better)
chunk_embedding = model.encode(chunk_texts,batch_size=128)

In [None]:
# save result to pickle
chunk_embedding_dict = {chunk_id: embedding for chunk_id, embedding in zip(chunk_ids, chunk_embedding)}
with open('./m3_chunk_128_embedding.pkl', 'wb') as f:
    pickle.dump(chunk_embedding_dict, f)