**OBJECTIVE:** Test different ways to chunk PDFs into smaller chunks that can be managed by handled by the algorithms we are using.

**AUTHOR:** [Aksh Sabherwal](https://www.github.com/akshsabherwal) (edited by [@jonjoncardoso](https://github.com/jonjoncardoso))

⚙️ **SETUP**

- Ensure you are running with the `chat-lse` conda environment. See [README.md](../../README.md) for more information.
- Install the packages we will need for this experiment:

    ```bash
    pip install PyPDF2 nest-asyncio==1.6.0 scikit-learn==1.5.0 spacy==3.7.5 nltk==3.8.1 sentence-transformers
    ```

**Imports**

In [10]:
import os
import re
import sys
import spacy
import nltk
import PyPDF2
import asyncio
import nest_asyncio

# Add ../../fastapi_app package to sys.path
sys.path.append(os.path.abspath('../../'))
import fastapi_app # Reuse the same embedding functions in our project
from fastapi_app.embeddings import compute_text_embedding

from tqdm.notebook import tqdm, trange
from transformers import GPT2Tokenizer

In [2]:
# NLTK needs additional setup
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/j.cardoso-
[nltk_data]     silva/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

**Constants**

In [3]:
DOCS_FOLDER = "./sample-docs/"

**Utils functions**

In [4]:
def read_pdf(file_path=DOCS_FOLDER):
    # Initialize a variable to hold all the text
    all_text = ""
    
    # Open the PDF file
    with open(file_path, "rb") as file:
        # Initialize a PDF reader object
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Iterate through each page in the PDF
        for page in pdf_reader.pages:
            # Extract text from the page
            text = page.extract_text()
            if text:
                all_text += text  # Append the extracted text to all_text

    return all_text

def clean_text(text):
    # Replace all newline characters with an empty string
    cleaned_text = re.sub(r'\n', '', text)
    # Replace two or more spaces with a single space
    cleaned_text = re.sub(r' {2,}', ' ', cleaned_text)
    # Replace a space followed by a period with just a period
    cleaned_text = re.sub(r' \.', '.', cleaned_text)
    # Replace a space followed by a comma with just a comma
    cleaned_text = re.sub(r' ,', ',', cleaned_text)
    return cleaned_text

# ***Summarised findings***

* I found three viable methods of chunking and two non-viable methods. The working ones are methods 1, 4, and 5
* The other two methods rely upon topic modelling/semantic meaning. These did not work because topics/semantic do not vary significantly within a doc 
* Method 1 is a sliding window method, and proved to be the second most effective method. It has the benefit of being universally applicable to all structures of text as we can determine the token size of the window. However, because of this, we also may encounter issues of hanging sentences 
* Method 4 is the method from the Google Collab notebook Riya shared. It is similar to sliding window, but instead of token size, the window is based on sentence size, so we might encounter issues where some sentences' tokens exceeds the embedding model's token length.
* Method 5 is the best method I found, which uses llama-index's SentenceSplitter() functionality. We set a default chunk token size (512), and the function splits the text so that each chunk has at most 512 tokens. However, an additional benefit is that this function keeps sentences together in one chunk, so we do not have issues of hanging sentences like in method 1
* Essentially, method 5 is an improved version of method 4. It has method 4's benefit of chunking based on sentences to prevent hanging sentences, but we are also able to determine a chunk size to ensure we do not exceed token limits. Another big benefit is that the code for using SentenceSplitter() is very clean and simple.
* We choose a default chunk size of 512 because the GTE large model's has a maximum intake of 512 tokens

* In a bonus method, I combine the ideas of the sliding functionality with the SentenceSplitter(), so that we get the benefits of both. However, the issue that arises is that we get many (many) more total chunks and embeddings (roughly three times the amount). This is a tradeoff that we should discuss. 

# 1. Read and clean PDFs

 Read PDFs from docs folder and perform necessary cleaning using regex

In [5]:
path_all_pdfs = [os.path.join(DOCS_FOLDER, file) for file in os.listdir(DOCS_FOLDER)]
path_all_pdfs

['./sample-docs/LSE-2030-booklet.pdf',
 './sample-docs/ConfidentialityPolicy.pdf',
 './sample-docs/Appeals-Regulations.pdf',
 './sample-docs/InterruptionPolicy.pdf',
 './sample-docs/UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf',
 './sample-docs/In-Course-Financial-Support.pdf',
 './sample-docs/BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf',
 './sample-docs/comPro.pdf',
 './sample-docs/bsc-handbook-21.22.pdf',
 './sample-docs/Formatting-and-binding-your-thesis-2021-22.pdf',
 './sample-docs/Exam-Procedures-for-Candidates.pdf',
 './sample-docs/MSc-Mark-Frame.pdf',
 './sample-docs/Student-Guidance-Deferral.pdf',
 './sample-docs/Spring-Exam-Timetable-2024-Final.pdf']

Read all the PDFs into text:

In [6]:
test_texts = [read_pdf(file) for file in tqdm(path_all_pdfs)]

  0%|          | 0/14 [00:00<?, ?it/s]

Clean them up:

In [7]:
cleaned_texts = [clean_text(i) for i in test_texts]

# 2. Begin investigations...

If you were to run the commented code below, you will clearly find that the current texts are too long... we need to try to find a way to chunk them, but in a way that they are usable.

In [9]:
# from fastapi_app.embeddings import compute_text_embedding  
# from openai_clients import create_openai_embed_client
# import asyncio

# async def main():
#     # Create the OpenAI embedding client
#     client, model, dimensions = await create_openai_embed_client()
#     dimensions = int(dimensions) if dimensions else 1536  # Default dimensions if not set

#     # Iterate over texts and compute their embeddings
#     embeddings = []
#     for text in cleaned_texts:
#         embedding = await compute_text_embedding(
#             q=text,
#             openai_client=client,
#             embed_model=model,
#             embedding_dimensions=dimensions
#         )
#         embeddings.append(embedding)
    
#     # Example of how to use embeddings (here we just print them)
#     for i, embedding in enumerate(embeddings):
#         print(f"Embedding for Text {i+1}: {embedding}")

# # Run the asynchronous main function
# if __name__ == "__main__":
#     loop = asyncio.get_event_loop()
#     if loop.is_running():
#         # Reuse the existing running loop in Jupyter Notebook
#         tasks = asyncio.ensure_future(main())  # Schedule main to run
#         # You may run tasks.result() in another cell to get the result if needed
#     else:
#         # If somehow the loop is not running, use asyncio.run (unlikely in Jupyter)
#         asyncio.run(main())

# 3. Chunking methods

I found three viable methods of chunking and two non-viable methods. (The working ones are methods 1, 4, and 5)

Methods 2 and 5 rely upon topic modelling/semantic meaning. These did not work because topics/semantic do not vary significantly within a doc. 

## 3.1 Method 1: Sliding window approach

I do not think sentence-based chunking is the best way to calculate embeddings for this context ... there will be issues with scalability and more importantly lost context. 

For this reason, I think it will be worth it to try a __sliding window method__, where we establish a "window" of a certain length upon which the embeddings will be calculated and a step size for the window. This will likely be a good idea because a majority of LSE documents do not have a uniform structure, but this method allows us to maintain a contextual link between each chunk.

Let's start with a step size of 4096 (half of 8192):

In [None]:
# import nest_asyncio

# nest_asyncio.apply()

# def sliding_window(text, max_length=8192, step_size=4096):
#     tokens = text.split()  # Simple whitespace tokenizer
#     chunks = []

#     for i in range(0, len(tokens), step_size):
#         chunk = tokens[i:i + max_length]
#         chunk_text = " ".join(chunk)
#         chunks.append(chunk_text)
#         if i + max_length >= len(tokens):
#             break
    
#     return chunks

# import asyncio

# async def main():
#     client, model, dimensions = await create_openai_embed_client()
#     dimensions = int(dimensions) if dimensions else 1536  # Default dimensions if not set
#     all_embeddings = []
#     for text in cleaned_texts:
#         text_chunks = sliding_window(text)
#         embeddings = []
#         for chunk in text_chunks:
#             embedding = await compute_text_embedding(
#                 q=chunk,
#                 openai_client=client,
#                 embed_model=model,
#                 embedding_dimensions=dimensions
#             )
#             embeddings.append(embedding)
#         all_embeddings.append(embeddings)

#     for i, embeddings in enumerate(all_embeddings):
#         print(f"Embedding for Text {i+1}:")
#         for j, embedding in enumerate(embeddings):
#             print(f"  Chunk {j+1}: {embedding}")

# await main()

Seems that there are still issues with token length even though we specified a max length of 8192 tokens earlier... this could be because of differences in how whitespace and special characters are counted as tokens by the model.

For this reason, I'll try decreasing the token and step size of the sliding window function:

In [22]:
import pandas as pd

pd.Series([len(text.split()) for text in cleaned_texts])

0      1264
1       672
2      3113
3      1718
4     28918
5      3579
6      1398
7      4091
8     15573
9       687
10     8954
11      337
12     1270
13     6411
dtype: int64

In [23]:
nest_asyncio.apply()


def sliding_window(text, max_length=6500, step_size=3000):
    tokens = text.split()  # Simple whitespace tokenizer
    chunks = []

    for i in trange(0, len(tokens), step_size, desc="Creating sliding window chunks"):
        chunk = tokens[i : i + max_length]
        chunk_text = " ".join(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break

    return chunks


embeddings_store = {}


async def main():

    text_id = 1  # A simple counter or identifier for each text
    text_chunks = sliding_window(cleaned_texts[4])
    embeddings = []
    for chunk in tqdm(text_chunks, desc="Computing embeddings for each chunk"):
        embedding = await compute_text_embedding(q=chunk)
        embeddings.append(embedding)
    embeddings_store[text_id] = embeddings
    text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Chunk {i+1}: {embedding}")


await main()

embeddings_store

Creating sliding window chunks:   0%|          | 0/10 [00:00<?, ?it/s]

Computing embeddings for each chunk:   0%|          | 0/9 [00:00<?, ?it/s]



Embedding for Text 1:
  Chunk 1: [-0.004293882288038731, -0.014189883135259151, 0.011240503750741482, -0.04798765480518341, 0.0094725601375103, 0.009533737786114216, -0.02992323786020279, 0.015568594448268414, 0.00822216086089611, 0.025811173021793365, 0.025236191228032112, 0.0037564656231552362, 0.0038099316880106926, -0.014822551980614662, -0.006081395782530308, 0.017917048186063766, -0.03894433379173279, -0.024364955723285675, 0.007823136635124683, -0.025712719187140465, 0.0080458614975214, 0.01232099812477827, -0.06078542396426201, -0.05011631175875664, -0.010189472697675228, 0.030141141265630722, 0.03863455355167389, -0.012388666160404682, 0.07173485308885574, 0.058533404022455215, 0.009137784130871296, -0.01988179422914982, 0.01402102131396532, -0.04744552820920944, -0.008020171895623207, -0.012772041372954845, 0.03258080407977104, -0.04140235111117363, -0.0019221141701564193, -0.051797110587358475, 0.02018694020807743, -0.021568341180682182, 0.005327848717570305, -0.038103628903

{1: [[-0.004293882288038731,
   -0.014189883135259151,
   0.011240503750741482,
   -0.04798765480518341,
   0.0094725601375103,
   0.009533737786114216,
   -0.02992323786020279,
   0.015568594448268414,
   0.00822216086089611,
   0.025811173021793365,
   0.025236191228032112,
   0.0037564656231552362,
   0.0038099316880106926,
   -0.014822551980614662,
   -0.006081395782530308,
   0.017917048186063766,
   -0.03894433379173279,
   -0.024364955723285675,
   0.007823136635124683,
   -0.025712719187140465,
   0.0080458614975214,
   0.01232099812477827,
   -0.06078542396426201,
   -0.05011631175875664,
   -0.010189472697675228,
   0.030141141265630722,
   0.03863455355167389,
   -0.012388666160404682,
   0.07173485308885574,
   0.058533404022455215,
   0.009137784130871296,
   -0.01988179422914982,
   0.01402102131396532,
   -0.04744552820920944,
   -0.008020171895623207,
   -0.012772041372954845,
   0.03258080407977104,
   -0.04140235111117363,
   -0.0019221141701564193,
   -0.051797110587

Looks like it worked for the "Appeals Regulation" text... let's try it for the other ones as well 

In [None]:
# import nest_asyncio

# nest_asyncio.apply()

# def sliding_window(text, max_length=6500, step_size=3000):
#     tokens = text.split()  # Simple whitespace tokenizer
#     chunks = []

#     for i in range(0, len(tokens), step_size):
#         chunk = tokens[i:i + max_length]
#         chunk_text = " ".join(chunk)
#         chunks.append(chunk_text)
#         if i + max_length >= len(tokens):
#             break
    
#     return chunks

# embeddings_store = {}

# async def main():
#     client, model, dimensions = await create_openai_embed_client()
#     dimensions = int(dimensions) if dimensions else 1536

#     text_id = 1  # A simple counter or identifier for each text
#     for text in cleaned_texts:
#         text_chunks = sliding_window(text)
#         embeddings = []
#         for chunk in text_chunks:
#             embedding = await compute_text_embedding(
#                 q=chunk,
#                 openai_client=client,
#                 embed_model=model,
#                 embedding_dimensions=dimensions
#             )
#             embeddings.append(embedding)
#         embeddings_store[text_id] = embeddings
#         text_id += 1

#     for text_id, embeddings in embeddings_store.items():
#         print(f"Embedding for Text {text_id}:")
#         for i, embedding in enumerate(embeddings):
#             print(f"  Chunk {i+1}: {embedding}")

# await main()

# embeddings_store

The error suggests that one of the texts still has 15001 tokens, which is strange since we specified that the limit of the window is 6500...

I have found out that actually the method I have instated for splitting based on chunking makes the crucial and incorrect assumption that one character equals one token, which is not the case for many OpenAI models... Let's use a byte-pair encoding (BPE) tokenizer instead to more accurately chunk based on the number of tokens. 

This is not entirely accurate either though to estimate the number of tokens taken in by this specific embeddings model, so we need to use a smaller window size.

In [None]:
nest_asyncio.apply()

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def sliding_window(text, max_length=8000, step_size=3000):
    # Tokenize the text
    tokens = tokenizer.tokenize(text)
    
    chunks = []
    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = tokenizer.convert_tokens_to_string(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    
    return chunks

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    for text in cleaned_texts:
        text_chunks = sliding_window(text)
        embeddings = []
        for chunk in text_chunks:
            embedding = await compute_text_embedding(
                q=chunk,
                openai_client=client,
                embed_model=model,
                embedding_dimensions=dimensions
            )
            embeddings.append(embedding)
        embeddings_store[text_id] = embeddings
        text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Chunk {i+1}: {embedding}")

await main()

Note: I had initially encountered an issue where the embeddings for the first chunk of text 4 was just a vector with all entries being null. I fixed this by adding more settings to the regex function at the start.

## 3.2 Method 2: Topic modelling (incompatible with our purposes)

Another way we could try is by determining what the key topics are in each text and chunking based on that. 

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Assuming cleaned_texts is a list of document strings
for index, text in enumerate(cleaned_texts):
    document = [text]
    vectorizer = CountVectorizer(stop_words='english')
    X = vectorizer.fit_transform(document)

    # Step 2: Apply LDA
    lda = LatentDirichletAllocation(n_components=2, random_state=0)
    lda.fit(X)

    # Heading for the output
    print(f"\nText {index + 1} Analysis:")  # Dynamic heading for each text

    # Step 3: View topics (each topic as a list of words)
    def print_topics(model, vectorizer, top_n=20):
        for idx, topic in enumerate(model.components_):
            print(f"Topic {idx + 1}")
            print([(vectorizer.get_feature_names_out()[i], topic[i]) for i in topic.argsort()[:-top_n - 1:-1]])

    print_topics(lda, vectorizer)

    # Example of assigning topics to new documents
    doc_topic_dist = lda.transform(X)
    print("\nDocument topic distribution:")
    print(doc_topic_dist)


In [None]:
def extract_keywords(model, vectorizer, top_n=20):
    topic_keywords = []
    for idx, topic in enumerate(model.components_):
        keywords = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-top_n - 1:-1]]
        topic_keywords.append(keywords)
    return topic_keywords

def chunk_text_by_keywords(text, keywords):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    # Initialize chunks (a list of lists, each list is a chunk of sentences)
    chunks = [[] for _ in keywords]

    # Assign sentences to the chunk of the topic they are most relevant to
    for sentence in sentences:
        # Count keyword occurrences in the sentence
        keyword_counts = [sum(sentence.count(keyword) for keyword in topic_keywords) for topic_keywords in keywords]
        # Find the topic with the maximum count of keywords
        max_topic = keyword_counts.index(max(keyword_counts))
        # Append the sentence to the corresponding chunk
        chunks[max_topic].append(sentence)

    # Join sentences back to form coherent chunks
    return [" ".join(chunk) for chunk in chunks]

for index, text in enumerate(cleaned_texts):
    document = [text]
    vectorizer = CountVectorizer(stop_words='english')
    X = vectorizer.fit_transform(document)

    lda = LatentDirichletAllocation(n_components=2, random_state=0)
    lda.fit(X)

    # Extract keywords for the current text's topics
    keywords = extract_keywords(lda, vectorizer)
    print(f"\nText {index + 1} Keywords and Chunks:")
    for i, topic_keywords in enumerate(keywords):
        print(f"Topic {i + 1} Keywords: {topic_keywords}")

    # Chunk the text based on these keywords
    text_chunks = chunk_text_by_keywords(text, keywords)
    for i, chunk in enumerate(text_chunks):
        print(f"Chunk {i + 1} for Text {index + 1}: {chunk}...")  # Print the first 100 characters of each chunk


The issue does seem to be that we have insufficient content in each document for topic modelling... this could be the reason why the keywords are the same in topic 1 and topic 2, and so we only get 1 chunk for each text. For this reason I do not think this is a viable method for chunking.

## 3.3 Method 3: Semantic similarity-based splitting (incompatible with our purposes)

When I input a prompt into the chat interface, I would ideally want to be returned with sections that are directly relevant to my question. So, I think it would be worth trying ways to split sections of the docs semantically. It is similar to method 2, except we would be considering more semantic aspects rather than topic modelling.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel


# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].mean(dim=0)


def alt_clean_text(text):
    # Replace two or more spaces with a single space
    cleaned_text = re.sub(r' {2,}', ' ', text)
    # Replace a space followed by a period with just a period
    cleaned_text = re.sub(r' \.', '.', text)
    # Replace a space followed by a comma with just a comma
    cleaned_text = re.sub(r' ,', ',', text)
    return cleaned_text

alt_cleaned_texts = [alt_clean_text(i) for i in test_texts]

# Example text
text = alt_cleaned_texts[-2]
paragraphs = text.split('\n\n')

# Compute embeddings
embeddings = [get_embedding(para) for para in paragraphs]

# Determine breakpoints based on embedding similarity
for i in range(1, len(embeddings)):
    sim = cosine_similarity([embeddings[i-1].detach().numpy()], [embeddings[i].detach().numpy()])
    if sim < 0.999:  # Threshold needs adjustment based on your specific needs
        print(f"Split between paragraph {i-1} and {i} due to low similarity: {sim}")



Hmmmmm, even though we have a very high threshold for cosine similarity, we still aren't getting any results of paragraph splitting. Perhaps this is because semantically-speaking, each document is self-contained, and so it would be difficult to differentiate based on this alone... let's try something else.

## 3.4 Method 4: D. Brouke's method (template repo)

In the Google Colab notebook, they use a method similar to method 1, but they omit the step of a "sliding window" and just split chunks into 5,7, or 10 sentences. 

They also use NLP to handle splitting into sentences, which might be more robust and efficient.

In [None]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/ 
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
print(list(doc.sents))

So all the code did is break up the string with two sentences into a "list" with two strings.

In [None]:
import tqdm
import spacy

# Load the spaCy language model

# Initialize a dictionary to store results
results = {}

# Process each text in the list with its index
for index, text in enumerate(tqdm.tqdm(cleaned_texts)):
    # Analyze the text with spaCy to get sentences
    doc = nlp(text)
    sentences = list(doc.sents)
    
    # Convert all Sentence objects to strings
    sentences = [str(sentence) for sentence in sentences]
    
    # Use the index as the key for each document's results
    results[f"document_{index}"] = {
        "sentences": sentences,
        "sentence_count": len(sentences)
    }


In [None]:
from pprint import pprint 

pprint(results)

In [None]:
results

In [None]:
import tqdm as tqdm

# Create a list of dictionaries from cleaned texts to ensure compatability with the method prescribed in the Google colab notebook 
dict_cleaned_texts = [{'text': text} for text in cleaned_texts]

for item in tqdm.tqdm(dict_cleaned_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

# Define split size to turn groups of sentences into chunks

num_sentence_chunk_size = 5

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm.tqdm(dict_cleaned_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

In [None]:
dict_cleaned_texts[0]["sentence_chunks"]

In [None]:
for item in dict_cleaned_texts:
    # Convert each list of sentences in sentence_chunks to a single concatenated string
    item["sentence_chunks"] = [' '.join(chunk) for chunk in item["sentence_chunks"]]

print(dict_cleaned_texts[0]["sentence_chunks"])


In [None]:
from embeddings import compute_text_embedding  
from openai_clients import create_openai_embed_client
from transformers import GPT2Tokenizer
import nest_asyncio
import asyncio

nest_asyncio.apply()

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    
    # Loop through each dictionary in the list
    for item in dict_cleaned_texts:
        sentence_chunks = item['sentence_chunks']  # Accessing sentence_chunks from each dictionary
        embeddings = []
        for chunk in sentence_chunks:
            embedding = await compute_text_embedding(
                q=chunk,
                openai_client=client,
                embed_model=model,
                embedding_dimensions=dimensions
            )
            embeddings.append(embedding)
        embeddings_store[text_id] = embeddings
        text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Chunk {i+1}: {embedding}")

await main()


Note: So far, I have been using the method in the Google Colab notebook. However, I have run into an issue of token length as before, even though I have restricted sentence length to 5. I will need to make use of the sliding window function from before again...

In [None]:
from embeddings import compute_text_embedding  
from openai_clients import create_openai_embed_client
from transformers import GPT2Tokenizer
import nest_asyncio
import asyncio

nest_asyncio.apply()

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def sliding_window(text, max_length=8000, step_size=3000):
    tokens = tokenizer.tokenize(text)
    chunks = []
    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = tokenizer.convert_tokens_to_string(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    return chunks

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    
    # Loop through each dictionary in the list
    for item in dict_cleaned_texts:
        sentence_chunks = item['sentence_chunks']  # Accessing sentence_chunks from each dictionary
        embeddings = []
        for chunk in sentence_chunks:
            sub_chunks = sliding_window(chunk)  # Use sliding_window to handle long chunks
            chunk_embeddings = []
            for sub_chunk in sub_chunks:
                embedding = await compute_text_embedding(
                    q=sub_chunk,
                    openai_client=client,
                    embed_model=model,
                    embedding_dimensions=dimensions
                )
                chunk_embeddings.append(embedding)
            embeddings.append(chunk_embeddings)
        embeddings_store[text_id] = embeddings
        text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, chunk_embeddings in enumerate(embeddings):
            print(f"  Chunk {i+1}:")
            for sub_i, embedding in enumerate(chunk_embeddings):
                print(f"    Sub-chunk {sub_i+1}: {embedding}")

await main()


## 3.5 Method 5: llama-index SentenceSplitter function 

An issue with tokenisation methods is that we might get the issue of sentences being embedded where certain phrases or words are incomplete. SentenceSplitter resolves that by trying to keep paragraphs and sentences together. In effect, this is a combination of methods 1 and 5, which will help to accomodate for the shortfalls of both methods. The shortfall of method 1 (sliding window) is the idea of hanging sentences; the shortfall of method 4 is token numbers exceeding the model allowance.

In [None]:
from llama_index.core.node_parser import SentenceSplitter

# Assuming constants and imports are defined elsewhere
DEFAULT_CHUNK_SIZE = 500  # This means each chunk has at most 500 tokens
SENTENCE_CHUNK_OVERLAP = 50  # Example overlap
CHUNKING_REGEX = r"[^,\.;]+[,\.;]?"  # Simple sentence splitter regex
DEFAULT_PARAGRAPH_SEP = "\n\n"  # Paragraph separator

# Import required functions and classes - make sure these are defined
# from your_module import get_tokenizer, split_by_sentence_tokenizer, split_by_sep, split_by_regex, split_by_char, CallbackManager, default_id_func

class _Split:
    def __init__(self, text, is_sentence, token_size):
        self.text = text
        self.is_sentence = is_sentence
        self.token_size = token_size

# Now using SentenceSplitter class with necessary modifications for clarity
splitter = SentenceSplitter()
import nest_asyncio
from transformers import GPT2Tokenizer
from embeddings import compute_text_embedding  
from openai_clients import create_openai_embed_client

nest_asyncio.apply()

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def sliding_window(text, max_length=8000, step_size=3000):
    tokens = tokenizer.tokenize(text)
    chunks = []
    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = tokenizer.convert_tokens_to_string(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    return chunks

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    
    # Assuming 'splitter' is an instance of SentenceSplitter
    splitter = SentenceSplitter()  # Initialize this based on your specific implementation

    # Loop through each text in cleaned_texts, chunk it, then calculate embeddings
    for text in cleaned_texts:
        print(f"\nProcessing Text {text_id}:")
        sentence_chunks = splitter.split_text(text)  # Use your SentenceSplitter to split the text
        embeddings = []
        
        for i, chunk in enumerate(sentence_chunks):
            print(f"  Chunk {i+1}: {chunk[:100]}...")  # Print first 100 characters of each chunk for brevity
            sub_chunks = sliding_window(chunk)  # Handle long chunks
            chunk_embeddings = []
            
            for j, sub_chunk in enumerate(sub_chunks):
                print(f"    Sub-chunk {j+1}: {sub_chunk[:50]}...")  # Print first 50 characters of each sub-chunk
                embedding = await compute_text_embedding(
                    q=sub_chunk,
                    openai_client=client,
                    embed_model=model,
                    embedding_dimensions=dimensions
                )
                chunk_embeddings.append(embedding)
                print(f"      Embedding for Sub-chunk {j+1}: {embedding[:10]}...")  # Print first 10 elements of embedding array

            embeddings.append(chunk_embeddings)
        
        embeddings_store[text_id] = embeddings
        text_id += 1

    print("\nFinished processing all texts.")

# Run the main function in the asyncio event loop
await main()



## 3.6 Method 6: Misc - testing GTE embedding model

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding 

embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large") 

embeddings = embed_model.get_text_embedding("Hello World!") 
print(len(embeddings))
print(embeddings)

Now that we have installed the necessary dependencies and set up necessary objects, we will test the embeddings on cleaned_text

In [None]:
DEFAULT_CHUNK_SIZE = 512  # Setting as 512 because this is the model's maximum length; any text with longer text is truncated down to 512 tokens. 

# Initialize the tokenizer and the embedding model
embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large")

embeddings_store = {}

def main():
    text_id = 1  # A simple counter or identifier for each text
    
    # Assuming 'splitter' is an instance of SentenceSplitter
    # Initialize the SentenceSplitter with specific chunk size
    splitter = SentenceSplitter(chunk_size=DEFAULT_CHUNK_SIZE, chunk_overlap=SENTENCE_CHUNK_OVERLAP)
    # Initialize this based on your specific implementation

    # Loop through each text in cleaned_texts, process each sentence, then calculate embeddings
    for text in cleaned_texts:
        print(f"\nProcessing Text {text_id}:")
        sentence_chunks = splitter.split_text(text)  # Use your SentenceSplitter to split the text
        embeddings = []
        
        for i, chunk in enumerate(sentence_chunks):
            print(f"  Chunk {i+1}: {chunk[:100]}...")  # Print first 100 characters of each chunk for brevity
            embedding = embed_model.get_text_embedding(chunk)
            embeddings.append(embedding)
            print(f" Embedding for Chunk {i+1}: {embedding[:10]}...")  # Print first 10 elements of embedding array
        
        embeddings_store[text_id] = embeddings
        text_id += 1

    print("\nFinished processing all texts.")

# Run the main function
main()


In [None]:
print(sentence_chunks)

### 3.6.1 Bonus combined method: Integrating SentenceSplitter() with sliding_window()

Using method 5 so far is the best, and will likely be sufficient for our purposes. However, a nice benefit of the sliding window method was that we were able to retain more context between paragraphs by having the "sliding" functionality. Let's see if we can find a way to integrate the function so that we get sub-chunks as well as non-hanging sentences...

We're not going to be able to use our previous sliding window function, because we will encounter the same issue of hanging sentences. Instead, we can try something where we split sentences into even smaller chunks and then calculate the embeddings for two chunks at the same time. For instance, say we have chunks 1,2,3,4... then we would calculate embeddings for chunks 1 and 2, then 2 and 3, then 3 and 4 and so on.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large")

def create_overlapping_chunks(sentence_chunks):
    """Create overlapping chunks from list of sentence chunks."""
    combined_chunks = []
    for i in range(len(sentence_chunks) - 1):
        # Merge two consecutive chunks
        combined_chunk = sentence_chunks[i] + " " + sentence_chunks[i + 1]
        combined_chunks.append(combined_chunk)
    return combined_chunks

def compute_embeddings_for_chunks(chunks):
    """Compute embeddings for each chunk."""
    embeddings = []
    for chunk in chunks:
        embedding = embed_model.get_text_embedding(chunk)
        embeddings.append(embedding)
    return embeddings

async def main():
    # Assuming 'splitter.split_text(text)' returns a list of sentence chunks
    embeddings_store = {}
    text_id = 1
    splitter = SentenceSplitter(chunk_size=206, chunk_overlap=SENTENCE_CHUNK_OVERLAP)
    for text in cleaned_texts:
        print(f"Processing Text {text_id}:")
        sentence_chunks = splitter.split_text(text)  # Split text into sentence chunks
        
        # Create overlapping chunks from the sentence chunks
        overlapping_chunks = create_overlapping_chunks(sentence_chunks)
        
        # Compute embeddings for each overlapping chunk
        embeddings = compute_embeddings_for_chunks(overlapping_chunks)
        
        embeddings_store[text_id] = embeddings
        text_id += 1

    # Print or process the embeddings
    for text_id, embeddings in embeddings_store.items():
        print(f"Embeddings for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Embedding {i+1}: {embedding[:10]}...")  # Show first 10 elements for brevity

await main()


In [None]:
import tiktoken

# Initialize the TikToken tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

def count_tokens(text, tokenizer):
    """Tokenize the input text using TikToken and return the number of tokens."""
    tokens = tokenizer.encode(text)
    return len(tokens)

def create_overlapping_chunks(sentence_chunks):
    """Create overlapping chunks from list of sentence chunks."""
    combined_chunks = []
    for i in range(len(sentence_chunks) - 1):
        # Merge two consecutive chunks
        combined_chunk = sentence_chunks[i] + " " + sentence_chunks[i + 1]
        combined_chunks.append(combined_chunk)
    return combined_chunks

# Example usage with cleaned_texts assumed to be defined
for text in cleaned_texts:
    sentence_chunks = splitter.split_text(text)  # Assuming splitter.split_text is adapted for TikToken
    overlapping_chunks = create_overlapping_chunks(sentence_chunks)
    
    # Count tokens using the TikToken tokenizer
    for chunk in overlapping_chunks:
        print(count_tokens(chunk, tokenizer))


## 3.7 Method 7: Retrying NLP-based methods

In [None]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def dynamic_chunk(text, max_length=512):
    chunks = []
    current_chunk = []
    current_length = 0
    doc = nlp(text)
    sentences = [sentence.text for sentence in doc.sents]

    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > max_length:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentence]
            current_length = sentence_length
        else:
            current_chunk.append(sentence)
            current_length += sentence_length

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

# Apply chunking to each text in the list
all_chunks = [dynamic_chunk(text) for text in cleaned_texts]

# Optionally, print the chunks to verify
for text_chunks in all_chunks:
    for chunk in text_chunks:
        print(chunk)
        print("\n---\n")  # Separates chunks for readability


In [None]:
print(all_chunks)

In [None]:
for sub_chunk in all_chunks[0]:
    print(len(sub_chunk))

In [None]:

embeddings_store = {}
text_id = 1
sub_chunk_text_id = 1
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=SENTENCE_CHUNK_OVERLAP)
for chunk in all_chunks:
    print(f"Processing Text {text_id}:")
    for sub_chunk in chunk:
        # Compute embeddings for each overlapping chunk
        print(f"Processing Sub-Text {sub_chunk_text_id}:")
        print(f"Text being processed: {sub_chunk[:100]}")
        embeddings = compute_embeddings_for_chunks(sub_chunk)
        embeddings_store["sub_chunk_text_id"] = embeddings
        print(f"Embedding for {sub_chunk_text_id}: {embeddings[:10]}")
        sub_chunk_text_id += 1

    text_id += 1

