# NB01 - Chunking PDFs

## Setup - Read PDFs from docs folder and perform necessary cleaning using regex

In [3]:
import PyPDF2

def read_pdf(file_path):
    # Initialize a variable to hold all the text
    all_text = ""
    
    # Open the PDF file
    with open(file_path, "rb") as file:
        # Initialize a PDF reader object
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Iterate through each page in the PDF
        for page in pdf_reader.pages:
            # Extract text from the page
            text = page.extract_text()
            if text:
                all_text += text  # Append the extracted text to all_text

    return all_text

appeals_text = read_pdf("../docs/Appeals-Regulations.pdf")
classification_text = read_pdf("../docs/BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf")
complaints_text = read_pdf("../docs/comPro.pdf")
procedure_text = read_pdf("../docs/Exam-Procedures-for-Candidates.pdf")
finsupport_text = read_pdf("../docs/In-Course-Financial-Support.pdf")
interruption_text = read_pdf("../docs/InterruptionPolicy.pdf")
timetable_text = read_pdf("../docs/Spring-Exam-Timetable-2024-Final.pdf")
deferral_text = read_pdf("../docs/Student-Guidance-Deferral.pdf")

test_texts = [appeals_text, classification_text, complaints_text, procedure_text, finsupport_text, interruption_text, timetable_text, deferral_text] #save in a list for efficiency for now


In [22]:
import re

def clean_text(text):
    # Replace all newline characters with an empty string
    cleaned_text = re.sub(r'\n', '', text)
    # Replace two or more spaces with a single space
    cleaned_text = re.sub(r' {2,}', ' ', cleaned_text)
    # Replace a space followed by a period with just a period
    cleaned_text = re.sub(r' \.', '.', cleaned_text)
    # Replace a space followed by a comma with just a comma
    cleaned_text = re.sub(r' ,', ',', cleaned_text)
    return cleaned_text

cleaned_texts = [clean_text(i) for i in test_texts]

## Begin investigations into finding most appropriate embeddings

In [2]:
from embeddings import compute_text_embedding  
from openai_clients import create_openai_embed_client
import asyncio

async def main():
    # Create the OpenAI embedding client
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536  # Default dimensions if not set

    # Iterate over texts and compute their embeddings
    embeddings = []
    for text in cleaned_texts:
        embedding = await compute_text_embedding(
            q=text,
            openai_client=client,
            embed_model=model,
            embedding_dimensions=dimensions
        )
        embeddings.append(embedding)
    
    # Example of how to use embeddings (here we just print them)
    for i, embedding in enumerate(embeddings):
        print(f"Embedding for Text {i+1}: {embedding}")

# Run the asynchronous main function
if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    if loop.is_running():
        # Reuse the existing running loop in Jupyter Notebook
        tasks = asyncio.ensure_future(main())  # Schedule main to run
        # You may run tasks.result() in another cell to get the result if needed
    else:
        # If somehow the loop is not running, use asyncio.run (unlikely in Jupyter)
        asyncio.run(main())

### So clearly, the current texts are too long... we need to try to find a way to chunk them, but in a way that they are usable.

## Method 1: Sliding window approach

I do not think sentence-based chunking is the best way to calculate embeddings for this context ... there will be issues with scalability and more importantly lost context. 

For this reason, I think it will be worth it to try a __sliding window method__, where we establish a "window" of a certain length upon which the embeddings will be calculated and a step size for the window. This will likely be a good idea because a majority of LSE documents do not have a uniform structure, but this method allows us to maintain a contextual link between each chunk.

Let's start with a step size of 4096 (half of 8192)

In [12]:
import nest_asyncio

nest_asyncio.apply()

def sliding_window(text, max_length=8192, step_size=4096):
    tokens = text.split()  # Simple whitespace tokenizer
    chunks = []

    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = " ".join(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    
    return chunks

import asyncio

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536  # Default dimensions if not set
    all_embeddings = []
    for text in cleaned_texts:
        text_chunks = sliding_window(text)
        embeddings = []
        for chunk in text_chunks:
            embedding = await compute_text_embedding(
                q=chunk,
                openai_client=client,
                embed_model=model,
                embedding_dimensions=dimensions
            )
            embeddings.append(embedding)
        all_embeddings.append(embeddings)

    for i, embeddings in enumerate(all_embeddings):
        print(f"Embedding for Text {i+1}:")
        for j, embedding in enumerate(embeddings):
            print(f"  Chunk {j+1}: {embedding}")

await main()

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 9865 tokens (9865 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Seems that there are still issues with token length even though we specified a max length of 8192 tokens earlier... this could be because of differences in how whitespace and special characters are counted as tokens by the model.

For this reason, I'll try decreasing the token and step size of the sliding window function.

In [21]:
import nest_asyncio

nest_asyncio.apply()

def sliding_window(text, max_length=6500, step_size=3000):
    tokens = text.split()  # Simple whitespace tokenizer
    chunks = []

    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = " ".join(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    
    return chunks

import asyncio

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    text_chunks = sliding_window(cleaned_texts[0])
    embeddings = []
    for chunk in text_chunks:
        embedding = await compute_text_embedding(
            q=chunk,
            openai_client=client,
            embed_model=model,
            embedding_dimensions=dimensions
        )
        embeddings.append(embedding)
    embeddings_store[text_id] = embeddings
    text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Chunk {i+1}: {embedding}")

await main()

embeddings_store

Embedding for Text 1:
  Chunk 1: [0.02551848627626896, 0.015385559760034084, 0.008457403630018234, -0.032792385667562485, -0.01761959120631218, 0.017433421686291695, 0.0030501838773489, -0.003171526361256838, -0.009315112605690956, -0.03712746873497963, -0.02503976598381996, -0.012958711013197899, 0.0031632152386009693, 0.00879649817943573, -0.0014178784331306815, -0.0036801674868911505, 0.0374998077750206, 0.00802522525191307, 0.0314360111951828, 0.01686161570250988, -0.021236594766378403, -0.008111661300063133, -0.02811155840754509, -0.0008759928750805557, -0.011489302851259708, 0.0037998477928340435, 0.0030501838773489, -0.01752650737762451, 0.009674153290688992, -0.021967973560094833, 0.004537875764071941, -0.01256642583757639, -0.027233904227614403, -0.0015674787573516369, -0.011894886381924152, 0.015225986018776894, -0.0037134119775146246, 0.008849688805639744, 0.03343068063259125, 0.0005597544368356466, 0.03757959604263306, 0.007101027760654688, -0.0017469990998506546, -0.001181

{1: [[0.02551848627626896,
   0.015385559760034084,
   0.008457403630018234,
   -0.032792385667562485,
   -0.01761959120631218,
   0.017433421686291695,
   0.0030501838773489,
   -0.003171526361256838,
   -0.009315112605690956,
   -0.03712746873497963,
   -0.02503976598381996,
   -0.012958711013197899,
   0.0031632152386009693,
   0.00879649817943573,
   -0.0014178784331306815,
   -0.0036801674868911505,
   0.0374998077750206,
   0.00802522525191307,
   0.0314360111951828,
   0.01686161570250988,
   -0.021236594766378403,
   -0.008111661300063133,
   -0.02811155840754509,
   -0.0008759928750805557,
   -0.011489302851259708,
   0.0037998477928340435,
   0.0030501838773489,
   -0.01752650737762451,
   0.009674153290688992,
   -0.021967973560094833,
   0.004537875764071941,
   -0.01256642583757639,
   -0.027233904227614403,
   -0.0015674787573516369,
   -0.011894886381924152,
   0.015225986018776894,
   -0.0037134119775146246,
   0.008849688805639744,
   0.03343068063259125,
   0.00055975

Looks like it worked for the "Appeals Regulation" text... let's try it for the other ones as well 

In [23]:
import nest_asyncio

nest_asyncio.apply()

def sliding_window(text, max_length=6500, step_size=3000):
    tokens = text.split()  # Simple whitespace tokenizer
    chunks = []

    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = " ".join(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    
    return chunks

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    for text in cleaned_texts:
        text_chunks = sliding_window(text)
        embeddings = []
        for chunk in text_chunks:
            embedding = await compute_text_embedding(
                q=chunk,
                openai_client=client,
                embed_model=model,
                embedding_dimensions=dimensions
            )
            embeddings.append(embedding)
        embeddings_store[text_id] = embeddings
        text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Chunk {i+1}: {embedding}")

await main()

embeddings_store

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens, however you requested 15001 tokens (15001 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

The error suggests that one of the texts still has 15001 tokens, which is strange since we specified that the limit of the window is 6500...

I have found out that actually the method I have instated for splitting based on chunking makes the crucial and incorrect assumption that one character equals one token, which is not the case for many OpenAI models... Let's use a byte-pair encoding (BPE) tokenizer instead to more accurately chunk based on the number of tokens. 

This is not entirely accurate either though to estimate the number of tokens taken in by this specific embeddings model, so we need to use a smaller window size.

In [25]:
from transformers import GPT2Tokenizer
import nest_asyncio

nest_asyncio.apply()

# Initialize the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def sliding_window(text, max_length=8000, step_size=3000):
    # Tokenize the text
    tokens = tokenizer.tokenize(text)
    
    chunks = []
    for i in range(0, len(tokens), step_size):
        chunk = tokens[i:i + max_length]
        chunk_text = tokenizer.convert_tokens_to_string(chunk)
        chunks.append(chunk_text)
        if i + max_length >= len(tokens):
            break
    
    return chunks

embeddings_store = {}

async def main():
    client, model, dimensions = await create_openai_embed_client()
    dimensions = int(dimensions) if dimensions else 1536

    text_id = 1  # A simple counter or identifier for each text
    for text in cleaned_texts:
        text_chunks = sliding_window(text)
        embeddings = []
        for chunk in text_chunks:
            embedding = await compute_text_embedding(
                q=chunk,
                openai_client=client,
                embed_model=model,
                embedding_dimensions=dimensions
            )
            embeddings.append(embedding)
        embeddings_store[text_id] = embeddings
        text_id += 1

    for text_id, embeddings in embeddings_store.items():
        print(f"Embedding for Text {text_id}:")
        for i, embedding in enumerate(embeddings):
            print(f"  Chunk {i+1}: {embedding}")

await main()

Embedding for Text 1:
  Chunk 1: [0.020265037193894386, 0.019887762144207954, 0.0032809418626129627, -0.03161022439599037, -0.013164189644157887, 0.02080400101840496, 0.008030559867620468, -0.010038199834525585, -0.0110487574711442, -0.03802389279007912, -0.018392138183116913, -0.01220752950757742, 0.0011309817200526595, 0.007073899265378714, 0.0035807404201477766, -0.004554243758320808, 0.03452062979340553, 0.01188415102660656, 0.03832032531499863, 0.013837894424796104, -0.020278511568903923, -0.01020662672817707, -0.03274204954504967, -0.005285213235765696, -0.012679122388362885, 0.008845742791891098, 0.007208640221506357, -0.016397971659898758, 0.00823940895497799, -0.0226095300167799, 0.008764898404479027, -0.01304292306303978, -0.023148493841290474, -0.0024337582290172577, -0.014821503311395645, 0.013352827169001102, -0.004631720017641783, 0.008070982061326504, 0.03832032531499863, 0.00113687664270401, 0.04093429818749428, 0.007660022471100092, -0.004096124786883593, -0.0048371995

Note: I had initially encountered an issue where the embeddings for the first chunk of text 4 was just a vector with all entries being null. I fixed this by adding more settings to the regex function at the start.

## Method 2: Topic modelling (incompatible with our purposes)

Another way we could try is by determining what the key topics are in each text and chunking based on that. 

In [44]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Assuming cleaned_texts is a list of document strings
for index, text in enumerate(cleaned_texts):
    document = [text]
    vectorizer = CountVectorizer(stop_words='english')
    X = vectorizer.fit_transform(document)

    # Step 2: Apply LDA
    lda = LatentDirichletAllocation(n_components=2, random_state=0)
    lda.fit(X)

    # Heading for the output
    print(f"\nText {index + 1} Analysis:")  # Dynamic heading for each text

    # Step 3: View topics (each topic as a list of words)
    def print_topics(model, vectorizer, top_n=20):
        for idx, topic in enumerate(model.components_):
            print(f"Topic {idx + 1}")
            print([(vectorizer.get_feature_names_out()[i], topic[i]) for i in topic.argsort()[:-top_n - 1:-1]])

    print_topics(lda, vectorizer)

    # Example of assigning topics to new documents
    doc_topic_dist = lda.transform(X)
    print("\nDocument topic distribution:")
    print(doc_topic_dist)



Text 1 Analysis:
Topic 1
[('appeal', 0.5001182272337424), ('circumstances', 0.500118221305466), ('decision', 0.5001182201810146), ('evidence', 0.5001182201810146), ('academic', 0.5001182195105467), ('school', 0.5001182178851279), ('appeals', 0.5001182168929599), ('art', 0.5001182157483708), ('10', 0.5001182144184692), ('regulations', 0.5001182128609896), ('procedure', 0.5001182128609896), ('exceptional', 0.5001182128609896), ('review', 0.5001182110208952), ('student', 0.5001182061769369), ('students', 0.5001181989323576), ('procedures', 0.5001181989323576), ('submit', 0.5001181873940317), ('board', 0.5001181873940317), ('reason', 0.5001181788696052), ('classification', 0.5001181788696052)]
Topic 2
[('appeal', 77.49988177276589), ('circumstances', 27.49988177869416), ('decision', 25.499881779818608), ('evidence', 25.499881779818608), ('academic', 24.49988178048908), ('school', 22.4998817821145), ('appeals', 21.499881783106666), ('art', 20.499881784251258), ('10', 19.499881785581156), (

In [45]:
def extract_keywords(model, vectorizer, top_n=20):
    topic_keywords = []
    for idx, topic in enumerate(model.components_):
        keywords = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-top_n - 1:-1]]
        topic_keywords.append(keywords)
    return topic_keywords

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def chunk_text_by_keywords(text, keywords):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    # Initialize chunks (a list of lists, each list is a chunk of sentences)
    chunks = [[] for _ in keywords]

    # Assign sentences to the chunk of the topic they are most relevant to
    for sentence in sentences:
        # Count keyword occurrences in the sentence
        keyword_counts = [sum(sentence.count(keyword) for keyword in topic_keywords) for topic_keywords in keywords]
        # Find the topic with the maximum count of keywords
        max_topic = keyword_counts.index(max(keyword_counts))
        # Append the sentence to the corresponding chunk
        chunks[max_topic].append(sentence)

    # Join sentences back to form coherent chunks
    return [" ".join(chunk) for chunk in chunks]

for index, text in enumerate(cleaned_texts):
    document = [text]
    vectorizer = CountVectorizer(stop_words='english')
    X = vectorizer.fit_transform(document)

    lda = LatentDirichletAllocation(n_components=2, random_state=0)
    lda.fit(X)

    # Extract keywords for the current text's topics
    keywords = extract_keywords(lda, vectorizer)
    print(f"\nText {index + 1} Keywords and Chunks:")
    for i, topic_keywords in enumerate(keywords):
        print(f"Topic {i + 1} Keywords: {topic_keywords}")

    # Chunk the text based on these keywords
    text_chunks = chunk_text_by_keywords(text, keywords)
    for i, chunk in enumerate(text_chunks):
        print(f"Chunk {i + 1} for Text {index + 1}: {chunk}...")  # Print the first 100 characters of each chunk



Text 1 Keywords and Chunks:
Topic 1 Keywords: ['appeal', 'circumstances', 'decision', 'evidence', 'academic', 'school', 'appeals', 'art', '10', 'regulations', 'procedure', 'exceptional', 'review', 'student', 'students', 'procedures', 'submit', 'board', 'reason', 'classification']
Topic 2 Keywords: ['appeal', 'circumstances', 'decision', 'evidence', 'academic', 'school', 'appeals', 'art', '10', 'regulations', 'procedure', 'exceptional', 'review', 'student', 'students', 'procedures', 'submit', 'board', 'reason', 'classification']
Chunk 1 for Text 1:  Houghton Street London WC2A 2AE United Kingdom lse.ac.uk/appeals Academic Appeals Regulations for Taught Programmes These Regulations are approved by the Academic Board. These Regulations take effect from the 20 23/24 academic year and apply to all undergraduate and taught postgraduate students. See also: • Regulations for First Degrees; • Regulations for Taught Masters; • Schemes for Awards; and • The procedure for submitting Exceptional C

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>


The issue does seem to be that we have insufficient content in each document for topic modelling... this could be the reason why the keywords are the same in topic 1 and topic 2, and so we only get 1 chunk for each text. For this reason I do not think this is a viable method for chunking.

## Method 3: Semantic similarity-based splitting

We will try using a huggingface model to determine where there are substantial jumps in semantic space between different sections, and calculate embedings off of them.

In [47]:
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].mean(dim=0)

# Example text
text = cleaned_texts[0]
paragraphs = text.split('\n\n')

# Compute embeddings
embeddings = [get_embedding(para) for para in paragraphs]

# Determine breakpoints based on embedding similarity
for i in range(1, len(embeddings)):
    sim = cosine_similarity([embeddings[i-1].detach().numpy()], [embeddings[i].detach().numpy()])
    if sim < 0.5:  # Threshold needs adjustment based on your specific needs
        print(f"Split between paragraph {i-1} and {i} due to low similarity: {sim}")





ImportError: 
AutoModel requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.
