**OBJECTIVE:** From [NB03 - Chunking PDFs](./NB03%20-%20Chunking%20PDFs.ipynb), we determined that SentenceSplitter from ollama-index's documentation is an effective way of chunking for our purposes. This notebook will attempt to implement it using the docs we have collected via crawling and then test them on the ChatUI.

**AUTHOR:** [Aksh Sabherwal](https://www.github.com/akshsabherwal) (edited by [@jonjoncardoso](https://github.com/jonjoncardoso))

⚙️ **SETUP**

- Ensure you are running with the `chat-lse` conda environment. See [README.md](../../README.md) for more information.
- Always re-run the environment set up to ensure you have the latest packages installed:

    ```bash
    cd chat-lse
    conda activate chat-lse
    pip install -r requirements.txt
    pip install spacy
    ```

**Imports**

In [2]:
import os
import re
import PyPDF2
import jsonlines

import pandas as pd

from tqdm.notebook import tqdm
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

**Tweaks**

In [21]:
# Filter unnecessary FutureWarning thrown by HuggingFaceEmbedding
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

tqdm.pandas()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/ 
nlp = English()
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x27c16801650>

**Constants**

In [4]:
DOCS_FOLDER = "./sample-docs/"

# Assuming constants and imports are defined elsewhere
DEFAULT_EMBED_MODEL = "thenlper/gte-large"

DEFAULT_CHUNK_SIZE = 500  # This means each chunk has at most 500 tokens
SENTENCE_CHUNK_OVERLAP = 50  # Example overlap
CHUNKING_REGEX = r"[^,\.;]+[,\.;]?"  # Simple sentence splitter regex
DEFAULT_PARAGRAPH_SEP = "\n\n"  # Paragraph separator

**Utils functions for reading and parsing PDFs**

In [5]:
def read_pdf(file_path=DOCS_FOLDER):
    # Initialize a variable to hold all the text
    all_text = ""
    
    # Open the PDF file
    with open(file_path, "rb") as file:
        # Initialize a PDF reader object
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Iterate through each page in the PDF
        for page in pdf_reader.pages:
            # Extract text from the page
            text = page.extract_text()
            if text:
                all_text += text  # Append the extracted text to all_text

    return all_text

def clean_text(text):
    # Replace all newline characters with a single space
    cleaned_text = re.sub(r'\n', '', text)
    # Replace two or more spaces with a single space
    cleaned_text = re.sub(r' {2,}', ' ', cleaned_text)
    # Replace a space followed by a period with just a period
    cleaned_text = re.sub(r' \.', '.', cleaned_text)
    # Replace a space followed by a comma with just a comma
    cleaned_text = re.sub(r' ,', ',', cleaned_text)
    return cleaned_text

# 1. Read and clean PDFs

 Read PDFs from docs folder and perform necessary cleaning using regex

In [6]:
path_all_pdfs = [file for file in os.listdir(DOCS_FOLDER)]
path_all_pdfs

['Appeals-Regulations.pdf',
 'BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf',
 'bsc-handbook-21.22.pdf',
 'comPro.pdf',
 'ConfidentialityPolicy.pdf',
 'Exam-Procedures-for-Candidates.pdf',
 'Formatting-and-binding-your-thesis-2021-22.pdf',
 'In-Course-Financial-Support.pdf',
 'InterruptionPolicy.pdf',
 'LSE-2030-booklet.pdf',
 'MSc-Mark-Frame.pdf',
 'Spring-Exam-Timetable-2024-Final.pdf',
 'Student-Guidance-Deferral.pdf',
 'UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf']

Read all the PDFs into text:

In [7]:
docs = {filename: read_pdf(os.path.join(DOCS_FOLDER, filename)) for filename in tqdm(path_all_pdfs)}

print(f"Read {len(docs)} documents")
# Uncomment lines below if you want to see the content of the documents
# from pprint import pprint
# pprint(docs)

  0%|          | 0/14 [00:00<?, ?it/s]

Read 14 documents


Create a `cleaned_docs` dictionary with a cleaned version of the text:

In [8]:
cleaned_docs= {filename: clean_text(doc) for filename, doc in tqdm(docs.items())}

# Uncomment lines below if you want to see the cleaned text
# from pprint import pprint
# pprint(cleaned_docs)

  0%|          | 0/14 [00:00<?, ?it/s]

## 1.1 **Annotate documents:**

In [9]:
# Convert to DataFrame
df_docs = pd.Series(cleaned_docs).to_frame("cleaned_text")
df_docs.index.name = "filename"

df_docs

Unnamed: 0_level_0,cleaned_text
filename,Unnamed: 1_level_1
Appeals-Regulations.pdf,Houghton Street London WC2A 2AE United Kingdo...
BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf,Page 1 of 2 THREE YEAR CLASSIFICATION SCHEME F...
bsc-handbook-21.22.pdf,2021/22Welcome to the Department of EconomicsU...
comPro.pdf,1 Section One -How to Raise a Complaint Introd...
ConfidentialityPolicy.pdf,3. where we are required to do so by law - thi...
Exam-Procedures-for-Candidates.pdf,Student Services Centre Exam Procedures for C...
Formatting-and-binding-your-thesis-2021-22.pdf,1 Formatting and binding your thesis Please no...
In-Course-Financial-Support.pdf,Financial Support Office In-Course Financial S...
InterruptionPolicy.pdf,Page 1 of 3 INTERRUPTION OF STUDIES POLICY A ...
LSE-2030-booklet.pdf,Find out more lse.ac.uk/ 2030LSE 2030Shape the...


Manually annotate the sample documents. 

Once our crawler has been integrated, this type of information would have already been collected.

In [10]:
# Add a description
df_docs.loc["ConfidentialityPolicy.pdf", "description"] = "Immigration Advice Confidentiality Policy"
df_docs.loc["Formatting-and-binding-your-thesis-2021-22.pdf", "description"] = "Formatting and binding your thesis"
df_docs.loc["LSE-2030-booklet.pdf", "description"] = "LSE 2030 Strategy"
df_docs.loc["MSc-Mark-Frame.pdf", "description"] = "MSc Mark Frame"
df_docs.loc["bsc-handbook-21.22.pdf", "description"] = "BSc Economics Handbook 2021/22"
df_docs.loc["UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf", "description"] = "UG History Department Handbook 2023/24"
df_docs.loc["Exam-Procedures-for-Candidates.pdf", "description"] = "Exam Procedures for Candidates"
df_docs.loc["Spring-Exam-Timetable-2024-Final.pdf", "description"] = "Spring Exam Timetable 2024"
df_docs.loc["InterruptionPolicy.pdf", "description"] = "Interruption of Studies Policy"
df_docs.loc["Appeals-Regulations.pdf", "description"] = "Academic Appeals Regulations for Taught Programmes"
df_docs.loc["In-Course-Financial-Support.pdf", "description"] = "In-Course Financial Support - Application form and guidance notes"
df_docs.loc["BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf", "description"] = "BA/BSc Three-Year Scheme for students from 2018/19"
df_docs.loc["comPro.pdf", "description"] = "Student Complaints Procedure"
df_docs.loc["Student-Guidance-Deferral.pdf", "description"] = "Student Guidance on Deferral"

# Add URL of the document
df_docs.loc["ConfidentialityPolicy.pdf", "url"] = "https://info.lse.ac.uk/current-students/immigration-advice/assets/documents/Info-Sheets/ConfidentialityPolicy.pdf"
df_docs.loc["Formatting-and-binding-your-thesis-2021-22.pdf", "url"] = "https://info.lse.ac.uk/current-students/phd-academy/assets/documents/Formatting-and-binding-your-thesis-2021-22.pdf"
df_docs.loc["LSE-2030-booklet.pdf", "url"] = "https://www.lse.ac.uk/2030/assets/pdf/LSE-2030-booklet.pdf"
df_docs.loc["MSc-Mark-Frame.pdf", "url"] = "https://www.lse.ac.uk/sociology/assets/documents/study/Assessment-and-Feedback/MSc-Mark-Frame.pdf"
df_docs.loc["bsc-handbook-21.22.pdf", "url"] = "https://www.lse.ac.uk/economics/Assets/Documents/undergraduate-study/bsc-handbook-21.22.pdf"
df_docs.loc["UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf", "url"] = "https://www.lse.ac.uk/International-History/Assets/Documents/student-handbooks/2023-24/UG-Student-Handbook-Department-of-International-History-2023-24.pdf"
df_docs.loc["Exam-Procedures-for-Candidates.pdf", "url"] = "https://info.lse.ac.uk/current-students/services/assets/documents/Exam-Procedures-for-Candidates.pdf"
df_docs.loc["Spring-Exam-Timetable-2024-Final.pdf", "url"] = "https://info.lse.ac.uk/current-students/services/assets/documents/Spring-Exam-Timetable-2024-Final.pdf"
df_docs.loc["InterruptionPolicy.pdf", "url"] = "https://info.lse.ac.uk/Staff/Divisions/Academic-Registrars-Division/Teaching-Quality-Assurance-and-Review-Office/Assets/Documents/Calendar/InterruptionPolicy.pdf"
df_docs.loc["Appeals-Regulations.pdf", "url"] = "https://info.lse.ac.uk/current-students/services/assets/documents/Appeals-Regulations-August-2018.pdf"
df_docs.loc["In-Course-Financial-Support.pdf", "url"] = "https://info.lse.ac.uk/current-students/financial-support/assets/documents/In-Course-Financial-Support.pdf"
df_docs.loc["BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf", "url"] = "https://info.lse.ac.uk/staff/divisions/academic-registrars-division/Teaching-Quality-Assurance-and-Review-Office/Assets/Documents/Calendar/BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf"
df_docs.loc["comPro.pdf", "url"] = "https://info.lse.ac.uk/staff/Services/Policies-and-procedures/Assets/Documents/comPro.pdf?from_serp=1"
df_docs.loc["Student-Guidance-Deferral.pdf", "url"] = "https://info.lse.ac.uk/current-students/services/assets/documents/Student-Guidance-Deferral.pdf"

In [11]:
df_docs["filetype"] = "pdf"
df_docs

Unnamed: 0_level_0,cleaned_text,description,url,filetype
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Appeals-Regulations.pdf,Houghton Street London WC2A 2AE United Kingdo...,Academic Appeals Regulations for Taught Progra...,https://info.lse.ac.uk/current-students/servic...,pdf
BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf,Page 1 of 2 THREE YEAR CLASSIFICATION SCHEME F...,BA/BSc Three-Year Scheme for students from 201...,https://info.lse.ac.uk/staff/divisions/academi...,pdf
bsc-handbook-21.22.pdf,2021/22Welcome to the Department of EconomicsU...,BSc Economics Handbook 2021/22,https://www.lse.ac.uk/economics/Assets/Documen...,pdf
comPro.pdf,1 Section One -How to Raise a Complaint Introd...,Student Complaints Procedure,https://info.lse.ac.uk/staff/Services/Policies...,pdf
ConfidentialityPolicy.pdf,3. where we are required to do so by law - thi...,Immigration Advice Confidentiality Policy,https://info.lse.ac.uk/current-students/immigr...,pdf
Exam-Procedures-for-Candidates.pdf,Student Services Centre Exam Procedures for C...,Exam Procedures for Candidates,https://info.lse.ac.uk/current-students/servic...,pdf
Formatting-and-binding-your-thesis-2021-22.pdf,1 Formatting and binding your thesis Please no...,Formatting and binding your thesis,https://info.lse.ac.uk/current-students/phd-ac...,pdf
In-Course-Financial-Support.pdf,Financial Support Office In-Course Financial S...,In-Course Financial Support - Application form...,https://info.lse.ac.uk/current-students/financ...,pdf
InterruptionPolicy.pdf,Page 1 of 3 INTERRUPTION OF STUDIES POLICY A ...,Interruption of Studies Policy,https://info.lse.ac.uk/Staff/Divisions/Academi...,pdf
LSE-2030-booklet.pdf,Find out more lse.ac.uk/ 2030LSE 2030Shape the...,LSE 2030 Strategy,https://www.lse.ac.uk/2030/assets/pdf/LSE-2030...,pdf


# 2. Use SentenceSplitter()

## 2.1 Test that the embed model works:

In [None]:
embed_model = HuggingFaceEmbedding(model_name=DEFAULT_EMBED_MODEL) 

embeddings = embed_model.get_text_embedding("Hello World!") 
print(len(embeddings))
print(embeddings)

model.safetensors:   6%|6         | 41.9M/670M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

1024
[-0.015347783453762531, 0.03219835087656975, 0.004012199118733406, -0.007076372392475605, -0.03421807661652565, 0.010364627465605736, 0.005171555560082197, 0.035093992948532104, 0.02256808988749981, 0.015681633725762367, 0.020848993211984634, -0.0038733710534870625, 0.01970479264855385, -0.008473459631204605, -0.02410852536559105, 0.02485344000160694, -0.017525017261505127, -0.034737925976514816, -0.008834024891257286, 0.004576727747917175, -0.02903289534151554, 0.005700746551156044, -0.09182633459568024, -0.053349219262599945, -0.018730292096734047, 0.05672817304730415, 0.04165910556912422, -0.014728530310094357, 0.051864806562662125, 0.07056007534265518, -0.03206554427742958, -0.02654753252863884, 0.02285906672477722, -0.04784972220659256, -0.0032983587589114904, -0.011454700492322445, 0.05072870850563049, -0.04840535670518875, -0.014219054020941257, -0.05349956080317497, 0.02206721156835556, 0.009935451671481133, 0.025218719616532326, -0.03437867760658264, -0.05116734281182289,

## 2.2 Utils functions to chunk and embed text

In [62]:
# Define a helper function to generate chunk entries
def generate_chunk_entry(embedding_model, chunk_name, chunktext):
    try:
        embedding = embedding_model.get_text_embedding(chunktext)
        return {
            "chunkname": chunk_name,
            "chunktext": chunktext,
            "embedding": embedding  # Ensure the embedding is serializable
            
        }
    except Exception as e:
        print(f"Error computing embedding for chunk {chunk_name}: {e}")
        return None

# Define the function to generate a JSON entry for each document
def generate_json_entry(embed_model, splitter, filetype, filename, description, cleaned_text, url):
    try:
        # Split the description into chunks
        sentence_chunks = splitter.split_text(cleaned_text)
        chunks = []
        chunk_id = 1
        for chunk in tqdm(sentence_chunks, desc=f"Chunking document \'{filename}\'"):
            chunk_entry = generate_chunk_entry(embed_model, f"{filename} - Part {chunk_id}", chunk)
            if chunk_entry:
                chunks.append(chunk_entry)
                chunk_id += 1

        return {
            "filetype": filetype,
            "filename": filename,
            "description": description,
            "cleaned_text": cleaned_text,
            "url": url,
            "chunks": chunks
        }
    except Exception as e:
        print(f"Failed to compute embedding for {filename}: {e}")
        return None

## 2.3 Test the chunking and embedding functions 

In [71]:
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=256)
doc = df_docs.reset_index().iloc[0]
print(doc)

filename                                  Appeals-Regulations.pdf
cleaned_text     Houghton Street London WC2A 2AE United Kingdo...
description     Academic Appeals Regulations for Taught Progra...
url             https://info.lse.ac.uk/current-students/servic...
filetype                                                      pdf
Name: 0, dtype: object


In [72]:
print(generate_json_entry(embed_model, splitter, **doc))

Chunking document 'Appeals-Regulations.pdf':   0%|          | 0/16 [00:00<?, ?it/s]

{'filetype': 'pdf', 'filename': 'Appeals-Regulations.pdf', 'description': 'Academic Appeals Regulations for Taught Programmes', 'cleaned_text': " Houghton Street London WC2A 2AE United Kingdom lse.ac.uk/appeals Academic Appeals Regulations for Taught Programmes These Regulations are approved by the Academic Board. These Regulations take effect from the 20 23/24 academic year and apply to all undergraduate and taught postgraduate students. See also: • Regulations for First Degrees; • Regulations for Taught Masters; • Schemes for Awards; and • The procedure for submitting Exceptional Circumstances (ECs). 1. Introduction 1.1. The London School of Economics (LSE) is committed to a high quality student experience and these Regulations reflect the School’s commitment to consider appeals in a reasonable, consistent and equitable manner. 1.2. These Regulations apply to all undergraduate and taught masters students of the School and are designed to protect students against unfair assessment res

# 3. Test multiple chunk overlap sizes

The code in this section takes about 40 minutes to an hour to run.

In [73]:
sentence_overlap_sizes = [128, 256, 384]

In [77]:
for overlap_size in sentence_overlap_sizes:
    splitter = SentenceSplitter(chunk_size=512, chunk_overlap=overlap_size)

    print(f"Generating chunked and embedded data with overlap size {overlap_size}...")

    json_entries = [
        generate_json_entry(embed_model, splitter, **doc)
        for _, doc in df_docs.reset_index().iterrows()
    ]

    json_file = f"../../data/seed_lse_data_overlap_{overlap_size}.jsonl"
    with jsonlines.open(json_file, "w") as writer:
        print(f"Writing chunked and embedded data to {json_file}...")
        writer.write_all(json_entries)

Generating chunked and embedded data with overlap size 128...


Chunking document 'Appeals-Regulations.pdf':   0%|          | 0/11 [00:00<?, ?it/s]

Chunking document 'BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf':   0%|          | 0/5 [00:00<?, ?it…

Chunking document 'bsc-handbook-21.22.pdf':   0%|          | 0/54 [00:00<?, ?it/s]

Chunking document 'comPro.pdf':   0%|          | 0/14 [00:00<?, ?it/s]

Chunking document 'ConfidentialityPolicy.pdf':   0%|          | 0/2 [00:00<?, ?it/s]

Chunking document 'Exam-Procedures-for-Candidates.pdf':   0%|          | 0/29 [00:00<?, ?it/s]

Chunking document 'Formatting-and-binding-your-thesis-2021-22.pdf':   0%|          | 0/2 [00:00<?, ?it/s]

Chunking document 'In-Course-Financial-Support.pdf':   0%|          | 0/12 [00:00<?, ?it/s]

Chunking document 'InterruptionPolicy.pdf':   0%|          | 0/6 [00:00<?, ?it/s]

Chunking document 'LSE-2030-booklet.pdf':   0%|          | 0/4 [00:00<?, ?it/s]

Chunking document 'MSc-Mark-Frame.pdf':   0%|          | 0/1 [00:00<?, ?it/s]

Chunking document 'Spring-Exam-Timetable-2024-Final.pdf':   0%|          | 0/38 [00:00<?, ?it/s]

Chunking document 'Student-Guidance-Deferral.pdf':   0%|          | 0/5 [00:00<?, ?it/s]

Chunking document 'UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf':   0%|          | …

Writing chunked and embedded data to ../../data/seed_lse_data_overlap_128.jsonl...
Generating chunked and embedded data with overlap size 256...


Chunking document 'Appeals-Regulations.pdf':   0%|          | 0/16 [00:00<?, ?it/s]

Chunking document 'BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf':   0%|          | 0/7 [00:00<?, ?it…

Chunking document 'bsc-handbook-21.22.pdf':   0%|          | 0/78 [00:00<?, ?it/s]

Chunking document 'comPro.pdf':   0%|          | 0/19 [00:00<?, ?it/s]

Chunking document 'ConfidentialityPolicy.pdf':   0%|          | 0/3 [00:00<?, ?it/s]

Chunking document 'Exam-Procedures-for-Candidates.pdf':   0%|          | 0/42 [00:00<?, ?it/s]

Chunking document 'Formatting-and-binding-your-thesis-2021-22.pdf':   0%|          | 0/3 [00:00<?, ?it/s]

Chunking document 'In-Course-Financial-Support.pdf':   0%|          | 0/16 [00:00<?, ?it/s]

Chunking document 'InterruptionPolicy.pdf':   0%|          | 0/8 [00:00<?, ?it/s]

Chunking document 'LSE-2030-booklet.pdf':   0%|          | 0/4 [00:00<?, ?it/s]

Chunking document 'MSc-Mark-Frame.pdf':   0%|          | 0/1 [00:00<?, ?it/s]

Chunking document 'Spring-Exam-Timetable-2024-Final.pdf':   0%|          | 0/52 [00:00<?, ?it/s]

Chunking document 'Student-Guidance-Deferral.pdf':   0%|          | 0/5 [00:00<?, ?it/s]

Chunking document 'UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf':   0%|          | …

Writing chunked and embedded data to ../../data/seed_lse_data_overlap_256.jsonl...
Generating chunked and embedded data with overlap size 384...


Chunking document 'Appeals-Regulations.pdf':   0%|          | 0/29 [00:00<?, ?it/s]

Chunking document 'BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf':   0%|          | 0/11 [00:00<?, ?i…

Chunking document 'bsc-handbook-21.22.pdf':   0%|          | 0/144 [00:00<?, ?it/s]

Chunking document 'comPro.pdf':   0%|          | 0/37 [00:00<?, ?it/s]

Chunking document 'ConfidentialityPolicy.pdf':   0%|          | 0/4 [00:00<?, ?it/s]

Chunking document 'Exam-Procedures-for-Candidates.pdf':   0%|          | 0/83 [00:00<?, ?it/s]

Chunking document 'Formatting-and-binding-your-thesis-2021-22.pdf':   0%|          | 0/4 [00:00<?, ?it/s]

Chunking document 'In-Course-Financial-Support.pdf':   0%|          | 0/29 [00:00<?, ?it/s]

Chunking document 'InterruptionPolicy.pdf':   0%|          | 0/16 [00:00<?, ?it/s]

Chunking document 'LSE-2030-booklet.pdf':   0%|          | 0/4 [00:00<?, ?it/s]

Chunking document 'MSc-Mark-Frame.pdf':   0%|          | 0/1 [00:00<?, ?it/s]

Chunking document 'Spring-Exam-Timetable-2024-Final.pdf':   0%|          | 0/81 [00:00<?, ?it/s]

Chunking document 'Student-Guidance-Deferral.pdf':   0%|          | 0/9 [00:00<?, ?it/s]

Chunking document 'UG-Student-Handbook-Department-of-International-History-2023-24 (1).pdf':   0%|          | …

Writing chunked and embedded data to ../../data/seed_lse_data_overlap_384.jsonl...


I found that with the above method, it splits chunks based on sentences as well... this means that when a sentence ends, it stops the chunk, which isn't exactly desirable. Let's try to find a way to work around it.

I have found out that the issue is to do with the fact that we are enumerating based on pages, and in doing so we split based on pages. Let's just get rid of this.

**TODO:** (by Jon)

My review of the code stopped here. I will continue to review the rest of the code.

# 4. Use recursive breakdown

Use `spaCy` to split sentences:

In [19]:
# Initialize a dictionary to store results
results = {}

# Process each text in the list with its index
for filename, row in tqdm(df_docs[['cleaned_text']].iterrows(), total=len(df_docs)):
    # Analyze the text with spaCy to get sentences
    doc = nlp(row["cleaned_text"])
    sentences = list(doc.sents)
    
    # Convert all Sentence objects to strings
    sentences = [str(sentence) for sentence in sentences]
    
    # Use the index as the key for each document's results
    results[filename] = {
        "sentences": sentences,
        "sentence_count": len(sentences)
    }

df_sentences = pd.DataFrame(results).T
df_sentences.index.name = "filename"
df_sentences

  0%|          | 0/14 [00:00<?, ?it/s]

Unnamed: 0_level_0,sentences,sentence_count
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
Appeals-Regulations.pdf,[ Houghton Street London WC2A 2AE United Kingd...,190
BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf,[Page 1 of 2 THREE YEAR CLASSIFICATION SCHEME ...,51
bsc-handbook-21.22.pdf,[2021/22Welcome to the Department of Economics...,582
comPro.pdf,[1 Section One -How to Raise a Complaint Intro...,166
ConfidentialityPolicy.pdf,"[3., where we are required to do so by law - t...",40
Exam-Procedures-for-Candidates.pdf,[ Student Services Centre Exam Procedures for ...,440
Formatting-and-binding-your-thesis-2021-22.pdf,[1 Formatting and binding your thesis Please n...,30
In-Course-Financial-Support.pdf,[Financial Support Office In-Course Financial ...,106
InterruptionPolicy.pdf,[ Page 1 of 3 INTERRUPTION OF STUDIES POLICY A...,106
LSE-2030-booklet.pdf,[Find out more lse.ac.uk/ 2030LSE 2030Shape th...,35


Util functions to split the text based a set number of sentences:

In [20]:
# Define split size to turn groups of sentences into chunks

num_sentence_chunk_size = 3

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

Update `df_sentences` to include sentence_chunks and num_sentence_chunks:


In [23]:
df_sentences = df_sentences.assign(
    sentence_chunks=df_sentences["sentences"].progress_apply(
        lambda x: split_list(x, num_sentence_chunk_size)
    ),
    num_sentence_chunks=lambda x: x["sentence_chunks"].apply(len)
)

  0%|          | 0/14 [00:00<?, ?it/s]

Unnamed: 0_level_0,sentences,sentence_count,sentence_chunks,num_sentence_chunks
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Appeals-Regulations.pdf,[ Houghton Street London WC2A 2AE United Kingd...,190,[[ Houghton Street London WC2A 2AE United King...,64
BA-BSc-Three-Year-scheme-for-students-from-2018.19.pdf,[Page 1 of 2 THREE YEAR CLASSIFICATION SCHEME ...,51,[[Page 1 of 2 THREE YEAR CLASSIFICATION SCHEME...,17
bsc-handbook-21.22.pdf,[2021/22Welcome to the Department of Economics...,582,[[2021/22Welcome to the Department of Economic...,194
comPro.pdf,[1 Section One -How to Raise a Complaint Intro...,166,[[1 Section One -How to Raise a Complaint Intr...,56
ConfidentialityPolicy.pdf,"[3., where we are required to do so by law - t...",40,"[[3., where we are required to do so by law - ...",14
Exam-Procedures-for-Candidates.pdf,[ Student Services Centre Exam Procedures for ...,440,[[ Student Services Centre Exam Procedures for...,147
Formatting-and-binding-your-thesis-2021-22.pdf,[1 Formatting and binding your thesis Please n...,30,[[1 Formatting and binding your thesis Please ...,10
In-Course-Financial-Support.pdf,[Financial Support Office In-Course Financial ...,106,[[Financial Support Office In-Course Financial...,36
InterruptionPolicy.pdf,[ Page 1 of 3 INTERRUPTION OF STUDIES POLICY A...,106,[[ Page 1 of 3 INTERRUPTION OF STUDIES POLICY ...,36
LSE-2030-booklet.pdf,[Find out more lse.ac.uk/ 2030LSE 2030Shape th...,35,[[Find out more lse.ac.uk/ 2030LSE 2030Shape t...,12


**TODO:**

There is a lot of repeated code in the rest of the notebook. This tends to lead to a lot of bugs and is generally not a good practice. I will refactor the code to make it more readable and maintainable.

In [None]:
for item in dict_cleaned_texts:
    # Convert each list of sentences in sentence_chunks to a single concatenated string
    item["sentence_chunks"] = [' '.join(chunk) for chunk in item["sentence_chunks"]]

print(dict_cleaned_texts[0]["sentence_chunks"])

In [None]:
import json
from spacy.lang.en import English
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import tqdm

# Load the spaCy English model
nlp = English()
nlp.add_pipe("sentencizer")

# Sample data, assuming 'cleaned_texts' is already defined
documents = [
    (1, "PDF", "Immigration Advice Confidentiality Policy", cleaned_texts[0], "https://info.lse.ac.uk/current-students/immigration-advice/assets/documents/Info-Sheets/ConfidentialityPolicy.pdf"),
    (2, "PDF", "Formatting and binding your thesis", cleaned_texts[1], "https://info.lse.ac.uk/current-students/phd-academy/assets/documents/Formatting-and-binding-your-thesis-2021-22.pdf"),
    (3, "PDF", "LSE 2030 Strategy", cleaned_texts[2], "https://www.lse.ac.uk/2030/assets/pdf/LSE-2030-booklet.pdf"),
    (4, "PDF", "MSc Mark-Frame", cleaned_texts[3], "https://www.lse.ac.uk/sociology/assets/documents/study/Assessment-and-Feedback/MSc-Mark-Frame.pdf"),
    (5, "PDF", "BSc Economics Handbook 2021/22", cleaned_texts[4], "https://www.lse.ac.uk/economics/Assets/Documents/undergraduate-study/bsc-handbook-21.22.pdf"),
    (6, "PDF", "UG History Department Handbook 2023/24", cleaned_texts[5], "https://www.lse.ac.uk/International-History/Assets/Documents/student-handbooks/2023-24/UG-Student-Handbook-Department-of-International-History-2023-24.pdf"),
    (7, "PDF", "Exam Procedure for Candidates", cleaned_texts[6], "https://info.lse.ac.uk/current-students/services/assets/documents/Exam-Procedures-for-Candidates.pdf"),
    (8, "PDF", "Spring Exam Timetable 2024", cleaned_texts[7], "https://info.lse.ac.uk/current-students/services/assets/documents/Spring-Exam-Timetable-2024-Final.pdf")
    
]

# Initialize the embedding model
embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large")

# Define a function to generate chunk entries with embeddings
def generate_chunk_entry(chunk, doc_id, doc_name, idx, embedding_model):
    try:
        embedding = embedding_model.get_text_embedding(chunk)
        return {
            "Type": "PDF",
            "Name": f"{doc_name} - Part {idx}",
            "Description": chunk,
            "Embedding": embedding
        }
    except Exception as e:
        print(f"Error computing embedding for chunk: {e}")
        return None

# Function to split text into chunks of three sentences
def split_into_chunks(text, nlp_model, chunk_size=3):
    doc = nlp_model(text)
    sentences = [str(sentence) for sentence in doc.sents]
    chunks = [' '.join(sentences[i:i + chunk_size]) for i in range(0, len(sentences), chunk_size)]
    return chunks

# Process each document
json_data = []
for document in documents:
    doc_id, doc_type, doc_name, description, link = document  # Unpack the tuple correctly
    doc_chunks = []
    sentence_chunks = split_into_chunks(description, nlp)
    for idx, chunk in enumerate(sentence_chunks, start=1):
        chunk_entry = generate_chunk_entry(chunk, doc_id, doc_name, idx, embed_model)
        if chunk_entry:
            doc_chunks.append(chunk_entry)
    
    document_entry = {
        "Id": doc_id,
        "Name": doc_name,
        "Description": description,
        "Link": link,
        "Chunks": doc_chunks
    }
    json_data.append(document_entry)

# Optionally, save the JSON data to a file
json_file_path = "formatted_json_data.json"
try:
    with open(json_file_path, "w") as f:
        json.dump(json_data, f, indent=4)
    print(f"JSON file created successfully at {json_file_path}")
except Exception as e:
    print(f"Failed to write JSON file: {e}")
