In [3]:
import os
import glob
import sys
from tqdm.notebook import tqdm
from torch import Tensor
import torch.nn.functional as F
from transformers import pipeline
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM
from pathlib import Path
import json

# 1.1 Identify and parse all PDF files in the directory

- Note that in this case, we are using a robust method to extract text from PDF with help from the unstructured package
- The unstructured package is a wrapper around the PyMuPDF package which is a Python binding for the MuPDF library
- The extract_text_data_from_pdf automatically labels sections of the extracted text with location data (e.g. page number, section heading, etc.)


In [4]:
# Glob all files under ../data/pdfs/

files = glob.glob('../data/pdfs/*')

# 1.1.1 Extract text using unstructured package
- I want to save the extracted data in a JSON format to preserve the meta data instead of just saving it as a .txt file

# 1.1.2 Adding Sentence-Based Chunking

In addition to extracting the raw text from PDFs, we'll now implement sentence-based chunking to create more semantically meaningful text units. This approach offers several advantages:

- **Semantic Coherence**: Each chunk contains complete sentences, preserving meaning
- **Overlap Control**: We can specify how many sentences should overlap between chunks to maintain context
- **Size Management**: By setting a maximum chunk size, we can control the length of each chunk for better processing
- **Structure Preservation**: Special handling for titles and headings ensures document structure is maintained

We'll use the `chunk_document_by_sentences` function from our utils module, which implements this chunking strategy.

Note that our helpful function already collects a lot of metadata about each chunk of PDF:



# 1.1.3 Understanding the Chunking Process

The `chunk_document_by_sentences` function has chunked our documents in a meaningful way:

1. **Semantic Units**: Each chunk contains complete sentences rather than arbitrary character counts
2. **Contextual Overlap**: The `overlap=2` parameter ensures that 2 sentences from the previous chunk are carried forward to the next chunk, maintaining context
3. **Special Element Handling**: Titles and headings are preserved as their own chunks for better document navigation
4. **Metadata Preservation**: Each chunk retains information about its source (page number, element type, etc.)

Let's examine the structure of our generated chunks:

In [6]:
# Let's load a chunk file if available to examine the structure
if files:  # Check if we have any files
    base_name = os.path.splitext(os.path.basename(files[0]))[0]
    chunks_file = f'../data/processed/chunks/{base_name}_chunks.json'
    
    if os.path.exists(chunks_file):
        with open(chunks_file, 'r', encoding='utf-8') as f:
            chunks = json.load(f)
            
        # Print info about the chunks
        print(f"Found {len(chunks)} chunks for {base_name}")
        
        # Show the structure of the first chunk
        print("\nExample chunk structure:")
        if chunks:
            first_chunk = chunks[0]
            print(json.dumps(first_chunk, indent=2))
            
            # Show how many sentences are in each chunk (first 5 chunks)
            print("\nSentence counts in first 5 chunks:")
            for i, chunk in enumerate(chunks[:5]):
                print(f"Chunk {i+1}: {len(chunk.get('sentences', []))} sentences")
    else:
        print(f"No chunks file found at {chunks_file}")

Found 772 chunks for andorra_spanish_20250201

Example chunk structure:
{
  "id": "chunk_0",
  "text": "Tercera contribuci\u00f3n determinada a nivel nacional de Andorra \u00a9 Oficina de l\u2019Energia i del Canvi Clim\u00e0tic 2025 Tercera Contribuci\u00f3n Determinada a nivel Nacional ante la Convenci\u00f3n Marco de las Naciones Unidas sobre Cambio Clim\u00e1tico (CMNUCC) Presentada y aprobada por el Gobierno de Andorra, 5 de febrero de 2025",
  "sentences": [
    "Tercera contribuci\u00f3n determinada a nivel nacional de Andorra",
    "\u00a9 Oficina de l\u2019Energia i del Canvi Clim\u00e0tic 2025",
    "Tercera Contribuci\u00f3n Determinada a nivel Nacional ante la Convenci\u00f3n Marco de las Naciones Unidas sobre Cambio Clim\u00e1tico (CMNUCC)",
    "Presentada y aprobada por el Gobierno de Andorra, 5 de febrero de 2025"
  ],
  "metadata": {
    "element_types": [
      "NarrativeText",
      "UncategorizedText"
    ],
    "page_number": 1,
    "paragraph_numbers": [
      1,


# 1.1.4 Benefits of Sentence-Based Chunking for NLP Tasks

The chunking approach we've implemented offers significant advantages for downstream NLP tasks:

1. **Improved Context for Embeddings**: When generating embeddings, each chunk contains complete thoughts with proper context, leading to more accurate semantic representations

2. **Better Search Results**: Searching through chunks that maintain sentence integrity gives more meaningful results compared to arbitrary text segments

3. **Control Over Granularity**: By adjusting the `max_chunk_size` and `overlap` parameters, we can fine-tune the tradeoff between context preservation and processing efficiency

4. **Enhanced Document Navigation**: Since we've preserved titles and headings as separate chunks, we can better understand the document structure

5. **Efficient Processing**: Rather than processing the entire document at once, we can efficiently work with manageable chunks while maintaining context through overlap