# PDF-Extraction and Data Annotation

__Corresponding spec of the project__: <BR>
This notebook uses the introductory code provided by Jon to start the extraction process of our unstructured text data for the rest of the project. <BR>
The parts of the spec that this notebook should cover are: <BR>
* __Data Annotation__:
    - Process PDF files to a structured data format
    - Label sections of the extracted text with location data (e.g. page number, section heading, etc.)
    - Document annotation methodology <br>
    __Course Reference: W7 Lab - Text Extraction Methods, W8 Lab - Document Chunking__

* __Quality Assurance__:
    - Validate consistency (🏆)
    - Handle edge cases (🏆) <BR>
    __Course Reference: W8 Lab - Exploratory Quality Assurance__

In [11]:
import os
import glob

from tqdm.notebook import tqdm
import importlib
import utils 

importlib.reload(utils)

<module 'utils' from '/Users/bilalhashim/Desktop/LSE/Year 4/DS205/summative_project_2/problem-set-2-BilalNHashim/notebooks/utils.py'>

### Identifying PDFS

In [7]:
# Read all files

# Glob all files under ../data/pdfs/
#Here we use the glob module which is used to find files based on patterns (note the * to denote all files within the pdfs folder) crucially we are saving the file paths to the files variable and not the actual files themselves

files = glob.glob('../data/pdfs/*')

### Parsing PDFs


#### IGNORE (To aid own understanding)
The below function is what we use to create our initial dictionary of doc names and their respective text. Starting from the back of the function, the list comprehension retrieves each file path for all the pds file paths which we stored in the files variable above. For each file, we create a key-value pair, in which the base_name i.e. the actual name of the file stripped of the path is the key and using the extract_text_data_from_pdf function from utils, we store the actual text data. Note that this function uses the unstructure package to do the heavy lifting in terms of the actual text extraction, but with additional functionality for correctly formatting, cleaning the extracted text. Inclusion of metadata is a crucial addition here since we now get additional context for each pdf such as page numbers, . As seen in the cell output below, for some pdfs the function is unable to read and so omitts(I assume this is fine since the function handles and skips pdfs where this occurs). So what partition_pdf is doing is processign different elements of the pdf - and then populating them with the relevant tags/metadata in a dict, and then appending said dict to a list which is the value for each pdf. So in that way for each key (pdf name), we have a list as the value - and the entire contents of that list span the text from the pdf - with associated relevant information for each element. <BR>

Note on the pdf_partition function: <BR>
* So the function below splits parts of the pdf into elements due to the pdf_partition function. The split is done according to the parts of the document that the respective words/sections/sentences correpond to - e.g. title, table, text etc.
* So this is not tokenising per se, we still need to perform tokenisation after this



__Note:__ The below function is currently encountering one error when processing the pdfs because of invalid dictionary constructs. This means that pdfminer is failing to extract the text. Some pdfs contain non-standard or corrupted meta-data which is what may be causing the issue <BR>

In [8]:
doc_text = {os.path.basename(file): utils.extract_text_data_from_pdf(file, unstructured_strategy="fast") 
            for file in tqdm(files)}

  0%|          | 0/186 [00:00<?, ?it/s]

Invalid dictionary construct: [/'CS', /'DeviceCMYK', /'I', False, /'K', /b'fals', /b'e', /'S', /'Transparency', /'Type', /'Group']
Traceback (most recent call last):
  File "/Users/bilalhashim/Desktop/LSE/Year 4/DS205/summative_project_2/problem-set-2-BilalNHashim/venv/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 251, in partition_pdf_or_image
    extracted_elements = extractable_elements(
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bilalhashim/Desktop/LSE/Year 4/DS205/summative_project_2/problem-set-2-BilalNHashim/venv/lib/python3.12/site-packages/unstructured/partition/pdf.py", line 196, in extractable_elements
    return _partition_pdf_with_pdfminer(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/bilalhashim/Desktop/LSE/Year 4/DS205/summative_project_2/problem-set-2-BilalNHashim/venv/lib/python3.12/site-packages/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users

doc_text is a dictionary within which each key is the document header, and the values are a list, where each element is a dict - where each dict basically represents an element and its respective metadata

__Storing our doc_text dictionary as a pickle download so that we can recall without running the long above procedure__

In [13]:
import pickle

with open("doc_text.pkl", "wb") as f:
    pickle.dump(doc_text, f)

In [12]:
import pickle
with open("doc_text.pkl", "rb") as f:
    doc_text = pickle.load(f)

We already collect a lot of metadata about each chunk of PDF, as can be seen below:



In [5]:
#186 documents 
print(f'No. docs: {len(doc_text)}')
#From the algeria pdf for instance, we can extract the 23rd element which corresponds to a sentence on the 3rd page, if we extract the first element we see that this is the title 
doc_text['algeria_english_20220601.pdf'][28]
#print(doc_text['algeria_english_20220601.pdf'][1])

No. docs: 186


{'id': 28,
 'type': 'NarrativeText',
 'text': 'Algeria is an African and a Mediterranean country covering 2 381 741 km2. Like many of the countries in its region, Algeria is affected by desertification and land degradation. Most of the country is arid or semi- arid. The areas receiving more than 400 mm of rain per year are located in a narrow strip along the coast, not-exceeding 150 km large. Moreover, due to climate changes, yearly average rainfall declined by more than 30% over the past decades.',
 'metadata': {'page_number': 4,
  'coordinates': CoordinatesMetadata(points=((70.824, 355.65031999999997), (70.824, 445.76935999999995), (527.73376, 445.76935999999995), (527.73376, 355.65031999999997)), system=<unstructured.documents.coordinates.PixelSpace object at 0x169752720>),
  'file_directory': '../data/pdfs',
  'filename': 'algeria_english_20220601.pdf',
  'languages': ['eng'],
  'last_modified': '2025-03-12T10:28:09',
  'links': [],
  '_known_field_names': frozenset({'attached_to_f

In [10]:
#Lists all of our 186 document titles
list(doc_text.keys())

['democratic_republic_of_the_congo_french_20220601.pdf',
 'argentina_spanish_20220501.pdf',
 'georgia_english_20220601.pdf',
 'türkiye_english_20230401.pdf',
 'canada_english_20250201.pdf',
 'brazil_english_20241101.pdf',
 'venezuela_(bolivarian_republic_of)_spanish_20220601.pdf',
 'nauru_english_20220601.pdf',
 "lao_people's_democratic_republic_english_20220601.pdf",
 'mali_french_20220601.pdf',
 'morocco_french_20220601.pdf',
 'zimbabwe_english_20250201.pdf',
 'mauritania_french_20220601.pdf',
 'new_zealand_english_20250101.pdf',
 'marshall_islands_english_20250201.pdf',
 'saudi_arabia_english_20211023.pdf',
 'guatemala_spanish_20220601.pdf',
 'mongolia_english_20220601.pdf',
 'sao_tome_and_principe_english_20220601.pdf',
 'guinea-bissau_english_20220601.pdf',
 'equatorial_guinea_spanish_20221001.pdf',
 'panama_spanish_20240601.pdf',
 'vanuatu_english_20220801.pdf',
 'andorra_spanish_20230101.pdf',
 'burundi_french_20220601.pdf',
 'bosnia_and_herzegovina_english_20220601.pdf',
 'sing

In [11]:
doc_text['cuba_spanish_20250201.pdf'][0]

{'id': 0,
 'type': 'Title',
 'text': 'República de Cuba',
 'metadata': {'page_number': 1,
  'coordinates': CoordinatesMetadata(points=((192.127, 238.42984), (192.127, 269.02084), (418.98985600000003, 269.02084), (418.98985600000003, 238.42984)), system=<unstructured.documents.coordinates.PixelSpace object at 0x2cc0c7380>),
  'file_directory': '../data/pdfs',
  'filename': 'cuba_spanish_20250201.pdf',
  'languages': ['eng'],
  'last_modified': '2025-03-12T10:21:24',
  'links': [],
  '_known_field_names': frozenset({'attached_to_filename',
             'category_depth',
             'coordinates',
             'data_source',
             'detection_class_prob',
             'detection_origin',
             'emphasized_text_contents',
             'emphasized_text_tags',
             'file_directory',
             'filename',
             'filetype',
             'header_footer_type',
             'image_path',
             'is_continuation',
             'languages',
             'last_m

#### Data Validation and Quality Assurance

In [6]:
# Here we attempt to see if there are any empty documents - if our parsing procedures worked properly we should not have any empty documents
def extraction_validation(doc_text):
    empty_docs = []
    
    for filename, chunks in doc_text.items():
        if not chunks:
            empty_docs.append((filename))
            continue
    return empty_docs

extraction_validation(doc_text)

['morocco_french_20220601.pdf',
 'kenya_english_20220601.pdf',
 'kiribati_english_20230301.pdf',
 'iraq_arabic_20220601.pdf',
 'mauritius_english_20220601.pdf',
 "democratic_people's_republic_of_korea_english_20220601.pdf",
 'marshall_islands_english_20220601.pdf',
 'eswatini_english_20220601.pdf',
 'paraguay_spanish_20220601.pdf',
 'lesotho_english_20250201.pdf',
 'israel_english_20220601.pdf',
 'indonesia_english_20220901.pdf']

The above procedure has identified that we do have several documents which were not parsed correctly, which may have been why we see errors in the top-most function. <BR>

As these documents do not contain chunks they will not form part of our downstream RAG procedure

In [18]:
# This function draws on the functionality provided by the unstructured package to recognise portions of the pdf
# We check here to see if there are any documents that are not narrative dominant - this is important as we want to ensure that the documents we are working with have sufficient narrative text 
def flag_non_narrative_dominant_docs(doc_text, threshold=0.15):
    flagged = []

    for filename, chunks in doc_text.items():
        narrative = [c for c in chunks if c.get("type") == "NarrativeText"]
        if len(narrative) / max(1, len(chunks)) < threshold:
            flagged.append((filename, f"{len(narrative)} of {len(chunks)} are narrative"))
    
    return flagged

print(f'{len(flag_non_narrative_dominant_docs(doc_text))} docs are non-narrative heavy')

17 docs are non-narrative heavy


 Note that because these documents do contain a variety of formats such as tables, headings etc. its normal to see several documents with low amount of narrative text which is why I set the threshold so low - and also undermines the extent to which this check is actually a highly robust screener given we do not have a benchmark to compare to.

In [19]:
chunked_documents = {}
for doc, elements in doc_text.items():
    chunked_documents[doc] = utils.chunk_document(doc_text[doc])

In [20]:
chunked_documents['afghanistan_english_20220601.pdf'][9]

{'line_number': 10,
 'text': 'Target Years:',
 'type': 'Title',
 'metadata': {'page_number': 1,
  'coordinates': CoordinatesMetadata(points=((72.024, 393.14199999999994), (72.024, 405.14199999999994), (135.14, 405.14199999999994), (135.14, 393.14199999999994)), system=<unstructured.documents.coordinates.PixelSpace object at 0x2f2615a30>),
  'file_directory': '../data/pdfs',
  'filename': 'afghanistan_english_20220601.pdf',
  'languages': ['eng'],
  'last_modified': '2025-03-12T10:28:10',
  'links': [],
  '_known_field_names': frozenset({'attached_to_filename',
             'category_depth',
             'coordinates',
             'data_source',
             'detection_class_prob',
             'detection_origin',
             'emphasized_text_contents',
             'emphasized_text_tags',
             'file_directory',
             'filename',
             'filetype',
             'header_footer_type',
             'image_path',
             'is_continuation',
             'languages

In [33]:
# This chunking validation method is supposed to check just how many documents contain empty or short chunks at any one point
def detect_poorly_chunked_docs(chunked_documents, min_length=10):
    poorly_chunked_docs = []

    for doc_id, chunks in chunked_documents.items():
        for chunk in chunks:
            text = chunk.get("text", "")
            if not isinstance(text, str) or len(text.strip()) < min_length:
                poorly_chunked_docs.append(doc_id)
                break 

    return(f"Docs with poor chunks: {len(poorly_chunked_docs)}")

detect_poorly_chunked_docs(chunked_documents)

'Docs with poor chunks: 174'

In [34]:
# This builds on the last implementation by identifying the specific chunks that are empty or short and in turn how much of our chunks will be ineffective when it comes to embeddings
def detect_empty_or_short_chunks(chunked_documents, min_length=10):
    no_poor_chunks = []

    for doc_id, chunks in chunked_documents.items():
        for chunk in chunks:
            text = chunk.get("text", "")
            if not isinstance(text, str) or len(text.strip()) < min_length:
                no_poor_chunks.append(chunk)

    return(f"Docs with poor chunks: {len(no_poor_chunks)}")

detect_empty_or_short_chunks(chunked_documents)

'Docs with poor chunks: 59317'

We see that nearly all of 186 documents contain at least 1 instance of either an empty chunk. This does not fully tell us the magnitude of the problem without the context of the second check <BR>

Even more concerningly almost 60,000 of our c.190,000 chunks (or almost a third) are 'poor' quality. Of course some of these will not be poor and will just be short headers or titles or sentences etc. But for the most part we can imagine that these chunks will not be that useful when it comes to the rest of our pipeline.

__Note: as I only impelmented these checks after completing the majority of this project I didn't get the chance to better engineer the chunk database - but this would have been a __major potential source of improvement for the project__

####  Comitting Chunks to our Database
* This portion of the code was assisted heavily by vibe-coding. Nevertheless, the intuition behind this appraoch
was to take our pre-processed document chunks and use SQLAlchemy to store them in our SQL database. <br>

It starts with a helper function serialize_metadata() that makes sure the metadata attached to each chunk can be saved as JSON. I was having some problems uploading the frozensets, datetimes, and custom objects (like coordinates) datatypes, so this converts them into plain strings, lists, or dictionaries that can be safely stored. <br>

For each document in chunked_documents, we strip the file extension from the document name (doc_id = doc_id[:-4]), this allows us to get the unique doc id which is the relational feature in this database <br>

For each chunk, we generate a unique ID, serialize the metadata, building a dictionary with the relevant fields: id, doc_id, content, chunk_index, and chunk_metadata. <br>

We call upon the DocChunk model inside of models.py script which is a SQLAlchemy model that represents each chunk, collects, and adds to the current session <br>

Once all documents and their chunks are processed, we commits everything to the database with session.commit().

In [1]:
import sys
import os
from pathlib import Path

# Add the project root to Python path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Now you can import from climate_policy_extractor
from climate_policy_extractor.models import get_db_session, DocChunk
from dotenv import load_dotenv
import uuid

# Load environment variables and get database session
load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')
session = get_db_session(DATABASE_URL)
session.rollback()

In [None]:

from sqlalchemy.orm import Session
from sqlalchemy.exc import SQLAlchemyError
import json
import copy
import datetime

session.rollback()
def serialize_metadata(metadata):
    """Convert metadata to JSON-serializable format."""
    if not metadata:
        return None
        
    serialized = {}
    for key, value in metadata.items():
        if key == 'coordinates' and hasattr(value, 'points'):
            # Convert coordinates to a list of lists
            serialized[key] = {
                'points': [list(point) for point in value.points],
                'system': str(value.system)
            }
        elif key == '_known_field_names' and isinstance(value, frozenset):
            # Convert frozenset to list
            serialized[key] = list(value)
        elif isinstance(value, (datetime, date)):
            # Convert datetime/date to ISO format string
            serialized[key] = value.isoformat()
        elif isinstance(value, (list, tuple)):
            # Convert lists/tuples to lists
            serialized[key] = list(value)
        else:
            # For other types, try to convert to string if not already serializable
            try:
                json.dumps(value)
                serialized[key] = value
            except (TypeError, ValueError):
                serialized[key] = str(value)
    
    return serialized

# Process each document and its chunks
for doc_id, chunks in chunked_documents.items():
    doc_id = doc_id[:-4]
    print(f"Processing document: {doc_id}")
    
    # Create DocChunk objects for each chunk
    doc_chunks = []
    for i, chunk in enumerate(chunks):
        chunk_id = str(uuid.uuid4())
        
        # Serialize the metadata before creating the DocChunk
        serialized_metadata = serialize_metadata(chunk.get('metadata', {}))
        
        # Create a dictionary with the chunk data
        chunk_data = {
            'id': chunk_id,
            'doc_id': doc_id,
            'content': chunk['text'],
            'chunk_index': i,
            'chunk_metadata': serialized_metadata
        }
        
        # Create and add the DocChunk
        doc_chunk = DocChunk(**chunk_data)
        doc_chunks.append(doc_chunk)
    
    # Add all chunks to the session
    session.add_all(doc_chunks)
    print(f"Added {len(doc_chunks)} chunks for document {doc_id}")

# Commit all chunks to the database
session.commit()
print("All chunks have been committed to the database")

### Additional Chunking Strategy

__Another feature of this task was to improve upon the chunking strategy__, whilst I did experiment with this with the __aid of vibe-coding__, because of the time taken to generate embeddings on the new chunks, add these to the postgreSQL database, and then run similarity searches on the new embeddings I wasnt able to validate this function myself to see if it worked better and implement within my RAG pipeline <BR>

However, I still thought I was include it in my project notebooks to demonstrate some of the increased efficiencies it brings and how I might look to use it if I carry this project forward independently. <BR>

I kept the function definition inside of the utils.py file, with annotations of the changes made below as well as a working demonstration and comparison of the approach to the baseline <br>

The original chunking strategy created one chunk per text element (e.g. paragraph), which led to many small or inconsistent chunks that harmed the semantic embedding quality and created too many chunks in the vector database. <BR>

The new chunking strategy uses a rolling window within which:

- Aggregates text into windows of up to 1000 characters 
- Uses a 200-character overlap between chunks to preserve semantic continuity
- Produces context-rich chunks suitable for embedding and retrieval
- Reduces the total number of chunks, improving both embedding efficiency and search performance

In [22]:
new_chunked_documents = {}
for doc, elements in doc_text.items():
    new_chunked_documents[doc] = utils.improved_chunker(doc_text[doc])

In [27]:
# Check the first 5 chunks of the first document
new_chunked_documents['algeria_english_20220601.pdf'][20:30]

[{'content': 'estation Plan with a global objective of reforestation of 1 245 000 ha. The mitigation actions to be implemented by Algeria, planned for the 2021-2030 period, will lead to the following contribution: Reduction of greenhouse gases emissions by 7% to 22%, by 2030, compared to a business as usual -BAU- scenario, conditional on external support in terms of finance, technology development and transfer, and capacity building. The 7% GHG reduction will be achieved with national means. The Algerian contribution regarding mitigation is defined as follows: Type of INDC: Relative reduction compared to Business as usual (BAU) scenario. Implementation period: 2021-2030 Methodological approach: combined approach: Bottom-Up concerning sectors and Top-Down concerning national objectives. Sectors covered: Energy (Generation, Transport, Building and Industry); Industrial processes; Agriculture, Forests, Land use and Waste. 6',
  'chunk_index': 20},
 {'content': 'ncerning sectors and Top-Do

Ultimately these chunks might have been __too__ long compared to the previous chunking strategy - however this is an easy fix by changing the default chunk length in the function within utils. It would have been interesting to see how this chunk length affected the embeddings vectors produced in the next notebook, and in turn the retrieval stage of my pipeline however as mentioned I wasn't able to implement due to time constraints