# Extract Text from PDFs, Generate Embeddings
parldebatescanner.org uses the [opendata web services of parlament.ch](https://www.parlament.ch/de/%C3%BCber-das-parlament/fakten-und-zahlen/open-data-web-services) to retrieve speeches from sessions in the national council and council of states in the swiss parliament. For simplicity and based on the feedback of people interested in the code we do not use the parlament's API but extract text from PDFs.

In [2]:
# Load Libraries
import os
import pandas as pd
import fitz # PyMuPDF used for text extraction from PDFs
from sentence_transformers import SentenceTransformer # Sentence Transformer to generate Embeddings

The model paraphrase-multilingual-MiniLM-L12-v2 can handle a context size of 128 tokens. For simplicity we use a rule of thumb and define the chunk size to be 4*128 characters as a token represents approximately 4 characters. A more sophisticated approach would be to use the sentence_transformer models encoder to split up the text to sizes of exactly 128 tokens.

In [3]:
# Set default (relative) directories
DATA_PATH = 'data'
RAW_DATA_PATH = 'data/raw'
TRANSFORMED_DATA_PATH = 'data/transformed'

# Create directories if they don't exist
if not os.path.exists(DATA_PATH):
    os.makedirs(DATA_PATH)
if not os.path.exists(RAW_DATA_PATH):
    os.makedirs(RAW_DATA_PATH)
if not os.path.exists(TRANSFORMED_DATA_PATH):
    os.makedirs(TRANSFORMED_DATA_PATH)

MODEL = 'paraphrase-multilingual-MiniLM-L12-v2' # https://www.sbert.net/docs/pretrained_models.html
CHUNK_SIZE = 128*4 # Number of characters that make up a text chunk
OVERLAP = 1/5 # Percentage of Chunk size to overlap
MIN_BLOCK_LENGTH = 120 # Minimum number of characters of a text to be deemed valuable

model = SentenceTransformer(MODEL)
output_file_name = 'dataset_1.parquet'

For simplicity we simply extract all text blocks from the PDFs using PyMuPDF. A more sophisticated approach would be to combine and cleanse the blocks. There are also other PDF extraction libraries that e.g. use OCR. However, OCR is not in every case better than PyMuPDF especially if the PDF's structure is very simple.

For simplicity we only remove some special characters from the text for cleansing. There are many more sophisticated approaches to cleanse the text.

In [4]:
# List all files in the folder RAW_DATA_PATH
files_to_read = os.listdir(RAW_DATA_PATH)

print('Files to read:', files_to_read)

# Placeholder for extracted text blocks
data = []

# Iterate over all files to read
for file_name in files_to_read:
    
    # Open the PDF using PyMuPDF
    doc = fitz.open(f'{RAW_DATA_PATH}/{file_name}')
    # Placeholder variable to track block index of every extracted text block
    block_index = 0
    # Placeholder variable to track page number of extracted text blocks
    page_nr = 1

    # Iterate over all pages of document
    for page in doc:

        # Extract all text blocks from the page
        blocks = page.get_text('blocks')
        
        # Iterate over all text blocks
        for block in blocks:

            # The text data is in position 4 of the tuple
            block_cleansed = block[4].replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').strip()

            # Only use blocks with a minimum length. This is a very naive approach. See description above this code cell
            if len(block_cleansed) >= MIN_BLOCK_LENGTH:

                # Append filename, page number, block index and the text as a row to our dataset
                data.append([file_name, page_nr, block_index, block_cleansed])

                block_index += 1 # Increment block index
        page_nr += 1 # Increment page number

# Put extracted texts into a pandas DataFrame
df = pd.DataFrame(data, columns=['pdf_name', 'page_number', 'block_index', 'block_text'])
df.head()

Files to read: ['fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1-doc_12-de-pdf-a.pdf']


Unnamed: 0,pdf_name,page_number,block_index,block_text
0,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,1,0,Der Regierungsrat des Kantons Aargau bedankt s...
1,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,2,1,3. Ihre elektronische Stellungnahme senden Sie...
2,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,4,2,"Der Regierungsrat des Kantons Aargau begrüsst,..."
3,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,4,3,"Die ""Generaleinwilligung"" ist kein Freibrief f..."
4,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,5,4,Erlangt die Prüfperson nach Abschluss des klin...


The extracted text blocks are sometimes to big for the model to encode (paraphrase-multilingual-MiniLM-L12-v2) has a context length of 128 tokens. We therefore split up the text blocks into chunks.

In [5]:
# Placeholder variable for text chunks
data_chunks = []

# Iterate over all text blocks
for idx, row in df.iterrows():

    text = row['block_text']
    pos = 0 # Pointer to iterate over text block
    chunk_id = 0 # Placeholder for chunk_id per text block
    
    # As long as the pointer has not reached the end of the text block, keep generating chunks
    while pos < len(text):
        # At the beginning of a chunk we don't care about overlap
        if pos == 0:
            data_chunks.append(row.tolist() + [chunk_id, text[pos:pos+CHUNK_SIZE]])
        # Chunks that start inside the text block have to have an overlap with the preceeding chunk
        else:
            data_chunks.append(row.tolist() + [chunk_id, text[pos-int(CHUNK_SIZE*OVERLAP):pos+CHUNK_SIZE]])
        
        pos += CHUNK_SIZE
        chunk_id += 1

    # Note: Chunking up the text using characters is a naive approach. Using the model's tokenizer would result in better results

# Combine chunks with original data set
df = df.merge(pd.DataFrame(data_chunks, columns=df.columns.tolist()+['chunk_index', 'chunk_text']), on=df.columns.tolist(), how='left')
df.head()

Unnamed: 0,pdf_name,page_number,block_index,block_text,chunk_index,chunk_text
0,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,1,0,Der Regierungsrat des Kantons Aargau bedankt s...,0,Der Regierungsrat des Kantons Aargau bedankt s...
1,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,2,1,3. Ihre elektronische Stellungnahme senden Sie...,0,3. Ihre elektronische Stellungnahme senden Sie...
2,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,4,2,"Der Regierungsrat des Kantons Aargau begrüsst,...",0,"Der Regierungsrat des Kantons Aargau begrüsst,..."
3,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,4,3,"Die ""Generaleinwilligung"" ist kein Freibrief f...",0,"Die ""Generaleinwilligung"" ist kein Freibrief f..."
4,fedlex-data-admin-ch-eli-dl-proj-2023-5-cons_1...,5,4,Erlangt die Prüfperson nach Abschluss des klin...,0,Erlangt die Prüfperson nach Abschluss des klin...


In [6]:
# Generate embeddings using the pretrained model from hugging face
df['embeddings'] = model.encode(df['chunk_text'], show_progress_bar=True).tolist()

Batches:   0%|          | 0/327 [00:00<?, ?it/s]

In [7]:
# Store model in a parquet file for search in 01_search.ipynb
df.to_parquet(f'{TRANSFORMED_DATA_PATH}/{output_file_name}')