In [1]:
from platform import python_version

print(python_version())

3.11.5


# Create and run a local RAG pipeline 

The goal of this notebook is to build a RAG (Retrieval Augmented Generation) pipeline from scratch and have it run on a local GPU.

Specifically, we'd like to be able to open a PDF file, ask questions (queries) of it and have them answered by a Large Language Model (LLM).

There are frameworks that replicate this kind of workflow, including [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/), however, the goal of building from scratch is to be able to inspect and customize all the parts.

## Import PDF Document




In [7]:
import os
import requests

# Get PDF document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print("[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF
    url = "https://students.aiu.edu/submissions/profiles/resources/onlineBook/q9f4h2_Health_Care_Systems_Around_the_World.pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content) 
        print(f"[INFO] The file has been download and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code: {reponse.status_code}")

else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


Let's open the PDF after downloading it !

In [46]:
import fitz # requires: !pip install PyMuPDF, see: https://github.com/pymupdf/PyMuPDF 
from tqdm.auto import tqdm # pip install tqdm

def text_formatter(text: str) -> str: 
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()

    # Potentially more text formatting functions can go here
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = [] 
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 31,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_setence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4, # 1 token = ~4 characters
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:3]

0it [00:00, ?it/s]

[{'page_number': -31,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_setence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': -30,
  'page_char_count': 37,
  'page_word_count': 7,
  'page_setence_count_raw': 1,
  'page_token_count': 9.25,
  'text': 'HEALTH CARE SYSTEMS  AROUND THE WORLD'},
 {'page_number': -29,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_setence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [47]:
import math
import random
random.sample(pages_and_texts, k=5)

[{'page_number': 477,
  'page_char_count': 4164,
  'page_word_count': 712,
  'page_setence_count_raw': 23,
  'page_token_count': 1041.0,
  'text': '478\t TUVALU In 2010, tuberculosis incidence was 66.0 per  100,000 population, tuberculosis prevalence 77.0  per 100,000, and deaths due to tuberculosis among  human immunodeficiency virus (HIV)-negative  people 20.00 per 100,000. As of 2009, an estimated  0.1 percent of adults age 15 to 49 were living with  HIV or acquired immune deficiency syndrome  (AIDS).  Health Care Personnel Turkmenistan has one medical school, Turkmen Dru\xad zbi Narodov Medical Institute in Ashkhabat, which  has been offering instruction since 1932. In 2007,  Turkmenistan had 2.44 physicians per 1,000 popula\xad tion, 0.20 pharmaceutical personnel per 1,000, 0.14  dentistry personnel per 1,000, and 4.52 nurses and  midwives per 1,000. Government Role in Health Care According to WHO, in 2010, Turkmenistan spent  $535 million on health care, an amount that comes  to 

In [49]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.tail(-31)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,text
31,0,3837,687,21,959.25,AFGHANISTAN\t 1 AFGHANISTAN Afghanistan is a l...
32,1,3576,594,14,894.00,2\t AFGHANISTAN Extension of the BPHS to all r...
33,2,4944,868,27,1236.00,"AFGHANISTAN\t 3 2010, for 90 percent. In 2010,..."
34,3,5098,874,28,1274.50,4\t AFGHANISTAN several factors implicated in ...
35,4,4949,823,24,1237.25,AFGHANISTAN\t 5 produced 140 community midwive...
...,...,...,...,...,...,...
601,570,2521,360,3,630.25,"INDEX\t 571 insurance, 417–418 Mothers’ Index ..."
602,571,2582,377,2,645.50,572\t INDEX government role in health care and...
603,572,2560,362,3,640.00,"INDEX\t 573 health care access, facilities, pe..."
604,573,2554,369,4,638.50,574\t INDEX government role in health care and...


In [50]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count
count,606.0,606.0,606.0,606.0,606.0
mean,271.5,3930.82,667.31,22.26,982.71
std,175.08,995.11,181.24,9.58,248.78
min,-31.0,0.0,1.0,1.0,0.0
25%,120.25,3564.25,612.25,20.0,891.06
50%,271.5,3977.5,694.0,24.0,994.38
75%,422.75,4684.75,801.75,27.0,1171.19
max,574.0,7990.0,902.0,129.0,1997.5


## Chunking our sentences together
The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking.

There is no 100% correct way to do this.

We'll keep it simple and split into groups of 10 sentences (however, you could also try 5, 7, 8, whatever you like).

There are frameworks such as LangChain which can help with this, however, we'll stick with Python for now: https://python.langchain.com/docs/modules/data_connection/document_transformers/

Why we do this:

So our texts are easier to filter (smaller groups of text can be easier to inspect that large passages of text).
So our text chunks can fit into our embedding model context window (e.g. 384 tokens as a limit).
So our contexts passed to an LLM can be more specific and focused.

In [65]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [66]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer 
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This another sentence. I like elephants.")
assert len(list(doc.sents)) == 3

# Print out our sentences split
list(doc.sents)

[This is a sentence., This another sentence., I like elephants.]

In [68]:
pages_and_texts[531]

{'page_number': 500,
 'page_char_count': 4463,
 'page_word_count': 760,
 'page_setence_count_raw': 23,
 'page_token_count': 1115.75,
 'text': 'UZBEKISTAN\t 501 where 1 denotes high development and 0 low develop\xad ment). Life expectancy at birth in 2012 was estimated  at 72.77 years, and estimated gross domestic product  (GDP) per capita in 2011 was $3,300. In 2003, the  Gini Index (a measure of dispersion, in which perfect  equality is denoted by 0 and maximum inequality is  denoted by 100) for family income was 36.8. Emergency Health Services According to the World Health Organization  (WHO), as of 2007, Uzbekistan had a formal and  publicly available emergency care (prehospital care)  system accessible through a national access number.  Doctors Without Borders (Médecins Sans Frontières,  or MSF) began working in Uzbekistan in 1997 and  had 117 staff members in the country at the close of  2010. MSF’s primary focus is diagnosis and treatment  of drug-resistant tuberculosis, and it c

In [69]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (the default type is a spaCy datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/606 [00:00<?, ?it/s]

In [70]:
random.sample(pages_and_texts, k=1)

[{'page_number': 353,
  'page_char_count': 3575,
  'page_word_count': 635,
  'page_setence_count_raw': 20,
  'page_token_count': 893.75,
  'text': '354\t PANAMA Public Health Programs According to WHO, in 2000 (the most recent year for  which data is available), Palau had 0.07 environmen\xad tal and public health workers per 1,000 population.  In 2010, access to improved sanitation facilities was  essentially universal, while 85 percent of the popula\xad tion had access to improved sources of drinking water  (96 percent in rural areas and 83 percent in urban). PANAMA Panama is a Central American country, sharing bor\xad ders with Costa Rica and Colombia and having coast\xad lines on the Caribbean Sea and the north Pacific  Ocean. The area of 29,120 square miles (75,420 square  kilometers) makes Panama about the size of South  Carolina, and the July 2012 population was estimated  at 3.5 million. In 2010, 75 percent of the population  lived in urban areas, and the 2010 to 2015 annual rat

In [71]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy
count,606.0,606.0,606.0,606.0,606.0,606.0
mean,271.5,3930.82,667.31,22.26,982.71,21.66
std,175.08,995.11,181.24,9.58,248.78,8.37
min,-31.0,0.0,1.0,1.0,0.0,0.0
25%,120.25,3564.25,612.25,20.0,891.06,19.0
50%,271.5,3977.5,694.0,24.0,994.38,23.0
75%,422.75,4684.75,801.75,27.0,1171.19,27.0
max,574.0,7990.0,902.0,129.0,1997.5,69.0


### Chunking our sentences together
The concept of splitting larger pieces of text into smaller ones is often referred to as text splitting or chunking.

There is no 100% correct way to do this.

We'll keep it simple and split into groups of 10 sentences (however, you could also try 5, 7, 8, whatever you like).

There are frameworks such as LangChain which can help with this, however, we'll stick with Python for now: https://python.langchain.com/docs/modules/data_connection/document_transformers/

Why we do this:

So our texts are easier to filter (smaller groups of text can be easier to inspect that large passages of text).
So our text chunks can fit into our embedding model context window (e.g. 384 tokens as a limit).
So our contexts passed to an LLM can be more specific and focused.

In [72]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# Create a function to split lists of texts recursively into chunk size
# e.g. [20] -> [10, 10] or [25] -> [10, 10, 5]
def split_list(input_list: list[str],
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [73]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/606 [00:00<?, ?it/s]

In [74]:
random.sample(pages_and_texts, k=1)

[{'page_number': 360,
  'page_char_count': 4577,
  'page_word_count': 782,
  'page_setence_count_raw': 27,
  'page_token_count': 1144.25,
  'text': 'PERU\t 361 operated one general hospital, one specialized hospi\xad tal, 10 regional hospitals, five outlying clinics, and 60  first-level units. The National University of Asuncion  has a teaching hospital, providing care primarily for  low-income people, and the Police Clinic in Asuncion  provides the most complex hospital care available in  Paraguay. The private nonprofit sector operates 30  care facilities, and the private for-profit sector oper\xad ate a number of hospitals, emergency services, phar\xad macies, laboratories, physician offices, and other care  facilities. As of 2010, according to WHO, Paraguay had  10.18 health posts per 100,000 population, 1.83 health  centers per 100,000, 2.17 district or rural hospitals per  100,000, 0.12 provincial hospitals per 100,000, and  0.26 specialized hospitals per 100,000. In 2009, Para\xa

In [75]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_setence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,606.0,606.0,606.0,606.0,606.0,606.0,606.0
mean,271.5,3930.82,667.31,22.26,982.71,21.66,2.6
std,175.08,995.11,181.24,9.58,248.78,8.37,0.8
min,-31.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,120.25,3564.25,612.25,20.0,891.06,19.0,2.0
50%,271.5,3977.5,694.0,24.0,994.38,23.0,3.0
75%,422.75,4684.75,801.75,27.0,1171.19,27.0,3.0
max,574.0,7990.0,902.0,129.0,1997.5,69.0,7.0


### Splitting each chunk into its own item
We'd like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity.

Meaning, we can dive specifically into the text sample that was used in our model.

In [76]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts): 
    for sentence_chunk in item["sentence_chunks"]: 
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" (will work for any captial letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars

        pages_and_chunks.append(chunk_dict) 

len(pages_and_chunks)

  0%|          | 0/606 [00:00<?, ?it/s]

1577

In [80]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 41,
  'sentence_chunk': '42\t BARBADOS 2011 Mothers’ Index, produced by the international nongovernmental organization (NGO) Save the Chil\xad dren, based on a number of health and social factors relating to women, children, and maternal and child care. According to WHO, in 2007, 100 percent of births in Barbados were attended by skilled person\xad nel (for example, physicians, nurses, or midwives). In 2007, 100 percent of pregnant women received at least one prenatal care visit. The 2010 immunization rates for 1-year-olds were 86 percent for diphtheria and pertussis (DPT3), 85 percent for measles (MCV), and 86 percent for Hib (Hib3). Cost of Drugs The Barbados Drug Service obtains essential drugs from a local manufacturer and from sources in Europe, South America, the United States, and Canada. Barba\xad dos maintains a national Drug Formulary, and some medications are provided free to children under age 16, persons over age 65, and for treatment of asthma, cancer, di

In [81]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1577.0,1577.0,1577.0,1577.0
mean,264.72,1488.85,235.38,372.21
std,164.96,553.29,87.74,138.32
min,-30.0,7.0,1.0,1.75
25%,122.0,1283.0,202.0,320.75
50%,265.0,1620.0,256.0,405.0
75%,406.0,1834.0,292.0,458.5
max,574.0,3317.0,500.0,829.25


In [84]:
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-30,HEALTH CARE SYSTEMS AROUND THE WORLD,36,6,9.0
1,-28,HEALTH CARE SYSTEMS AROUND THE WORLD A Compara...,100,15,25.0
2,-27,"FOR INFORMATION: SAGE Publications, Inc. 2455 ...",1483,219,370.75
3,-27,2. Cross-Cultural Comparison.3. Health Policy....,184,61,46.0
4,-26,CONTENTS Afghanistan.............................,1947,44,486.75


### Filter chunks of text for short chunks
These chunks may not contain much useful information.

In [96]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 21.75 | Text: Life expectancy at birth in 2012 was estimated at 82.50 years, among the highest in the
Chunk token count: 10.0 | Text: Most (83 percent) of health care funding
Chunk token count: 9.75 | Text: Despite these difficulties, North Korea
Chunk token count: 21.5 | Text: UNITED KINGDOM) BOLIVIA PERU BRAZIL URUGUA Y  P AR AG UA Y C H I L E 0 300 Mi 0 300 Km
Chunk token count: 25.0 | Text: HEALTH CARE SYSTEMS AROUND THE WORLD A Comparative Guide Sarah E. Boslaugh Kennesaw State University


In [98]:
# Filter our DataFrame for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -27,
  'sentence_chunk': 'FOR INFORMATION: SAGE Publications, Inc. 2455 Teller Road Thousand Oaks, California 91320 E-mail: order@sagepub.com SAGE Publications India Pvt. Ltd. B 1/I 1 Mohan Cooperative Industrial Area Mathura Road, New Delhi 110 044 India SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road London EC1Y 1SP United Kingdom SAGE Publications Asia-Pacific Pte. Ltd. 33 Pekin Street #02-01 Far East Square Singapore 048763 Senior Editor: Jim Brace-Thompson Cover Designer: Michael Dubowe Reference Systems Manager: Leticia Gutierrez Reference Systems Coordinators: Laura Notton   \t Anna Villasenor Marketing Manager: Carmel Schrire Golson Media President and Editor: J. Geoffrey Golson Production Director: Mary Jo Scibetta Layout Editor: Oona Patrick Copyeditor: Pam Schroeder Proofreader: Rebecca Kuzins Indexer: J S Editorial Copyright © 2013 by SAGE Publications, Inc. All rights reserved. No part of this book may be reproduced or utilized in any form or by any me

In [99]:

random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 134,
  'sentence_chunk': 'The sys\xad tem is funded by a wage tax (3.04 percent of covered earnings) and an employer payroll tax (7.09 percent of covered earnings). Medical care covered includes inpatient and outpatient care, preventive care, mater\xad nity care, surgery, emergency care, medicines, and prostheses for the disabled. Medicines are covered at 70 percent and are provided free to those on social assistance. Pediatric health care is provided until age 5 and includes child development programs, health care, and nutrition programs. The maternity benefit is equivalent to three months’ earnings for the insured, paid for six weeks prior to and after the due date. For low-income people, a nursing allowance is also paid for the first year of the child’s life. The sickness ben\xad efit is 60 percent of earnings for up to 26 weeks; if the patient is hospitalized, the benefit is 40 percent of earnings. In 2002, 21.1 percent of the population had private health insuranc

### Embedding our text chunks
Embeddings are a broad but powerful concept.

While humans understand text, machines understand numbers.

What we'd like to do:

Turn our text chunks into numbers, specifically embeddings.
A useful numerical representation.

The best part about embeddings is that are a learned representation.

{"the": 0,
"a": 1,
...

For a great resource on learning embeddings, see here: https://vickiboykis.com/what_are_embeddings/

In [1]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cpu")

# Create a list of sentences
sentences = ["The Sentence Transformer library provides an easy way to create embeddings.",
             "Sentences can be embedded one by one or in a list.",
             "I like horses!"]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")

ModuleNotFoundError: No module named 'sentence_transformers'