# Create and run a local RAG pipeline (From Scratch)

## What is a RAG

RAG stands for **Retrieval Augmented Generation**. The goal of RAG is to have specific information and pass it to an LLM so it can generate outputs more specific based on that information.

1. **Retrieval**: Find relevant information given a query.
2. **Augmentation**: Take the relevant information and *augment* our input (prompt) to an LLM with that relevant information.
3. **Generation**: Take the first two steps and pass them to an LLM for a generative output.

Asking existing chatbots with broad data (eg. OpenAI) can have:

1. No real-time information
2. AI will Hallucinates (make up random answer)
3. No custom, more specific data

## Important Concepts:
1. Text Embedding
2. Vector Database

## Why Local
1. Cool
2. Privacy: Don't want to send company's data via API
3. Speed: No need to send data across the internet.
4. Cost: No API fee
5. No Vendor Lockin: If OpenAI exploded tomorrow, we can still operate.

In [1]:
!nvidia-smi

Thu Jun 27 12:06:46 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   39C    P8             N/A /  120W |     604MiB /   8188MiB |     24%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Sample

We will be using the ICT curriculum because why not?

1. We slice the file into smaller chunks of text as "context"
2. Embed the texts into numerical format using embedding models
3. Store them in database or PyTorch tensors

## What we are doing

### Document Preprocessing and Embedding Creation

1. Open a PDF document (or even a collections of PDFs)
2. Format the text of the PDF ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical representations (embedding) which we can store for later.

### Search and Answer

4. Build a retrieval system that uses **Vector Search** to find relevant chunk of text based on a *query*.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

<hr>

# Steps:

## 1. Document Preprocessing and Embedding Creation

**Requirement**:
1. PDF Document (or any type of document)
2. Embedding Model of choice

**Steps**:
1. Import PDF document.
2. Process text for embedding 
    * (eg. splitting into chunks of sentenses)
3. Embed text chunks with embedding model
4. Save embeddings to file

### 1.1 Import PDF Document

In [2]:
import os
import requests

# Get PDF document path
pdf_path = "curriculum.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print(f"{pdf_path} does not exist")
    
    # Enter the URL of the PDF
    url = "https://www.ict.mahidol.ac.th/wp-content/uploads/2021/05/ICT2018-TQF2_Webversion_English.pdf"
    
    # The local filename to save the downloaded file
    filename = pdf_path
    
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"File {filename} downloaded")
    else:
        print(f"Failed to download file: {response.status_code}")
        
else:
    print(f"File {pdf_path} Existed: Skipping")

File curriculum.pdf Existed: Skipping


### 1.2 Open PDF Document

There are multiple PDF-related modules, the tutorial uses [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)

In [3]:
import fitz # PyMuPDF (Fitz is legacy/backward compatible)
print(fitz.__doc__)

None


In [4]:
# Progress Bar looks cool
from tqdm.auto import tqdm
print(tqdm.__doc__)


    Asynchronous-friendly version of tqdm.
    


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Perform text formatting
# Because raw PDF copy-pasting doesn't really work well
def text_formatter(text: str) -> str:
    """Performs minor formatting on text

    Args:
        text (str): _description_

    Returns:
        str: Formatted Text through various functions
    """
    
    # Strip trailing spaces
    # Replace "\n" with " "
    cleaned_text = text.replace("\n", " ").strip()
    
    # Potentially more text formatting functions go here
    # Better text = Better LLM
    
    return cleaned_text

In [6]:
def read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    
    # Loop through the doc with tqdm progress bar 
    # Page number and page content
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        # Experiment with page number if you want
        pages_and_texts.append({
            "page_number": page_number - 3, # Page number start appearing at page 4
            "page_char_count": len(text),
            "page_word_count": len(text.split(' ')),
            "page_sentence_count_raw": len(text.split('. ')),
            "page_token_count": len(text) / 4, # 1 English Word = ~4 tokens
            "text": text
                                })
    
    return pages_and_texts

**Token**: A sub-word pieve of textA sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. 

A token can be a whole word,
part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.
Text gets broken into tokens before being passed to an LLM.. 

### 1.3 Testing parsing and reading of PDF

In [7]:
# Testing
pages_and_text = read_pdf(pdf_path=pdf_path)
pages_and_text[:2] # First 2 samples

106it [00:00, 481.77it/s]


[{'page_number': -3,
  'page_char_count': 198,
  'page_word_count': 41,
  'page_sentence_count_raw': 1,
  'page_token_count': 49.5,
  'text': 'Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University'},
 {'page_number': -2,
  'page_char_count': 5270,
  'page_word_count': 341,
  'page_sentence_count_raw': 65,
  'page_token_count': 1317.5,
  'text': 'Table of Contents  SECTION 1.  GENERAL INFORMATION  .......................................................................................................................................... 1  1.  PROGRAM TITLE ..................................................................................................................................................................... 1  2.  DEGREE TITLE .............................................................................................................

In [8]:
# Random sample
import random
random.sample(pages_and_text, k=1)

[{'page_number': 57,
  'page_char_count': 1581,
  'page_word_count': 310,
  'page_sentence_count_raw': 2,
  'page_token_count': 395.25,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 57    ITCS 473  Software Quality Assurance and Testing  3 (3 – 0 – 6)  Prerequisite    : ITCS 371  Co-requisite : None      Software quality assurance process;  avoidance of errors and other quality problems;  inspection and reviews; testing, verification and validation technique; process assurance versus  product assurance; quality process standards; product and process assurance; problem  analysis and reporting; statistical approaches to quality control  ITCS 475  Mathematical Programming  3 (3 – 0 – 6)  Prerequisite    : None  Co-requisite : None     Linear programming; integer linear programming; zero-one integer programmi

### 1.4 Performing some exploratory analysis

In [9]:
import pandas as pd

# The reason we made it a list of dictionary
df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-3,198,41,1,49.5,Bachelor of Science in Information and Commu...
1,-2,5270,341,65,1317.5,Table of Contents SECTION 1. GENERAL INFORMA...
2,-1,1495,86,15,373.75,SECTION 6: ACADEMIC STAFF DEVELOPMENT ...........
3,0,1252,273,10,313.0,Degree  Bachelor Master Ph.D....
4,1,1630,314,13,407.5,Degree  Bachelor Master Ph.D....


In [10]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.084906,377.443396,7.09434,446.771226
std,30.743563,474.917479,127.83852,8.356951,118.72937
min,-3.0,198.0,41.0,1.0,49.5
25%,23.25,1594.5,322.0,2.0,398.625
50%,49.5,1801.0,349.0,2.5,450.25
75%,75.75,1951.75,387.0,11.0,487.9375
max,102.0,5270.0,812.0,65.0,1317.5


**Average Token Per Page**: `447`

Why care about Token?

Token is important concept because:
1. Embedding Models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example an embedding model may be trained to embed sequence of `384` tokens into numerical space. 

As for LLMs, they can't accept infinite number of Tokens into their **LLM Context Window**.


### 1.5 Splitting each page into sentences

Possible Ways:
1. Splitting on `"."`
2. Using an NLP library eg. nltk, spacy

In [12]:
from spacy.lang.en import English

nlp = English()

# Build a sentencizer pipeline.
nlp.add_pipe("sentencizer")

# Create document instance as an exxample.
doc = nlp("This is a sentence. This is another sentence, according to this. Hello World!")
assert len(list(doc.sents)) == 3

# Print out our sentences split
list(doc.sents)

[This is a sentence.,
 This is another sentence, according to this.,
 Hello World!]

In [13]:
pages_and_text[0]

{'page_number': -3,
 'page_char_count': 198,
 'page_word_count': 41,
 'page_sentence_count_raw': 1,
 'page_token_count': 49.5,
 'text': 'Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University'}

In [14]:
for item in tqdm(pages_and_text):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings (Default = Spacy)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences
    item["page_sentence_count_spacy"] = len(item['sentences'])

100%|██████████| 106/106 [00:00<00:00, 342.62it/s]


In [15]:
random.sample(pages_and_text, k=1)

[{'page_number': 11,
  'page_char_count': 1448,
  'page_word_count': 289,
  'page_sentence_count_raw': 8,
  'page_token_count': 362.0,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 11    Section 3.  Educational Management System,   Curriculum Implementation and Structure  1. Educational Management System  1.1 System  A bimester system, with two regular semesters and one special semester (also called the  “summer semester”) in an academic year.   1.2 Summer session  Summer session is offered depending on Program Faculty Members consideration   1.3 Credit equivalent to semester system  Credits transferring is applicable according to the regulations announced by Mahidol  University  1.4 Course Exempt  Course exemption is applicable according to the regulations announced by Faculty of ICT,  Mahidol University

In [16]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,106.0,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.08,377.44,7.09,446.77,5.1
std,30.74,474.92,127.84,8.36,118.73,5.99
min,-3.0,198.0,41.0,1.0,49.5,1.0
25%,23.25,1594.5,322.0,2.0,398.62,1.0
50%,49.5,1801.0,349.0,2.5,450.25,1.0
75%,75.75,1951.75,387.0,11.0,487.94,8.0
max,102.0,5270.0,812.0,65.0,1317.5,33.0


### 1.6 Chunking sentences together

The concept of splitting larger pieces of text intoo smaller ones is often referred to as text splitting or chunking.

Reasons:
1. Easier to filter.
2. Can fit into embedding model context window
3. LLM can have more specific and focused contexts

Try experimenting with chunk size if wanted.

Tools such as LangChain can be used.

In [28]:
num_sentence_chunk_size = 8

# A function to split lists of texts recursively into chunk size
def split_list(input_list: list[str], slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7],
 [8, 9, 10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23],
 [24]]

In [29]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_text):
    item["sentence_chunk"] = split_list(input_list=item['sentences'],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunk"] = len(item['sentence_chunk'])

100%|██████████| 106/106 [00:00<?, ?it/s]


In [32]:
random.sample(pages_and_text, k=1)

[{'page_number': 70,
  'page_char_count': 1695,
  'page_word_count': 320,
  'page_sentence_count_raw': 11,
  'page_token_count': 423.75,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 70    Section 4.  Learning Outcomes, Teaching Strategies, and Assessment Strategies  1. Development of Students’ Special Characteristics    Special Characteristics  Strategy or Student Activity  1. Have a great personality, be  professional, have a great presentation  skill and strong communication skill (in  English)   • Guide students about appropriate manner such as how to  dress properly during official events as follow by the  Mahidol University regulation  • Teach students several techniques for public presentation  and allow them to practice during class presentation as  well as extra curriculums activities at the Facu

In [33]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunk
count,106.0,106.0,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.08,377.44,7.09,446.77,5.1,1.3
std,30.74,474.92,127.84,8.36,118.73,5.99,0.65
min,-3.0,198.0,41.0,1.0,49.5,1.0,1.0
25%,23.25,1594.5,322.0,2.0,398.62,1.0,1.0
50%,49.5,1801.0,349.0,2.5,450.25,1.0,1.0
75%,75.75,1951.75,387.0,11.0,487.94,8.0,1.0
max,102.0,5270.0,812.0,65.0,1317.5,33.0,5.0


### 1.7 Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity

Meaning, we can dive specifically into the text sample that used in the model.

In [36]:
import re

# Split each chunk into its own item
pages_and_chunk = []
for item in tqdm(pages_and_text):
    for sentence_chunk in item['sentence_chunk']:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure.
        joined_sentence_chunk = "".join(sentence_chunk).replace(" ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A"
        
        chunk_dict['sentence_chunk'] = joined_sentence_chunk
        
        # Get some stats on the chunks
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars
        
        pages_and_chunk.append(chunk_dict)
        
len(pages_and_chunk)

100%|██████████| 106/106 [00:00<00:00, 21169.23it/s]


138

In [37]:
random.sample(pages_and_chunk, k=1)

[{'page_number': 102,
  'sentence_chunk': 'Degree    \uf052 Bachelor       Master        Ph. D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 102    (3) Haga J, Mitrpanont J, Roungsuriyaviboon J, Sathapornwatanakul T, Sawangphol W,  Kobayashi D, MedThaiSAGE: Visualization of Thai Medical Research Data on Large Tiled  Display Walls. In: the Pacific Rim Application and Grid Middleware Assembly (PRAGMA33);  2017 Oct 16; Brisbane, Australia; 2017. (4) Sawangphol W, Li Y-F, Tack G. CP4DL: Constraint-based Reasoning for Expressive  Description Logics. In: the Fifteenth International Workshop on Constraint Modelling and  Reformulation (ModRef), 2016 Sep 5-9; Toulouse, France; 2016. (5) Kang Y-B, Pan JZ, Krishnaswamy S, Sawangphol W, Li Y-F. How Long Will It Take?Accurate  Prediction of Ontology Reasoning Performance. In: The 28th AAAI Conference on Artificial  Intelligence 

In [38]:
df = pd.DataFrame(pages_and_chunk)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,138.0,138.0,138.0,138.0
mean,46.75,1371.71,289.17,342.93
std,32.8,555.78,168.62,138.95
min,-3.0,119.0,4.0,29.75
25%,14.0,1044.25,171.75,261.06
50%,47.5,1519.0,310.5,379.75
75%,76.75,1809.5,352.75,452.38
max,102.0,2273.0,813.0,568.25


### 1.8 Filter chunks of text for short chunks

Since these chunks might not contain many useful information

In [47]:
# Show random chunks with less than 50 tokens
min_token_length = 50
for row in df[df['chunk_token_count'] <= min_token_length].iterrows():
    print(f"Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}")

Chunk token count: 49.5 | Text: Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University
Chunk token count: 44.75 | Text: STUDENT APPEAL ................................................................................................................................................................. 77
Chunk token count: 29.75 | Text: In: the 2nd International  Conference on Information  Technology (InCIT), 2017 Nov 2-3;  Nakhon Pathom, Thailand; 2017.
Chunk token count: 40.25 | Text: 4.2  Identify the code of ICT-related ethics (e.g. policy, law). 4.3  Express the awareness of business, social, security,  professional, and ICT-related ethics.
Chunk token count: 35.75 | Text: 8.3  Analyze, design, and develop solutions for research  problems. 8.4  Evaluate the solutions. 8.5  Prepare a research paper for publication.


In [48]:
# Filter our DataFrame for rows with under 30 tokens
pages_and_chunk_over_min_token_length = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunk_over_min_token_length[:2]

[{'page_number': -2,
  'sentence_chunk': 'Table of Contents  SECTION 1. GENERAL INFORMATION  .......................................................................................................................................... 1  1. PROGRAM TITLE ..................................................................................................................................................................... 1  2. DEGREE TITLE ........................................................................................................................................................................ 1  3. MAJOR OR MINOR SUBJECTS (IF ANY) ....................................................................................................................................... 1  4. TOTAL NUMBER OF CREDITS .................................................................................................................................................... 1  5. PROGRAM CHARACTERISTICS ..........

## 2. Embedding the text chunks