# Create and run a local RAG pipeline

## What is a RAG

RAG stands for **Retrieval Augmented Generation**. The goal of RAG is to have specific information and pass it to an LLM so it can generate outputs more specific based on that information.

1. **Retrieval**: Find relevant information given a query.
2. **Augmentation**: Take the relevant information and *augment* our input (prompt) to an LLM with that relevant information.
3. **Generation**: Take the first two steps and pass them to an LLM for a generative output.

Asking existing chatbots with broad data (eg. OpenAI) can have:

1. No real-time information
2. AI will Hallucinates (make up random answer)
3. No custom, more specific data

## Important Concepts:
1. Text Embedding
2. Vector Database

## Why Local
1. Cool
2. Privacy: Don't want to send company's data via API
3. Speed: No need to send data across the internet.
4. Cost: No API fee
5. No Vendor Lockin: If OpenAI exploded tomorrow, we can still operate.

In [1]:
!nvidia-smi

Wed Jun 26 17:29:51 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   48C    P8             N/A /  120W |     924MiB /   8188MiB |     13%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Sample

We will be using the ICT curriculum because why not?

1. We slice the file into smaller chunks of text as "context"
2. Embed the texts into numerical format using embedding models
3. Store them in database or PyTorch tensors

## What we are doing

### Document Preprocessing and Embedding Creation

1. Open a PDF document (or even a collections of PDFs)
2. Format the text of the PDF ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical representations (embedding) which we can store for later.

### Search and Answer

4. Build a retrieval system that uses **Vector Search** to find relevant chunk of text based on a *query*.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

<hr>

# Steps:

## 1. Document Preprocessing and Embedding Creation

**Requirement**:
1. PDF Document (or any type of document)
2. Embedding Model of choice

**Steps**:
1. Import PDF document.
2. Process text for embedding 
    * (eg. splitting into chunks of sentenses)
3. Embed text chunks with embedding model
4. Save embeddings to file

### 1.1 Import PDF Document

In [9]:
import os
import requests

# Get PDF document path
pdf_path = "curriculum.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print(f"{pdf_path} does not exist")
    
    # Enter the URL of the PDF
    url = "https://www.ict.mahidol.ac.th/wp-content/uploads/2021/05/ICT2018-TQF2_Webversion_English.pdf"
    
    # The local filename to save the downloaded file
    filename = pdf_path
    
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"File {filename} downloaded")
    else:
        print(f"Failed to download file: {response.status_code}")
        
else:
    print(f"File {pdf_path} Existed: Skipping")

curriculum.pdf does not exist
File curriculum.pdf downloaded


### 1.2 Open PDF Document

There are multiple PDF-related modules, the tutorial uses [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)

In [23]:
import fitz # PyMuPDF (Fitz is legacy/backward compatible)
print(fitz.__doc__)

None


In [24]:
# Progress Bar looks cool
from tqdm.auto import tqdm
print(tqdm.__doc__)


    Asynchronous-friendly version of tqdm.
    


In [25]:
# Perform text formatting
# Because raw PDF copy-pasting doesn't really work well
def text_formatter(text: str) -> str:
    """Performs minor formatting on text

    Args:
        text (str): _description_

    Returns:
        str: Formatted Text through various functions
    """
    
    # Strip trailing spaces
    # Replace "\n" with " "
    cleaned_text = text.replace("\n", " ").strip()
    
    # Potentially more text formatting functions go here
    # Better text = Better LLM
    
    return cleaned_text

In [31]:
def read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    
    # Loop through the doc with tqdm progress bar 
    # Page number and page content
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        # Experiment with page number if you want
        pages_and_texts.append({
            "page_number": page_number - 3, # Page number start appearing at page 4
            "page_char_count": len(text),
            "page_word_count": len(text.split(' ')),
            "page_sentence_count_raw": len(text.split('. ')),
            "page_token_count": len(text) / 4, # 1 English Word = ~4 tokens
            "text": text
                                })
    
    return pages_and_texts

**Token**: A sub-word pieve of textA sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. 

A token can be a whole word,
part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.
Text gets broken into tokens before being passed to an LLM.. 

### 1.3 Testing parsing and reading of PDF

In [32]:
# Testing
pages_and_text = read_pdf(pdf_path=pdf_path)
pages_and_text[:2] # First 2 samples

106it [00:00, 542.00it/s]


[{'page_number': -3,
  'page_char_count': 198,
  'page_word_count': 41,
  'page_sentence_count_raw': 1,
  'page_token_count': 49.5,
  'text': 'Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University'},
 {'page_number': -2,
  'page_char_count': 5270,
  'page_word_count': 341,
  'page_sentence_count_raw': 65,
  'page_token_count': 1317.5,
  'text': 'Table of Contents  SECTION 1.  GENERAL INFORMATION  .......................................................................................................................................... 1  1.  PROGRAM TITLE ..................................................................................................................................................................... 1  2.  DEGREE TITLE .............................................................................................................

In [33]:
# Random sample
import random
random.sample(pages_and_text, k=1)

[{'page_number': 83,
  'page_char_count': 2128,
  'page_word_count': 798,
  'page_sentence_count_raw': 3,
  'page_token_count': 532.0,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 83    Course Code and Name  Number of  Credits  Learning Outcomes (LOs)  Program Learning Outcomes  Stream Learning Outcomes  PLO1 PLO2 PLO3 PLO4 PLO5 SLO6 SLO7 SLO8 SLO9  8 ITLG 108 Elementary Chinese I  2(1-2-3)  I        I          9 ITLG 109 Elementary Chinese II  2(1-2-3)  I        R          10 ITLG 110 Elementary Chinese III  2(1-2-3)  I        R          11 ITLG 201 Reading Skills  2(1-2-3)          I      I    12 ITLG 202 Public Speaking and Presentation  2(1-2-3)  I        R          13 ITLG 301 Business Writing  2(1-2-3)          R          13 ITLG 302 Academic Writing  2(1-2-3)  R        M/A      R    14 ITLG 303 Li

### 1.4 Performing some exploratory analysis

In [34]:
import pandas as pd

# The reason we made it a list of dictionary
df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-3,198,41,1,49.5,Bachelor of Science in Information and Commu...
1,-2,5270,341,65,1317.5,Table of Contents SECTION 1. GENERAL INFORMA...
2,-1,1495,86,15,373.75,SECTION 6: ACADEMIC STAFF DEVELOPMENT ...........
3,0,1252,273,10,313.0,Degree  Bachelor Master Ph.D....
4,1,1630,314,13,407.5,Degree  Bachelor Master Ph.D....


In [35]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.084906,377.443396,7.09434,446.771226
std,30.743563,474.917479,127.83852,8.356951,118.72937
min,-3.0,198.0,41.0,1.0,49.5
25%,23.25,1594.5,322.0,2.0,398.625
50%,49.5,1801.0,349.0,2.5,450.25
75%,75.75,1951.75,387.0,11.0,487.9375
max,102.0,5270.0,812.0,65.0,1317.5


**Average Token Per Page**: `447`

Why care about Token?

Token is important concept because:
1. Embedding Models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example an embedding model may be trained to embed sequence of `384` tokens into numerical space. 

As for LLMs, they can't accept infinite number of Tokens into their **LLM Context Window**.


### 1.5 Splitting each page into sentences