# Create and run a local RAG pipeline (From Scratch)

## What is a RAG

RAG stands for **Retrieval Augmented Generation**. The goal of RAG is to have specific information and pass it to an LLM so it can generate outputs more specific based on that information.

1. **Retrieval**: Find relevant information given a query.
2. **Augmentation**: Take the relevant information and *augment* our input (prompt) to an LLM with that relevant information.
3. **Generation**: Take the first two steps and pass them to an LLM for a generative output.

Asking existing chatbots with broad data (eg. OpenAI) can have:

1. No real-time information
2. AI will Hallucinates (make up random answer)
3. No custom, more specific data

## Important Concepts:
1. Text Embedding
2. Vector Database

## Why Local
1. Cool
2. Privacy: Don't want to send company's data via API
3. Speed: No need to send data across the internet.
4. Cost: No API fee
5. No Vendor Lockin: If OpenAI exploded tomorrow, we can still operate.

In [1]:
!nvidia-smi

Thu Jun 27 17:21:36 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.99                 Driver Version: 555.99         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060      WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   48C    P8             N/A /  120W |     740MiB /   8188MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Sample

We will be using the ICT curriculum because why not?

1. We slice the file into smaller chunks of text as "context"
2. Embed the texts into numerical format using embedding models
3. Store them in database or PyTorch tensors

## What we are doing

### Document Preprocessing and Embedding Creation

1. Open a PDF document (or even a collections of PDFs)
2. Format the text of the PDF ready for an embedding model.
3. Embed all of the chunks of text in the textbook and turn them into numerical representations (embedding) which we can store for later.

### Search and Answer

4. Build a retrieval system that uses **Vector Search** to find relevant chunk of text based on a *query*.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on the passages of the textbook with an LLM.

<hr>

# Steps:

## 1. Document Preprocessing and Embedding Creation

**Requirement**:
1. PDF Document (or any type of document)
2. Embedding Model of choice

**Steps**:
1. Import PDF document.
2. Process text for embedding 
    * (eg. splitting into chunks of sentenses)
3. Embed text chunks with embedding model
4. Save embeddings to file

### 1.1 Import PDF Document

In [2]:
import os
import requests

# Get PDF document path
pdf_path = "curriculum.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print(f"{pdf_path} does not exist")
    
    # Enter the URL of the PDF
    url = "https://www.ict.mahidol.ac.th/wp-content/uploads/2021/05/ICT2018-TQF2_Webversion_English.pdf"
    
    # The local filename to save the downloaded file
    filename = pdf_path
    
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Check if successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as f:
            f.write(response.content)
        print(f"File {filename} downloaded")
    else:
        print(f"Failed to download file: {response.status_code}")
        
else:
    print(f"File {pdf_path} Existed: Skipping")

File curriculum.pdf Existed: Skipping


### 1.2 Open PDF Document

There are multiple PDF-related modules, the tutorial uses [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)

In [3]:
import fitz # PyMuPDF (Fitz is legacy/backward compatible)
print(fitz.__doc__)

None


In [4]:
# Progress Bar looks cool
from tqdm.auto import tqdm
print(tqdm.__doc__)


    Asynchronous-friendly version of tqdm.
    


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Perform text formatting
# Because raw PDF copy-pasting doesn't really work well
def text_formatter(text: str) -> str:
    """Performs minor formatting on text

    Args:
        text (str): _description_

    Returns:
        str: Formatted Text through various functions
    """
    
    # Strip trailing spaces
    # Replace "\n" with " "
    cleaned_text = text.replace("\n", " ").strip()
    
    # Potentially more text formatting functions go here
    # Better text = Better LLM
    
    return cleaned_text

In [6]:
def read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    
    # Loop through the doc with tqdm progress bar 
    # Page number and page content
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        # Experiment with page number if you want
        pages_and_texts.append({
            "page_number": page_number - 3, # Page number start appearing at page 4
            "page_char_count": len(text),
            "page_word_count": len(text.split(' ')),
            "page_sentence_count_raw": len(text.split('. ')),
            "page_token_count": len(text) / 4, # 1 English Word = ~4 tokens
            "text": text
                                })
    
    return pages_and_texts

**Token**: A sub-word pieve of textA sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. 

A token can be a whole word,
part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.
Text gets broken into tokens before being passed to an LLM.. 

### 1.3 Testing parsing and reading of PDF

In [7]:
# Testing
pages_and_text = read_pdf(pdf_path=pdf_path)
pages_and_text[:2] # First 2 samples

106it [00:00, 448.24it/s]


[{'page_number': -3,
  'page_char_count': 198,
  'page_word_count': 41,
  'page_sentence_count_raw': 1,
  'page_token_count': 49.5,
  'text': 'Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University'},
 {'page_number': -2,
  'page_char_count': 5270,
  'page_word_count': 341,
  'page_sentence_count_raw': 65,
  'page_token_count': 1317.5,
  'text': 'Table of Contents  SECTION 1.  GENERAL INFORMATION  .......................................................................................................................................... 1  1.  PROGRAM TITLE ..................................................................................................................................................................... 1  2.  DEGREE TITLE .............................................................................................................

In [8]:
# Random sample
import random
random.sample(pages_and_text, k=1)

[{'page_number': 54,
  'page_char_count': 1952,
  'page_word_count': 339,
  'page_sentence_count_raw': 2,
  'page_token_count': 488.0,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 54      Roles of software and hardware in designing the embedded systems; design components  including hardware and software architectures, design methodologies and tools, and communication  protocols; design specification and modeling, hardware components and platforms, software  organization, embedded and real-time operating systems, interfacing with external environments using  sensors and actuators, and communication in distributed embedded systems; Advanced topics such  as energy management, safety and reliability, and security; case-studies of real-world systems such as  biomedical devices, smart cards, RFID, networked se

### 1.4 Performing some exploratory analysis

In [9]:
import pandas as pd

# The reason we made it a list of dictionary
df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-3,198,41,1,49.5,Bachelor of Science in Information and Commu...
1,-2,5270,341,65,1317.5,Table of Contents SECTION 1. GENERAL INFORMA...
2,-1,1495,86,15,373.75,SECTION 6: ACADEMIC STAFF DEVELOPMENT ...........
3,0,1252,273,10,313.0,Degree  Bachelor Master Ph.D....
4,1,1630,314,13,407.5,Degree  Bachelor Master Ph.D....


In [10]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.084906,377.443396,7.09434,446.771226
std,30.743563,474.917479,127.83852,8.356951,118.72937
min,-3.0,198.0,41.0,1.0,49.5
25%,23.25,1594.5,322.0,2.0,398.625
50%,49.5,1801.0,349.0,2.5,450.25
75%,75.75,1951.75,387.0,11.0,487.9375
max,102.0,5270.0,812.0,65.0,1317.5


**Average Token Per Page**: `447`

Why care about Token?

Token is important concept because:
1. Embedding Models don't deal with infinite tokens.
2. LLMs don't deal with infinite tokens.

For example an embedding model may be trained to embed sequence of `384` tokens into numerical space. 

As for LLMs, they can't accept infinite number of Tokens into their **LLM Context Window**.


### 1.5 Splitting each page into sentences

Possible Ways:
1. Splitting on `"."`
2. Using an NLP library eg. nltk, spacy

In [11]:
from spacy.lang.en import English

nlp = English()

# Build a sentencizer pipeline.
nlp.add_pipe("sentencizer")

# Create document instance as an exxample.
doc = nlp("This is a sentence. This is another sentence, according to this. Hello World!")
assert len(list(doc.sents)) == 3

# Print out our sentences split
list(doc.sents)

[This is a sentence.,
 This is another sentence, according to this.,
 Hello World!]

In [12]:
pages_and_text[0]

{'page_number': -3,
 'page_char_count': 198,
 'page_word_count': 41,
 'page_sentence_count_raw': 1,
 'page_token_count': 49.5,
 'text': 'Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University'}

In [13]:
for item in tqdm(pages_and_text):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings (Default = Spacy)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences
    item["page_sentence_count_spacy"] = len(item['sentences'])

100%|██████████| 106/106 [00:00<00:00, 273.40it/s]


In [14]:
random.sample(pages_and_text, k=1)

[{'page_number': 22,
  'page_char_count': 1774,
  'page_word_count': 440,
  'page_sentence_count_raw': 2,
  'page_token_count': 443.5,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 22    Number of credits (Lecture – Laboratory – Self-study)  *ITCS  453  Data Warehousing and Data Mining  3 (3 – 0 – 6)  *ITCS  475 Mathematical Programming  3 (3 – 0 – 6)  *ITCS  481 Computer Graphics  3 (3 – 0 – 6)  *ITCS  498 Special Topics in Computer Science  3 (3 – 0 – 6)      * The course that is already offered  (7) Health Information Technology  Number of credits (Lecture – Laboratory – Self-study)  *ITCS  403 Introduction to Healthcare Systems  3 (3 – 0 – 6)  *ITCS  404 Information Technology for Healthcare Services  3 (3 – 0 – 6)  *ITCS  405 Information Models and Healthcare Information Standards  3 (3 – 0 – 6)  *IT

In [15]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,106.0,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.08,377.44,7.09,446.77,5.1
std,30.74,474.92,127.84,8.36,118.73,5.99
min,-3.0,198.0,41.0,1.0,49.5,1.0
25%,23.25,1594.5,322.0,2.0,398.62,1.0
50%,49.5,1801.0,349.0,2.5,450.25,1.0
75%,75.75,1951.75,387.0,11.0,487.94,8.0
max,102.0,5270.0,812.0,65.0,1317.5,33.0


### 1.6 Chunking sentences together

The concept of splitting larger pieces of text intoo smaller ones is often referred to as text splitting or chunking.

Reasons:
1. Easier to filter.
2. Can fit into embedding model context window
3. LLM can have more specific and focused contexts

Try experimenting with chunk size if wanted.

Tools such as LangChain can be used.

In [16]:
num_sentence_chunk_size = 8

# A function to split lists of texts recursively into chunk size
def split_list(input_list: list[str], slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7],
 [8, 9, 10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23],
 [24]]

In [17]:
# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_text):
    item["sentence_chunk"] = split_list(input_list=item['sentences'],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunk"] = len(item['sentence_chunk'])

100%|██████████| 106/106 [00:00<?, ?it/s]


In [18]:
random.sample(pages_and_text, k=1)

[{'page_number': 87,
  'page_char_count': 2092,
  'page_word_count': 769,
  'page_sentence_count_raw': 3,
  'page_token_count': 523.0,
  'text': 'Degree    \uf052 Bachelor       Master        Ph.D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 87    4 ITCS 407 Practical Healthcare Management  3(2-2-5)  M  M    R    M/A  M  R    5 ITCS 409 Special Topics in Healthcare Systems  3(3-0-6)  M  M        M  R  R    6 ITCS 453 Data Warehousing and Data Mining  3(3-0-6)  R  R        R  R  R            (8)  Management Information Systems  1 ITCS 364 Knowledge Management  3(3-0-6)  R  R        R  R  R    2 ITCS 365 Information Systems Analysis and Design  3(3-0-6)  M  M        M  M  M    3 ITCS 366 Enterprise Architecture  3(3-0-6)  R  R    R    R  R  R    4 ITCS 367 IT Infrastructure Management  3(3-0-6)  R  R        R  R  R    5 ITCS 368 Information and Business Process  Mana

In [19]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunk
count,106.0,106.0,106.0,106.0,106.0,106.0,106.0
mean,49.5,1787.08,377.44,7.09,446.77,5.1,1.3
std,30.74,474.92,127.84,8.36,118.73,5.99,0.65
min,-3.0,198.0,41.0,1.0,49.5,1.0,1.0
25%,23.25,1594.5,322.0,2.0,398.62,1.0,1.0
50%,49.5,1801.0,349.0,2.5,450.25,1.0,1.0
75%,75.75,1951.75,387.0,11.0,487.94,8.0,1.0
max,102.0,5270.0,812.0,65.0,1317.5,33.0,5.0


### 1.7 Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

That'll give us a good level of granularity

Meaning, we can dive specifically into the text sample that used in the model.

In [20]:
import re

# Split each chunk into its own item
pages_and_chunk = []
for item in tqdm(pages_and_text):
    for sentence_chunk in item['sentence_chunk']:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure.
        joined_sentence_chunk = "".join(sentence_chunk).replace(" ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A"
        
        chunk_dict['sentence_chunk'] = joined_sentence_chunk
        
        # Get some stats on the chunks
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk) / 4 # 1 token = ~4 chars
        
        pages_and_chunk.append(chunk_dict)
        
len(pages_and_chunk)

100%|██████████| 106/106 [00:00<00:00, 34769.39it/s]


138

In [21]:
random.sample(pages_and_chunk, k=1)

[{'page_number': 77,
  'sentence_chunk': 'Degree    \uf052 Bachelor       Master        Ph. D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 77    (5) Have honor of students   (6) Follow the announcement by Program Faculty Members  3.2   To recognize the outstanding achievements, students who maintain a high scholastic  GPA are eligible for graduation with the following honors. First Class Honor: Earn a  cumulative GPA of 3.50 or higher. Second Class Honor: Earn a cumulative GPA of 3.25 or  higher, but less than 3.50. Never receive an ‘F’, ‘W’ or ‘I’ grade for any course. Never regrade  any course. Complete all the required courses within 4 years since initial registration. 3.3  To request for graduation, students must meet the following requirements. (1) Be students who registered and received passing grade in all courses as required in the  program study plan  (2) 

In [22]:
df = pd.DataFrame(pages_and_chunk)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,138.0,138.0,138.0,138.0
mean,46.75,1371.71,289.17,342.93
std,32.8,555.78,168.62,138.95
min,-3.0,119.0,4.0,29.75
25%,14.0,1044.25,171.75,261.06
50%,47.5,1519.0,310.5,379.75
75%,76.75,1809.5,352.75,452.38
max,102.0,2273.0,813.0,568.25


### 1.8 Filter chunks of text for short chunks

Since these chunks might not contain many useful information

In [23]:
# Show random chunks with less than 50 tokens
min_token_length = 50
for row in df[df['chunk_token_count'] <= min_token_length].iterrows():
    print(f"Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}")

Chunk token count: 49.5 | Text: Bachelor of Science   in Information and Communication Technology (ICT)  (International Program)  2018 Revision                Faculty of Information and Communication Technology  Mahidol University
Chunk token count: 44.75 | Text: STUDENT APPEAL ................................................................................................................................................................. 77
Chunk token count: 29.75 | Text: In: the 2nd International  Conference on Information  Technology (InCIT), 2017 Nov 2-3;  Nakhon Pathom, Thailand; 2017.
Chunk token count: 40.25 | Text: 4.2  Identify the code of ICT-related ethics (e.g. policy, law). 4.3  Express the awareness of business, social, security,  professional, and ICT-related ethics.
Chunk token count: 35.75 | Text: 8.3  Analyze, design, and develop solutions for research  problems. 8.4  Evaluate the solutions. 8.5  Prepare a research paper for publication.


In [24]:
# Filter our DataFrame for rows with under 30 tokens
pages_and_chunk_over_min_token_length = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunk_over_min_token_length[:2]

[{'page_number': -2,
  'sentence_chunk': 'Table of Contents  SECTION 1. GENERAL INFORMATION  .......................................................................................................................................... 1  1. PROGRAM TITLE ..................................................................................................................................................................... 1  2. DEGREE TITLE ........................................................................................................................................................................ 1  3. MAJOR OR MINOR SUBJECTS (IF ANY) ....................................................................................................................................... 1  4. TOTAL NUMBER OF CREDITS .................................................................................................................................................... 1  5. PROGRAM CHARACTERISTICS ..........

## 2. Embedding the text chunks

Embedding is an important concept. While human understands text, machines understand numbers.

TODO:
1. Turn the text chunks into useful numerical representation.

Embedding is already a **Learned Representation**, meaning that they already have some sort of mapping words to numbers, sentences to numbers etc.

Ref: [Vickiboykis.com/what_are_embeddings](<https://Vickiboykis.com/what_are_embeddings>)

### 2.1 Getting an Open-Source (Free) Embedding Model

eg:
1. Transformer Library
2. HuggingFace

Various dimensions, max tokens etc. for variety.

In [25]:
# https://sbert.net/
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2") # Or use other model from sbert

# List of sentences
sentences = [
    "The sentence transformer library in Python will provide a convenient way to create an embedding model for our LLM",
    "Sentences can be embedded one at a time or as a whole list",
    "He plays Elden Ring"
]

# Sentences are then encoded and embeded using model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embedding
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print('')



Sentence: The sentence transformer library in Python will provide a convenient way to create an embedding model for our LLM
Embedding: [ 2.33592596e-02  7.65010715e-03 -3.25721106e-03  5.62376454e-02
 -2.78488006e-02  1.39634376e-02 -1.42549509e-02 -1.11396220e-02
  3.03600356e-02 -7.68511146e-02  1.01084337e-02  2.70497426e-02
 -4.93733585e-02  4.27146023e-03  3.42298150e-02 -3.72172557e-02
  3.39534134e-02 -8.74367892e-04 -2.54810303e-02  3.67035046e-02
 -2.60433881e-03  4.54806536e-03  2.65060784e-03  3.28643993e-02
 -2.76557971e-02 -1.52009949e-02 -1.41041940e-02  2.15806887e-02
  3.82281989e-02 -2.34311186e-02 -1.51765412e-02  1.66700594e-02
  3.46966684e-02 -3.19524929e-02  1.15220803e-06 -3.60985026e-02
 -3.87522168e-02 -2.79415715e-02 -4.43816185e-03  9.47822758e-04
  4.93406020e-02 -3.67442891e-02 -7.99418055e-03  2.38510892e-02
 -6.35800213e-02  6.14589117e-02  1.51108736e-02  5.28505966e-02
  6.84414729e-02  1.05063662e-01 -1.09240348e-02 -4.80470248e-02
  2.63744388e-02 -7.

The model have converted each sentence into numbers (https://huggingface.co/sentence-transformers/all-mpnet-base-v2). Checking shape now:

In [26]:
embeddings[0].shape # 768 numbers to represent ONE sentence

(768,)

We *embed* all the sentences and we compare the input prompt's embed value with potential answer's embed, the higher the value, the more likely it is related to the question. 

eg (made up numbers):

**Sorce**: I like milk

**Answers**:
1. Calcium is strong for the bones. (47%)
2. Cheap livestock shop list (60%)
3. Tractor color (30%)
4. You like machine learning (2%)


In [27]:
embedding = embedding_model.encode("I like milk.")
embedding

array([ 3.29155549e-02,  9.56299976e-02, -1.85288340e-02, -3.28779072e-02,
        1.94535255e-02,  5.49064614e-02, -4.99018654e-02, -8.10288452e-03,
        6.84753060e-02,  3.75885027e-03,  2.88886111e-03, -3.35996002e-02,
        5.52557502e-03,  4.89195660e-02, -1.82248540e-02, -2.21181847e-02,
        2.82817911e-02,  1.79584697e-02,  7.78875127e-02,  1.35696065e-02,
       -2.57396651e-03,  1.04286447e-02, -2.17716321e-02, -3.28772664e-02,
       -3.96525953e-03,  1.18092215e-02,  2.69057136e-02, -4.38594595e-02,
        4.80882637e-02,  5.92225417e-02, -2.26736795e-02,  2.50691324e-02,
       -4.27515134e-02, -3.26632075e-02,  1.42373324e-06, -8.78988858e-03,
       -1.59406066e-02,  3.07522286e-02, -5.24568779e-04,  3.38814221e-02,
        7.55067263e-03, -3.28946933e-02, -1.28630465e-02, -3.41370180e-02,
       -1.87201705e-02,  1.11997366e-01,  6.01977892e-02, -2.74671298e-02,
       -2.47906782e-02,  8.13442934e-03, -1.19801741e-02, -3.94501947e-02,
       -1.07321665e-01, -

In [28]:
%%time

embedding_model.to('cuda')

# Embed each chunk
for item in tqdm(pages_and_chunk_over_min_token_length):
    item['embedding'] = embedding_model.encode(item['sentence_chunk'])

100%|██████████| 133/133 [00:02<00:00, 63.11it/s]

CPU times: total: 6.38 s
Wall time: 2.11 s





In [33]:
%%time

text_chunks_embeddings = embedding_model.encode(text_chunks,
                                                batch_size=32, 
                                                convert_to_tensor=True) # Can experiment with the size you want
text_chunks_embeddings

CPU times: total: 1.86 s
Wall time: 1.98 s


tensor([[-0.0220, -0.0461, -0.0279,  ..., -0.0214, -0.0440, -0.0476],
        [ 0.0472, -0.0364, -0.0236,  ..., -0.0217,  0.0547, -0.0346],
        [ 0.0058, -0.0424, -0.0211,  ..., -0.0189,  0.0034, -0.0449],
        ...,
        [ 0.0475, -0.0544, -0.0494,  ...,  0.0015,  0.0169,  0.0221],
        [ 0.0377, -0.0100, -0.0466,  ...,  0.0130, -0.0588, -0.0162],
        [ 0.0418,  0.0519, -0.0189,  ...,  0.0412, -0.0142, -0.0279]],
       device='cuda:0')


### 2.2 Save Embeddings to Files

In [34]:
text_chunks_and_embedding_df = pd.DataFrame(pages_and_chunk_over_min_token_length)
embedding_df_save_path = "text_chunks_and_embedding_df.csv"
text_chunks_and_embedding_df.to_csv(embedding_df_save_path, index=False)

In [35]:
# Import saved file
text_chunks_and_embedding_df_load = pd.read_csv(embedding_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-2,Table of Contents SECTION 1. GENERAL INFORMAT...,1267,58,316.75,[-2.19847802e-02 -4.61262427e-02 -2.79324148e-...
1,-2,THE ABILITY TO IMPLEMENT/PROMOTE THE PROGRAM ...,1125,87,281.25,[ 4.71613593e-02 -3.63593213e-02 -2.35828217e-...
2,-2,PROGRAM SPECIFIC INFORMATION ....................,1283,70,320.75,[ 5.78025775e-03 -4.24177051e-02 -2.10961681e-...
3,-2,FIELD EXPERIENCE COURSES (INTERNSHIP OR COOPER...,1380,90,345.0,[ 1.33792264e-02 -8.28028247e-02 -2.78722942e-...
4,-1,SECTION 6: ACADEMIC STAFF DEVELOPMENT ...........,1490,81,372.5,[ 5.83376065e-02 -6.44410923e-02 -1.28114251e-...


For large embedding database (100k+), consider using **Vector Database**

## 3. Search and Answer (RAG)

Goal: Retrieve relevant passages based on a query and use those passages to augment an input to an LLM so it can generate an output based on those relevant passages.

### 3.1 Similarity Search

Embeddings can be used for almost any type of data. eg. Images and sounds.

Comparing embeddings is known as **Similarity Search, Vector Search, Semantic Search**, while not containing the word itself, it's about the context and relevancy.

Whereas with **Keyword Search**, if we search "Computer" we should get passages back with "Computer".

In [79]:
import random
import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embedding_df.csv")

# Convert embedding column back to numpy array (It's String when converted and save to CSV)
text_chunks_and_embedding_df['embedding'] = text_chunks_and_embedding_df['embedding'].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Then save it to a pytorch tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embedding_df['embedding'].tolist(), axis=0), dtype=torch.float32).to(device=device)


# Convert texts and embedding df to list of dicts
pages_and_chunk = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embedding_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-2,Table of Contents SECTION 1. GENERAL INFORMAT...,1267,58,316.75,"[-0.0219847802, -0.0461262427, -0.0279324148, ..."
1,-2,THE ABILITY TO IMPLEMENT/PROMOTE THE PROGRAM ...,1125,87,281.25,"[0.0471613593, -0.0363593213, -0.0235828217, 0..."
2,-2,PROGRAM SPECIFIC INFORMATION ....................,1283,70,320.75,"[0.00578025775, -0.0424177051, -0.0210961681, ..."
3,-2,FIELD EXPERIENCE COURSES (INTERNSHIP OR COOPER...,1380,90,345.00,"[0.0133792264, -0.0828028247, -0.0278722942, -..."
4,-1,SECTION 6: ACADEMIC STAFF DEVELOPMENT ...........,1490,81,372.50,"[0.0583376065, -0.0644410923, -0.0128114251, -..."
...,...,...,...,...,...,...
128,98,Degree  Bachelor Master Ph. D...,1200,307,300.00,"[0.0169025138, -0.0259401686, -0.0390770957, -..."
129,99,Degree  Bachelor Master Ph. D...,1411,273,352.75,"[0.0908453614, -0.00180858572, -0.0493984669, ..."
130,100,Degree  Bachelor Master Ph. D...,1352,313,338.00,"[0.0475489125, -0.0543984883, -0.0493552461, -..."
131,101,Degree  Bachelor Master Ph. D...,1456,278,364.00,"[0.0376859568, -0.00997580402, -0.0466063842, ..."


In [80]:
embeddings.shape

torch.Size([133, 768])

In [81]:
# Create model
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device)



Creating a small semantic search pipeline

Search for a query (eg. "Video Editing") and get back relevant passages from the pdf.

1. Define a query string
2. Turn the query string into an embedding.
3. Perform a dot product or cosine similarity function between the text embeddings and the query embedding. (Vector stuff)
4. Sort the results in descending order.

In [104]:
# 1. Define the query
query = "cybersecurity"
print(f"Query: {query}")

# 2. Embed the query
# Important that we embed the query with the SAME model we embed our passages
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

# 3. Get similarity score with the dot product (use cosine similarity if outputs of model aren't normalized)
# Note that vector sizes must be of the same shape and have the same datatype
from time import perf_counter as timer
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"[INFO] Time Taken {end_time-start_time} seconds to get score on {len(embeddings)} passages.")

# 4. Get top-K results (we want 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: cybersecurity
[INFO] Time Taken 0.00023610000062035397 seconds to get score on 133 passages.


torch.return_types.topk(
values=tensor([0.5507, 0.5404, 0.5029, 0.5023, 0.5011], device='cuda:0'),
indices=tensor([ 62,  60,  39, 113,  69], device='cuda:0'))

In [106]:
pages_and_chunk[62]

{'page_number': 44,
 'sentence_chunk': 'Degree    \uf052 Bachelor       Master        Ph. D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 44    ITCS 461  Computer and Communication Security  3 (3 – 0 – 6)  Prerequisite    : ITCS 343 and ITCS 420  Co-requisite : None     Introduction to the security systems, encryption, cryptanalysis, data encryption standard;  cryptographic techniques and protocols in communication; applications of cryptography regarding  management; the public key systems, digital signatures, file security systems; penetration of the  database systems  ITCS 491  Senior Project I    3 (0 – 6 – 3)  Prerequisite    : Advisor’s consideration  Co-requisite : None     Topics of undergraduate-level project in Information and Communication Technology with the  approval of senior project advisors; writing a senior project proposal; presenting senior project

**The Sentences in all its glory**:

Degree    \uf052 Bachelor       Master        Ph. D.                 Information and Communication Technology  TQF2 Bachelor of Science in Information and Communication Technology (International Program)                 44    ITCS 461  **Computer and Communication Security  3 (3 – 0 – 6)  Prerequisite    : ITCS 343 and ITCS 420  Co-requisite : None     Introduction to the security systems, encryption, cryptanalysis, data encryption standard;  cryptographic techniques and protocols in communication; applications of cryptography regarding  management; the public key systems, digital signatures, file security systems; penetration of the  database systems**  ITCS 491  Senior Project I    3 (0 – 6 – 3)  Prerequisite    : Advisor’s consideration  Co-requisite : None     Topics of undergraduate-level project in Information and Communication Technology with the  approval of senior project advisors; writing a senior project proposal; presenting senior project  proposal  ITCS 492  Senior Project II    3 (0 – 6 – 3)  Prerequisite    : ITCS 491 and advisor’s consideration  Co-requisite : None     Topics of undergraduate-level project in Information and Communication Technology with the  approval of a senior project advisors; developing a proposed project; writing a final senior project  document; defending a senior project    • Elective Courses   no less than 12 Credits            Number of credits (Lecture – Laboratory – Self-study)  ITCS 331  Organization of Programming Languages  3 (3 – 0 – 6)  Prerequisite    : None',