<hr>

# EXECUTE ONLY IN GPU
<hr>


In [1]:
!pip install PyMuPDF
!pip install sentence-transformers  #by huggingface, to load embedding models
!pip install transformers   #by huggingface, to load trnasformers and its tokenizer
# flash attention
!pip install flash-attn --no-build-isolation  #github repo:https://github.com/Dao-AILab/flash-attention

# bitsandbytes-custom python wrapper functions for CUDA, especially for 8 and 4 bit quantization
!pip install bitsandbytes    #github repo:https://github.com/bitsandbytes-foundation/bitsandbytes

!pip install accelerate   #hugging face library

!pip install huggingface_hub  #loggin to huggingface in CLI to access llama3.2 model

Collecting PyMuPDF
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.3
Collecting flash-attn
  Downloading flash_attn-2.7.4.post1.tar.gz (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m76.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: filename=flash_attn-2.7.4.post1-cp311-cp311-linux_x86_64.whl size=187815463 sha256=d944fc7d2f962bce83fc4708c2fc0c21eaf8255962a0b350ae919362a51b7ef2
  Stored in directory: /root/.cac

# RAG - retrieval augmentation generation
#### combining information retrieval with LLM to improve accuracy of the model and reducing hallucinations

components of RAG :
- retriever : identifies and retrieves relevant documents
- generator : takes the relevant documents and generates response related to the prompt

resources : 
- https://www.youtube.com/watch?v=qN_2fnOPY-M
- https://github.com/mrdbourke/simple-local-rag/blob/main/00-simple-local-rag.ipynb

#### retriever :

we take a bunch of documents(either saved into memory or search results, etc) split into small chunks and embed them using a embedding model(pre-trained on large amount of data). we then perform similarity search(like cosine similarity) to retrieve the information that closly matches to the prompt from the user.

#### generator :

we then attach the retrieved text from the source with the user's prompt and pass it to the LLM to generate answers

# Local RAG :

everything done locally

In [19]:
import fitz    #for reading pdf, pip install PyMuIDF
import pandas as pd
import numpy as np

from transformers import AutoTokenizer, AutoModelForCausalLM   #pip install transformers
from transformers import AutoConfig
from transformers import BitsAndBytesConfig
from transformers.utils import is_flash_attn_2_available
from sentence_transformers import SentenceTransformer   #pip install sentence-transformers
import tensorflow as tf

import os
import time
import random

In [3]:
# Check GPU availability and set memory growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    print(e)

1 Physical GPUs, 1 Logical GPUs


collect data, preprocess and split into chunks of text

In [5]:
# loading the resource files
paths=[
    r'./Deep Learning for Computer Vision - Image Classification, Object Detection and Face Recognition in Python by Jason Brownlee (z-lib.org).pdf',
    r'./sample.txt'
]

"""
idea: we have multiple documents to read from. so we will first read a single document, then either split the text into smaller chunks or split the text by sentences and combine N sentences into a chunk.
"""

def preprocessing_text(text):
    """preprocessing the extracted texts from the documents
        -> while preprocessing you may thing that we should remove unwanted data like abstract, content, etc but its not necessary because we will search for relevant content based on the user's prompt.
    """
    text=text.replace('\n', ' ').replace('  ',' ')
    return text

def split_chunks(text:str, chunk_size: int=200)->list[str]:
    """function to split the text into chunks of text"""
    text=text.split()
    chunks=[' '.join(text[i:i+200]) for i in range(0, len(text), chunk_size)]
    return chunks


data = []   #here we will save data from all documents

for path in paths:
    if '.pdf' in path:
        # read pdf using fitz
        text=''
        doc=fitz.open(path)
        for pages in range(len(doc)):
            text+=doc.load_page(pages).get_text()
    elif '.txt' in path:
        with open(path, 'r', encoding='utf-8') as f:
            text=f.read()

    # preprocessing the text
    text=preprocessing_text(text)
    # 'text' contains all the text from the pdf. we will split the text into chunks of 200 words
    chunks=split_chunks(text)
    for chunk in chunks:
        data.append({
            'source':path,
            'text':chunk
        })

In [6]:
data

[{'source': './Deep Learning for Computer Vision - Image Classification, Object Detection and Face Recognition in Python by Jason Brownlee (z-lib.org).pdf',
  'text': 'Deep Learning for Computer Vision Image Classification, Object Detection and Face Recognition in Python Jason Brownlee i Disclaimer The information contained within this eBook is strictly for educational purposes. If you wish to apply ideas contained in this eBook, you are taking full responsibility for your actions. The author has made every eﬀort to ensure the accuracy of the information within this book was correct at time of publication. The author does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or mechanical, recording or by any information storage and ret

convert the data into embedding representations using embedding model. we will load the embedding model from hugging face

here we just represent our text data as numbers in N dimensional space. there are many ways to perform this like using KNN but using embedding models is most popular because it give the best representations as it has learnable weights

resources :

https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

https://dagshub.com/blog/how-to-train-a-custom-llm-embedding-model/

https://medium.com/snowflake/how-to-build-a-state-of-the-art-text-embedding-model-a8cd0c86a19e

NOTE : ensure that the selected model has context length same or more than our chunk size and the model's embedding dimension is enough based on your computational availability(larger dimension size means more computation). also ensure that the model is trained on the same language as your source language

In [7]:
# embedding model
# we will be using 'intfloat/multilingual-e5-large-instruct' model from huggingface(https://huggingface.co/intfloat/multilingual-e5-large-instruct) fro embedding
# this model has embedding dimension of 1024 and context length of 512 tokens
embedding_model = SentenceTransformer('intfloat/multilingual-e5-large-instruct')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/140k [00:00<?, ?B/s]

sentence_xlm-roberta_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

In [8]:
# encoding the source text chunks

# all text chunks into a list for faster encoding
text_chunks=[i['text'] for i in data]

# embedding the texts
embeddings=embedding_model.encode(text_chunks, convert_to_tensor=True, normalize_embeddings=True, batch_size=32, show_progress_bar=True)

for i in range(len(embeddings)):
    data[i]['embeddings']=embeddings[i]

Batches:   0%|          | 0/25 [00:00<?, ?it/s]

In [31]:
# to search for the most relevant chunk, we will perform cosine similarity between embedding vector of each chunk with embedding vector of the user's prompt

"""
cosine similarity :
cos θ = (A · B) / (||A|| * ||B||)

dot product between 2 vectors divided by product of magnitude(euclidean normalization/l2 normalization) of each vectors

output range=-1 to +1 where -1 being most unsimilar and +1 being perfect match between the vectors
"""

def search_topk(source_embeddings, query_embedding, k:int=5):
    """
    topk -> how many results to return
    source_embeddings ->  embeddings of the source chunks
    query_embedding -> embeddings of user's prompt
    """

    # since the output of our embedding model is already normalized, we will perform dot product between the vectors to get the cosine similarity score.
    scores=[]
    for i in range(len(source_embeddings)):
        scores.append(np.dot(source_embeddings[i].cpu(), query_embedding.cpu()))

    return tf.math.top_k(scores, k=k if len(source_embeddings)>k else len(source_embeddings))

query='who is praveen'
query_embedding=embedding_model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
scores, indices = search_topk(embeddings, query_embedding, k=1)

In [32]:
scores, indices

(<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.9277608], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=int32, numpy=array([784], dtype=int32)>)

In [34]:
indices.numpy()[0]

784

In [39]:
print(f'query={query}')
print(f"matched text chunk={data[indices.numpy()[0]]['text']}")
print(f"matched text source={data[indices.numpy()[0]]['source']}")
print(f'score={scores.numpy()[0]}')

query=who is praveen
matched text chunk=praveen is a superstar
matched text source=./sample.txt
score=0.9277607798576355


loading deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model from huggingface

In [10]:
compute_capability=[]
gpus=tf.config.list_physical_devices('GPU')
for gpu in gpus:
    properties=tf.config.experimental.get_device_details(gpu)
    compute_Capability=properties.get('compute_capability')
    compute_capability.append(compute_Capability[0])
    print(f'compute calability of GPU {gpu} : {compute_Capability}')

compute calability of GPU PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') : (7, 5)


since out GPU compute capability is 7.5, we cannot use flash attention2 instead we will use flash attention 1.

In [11]:
is_flash_attn_2_available()   #eventhough our GPU's compute capapbility is
#   <8.0, is_flash_attn_2_available() so we cannot use flash attention

True

In [12]:
quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype='float16')

if is_flash_attn_2_available() and compute_capability[0]>=8:
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"  #scaled dpt product attention
print(f"[INFO] Using attention implementation: {attn_implementation}")

[INFO] Using attention implementation: sdpa


In [16]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# loading the model and its tokenizer

config=AutoConfig.from_pretrained('google/gemma-2-2b-it', trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path='google/gemma-2-2b-it',
    config=config,
    quantization_config=quantization_config,
    low_cpu_mem_usage=True,   #we want to use as much memory as possible
    attn_implementation=attn_implementation,
    device_map='auto',
    trust_remote_code=True,
    token='<YOUR-HUGGINGFACE-TOKEN>'
)

# tokenizer for the model
tokenizer=AutoTokenizer.from_pretrained('google/gemma-2-2b-it')

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [23]:
# asking the model without RAG
query = "who is praveen?"

# each model has its own chat template(a format that we should use for prompting the model properly)
dialogue_template = [
    {"role": "user",
     "content": query}
]
# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)

input_ids=tokenizer(query, return_tensors='pt').to('cuda')
outputs = model.generate(**input_ids,
                             max_new_tokens=1024)
output=tokenizer.decode(outputs[0])
print(f'query={query}\n')
print(f'chat template={prompt}\n')
print(f'model\'s response=\n {output}')

query=who is praveen?
model's response=
 <bos>who is praveen?

Please provide more context. 

For example, you could tell me:

* **"I saw Praveen on a TV show about..."**
* **"My friend Praveen is a..."**
* **"I'm looking for information about Praveen, he's..."**

The more information you give me, the better I can understand who you're asking about. 
<end_of_turn>


you can see that the model doesn't know the answer. now we will add context to the prompt using RAG, bacause ./sample.txt file contains "praveen is superstar" text, which when added as context to the prompt the model will probabily be able to answer the question more accurately

In [43]:
# now we will add context to the model using RAG and see the model's output
query='who is praveen?'
# embedding the query
query_embedding=embedding_model.encode(query, convert_to_tensor=True, normalize_embeddings=True)

scores, indices=search_topk(embeddings, query_embedding, k=1)  #we will extract only top1 search results
context=data[indices.numpy()[0]]['text']

final_query=f"""
context :
{context}
based on the given content, please answer the following query:
{query}
think propoerly for as long as you need before answering the query and give a detailed answer.
"""

dialogue_template=[
    {"role":"user",
    "content":final_query}
]

prompt=tokenizer.apply_chat_template(conversation=dialogue_template,
                                     tokenize=False,
                                     add_generation_promp=True)

# lets tokenize the prompt
input_tokens=tokenizer(prompt, return_tensors='pt').to('cuda')

# generating model's response
output_tokens=model.generate(**input_tokens, max_new_tokens=1024)

# decoding the output
output=tokenizer.decode(output_tokens[0])

print(f'query : {query}\n')
print(f'prompt : {prompt}\n')
print(f'model output : {output}\n\n')
print(f'----------------------------------------------------------------\n')
print(f'input tokens : {input_tokens}\n')
print(f'----------------------------------------------------------------\n')
print(f'output tokens : {output_tokens}\n')
print(f'----------------------------------------------------------------\n')


query : who is praveen?

prompt : <bos><start_of_turn>user
context : 
praveen is a superstar
based on the given content, please answer the following query:
who is praveen?
think propoerly for as long as you need before answering the query and give a detailed answer.<end_of_turn>


model output : <bos><bos><start_of_turn>user
context : 
praveen is a superstar
based on the given content, please answer the following query:
who is praveen?
think propoerly for as long as you need before answering the query and give a detailed answer.<end_of_turn>


Praveen is a superstar. This statement tells us that Praveen is a person who is highly regarded and admired, likely due to their achievements, skills, or talent. 

Here's a breakdown of what we can infer from the statement:

* **Praveen is likely a successful individual:** The term "superstar" implies a level of success and recognition that goes beyond just being a regular person.
* **Praveen has achieved something significant:**  The term "super

#### we can see that the model is accurately able to say that "praveen is a superstar" based on the given context which means our RAG pipeline is working as intended

this is our local RAG(everything done locally) but when we use large models and large source data, its impractical to save everything locally. so for that we will save the embeddings in a database(espically in vector database which is optimized for working with embedding vectors) for faster access and for faster inference time.