# RAG

- to retrieve relevant passages based on a query and use those passages to augment an input to an LLM so that it can generate an output based on those relevant passages

1. Similarity Search/ Vector Search / Semantic Search

In [2]:
import torch
import random 
import numpy as np 
import pandas as pd

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device 

'cuda'

In [17]:
texts_chunks_embeddings_df = pd.read_csv('text_chunks_embeddings_df.csv')

# convert embedding to np.array
texts_chunks_embeddings_df["embedding"] = texts_chunks_embeddings_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

pages_and_chunks = texts_chunks_embeddings_df.to_dict(orient='records')
texts_chunks_embeddings_df.head(3)

Unnamed: 0,page_number,sentence_chunk,chunk_character_count,chunk_word_count,chunk_token_count,embedding
0,-17,CUDA by Example g JAson sAnders edwArd KAndrot...,226,43,56.5,"[-0.0277873948, -0.0197100993, -0.0666598231, ..."
1,-16,Many of the designations used by manufacturers...,1556,231,389.0,"[0.0397266746, -0.0556121096, -0.0714852512, 0..."
2,-16,p. cm. Includes index. ISBN 978-0-13-138768-...,130,16,32.5,"[0.0463345461, -0.0485776998, -0.0402859673, 0..."


In [20]:
embeddings = np.stack(texts_chunks_embeddings_df.embedding.tolist(), axis=0)

In [21]:
embeddings.shape 

(426, 768)

In [40]:
embeddings = torch.tensor(embeddings, dtype=torch.float32)
embeddings.shape

  embeddings = torch.tensor(embeddings, dtype=torch.float32)


torch.Size([426, 768])

## Create model

In [41]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path = 'all-mpnet-base-v2',
                                     device=device)
embedding_model 



SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

## Semantic Searching

### Steps

- Query String
- String to Embedding
- Dot product or Cosine Similarity between text embeddings and query embeddings
- Sort results (top 3, descending order)

In [42]:
embeddings = embeddings.to(device)
embeddings.shape, embeddings.dtype

(torch.Size([426, 768]), torch.float32)

In [43]:
from time import perf_counter as timer
import sentence_transformers

## for similarity measure

- If outputs from encoding are normalized embeddings, use dot product
    - else, use cosine similarity

In [61]:
query = "CUDA memory allocation"

query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

# ensuring the dtypes match
print(f'Dtype of embedding is: {embeddings.dtype}')
print(f'Dtype of query embedding is: {query_embedding.dtype}')

# ensuring shape of query embedding match with the overall embeddings
print(f'Shape of query embedding: {query_embedding.shape}')

start_time = timer()
dot_scores = sentence_transformers.util.dot_score(a=query_embedding, b=embeddings)[0]

end_time = timer()

print(f'{end_time-start_time:.5f} seconds')


top_n = 4

top_results = torch.topk(dot_scores, k=top_n)
print("\n\n")
top_results


Dtype of embedding is: torch.float32
Dtype of query embedding is: torch.float32
Shape of query embedding: torch.Size([768])
0.00013 seconds





torch.return_types.topk(
values=tensor([0.7307, 0.7260, 0.7245, 0.7038], device='cuda:0'),
indices=tensor([ 77, 321, 286, 329], device='cuda:0'))

In [66]:
pages_and_chunks[329]['sentence_chunk']

'cudA c on multIPle GPus 220 The only thing remaining in the cudaHostAlloc() version of the dot product is cleanup.  HANDLE_ERROR( cudaFreeHost( a ) );   HANDLE_ERROR( cudaFreeHost( b ) );   HANDLE_ERROR( cudaFreeHost( partial_c ) );   // free events   HANDLE_ERROR( cudaEventDestroy( start ) );   HANDLE_ERROR( cudaEventDestroy( stop ) );   printf( "Value calculated: %f\\n", c );   return elapsedTime; } You will notice that no matter what flags we use with cudaHostAlloc(), the memory always gets freed in the same way. Specifically, a call to cudaFreeHost() does the trick. And that’s that!All that remains is to look at how main() ties all of this together. The first thing we need to check is whether our device supports mapping host memory. We do this the same way we checked for device overlap in the previous chapter, with a call to cudaGetDeviceProperties().int main( void ) {   cudaDeviceProp prop;   int whichDevice;   HANDLE_ERROR( cudaGetDevice( &whichDevice ) );   HANDLE_ERROR( cudaGe

In [154]:
# function for semantic search
def retrieve_resources(query: str,
                       embeddings: torch.tensor, 
                       model: SentenceTransformer = embedding_model,
                       top_k = 3,
                       pages_and_chunks = pages_and_chunks,
                       print_time: bool=True,
                       display=False):
    query_embedding = model.encode(query, convert_to_tensor=True)
    start_time = timer()
    dot_scores = sentence_transformers.util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f'Time taken: {(end_time - start_time):.5f} seconds')
    scores, indices = torch.topk(input=dot_scores,
                                k = top_k)

    if display:
        
        for score, idx in zip(scores, indices):
            print(f'\nScore: {score}')
            print("Text:")
            print(pages_and_chunks[idx]["sentence_chunk"])
        
    
    return scores, indices

In [81]:
retrieve_resources(query="parallel processing", embeddings=embeddings, pages_and_chunks=pages_and_chunks, print_time=False)


Score: 0.6934302449226379
Text:
Parallel programming (Computer science) I. Kandrot, Edward. II. Title.  QA76.76. A65S255 2010  005.2'75—dc22                                                              2010017618 Copyright © 2011 NVIDIA Corporation All rights reserved. Printed in the United States of America. This publication is protected by copy- right, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax: (617) 671-3447 ISBN-13: 978-0-13-138768-3 ISBN-10:    0-13-138768-5 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing, July 2010

Score: 0.6296412944793701
Text:
59 Chapter 5 thread Cooperation We

(tensor([0.6934, 0.6296, 0.6163, 0.5905, 0.5864], device='cuda:0'),
 tensor([  3, 107, 108, 285,  26], device='cuda:0'))

In [84]:
gpu_memory = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory/(2**30))
print(f'GPU Memory: {memory_gb} GB')

GPU Memory: 6 GB


In [87]:
torch.cuda.get_device_capability(0)

(8, 6)

### Using Google Gemma-2b-it

In [108]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig

In [109]:
from transformers.utils import is_flash_attn_2_available

In [110]:
if (is_flash_attn_2_available()):
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "sdpa" 
# sdpa = scaled dot product attention
attn_implementation

'sdpa'

In [111]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")
tokenizer

GemmaTokenizerFast(name_or_path='google/gemma-2-2b-it', vocab_size=256000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<bos>', 'eos_token': '<eos>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'additional_special_tokens': ['<start_of_turn>', '<end_of_turn>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<eos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<bos>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<mask>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False),
	5: AddedToken("<2mass>", rstrip=False, lstrip=False, single

In [116]:
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = "google/gemma-2-2b-it", 
                                            torch_dtype = torch.float16, 
                                            quantization_config = None,               
                                            attn_implementation = attn_implementation)
model = model.to(device)

Loading checkpoint shards: 100%|███| 2/2 [00:11<00:00,  5.95s/it]


In [117]:
model 

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2SdpaAttention(
          (q_proj): Linear(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm((2304,), 

In [118]:
mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])
model_mem_bytes = mem_params + mem_buffers # in bytes

model_mem_gb = model_mem_bytes / (1024**3)

print(f'Model memory: {model_mem_gb} GB')

Model memory: 4.869603633880615 GB


In [121]:
text = "What is CUDA parallel processing? how is GPU constructed?"

dialogue_template = [
    {"role": "user",
     "content": text}
]

prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False,
                                       add_generation_prompt=True)
print(f"\nPrompt: \n{prompt}")


Prompt: 
<bos><start_of_turn>user
What is CUDA parallel processing? how is GPU constructed?<end_of_turn>
<start_of_turn>model



In [123]:
%%time
# tokenizing input
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

Model input (tokenized):
{'input_ids': tensor([[     2,      2,    106,   1645,    108,   1841,    603, 154144,  14255,
          10310, 235336,   1368,    603,  37783,  18871, 235336,    107,    108,
            106,   2516,    108]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}

CPU times: total: 15.6 ms
Wall time: 3.99 ms


In [125]:
%%time
outputs = model.generate(**input_ids,
                             max_new_tokens=256)

  is_causal = True if causal_mask is None and q_len > 1 else False


CPU times: total: 1min 52s
Wall time: 1min 53s


In [126]:
%%time 
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
What is CUDA parallel processing? how is GPU constructed?<end_of_turn>
<start_of_turn>model
Let's break down CUDA parallel processing and how GPUs are built.

**What is CUDA Parallel Processing?**

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It allows software developers to harness the power of NVIDIA GPUs (Graphics Processing Units) for general-purpose computing tasks, not just graphics rendering. 

Here's the core idea:

* **GPUs are designed for parallel processing:**  GPUs have thousands of cores, each capable of performing simple calculations simultaneously. This is in contrast to CPUs, which are designed for sequential processing.
* **CUDA enables software to utilize these cores:**  CUDA allows developers to write code that can be executed on the GPU, taking advantage of its parallel architecture.
* **Benefits of CUDA:**
    * **Speed:**  GPUs are incre

### Managing Prompt to pass into LLM

In [149]:
def format_prompt(query: str,
                 context_items: list[dict]) -> str :
    context = f"- " + f"\n- ".join([item["sentence_chunk"] for item in context_items])

    base_prompt = """Based on the context items, please answer the query:
    Context_items: {context}
    Query: {query}
    """
    prompt = base_prompt.format(context=context, 
                               query=query)
    return context

In [150]:
queries = ["What is parallel processing?",
          "Tell me about heat transfer",
          "What is the best way to use GPU for tensors multiplication with programming?"]

query = random.choice(queries)

scores, indices = retrieve_resources(query=query,
                                    embeddings=embeddings)

context_items = [pages_and_chunks[i] for i in indices]

prompt = format_prompt(query=query, context_items=context_items)
prompt 

Time taken: 0.00004 seconds

Score: 0.6080108880996704
Text:
Parallel programming (Computer science) I. Kandrot, Edward. II. Title.  QA76.76. A65S255 2010  005.2'75—dc22                                                              2010017618 Copyright © 2011 NVIDIA Corporation All rights reserved. Printed in the United States of America. This publication is protected by copy- right, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax: (617) 671-3447 ISBN-13: 978-0-13-138768-3 ISBN-10:    0-13-138768-5 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing, July 2010

Score: 0.5896118879318237
Text:
oopera

"- Parallel programming (Computer science) I. Kandrot, Edward. II. Title.  QA76.76. A65S255 2010  005.2'75—dc22                                                              2010017618 Copyright © 2011 NVIDIA Corporation All rights reserved. Printed in the United States of America. This publication is protected by copy- right, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax: (617) 671-3447 ISBN-13: 978-0-13-138768-3 ISBN-10:    0-13-138768-5 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing, July 2010\n- ooperation 60 Chapter Objectives \t \t threads. You will learn a mechanism for different threa

In [135]:
%%time
input_ids = tokenizer(prompt, return_tensors = "pt").to(device)

outputs = model.generate(**input_ids, 
               max_new_tokens=200)

output_text = tokenizer.decode(outputs[0])
print(f'Query:\t{query}')
print(f'RAG Answer: {output_text.replace(prompt, '')}')

Query:	What is parallel processing?
RAG Answer: <bos>
- Parallel programming (Computer science) I. Kandrot, Edward. II. Title.  QA76.76. A65S255 2010  005.2'75—dc22                                                              2010017618 Copyright © 2011 NVIDIA Corporation All rights reserved. Printed in the United States of America. This publication is protected by copy- right, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax: (617) 671-3447 ISBN-13: 978-0-13-138768-3 ISBN-10:    0-13-138768-5 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing, July 2010
 ooperation 60 Chapter Objectives 	 	 thread

In [152]:
%%time
# with passing query with the context
input_ids = tokenizer(prompt, return_tensors = "pt").to(device)

outputs = model.generate(**input_ids, 
               max_new_tokens=200)

output_text = tokenizer.decode(outputs[0])
print(f'Query:\t{query}')
print(f"RAG Answer: {output_text.replace(prompt, '')}")

Query:	What is parallel processing?
RAG Answer: <bos> 


**Summary:**

This document appears to be a chapter excerpt from a book titled "Parallel Programming (Computer Science)" by Kandrot and Edward. 

Here's a breakdown of the key points:

* **Parallel Programming:** The chapter focuses on parallel programming techniques, particularly on how to utilize the GPU for parallel execution.
* **CUDA Runtime:** The CUDA runtime system is discussed, which manages the launch of parallel blocks and threads.
* **Threads and Blocks:** The relationship between threads and blocks is explained, with the number of threads per block being controlled by the second argument in the CUDA launch function.
* **Task Parallelism:** The chapter introduces task parallelism, a different type of parallelism where multiple tasks are executed concurrently, even if they are unrelated.
* **CUDA Streams:** The chapter concludes by introducing CUDA streams, which allow for the execution of operations in parallel.


**O

### LLM Model with Context Items

In [164]:
def ask(query: str,
       max_new_tokens: int = 256):
    scores, indices = retrieve_resources(query=query,
                                         embeddings=embeddings)
    context_items = [pages_and_chunks[i] for i in indices]
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu()

    prompt = format_prompt(query=query,
                          context_items=context_items)
    # tokenizing
    input_ids = tokenizer(prompt, 
                         return_tensors = "pt").to(device)

    outputs = model.generate(**input_ids,
                            max_new_tokens=max_new_tokens)

    output_text = tokenizer.decode(outputs[0])
    
    output_text = output_text.replace(prompt, '').replace("<bos>", '').replace("<end_of_turn>", '').replace("\n", "")

    return output_text

In [165]:
queries = ["What is parallel processing?",
          "What is CUDA?",
          "Explain about cuda programming."]

In [162]:
%%time
query = random.choice(queries)
print(f'Query:{query}')
ask(query=query)

Query:What is parallel processing?
Time taken: 0.00005 seconds
CPU times: total: 2min 14s
Wall time: 2min 15s


' \n\n\n**Summary:**\n\nThis document appears to be a chapter excerpt from a book titled "Parallel Programming (Computer Science)" by Kandrot and Edward. \n\nHere\'s a breakdown of the key points:\n\n* **Parallel Programming:** The chapter focuses on parallel programming techniques, particularly on how to utilize the GPU for parallel execution.\n* **CUDA Runtime:** The CUDA runtime system is discussed, which manages the launch of parallel blocks and threads.\n* **Threads and Blocks:** The relationship between threads and blocks is explained, with the number of threads per block being controlled by the second argument in the CUDA launch function.\n* **Task Parallelism:** The chapter introduces task parallelism, a different type of parallelism where multiple tasks are executed concurrently, even if they are unrelated.\n* **CUDA Streams:** The chapter concludes by introducing CUDA streams, which allow for the execution of operations in parallel.\n\n\n**Overall:** This excerpt provides a f

In [167]:
%%time
query = random.choice(queries)
print(f'Query:{query}')
ask(query=query)

Query:Explain about cuda programming.
Time taken: 0.00004 seconds
CPU times: total: 2min 42s
Wall time: 2min 42s


'**Summary:**This chapter is an introduction to CUDA C programming. It emphasizes the importance of understanding the CUDA Architecture and the nuances of NVIDIA GPUs. It recommends a book for further learning and suggests that a basic understanding of C or C++ is helpful. The chapter also mentions setting up the development environment for CUDA C and then moving on to the next chapter.**Key Points:*** **CUDA C Programming:** The chapter introduces CUDA C programming as a way to develop code for NVIDIA GPUs.* **CUDA Architecture:** Understanding the CUDA Architecture is crucial for advanced CUDA C programming.* **NVIDIA GPUs:** The chapter highlights the importance of understanding how NVIDIA GPUs work.* **Programming Massively Parallel Processors:** A recommended book for deeper understanding of CUDA Architecture.* **Prerequisites:** Basic knowledge of C or C++ is helpful for understanding the concepts.* **Development Environment:** Setting up the development environment is necessary 