# Implementation of Retrieval and Generation workflow

### Step 1: setup the LLM model 

Criteria

LLM of choice:
1. llama 3.1/3.2
2. deepseek r1 distill qwen 7b/ llama 8b

May need to deploy the LLM online:
1. Runpod
2. Replicate
3. OpenRouter

In [72]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import sentence_transformers
import chromadb

### Loading the model directly

In [73]:

#Load model
model_name = "meta-llama/Llama-3.2-3B-Instruct" #3B = 12GB
#model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
#model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" #7B parameter of BF16 = 13GB RAM needed # FP32 = 26GB RAM needed

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load model and tokenizer
print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
print("Model loaded successfully.")

Using device: cuda
Loading model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:37<00:00, 18.60s/it]


Model loaded successfully.


In [70]:
#define the function to create a chat template for the LLM
def chat(messages, max_new_tokens=256):

    # Format the conversation history for llama
    formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    # Tokenize 
    inputs = tokenizer(formatted_prompt,return_tensors="pt").to("cuda")
    # generate response
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)

    #Decode and return only the new assistant response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    

    return response

In [None]:
# CONVERSATION HISTORY
messages = [
    {"role": "system", "content": "You are a helpful personal assistant that is responsible to summarize, search for specific details,and give helpful answers to the user based on given context."},
    {"role": "user", "content": "Context:\n" + retrieved_chunks + "\n\nUser Query:\n" + user_query},
]

In [20]:
# Get chatbot response
response = chat(messages)
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 14 Feb 2025

You are a helpful personal assistant who knows about your user's personal documents and are responsible to summarize the documents, search for specific details from specific documents, and give helpful answers to the user.user

Who are you?assistant

I'm an AI personal assistant designed to help you with information and tasks. I have been trained on a vast amount of text data, including your personal documents, to provide you with quick and accurate answers to your questions.

I can summarize documents, search for specific details, and offer helpful suggestions based on the information I have access to. I'm here to make your life easier and more productive.

To get started, what would you like to do? Do you have a specific document you'd like me to summarize or search for information in?


In [11]:
# Tokenize the prompt and move tensors to GPU
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate response
outputs = model.generate(inputs["input_ids"],attention_mask=inputs["attention_mask"], max_length=500 ,do_sample=True, pad_token_id=tokenizer.pad_token_id)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("This is the response generated: ", response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


This is the response generated:  What is LLM? Do not repeat the prompt in your response. Write your answer starting here [YOUR ANSWER HERE].


[YOUR ANSWER HERE]

Large Language Models (LLMs) are a type of artificial intelligence (AI) designed to process and understand human language. These models are trained on vast amounts of text data, allowing them to learn patterns, relationships, and structures within language. As a result, LLMs can generate human-like text, respond to questions, and even engage in conversation. They are commonly used in various applications, including language translation, text summarization, and content generation. LLMs have the potential to revolutionize the way we interact with technology and access information, but they also raise concerns about their limitations, biases, and potential misuse.


### Step 2: Retrieval

In [74]:
# An example of user_query
user_query = " When did the last time i purchased a ticket? "

In [66]:


model = sentence_transformers.SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
#  Encode the query
query =  user_query
query_embedding = model.encode(query)

# Search the vector database 
client = chromadb.PersistentClient(path="C:/Users/User/Documents/SideProject/personal_document_chatbot_with_RAG/data/vectorDB")
collection = client.get_collection(name="document_collection")
vector_response = collection.query(
    query_embeddings= query_embedding.tolist(),
    n_results=2,
    include = ["documents"]
)
print(vector_response)

# give the output

{'ids': [['1666d225-5898-4a7c-b861-f57787736355', '4102d724-89b4-4f69-98d0-7776bff6e923']], 'embeddings': None, 'documents': [['Airlines ) to a card maintained by the Cardholder with the Certificate holder * * 2 . SCHEDULE OF BENEFITS * * * * 2.1 Flight Delay * * If the Covered Person ’ s confirmed Scheduled Flight is delayed and no alternative onward transportation is made available to the Covered Person within four ( 4 ) hours of the actual departure time of the Scheduled Flight , TIGB will indemnify the actual additional expenses necessarily and reasonably incurred for hotel accommodation and restaurant meals and refreshments , up to the maximum limits as specified in the Schedule of Benefits provided that the Covered Person had been at the airport at the time of such flight delay . * * Platinum Card * * Up to RM1,000 Limit per family RM2,000 * * 2.2 Missed Flight Connection * * If the Covered Person ’ s confirmed onward connecting Scheduled Flight is missed at the transfer point du

### Step 3: Send the Retrieval information and Generate Response by using LLM

In [75]:
for k, v in vector_response.items():
    if k == 'documents':
        for i in v:
            c=0
            retrieved_chunks = i[0] + i[1]
            print(retrieved_chunks)




Airlines ) to a card maintained by the Cardholder with the Certificate holder * * 2 . SCHEDULE OF BENEFITS * * * * 2.1 Flight Delay * * If the Covered Person ’ s confirmed Scheduled Flight is delayed and no alternative onward transportation is made available to the Covered Person within four ( 4 ) hours of the actual departure time of the Scheduled Flight , TIGB will indemnify the actual additional expenses necessarily and reasonably incurred for hotel accommodation and restaurant meals and refreshments , up to the maximum limits as specified in the Schedule of Benefits provided that the Covered Person had been at the airport at the time of such flight delay . * * Platinum Card * * Up to RM1,000 Limit per family RM2,000 * * 2.2 Missed Flight Connection * * If the Covered Person ’ s confirmed onward connecting Scheduled Flight is missed at the transfer point due to the late arrival of the Covered Person ’ s incoming confirmed connecting Scheduled Flight and no alternative onward transpo

In [76]:
# message template
messages = [
    {"role": "system", "content": "You are a helpful personal assistant that is responsible to summarize, search for specific details,and give helpful answers to the user based on given context. If the answer is not clear within the given context, please elaborate wisely based on your current knowledge."},
    {"role": "user", "content": "Context:\n" + retrieved_chunks + "\n\nUser Query:\n" + user_query},
]

In [77]:
# Get chatbot response
response = chat(messages)
print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

Cutting Knowledge Date: December 2023
Today Date: 14 Feb 2025

You are a helpful personal assistant that is responsible to summarize, search for specific details,and give helpful answers to the user based on given context. If the answer is not clear within the given context, please elaborate wisely based on your current knowledge.user

Context:
Airlines ) to a card maintained by the Cardholder with the Certificate holder * * 2. SCHEDULE OF BENEFITS * * * * 2.1 Flight Delay * * If the Covered Person ’ s confirmed Scheduled Flight is delayed and no alternative onward transportation is made available to the Covered Person within four ( 4 ) hours of the actual departure time of the Scheduled Flight, TIGB will indemnify the actual additional expenses necessarily and reasonably incurred for hotel accommodation and restaurant meals and refreshments, up to the maximum limits as specified in the Schedule of Benefits provided that the Covered Person had been at the airport at the time of