# Implementation of Retrieval and Generation workflow

### Step 1: setup the LLM model 

Criteria

LLM of choice:
1. llama 3.1/3.2
2. deepseek r1 distill qwen 7b/ llama 8b

May need to deploy the LLM online:
1. Runpod
2. Replicate
3. OpenRouter

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

  from .autonotebook import tqdm as notebook_tqdm


### Loading the model directly

In [2]:

#Load model
model_name = "meta-llama/Llama-3.2-3B-Instruct" #3B = 12GB
#model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
#model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" #7B parameter of BF16 = 13GB RAM needed # FP32 = 26GB RAM needed

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load model and tokenizer
print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
print("Model loaded successfully.")

Using device: cuda
Loading model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.05s/it]


Model loaded successfully.


### Loading the model by using Pipeline

In [None]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)


messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)

# Print the original output

# Print the response only
print(outputs[0]["generated_text"][-1])

RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
partially initialized module 'torch._inductor' has no attribute 'custom_graph_pass' (most likely due to a circular import)

In [None]:
prompt = "What is LLM? Do not repeat the prompt in your response. Write your answer starting here [YOUR ANSWER HERE]"

In [11]:
# Tokenize the prompt and move tensors to GPU
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate response
outputs = model.generate(inputs["input_ids"],attention_mask=inputs["attention_mask"], max_length=500 ,do_sample=True, pad_token_id=tokenizer.pad_token_id)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("This is the response generated: ", response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


This is the response generated:  What is LLM? Do not repeat the prompt in your response. Write your answer starting here [YOUR ANSWER HERE].


[YOUR ANSWER HERE]

Large Language Models (LLMs) are a type of artificial intelligence (AI) designed to process and understand human language. These models are trained on vast amounts of text data, allowing them to learn patterns, relationships, and structures within language. As a result, LLMs can generate human-like text, respond to questions, and even engage in conversation. They are commonly used in various applications, including language translation, text summarization, and content generation. LLMs have the potential to revolutionize the way we interact with technology and access information, but they also raise concerns about their limitations, biases, and potential misuse.


### Step 2: Retrieval

In [1]:
import sentence_transformers
import chromadb

model = sentence_transformers.SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
#  Encode the query
query = "What is the general condition of ASB Loan"
query_embedding = model.encode(query)

# Search the vector database 
client = chromadb.PersistentClient(path="C:/Users/User/Documents/SideProject/personal_document_chatbot_with_RAG/data/vectorDB")
collection = client.get_collection(name="document_collection")
vector_response = collection.query(
    query_embeddings= query_embedding.tolist(),
    n_results=5,
    include = ["documents"]
)
print(vector_response)

# give the output

  from .autonotebook import tqdm as notebook_tqdm


{'ids': [['58c81b88-ecbe-4ddd-9600-6736008e2ab1', 'ed0f7c44-0cc4-44e3-bc49-b9d180fe1c6c', '5f083e6c-2b0a-45d3-b166-9a62000d9fd7', '72fb8831-3dd2-4cdd-ab40-0d2c2101dd2d', '8fdb1ad0-999f-4ca6-a7c7-feda8dd5d9a4']], 'embeddings': None, 'documents': [['* * 1.0 DOCUMENTATION / DOKUMENTASI * * 1.1 Definition of Customer refers to the applicant and the guarantor ( if any ) named in the Application Form upon approval of the Facility by the Bank . _Takrif bagi Pelanggan merujuk pemohon dan penjamin ( jika ada ) yang dinamakan dalam Borang Permohonan selepas Kemudahan diluluskan oleh Bank._ 1.2 The applicant and the guarantor ( if any ) named in the Application Form agrees with Affin Bank Berhad ( “ Bank ” ) that this General Terms and Conditions ( “ T & C ” ) shall be read together with the Application Form executed by the Customer and shall bind the applicant and the guarantor ( if any ) whose application for Term Loan Secured by ASB Certificate ( “ Facility ” ) has been approved by the Bank . 

### Step 3: Send the Retrieval information and Generate Response by using LLM