# Article Series QA Assistant with RAG
## ABB #3 - Session 4

Code authored by: Shaw Talebi

### imports

In [1]:
import os 
import json
from IPython.display import display, Markdown
from functions import *

import torch
from sentence_transformers import SentenceTransformer
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from openai import OpenAI
from dotenv import load_dotenv

In [2]:
# import sk from .env file
load_dotenv()

# setup api client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

### load data & model

In [3]:
# load chunks
filename = 'data/chunk_list.json'
with open(filename, 'r', encoding='utf-8') as f:
    chunk_list = json.load(f)

# load embeddings
chunk_embeddings = torch.load('data/chunk_embeddings.pt', weights_only=False)

In [4]:
print("Num chunks:",len(chunk_list))
print(chunk_embeddings.shape)

Num chunks: 778
(778, 384)


In [5]:
# load model
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

### 1) define query

In [6]:
# define query
query = "When does it make sense to use RAG vs fine-tuning?"

### 2) context retreival

In [7]:
results_markdown = semantic_search(query, model, chunk_embeddings, chunk_list, temp=0.1, k=10, threshold=0.01)

In [8]:
display(Markdown(results_markdown))

1. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** We’ve already mentioned situations where RAG and fine-tuning perform well. However, since this is such a common question, it’s worth reemphasizing when each approach works best.  

2. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** Here’s high-level guidance on when to use each.  

3. **Article title:** How to Improve LLMs with RAG  
   **Section:** Why we care  
   **Snippet:** Previous articles in this series discussed fine-tuning, which adapts an existing model for a particular use case. While this is an alternative way to endow an LLM with specialized knowledge, empirically, fine-tuning seems to be less effective than RAG at doing this [1].  

4. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** RAG is when we inject relevant context into an LLM’s input prompt so that it can generate more helpful responses. For example, if we have a domain-specific knowledge base (e.g., internal company documents and emails), we might identify the items most relevant to the user’s query so that an LLM can synthesize information in an accurate and digestible way.  

5. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** Notice that these approaches are not mutually exclusive. In fact, the original RAG system proposed by Facebook researchers used fine-tuning to better use retrieved information for generating responses [4].  

6. **Article title:** How to Improve LLMs with RAG  
   **Section:** Some Nuances  
   **Snippet:** Document preparation—The quality of a RAG system is driven by how well useful information can be extracted from source documents. For example, if a document is unformatted and full of images and tables, it will be more difficult to parse than a well-formatted text file.  

7. **Article title:** How to Improve LLMs with RAG  
   **Section:** Some Nuances  
   **Snippet:** While the steps for building a RAG system are conceptually simple, several nuances can make building one (in the real world) more complicated.  

8. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** When NOT to Fine-tune  
   **Snippet:** The effectiveness of any approach will depend on the details of the use case. For example, fine-tuning is less effective than retrieval augmented generation (RAG) to provide LLMs with specialized knowledge [1].  

9. **Article title:** How to Improve LLMs with RAG  
   **Section:** How it works  
   **Snippet:** There are 2 key elements of a RAG system: a retriever and a knowledge base.  

10. **Article title:** How to Improve LLMs with RAG  
   **Section:** Why we care  
   **Snippet:** Notice that RAG does not fundamentally change how we use an LLM; it's still prompt-in and response-out. RAG simply augments this process (hence the name).  



### 3) prompt engineering

In [9]:
prompt_template = lambda query, results_markdown : f""" You are an AI assistant tasked with answering user questions based on excerpts from blog posts. Use the following snippets to \
provide accurate, concise, and synthesized answers. If the snippets don’t provide enough information, let the user know and suggest further exploration.

## Question:
{query}

## Relevant Snippets:
{results_markdown}

---

## Response:
Provide a clear and concise response below, synthesizing information from the snippets and referencing them directly. If additional information is \
required, suggest further follow-ups or note what’s missing.
"""

In [10]:
prompt = prompt_template(query, results_markdown)
# print(prompt)

### 4) prompt GPT-4o-mini

In [11]:
# make api call
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": prompt}
    ], 
    temperature = 0.5
)

# extract response
answer = response.choices[0].message.content

### 5) display results

In [12]:
print()
print(query)
print()
display(Markdown(answer))


When does it make sense to use RAG vs fine-tuning?



When deciding between using Retrieval-Augmented Generation (RAG) and fine-tuning, it's important to consider the specific use case and the desired outcome. RAG is particularly effective when you have a domain-specific knowledge base, such as internal company documents, and need to inject relevant context into an LLM's input prompt to generate more accurate and helpful responses (Snippet 4). This approach is often more effective than fine-tuning for providing LLMs with specialized knowledge (Snippets 3 and 8).

Fine-tuning, on the other hand, adapts an existing model for a particular use case by training it further on a specific dataset. While this can be useful, it's generally considered less effective than RAG for endowing an LLM with specialized knowledge (Snippets 3 and 8).

It's also worth noting that these approaches are not mutually exclusive and can be combined. For example, the original RAG system proposed by Facebook researchers utilized fine-tuning to better use retrieved information for generating responses (Snippet 5).

For a more detailed understanding of when to use each approach, further exploration of the specific nuances and requirements of your use case would be beneficial, as the effectiveness of each method can depend heavily on those details (Snippets 1 and 8).

### Bonus: Streamline Process

In [13]:
query = "What are the benefits of LLM fine-tuning?"
results_markdown = semantic_search(query, model, chunk_embeddings, chunk_list, temp=0.1, k=10, threshold=0.01)
answer = answer_query(query, results_markdown, prompt_template, client)
display(Markdown(answer))

The benefits of fine-tuning large language models (LLMs) include:

1. **Improved Performance for Specific Use Cases**: Fine-tuning allows smaller models to outperform larger pre-trained models on specific tasks, especially when tailored with high-quality datasets (Snippets 6, 10).

2. **Lower Inference Costs**: Fine-tuning can reduce the computational costs associated with inference, making it a more efficient option for deploying AI assistants (Snippet 9).

3. **Customization**: Fine-tuning enables the adaptation of LLMs to meet the unique requirements of particular applications, enhancing their relevance and effectiveness (Snippet 6).

However, it's important to note that fine-tuning is not universally superior to other techniques like prompt engineering or retrieval augmented generation (RAG), and it may come with trade-offs, such as performance drops in certain tasks (Snippets 1, 5). Additionally, the quality of the training dataset is crucial for achieving optimal performance (Snippet 7).

For a deeper understanding, you might explore specific use cases or the technical aspects of fine-tuning further.