# RAG Pipeline with BGE and Llama 3
---

In this notebook, we build a Retrieval-Augmented Generation (RAG) pipeline that enhances a language model with up-to-date news information. We use **BAAI/bge-base-en-v1.5** to generate embeddings and retrieve the most relevant BBC news articles based on a user query. The retrieved documents are then injected into a prompt sent to **meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo**, a model trained on data up to December 2023.

By combining semantic retrieval with generation, the system enables the LLM to produce responses grounded in recent events (including 2024), improving factual accuracy and contextual relevance.

# Table of Contents
- [ 1 - Introduction](#1)
  - [ 1.1 RAG architecture overview](#1-1)
  - [ 1.2 Importing the necessary libraries](#1-2)
- [ 2 - Loading the dataset](#2)
- [ 3 - Main Functions](#3)
  - [ 3.1 Query news by index function](#3-1)
  - [ 3.2 Retrieve function](#3-2)
  - [ 3.3 Get relevant data](#3-3)
    - [ Exercise 1](#ex01)
  - [ 3.4 Formatting the relevant rata](#3-4)
    - [ Exercise 2](#ex02)
  - [ 3.5 Generate the final prompt](#3-5)
  - [ 3.6 LLM call](#3-6)
- [ 4 - Experimenting with your RAG System](#4)

<a id='1'></a>
## 1 - Introduction

---

<a id='1-1'></a>
### 1.1 RAG Architecture Overview

Below is a simplified representation of a Retrieval-Augmented Generation (RAG) pipeline:

<div align="center">
  <img src="C:/Users/berka/OneDrive/Bureau/sysrag/src/assets/rag_overview.png" alt="RAG Overview" width="60%">
</div>

The system follows a structured workflow. A retriever first identifies the most relevant documents from the dataset based on a user query. The retrieved content is then formatted and injected into an augmented prompt. This enriched prompt is finally passed to the language model to generate a grounded response.

To evaluate the impact of retrieval, responses generated with the RAG pipeline are compared against responses produced without additional retrieved context. This comparison highlights how external knowledge influences factual accuracy and relevance in the model’s output.

<a id='1-2'></a>
### 1.2 Importing the necessary libraries

In [4]:
import os
import sys
from pathlib import Path
sys.path.extend([
    str(Path.cwd().parent),
    str(Path.cwd().parent / "src"),
])
import data
from sentence_transformers import SentenceTransformer
from utils.formatting import (
    pprint_json,
    read_dataframe,
    format_relevant_data,
)
from utils.rag_core import (
    query_news,
    build_embeddings_joblib,
    retrieve,
    get_relevant_data,
)

<a id='2'></a>

<a id='2'></a>
## 2 - Loading the dataset

In [5]:
NEWS_DATA = read_dataframe(path=os.path.join(os.path.dirname(data.__file__), "news_data_dedup.csv"))

In [6]:
pprint_json(NEWS_DATA[9:11])

[
  {
    "guid": "5dae28f191cfd1047f67c409e616fc3f",
    "title": "Paris's Moulin Rouge loses windmill sails overnight",
    "description": "The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895836",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "d2c3ff79d4e068911d05416ca061cd51",
    "title": "Ukraine uses longer-range US missiles for first time",
    "description": "Missiles secretly delivered this month have been used to strike Russian targets in Crimea, US media say.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68893196",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  }
]


Important fields are `title`, `description`, `url` and `published_at`. These fields will give good information to the LLM to answer the majority of questions with good enough data.

In [7]:
indices = [3, 6, 9]
pprint_json(query_news(indices=indices, dataset=NEWS_DATA))

[
  {
    "guid": "e696224ac208878a5cec8bdc9f97c632",
    "title": "Europe risks dying and faces big decisions - Macron",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68898887",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "4f585bad8f61b715fbafe2f022ab0ae8",
    "title": "Supreme Court divided on whether Trump has immunity",
    "description": "The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68901817",
    "published_at": "2024-04-25",
    "updated_at": "2024-04-26"
  },
  {
    "guid": "5dae28f191cfd1047f67c409e616fc3f",
    "title": "Paris's Moulin Rouge loses windmill sails overnight",
    "description": "The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-europe-68895836",
    "

In [8]:
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

In [9]:
EMBEDDINGS = build_embeddings_joblib(
    dataset=NEWS_DATA,
    model=model,
    output_path="embeddings.joblib",   
    fields=["title", "description"],   
    batch_size=32,
    normalize_embeddings=True
)

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

In [28]:
model.encode("mohammed", batch_size=32, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

array([-5.71994558e-02,  9.02646314e-03,  2.34920997e-03,  1.07889855e-02,
        2.84517743e-02,  1.31535297e-02,  7.62099251e-02,  3.22549883e-03,
       -2.86554489e-02, -5.61304204e-02, -2.26937272e-02,  4.67896946e-02,
       -4.68841456e-02,  1.56930368e-02,  3.35647166e-02,  9.91301052e-03,
        6.51913509e-02, -4.43201475e-02, -8.12048744e-03, -2.03027055e-02,
        7.31112028e-04, -3.85817587e-02,  5.41246217e-03, -1.29905960e-03,
       -4.78052394e-03, -5.78692555e-03,  1.28935985e-02,  1.35813905e-02,
       -3.28958780e-02,  1.58033706e-02,  1.08167129e-02,  1.16664246e-02,
       -1.24903014e-02, -2.10305676e-02,  1.04889832e-02, -8.49692523e-03,
        3.44403349e-02, -1.61330570e-02, -1.06847147e-02,  7.03668827e-03,
       -7.28151947e-03, -1.47740617e-02,  2.62880735e-02,  3.98050323e-02,
       -2.49024499e-02, -5.42163812e-02,  6.61906740e-03, -6.93226466e-03,
        1.30417850e-02,  7.64306821e-03, -2.03467924e-02,  1.64931249e-02,
       -1.41409296e-03, -

In [13]:
indices = retrieve(query="Concerts in North America", model=model, embeddings=EMBEDDINGS, top_k = 1)

In [14]:
indices

[350]

In [17]:
import numpy as np
def get_relevant_data(query: str, model: SentenceTransformer, embeddings: np.ndarray, dataset: list[dict], top_k: int = 5) -> list[dict]:
    """
    Retrieve and return the top relevant data items based on a given query.

    This function performs the following steps:
    1. Retrieves the indices of the top 'k' relevant items from a dataset based on the provided `query`.
    2. Fetches the corresponding data for these indices from the dataset.

    Parameters:
    - query (str): The search query string used to find relevant items.
    - top_k (int, optional): The number of top items to retrieve. Default is 5.

    Returns:
    - list[dict]: A list of dictionaries containing the data associated 
      with the top relevant items.

    """
    # Retrieve the indices of the top_k relevant items given the query
    relevant_indices = retrieve(query = query, model=model, embeddings=EMBEDDINGS, top_k = top_k)

    # Obtain the data related to the items using the indices from the previous step
    relevant_data = query_news(indices = relevant_indices, dataset=dataset)

    return relevant_data

In [18]:
query = "Greatest storms in the US"
relevant_data = get_relevant_data(query=query, model=model, embeddings=EMBEDDINGS, dataset=NEWS_DATA, top_k = 1)
pprint_json(relevant_data)

[
  {
    "guid": "3ca548fe82c3fcae2c4c0c635d03eb2e",
    "title": "Large tornado seen touching down in Nebraska",
    "description": "Severe and powerful storms have moved across several US states, leaving many experiencing power shortages.",
    "venue": "BBC",
    "url": "https://www.bbc.co.uk/news/world-us-canada-68860070",
    "published_at": "2024-04-26",
    "updated_at": "2024-04-28"
  }
]


In [23]:
def generate_final_prompt(query: str, model: SentenceTransformer, embeddings: np.ndarray, dataset: list[dict], top_k: int = 5, use_rag: bool = True, prompt: str = None) -> str:
    """
    Generates a final prompt based on a user query, optionally incorporating relevant data using retrieval-augmented generation (RAG).

    Args:
        query (str): The user query for which the prompt is to be generated.
        top_k (int, optional): The number of top relevant data pieces to retrieve and incorporate. Default is 5.
        use_rag (bool, optional): A flag indicating whether to use retrieval-augmented generation (RAG)
                                  by including relevant data in the prompt. Default is True.
        prompt (str, optional): A template string for the prompt. It can contain placeholders {query} and {documents}
                                for formatting with the query and formatted relevant data, respectively.

    Returns:
        str: The generated prompt, either consisting solely of the query or expanded with relevant data
             formatted for additional context.
    """
    # If RAG is not being used, format the prompt with just the query or return the query directly
    if not use_rag:
        return query

    # Retrieve the top_k relevant data pieces based on the query
    relevant_data = get_relevant_data(query=query, model=model, embeddings=embeddings, dataset=dataset, top_k=top_k)

    # Format the retrieved relevant data
    retrieve_data_formatted = format_relevant_data(relevant_data=relevant_data)

    # If no custom prompt is provided, use the default prompt template
    if prompt is None:
        prompt = (
            f"Answer the user query below. There will be provided additional information for you to compose your answer. "
            f"The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, "
            f"you should not rely only on this information to answer the query, but add it to your overall knowledge."
            f"Query: {query}\n"
            f"2024 News: {retrieve_data_formatted}"
        )
    else:
        # If a custom prompt is provided, format it with the query and formatted relevant data
        prompt = prompt.format(query=query, documents=retrieve_data_formatted)

    return prompt

In [25]:
print(generate_final_prompt(query="Tell me about the US GDP in the past 3 years.", model=model, embeddings=EMBEDDINGS, dataset=NEWS_DATA, top_k=5, use_rag=True, prompt=None))

Answer the user query below. There will be provided additional information for you to compose your answer. The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, you should not rely only on this information to answer the query, but add it to your overall knowledge.Query: Tell me about the US GDP in the past 3 years.
2024 News: {
  "guid": "60adcbc18cfa8fee177fbe0f25dd350c",
  "title": "America's Economy Is No. 1. That Means Trouble",
  "description": "If you want a single number to capture America’s economic stature, here it is: This year, the U.S. will account for 26.3% of the global gross domestic product, the highest in almost two decades. That’s based on the latest projections from the International Monetary Fund. According to the IMF, Europe’s share of world GDP has dropped 1.4 percentage points since 2018, and Japan’s by 2.1 points. The U.S. share, by contrast, is up 2.3 points.",
  "venue": "WSJ",
  "url": "https://ww