# RAG EXAMPLES

In this notebook, we show how to use OpenAI's API to interact with a Large Language Model (LLM) for generating text based on various prompts.

Each example explores a different way to add content and enrich the model's responses.

- **First example:** we ask a question without any context, allowing the model to respond solely from its own knowledge base.
- **Second example:** introduces a document-based retrieval system to search for relevant information on-the-fly, allowing the model to answer with document-specific context.
- **Third example:** demonstrates a direct connection to an operational database, showcasing how RAG can seamlessly integrate structured data sources, like SQL databases, to answer specific questions with dynamic, up-to-date information.

By the end of this notebook, you will see how to integrate yout LLM with contextual data to produce more relevant, accurate, and tailored outputs in real-world applications.

## Setting up the environment:

We'll instal the OpenAI 0.28 version because some of the example code uses some methods like openai.Completion and openai.ChatCompletion that are no longer supported in later versions (1.0 and above). The updated versions of the OpenAI library introduced changes in how the API is accessed and organized, requiring different syntax and method names.


In [53]:
!pip install -q openai==0.28

In [54]:
!pip show openai

Name: openai
Version: 0.28.0
Summary: Python client library for the OpenAI API
Home-page: https://github.com/openai/openai-python
Author: OpenAI
Author-email: support@openai.com
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, requests, tqdm
Required-by: 


In [55]:
import openai
import os
from getpass import getpass

In [56]:
# Set your API key

openai.api_key = "<YOUR_API_KEY_HERE>"


## 1. FIRST EXAMPLE: Asking the model without context

We start asking the model a question directly, without any additional support from a retrieval system. Thereby, the model will answer based on its own pre-trained knowledge to generate a response, without adding external context or data to assist.


### 1.1. First Test:

Let’s now proceed to using the Chat Completions API by providing it with an input prompt, and in this example, we use Hello!

In [57]:
# Define the user prompt message
prompt = "Hello!"
# Create a chatbot using ChatCompletion.create() function
completion = openai.ChatCompletion.create(
  # Use GPT 3.5 as the LLM
  model="gpt-3.5-turbo",
  # Pre-define conversation messages for the possible roles
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
  ]
)
# Print the returned output from the LLM model
print(completion.choices[0].message)

{
  "role": "assistant",
  "content": "Hello! How can I assist you today?",
  "refusal": null
}


### 1.2. Second Test: asking for capitals

In [58]:
from openai import ChatCompletion

resposta = ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "What is the capital of France? And Portugal?"}]
)
print(resposta['choices'][0]['message']['content'])

The capital of France is Paris and the capital of Portugal is Lisbon.


## 2. SECOND EXAMPLE: Adding a simple retrieval system

In this example we the model to answer about Benfica games that happened in the lasts few months.
First, we’ll ask the model a question without providing any context. And then, after giving it relevant data, we’ll ask the same question. We will see how the answer changes when the model has access to additional information through the retrieval system we created.


In [59]:
# First of all we ask the model without context
answer_without_context = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "What was the starting 11 of Benfica vs Porto on sunday 10/11/2024?"}])

print(answer_without_context['choices'][0]['message']['content'])


I'm sorry, but as of now, I do not have the ability to access real-time sports data or information on specific matches. I recommend checking the official websites of Benfica and Porto or sports news websites for the most up-to-date information on the starting 11 for that match.


Now we want to implement the Retrieval-Augmented Generation (RAG) pipeline using document retrieval and augmenting the prompt with the context before passing it to the model.


In [61]:
# Create the function to search for relevant documents based on the query
def search_relevant_documents(query, documents):
    # using a simple keyword matching approach
    key_words = query.lower().split()
    relevant_documents = []

    # Check the documents to find the keywords
    # (we will create the documents later on the code)
    for doc in documents:
        if any(word in doc.lower() for word in key_words):
            relevant_documents.append(doc)

    return relevant_documents

# Create the function to augment the prompt with our relevant context
def augment_prompt(query, documents):
    # Retrieve relevant documents for the query
    docs_relevant = search_relevant_documents(query, documents)

    # If there is no relevant documents, print a message
    if docs_relevant:
        context = "\n".join(docs_relevant)
    else:
        context = "No relevant documents found for this query."

    # Format the augmented prompt adding the new information
    augmented_prompt  = f"""Using the contexts below, answer the query.

    Contexts:
    {context}

    Query: {query}"""
    return augmented_prompt

# Create documents
documents = [
    "Starting eleven vs Porto, on 10/11/2024: Trubin/Bah,Araújo,Otamendi,Carreras/Florentino,Aursnes/Di María,Kokçu,Akturkoglu/Pavlidis",
    "Starting eleven vs Bayern on 6/11/2024: Trubin/Kaboré,Araújo,Otamendi,Silva,Carreras/Renato,Aursnes,Kokçu/Akturkoglu,Amdouni",
    "Starting eleven vs Farense on 2/11/2024: Trubin/Bah,Araújo,Otamendi,Carreras/Florentino,Aursnes/Di María,Kokçu,Akturkoglu/Pavlidis",
    "Starting eleven vs Santa Clara on 17/8/2024: Trubin/Bah,Silva,Otamendi,Carreras/Renato,Aursnes/Amdouni,Beste,Akturkoglu/Cabral"
]

# Ask the question
query = "What was the starting 11 of Benfica vs Farense?"

# Generate the augmented prompt to add in the conversation with the LLM
augmented_prompt = augment_prompt(query, documents)

# Ask the model with the augmented prompt
answer_with_context = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": augmented_prompt}]
)

print(answer_with_context['choices'][0]['message']['content'])


The starting eleven of Benfica vs Farense on 2/11/2024 was: Trubin/Bah, Araújo, Otamendi, Carreras/Florentino, Aursnes/Di María, Kokçu, Akturkoglu/Pavlidis.


## 3. THIRD EXAMPLE: Integrating with an Operational Database for Querying

Now, we show how to boost the model's capabilities by connecting it to an operational database. As we saw in the "theory", the model will now retrieve data from a live database, providing more accurate and up-to-date answers.

We'll use a simple database setup and show how the model can query and incorporate live data into the answers.

For this example we are using SQAlchemy, but there are several ways to integrate a LLM with a database.



1) SQLAlchemy Database Setup

In [62]:
# Import some necessary libraries
from sqlalchemy import create_engine, Column, Integer, String, Date
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
import openai
from datetime import date


# Configure SQLAlchemy
Base = declarative_base()

# Define the class to create the 'games' dataset
class Game(Base):
    __tablename__ = 'games'

    id = Column(Integer, primary_key=True)
    opponent = Column(String, nullable=False)
    date = Column(Date, nullable=False)
    lineup = Column(String, nullable=False)

# Create the connection to the SQLite database
engine = create_engine('sqlite:///benfica_games.db')
Base.metadata.create_all(engine)

# Create the session to interact with the database
Session = sessionmaker(bind=engine)
session = Session()

# Create the data about Benfica's matches (all the matches are added as instances of the Game class)
games_data = [
    Game(opponent='Porto', date=date(2024, 11, 10), lineup="Trubin/Bah,Araújo,Otamendi,Carreras/Florentino,Aursnes/Di María,Kokçu,Akturkoglu/Pavlidis"),
    Game(opponent='Bayern', date=date(2024, 11, 6), lineup="Trubin/Kaboré,Araújo,Otamendi,Silva,Carreras/Renato,Aursnes,Kokçu/Akturkoglu,Amdouni"),
    Game(opponent='Farense', date=date(2024, 11, 2), lineup="Trubin/Bah,Araújo,Otamendi,Carreras/Florentino,Aursnes/Di María,Kokçu,Akturkoglu/Pavlidis"),
    Game(opponent='Santa Clara', date=date(2024, 10, 30), lineup="Trubin/Bah,Silva,Otamendi,Carreras/Renato,Aursnes/Amdouni,Beste,Akturkoglu/Cabral"),
    Game(opponent='Rio Ave', date=date(2024, 10, 27), lineup="Trubin/Bah,Araújo,Otamendi,Carreras/Kokçu,Aursnes/Di María,Beste,Akturkoglu/Pavlidis"),
    Game(opponent='Feyenoord', date=date(2024, 10, 23), lineup="Trubin/Bah,Araújo,Otamendi,Carreras/Florentino,Aursnes/Di María,Kokçu,Akturkoglu/Pavlidis"),
    Game(opponent='Pevidem', date=date(2024, 10, 19), lineup="Soares/Kaboré,Araújo,Silva,Carreras/Florentino,Aursnes/Amdouni,Beste,Rolllheiser/Cabral"),
    Game(opponent='Atletico de Madrid', date=date(2024, 10, 2), lineup="Trubin/Bah,Araújo,Otamendi,Carreras/Florentino,Aursnes/Di María,Kokçu,Akturkoglu/Pavlidis")
]


# Add the example data to the database (if not already added)
if not session.query(Game).first():  # Avoid duplication
    session.add_all(games_data)
    session.commit()


  Base = declarative_base()


2) Function to Search for relevant matches based on the query

We create a function to search for relevant information in the database, augmenting the prompt with the starting lineup for the specific match asked in the query.

In [63]:
# Function to search for relevant matches in the database based on the query
def search_relevant_games(query: str):
    # Clean the query by removing the question mark (if present)
    # The '?' can cause errors in the matching process
    cleaned_query = query.lower().rstrip('?').strip()

    # Split the query into keywords
    keywords = cleaned_query.lower().split()

    # Fetch all games from the database
    games = session.query(Game).all()
    relevant_games = []

    # Check if the keywords (of the query) match the opponent field in the database
    for game in games:
        if any(keyword in game.opponent.lower() for keyword in keywords):
            relevant_games.append(game)
    return relevant_games

# Function to augment the prompt with the database context
def augment_prompt_from_db(query: str):
    relevant_games = search_relevant_games(query)

    # Format the context for the prompt
    if relevant_games:
        contexts = [
            f"Starting eleven vs {game.opponent} on {game.date}: {game.lineup}"
            for game in relevant_games
        ]
        context = "\n".join(contexts)
    else:
        context = "I couldn't find information about this game."

    # Build the augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {context}

    Query: {query}"""

    return augmented_prompt


3) Query the Model with the Augmented Prompt

We use the augmented prompt to query the model and get the answer based on the live database context.

In [64]:
# Query
query = "What was the starting 11 of Benfica vs Bayern?"

# Generate the augmented prompt from the database
augmented_prompt = augment_prompt_from_db(query)

# Ask the model with the augmented prompt
response_with_context = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "Here is information about Benfica's starting lineup for the specified matches."},
        {"role": "user", "content": augmented_prompt}
    ]
)

# Print the answer
print(response_with_context['choices'][0]['message']['content'])

The starting eleven for Benfica against Bayern on 2024-11-06 was:
- Trubin
- Kaboré
- Araújo
- Otamendi
- Silva
- Carreras
- Renato
- Aursnes
- Kokçu
- Akturkoglu
- Amdouni
