#Context Augmentation and RAG

We continue to explore Prompting with access to documents to build up components for the ChatBot assignment. In this exercise you will compare Context Augmentation, which consists in importing a document (as text) into a Prompt to Retrieval-Augmented Generation, in which the document is converted into a vector representation using the same embeddings or encoding method that the LLM uses to process input text. This ensures that the document can be used seamlessly in the same latent space as the model's understanding.

In [1]:
# let's install the packages we need first
%%capture
!pip install openai langchain-openai faiss-cpu langchain-community tiktoken pdfplumber

In [2]:
# let's import the packages you will need - they will be explained later
# in the cells that use them
#
import pdfplumber
import tiktoken
import openai
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.evaluation.qa import QAEvalChain
from openai.types import Completion, CompletionChoice, CompletionUsage
import os
import IPython

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Set OpenAI API Key for today
os.environ["OPENAI_API_KEY"] = ***insert OPENAI API key here ***

### Choose a pdf file for the experipments. It should be a document which is recent, unlikely to be part of gpt-4o training dataset. Don't make it too large (5-10+ pages is ok).

In [5]:
# loading your file (can be pdf or txt)
from google.colab import files
uploaded = files.upload()

In [6]:
### pdfplumber is a package to extract text from a pdf file
#
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            extracted = page.extract_text()
            if extracted:  # Avoid NoneType errors
                text += extracted + "\n"
    return text

# Extract text from PDF document
pdf_path = "/content/Thesis Andrea Mejia _MIMM_Final..pdf"  # Replace with the name of the PDF file you have uploaded
pdf_text = extract_text_from_pdf(pdf_path)



### Let's use pdf with last week's code for Context Augmentation

In [7]:
def get_completion(prompt, model="gpt-4o-mini"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.5
    )
    return response.choices[0].message.content

In [9]:
Prompt = f"""

Identify the main inbound strategies in the following text:
{pdf_text}
and generate a maximum of 5 parameters that generate these strategies in English

"""

### Now, **modify the above** **Prompt** to query the document about specific facts, or to ask more complex questions that require analysing the document's contents

When you import content into a Prompt, you increase its token contents. You might hit the token limit for small LLM (in some cases you are working locally on your computer with a LLM, so you're limited by your (V)RAM size and have to use smaller, e.g. 8B LLMs whose token window can be limited to a few thousands). So, the below function calculates the token count for the augmented Prompt.

In [10]:
# calculate the tokens for the Context-Augmented Prompt
# uses the 'tiktoken' package
#
def count_tokens(text, model="gpt-4o-mini"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

num_tokens = count_tokens(Prompt)
print(f"Estimated tokens: {num_tokens}")

Estimated tokens: 35688


In [11]:
# run completion and display results
reponse = get_completion(Prompt)
IPython.display.Markdown(reponse)

Based on the provided text, the main inbound marketing strategies identified can be summarized into five key parameters:

1. **Content Creation**: This involves generating high-quality, relevant content tailored to the target audience. It includes various formats such as blogs, videos, eBooks, and social media posts that provide value and establish the company as a trusted authority.

2. **SEO (Search Engine Optimization)**: Optimizing the website and content to improve visibility on search engines. This includes keyword research, on-page SEO tactics, and creating a strong online presence to attract organic traffic.

3. **Social Media Engagement**: Establishing and maintaining active social media profiles on platforms that resonate with the target audience. This strategy focuses on engaging with users through regular posts, interactions, and community building to enhance brand awareness and customer relationships.

4. **Lead Generation and Nurturing**: Implementing tactics to convert website visitors into leads through calls-to-action (CTAs), landing pages, and lead magnets (e.g., free trials, demos). Following up with personalized communication to nurture these leads into customers.

5. **Data Analysis and Optimization**: Utilizing analytics tools (such as HubSpot and Google Analytics) to track and measure the performance of marketing activities. This includes analyzing key metrics (e.g., website traffic, conversion rates) to refine strategies and improve overall effectiveness.

These parameters collectively contribute to developing a robust inbound marketing strategy tailored for small and medium-sized enterprises (SMEs) seeking to enhance their market presence and customer engagement.

OK, now you have experimented Context Augmentation from a pdf document. Let's move to Retrieval-Augmented Generation (RAG)

### This is your first (mini) RAG, a minimal, but user-friendly implementation

In [12]:
# The first step prior to embedding is "chunking"
# The text is split into smaller chunks before converting them into embeddings
# RecursiveCharacterTextSplitter is a LangChain utility
# you can experiment with various chunk sizes, and observe the impact on responses to queries
#
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = text_splitter.create_documents([pdf_text])

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors, primarily used for large-scale data like embeddings

In [13]:
# This is the vectorisation part
# initialise embeddings, then use FAISS to create a vectorised "database"
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)

In [14]:
# let's define the output elements of RAG
# first the retriever, then the LLM
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model_name="gpt-4o-mini")

In [15]:
# Create Retrieval-based QA Chain
# the 'chain'uses LangChain's 'RetrievalQA' package
# the chain comprises the LLM and the Retriever
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

In [20]:
# Example Query
# to obtain a response, instead of calling a completion function
# you run the chain ('qa_chain.run')
#
query = "What are is the research design and methods used in this thesis?"
response = qa_chain.run(query)
print("\nGenerated Answer:", response)


Generated Answer: The research design and methods used in this thesis include a combination of qualitative and quantitative sources to provide a comprehensive analysis that addresses the research questions. Specifically, the qualitative data was collected through video and audio interviews with the CEO of the company, focusing on understanding the company's operations, client base, and marketing strategies. The methodology emphasizes gathering, analyzing, and interpreting non-numerical data to gain insights into social realities.


You can try different queries and you will compare accuracy for different sorts of queries especially about specific facts
Below you can also calculate tokens and compare to the tokens required by Context Augmentation

In [21]:
# counting the tokens used (hence the API cost)
#
from langchain.callbacks import get_openai_callback
#
with get_openai_callback() as cb:
    response = qa_chain.run(query)
    print(f"Response: {response}")
    print(f"Total Tokens Used: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost: ${cb.total_cost:.5f}")

Response: The research design and methods used in this thesis involve a combination of qualitative and quantitative sources to provide a comprehensive analysis that addresses the research questions. The qualitative aspect includes gathering, analyzing, and interpreting non-numerical data, primarily through video and audio interviews with the CEO of the company. The initial questions of the interviews focus on understanding the company's operations, clientele, and industry, followed by inquiries into the marketing strategies employed by the company.
Total Tokens Used: 452
Prompt Tokens: 364
Completion Tokens: 88
Total Cost: $0.00011


# Summary

*   RAG is used to access knowledge which is not in a LLM training (because it's too recent, or because it's private or too specific)
*   It consumes less tokens
*   It's meant to retrieve **facts** from a document
*   Its performance depends on implementation (chunking, embeddings)
*   Its meant to be used with vectorisation, often a *vector database*
*   It's very popular, although not a magic bullet