### Introduction

In this notebook, we will:
- Introduce LLMs and the concept of Retrieval-Augmented Generation (RAG)
- Build a playful chatbot example (with a pirate twist) that demonstrates memory and context handling
- Dive into embedding techniques, vector stores, and visualize embeddings with UMAP and Plotly
- Implement a simple RAG pipeline using a PDF document as the knowledge source


![Intro](./data/slides/slide1.png)

![What are we building?](./data/slides/slide2.png)

![Journey?](./data/slides/slide3.png)

![What are LLMs?](./data/slides/slide4.png)

## Section 1

In [1]:
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings
from fastembed import TextEmbedding
from langchain_core.runnables import RunnablePassthrough
import chromadb
import chromadb.utils.embedding_functions as chroma_embedding_functions
import umap.umap_ as umap
import numpy as np
from tqdm import tqdm
import plotly.graph_objects as go
import getpass
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
OLLAMA_HOST = 'http://localhost:5050'


In [7]:
# Initialize the model

# model = ChatOllama(
#   model="llama3.1:8b-instruct-q4_0",
#   temperature=0,
#   seed=42,
# )

# openai_api_key = getpass.getpass("Enter your OpenAI API key: ")
openai_api_key = os.environ['OPENAI_API_KEY']
model = ChatOpenAI(
  model="gpt-4o-mini",
  temperature=0,
  seed=42,
  api_key=openai_api_key
)

### Let's build a simple pirate chatbot

*Notice how we keep the instructions concise to ensure clear behavior.*

In [8]:
prompt = PromptTemplate(
  template="""
  You are a pirate, you must answer all questions with pirate speak. Also, keep your responses short and to the point.
  Question: {question}
  Answer:
  """,
  input_variables=["question"],
)

model_chain = prompt | model

In [9]:
# Let's test the chain
model_chain.invoke({"question": "How are you?"}).content

"Arrr, I be feelin' as fine as a treasure chest full o' gold!"

### <del>*"You always forget!"*</del> 
### Let's add some memory

In [10]:
model.invoke([
  SystemMessage(content="You are a pirate, you must answer all questions with pirate speak. Also, keep your responses short and to the point."),
  HumanMessage(content="Hi! I'm Bob"),
  AIMessage(content="Hello Bob! How can I assist you today?"),
  HumanMessage(content="What's my name?"),
]).content

'Ye be called Bob, matey!'

In [11]:
memory = [
  SystemMessage(content="You are a pirate, you must answer all questions with pirate speak. Also, keep your responses short and to the point."),
]
max_qs = 10 # limit memory to a maximum number of messages to avoid overloading the context window.

def chat(message):
  global memory

  # trim old questions
  if len(memory) > max_qs:
    memory = memory[0:1] + memory[-max_qs:]
  
  memory.append(HumanMessage(content=message))
  response = model_chain.invoke(memory)
  memory.append(response)
  return response.content

In [12]:
# Let's try the memory enabled chat
chat('I am Bob')

"Ahoy, Bob! What be yer treasure seekin' today?"

In [13]:
chat('Who am I?')

"Ye be Bob, a brave soul sailin' the seas! Arrr!"

## Section 2

### Context Lengths

In [34]:
prompt = PromptTemplate(
  template="""
  You are an AI book assistant, you will answer any question about the book I'm providing.
  You must answer the questions that are from the given book only, if you don't know the answer, 
  then just say that you don't know. Also, mention the list of references from the book that you used to answer the question in the references section.
  Answer in the given json format.
  Book: 
  {book}
  ---
  Question: {question}
  open brace
    "answer": "<your answer here>",
    "references": [list of reference paragraphs]
  close brace
  """,
  input_variables=["book", "question"],
)

model_chain = prompt | model

In [35]:
with open('./data/docs/a-very-old-man-with-enormous-wings.txt') as f:
  book_text = f.read()

print(f'No. of chars in the book: {len(book_text)}')

No. of chars in the book: 15586


In [36]:
response = model_chain.invoke({
  'book': book_text,
  'question': 'Summarize the book for me in 1 line'
})

print(response.content)

```json
{
  "answer": "A Very Old Man with Enormous Wings tells the story of a decrepit angel who becomes a spectacle for a curious town, revealing the absurdity of human nature and the fleeting nature of wonder.",
  "references": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
```


### Tokenization

In [None]:
%%html
<iframe src="https://gpt-tokenizer.dev/" height="600" width="1000"></iframe>

Context lengths of models as of Dec 2024:

| **Model**                   | **Context Length** |
|-----------------------------|--------------------|
| Gemma 2b                    | 8k tokens          |
| Gemma 7b                    | 8k tokens          |
| GPT-3.5 (Turbo)             | 16k tokens         |
| Gemini 1.0                  | 32k tokens         |
| GPT-4o/mini                 | 128k tokens        |
| Llama 3.1 (all)             | 128k tokens        |
| Claude 3                    | 200k tokens        |
| Gemini 1.5 Pro              | 2m tokens          |



## Section 3

### Enter RAG!

RAG (Retrieval-Augmented Generation) is an AI framework that combines the strengths of traditional information retrieval systems (such as search and databases) with the capabilities of generative large language models (LLMs).

<img src="./data/slides/rag-architecture.png" width="1000"/>

### Embeddings

In [38]:
docs = [
  "The cat sat on the mat.",
  "A dog is playing in the yard.",
  "A bird is singing in the tree.",
  "The horse is galloping in the field.",
  "The chair is next to the table.",
  "The book is on the shelf.",
  "The computer is on the desk.",
  "Mars is known as the Red Planet.",
  "Jupiter is the largest planet in our solar system.",
  "Saturn has beautiful rings."
]

In [39]:
emb_fun = chroma_embedding_functions.DefaultEmbeddingFunction() # all-MiniLM-L6-v2
# emb_fun(['The cat sat on the mat'])[0].shape

In [40]:
client = chromadb.Client()

# Clean up any existing collection for a fresh start
existing = [collection.name for collection in client.list_collections()]
if "rag" in existing:
  client.delete_collection("rag")

collection = client.create_collection(
  "rag", 
  embedding_function=emb_fun,
)

collection.add(
  documents=docs,
  ids=[str(i) for i in range(len(docs))],
)

In [41]:
# Querying the collection
results = collection.query(
  query_texts=["animals"],
  n_results=2
)
results['documents']

[['A dog is playing in the yard.', 'The horse is galloping in the field.']]

### Visualizing Embeddings

In [54]:
def plot_embeds(dataset, query = (None, None), retreived=(None, None)): # all params are tuples of (projected_embeds, docs)
  projected_dataset_embeds, dataset_docs = dataset
  projected_query_embeds, query_doc = query
  projected_retreived_embeds, retreived_docs = retreived
  
  fig = go.Figure()
  fig.add_trace(
    go.Scatter(
      x=projected_dataset_embeds[:, 0], y=projected_dataset_embeds[:, 1],
      name='Dataset', mode='markers', text=dataset_docs, hoverinfo='text', 
      marker=dict(size=10, color='gray'),
    ),
  )
  if retreived_docs:
    fig.add_trace(
      go.Scatter(
        x=projected_retreived_embeds[:, 0], y=projected_retreived_embeds[:, 1],
        name='Retrieved', mode='markers', text=retreived_docs, hoverinfo='text', 
        marker=dict(size=10, color='green'),
      )
    )
  if query_doc:
    fig.add_trace(
      go.Scatter(
        x=projected_query_embeds[:, 0], y=projected_query_embeds[:, 1],
        name='Query', mode='markers', text=[query_doc], hoverinfo='text', 
        marker=dict(size=10, color='red', symbol='diamond'),
      )
    )
  fig.update_layout(
      title=f'Query: {query_doc}' if query_doc else 'Dataset Projection', showlegend=True,
      xaxis=dict(scaleanchor="y", scaleratio=1, visible=False),
      yaxis=dict(scaleanchor="x", scaleratio=1, visible=False),
  )
  fig.show()

def project_embeds(embeddings, umap_transform):
  projected_embeds = np.empty((len(embeddings), 2))
  for i, embed in enumerate(tqdm(embeddings)):
    projected_embeds[i] = umap_transform.transform([embed])
  return projected_embeds

def get_projection_transform(dataset_embeds):
  return umap.UMAP(random_state=42, transform_seed=42, n_jobs=1, n_neighbors=dataset_embeds.shape[0]-1).fit(dataset_embeds)

In [None]:
# Get dataset embeddings from collection
results = collection.get(include=['embeddings', 'documents'])
dataset_embeds, dataset_docs = results['embeddings'], results['documents']
projection_transoform = get_projection_transform(dataset_embeds)

# Project the embeddings
projected_dataset_embeds = project_embeds(dataset_embeds, projection_transoform)

In [56]:
plot_embeds((projected_dataset_embeds, dataset_docs))

In [None]:
query = "animal activities"
results = collection.query(query_texts=[query], n_results=3, include=['embeddings', 'documents'])

query_embeds = collection._embedding_function([query])[0]
retreived_embeds, retreived_docs = results['embeddings'][0], results['documents'][0]
projected_query_embeds = project_embeds([query_embeds], projection_transoform)
projected_retreived_embeds = project_embeds(retreived_embeds, projection_transoform)

plot_embeds(
  (projected_dataset_embeds, dataset_docs),
  (projected_query_embeds, query), 
  (projected_retreived_embeds, retreived_docs)
)

## Section 4

### Finally, let's build a RAG!

In [58]:
print(
  model.invoke("What is amazon's total revenue in 2023? Answer in 1 line").content
)

As of my last update in October 2023, Amazon's total revenue for the year 2023 was not yet available. Please check the latest financial reports or news sources for the most current information.


In [None]:
doc_loader = PyPDFLoader(
  "./data/docs/2023-amazon-annual-letter.pdf",
)
docs = doc_loader.load()

In [62]:
# Split the document into chunks with overlaps
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
doc_splits = text_splitter.split_documents(docs)
doc_splits[1]

Document(metadata={'source': './data/docs/2023-amazon-annual-letter.pdf', 'page': 1, 'page_label': '2'}, page_content='Dear Shareholders:\nLast year at this time, I shared my enthusiasm and optimism for Amazon’s future. Today, I have even more.\nThe reasons are many, but start with the progress we’ve made in our financial results and customer\nexperiences, and extend to our continued innovation and the remarkable opportunities in front of us.\nIn 2023, Amazon’s total revenue grew 12% year-over-year (“Y oY”) from $514B to $575B. By segment, North\nAmerica revenue increased 12% Y oY from $316B to $353B, International revenue grew 11% Y oY from\n$118B to $131B, and AWS revenue increased 13% Y oY from $80B to $91B.\nFurther, Amazon’s operating income and Free Cash Flow (“FCF”) dramatically improved. Operating\nincome in 2023 improved 201% Y oY from $12.2B (an operating margin of 2.4%) to $36.9B (an operating\nmargin of 6.4%). Trailing Twelve Month FCF adjusted for equipment finance leases 

In [None]:
# Generate embeddings using fastembed (or another model if preferred)
text_embedding = TextEmbedding()
embedding_values = list(text_embedding.embed([doc.page_content for doc in doc_splits]))

In [None]:
client = chromadb.Client()

if "rag" in [collection.name for collection in client.list_collections()]:
  client.delete_collection("rag")

collection = client.create_collection("rag")
collection.add(
  documents=[doc.page_content for doc in doc_splits],
  metadatas=[doc.metadata for doc in doc_splits],
  ids=[str(i) for i in range(len(doc_splits))],
  embeddings=embedding_values,
)

In [59]:
prompt = PromptTemplate(
  template="""You are an assistant for question-answering tasks.
    Use the following documents to answer the question.
    If you don't know the answer, just say that you don't know.
    Use three sentences maximum and keep the answer concise:
    Question: {question}
    Documents: {documents}
    Answer:
    """,
  input_variables=["question", "documents"],
)

rag_chain = prompt | model

In [60]:
class RAG:
  def __init__(self, rag_chain, collection):
    self.rag_chain = rag_chain
    self.collection = collection

  def print_retreived_docs(self, retreived_docs):
    print("Retrieved documents:")
    for i in range(len(retreived_docs['ids'][0])):
      print(f'Document: {i+1}')
      print(f'Source: {retreived_docs["metadatas"][0][i]}')
      print(f'Data: {retreived_docs["documents"][0][i]}\n')
    print('-' * 120 + '\n')

  def query(self, question, verbose=False):
    ret = collection.query(query_texts=[question], n_results=4)
    retreived_docs = ret['documents'][0]
    if verbose: self.print_retreived_docs(ret)
    response = rag_chain.invoke({"question": question, "documents": retreived_docs})
    return response.content
  
rag = RAG(rag_chain, collection)

In [61]:
print(
  rag.query("What is amazon's total revenue in 2023?", verbose=False)
)

Amazon's total revenue in 2023 is $575 billion, reflecting a 12% year-over-year growth from $514 billion in 2022. This growth includes increases in North America, International, and AWS revenue segments.


### RAG - Refactored and simplify with langchain

In [65]:
print(
  model.invoke("What is amazon's total revenue in 2023? Answer in 1 line").content
)

As of my last update in October 2023, Amazon's total revenue for the year 2023 was approximately $514 billion.


In [66]:
doc_loader = PyPDFLoader(
  "./data/docs/2023-amazon-annual-letter.pdf",
)
docs = doc_loader.load()

In [67]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
doc_splits = text_splitter.split_documents(docs)

In [68]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [69]:
ids = [str(i) for i in range(len(doc_splits))]
vector_store = Chroma.from_documents(
  doc_splits, 
  embedding=embeddings, 
  ids=ids,
  collection_name="rag-chroma-1",
)

In [70]:
retriever = vector_store.as_retriever()

prompt = PromptTemplate(
  template="""You are an assistant for question-answering tasks.
    Use the following documents to answer the question.
    If you don't know the answer, just say that you don't know.
    Use three sentences maximum and keep the answer concise:
    Question: {question}
    Documents: {documents}
    Answer:
    """,
  input_variables=["question", "documents"],
)

def format_docs(docs):
  return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"documents": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [71]:
print(
  rag_chain.invoke("What is amazon's total sales in 2023?")
)

Amazon's total sales in 2023 amounted to $574.785 billion. This represents a 12% increase compared to the previous year. The sales growth was driven by increased unit sales, advertising sales, and subscription services.


In [72]:
eval_questions = [
  'What were the primary drivers of revenue growth for Amazon in 2023?',
  'What initiatives is Amazon undertaking to reduce its cost to serve in 2024?',
  'What specific advancements did AWS make in 2023 regarding infrastructure, chip technology, and Generative AI?',
  'According to Amazon\'s CEO, what defining characteristics exemplify the company\'s culture and approach to innovation?',
  'How does Amazon empower its employees to act like "builders"?',
]

In [73]:
q = eval_questions[2]
print("Q: ", q)
print("A: ", rag_chain.invoke(q))

Q:  What specific advancements did AWS make in 2023 regarding infrastructure, chip technology, and Generative AI?
A:  In 2023, AWS introduced the Graviton4 CPU chips, offering up to 30% better compute performance and 75% more memory bandwidth compared to Graviton3. They also launched AWS Trainium2 chips, which provide up to four times faster machine learning training for generative AI applications. Additionally, AWS expanded its infrastructure to include 105 Availability Zones across 33 geographic regions, with six new regions planned.


## Section 5

### Evalutaion

![Evaluation](./data/slides/slide5.png)