In [1]:
import os
os.environ["OPENAI_API_KEY"] = "API KEY"
# Set your Pinecone API key here
os.environ["PINECONE_API_KEY"] = "API KEY"

# Homework 12 - Part B: Custom Data Chatbot

### Part B Goal

Build a chatbot to answer questions based on custom data from multiple documents using LangChain, OpenAI, and Pinecone vector DB, to build a chatbot capable of learning from the external world using **R**etrieval **A**ugmented **G**eneration (RAG).

The chatbot will save the conversation in memory such that it can expand on the conversation based on the past and summarize the conversation.

### Prerequisites

Install the following Python libraries:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

**NOTE**: *OpenAI dataloaders will not load locally for on-prem devices easily. To simplify the use of these loaders, it is recommended to use an online notebook such as CoLab.*

In [2]:
!pip install -qU \
    langchain==0.0.354 \
    openai==1.6.1 \
    datasets==2.10.1 \
    pinecone-client==3.1.0 \
    tiktoken==0.5.2

[0m

### BACKGROUND: Building a Chatbot (no RAG)

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [3]:
import os
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

  warn_deprecated(


In [4]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great thank you. How can I help you?"),
    HumanMessage(content="I'd like to know when Lewis Hamilton is moving to Ferrari.")
]

The format is very similar, we're just swapped the role of `"user"` for `HumanMessage`, and the role of `"assistant"` for `AIMessage`.

We generate the next response from the AI by passing these messages to the `ChatOpenAI` object.

In [5]:
res = chat(messages)
res

  warn_deprecated(


AIMessage(content="I'm sorry, but I don't have that information. As of now, there have been no official announcements regarding Lewis Hamilton moving to Ferrari. It's best to stay updated through reliable sources for any news regarding this matter.")

In response we get another AI message object. We can print it more clearly like so:

In [6]:
print(res.content)

I'm sorry, but I don't have that information. As of now, there have been no official announcements regarding Lewis Hamilton moving to Ferrari. It's best to stay updated through reliable sources for any news regarding this matter.


### Stringing Messages for a Conversation
Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [7]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="How many championships has Lewis Hamilton Won?"
)
# add to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)

print(res.content)

As of the end of the 2021 Formula 1 season, Lewis Hamilton has won a total of 7 Formula 1 World Championships.


### Dealing with Hallucinations

We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about the new (and very popular) Llama 2 LLM.

In [8]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Who are the drivers lined up for Ferrari in 2025?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [9]:
print(res.content)

I'm unable to provide information on future driver lineups as it is subject to change and speculation. It's best to stay updated through official announcements from Ferrari or reliable sources for the most accurate information.


Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it _does_ know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:

In [10]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Can you tell me where will Lewis Hamilton drive next Season in F1?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [11]:
print(res.content)

I don't have real-time information on Lewis Hamilton's team for the upcoming Formula 1 season. It's best to follow official announcements from Lewis Hamilton or his team to get the latest updates on his driving plans for the next season.


### Importing the Data

In [13]:
# !pip install pypdf

In [16]:
from langchain_community.document_loaders import PyPDFLoader

#load pdf files
loader = PyPDFLoader('Ferrari F1 Team for 2025.pdf')
data = loader.load()
print(data)

[Document(page_content="Ferrari\nF1\nTeam\nfor\n2025\nThe\nScuderia\nFerrari,\nalso\nknown\nas\nthe\nFerrari\nF1\nteam,\nis\nthe\nracing\nteam\nof\nthe\niconic\nItalian\nluxury\nsports\ncar\nmanufacturer\nFerrari.\nThe\nteam\nis\nbased\nin\nMaranello,\nItaly,\nand\nhas\nbeen\na\ndominant\nforce\nin\nFormula\nOne\nracing\nsince\nits\ninception\nin\n1950.\nTeam\nManagement\nTeam\nPrincipal:\nFrédéric\nVasseur\nA\nFrench\nengineer\nand\nmanager,\nVasseur\njoined\nFerrari\nin\n2023,\nreplacing\nMattia\nBinotto.\nHe\nhas\nextensive\nexperience\nin\nF1,\nhaving\nworked\nwith\nteams\nlike\nRenault,\nToyota,\nand\nAlfa\nRomeo.\nTechnical\nDirector:\nEnrico\nCardile\nand\nEnrico\nGualtieri\nCardile\nis\nresponsible\nfor\nthe\ncar's\noverall\ndesign\nand\ndevelopment.\nGualtieri\noversees\nthe\npower\nunit\nand\ntransmission.\nDriver\nLineup\nfor\n2025\nCharles\nLeclerc\n(Monaco)\nContract:\n2025-2027\nLeclerc\njoined\nFerrari\nin\n2019\nand\nhas\nbeen\na\nkey\ndriver\nfor\nthe\nteam.\nHe\nhas\n

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split text data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=20)
text_chunks = text_splitter.split_documents(data)
print(len(text_chunks))

4


In [18]:
# check the chunks
text_chunks[2]

Document(page_content="from\nother\ntop\nteams\nlike\nMercedes,\nRed\nBull,\nand\nMcLaren.\nThe\nteam\nhas\nbeen\nworking\nhard\nto\nimprove\nits\ncar\nand\noperations,\nand\nthe\naddition\nof\nLewis\nHamilton\nis\nexpected\nto\nbring\na\nnew\nlevel\nof\nexpertise\nand\ncompetitiveness.\nThe\nteam's\ngoal\nfor\n2025\nis\nto\nwin\nthe\nConstructors'\nChampionship\nand\nsupport\nits\ndrivers\nin\ntheir\nquest\nfor\nthe\nDrivers'\nChampionship.\nConclusion\nThe\nFerrari\nF1\nteam\nis\na\nlegendary\noutfit\nwith\na\nrich\nhistory\nand\na\npassionate\nfan\nbase.\nWith\na\nstrong\nteam\nmanagement,\ntalented\ndrivers,\nand\na\ncompetitive\ncar,\nthe\nteam\nis\nwell-positioned\nto", metadata={'source': 'Ferrari F1 Team for 2025.pdf', 'page': 0})

In [19]:
# reformat chunks to improve vectorization; match 'jamescalam/llama-2-arxiv-papers-chunked' format sourced from Llama 2 ArXiv papers on huggingface
dataset = []

for i, chunk in enumerate(text_chunks):
    dataset.append({
        'doi': '',  # you can add a DOI here if available
        'chunk-id': str(i),
        'chunk': chunk,
        'id': '',  # you can add an ID here if available
        'title': '',  # you can add a title here if available
        'summary': '',  # you can add a summary here if available
        'source': '',  # you can add a source here if available
        'authors': [],  # you can add authors here if available
        'categories': [],  # you can add categories here if available
        'comment': '',  # you can add a comment here if available
        'journal_ref': None,  # you can add a journal reference here if available
        'primary_category': '',  # you can add a primary category here if available
        'published': '',  # you can add a published date here if available
        'updated': '',  # you can add an updated date here if available
        'references': []  # you can add references here if available
    })

print(dataset[3])

{'doi': '', 'chunk-id': '3', 'chunk': Document(page_content='succeed\nin\nthe\n2025\nF1\nseason.\nAs\nthe\nteam\ncontinues\nto\nevolve\nand\nimprove,\nfans\naround\nthe\nworld\nwill\nbe\neagerly\nwatching\nto\nsee\nif\nFerrari\ncan\nreclaim\nits\nposition\nat\nthe\ntop\nof\nthe\nF1\npodium.', metadata={'source': 'Ferrari F1 Team for 2025.pdf', 'page': 1}), 'id': '', 'title': '', 'summary': '', 'source': '', 'authors': [], 'categories': [], 'comment': '', 'journal_ref': None, 'primary_category': '', 'published': '', 'updated': '', 'references': []}


#### Dataset Overview

The dataset used are PDFs samples of my (Silksong Gosalvez's) Deep Learning homeworks.

Because most **L**arge **L**anguage **M**odels (LLMs) only contain knowledge of the world as it was during training, they cannot answer our questions about Silksong the game without example data.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [21]:
from pinecone import Pinecone

# Initialize connection to Pinecone using the environment variable
api_key = os.environ.get("PINECONE_API_KEY")
pc = Pinecone(api_key=api_key)


Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [22]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [23]:
import time

index_name = 'llama-2-rag'
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Our index is now ready but it's empty. It is a vector index, so it needs vectors. As mentioned, to create these vector embeddings we will OpenAI's `text-embedding-ada-002` model — we can access it via LangChain like so:

In [24]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

  warn_deprecated(


Using this model we can create embeddings like so:

In [25]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed_model.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

From this we get two (aligning to our two chunks of text) 1536-dimensional embeddings.

We're now ready to embed and index all our our data! We do this by looping through our dataset and embedding and inserting everything in batches.

**NOTE**: *ensure that chunks are strings and ensure that they are correctly assigned to metadata (do this with the .page_content method)*

In [26]:
import pandas as pd
from tqdm.auto import tqdm  # for progress bar

data = pd.DataFrame(dataset) # this makes it easier to iterate over the dataset

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i+batch_size)
    # get batch of data
    batch = data.iloc[i:i_end]
    # generate unique ids for each chunk
    ids = [f"{x['doi']}-{x['chunk-id']}" for i, x in batch.iterrows()]
    # get text to embed
    texts = [str(x['chunk']) for _, x in batch.iterrows()]

    # embed text
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['chunk'].page_content,
         'source': x['source'],
         'title': x['title']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

100%|██████████| 1/1 [00:00<00:00,  1.37it/s]


We can check that the vector index has been populated using `describe_index_stats` like before:

In [27]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

#### Retrieval Augmented Generation

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be diving back into LangChain and reusing our template prompt from earlier.

To use LangChain here we need to load the LangChain abstraction for a vector index, called a `vectorstore`. We pass in our vector `index` to initialize the object.

In [28]:
from langchain.vectorstores import Pinecone

text_field = "text"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

  warn_deprecated(


Using this `vectorstore` we can already query the index and see if we have any relevant information given our question about Silksong's prior deep learning homeworks.

In [29]:
query = "Did Lewis Hamilton sign with Ferrari?"

vectorstore.similarity_search(query, k=3)

[Document(page_content='12-year\nstint\nwith\nMercedes.\nHe\nis\na\nseven-time\nWorld\nChampion\nand\none\nof\nthe\nmost\nsuccessful\nF1\ndrivers\nin\nhistory.\nOther\nKey\nPersonnel\nLaurent\nMekies:\nSporting\nDirector\nRiccardo\nAdami:\nHead\nof\nVehicle\nOperations\nEnrico\nGualtieri:\nHead\nof\nPower\nUnit\nFerrari\nF1\nTeam\nHistory\nFerrari\nis\nthe\noldest\nand\nmost\nsuccessful\nteam\nin\nFormula\nOne,\nwith\n16\nWorld\nChampionships\nand\nover\n250\nrace\nwins.\nThe\nteam\nhas\na\nrich\nhistory,\nhaving\nbeen\nfounded\nby\nEnzo\nFerrari\nin\n1947.\nOver\nthe\nyears,\nFerrari\nhas\nfielded\nsome\nof\nthe\ngreatest\ndrivers\nin\nF1\nhistory,\nincluding\nAlberto\nAscari,\nJuan\nManuel\nFangio,\nNiki\nLauda,\nand\nMichael\nSchumacher.\nFerrari\nF1\nTeam\nCar\nThe\nFerrari\nF1\nteam\ncar\nfor\n2025\nis\nthe\nSF-24,\ndesigned\nby\nEnrico\nCardile\nand\nhis\nteam.\nThe\ncar\nfeatures\na\n1.6-liter\nturbocharged\nV6\nengine,\nproducing\nover\n1,000\nhorsepower.\nThe\nSF-24\nalso\nfea

We return a lot of text here and it's not that clear what we need or what is relevant. Fortunately, our LLM will be able to parse this information much faster than us. All we need is to connect the output from our `vectorstore` to our `chat` chatbot. To do that we can use the same logic as we used earlier.

In [30]:
def augment_prompt(query: str):
    # get top 3 results from knowledge base
    results = vectorstore.similarity_search(query, k=3)
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    return augmented_prompt

Using this we produce an augmented prompt:

In [32]:
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    12-year
stint
with
Mercedes.
He
is
a
seven-time
World
Champion
and
one
of
the
most
successful
F1
drivers
in
history.
Other
Key
Personnel
Laurent
Mekies:
Sporting
Director
Riccardo
Adami:
Head
of
Vehicle
Operations
Enrico
Gualtieri:
Head
of
Power
Unit
Ferrari
F1
Team
History
Ferrari
is
the
oldest
and
most
successful
team
in
Formula
One,
with
16
World
Championships
and
over
250
race
wins.
The
team
has
a
rich
history,
having
been
founded
by
Enzo
Ferrari
in
1947.
Over
the
years,
Ferrari
has
fielded
some
of
the
greatest
drivers
in
F1
history,
including
Alberto
Ascari,
Juan
Manuel
Fangio,
Niki
Lauda,
and
Michael
Schumacher.
Ferrari
F1
Team
Car
The
Ferrari
F1
team
car
for
2025
is
the
SF-24,
designed
by
Enrico
Cardile
and
his
team.
The
car
features
a
1.6-liter
turbocharged
V6
engine,
producing
over
1,000
horsepower.
The
SF-24
also
features
advanced
aerodynamics,
including
complex
wing
designs
and
a
sophisticated
drag
reduction
syst

There is still a lot of text here, so let's pass it onto our chat model to see how it performs.

In [33]:
# create a new user prompt
prompt = HumanMessage(
    content=augment_prompt(query)
)
# add to messages
messages.append(prompt)

res = chat(messages)

print(res.content)

Yes, Lewis Hamilton signed with Ferrari for the 2025-2027 seasons after a successful 12-year stint with Mercedes.


We can continue with more questions about Silksong's prior deep learning homeworks. Let's try _without_ RAG first:

In [34]:
prompt = HumanMessage(
    content="Who is the Team boss of Scuderia?"
)

res = chat(messages + [prompt])
print(res.content)

The Team Principal of Scuderia Ferrari is Frédéric Vasseur.


The chatbot is able to respond about Silksong's prior deep learning homeworks thanks to it's conversational history stored in `messages`.

In [35]:
prompt = HumanMessage(
    content=augment_prompt(
        "What can you say about Frédéric Vasseur ?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Frédéric Vasseur is the Team Principal of the Ferrari F1 team. He is a French engineer and manager who joined Ferrari in 2023, replacing Mattia Binotto. Vasseur has extensive experience in Formula 1, having worked with teams like Renault, Toyota, and Alfa Romeo. As Team Principal, he plays a crucial role in overseeing the team's operations and strategy to help Ferrari achieve success on the track.


In [36]:
prompt = HumanMessage(
    content=augment_prompt(
        "What date should fans pay attention to?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Fans should pay attention to the 2025 F1 season to see if Ferrari can reclaim its position at the top of the podium with their new driver lineup and car.


In [37]:
prompt = HumanMessage(
    content=augment_prompt(
        "What color is the Scuderia Ferrari Horse?"
    )
)

res = chat(messages + [prompt])
print(res.content)

The Scuderia Ferrari horse is typically depicted in black on a yellow background.


In [38]:
prompt = HumanMessage(
    content=augment_prompt(
        "Which country is Charles Leclerc from and do you know which continent is that country from?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Charles Leclerc is from Monaco, and Monaco is a country located in Europe.


In [39]:
prompt = HumanMessage(
    content=augment_prompt(
        "How many world championships and races has Ferrari won'?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Ferrari has won 16 World Championships and over 250 race wins in Formula One.


In [40]:
prompt = HumanMessage(
    content=augment_prompt(
        "Who designed Ferrari Car for 2025?"
    )
)

res = chat(messages + [prompt])
print(res.content)

Enrico Cardile and his team designed the Ferrari F1 team car for 2025, the SF-24.


In [41]:
prompt = HumanMessage(
    content=augment_prompt(
        "Who are the technical directors of Ferrari?"
    )
)

res = chat(messages + [prompt])
print(res.content)

The technical directors of Ferrari are Enrico Cardile and Enrico Gualtieri. Enrico Cardile is responsible for the car's overall design and development, while Enrico Gualtieri oversees the power unit and transmission.


In [42]:
prompt = HumanMessage(
    content=augment_prompt(
        "Summarize our chat in bullets."
    )
)

res = chat(messages + [prompt])
print(res.content)

- Lewis Hamilton has joined Ferrari for the 2025-2027 seasons after a successful 12-year stint with Mercedes.
- Ferrari F1 team aims to win the Constructors' Championship and support its drivers in their quest for the Drivers' Championship in 2025.
- The team is well-positioned with strong team management, talented drivers, and a competitive car for the upcoming F1 season.
- Ferrari F1 team, also known as Scuderia Ferrari, has been a dominant force in Formula One racing since its inception in 1950.
- Frédéric Vasseur is the Team Principal of Ferrari F1 team, and Enrico Cardile and Enrico Gualtieri serve as the Technical Director overseeing car design and power unit, respectively.


**`Observations and Limitations of the Large Language Model (LLM)`**

*Complexity of PDFs*: The LLM's ability to extract information from PDFs is hindered by the presence of special characters and formatting complexities, resulting in incomplete data capture. For instance, the LLM successfully identified the context written by Jordan Sirani but failed to attribute authorship to him.

*Chunking format*: The utilization of chunking format ensures efficient data loading and ingestion, facilitating the processing of large amounts of information.

*Prompt and response appending*: The appending of prompts and responses to messages enables the expansion of content, allowing the chatbot to engage in conversational exchanges.

*Message saving and conversation recall*: The passing forward of messages enables the chatbot to "remember" the conversation, facilitating the drawing of conclusions and analysis.

Delete the index to save resources and not be charged for non-use:

In [43]:
pc.delete_index(index_name)

---