In this notebook, we will use LangChain, OpenAI, and Pinecone vector DB, to build an AI chatbot able to learning from external world data using RAG -Retrieval Augmented Generation-

Our external world data will be the Llama 2 ArXiv paper and other  related papers to help our chatbot answer questions about the latest and greatest in the world of Generative AI.

We'll need to get an [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io).


### Prerequisites


Before we start building our chatbot, we need to install some Python libraries. Here's a brief overview of what each library does:

- **langchain**: This is a library for GenAI. We'll use it to chain together different language models and components for our chatbot.
- **openai**: This is the official OpenAI Python client. We'll use it to interact with the OpenAI API and generate responses for our chatbot.
- **datasets**: This library provides a vast array of datasets for machine learning. We'll use it to load our knowledge base for the chatbot.
- **pinecone-client**: This is the official Pinecone Python client. We'll use it to interact with the Pinecone API and store our chatbot's knowledge base in a vector database.

You can install these libraries using pip like so:



In [None]:
# !pip install -qU \
#     langchain==0.0.292 \
#     openai==0.28.0 \
#     datasets==2.10.1 \
#     pinecone-client==2.2.4 \
#     tiktoken==0.5.1

# !pip install streamlit pandas cohere pinecone openai

We will be relying heavily on the LangChain library to bring together the different components needed for our chatbot. To begin, we'll create a simple chatbot without any retrieval augmentation. We do this by initializing a `ChatOpenAI` object. For this we do need an [OpenAI API key](https://platform.openai.com/account/api-keys).

In [None]:
import os 
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

chat = ChatOpenAI(
    openai_api_key = os.getenv("OPENAI_API_KEY"),
    model = 'gpt-3.5-turbo'
)


In [None]:
print(os.getenv('OPENAI_API_KEY'))

In [None]:
from langchain.schema import(
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content = 'You are a helpful assistant.'),
    HumanMessage(content="Hi AI, how are you today?"),
    AIMessage(content="I'm great, thank you. How can I help you?"),
    HumanMessage(content=("I'd like to understand string theory."))
]


In [None]:
res = chat(messages)
res

In response we get another AI message object. We can print it more clearly like so:

In [None]:
print(res.content)

Because `res` is just another `AIMessage` object, we can append it to `messages`, add another `HumanMessage`, and generate the next response in the conversation.

In [None]:
# add latest AI response to messages
messages.append(res)

# nnow create a new user prompt
prompt = HumanMessage(
    content= "Why do physicists believe it can produce a 'unified theory'?"
)
# add prompt to messages
messages.append(prompt)

# send to chat-gpt
res = chat(messages)
print(res.content)

### Dealing with Hallucinations


We have our chatbot, but as mentioned — the knowledge of LLMs can be limited. The reason for this is that LLMs learn all they know during training. An LLM essentially compresses the "world" as seen in the training data into the internal parameters of the model. We call this knowledge the _parametric knowledge_ of the model.

By default, LLMs have no access to the external world.

The result of this is very clear when we ask LLMs about more recent information, like about the new (and very popular) Llama 2 LLM.

In [None]:
## Addd latest AI response to messages
messages.append(res)
## now create a new user prompt
prompt = HumanMessage(
    content = "What is so special about Llama 2?"
)

## add to message
messages.append(prompt)

## send to OpenAI
res = chat(messages)


In [None]:
print(res.content)



Our chatbot can no longer help us, it doesn't contain the information we need to answer the question. It was very clear from this answer that the LLM doesn't know the informaiton, but sometimes an LLM may respond like it does know the answer — and this can be very hard to detect.

OpenAI have since adjusted the behavior for this particular example as we can see below:


In [None]:
# add latest AI response to messages
messages.append(res)

# now create a new user prompt
prompt = HumanMessage(
    content="Can you tell me about the LLMChain in LangChain?"
)
# add to messages
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [None]:
print(res.content)

There is another way of feeding knowledge into LLMs. It is called source knowledge and it refers to any information fed into the LLM via the prompt. We can try that with the LLMChain question. We can take a description of this object from the LangChain documentation.

In [None]:
llmchain_information = [
    "A LLMChain is the most common type of chain. It consists of a PromptTemplate, a model (either an LLM or a ChatModel), and an optional output parser. This chain takes multiple input variables, uses the PromptTemplate to format them into a prompt. It then passes that to the model. Finally, it uses the OutputParser (if provided) to parse the output of the LLM into a final format.",
    "Chains is an incredibly generic concept which returns to a sequence of modular components (or other chains) combined in a particular way to accomplish a common use case.",
    "LangChain is a framework for developing applications powered by language models. We believe that the most powerful and differentiated applications will not only call out to a language model via an api, but will also: (1) Be data-aware: connect a language model to other sources of data, (2) Be agentic: Allow a language model to interact with its environment. As such, the LangChain framework is designed with the objective in mind to enable those types of applications."
]

source_knowledge = "\n".join(llmchain_information)

We can feed this additional knowledge into our prompt with some instructions telling the LLM how we'd like it to use this information alongside our original query.

In [None]:
query = "Can you tell me about the LLMChain in LangChain?"

augmented_prompt = f"""Using the contexts below, answer the query.

Contexts:
{source_knowledge}

Query: {query} """

Now we feed this into our chatbot as we were before.

In [None]:
# create a new user prompt
prompt = HumanMessage(
    content= augmented_prompt
)

# add to message
messages.append(prompt)

# send to OpenAI
res = chat(messages)

In [None]:
print(res.content)



The quality of this answer is phenomenal. This is made possible thanks to the idea of augmented our query with external knowledge (source knowledge). There's just one problem — how do we get this information in the first place?

We learned in the previous chapters about Pinecone and vector databases. Well, they can help us here too. But first, we'll need a dataset.


## Import dataset

In this task, we will be importing our data. We will be using the Hugging Face Datasets library to load our data. Specifically, we will be using the "jamescalam/llama-2-arxiv-papers" dataset. This dataset contains a collection of ArXiv papers which will serve as the external knowledge base for our chatbot.

In [None]:
from datasets import load_dataset
dataset = load_dataset(
    "jamescalam/llama-2-arxiv-papers-chunked",
    split = "train"
)
dataset

In [None]:
dataset[0]

Dataset Overview

The dataset we are using is sourced from the Llama 2 ArXiv papers. It is a collection of academic papers from ArXiv, a repository of electronic preprints approved for publication after moderation. Each entry in the dataset represents a "chunk" of text from these papers.

Because most Large Language Models (LLMs) only contain knowledge of the world as it was during training, they cannot answer our questions about Llama 2 — at least not without this data.

### Task 4: Building the Knowledge Base

We now have a dataset that can serve as our chatbot knowledge base. Our next task is to transform that dataset into the knowledge base that our chatbot can use. To do this we must use an embedding model and vector database.

We begin by initializing our connection to Pinecone, this requires a [free API key](https://app.pinecone.io).

In [None]:
# from pinecone import Pinecone, PodSpec

# pc = Pinecone(api_key= os.getenv("PINECONE_API_KEY"))

# pc.create_index(
#   name="pod-index",
#   dimension=1536,
#   metric="cosine",
#   spec=PodSpec(
#     environment="gcp-starter"
#   )
# )


In [None]:
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key = os.getenv('PINECONE_API_KEY'),
    environment= os.getenv('PINECONE_ENVIRONMENT')
) 

Then we initialize the index. We will be using OpenAI's `text-embedding-ada-002` model for creating the embeddings, so we set the `dimension` to `1536`.

In [None]:
import time 

index_name = "llama-2-rag"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric = 'cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

index = pinecone.Index(index_name)


In [None]:
import os
import pinecone

# Set your Pinecone API key and environment
api_key = os.getenv('PINECONE_API_KEY')
environment = os.getenv('PINECONE_ENVIRONMENT')

# Create a Pinecone object
pinecone.init(api_key=api_key, environment=environment)
index = pinecone.Index(index_name="your_index_name")

    
