#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

To begin, we must install the prerequisite libraries that we will be using in this notebook. If we install all libraries we will find a conflict in the Hugging Face `datasets` library so we must install everything in a specific order like so:

In [None]:
!pip install -qU \
    datasets==2.12.0 \
    apache_beam \
    mwparserfromhell
!pip install langchain --upgrade
!pip install pdfminer.six pypdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.7/14.7 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup

## Building the Knowledge Base

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, PyPDFDirectoryLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_folder_path = './pdf'
loader = PyPDFDirectoryLoader(pdf_folder_path)
data = loader.load()

In [None]:
data[1]

Document(page_content='for a rewrite of our production indexing system. Sec-\ntion 7 discusses related and future work.\n2 Programming Model\nThe computation takes a set of input key/value pairs, and\nproduces a set of output key/value pairs. The user of\nthe MapReduce library expresses the computation as two\nfunctions: Map andReduce.\nMap, written by the user, takes an input pair and pro-\nduces a set of intermediate key/value pairs. The MapRe-\nduce library groups together all intermediate values asso-\nciated with the same intermediate key Iand passes them\nto the Reduce function.\nTheReduce function, also written by the user, accepts\nan intermediate key Iand a set of values for that key. It\nmerges together these values to form a possibly smaller\nset of values. Typically just zero or one output value is\nproduced per Reduce invocation. The intermediate val-\nues are supplied to the user\'s reduce function via an iter-\nator. This allows us to handle lists of values that are too\

Now we install the remaining libraries:

In [None]:
!pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/770.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/770.9 kB[0m [31m869.7 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/770.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m532.5/770.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m768.0/770.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m770.9/770.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m13.0 MB/s[0m

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

Every record contains *a lot* of text. Our first task is therefore to identify a good preprocessing methodology for chunking these articles into more "concise" chunks to later be embedding and stored in our Pinecone vector database.

For this we use LangChain's `RecursiveCharacterTextSplitter` to split our text into chunks of a specified max length.

In [None]:
import tiktoken

tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=0,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [None]:
chunks = text_splitter.split_documents(data)
chunks

In [None]:
tiktoken_len(chunks[0].page_content), tiktoken_len(chunks[1].page_content), tiktoken_len(chunks[2].page_content)

(376, 384, 70)

Using the `text_splitter` we get much better sized chunks of text. We'll use this functionality during the indexing process later. Now let's take a look at embedding.

## Creating Embeddings

Building embeddings using LangChain's OpenAI embedding support is fairly straightforward. We first need to add our [OpenAI api key]() by running the next cell:

In [None]:
import os

# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

*(Note that OpenAI is a paid service and so running the remainder of this notebook may incur some small cost)*

After initializing the API key we can initialize our `text-embedding-ada-002` embedding model like so:

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now we embed some text like so:

In [None]:
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed.embed_documents(texts)
len(res), len(res[0])

(2, 1536)

From this we get *two* (aligning to our two chunks of text) 1536-dimensional embeddings.

Now we move on to initializing our Pinecone vector database.

## Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [None]:
index_name = 'YOUR_PINECONE_INDEX_NAME'
# Dimension set to 1536 for GPT_3_5

In [None]:
import pinecone

# find API key in console at app.pinecone.io
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

pinecone.init(
    api_key='PINECONE_API_KEY',
    environment='PINECONE_ENVIRONMENT'
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

Then we connect to the new index:

In [None]:
index = pinecone.GRPCIndex(index_name)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.00039,
 'namespaces': {'': {'vector_count': 39}},
 'total_vector_count': 39}

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

## Indexing

We can perform the indexing task using the LangChain vector store object. But for now it is much faster to do it via the Pinecone python client directly. We will do this in batches of `100` or more.

In [None]:
from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

for i, document in enumerate(tqdm(data)):
# for i, document in enumerate(tqdm(final_data)):
    # first get metadata fields for this record
    # print(i)
    # print(type(i))
    metadata = {
        'source': document.metadata['source']
    }
    print(document.metadata['source'])
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(document.page_content)
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

  0%|          | 0/13 [00:00<?, ?it/s]

pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf
pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf


We've now indexed everything. We can check the number of vectors in our index like so:

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

## Creating a Vector Store and Querying

Now that we've build our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:

In [None]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

In [None]:
query = "What is MapReduce?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content="MapReduce: Simpli\x02ed Data Processing on Large Clusters\nJeffrey Dean and Sanjay Ghemawat\njeff@google.com, sanjay@google.com\nGoogle, Inc.\nAbstract\nMapReduce is a programming model and an associ-\nated implementation for processing and generating large\ndata sets. Users specify a map function that processes a\nkey/value pair to generate a set of intermediate key/value\npairs, and a reduce function that merges all intermediate\nvalues associated with the same intermediate key. Many\nreal world tasks are expressible in this model, as shown\nin the paper.\nPrograms written in this functional style are automati-\ncally parallelized and executed on a large cluster of com-\nmodity machines. The run-time system takes care of the\ndetails of partitioning the input data, scheduling the pro-\ngram's execution across a set of machines, handling ma-\nchine failures, and managing the required inter-machine\ncommunication. This allows programmers without any\nexperience 

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo', # model name
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [None]:
query = "What is MapReduce? Please explain in detail and provide the references."
qa.run(query)

'MapReduce is a programming model and implementation developed by Jeffrey Dean and Sanjay Ghemawat at Google. It is designed for processing and generating large data sets. The concept behind MapReduce is to simplify the processing of big data by allowing users to specify a map function and a reduce function.\n\nIn the MapReduce model, users define a map function that takes a key/value pair as input and generates a set of intermediate key/value pairs. The map function processes the input data and performs any necessary transformations or computations. The intermediate key/value pairs are then passed to the reduce function.\n\nThe reduce function merges all the intermediate values associated with the same intermediate key. It performs any necessary aggregations or calculations on the intermediate values to produce the final output.\n\nThe MapReduce model is highly scalable and can be executed on a large cluster of commodity machines. The runtime system takes care of partitioning the inpu

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [None]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [None]:
query = "What is MapReduce?"
qa_with_sources(query)

{'question': 'What is MapReduce?',
 'answer': 'MapReduce is a programming model and implementation for processing and generating large data sets. It allows users to specify a map function that processes key/value pairs and generates intermediate key/value pairs, as well as a reduce function that merges intermediate values associated with the same key. MapReduce programs are automatically parallelized and executed on a large cluster of machines. It has been used for various tasks at Google, including large-scale machine learning, clustering, data extraction, and graph computations. The implementation of MapReduce at Google is highly scalable and has processed terabytes of data on thousands of machines. (',
 'sources': 'pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf)'}

In [None]:
query = "What is the architecture of MapReduce?"
qa_with_sources(query)

{'question': 'What is the architecture of MapReduce?',
 'answer': 'The architecture of MapReduce consists of a programming model and an associated implementation for processing and generating large data sets. It involves a map function that processes a key/value pair to generate intermediate key/value pairs, and a reduce function that merges intermediate values associated with the same key. The implementation runs on a large cluster of commodity machines and is highly scalable. It automatically parallelizes programs written in the functional style and handles partitioning of input data, scheduling of program execution, machine failures, and inter-machine communication. The architecture is described in detail in the paper "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat.\n',
 'sources': 'pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf'}

In [None]:
query = "Can you explain what experiment they produced in this paper?"
qa_with_sources(query)

{'question': 'Can you explain what experiment they produced in this paper?',
 'answer': 'The paper discusses the implementation and performance measurements of the MapReduce programming model in a cluster-based computing environment. It also explores the use of MapReduce within Google for various tasks, including large-scale machine learning problems, clustering problems, data extraction, and large-scale graph computations. The paper provides examples of the effect of backup tasks and machine failures on the execution of the sort program using MapReduce. Additionally, it presents statistics on the computational resources used by MapReduce jobs at Google, including the number of jobs, average completion time, machine days used, input data read, intermediate data produced, output data written, and average number of worker machines per job. \n',
 'sources': 'pdf/MapReduce- Simplified Data Processing on Large Clusters.pdf'}

In [None]:
query = "Can you explain what does map and reduce phases do in Mapreduce?"
qa_with_sources(query)

{'question': 'Can you explain what does map and reduce phases do in Mapreduce?',
 'answer': "In the MapReduce framework, the map phase involves processing a key/value pair to generate a set of intermediate key/value pairs. The reduce phase then merges all intermediate values associated with the same intermediate key. This allows for parallel processing of large data sets on a cluster of machines. The map and reduce functions are specified by the user and are used to express the computation. The map function emits intermediate key/value pairs, while the reduce function merges the values for each key. The MapReduce framework takes care of partitioning the input data, scheduling the program's execution, handling machine failures, and managing inter-machine communication. This allows programmers without experience in parallel and distributed systems to utilize the resources of a large distributed system. The implementation of MapReduce at Google runs on a large cluster of commodity machine

Now we answer the question being asked, *and* return the source of this information being used by the LLM.

---