[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb)

# Building a LangChain Docs Plugin for ChatGPT

In this walkthrough we setup a ChatGPT plugin.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere (like on Digital Ocean). More detailed instructions for the setup and deployment can be [found in the video here](https://youtu.be/hpePPqKxNq8).

We will summarize the instructions (specific to the Pinecone datastore) before moving on to the walkthrough.

## App Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `chatgpt-retrieval-plugin` repository:

```
git clone git@github.com:openai/chatgpt-retrieval-plugin.git
```

_**Note**: To see how we setup the *hosted app* on DigitalOcean [refer to this video](https://youtu.be/hpePPqKxNq8), otherwise continue to setup the app locally by following the remaining steps._

3. Navigate to the app directory:

```
cd /path/to/chatgpt-retrieval-plugin
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set Pinecone-specific environment variables:

* `DATASTORE`: set to `pinecone`.

* `PINECONE_API_KEY`: Set to your Pinecone API key. This requires a free Pinecone account and can be [found in the Pinecone console](https://app.pinecone.io/).

* `PINECONE_ENVIRONMENT`: Set to your Pinecone environment, looks like `us-east1-gcp`, `us-west1-aws`, and can be found next to your API key in the [Pinecone console](https://app.pinecone.io/).

* `PINECONE_INDEX`: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like `"openai-retrieval-app"`. *Note that index names are restricted to alphanumeric characters, `"-"`, and can contain a maximum of 45 characters.*

8. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

In that case, the app has automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [33]:
!pip install -qU langchain tiktoken tqdm

## Preparing Data

In this example, we will download the LangChain docs from [langchain.readthedocs.io/](https://langchain.readthedocs.io/latest/en/). We get all `.html` files located on the site like so:

In [41]:
!wget -r --no-parent -A.html -P rtdocs https://python.langchain.com/en/latest/

--2023-04-02 17:52:00--  https://python.langchain.com/en/latest/
Resolving python.langchain.com (python.langchain.com)... 104.17.33.82, 104.17.32.82, 2606:4700::6811:2052, ...
Connecting to python.langchain.com (python.langchain.com)|104.17.33.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘rtdocs/python.langchain.com/en/latest/index.html’

          python.la     [<=>                 ]       0  --.-KB/s               python.langchain.co     [ <=>                ]  74.15K  --.-KB/s    in 0.002s  

2023-04-02 17:52:00 (39.6 MB/s) - ‘rtdocs/python.langchain.com/en/latest/index.html’ saved [75925]

Loading robots.txt; please ignore errors.
--2023-04-02 17:52:00--  https://python.langchain.com/robots.txt
Reusing existing connection to python.langchain.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 95 [text/plain]
Saving to: ‘rtdocs/python.langchain.com/robots.txt.tmp’


2023-04-02 17:52:00 (23.7 MB/s) - ‘rt

This downloads all HTML into the `rtdocs` directory. Now we can use LangChain itself to process these docs. We do this using the `ReadTheDocsLoader` like so:

In [62]:
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()[1:]
len(docs)

439

This leaves us with `389` processed doc pages. Let's take a look at the format each one contains:

In [63]:
docs[9]

Document(page_content='.rst\n.pdf\nLangChain Gallery\n Contents \nOpen Source\nMisc. Colab Notebooks\nProprietary\nLangChain Gallery#\nLots of people have built some pretty awesome stuff with LangChain.\nThis is a collection of our favorites.\nIf you see any other demos that you think we should highlight, be sure to let us know!\nOpen Source#\nHowDoI.ai\nThis is an experiment in building a large-language-model-backed chatbot. It can hold a conversation, remember previous comments/questions,\nand answer all types of queries (history, web search, movie data, weather, news, and more).\nYouTube Transcription QA with Sources\nAn end-to-end example of doing question answering on YouTube transcripts, returning the timestamps as sources to legitimize the answer.\nQA Slack Bot\nThis application is a Slack Bot that uses Langchain and OpenAI’s GPT3 language model to provide domain specific answers. You provide the documents.\nThoughtSource\nA central, open resource and community around data and t

In [58]:
docs[9]

Document(page_content='.rst\n.pdf\nLangChain Gallery\n Contents \nOpen Source\nMisc. Colab Notebooks\nProprietary\nLangChain Gallery#\nLots of people have built some pretty awesome stuff with LangChain.\nThis is a collection of our favorites.\nIf you see any other demos that you think we should highlight, be sure to let us know!\nOpen Source#\nHowDoI.ai\nThis is an experiment in building a large-language-model-backed chatbot. It can hold a conversation, remember previous comments/questions,\nand answer all types of queries (history, web search, movie data, weather, news, and more).\nYouTube Transcription QA with Sources\nAn end-to-end example of doing question answering on YouTube transcripts, returning the timestamps as sources to legitimize the answer.\nQA Slack Bot\nThis application is a Slack Bot that uses Langchain and OpenAI’s GPT3 language model to provide domain specific answers. You provide the documents.\nThoughtSource\nA central, open resource and community around data and t

We access the plaintext page content like so:

In [65]:
!pip install fuzzywuzzy python-Levenshtein

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.20.9-py3-none-any.whl (9.4 kB)
Collecting Levenshtein==0.20.9
  Downloading Levenshtein-0.20.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.5/175.5 KB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<3.0.0,>=2.3.0
  Downloading rapidfuzz-2.15.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fuzzywuzzy, rapidfuzz, Levenshtein, python-Levenshtein
Successfully installed Levenshtein-0.20.9 fuzzywuzzy-0.18.0 python-Levenshtein-0.20.9 rapidfuzz-2.15.0


In [66]:
import re
from bs4 import BeautifulSoup
import pygments
from pygments.lexers import get_lexer_by_name
from pygments.token import Token
from fuzzywuzzy import fuzz

# Define the endings to be removed
endings = ["\nprevious\n", "\n By Harrison Chase\n", "\nBy Harrison Chase\n"]

# Regex pattern to identify code blocks
code_block_pattern = re.compile(r"(```[\s\S]*?```)", re.MULTILINE)

# Function to extract and tokenize code snippets using Pygments
def tokenize_code_snippet(snippet):
    snippet_content = snippet.strip("```")
    lexer = get_lexer_by_name('python')
    tokens = list(pygments.lex(snippet_content, lexer))

    return tokens

# Function to remove near-duplicates from a list of strings based on a similarity threshold
def remove_near_duplicates(strings, threshold=90):
    unique_strings = []

    for string in strings:
        if not any(fuzz.ratio(string, unique_string) >= threshold for unique_string in unique_strings):
            unique_strings.append(string)

    return unique_strings

# Loop through each Document object in the list
for doc in docs:
    # Get the page_content from the current Document object
    page_content = doc.page_content

    # Extract and store code snippets
    code_snippets = code_block_pattern.findall(page_content)
    page_content = code_block_pattern.sub("CODE_SNIPPET_PLACEHOLDER", page_content)

    # Tokenize and store code snippets
    tokenized_code_snippets = [tokenize_code_snippet(snippet) for snippet in code_snippets]

    # Loop through each ending to be removed
    for ending in endings:
        # Find the index of the last occurrence of the current ending
        last_index = page_content.rfind(ending)

        # If the ending was found, remove the text from last_index onwards
        if last_index != -1:
            page_content = page_content[:last_index]

    # Clean the text content
    soup = BeautifulSoup(page_content, "html.parser")
    cleaned_text = soup.get_text(separator=" ", strip=True)

    # Split cleaned text into sentences
    sentences = cleaned_text.split(". ")

    # Remove near-duplicates from sentences
    unique_sentences = remove_near_duplicates(sentences)

    # Rejoin sentences into a single cleaned text
    cleaned_text_deduplicated = ". ".join(unique_sentences)

    # Reinsert code snippets into the cleaned text content
    cleaned_text_with_code = cleaned_text_deduplicated
    for snippet, tokenized_snippet in zip(code_snippets, tokenized_code_snippets):
        tokenized_snippet_str = "".join([t[1] for t in tokenized_snippet if t[0] in Token.Text])
        cleaned_text_with_code = cleaned_text_with_code.replace("CODE_SNIPPET_PLACEHOLDER", tokenized_snippet_str, 1)

    # Update the page_content of the current Document object with the cleaned content
    doc.page_content = cleaned_text_with_code

In [50]:
# Define the endings to be removed
endings = ["\nprevious\n", "\n By Harrison Chase\n", "\nBy Harrison Chase\n"]

# Loop through each Document object in the list
for doc in docs:
    # Get the page_content from the current Document object
    page_content = doc.page_content

    # Loop through each ending to be removed
    for ending in endings:
        # Find the index of the last occurrence of the current ending
        last_index = page_content.rfind(ending)

        # If the ending was found, remove the text from last_index onwards
        if last_index != -1:
            page_content = page_content[:last_index]

    # Update the page_content of the current Document object with the cleaned content
    doc.page_content = page_content

In [56]:
docs[50]

Document(page_content='Source code for langchain.agents.conversational_chat.base\n"""An agent designed to hold a conversation in addition to using tools."""\nfrom __future__ import annotations\nimport json\nfrom typing import Any, List, Optional, Sequence, Tuple\nfrom langchain.agents.agent import Agent\nfrom langchain.agents.conversational_chat.prompt import (\n    FORMAT_INSTRUCTIONS,\n    PREFIX,\n    SUFFIX,\n    TEMPLATE_TOOL_RESPONSE,\n)\nfrom langchain.callbacks.base import BaseCallbackManager\nfrom langchain.chains import LLMChain\nfrom langchain.prompts.base import BasePromptTemplate\nfrom langchain.prompts.chat import (\n    ChatPromptTemplate,\n    HumanMessagePromptTemplate,\n    MessagesPlaceholder,\n    SystemMessagePromptTemplate,\n)\nfrom langchain.schema import (\n    AgentAction,\n    AIMessage,\n    BaseLanguageModel,\n    BaseMessage,\n    BaseOutputParser,\n    HumanMessage,\n)\nfrom langchain.tools.base import BaseTool\nclass AgentOutputParser(BaseOutputParser):\n

We can also find the source of each document:

In [61]:
docs[50]

Document(page_content='Source code for langchain.agents.conversational_chat.base\n"""An agent designed to hold a conversation in addition to using tools."""\nfrom __future__ import annotations\nimport json\nfrom typing import Any, List, Optional, Sequence, Tuple\nfrom langchain.agents.agent import Agent\nfrom langchain.agents.conversational_chat.prompt import (\n    FORMAT_INSTRUCTIONS,\n    PREFIX,\n    SUFFIX,\n    TEMPLATE_TOOL_RESPONSE,\n)\nfrom langchain.callbacks.base import BaseCallbackManager\nfrom langchain.chains import LLMChain\nfrom langchain.prompts.base import BasePromptTemplate\nfrom langchain.prompts.chat import (\n    ChatPromptTemplate,\n    HumanMessagePromptTemplate,\n    MessagesPlaceholder,\n    SystemMessagePromptTemplate,\n)\nfrom langchain.schema import (\n    AgentAction,\n    AIMessage,\n    BaseLanguageModel,\n    BaseMessage,\n    BaseOutputParser,\n    HumanMessage,\n)\nfrom langchain.tools.base import BaseTool\nclass AgentOutputParser(BaseOutputParser):\n

In [67]:
docs[50]

Document(page_content='Source code for langchain.agents.conversational_chat.base\n"""An agent designed to hold a conversation in addition to using tools."""\nfrom __future__ import annotations\nimport json\nfrom typing import Any, List, Optional, Sequence, Tuple\nfrom langchain.agents.agent import Agent\nfrom langchain.agents.conversational_chat.prompt import (\n    FORMAT_INSTRUCTIONS,\n    PREFIX,\n    SUFFIX,\n    TEMPLATE_TOOL_RESPONSE,\n)\nfrom langchain.callbacks.base import BaseCallbackManager\nfrom langchain.chains import LLMChain\nfrom langchain.prompts.base import BasePromptTemplate\nfrom langchain.prompts.chat import (\n    ChatPromptTemplate,\n    HumanMessagePromptTemplate,\n    MessagesPlaceholder,\n    SystemMessagePromptTemplate,\n)\nfrom langchain.schema import (\n    AgentAction,\n    AIMessage,\n    BaseLanguageModel,\n    BaseMessage,\n    BaseOutputParser,\n    HumanMessage,\n)\nfrom langchain.tools.base import BaseTool\nclass AgentOutputParser(BaseOutputParser):\n

In [None]:
docs[10]

In [None]:
docs[20]

In [None]:
docs[20]

In [6]:
docs[5].metadata['source'].replace('rtdocs/', 'https://')

'https://python.langchain.com/en/latest/model_laboratory.html'

Looks good, we need to also consider the length of each page with respect to the number of tokens that will reasonably fit within the window of a ChatGPT model. We will use `gpt-3.5-turbo` as the assumed model.

### Chunking the Text

At the time of writing, `gpt-3.5-turbo` supports a context window of 4096 tokens — that means that input tokens + generated ( / completion) output tokens, cannot total more than 4096 without hitting an error.

So we 100% need to keep below this. If we assume a very safe margin of ~2000 tokens for the input prompt into `gpt-3.5-turbo`, leaving ~2000 tokens for conversation history and completion.

With this ~2000 token limit we may want to include *five* snippets of relevant information, meaning each snippet can be no more than **400** token long.

To create these snippets we use the `RecursiveCharacterTextSplitter` from LangChain. To measure the length of snippets we also need a *length function*. This is a function that consumes text, counts the number of tokens within the text (after tokenization using the `gpt-3.5-turbo` tokenizer), and returns that number. We define it like so:

In [68]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

Note that for the tokenizer we defined the encoder as `"cl100k_base"`. This is a specific tiktoken encoder which is used by `gpt-3.5-turbo`. Other encoders exist and at the time of writing are summarized as:

| Encoder | Models |
| --- | --- |
| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |
| `p50k_base` | `text-davinci-003`, `code-davinci-002`, `code-cushman-002` |
| `r50k_base` | `text-davinci-001`, `davinci`, `text-similarity-davinci-001` |
| `gpt2` | `gpt2` |

You can find these details in the [Tiktoken `model.py` script](https://github.com/openai/tiktoken/blob/main/tiktoken/model.py), or using `tiktoken.encoding_for_model`:

In [69]:
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

With the length function defined we can initialize our `RecursiveCharacterTextSplitter` object like so:

In [70]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

Then we split the text for a document like so:

In [71]:
chunks = text_splitter.split_text(docs[5].page_content)
len(chunks)

34

In [72]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1])

(370, 363)

For `docs[5]` we created `2` chunks of token length `346` and `247`.

This is for a single document, we need to do this over all of our documents. While we iterate through the docs to create these chunks we will reformat them into the format required by our API app. This format needs to align to the `/upsert` endpoints required document format, which looks like this:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional, however, we will include both.

The `"id"` will be created based on the URL of the text + it's chunk number.

In [74]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[4].metadata['source'].replace('rtdocs/', 'https://')
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

https://python.langchain.com/en/latest/model_laboratory.html
675233ddef72


Then use the `uid` alongside chunk number and actual `url` to create the format needed:

In [75]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'metadata': {'url': url}
    } for i, chunk in enumerate(chunks)
]
data

[{'id': '675233ddef72-0',
  'text': 'Index\n_\n | A\n | B\n | C\n | D\n | E\n | F\n | G\n | H\n | I\n | J\n | K\n | L\n | M\n | N\n | O\n | P\n | Q\n | R\n | S\n | T\n | U\n | V\n | W\n_\n__call__() (langchain.llms.AI21 method)\n(langchain.llms.AlephAlpha method)\n(langchain.llms.Anthropic method)\n(langchain.llms.AzureOpenAI method)\n(langchain.llms.Banana method)\n(langchain.llms.CerebriumAI method)\n(langchain.llms.Cohere method)\n(langchain.llms.DeepInfra method)\n(langchain.llms.ForefrontAI method)\n(langchain.llms.GooseAI method)\n(langchain.llms.HuggingFaceEndpoint method)\n(langchain.llms.HuggingFaceHub method)\n(langchain.llms.HuggingFacePipeline method)\n(langchain.llms.Modal method)\n(langchain.llms.NLPCloud method)\n(langchain.llms.OpenAI method)\n(langchain.llms.OpenAIChat method)\n(langchain.llms.Petals method)\n(langchain.llms.PromptLayerOpenAI method)\n(langchain.llms.PromptLayerOpenAIChat method)\n(langchain.llms.Replicate method)\n(langchain.llms.SagemakerEndpoint met

Now we repeat the same logic across our full dataset:

In [17]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
    url = doc.metadata['source'].replace('rtdocs/', 'https://')
    m.update(url.encode('utf-8'))
    uid = m.hexdigest()[:12]
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'metadata': {'url': url}
        })

len(documents)

  0%|          | 0/441 [00:00<?, ?it/s]

2415

We're now left with `2201` documents in the format required by our API.

---

#### (Optional) Load Dataset from Hugging Face

Rather than running the above scripts to build the dataset, you can load a prepared version from Hugging Face Datasets like so:

In [5]:
!pip install -qU datasets

from datasets import load_dataset

documents = load_dataset('jamescalam/langchain-docs', split='train')
documents

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading and preparing dataset json/jamescalam--langchain-docs to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--langchain-docs-bcc23a7c6d742f0e/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.76M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/jamescalam___json/jamescalam--langchain-docs-bcc23a7c6d742f0e/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


Dataset({
    features: ['id', 'text', 'source'],
    num_rows: 2212
})

In [15]:
documents[0]

{'id': '9997c866b69e-0',
 'text': '.md\n.pdf\nGlossary\n Contents \nChain of Thought Prompting\nAction Plan Generation\nReAct Prompting\nSelf-ask\nPrompt Chaining\nMemetic Proxy\nSelf Consistency\nInception\nMemPrompt\nGlossary#\nThis is a collection of terminology commonly used when developing LLM applications.\nIt contains reference to external papers or sources where the concept was first introduced,\nas well as to places in LangChain where the concept is used.\nChain of Thought Prompting#\nA prompting technique used to encourage the model to generate a series of intermediate reasoning steps.\nA less formal way to induce this behavior is to include “Let’s think step-by-step” in the prompt.\nResources:\nChain-of-Thought Paper\nStep-by-Step Paper\nAction Plan Generation#\nA prompt usage that uses a language model to generate actions to take.\nThe results of these actions can then be fed back into the language model to generate a subsequent action.\nResources:\nWebGPT Paper\nSayCan Pap

This needs to be reformated into the format we need for the API:

In [18]:
documents = [{
    'id': doc['id'],
    'text': doc['text'],
    'metadata': {'url': doc['source']}
} for doc in documents]

documents[0]

KeyError: ignored

---

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [19]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxIiwibmFtZSI6IkFsZWt6YW5kZXIgQnl3YXRlciIsImlhdCI6MTUxNjIzOTAyMn0.Pvit0VmO89bF7BIZ5YOoQKznuID-2PsRN6jVHRxslPY"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [20]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [38]:
import requests
from requests.adapters import HTTPAdapter, Retry
from tqdm.auto import tqdm

batch_size = 100
endpoint_url = "https://walrus-app-verbm.ondigitalocean.app/"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

  0%|          | 0/25 [00:00<?, ?it/s]

With that our LangChain doc records have all been indexed and we can move on to querying.

### Making Queries

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint. We can make a few questions related to LangChain and see if we return relevant info:

In [39]:
queries = [
    {'query': "What is the LLMChain in LangChain?"},
    {'query': "How do I use Pinecone in LangChain?"},
    {'query': "What is the difference between Knowledge Graph memory and buffer memory for "+
     "conversational memory?"}
]

requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={'queries': queries}
)

<Response [404]>

In [28]:
queries = [
    {'query': "What is the LLMChain in LangChain?"},
    {'query': "How do I use Pinecone in LangChain?"},
    {'query': "What is the difference between Knowledge Graph memory and buffer memory for "+
     "conversational memory?"}
]

res = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [404]>

Now we can loop through the responses and see the results returned for each query:

In [24]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

KeyError: ignored

The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the [Pinecone console](https://app.pinecone.io/).