## Document
### see https://github.com/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb

## Required Libraries

In [1]:
!pip install -qU langchain tiktoken tqdm
!pip install beautifulsoup4
!pip install -qU python-dotenv
# eval "$(direnv hook bash)"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Preparing Data


In [2]:
!wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/


/bin/bash: wget: command not found


In [31]:
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)



  _ = BeautifulSoup(


  soup = BeautifulSoup(data, **self.bs_kwargs)


472

In [19]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

In [20]:
text_splitter

<langchain.text_splitter.RecursiveCharacterTextSplitter at 0x11543e700>

In [21]:
chunks = text_splitter.split_text(docs[5].page_content)
len(chunks)

5

In [23]:
chunks[1]

"Ask Questions On Your Custom (or Private) Files\nConnect Google Drive Files To OpenAI\nYouTube Transcripts + OpenAI\nQuestion A 300 Page Book (w/ OpenAI + Pinecone)\nWorkaround OpenAI's Token Limit With Chain Types\nBuild Your Own OpenAI + LangChain Web App in 23 Minutes\nWorking With The New ChatGPT API\nOpenAI + LangChain Wrote Me 100 Custom Sales Emails\nStructured Output From OpenAI (Clean Dirty Data)\nConnect OpenAI To +5,000 Tools (LangChain + Zapier)\nUse LLMs To Extract Data From Text (Expert Mode)\nLangChain How to and guides by Sam Witteveen:\nLangChain Basics - LLMs & PromptTemplates with Colab\nLangChain Basics - Tools and Chains\nChatGPT API Announcement & Code Walkthrough with LangChain\nConversations with Memory (explanation & code walkthrough)\nChat with Flan20B\nUsing Hugging Face Models locally (code walkthrough)\nPAL : Program-aided Language Models with LangChain code\nBuilding a Summarization System with LangChain and GPT-3 - Part 1\nBuilding a Summarization System

In [27]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[5].metadata['source'].replace('rtdocs/', 'https://')
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

https://python.langchain.com/en/latest/youtube.html
001b1930c81f


In [25]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'metadata': {'url': url}
    } for i, chunk in enumerate(chunks)
]
data

NameError: name 'chunks' is not defined

In [33]:
1

1

In [32]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
    url = doc.metadata['source'].replace('rtdocs/', 'https://')
    m.update(url.encode('utf-8'))
    uid = m.hexdigest()[:12]
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'metadata': {'url': url}
        })

len(documents)

100%|██████████| 472/472 [00:02<00:00, 212.32it/s]


1945

## Indexing the docs

In [34]:
import os
from dotenv import load_dotenv

load_dotenv()

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}
BEARER_TOKEN

'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJjaGF0Z3B0IHBsdWdpbiBhcHAiLCJuYW1lIjoibmllbCIsImlhdCI6MTY4MzQxMzM1OX0.FRvcY0cASN_6JKXWOItD_GKRhWphHNDQs_49srGLz8E'

In [79]:
import requests
from requests.adapters import HTTPAdapter, Retry
from tqdm.auto import tqdm
from time import sleep


batch_size = 100
endpoint_url = "https://squid-app-ercsl.ondigitalocean.app"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)

# s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )


100%|██████████| 20/20 [01:31<00:00,  4.60s/it]


## Making Queries


In [81]:
queries = [
    {'query': "What is the LLMChain in LangChain?"},
    {'query': "How do I use Pinecone in LangChain?"},
    {'query': "What is the difference between Knowledge Graph memory and buffer memory for "+
     "conversational memory?"}
]

res = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [200]>

In [83]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
What is the LLMChain in LangChain?

0.87: Loading from LangChainHub next Sequential Chains  Contents    LLM Chain Additional ways of running LLM Chain Parsing the outputs Initialize from string By Harrison Chase            © Copyright 2023, Harrison Chase.          Last updated on May 07, 2023.
0.87: nThese are, in increasing order of complexity:\n\nð\x9f“\x83 LLMs and Prompts:\n\nThis includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.\n\nð\x9f”\x97 Chains:\n\nChains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and
0.87: Started\nModules\nUse Cases\nReference Docs\nLangChain Ecosystem\nAdditional Resources\n\n\n\n\n\n\n\n\nWelcome to LangChain#\nLarge language models (LLMs) are emerging as a tra