<a href="https://colab.research.google.com/github/Phishinf/BOT4PRO/blob/main/RAG_LangChain%26Pinecone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Retrieval Augmentation in LangChain

LLMs have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

This example covers the steps to integrate Pinecone, a high-performance vector database, with LangChain, a framework for building applications powered by large language models (LLMs).

Pinecone enables developers to build scalable, real-time recommendation and search systems based on vector similarity search. LangChain, on the other hand, provides modules for managing and optimizing the use of language models in applications

##Install Environmnt Packages

In [None]:
!pip install -qU langchain==0.0.162
!pip install openai tiktoken "pinecone-client[grpc]" datasets
!pip install apache_beam mwparserfromhell

In [None]:
!pip install multiprocess==0.70.15

#Building Knowledge Base
Every record contains a lot of text. It is useful chunking these articles/documents into more "concise" chunks to later be embedding and stored in our Pinecone vector database.

By using LangChain's RecursiveCharacterTextSplitter to split our text into chunks of a specified max length.

In [9]:
from datasets import load_dataset
data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')
data
data[6]
# Use any LLM to do
import tiktoken
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [13]:
#Copy the abovee Encoding size and fill in "size" below

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

##Divide the text into chunks

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Relation between chunk size and
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)
# Using the text_splitter get much better sized chunks of text.
# Use this functionality during the indexing process too.
chunks = text_splitter.split_text(data[6]['text'])[:4]
chunks
tiktoken_len(chunks[0]), tiktoken_len(chunks[1]), tiktoken_len(chunks[2])


(470, 469, 222)

##Creating Embeddings
Building embeddings using LangChain's OpenAI or other LLM embedding. Usually first it needa to add OpenAI api key by running the next cell:

In [None]:
from getpass import getpass
OPENAI_API_KEY = getpass("OpenAI API Key: ")  # platform.openai.com
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'
embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed.embed_documents(texts)
len(res), len(res[0])



##Vector Database
To create our vector database need a free API key from Pinecone.

Initialize like so:

In [None]:
import pinecone

# find API key in console at app.pinecone.io
YOUR_API_KEY = getpass("Pinecone API Key: ")
# find ENV (cloud region) next to API key in console
YOUR_ENV = input("Pinecone environment: ")

index_name = 'langchain-retrieval-augmentation'
pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )
index = pinecone.GRPCIndex(index_name)

index.describe_index_stats()
#This Pinecone index has a total_vector_count of 0, as haven't added any vectors yet.

from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))
