# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

For this notebook we are gonna create a Chain that can answer questions related with a document. So it´s very interesting 

THE general idea is that we wanna use LLM along with all our documents,
but there is a key issue. LLM can only just inspect a few thousand words at a time. 

So with large documents we have a problem. And here is where world embedings and Vectors database come to play.

### EMBEDDINGS

EMBEDINGS = Embeddings are numerical representations of objects (pieces of text) that capture their semantic meaning of the piece of text. These representations are typically high-dimensional vectors....

Embeddings enable similarity searches, where objects that are semantically similar are close to each other in the vector space. This is useful in tasks like recommendation systems and semantic search (Pieces of text with similar content will have similar vectors.)

Imagine we have three sentences: one is about a dog life, other is about cats life and last one is about how ferrary is the best car brand.

The emmbedings vectors related with cat and dog sentences would have higher simmilarity than the car one.

### VECTOR DATABASE

It´s a way to store the vector representations we´ve created before with the embbedings transformation.

We populate the BBDD wich chunks of text from the document. 

So the first step is dividing all the info in a document into smaller chunks of text. Wich it will be useful  couz maybe we are not able to pass all the document info into the LLM in once

Second we create embbedings for each chunk

third we store the chunks into the Vector database.


So when we wanna do RAG, retrieval information, Whe create an embbeding of the query we are passing.

Then we compare the embbeding of the query with the stored chunk embbedings, therefore it will return the  most similar vector embbedings in our Vector BBDD.

And we can pass those into the prompt of LLM for getting our final answer


In [None]:
#pip install --upgrade langchain

A vector database is a specialized type of database designed to store and manage high-dimensional vectors, which are numerical representations of data. These vectors can represent various types of data, such as text, images, audio, and more. Here’s a detailed look at what vector databases are and how they work:

Key Features of Vector Databases
Storage of Vectors:

Vector databases store data as vectors, which are fixed-length lists of numbers. Each vector represents the features of the data in a high-dimensional space1.
Similarity Search:

One of the primary functions of vector databases is to perform similarity searches. This involves finding vectors that are closest to a given query vector, which is useful for tasks like semantic search, recommendation systems, and image retrieval2.
Approximate Nearest Neighbor (ANN) Algorithms:

Vector databases often implement ANN algorithms to efficiently search for similar vectors. Techniques like Hierarchical Navigable Small World (HNSW) graphs, Locality-sensitive Hashing (LSH), and Product Quantization (PQ) are commonly used1.
Applications:

Semantic Search: Finding documents or data that are semantically similar to a query.
Recommendation Systems: Suggesting items similar to user preferences.
Image and Audio Retrieval: Searching for images or audio files that are similar to a given example.
Large Language Models (LLMs): Enhancing the performance of LLMs by retrieving relevant context1.
How Vector Databases Work
Vectorization:

Data is transformed into vectors using machine learning models, embeddings, or feature extraction techniques. For example, a sentence can be converted into a vector using word embeddings like Word2Vec or BERT2.
Storage and Indexing:

The vectors are stored in the database, and indexing techniques are applied to facilitate efficient similarity searches.
Query Processing:

When a query vector is provided, the database uses ANN algorithms to quickly find the nearest vectors in the high-dimensional space.
Example Use Case
Imagine you have a collection of images and you want to find images similar to a given one. You would:

Convert all images into vectors using a feature extraction model.
Store these vectors in a vector database.
When a query image is provided, convert it into a vector and use the database to find the most similar image vectors.
Popular Vector Databases
Some well-known vector databases include:

Pinecone
Milvus
Weaviate
FAISS (Facebook AI Similarity Search)
These databases are optimized for handling high-dimensional data and performing fast similarity searches, making them essential tools in modern AI and machine learning applications23.

Would you like to know more about a specific vector database or how to implement one in a project?

In [3]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.llms import OpenAI


import os
import os
import openai
from openai import AzureOpenAI
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv()) # read local .env file
AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY')
AZURE_OPENAI_ENDPOINT_GPT = os.getenv('AZURE_OPENAI_ENDPOINT_GPT') # Endpoint for GPT
MODEL = os.getenv('OPENAI_MODEL_NAME')
AZURE_OPENAI_API_VERSION = '2024-02-15-preview'
AZURE_OPENAI_ENDPOINT_EMBEDDING = os.getenv('AZURE_OPENAI_ENDPOINT_EMBEDDING') # Endpoint for the creation of embeddings

import warnings
warnings.filterwarnings('ignore')

##########################
from langchain_openai import AzureChatOpenAI
##########################3
from langchain.indexes import VectorstoreIndexCreator
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain

#### ATENTION

If you have and endpoint of a GPT, and you wanna create embeddings, that endpoint it´s not gonna work. 
You should ask for another endpoint related with the creation of embeddings.

**However, we can use the same API KEY for both endpoints**

At this moment I don´t know if there are endpoints avaliable which work for both (GPT and embeddings)

In [4]:
from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-ada-002",
    # dimensions: Optional[int] = None, # Can specify dimensions with new text-embedding-3 models
    azure_endpoint=AZURE_OPENAI_ENDPOINT_EMBEDDING, # If not provided, will read env variable AZURE_OPENAI_ENDPOINT
    api_key= AZURE_OPENAI_API_KEY, # Can provide an API key directly. If missing read env variable AZURE_OPENAI_API_KEY
    openai_api_version= AZURE_OPENAI_API_VERSION,
    openai_api_type="azure",
      # If not provided, will read env variable AZURE_OPENAI_API_VERSION
)

# Chunk of text we are gonna use to show how embeddings work
text = 'LangChain is the framework for building context-aware reasoning applications'

single_vector = embeddings.embed_query(text)
print(str(single_vector)[:100])

[-0.0011866469867527485, 0.007133987732231617, -0.014754624105989933, -0.03413593769073486, 0.011390


In [5]:
#When creating an LLM, we must use the GPT endpoint, otherwise we will recieve an error

llm_gpt_endpoint = AzureChatOpenAI(temperature=0.9, 
                      api_key=AZURE_OPENAI_API_KEY, 
                      api_version=AZURE_OPENAI_API_VERSION,
                      azure_endpoint=AZURE_OPENAI_ENDPOINT_GPT)

#This 'llm_embeddings_endpoint' is not gonna work
#llm_embeddings_endpoint = AzureChatOpenAI(temperature=0.9, 
#                      api_key=AZURE_OPENAI_API_KEY, 
#                      api_version=AZURE_OPENAI_API_VERSION,
#                      azure_endpoint=AZURE_OPENAI_ENDPOINT_EMBEDDING)


#Then we initialazie the prompt, is gonna take a varible product...
prompt = ChatPromptTemplate.from_template(
    "What is the best name to describe \
    a company that makes {product}?\
        JUST RETURN THE NAME UP TO 6 WORDS"
)


chain = LLMChain(llm=llm_gpt_endpoint, prompt=prompt)

product = "Queen Size Sheet Set"

# Then we run the Chain object we´ve created before
chain.run(product)

'Royal Rest Bedding Co.'

# LET´S CONTINUE WITH THE WORKBOOK

In [None]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')
len(loader.load())

1000

In [13]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("data/001_ADZYNMA.pdf")
pages = loader.load()
len(pages)

48

In [17]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150,
    chunk_overlap = 10
)

splits = text_splitter.split_documents(pages)
len(splits)

647

In [9]:
#!pip install docarray

In [18]:
#from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorstoreIndexCreator 
from langchain_openai import AzureOpenAIEmbeddings

embedding_model = AzureOpenAIEmbeddings(
                azure_deployment="text-embedding-ada-002",
                # dimensions: Optional[int] = None, # Can specify dimensions with new text-embedding-3 models
                azure_endpoint=AZURE_OPENAI_ENDPOINT_EMBEDDING, # If not provided, will read env variable AZURE_OPENAI_ENDPOINT
                api_key= AZURE_OPENAI_API_KEY, # Can provide an API key directly. If missing read env variable AZURE_OPENAI_API_KEY
                openai_api_version= AZURE_OPENAI_API_VERSION, #Another used version '2023-05-15'
                openai_api_type="azure",
)

In [9]:
index = VectorstoreIndexCreator(
    embedding=embedding_model,
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

RateLimitError: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 86400 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}

In [26]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

**Note**:
- The notebook uses `langchain==0.0.179` and `openai==0.27.7`
- For these library versions, `VectorstoreIndexCreator` uses `text-davinci-003` as the base model, which has been deprecated since 1 January 2024.
- The replacement model, `gpt-3.5-turbo-instruct` will be used instead for the `query`.
- The `response` format might be different than the video because of this replacement model.

In [27]:
#This snippet of code below doesnt work due 'OpenAi' it´s an older version. We use instead 'AzureChatOpenAI'

#llm_replacement_model = OpenAI(temperature=0, 
#                              model='gpt-3.5-turbo-instruct')

llm_replacement_model = AzureChatOpenAI(temperature=0, 
                      api_key=AZURE_OPENAI_API_KEY, 
                      api_version=AZURE_OPENAI_API_VERSION,
                      azure_endpoint=AZURE_OPENAI_ENDPOINT_GPT)

response = index.query(query, 
                       llm = llm_replacement_model)

NameError: name 'index' is not defined

In [None]:
display(Markdown(response))

## Step By Step

In [None]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [None]:
docs = loader.load()

In [None]:
docs[0]

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [None]:
print(len(embed))

In [None]:
print(embed[:5])

In [None]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [None]:
query = "Please suggest a shirt with sunblocking"

In [None]:
docs = db.similarity_search(query)

In [None]:
len(docs)

In [None]:
docs[0]

In [None]:
retriever = db.as_retriever()

In [None]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

In [None]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])


In [None]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 

In [None]:
display(Markdown(response))

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [None]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [None]:
response = qa_stuff.run(query)

In [None]:
display(Markdown(response))

In [None]:
response = index.query(query, llm=llm)

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])