# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

## EITHER: use your [OpenAI API Key](https://platform.openai.com/account/api-keys)

In [2]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [None]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

## OR: use [LocalAI as an OpenAI replacement](https://localai.io/howtos/easy-request-openai/)

In [5]:
import os
import openai

# Specify the port your LocalAI docker container runs on
# openai.api_base = "http://localhost:8080/v1"  # default
openai.api_base = "http://localhost:9095/v1"  # for lunademo
openai.api_key = "sx-xxx"  # not needed for LocalAI (dummy)
OPENAI_API_KEY = "sx-xxx"
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [6]:
# Specify the model you are using
# llm_model = ""
llm_model = "lunademo"  # for lunademo

## Get Started

**Components we need:**
- **RetrievalQA Chain** - for retrievel over some documents
- **ChatOpenAI** --> *could we swap this with local Llama?*
- **CSVLoader** - one of many different document loaders
- **DocArrayInMemorySearch** - one of many different types of **vector stores**, this one is *in-memory* vector store, does not connect to database

In [None]:
#pip install --upgrade langchain

In [7]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown  # to display the response in Jupyter Notebook

In [8]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')

**- VectorstoreIndexCreator** - to create the vector store

In [None]:
#pip install docarray

In [9]:
from langchain.indexes import VectorstoreIndexCreator

**Specify the vector store class (we use *DocArrayInMemorySearch* imported before) and load a list of loaders (here only one)**
- **FixMe:** embeddings do not work with LocalAI `lunademo`

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

**This is the query we ask our loaded document**

In [None]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

**Create the response**

In [None]:
response = index.query(query)

**Display the response in markdown**

In [None]:
display(Markdown(response))  # will display a table with different product names and their descriptions, as well as a short text summary

## Step By Step
- more in depth creation of the chain

In [11]:
from langchain.document_loaders import CSVLoader
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')

In [12]:
docs = loader.load()

**If we look at the individual documents, we see that each document corresponds to one product in the original csv**

In [13]:
docs[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

In [14]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

**We use OpenAI embeddings** (but there are alternatives)
- because the documents are so small, we don't need to do any *chunking* --> for larger documents this is necessary!

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

**Embedding of an example string**

In [None]:
embed = embeddings.embed_query("Hi my name is Max")

**Will give us an embedding vector with 1536 elements --> the overall numerical representation of this piece of text**

In [None]:
print(len(embed))

In [None]:
print(embed[:5])

**Store the embeddings in the vector store** 
- using *from_documents()* function
- takes list of documents + embeddings object

In [None]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

**We write a new query to check it against the vector store to find similar pieces of text**

In [None]:
query = "Please suggest a shirt with sunblocking"

**Will return a list of documents (here: 4)**

In [None]:
docs = db.similarity_search(query)  

In [None]:
len(docs)

In [None]:
docs[0]

### Now we use this to do Q&A over our documents

**We first need a retriever from our vector store**

In [None]:
retriever = db.as_retriever()

**We want to do text generation for a natural language response (here we use OpenAI again)**

In [None]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

**If we did this by hand, we would combine all documents into a single piece of text**

In [None]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])


**And ask the LLM to format it to a summarized table for us**

In [None]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 


**Get the response**

In [None]:
display(Markdown(response))

## We can do all this with a Chain

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

**We create a retrieval QA chain**
- does Q&A over the retrieved documents
- we pass an LLM (for text generation in the end)
- a chain type, here *stuff*
- a retriever (the interface for fetching documents)


**Stuff method:**
- simplest but most popular method, *stuffs* all documents into one prompt and make single call to LLM
    - LLM has access to all data at once
    - but keep context lenght of LLM in mind, will not work for large documents or many documents at one
 
**Other methods**
- *map_reduce* calls LLM for every chunk, combines all responses into another call to LLM, get final answer
    - treats all documents as independent, can be done in parallel --> fast
    - but context across documents might be lost
    - most often for **summarization**
- *refine* iteratively call LLM for every chunk
    - build upon answer of previous document, really powerful for combining information and building up answer over time
    - will lead to longer answers --> slow
- *map_rerank* calls LLM for every chunk and returns a score, then selects the highest score

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

**We create a new query**

In [None]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

**And run our created chain on the query**

In [None]:
response = qa_stuff.run(query)

**Display the response**

In [None]:
display(Markdown(response))

**In one line**

In [None]:
response = index.query(query, llm=llm)

**We can also customize the index when we're creating it**
- e.g., specify the embeddings, swap out the vector store for a different type


In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

### Source: https://learn.deeplearning.ai/langchain/lesson/5/question-and-answer