# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest. The following slide goes through an overview of the flow of loading a document into a vector database for it to be queried by an LLM. 

![image1.png](..\Img\qna_1.png)
![image2.png](..\Img\qna_2.png)
![image3.png](..\Img\qna_3.png)

## 1. Download required libraries

In [1]:
pip install --upgrade langchain

Note: you may need to restart the kernel to use updated packages.


## 2. Import required libraries

In [2]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the video.

In [3]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

`RetrievalQA` will carry out the retrieval over the specified documents.  <br /> 
`CSVLoader` is a document loader which will be used to import proprietary data.  <br /> 
`DocArrayInMemorySearch` is a Vector Store which is an in-memory vector store and does not require connecting to any external Database.

In [5]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.llms import OpenAI

## 3. Load Data
Here, we load our simulated proprietary data (CSV dataset). We will combine this with our language model.

In [18]:
file = r"..\Data\OutdoorClothingCatalog_1000.csv"

# Initialize loader
loader = CSVLoader(file_path=file)

## 4. Create Vector Store

Here, we import an index for the Vector Store. 

In [10]:
from langchain.indexes import VectorstoreIndexCreator

In [9]:
!pip install docarray

Collecting docarray
  Downloading docarray-0.40.0-py3-none-any.whl.metadata (36 kB)
Collecting rich>=13.1.0 (from docarray)
  Using cached rich-13.9.4-py3-none-any.whl.metadata (18 kB)
Collecting types-requests>=2.28.11.6 (from docarray)
  Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)
Collecting markdown-it-py>=2.2.0 (from rich>=13.1.0->docarray)
  Using cached markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich>=13.1.0->docarray)
  Using cached mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Downloading docarray-0.40.0-py3-none-any.whl (270 kB)
Using cached rich-13.9.4-py3-none-any.whl (242 kB)
Downloading types_requests-2.32.0.20241016-py3-none-any.whl (15 kB)
Using cached markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: types-requests, mdurl, markdown-it-py, rich, docarray
Successfully installed docarray-0.40.0 mar

In order for the index to work, we need to specify the `OpenAIEmbedding` in the `embedding` parameter. 

As seen below, we have specified `DocArrayInMemorySearch` as our selected vector store, loaded the data and created the vector store. 

In [20]:
from langchain.embeddings import OpenAIEmbeddings

# Instantiating embeddings model 
embeddings = OpenAIEmbeddings()

index = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

  embeddings = OpenAIEmbeddings()


Test the connectivity to the Vector Store

In [28]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

**Note**:
- The notebook uses `langchain==0.0.179` and `openai==0.27.7`
- For these library versions, `VectorstoreIndexCreator` uses `text-davinci-003` as the base model, which has been deprecated since 1 January 2024.
- The replacement model, `gpt-3.5-turbo-instruct` will be used instead for the `query`.
- The `response` format might be different than the video because of this replacement model.

In [29]:
llm_replacement_model = OpenAI(temperature=0, 
                               model='gpt-3.5-turbo-instruct')

response = index.query(query, 
                       llm = llm_replacement_model)

In [30]:
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets, imported | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rating, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets, imported | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets, imported | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rating, moisture-wicking, fits over swimsuit, abrasion-resistant, imported | SPF 

## 5. Breaking Down LangChain 

### 5A. Query the CSV with a vector database using similarity search

The following lines of code breaks down what is going on under the hood of the above functions. 

First we start with the data load.

In [31]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

Here, we can start loading documents. When we load the document, we can see that each document corresponds to one of the products in the CSV. 

In [32]:
docs = loader.load()
docs[0]

Document(metadata={'source': '..\\Data\\OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

As our documents (product desc) is small, we do not need to do text chunking. Thus, here we directly go to the embedding creation. As seen above, `OpenAIEmbeddings` is the OpenAI embedding class used to create the embedding.

In [33]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

Here, we can see what happens when we embed our text. This text generates a vector of 1536 elements. Each element is a numerical representation of a portion of the text.

In [38]:
embed = embeddings.embed_query("Hi my name is Harrison")

print(len(embed))
print(embed[:5])

1536
[-0.02196465528695117, 0.006758838256223806, -0.018249490165056663, -0.03923515029463157, -0.014007174091135742]


Here, we create the embeddings for all of the text in the CSV document and store them in a vector store.

In [39]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [40]:
query = "Please suggest a shirt with sunblocking"

We can now leverage this vector store to find pieces of text similar to an incoming query using the `similarity_search` function.

In [46]:
docs = db.similarity_search(query)

print("Number of matching products: ", len(docs))
print("\nSample matching products:\n ", docs[0])


Number of matching products:  4

Sample matching products:
  page_content=': 255
name: Sun Shield Shirt by
description: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. 

Size & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.

Fabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.

Additional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.

Sun Protection That Won't Wear Off
Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.' metadata={'source': '..\\Data\\OutdoorClothingCatalog_1000.csv', 'row': 255}


### 5B. Query the CSV with a LLM

In order to leverage this for Q&A over the documents, we first need to initialize a retriever. A retriever is a generic way to take in a query and return a document. There are many methods to achieve this. 

In [47]:
retriever = db.as_retriever()

Next, we import our language model so that we can have a natural language interface to the above functionality. 

In [48]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

  llm = ChatOpenAI(temperature = 0.0, model=llm_model)


Now, we combine the documents into a single piece of text. Here, I have limited it to the first 50 documents so that it fits within the context window.

In [62]:
loader = CSVLoader(file_path=file)
docs = loader.load()
docs= docs[:50]

qdocs = "".join([docs[i].page_content for i in range(len(docs))])


Pass the text (documents) into the prompt for the LLM to retrieve the necessary products. 

In [63]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 


In [64]:
display(Markdown(response))

| Name                                | Sun Protection | Summary                                                                                                                                                                                                                   |
|-------------------------------------|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Refresh Swimwear, V-Neck Tankini Contrasts | UPF 50+ rated | Watersport-ready tankini top made from recycled nylon with Lycra® spandex for stretch. Features lightweight racerback straps, V-neck silhouette, and offers SPF 50+ sun protection. |
| Performance Plus Woven Shirt         | SPF 50+        | Breathable summer shirt with quick-dry fabric, moisture-wicking, and abrasion-resistant construction. Provides SPF 50+ sun protection and is ideal for trail or travel.           |
| Angler's Athletic Shorts             | UPF 50+ rated | High-performance fly-fishing shorts with quick-drying lightweight fabric, four-way stretch, and active range of motion. Offers SPF 50+ sun protection.                        |

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [None]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [None]:
response = qa_stuff.run(query)

In [None]:
display(Markdown(response))

In [None]:
response = index.query(query, llm=llm)

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

Reminder: Download your notebook to you local computer to save your work.