# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest. The script loads a CSV, builds an in-memory vector index from it, asks an LLM a question constrained to that data, and renders the answer as Markdown.

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
#from rich import pretty; pretty.install()
#from pprint import pprint as print
#%pprint on

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the video.

In [2]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

In [3]:
from langchain_community.document_loaders import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_core.prompts import ChatPromptTemplate
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains import create_retrieval_chain
from IPython.display import display, Markdown

In [4]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding="utf-8")

In [5]:
docs = loader.load()

splits = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=150
).split_documents(docs)

In [6]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

**Note**:
- The notebook uses `langchain==0.0.179` and `openai==0.27.7`
- For these library versions, `VectorstoreIndexCreator` uses `text-davinci-003` as the base model, which has been deprecated since 1 January 2024.
- The replacement model, `gpt-3.5-turbo-instruct` will be used instead for the `query`.
- The `response` format might be different than the video because of this replacement model.

In [7]:
# 2) Embed & index in memory (no legacy VectorstoreIndexCreator)
embeddings = OpenAIEmbeddings()
vectorstore = DocArrayInMemorySearch.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})

In [8]:
# conduct search and retrieve all the documents relevant to the queery
docs = retriever.invoke(query)

In [9]:
len(docs)

6

In [10]:
print(docs[0])

page_content=': 618
name: Men's Tropical Plaid Short-Sleeve Shirt
description: Our lightest hot-weather shirt is rated UPF 50+ for superior protection from the sun's UV rays. With a traditional fit that is relaxed through the chest, sleeve, and waist, this fabric is made of 100% polyester and is wrinkle-resistant. With front and back cape venting that lets in cool breezes and two front bellows pockets, this shirt is imported and provides the highest rated sun protection possible. 

Sun Protection That Won't Wear Off. Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays.' metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 618}


In [11]:
# 3) LLM + prompt and retrieval chain (no legacy RetrievalQA)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer ONLY from the provided context. If unsure, say you don't know.\n\n{context}"),
    ("human", "{input}")
])
doc_chain = create_stuff_documents_chain(llm, prompt)
qa_chain = create_retrieval_chain(retriever, doc_chain)

In [12]:
# 4) Ask
query = "Please list all your shirts with sun protection in a table in markdown and summarize each one."
result = qa_chain.invoke({"input": query})
answer = result["answer"]

In [13]:
display(Markdown(answer))

| Name                                      | Description Summary                                                                                          |
|-------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | Lightweight, UPF 50+ sun protection, relaxed fit, 100% polyester, wrinkle-resistant, with cape venting and pockets. |
| Men's Plaid Tropic Shirt, Short-Sleeve    | Designed for fishing, UPF 50+ coverage, 52% polyester and 48% nylon, wrinkle-free, quick-drying, with cape venting and pockets. |
| Men's TropicVibe Shirt, Short-Sleeve      | Lightweight, UPF 50+ rated, traditional fit, 71% nylon and 29% polyester, wrinkle-resistant, with cape venting and pockets. |
| Sun Shield Shirt                          | Slightly fitted, UPF 50+ rated, 78% nylon and 22% Lycra, moisture-wicking, abrasion-resistant, handwash recommended. |
| Girls' Ocean Breeze Long-Sleeve Stripe Shirt | Long-sleeve rash guard, UPF 50+ rated, nylon Lycra-elastane blend, quick-drying, fade-resistant, machine washable. |
| Girls' Beachside Breeze Shirt, Half-Sleeve | Swim shirt with UPF 50+ protection, 80% nylon and 20% Lycra, snag and fade-resistant, durable fabric, machine washable. |

## Step By Step

In [14]:
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader(file_path=file, encoding="utf-8")

In [15]:
docs = loader.load()

In [16]:
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [17]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [18]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [19]:
print(len(embed))

1536


In [20]:
print(embed[:5])

[-0.021954253315925598, 0.006774455308914185, -0.018215758726000786, -0.03919148072600365, -0.014013086445629597]


In [21]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [22]:
query = "Please suggest a shirt with sunblocking"

In [23]:
docs = db.similarity_search(query, k=4)

In [24]:
len(docs)

4

In [25]:
print(docs[0])

page_content=': 255
name: Sun Shield Shirt by
description: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. 

Size & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.

Fabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.

Additional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.

Sun Protection That Won't Wear Off
Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.' metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255}


In [26]:
retriever = db.as_retriever(search_kwargs={"k": 6})

In [27]:
llm = ChatOpenAI(temperature = 0.0, model="gpt-4o-mini")

In [28]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer ONLY using the provided context. If unsure, say you don't know.\n\n{context}"),
    ("human", "{input}")
])

In [29]:
doc_chain = create_stuff_documents_chain(llm, prompt)
qa_chain = create_retrieval_chain(retriever, doc_chain)

In [30]:
# 5) Ask the final question (replaces call_as_llm / RetrievalQA.run / index.query)
final_q = "Please list all your shirts with sun protection in a table in markdown and summarize each one."
result = qa_chain.invoke({"input": final_q})
answer = result["answer"]

In [32]:
print(answer)

| Name                                      | Description Summary                                                                                          |
|-------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | Lightweight, UPF 50+ sun protection, traditional fit, 100% polyester, wrinkle-resistant, with cape venting.   |
| Men's Plaid Tropic Shirt, Short-Sleeve    | Designed for fishing, UPF 50+ coverage, 52% polyester and 48% nylon, wrinkle-free, quick-drying, with cape venting. |
| Men's TropicVibe Shirt, Short-Sleeve      | Lightweight, UPF 50+ protection, traditional fit, 71% nylon and 29% polyester, wrinkle-resistant, with cape venting. |
| Sun Shield Shirt                          | Slightly fitted, UPF 50+ protection, 78% nylon and 22% Lycra, moisture-wicking, abrasion resistant, handwash.  |
| Girls' Ocean Breeze Long-Sleeve Stripe

The first part does a quick vector similarity search to fetch the most relevant CSV rows/chunks; the second part uses a retrieval chain + LLM to turn those chunks into a coherent answer (e.g., a markdown table) using a prompt. In short: first = retrieve, second = retrieve and generate.

Reminder: Download your notebook to you local computer to save your work.