<a href="https://colab.research.google.com/github/GiX007/agent-labs/blob/main/03_langchain/03_qna_over_docs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [None]:
#pip install --upgrade langchain

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
dotenv_path = find_dotenv() or '/content/OPENAI_API_KEY.env' # read local .env file
load_dotenv(dotenv_path)

import warnings
warnings.filterwarnings('ignore')

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers.

In [None]:
# Set the model variable based on the best and cheapest available choice at the current date
llm_model = "gpt-4o-mini"

In [None]:
!pip install langchain langchain-openai langchain-community



In [None]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.llms import OpenAI

In [None]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [None]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import OpenAIEmbeddings

In [None]:
!pip install docarray

Collecting docarray
  Downloading docarray-0.41.0-py3-none-any.whl.metadata (36 kB)
Collecting types-requests>=2.28.11.6 (from docarray)
  Downloading types_requests-2.32.4.20250913-py3-none-any.whl.metadata (2.0 kB)
Downloading docarray-0.41.0-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.8/302.8 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.4.20250913-py3-none-any.whl (20 kB)
Installing collected packages: types-requests, docarray
Successfully installed docarray-0.41.0 types-requests-2.32.4.20250913


In [None]:
# Create a vector store index from documents using embeddings, so we can perform semantic search over the loaded data
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=OpenAIEmbeddings()
).from_loaders([loader])

In [None]:
query ="Please list all your shirts with sun protection in a table in markdown and summarize each one."

Even though we've defined `llm_model` for general use, we often want a different model for retrieval-augmented generation (RAG) or indexing queries because it might be cheaper or better suited for that specific task (e.g., lower temperature, faster response, simpler context). So for queries on the index we use `llm_replacement_model` instead of the basic `llm_model` with temperature=0 (deterministic). This separation gives us flexibility aas we can use a high-capability/slower/more expensive model for creative output, and a cheaper/faster model for retrieval, indexing, or simpler tasks.

In [None]:
llm_replacement_model = OpenAI(temperature=0,
                               model='gpt-3.5-turbo-instruct')

response = index.query(query, llm = llm_replacement_model)

In [None]:
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rating, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rating, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rating, moisture-wicking, abrasion-resistant, fits over swimsuit | SPF 50+, blocks 98% of harmful UV rays |

## Step By Step

In [None]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [None]:
docs = loader.load()

In [None]:
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [None]:
print(len(embed))

1536


In [None]:
print(embed[:5])

[-0.02196465528695117, 0.006758838256223806, -0.018249490165056663, -0.03923515029463157, -0.014007174091135742]


In [None]:
# Create an in-memory vector store from documents using embeddings for fast semantic search
db = DocArrayInMemorySearch.from_documents(
    docs,
    embeddings
)

In [None]:
query = "Please suggest a shirt with sunblocking"

In [None]:
# Perform a semantic search in the vector store to find documents most similar to the query
docs = db.similarity_search(query)

# `db` is a vector store holding embeddings of documents.
# `db.similarity_search(query)` computes the embedding of the query and finds the most similar documents in the store.
# The LLM is used later only to generate answers or summarize the retrieved documents. In the above example, we had a LLM to make the search on index.

In [None]:
len(docs)

4

In [None]:
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255}, page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.')

In [None]:
# Convert the vector store into a retriever object that can be used to fetch relevant documents for a query
retriever = db.as_retriever()

In [None]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

In [None]:
# Combine the text content of all retrieved documents into a single string
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [None]:
# Use the LLM to process the combined documents and generate a markdown table summarizing shirts with sun protection
response = llm.invoke(f"{qdocs} Question: Please list all your shirts with sun protection in a table in markdown and summarize each one.")

In [None]:
display(Markdown(response.content))

Here's a table summarizing the shirts with sun protection in markdown format:

| Name                                   | Description Summary                                                                                                                                                                                                                     |
|----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Sun Shield Shirt**                  | High-performance sun shirt with UPF 50+ protection, blocking 98% of UV rays. Slightly fitted, made of 78% nylon and 22% Lycra Xtra Life fiber. Moisture-wicking, abrasion-resistant, and comfortable over swimsuits. Handwash recommended. |
| **Men's Plaid Tropic Shirt, Short-Sleeve** | Lightweight, UPF 50+ rated shirt originally designed for fishing. Made of 52% polyester and 48% nylon, it features wrinkle-free fabric, evaporates perspiration quickly, and includes front and back cape venting with two bellows pockets.  |
| **Men's TropicVibe Shirt, Short-Sleeve** | Lightweight sun-protection shirt with UPF 50+ rating. Traditional fit, made of 71% nylon and 29% polyester with a 100% polyester knit mesh lining. Features include wrinkle resistance, cape venting, and two front bellows pockets.          |
| **Men's Tropical Plaid Short-Sleeve Shirt** | Lightest hot-weather shirt with UPF 50+ protection, made of 100% polyester. Traditional fit, wrinkle-resistant, with front and back cape venting and two front bellows pockets. Provides high sun protection by blocking 98% of UV rays.      |

This table provides a concise overview of each shirt's features and sun protection capabilities.

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [None]:
query =  "Please list all your shirts with sun protection in a table in markdown and summarize each one."

In [None]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [None]:
display(Markdown(response))

Here is a table listing all the shirts with sun protection along with a summary of each:

| Name                                      | Description Summary                                                                                     |
|-------------------------------------------|--------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | Lightweight, UPF 50+ sun protection, relaxed fit, 100% polyester, wrinkle-resistant, with cape venting and two front pockets. |
| Men's Plaid Tropic Shirt, Short-Sleeve    | Designed for fishing, UPF 50+ coverage, made of 52% polyester and 48% nylon, wrinkle-free, evaporates perspiration, with cape venting and two front pockets. |
| Men's TropicVibe Shirt, Short-Sleeve      | Lightweight, UPF 50+ rated, traditional fit, made of 71% nylon and 29% polyester, wrinkle resistant, with cape venting and two front pockets. |
| Sun Shield Shirt                          | Slightly fitted, UPF 50+ rated, made of 78% nylon and 22% Lycra Xtra Life, moisture-wicking, abrasion resistant, recommended by The Skin Cancer Foundation. |

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

In [None]:
response = index.query(query, llm=llm)

In [None]:
display(Markdown(response))

Here is a table listing all the shirts with sun protection along with a summary of each:

| Name                                      | Description Summary                                                                                          |
|-------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | Lightweight, UPF 50+ sun protection, relaxed fit, 100% polyester, wrinkle-resistant, with cape venting and two front pockets. |
| Men's Plaid Tropic Shirt, Short-Sleeve    | Designed for fishing, UPF 50+ coverage, made of 52% polyester and 48% nylon, wrinkle-free, evaporates perspiration, with cape venting and two front pockets. |
| Men's TropicVibe Shirt, Short-Sleeve      | Lightweight, UPF 50+ protection, traditional fit, made of 71% nylon and 29% polyester, wrinkle-resistant, with cape venting and two front pockets. |
| Sun Shield Shirt                          | Slightly fitted, UPF 50+ protection, made of 78% nylon and 22% Lycra Xtra Life, moisture-wicking, abrasion resistant, recommended by The Skin Cancer Foundation. |

In this notebook, we retrieve and summarize shirts with sun protection in three different ways, but all yield the same results. First, we directly pass the combined document text (`qdocs`) to the LLM to generate a markdown table. Second, we use a `RetrievalQA` chain, where the retriever fetches relevant documents and the LLM summarizes them, essentially automating the retrieval and answer generation. Third, we use a `VectorstoreIndexCreator` to create an index from the documents and query it with the same LLM. Despite the different methods - manual concatenation, retrieval-augmented QA, or indexing — the underlying information and the output table remain consistent, demonstrating that LangChain offers multiple flexible ways to combine embeddings, retrieval, and LLMs to achieve the same semantic search and summarization results.