<a href="https://colab.research.google.com/github/SomeiLam/langchain-example/blob/main/LangChain_q%26a_over_documents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain
!pip install openai
!pip install langchain_community
!pip install -U langchain-openai
!pip install docarray

In [3]:
import os
import openai
from google.colab import userdata
api_key = userdata.get("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = api_key
openai.api_key = os.environ['OPENAI_API_KEY']
llm_model = "gpt-3.5-turbo-0301"

In [4]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.llms import OpenAI

In [5]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [13]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()

In [14]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

index_creator = VectorstoreIndexCreator(
    embedding=embedding_model,
    vectorstore_cls=DocArrayInMemorySearch,
)
index = index_creator.from_loaders([loader])




In [15]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [18]:
llm_replacement_model = OpenAI(temperature=0,
                               model='gpt-3.5-turbo-instruct')

response = index.query(query,
                       llm = llm_replacement_model)

In [19]:
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rating, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rating, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rating, moisture-wicking, abrasion-resistant, fits over swimsuit | SPF 50+, blocks 98% of harmful UV rays |

## Step by Step

In [20]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [21]:
docs = loader.load()

In [22]:
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [23]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [24]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [25]:
print(len(embed))

1536


In [26]:
print(embed[:5])

[-0.02196465528695117, 0.006758838256223806, -0.018249490165056663, -0.03923515029463157, -0.014007174091135742]


In [27]:
db = DocArrayInMemorySearch.from_documents(
    docs,
    embeddings
)

In [28]:
query = "Please suggest a shirt with sunblocking"

In [29]:
docs = db.similarity_search(query)

In [30]:
len(docs)

4

In [31]:
docs[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255}, page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.')

In [46]:
retriever = db.as_retriever()

In [41]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)

In [34]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [42]:
response = llm.invoke(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.")

In [44]:
display(Markdown(response.content))

| Name                                | Description                                                                                                                                                                                                                                                                      |
|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Sun Shield Shirt by                 | This high-performance sun shirt is guaranteed to protect from harmful UV rays, with UPF 50+ rated sun protection. Made of 78% nylon and 22% Lycra Xtra Life fiber, it wicks moisture, fits comfortably over swimsuits, and is abrasion resistant.                                                   |
| Men's Plaid Tropic Shirt, Short-Sleeve | Ultracomfortable sun protection rated UPF 50+, great for fishing and travel. Made of 52% polyester and 48% nylon, this wrinkle-free, moisture-wicking shirt features front and back venting, bellows pockets, and blocks 98% of harmful UV rays.                                   |
| Men's TropicVibe Shirt, Short-Sleeve   | This sun-protection shirt for men has UPF 50+ coverage, with a traditional fit made of 71% Nylon and 29% Polyester. It is wrinkle-resistant, features front and back venting, bellows pockets, and blocks 98% of harmful UV rays.                                                   |
| Men's Tropical Plaid Short-Sleeve Shirt | This hot-weather shirt is rated UPF 50+ for superior sun protection, with a traditional fit and made of 100% polyester. It is wrinkle-resistant, features front and back venting, bellows pockets, and blocks 98% of harmful UV rays.                                   |

In [47]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [49]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [52]:
response = qa_stuff.invoke(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [51]:
display(Markdown(response))

Sure! Here is a table in markdown format listing the shirts with sun protection along with a summarized description for each:

| Shirt Name                                | Description                                                                                                                                                                                                         |
|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | Made of 100% polyester, this shirt offers UPF 50+ protection, blocking 98% of the sun's harmful rays. It has front and back venting, two front bellows pockets, and is wrinkle-resistant.                                 |
| Men's Plaid Tropic Shirt, Short-Sleeve    | This shirt is made of 52% polyester and 48% nylon, providing UPF 50+ protection. It features front and back cape venting, two front bellows pockets, and is wrinkle-free with quick perspiration evaporation.           |
| Men's TropicVibe Shirt, Short-Sleeve      | With a shell of 71% nylon and 29% polyester, this shirt offers UPF 50+ protection. It is wrinkle-resistant, has front and back cape venting, two front bellows pockets, and is machine washable and dryable.        |
| Sun Shield Shirt                          | This shirt is made of 78% nylon and 22% Lycra Xtra Life fiber, providing UPF 50+ protection. It is quick-drying, abrasion-resistant, fits comfortably over swimsuits, and is recommended by The Skin Cancer Foundation.   |

Each shirt provides UPF 50+ sun protection, blocks 98% of harmful UV rays, and offers additional features like wrinkle-resistance, venting, and quick-drying comfort.

In [53]:
response = index.query(query, llm=llm)

In [54]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

In [55]:
index

VectorStoreIndexWrapper(vectorstore=<langchain_community.vectorstores.docarray.in_memory.DocArrayInMemorySearch object at 0x785bef1710d0>)

## Chain Types
### 1. **"stuff"**:
Retrieves the top k documents, concatenates all of them (plus the question) into a single prompt, and sends that in one go to the LLM.
```python
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
)
# Under the hood it does:
# prompt = f"Context: {doc1}\n{doc2}\n... Question: {query}"
# llm.invoke(prompt)
```
 - Pro:
    - Simplicity: only one LLM call.
    - Low orchestration overhead: easiest to debug and monitor.
    - Cost-effective when your total context fits comfortably under token limits.
 - Cons:
    - Token limits: can easily exceed model context window if you retrieve many or large docs.
    - No parallelism: single call, so you can’t leverage batching or map/reduce.
 - Best for:
    - Small knowledge bases (e.g. < 2 KB total).
    - Quick prototypes where ease of implementation outweighs scale concerns.
    - Low-latency scenarios where one LLM call is preferable.

### 2. **"map_reduce"**
**Map step**: for each retrieved doc, call the LLM separately with a “map prompt” (question + single chunk) to produce a partial answer.

**Reduce step**: concatenate all partial answers into one “combine prompt” and call the LLM again to synthesize a final answer.

```python
# Define the “map” prompt: run once per document chunk
map_prompt = PromptTemplate(
    input_variables=["question", "context"],
    template=(
        "You are an expert assistant.  Use ONLY the information in the provided context to answer the question.\n\n"
        "Question: {question}\n\n"
        "Context:\n"
        "{context}\n\n"
        "Provide a concise answer (1–2 sentences)."
    )
)

# Define the “combine” prompt: run once over all partial answers
combine_prompt = PromptTemplate(
    input_variables=["question", "answers"],
    template=(
        "You are an expert assistant.  The user asked:\n\n"
        "{question}\n\n"
        "You have been given several partial answers, each based on different sources:\n\n"
        "{answers}\n\n"
        "Synthesize these into one final, comprehensive answer.  "
        "Do not introduce new information—only use what’s in the partial answers."
    )
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
    chain_type_kwargs={
      "map_prompt":    map_prompt,
      "combine_prompt": combine_prompt,
    },
)
```
 - Pros:
    - Handles large corpora: each chunk is processed individually so you never blow past context limits.
    - Parallelizable: the map calls can be issued in parallel.
    - Fine-grained control: you can customize both map and reduce prompts independently.
 - Cons
    - Two calls per chunk + one combine call: higher latency and cost.
    - Complexity: need to manage two prompt templates and ensure consistency.
 - Best for
    - Large documents split into many chunks (e.g. books, long reports).
    - High-throughput scenarios where you can parallelize the map phase.
    - Use cases requiring both local chunk reasoning and global synthesis.


### 3. **"refine"**
**Initial draft**: run a first pass on the entire (or partial) context to get a draft answer.

**Refinement passes**: iteratively feed additional documents (or the same ones) plus the current draft back to the LLM to refine the answer.

```python
# INITIAL PROMPT: run on the first chunk → create the first draft
initial_prompt = PromptTemplate(
    input_variables=["question", "context"],
    template=(
        "You are an expert assistant.  Answer the user’s question using ONLY the provided context.\n\n"
        "Question: {question}\n\n"
        "Context:\n"
        "{context}\n\n"
        "===\n"
        "Give a complete, concise answer in one or two paragraphs."
    )
)

# REFINE PROMPT: run for each subsequent chunk → refine the draft
refine_prompt = PromptTemplate(
    input_variables=["question", "existing_answer", "context"],
    template=(
        "You are an expert assistant.  The user asked:\n\n"
        "{question}\n\n"
        "We already have this draft answer:\n\n"
        "{existing_answer}\n\n"
        "Here is an additional piece of context:\n\n"
        "{context}\n\n"
        "===\n"
        "Please update or expand the draft answer to incorporate any new relevant information from this context. "
        "Do NOT remove any correct information from the existing answer, and do NOT introduce anything not supported by the context."
    )
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=retriever,
    chain_type_kwargs={
      "initial_prompt": initial_prompt,
      "refine_prompt":  refine_prompt,
    },
)
```
 - Pros
    - Progressive improvement: drafts get more detailed or accurate with each pass.
    - Context control: you can decide how many docs to use per refine iteration.
    - Better quality for complex questions that benefit from iterative context injection.

 - Cons
    - Multiple sequential calls: slower than “map_reduce” if you do many refinements.
    - Prompt engineering: you need at least two well-tuned prompts (initial vs refine).

 - Best for
    - Deep reasoning tasks where initial answers need targeted correction or expansion.
    - Conversational or interactive flows where you can refine based on user feedback.
    - Medium-sized corpora (e.g. 5–10 chunks) where you want iterative focus rather than full map/reduce.

### 4. **"map_rerank"**
**Map**: like “map_reduce,” generate a candidate answer for each chunk.

**Rerank**: score or rerank these per-chunk answers by calling the LLM again with a “rerank prompt” that compares them and picks the best.
```python
# MAP PROMPT: generate one candidate answer per chunk
map_prompt = PromptTemplate(
    input_variables=["question", "context"],
    template=(
        "You are a helpful assistant.  Answer the user’s question using ONLY the information below.\n\n"
        "Question: {question}\n\n"
        "Context:\n"
        "{context}\n\n"
        "===\n"
        "Provide a concise answer (one or two sentences)."
    )
)

# RERANK PROMPT: choose the single best candidate
rerank_prompt = PromptTemplate(
    input_variables=["question", "candidates"],
    template=(
        "You are an expert assistant tasked with selecting the best answer to the user’s question.\n\n"
        "Question: {question}\n\n"
        "Here are several candidate answers submitted by different context chunks:\n\n"
        "{candidates}\n\n"
        "===\n"
        "Review each candidate and reply with ONLY the number of the candidate that best and most completely answers the question.  "
        "If multiple are equally good, pick the one that is most concise."
    )
)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_rerank",
    retriever=retriever,
    chain_type_kwargs={
      "map_prompt":    map_prompt,
      "rerank_prompt": rerank_prompt,
    },
)
```
 - Pros
    - Best-of-N selection: you get the single strongest candidate instead of a synthesized merge.
    - Reduced hallucination: by comparing explicit answers rather than merging potentially conflicting info.

 - Cons
    - N+1 calls: one map call per doc + one rerank call.
    - Candidate loss: doesn’t produce a cohesive final answer—just picks one chunk’s response.

 - Best for
    - Factoid QA where one chunk likely contains the correct answer (e.g. definition lookup).
    - Ensemble scenarios where you want the model to choose the best chunk-level answer.
    - Token-sensitive cases where merging would inflate context unnecessarily.