# Question Answering with LangChain, OpenAI, and MultiQuery Retriever

This interactive workbook demonstrates example of Elasticsearch's [MultiQuery Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) to generate similar queries for a given user input and apply all queries to retrieve a larger set of relevant documents from a vectorstore.

Before we begin, we first split the fictional workplace documents into passages with `langchain` and uses OpenAI to transform these passages into embeddings and then store these into Elasticsearch.

We will then ask a question, generate similar questions using langchain and OpenAI, retrieve relevant passages from the vector store, and use langchain and OpenAI again to provide a summary for the questions.

## Install packages and import modules

In [64]:
!python3 -m pip install -qU jq lark langchain langchain-elasticsearch langchain_openai tiktoken


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: C:\Users\44758\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [65]:
!pip install elasticsearch



In [66]:
!pip install -q python-dotenv langchain langchain-elasticsearch langchain-openai tiktoken

In [67]:
!pip install numpy<2


The system cannot find the file specified.


In [68]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai.llms import OpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from getpass import getpass
import os

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step

In [69]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# https://platform.openai.com/api-keys
OPENAI_API_KEY = getpass("OpenAI API key: ")

embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vectorstore = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name= "qa_workplace",
    embedding=embeddings,
)

## Data PreProcessing
## Indexing Data into Elasticsearch
Let's download the sample dataset and deserialize the document.

In [70]:
from urllib.request import urlopen # this just lets me grab stuff from the internet
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

# open the link and read what's insid
response = urlopen(url) 


# turn that downloaded JSON file into actual Python data 
data = json.load(response)

# save that data into a file locally so we don’t have to redownload it every time
with open("temp.json", "w") as json_file:         
    json.dump(data, json_file)

In [71]:

# just peek at the first item to see the structure
print(json.dumps(data[0], indent=2))


{
  "content": "Effective: March 2020\nPurpose\n\nThe purpose of this full-time work-from-home policy is to provide guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations during the COVID-19 pandemic and beyond.\nScope\n\nThis policy applies to all employees who are eligible for remote work as determined by their role and responsibilities. It is designed to allow employees to work from home full time while maintaining the same level of performance and collaboration as they would in the office.\nEligibility\n\nEmployees who can perform their work duties remotely and have received approval from their direct supervisor and the HR department are eligible for this work-from-home arrangement.\nEquipment and Resources\n\nThe necessary equipment and resources will be provided to employees for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Employees are res

In [72]:
print(data[0].keys())

dict_keys(['content', 'summary', 'name', 'url', 'created_on', 'updated_at', 'category', '_run_ml_inference', 'rolePermissions'])


# for JSON list error

In [73]:
!pip install jq



### Split Documents into Passages

We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 800 token passages with an overlap of 400 tokens.

Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [74]:
from langchain.document_loaders import JSONLoader

# helps us break long text into chunks (so the AI doesn't get overwhelmed lol)
from langchain.text_splitter import RecursiveCharacterTextSplitter


#this function is sto pull some extra info from the data - here so the loader doesn't cry lol
def metadata_func(record: dict, metadata: dict) -> dict:
    # automatically grab all relevant metadata from each record
    metadata["name"] = record.get("name")
    metadata["summary"] = record.get("summary")
    metadata["url"] = record.get("url")
    metadata["category"] = record.get("category")
    metadata["created_on"] = record.get("created_on")
    metadata["updated_at"] = record.get("updated_at")

    return metadata

# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",  # our data file from earlier
    jq_schema=".[]",        # treat it like a list of records
    content_key="content",   # this is the actual text we're chunking
    metadata_func=metadata_func,  # func we just defined to pull out extra info
)

In [75]:
#text splitter to chunk the docs into small parts
# helps with better question answering because AI likes bite-sized pieces
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=800, chunk_overlap=400 
)


docs = loader.load_and_split(text_splitter=text_splitter)

In [76]:
print(f"Loaded {len(docs)} documents from {loader.file_path}")



Loaded 15 documents from C:\Users\44758\Desktop\IronHack\week7\day4\lab-chatbot-with-multi-query-retriever\temp.json


In [77]:
# show how many total chunks were created after splitting
print(f"📦 Total chunks after splitting: {len(docs)}")


📦 Total chunks after splitting: 15


### Bulk Import Passages

Now that we have split each document into the chunk size of 800, we will now index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.

In [78]:
#This pushes all my chunks (docs) into Elasticsearch to be used for smart search late
documents = vectorstore.from_documents(
    docs,                                 # the list of text chunks made earlier
    embeddings,                            # convert text → numbers (OpenAIEmbeddings)
    index_name="qa_workplace",
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
)

# 🔮Setting up OpenAI to be our LLM brain

In [79]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)  ## 0 = max accuracy, less creative — perfect for Q&A

# 🧠 MultiQueryRetriever = Smart Searcher

In [80]:
retriever = MultiQueryRetriever.from_llm(
    vectorstore.as_retriever(), # turn our Elastic vectorstore into a retriever
    llm                         # to generate multiple query variations
)

# Question Answering with MultiQuery Retriever

Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages.

- FINAL BOSS LEVEL

In [81]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough #literally just passes the question along unchanged
#get me the docs AND the question at the same time
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

#the multiple search queries GPT creates when you ask a question
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)


# 💬 Set Up the Prompt That Talks to GPT

In [82]:
LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(

    #So the AI sees:
    """  
---
SOURCE: {name}
{page_content}
---
"""
)


# This function takes the documents and formats them into a string that the LLM can understand.
# like 'Here GPT, read this before you answer!'
def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

# Build the Chain That Powers Q&A
_context = RunnableParallel(
    context=retriever | _combine_documents,  # Step 1: Get relevant chunks & format them
    question=RunnablePassthrough(),          # Step 2: Just pass the user's question through
)

# 🤖 Hook It Up to OpenAI

In [83]:
chain = _context | LLM_CONTEXT_PROMPT | llm

# ❓ Finally Ask a Question!

In [84]:
ans = chain.invoke("what is the nasa sales team?")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the sales team at NASA?', '2. How does the sales team operate within NASA?', '3. What are the responsibilities of the NASA sales team?']


---- Answer ----
The NASA sales team is a part of the Americas region in the sales organization of the company. It is led by two Area Vice-Presidents, Laura Martinez for North America and Gary Johnson for South America. The team is responsible for promoting and selling the company's products and services in the North and South American markets. They work closely with other departments, such as marketing, product development, and customer support, to ensure the company's success in these regions.


**Generate at least two new iteratioins of the previous cells - Be creative.** Did you master Multi-
Query Retriever concepts through this lab?

In [89]:
ans_1 = chain.invoke("How did the company support employees during the pandemic?")
print("---- Answer 1 ----")
print(ans_1)


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What measures did the company take to assist employees during the pandemic?', '2. In what ways did the company provide support for its employees during the pandemic?', '3. How was the company able to aid its employees during the pandemic?']


---- Answer 1 ----

The company supported employees during the pandemic by implementing a full-time work-from-home policy, effective March 2020. This policy provided guidelines and support for employees to conduct their work remotely, ensuring the continuity and productivity of business operations. The policy applied to all eligible employees and allowed them to work from home while maintaining the same level of performance and collaboration as they would in the office. The company also provided necessary equipment and resources for remote work, including a company-issued laptop, software licenses, and access to secure communication tools. Additionally, the company encouraged employees to prioritize their health and well-being while working from home and provided support through regular communication with supervisors, maintaining regular work hours, and taking breaks when needed. The policy was periodically reviewed and updated as necessary to ensure it aligned with public health guida

In [88]:
ans_2 = chain.invoke("Can I take time off if I’m not feeling well?")
print("---- Answer 2 ----")
print(ans_2)


INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can I request time off if I am feeling unwell?', '2. Is it possible for me to take a break if I am not feeling well?', '3. What are my options for taking time off if I am not feeling well?']


---- Answer 2 ----

Yes, you can take time off if you are not feeling well. According to the Company Vacation Policy, in the event of an unplanned absence due to illness, employees may use their accrued vacation time with supervisor approval. It is important to inform your supervisor as soon as possible and provide any required documentation upon your return to work. Additionally, if your employment is terminated, you will be paid out for any unused vacation time. If you have any further questions or concerns about taking time off, you can direct them to your supervisor or the HR department.


# NASA-style Q&A 🛰️

In [85]:
ans_1 = chain.invoke("What does the Mars mission engineering team do?")
print("---- Answer 1 ----")
print(ans_1)


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the responsibilities of the Mars mission engineering team?', '2. How does the Mars mission engineering team contribute to the overall mission?', '3. Can you explain the role of the Mars mission engineering team in detail?']


---- Answer 1 ----
I'm sorry, I cannot answer this question as it is not mentioned in the provided context. The context only discusses the sales organization and their responsibilities.


In [None]:
ans_2 = chain.invoke("What happens if an astronaut gets sick on a mission?")
print("---- Answer 2 ----")
print(ans_2)


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the protocols for handling a sick astronaut during a mission?', '2. How does the space agency handle medical emergencies during a space mission?', '3. In the event of an astronaut falling ill during a mission, what steps are taken to ensure their safety and the success of the mission?']


---- Answer ----

If an astronaut gets sick on a mission, they will receive medical care from the designated medical officer on board the spacecraft. If the illness is severe or requires specialized treatment, the astronaut may be evacuated to a medical facility on Earth. In some cases, the mission may be altered or shortened to ensure the safety and well-being of the sick astronaut. Additionally, NASA has protocols in place for emergency medical situations, such as a medical emergency on the International Space Station, where the astronaut may be transported back to Earth for treatment.
