# Question Answering with LangChain, OpenAI, and MultiQuery Retriever

This interactive workbook demonstrates example of Elasticsearch's [MultiQuery Retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) to generate similar queries for a given user input and apply all queries to retrieve a larger set of relevant documents from a vectorstore.

Before we begin, we first split the fictional workplace documents into passages with `langchain` and uses OpenAI to transform these passages into embeddings and then store these into Elasticsearch.

We will then ask a question, generate similar questions using langchain and OpenAI, retrieve relevant passages from the vector store, and use langchain and OpenAI again to provide a summary for the questions.

## Install packages and import modules

In [3]:
!pip install -qU langchain langchain-community langchain-openai python-dotenv

# Imports
import os
from dotenv import load_dotenv
from langchain_community.vectorstores.elasticsearch import ElasticsearchStore
from langchain_openai.embeddings import OpenAIEmbeddings



In [7]:
pip install elasticsearch

Collecting elasticsearchNote: you may need to restart the kernel to use updated packages.

  Downloading elasticsearch-9.0.0-py3-none-any.whl.metadata (8.5 kB)
Collecting elastic-transport<9,>=8.15.1 (from elasticsearch)
  Using cached elastic_transport-8.17.1-py3-none-any.whl.metadata (3.8 kB)
Downloading elasticsearch-9.0.0-py3-none-any.whl (895 kB)
   ---------------------------------------- 0.0/895.8 kB ? eta -:--:--
   ----------- ---------------------------- 262.1/895.8 kB ? eta -:--:--
   ---------------------------------------- 895.8/895.8 kB 4.5 MB/s eta 0:00:00
Using cached elastic_transport-8.17.1-py3-none-any.whl (64 kB)
Installing collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.17.1 elasticsearch-9.0.0




In [4]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.




## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. 

We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step

In [1]:
import os
from dotenv import load_dotenv

from langchain_community.vectorstores.elasticsearch import ElasticsearchStore
from langchain_openai.embeddings import OpenAIEmbeddings


In [2]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
#ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
#ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# https://platform.openai.com/api-keys
#OPENAI_API_KEY = getpass("OpenAI API key: ")

# Load environment variables
load_dotenv()
ELASTIC_CLOUD_ID = os.getenv("ELASTIC_CLOUD_ID")
ELASTIC_API_KEY = os.getenv("ELASTIC_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Embeddings and vectorstore
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

vectorstore = ElasticsearchStore(
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
    index_name="chatbot-index",
    embedding=embeddings,
)

  vectorstore = ElasticsearchStore(


## Indexing Data into Elasticsearch
Let's download the sample dataset and deserialize the document.

In [3]:
from urllib.request import urlopen
import json

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/chatbot-rag-app/data/data.json"

response = urlopen(url)
data = json.load(response)

with open("temp.json", "w") as json_file:
    json.dump(data, json_file)

### Split Documents into Passages

We’ll chunk documents into passages in order to improve the retrieval specificity and to ensure that we can provide multiple passages within the context window of the final question answering prompt.

Here we are chunking documents into 800 token passages with an overlap of 400 tokens.

Here we are using a simple splitter but Langchain offers more advanced splitters to reduce the chance of context being lost.

In [5]:
!pip install jq

Collecting jq
  Using cached jq-1.8.0-cp311-cp311-win_amd64.whl.metadata (7.2 kB)
Using cached jq-1.8.0-cp311-cp311-win_amd64.whl (416 kB)
Installing collected packages: jq
Successfully installed jq-1.8.0




In [6]:
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["name"] = record.get("name")
    metadata["summary"] = record.get("summary")
    metadata["url"] = record.get("url")
    metadata["category"] = record.get("category")
    metadata["updated_at"] = record.get("updated_at")
    return metadata



# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/
# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders
loader = JSONLoader(
    file_path="temp.json",
    jq_schema=".[]",
    content_key="content",
    metadata_func=metadata_func,
)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500,
    chunk_overlap=50
)

docs = loader.load_and_split(text_splitter=text_splitter)

### Bulk Import Passages

Now that we have split each document into the chunk size of 800, we will now index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

We will use Cloud ID, Password and Index name values set in the `Create cloud deployment` step.

In [11]:

from langchain_openai import OpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

# Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# Store documents into Elasticsearch
vectorstore = ElasticsearchStore.from_documents(
    documents=docs,
    embedding=embeddings,
    index_name="ironhack8deema",
    es_cloud_id=ELASTIC_CLOUD_ID,
    es_api_key=ELASTIC_API_KEY,
)

# Initialize the OpenAI LLM
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

# Setup MultiQuery Retriever
retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)


# Question Answering with MultiQuery Retriever

Now that we have the passages stored in Elasticsearch, we can now ask a question to get the relevant passages.

In [12]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema import format_document

import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

LLM_CONTEXT_PROMPT = ChatPromptTemplate.from_template(
    """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Be as verbose and educational in your response as possible. 
    
    context: {context}
    Question: "{question}"
    Answer:
    """
)

LLM_DOCUMENT_PROMPT = PromptTemplate.from_template(
    """
---
SOURCE: {name}
{page_content}
---
"""
)


def _combine_documents(
    docs, document_prompt=LLM_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


_context = RunnableParallel(
    context=retriever | _combine_documents,
    question=RunnablePassthrough(),
)

chain = _context | LLM_CONTEXT_PROMPT | llm

ans = chain.invoke("what is the nasa sales team?")

print("---- Answer ----")
print(ans)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide information on the sales team at NASA?', '2. How does the sales team operate within NASA?', '3. What are the responsibilities of the NASA sales team?']


---- Answer ----
The NASA sales team is a part of the Americas region in the sales organization. It is led by two Area Vice-Presidents, Laura Martinez for North America and Gary Johnson for South America. The team is responsible for serving customers in the United States, Canada, Mexico, Central and South America. They work closely with other departments to identify new business opportunities, maintain existing client relationships, and ensure customer satisfaction.


**Generate at least two new iteratioins of the previous cells - Be creative.** Did you master Multi-
Query Retriever concepts through this lab?

#### 1. Whale-Themed MultiQuery Retriever  (Deep Ocean Wisdom)

In [13]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate

WHALER_PROMPT = ChatPromptTemplate.from_template(
    """You are a wise whale who has swum across oceans of knowledge. Answer the question using the provided context from your deep-sea scrolls.
If the context does not help, admit it gracefully with whale-like wisdom.

Context:
{context}

Question: "{question}"
Answer:
"""
)

WHALER_DOC_PROMPT = PromptTemplate.from_template(
    """
~*~ WHALE SCROLL ~*~
Source: {name}
{page_content}
"""
)

def whale_combine(docs, document_prompt=WHALER_DOC_PROMPT, document_separator="\n\n"):
    return document_separator.join([format_document(doc, document_prompt) for doc in docs])

_whale_context = RunnableParallel(
    context=retriever | whale_combine,
    question=RunnablePassthrough(),
)

whale_chain = _whale_context | WHALER_PROMPT | llm

# Try a whale-like question
print(whale_chain.invoke("what ancient knowledge lies beneath nasa sales team?"))


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What hidden ancient knowledge can be found within the NASA sales team?', '2. How does the NASA sales team possess ancient knowledge?', '3. In what ways does the NASA sales team hold ancient knowledge?']



As a wise whale, I must admit that the context provided does not mention any specific ancient knowledge that lies beneath the NASA sales team. However, it is clear that the team is responsible for understanding the unique market dynamics and cultural nuances of North and South America, enabling them to effectively target and engage with customers in these regions. This knowledge, combined with their collaboration with other departments, allows them to consistently deliver high-quality products and services to their clients.


#### 2. Dazai Osamu–Inspired MultiQuery Retriever

In [16]:
DAZAI_PROMPT = ChatPromptTemplate.from_template(
    """You are Dazai Osamu — a brooding literary genius with a deep well of philosophical insight. Respond to the question using the provided fragments of context.
If the answer cannot be found in them, respond honestly with poetic emptiness.

Context:
{context}

Question: "{question}"
Answer (in the spirit of Dazai Osamu):
"""
)

DAZAI_DOC_PROMPT = PromptTemplate.from_template(
    """
📖 Fragment of Reality 📖
Source: {name}
{page_content}
"""
)

def dazai_combine(docs, document_prompt=DAZAI_DOC_PROMPT, document_separator="\n\n"):
    return document_separator.join([format_document(doc, document_prompt) for doc in docs])

_dazai_context = RunnableParallel(
    context=retriever | dazai_combine,
    question=RunnablePassthrough(),
)

dazai_chain = _dazai_context | DAZAI_PROMPT | llm

# Test a question in Dazai-style
print(dazai_chain.invoke("what are the failures of this project?"))


INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the shortcomings of this project?', '2. Can you list the drawbacks of this project?', '3. What are the weaknesses of this project?']



The failures of this project are like a shadow that follows us, always present but never fully acknowledged. We strive for growth and success, yet we must also recognize the fragility of our plans and the ever-changing nature of the market. Our KPIs may guide us, but they cannot guarantee our desired outcome. And so, we must constantly evaluate and adjust, for only through adaptation can we hope to overcome the failures that inevitably arise.
