# Using Langchain to Query Any Document With an LLM in Less Than 30 Simple Lines Of Code

You've learned linear regression as part of the basics for machine learning.  Now, learn question answering with documents as part of the basics of LLM's!

With Large Language Models (LLM's) becoming more ubiquitous for ML-driven product design, there are elementary building blocks which any Machine Learning scientist should be familiar with for building out modern, A.I. powered solutions.

In particular, interacting with a document is a fundamental component of many types of LLM-driven solution.  Search, CoPilots, Virtual Support Agents etc can be implemented as systems which reference specific documents or information and enable interaction with this information via chat or prompt.

Langchain provides a great interface for building on top of Large Langage Models' capabilities.  In particular, it's a piece of cake to use langchain to put together a question answering service for any document, regardless of its length.   

The pattern we're discussing today is based on a divide and conquer strategy for documents.  The goal is to be able to ask a question about a document of arbitrary length and to get an answer.  

Here's what it looks like:
- select a document you'd like to query
- break it up into chunks 
- use a similarity metric to choose relevant chunks which may contain answers to your question
- pick the most relevant chunks, and use them to form a prompt which the LLM can directly reference to answer questions
- ask the question - receive an answer based on the provided text chunks

One might wonder why we're breaking the document into chunks.  If the document was short enough, we could feed it into the prompt and directly ask questions.  Assuming we're working with a longer document, we won't be able to fit it into the prompt.  The chunking strategy works by only filling the prompt with text that is likely to contain the answer to the question


# Getting started

You'll need to install openai, langchain, tiktoken, faiss and perhaps a few more packages.

You'll also need your own OpenAPI key.

In [1]:
import openai
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

# Set your OpenAI API key
openai_api_key = 'YourAPIKey'

For this demo, I copy / pasted some documentation from azure automl into a text file. Feel free to use whatever text file you'd like to do this with. 

In [2]:
loader = TextLoader('azure_automl_documentation.txt')
document_text = loader.load()

The strategy we're going to take is to divide the document up into chunks.

We'll split the document into overlapping chunks.  I chose a chunk size of 500 and a chunk overlap of 100, but these are params you can play with.  


In [3]:
# Break document into chunks using text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
documents = text_splitter.split_documents(document_text)

Something to keep in mind when choosing chunk size is: the text that indicates relevance to the question needs to be within the same chunk as the answer to the question.  

For example, if you ask "what is the list of supported ML models", and the chunk size is 110, you might retrieve a chunk that says "the list of supported ML models is comprehensive and represents the current state of the art in A.I. included are {text truncated}" which doesn't actually include the answer!  It'd be hard for the model to answer the question without the full list of ML models being included in the chunk.

On the other hand, we don't want chunks so large that that their relevance score gets diluted by irrelevant text.  So - this is a param that needs to be optimized depending on the structure of the text.

Now, lets compute embeddings for each of the document chunks.

In [4]:
# Embed all document chunks using OpenAI Embeddings API
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
chunk_embeddings = [embeddings.embed_query(chunk.page_content) for chunk in documents]

Now, convert these into a vector store.  The vector store allows us to retrieve documents similar to a query document we provide.  In this case, we'll be looking for document chunks that are similar to our question.

In [5]:
# Create the FAISS vector store from the documents and embeddings
vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()

Here's the main function for querying the document store.  



In [6]:
# Function to retrieve top relevant chunks and launch chat session
def query_document(question, print_prompt=False):
    # Retrieve top relevant chunks using FAISS vector store
    retrieved_documents = retriever.get_relevant_documents(question)
    relevant_chunks = [document.page_content for document in retrieved_documents]

    # Update the prompt with the top relevant chunks
    prompt = ('Directly below I will provide documentation for answering questions. ' + 
              'Then, I will ask a question.  Only use information provided in the documentation.  The question will begin after "User: " ' +
              'If the answer is not in the documentation provided, respond with "answer not found"')
    prompt = '\n\n'.join(relevant_chunks)
    prompt += f'\n\nUser: {question}\n'

    # Launch chat session with the updated prompt
    '''
    Note that we supply a temperature of 0.  
    Temperature increases creativity but it also increases randomness.  
    We don't want random, creative answers - we just want the cold, hard facts! 
    '''
    chat = ChatOpenAI(temperature=0.0, openai_api_key=openai_api_key, model="gpt-4")
    response = chat([HumanMessage(content=prompt)])
    
    if print_prompt: print('**** PROMPT ****\n\n', prompt)
    print('**** RESPONSE *****\n\n',response.content)
    

It uses the "get_relevant_documents" function to ... get relevant documents!  
Then, it concatenates all of those document chunks together and inserts them into a prompt.


We could just ask questions on top of the chunks without additional context.  However, in order to avoid having the LLM attempt to invent answers which are not actually in the document, I give it the following setup: 


> 'Directly below I will provide documentation for answering questions. Then, I will ask a question.  Only use information provided in the documentation. The question will begin after "User: " If the answer is not in the documentation provided, respond with "answer not found"'


Note that I specify gpt-4 here.  The quality of the response varies greatly dependending on which model you choose.  For embeddings / chunk retrieval, it appears we can get away with the default model gpt-3.5-turbo.  

More complex gpt models are slower and more expensive, but of course generally yield better results.  

In this case, it made sense to use a less expensive model for embedding the numerous chunks, and a more expensive model for the few queries on those chunks.

# I'll Ask The Questions Here

Let's try asking our document a few questions.

In [7]:
query_document(question="Does this support Bert?")

**** RESPONSE *****

 Yes, the NLP capability supports end-to-end deep neural network NLP training with the latest pre-trained BERT models. This allows you to leverage the power of BERT for various natural language processing tasks in your automated ML experiments.


In [8]:
query_document(question="Does this support ONNX?")

**** RESPONSE *****

 Yes, AutoML supports ONNX (Open Neural Network Exchange) format. With Azure Machine Learning, you can use automated ML to build a Python model and have it converted to the ONNX format. Once the models are in the ONNX format, they can be run on a variety of platforms and devices. This allows for better interoperability and performance optimization across different machine learning frameworks.


In this case, the answers to both questions appear correct.

# Debugging

This function also supports printing the prompt for debugging and gaining insight into the LLM's behavior.  

For example, you may find that the chunks being returned aren't very relevant.  You may also find that the LLM is hallucinating and isn't basing its responses off the document chunks at all - in which case you'll have to think about doing some prompt engineering and perhaps setting up an evaluation method to quantify the accuracy of the question answering service.   

In [9]:
query_document(question="Does this support ONNX?", print_prompt=True)

**** PROMPT ****

 The ONNX runtime also supports C#, so you can use the model built automatically in your C# apps without any need for recoding or any of the network latencies that REST endpoints introduce. Learn more about using an AutoML ONNX model in a .NET application with ML.NET and inferencing ONNX models with the ONNX runtime C# API.

Next steps
There are multiple resources to get you up and running with AutoML.

Tutorials/ how-tos
Tutorials are end-to-end introductory examples of AutoML scenarios.

See the AutoML package for changing default ensemble settings in automated machine learning.


AutoML & ONNX
With Azure Machine Learning, you can use automated ML to build a Python model and have it converted to the ONNX format. Once the models are in the ONNX format, they can be run on a variety of platforms and devices. Learn more about accelerating ML models with ONNX.

See how to convert to ONNX format in this Jupyter notebook example. Learn which algorithms are supported in ONN

For this question, it looks like the answer is correct and is mostly supported by the documentation.

# Conclusion

LLM's are a whole new paradigm for machine learning-based solutions.  There are plenty of amazing tools being developed all the time, and capabilities that weren't accessible with so little effort even a year ago.

Make sure to stay up-to-date with patterns and recipes for building amazing apps with A.I.