# DocChat - Chat with your documents

This Jupyter notebook serves as a walkthrough for a machine learning and chatbot assignment, utilizing the Langchain framework and Large Language Models (LLMs). The assignment encompasses various stages, from importing dependencies and loading PDF documents to creating embeddings, conducting similarity searches, and setting up a conversational retriever with an LLM. The notebook guides you through these steps, illustrating how to extract valuable insights from text data, answer user queries, and build a functional chatbot using state-of-the-art natural language processing techniques.

This cell is where you import the necessary dependencies and libraries for your machine learning and chatbot project. These dependencies provide the essential tools and functionality that will be used in various parts of your Jupyter notebook.

In [1]:
import os
import getpass
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone, Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader

This code configures and stores the OpenAI API key for secure access to OpenAI's services and models.

In [2]:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key: ········


In this cell, you're setting up a pdf_loader to load PDF documents from a specified directory. The DirectoryLoader is configured to look for PDF files within the '../documents/' directory and its subdirectories (as indicated by the '**/*.pdf' pattern). This step is essential for accessing and processing the PDF documents in your project.

In [3]:
pdf_loader = DirectoryLoader('../documents/', glob="**/*.pdf")

documents = []
documents.extend(pdf_loader.load())

num_docs = len(documents)
num_chars = sum([len(document.page_content) for document in documents])

print (f'Found {num_docs} document(s) the provided directory')
print (f'Found {num_chars} characters in your document(s)')

Found 2 document(s) the provided directory
Found 94878 characters in your document(s)


In this cell, you are employing the CharacterTextSplitter to segment the text content of your PDF documents into smaller, more manageable chunks. This text chunking process is crucial, especially when using large language models (LLMs) like GPT-3 or GPT-4. Here's why text chunking is necessary in relation to context length when working with LLMs:

1. Managing Context Length: Large language models, such as GPT-3 or GPT-4, often have limitations on the maximum amount of text they can process in a single request. This is known as the "context length." If your input text exceeds this limit, you'll need to truncate or split it into smaller portions to fit within the model's constraints.
2. Preserving Contextual Understanding: To maintain the contextual understanding of the text, it's important to include enough context for the model to generate coherent responses. By breaking down the text into chunks with an appropriate overlap (as determined by chunk_size and chunk_overlap parameters), you ensure that the necessary context is retained in each chunk, and there is a degree of overlap to provide continuity between adjacent chunks.
3. Enhancing Efficiency: Text chunking not only helps in managing context but also makes the processing of large documents more efficient. Smaller chunks are easier to work with, both in terms of memory and computational resources, which is essential when dealing with extensive datasets.
4. Optimizing Model Usage: By chunking the text effectively, you can optimize your interactions with LLMs. It allows you to send manageable portions of text to the model for processing, reducing the likelihood of exceeding context length limitations and ensuring more accurate responses.

In [4]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=40) #chunk overlap seems to work better
documents = text_splitter.split_documents(documents)
print(len(documents))

Created a chunk of size 1161, which is longer than the specified 1000
Created a chunk of size 5132, which is longer than the specified 1000
Created a chunk of size 1031, which is longer than the specified 1000
Created a chunk of size 1172, which is longer than the specified 1000
Created a chunk of size 1010, which is longer than the specified 1000
Created a chunk of size 1398, which is longer than the specified 1000
Created a chunk of size 1017, which is longer than the specified 1000
Created a chunk of size 1186, which is longer than the specified 1000
Created a chunk of size 1232, which is longer than the specified 1000
Created a chunk of size 1318, which is longer than the specified 1000
Created a chunk of size 1158, which is longer than the specified 1000
Created a chunk of size 1399, which is longer than the specified 1000
Created a chunk of size 1111, which is longer than the specified 1000
Created a chunk of size 1338, which is longer than the specified 1000
Created a chunk of s

111


This cell sets up the creation of embeddings and a vector store for the text data from your documents. Here's a concise summary:

1. Import Chroma and OpenAIEmbeddings modules.
2. Create an embeddings instance for generating text embeddings.
3. Initialize a vectorstore using Chroma from your chunked documents. This creates vector representations of the text, which are useful for various natural language processing tasks.

In [5]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)


In this cell, you're conducting a similarity search using a provided text query:

1. The variable text contains the query you want to search for, which is "What makes Retentive Networks good?"
2. You use the vectorstore created earlier to perform a similarity search on the text query. This search aims to find documents or text chunks that are similar in meaning or context to the query text.

This process allows you to retrieve documents or chunks of text that are semantically related to the query, which can be valuable for tasks such as information retrieval or recommendation systems.

In [6]:
text = "What makes Retentive Networks good?"
docs = vectorstore.similarity_search(text)

Next we're printing the content of the most similar document or text chunk that was returned by the similarity search conducted in the previous cell. This action allows you to see and inspect the content of the document that best matches the query you provided.

In [7]:
print(docs[0].page_content)

4 Conclusion

In this work, we propose retentive networks (RetNet) for sequence modeling, which enables various representations, i.e., parallel, recurrent, and chunkwise recurrent. RetNet achieves significantly better inference efficiency (in terms of memory, speed, and latency), favorable training parallelization, and competitive performance compared with Transformers. The above advantages make RetNet an ideal successor to Transformers for large language models, especially considering the deployment benefits brought by the O(1) inference complexity. In the future, we would like to scale up RetNet in terms of model size [CDH+22] and training steps. Moreover, retention can efficiently work with structured prompting [HSD+22b] by compressing long-term memory. We will also use RetNet as the backbone architecture to train multimodal large language models [HSD+22a, HDW+23, PWD+23]. In addition, we are interested in deploying RetNet models on various edge devices, such as mobile phones.


In this cell, you are configuring a conversational retriever using the OpenAI language model (LLM) and the vector store for similarity-based retrieval:

1. You import the necessary modules, including OpenAI from 'langchain.llms' to work with the language model.
2. You create a retriever called retriever from the vectorstore with a specified search type ("similarity") and search parameters. This retriever is designed to find similar text based on vector representations.
3. You establish a conversational retrieval chain, denoted as ´chatbot´, using the OpenAI LLM with a temperature of 0. The LLM will be used for answering questions or generating responses based on the retrieved content.

This setup is fundamental for creating a chatbot or a system that can answer questions and engage in conversations by leveraging similarity-based content retrieval and the capabilities of the language model.

In [8]:
from langchain.llms import OpenAI
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":2})
chatbot = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever)


In this cell, you are simulating a chatbot interaction. Here's a succinct summary:

1. chat_history is an empty list meant to store the chat history or conversation context.
2. You define a query as "What makes Retentive Networks good?" which represents a user's question.
3. Using the ´chatbot´ conversational retrieval chain, you provide the question and chat history as input to obtain a response from the chatbot.
4. The response is stored in the result variable, and you extract the answer from it with result["answer"].

This code represents the interaction between a user's query and the chatbot, demonstrating how the chatbot generates responses based on the provided question and historical conversation context.

In [9]:
chat_history = []
query = "What makes Retentive Networks good?"
result = chatbot({"question": query, "chat_history": chat_history})
chat_history.append((query, result["answer"]))
result["answer"]

' Retentive Networks (RetNet) are good because they enable various representations, have significantly better inference efficiency, have favorable training parallelization, and have competitive performance compared to Transformers.'