Skip to content

A tutorial for building a chatbot which answers questions about PDF files

Notifications You must be signed in to change notification settings

LuotoCompany/basic-bot-tutorial

Repository files navigation

A guide for building a basic chatbot

This guide shows one example of how to make a chatbot which can answer questions about PDF files.

In Part 1, we will build a command-line bot application, and in Part 2 we add a UI to the bot.

But first, here is an overview of the components needed to build such a bot.

Building blocks

img/bot_structure.jpg

LLM

Large Language Model. For this guide, we'll use GPT-4o via OpenAI APIs.

In addition to GPTs, there are many other models, such as PaLM, Claude, Llama, Falcon, Cohere and Mistral.

Vector database

A database that stores vectors. Data is converted into a vector space, where similar items get clustered closer to each other. When searching for data, the search term is converted to the same vector space, and results are then fetched using a similarity metric such as cosine similarity.

We'll use ChromaDB, which is an open source vector database. It uses SQLite as the default backend, which is nice and easy for simple prototyping.

Here's a few other vector databases:

Embedding model

Embedding model is a neural network whose job is to get data as input and output numerical vectors.

ChromaDB ships with a default embedding model, let's use that for simplicity. Basically one could use any embedding model to vectorize and then search for data, as long as you use the same model for data and the search query.

More about embeddings:

Data processing

Preparing data to be stored in a vector database is a large topic and its complexity depends on things like how complex/varying the data is (e.g. a simple text file vs images containing text on a PDF) and the intended use case.

The basic idea is to chop data into short pieces that are then converted into vectors. The data can be anything, such a website, txt file, csv, pdf, transcribed audio, output from an LLM, and so on.

For this guide we'll go with the document loader api of Langchain and PyPdfLoader, which reads text from PDFs.

More about data processing:

Chat user interface

A chat UI can naturally be built with any tech, but instead of focusing on React hooks or Vue templates, we'll take an off-the-shelf solution and use a library called Chainlit.

Another library for LLM/data focused UI creation is Streamlit.

Middleware / LLM orchestration

LLM orhcestration libraries wrap the above concepts (data processing, vector databases, LLMs) into a package that's easier to use and build pipelines where one can easily switch the different components such as LLMs and vector databases. They also implement more complex use cases such as different kind of agents.

While an orchestration library is very useful in practice, it's not mandatory. For the sake of learning, we'll go without a library in this guide.

Here's a few libraries for LLM orchestration:

Let's begin

This guide was made with Python 3.11 and a Mac. In addition to Python and a shell, you'll be needing an OpenAI API key.

Setup

First, create a Python virtual env and activate it:

python -m venv .venv
source .venv/bin/activate

(to get out of the env, call deactivate)

Then install the dependencies:

pip install -r requirements.txt

(requirements.txt is basically the contents from this)
pip install chromadb openai langchain pypdf chainlit

Part 1: Building a command-line version

Head on to PART1_CMDLINE.md for building the first part.

About

A tutorial for building a chatbot which answers questions about PDF files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages