- Installing libraries from requirements.txt
pip install -r requirements.txt
- PDF Text Extraction through its Table of Contents using PyPDF2 library
- Chunking of texts
- Connecting to Chroma DB Hosted on Docker
- Starting docker on WSL-Ubuntu with
sudo service docker start
- Pulling chromadb image with
docker pull chromadb/chroma
- Creating container and opening port with
docker run -d -p 8080:8080 --name chromadb chromadb/chroma
- Starting docker on WSL-Ubuntu with
- Using Hugging Face Embeddings to perform text embedding using Sentence Transformer library
- Creating collection on Chroma DB to add embeddings, and its metadata
- Retrieval
- Query Based Documents Extraction -> Extracting relevant documents from vector database based on user's query
- Ranking the retrieved results based on similarity
- Chains
- Maintaining conversation flow through RetrievalQA, create_retrieval_chain and create_history_aware_retriever of LangChain
- Chains with Memory
- Maintaining conversation flow using memory with ConversationChain() and VectorStoreRetrieverMemory of LangChain
- Query Processing
- Handling incoming queries from users by preprocessing them through NER, spelling correction and lemmatization
- Chat Pipeline
- Creating a chat pipeline class where all the above components will be integrated
Every folder in this project contains a Jupyter notebook (.ipynb) file to display intermediate outputs.
P.S. This project aims to build the PDF chatbot concentrated on a single PDF file. But, it can be scaled to include multiple PDFs, if they have the same formats, so that they can be chunked properly.