# Personal Chatbot Template

This template can be used to build a personal chatbot that searches the most relevant personal documents and uses these documents to provide a language model (in this case OpenAI's gpt-3.5) with relevant context.

This template follows the steps of Langchain's [retrieval-augmented generation (RAG) guideline](https://python.langchain.com/docs/use_cases/question_answering.html). The steps are the following:

1. Document Loading: Loading all the desired documents and sources into the right format via specific document loaders (loaders for each file type exist, e.g. txt or pdf).
2. Splitting: Splitting the loaded documents into smaller manageable chuncks that fit into language models.
3. Storage: Storing the splitted and embedded documents in a vectorstore (database).
4. Retrieval: Retrieving the most relevant document splits based on a similarity measure.
5. Output: Feeding the language model the relevant context and obtaining the answer.

![Image of the Retrieval Augmented Generation](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

## Preparations

The following code snippet sets the paths to the files that should be included in the context.

In [None]:
#getting the paths to all relevant documents
from os import listdir
path_data = "path to your data"
doc_paths = [path_data + x for x in listdir(path_data)]

## 1. Document Loading

To keep the process as simple as possible we just use one file type (word documents). If you want to use other file types you need additional loaders for those document types.

In [None]:
#this code installs necessary modules
%%capture
!pip install docx2txt
!pip install langchain

In [None]:
#import the document loader for word documents
from langchain.document_loaders import Docx2txtLoader

#for readability I use for loop
#alternativly list comprehension saves two more lines
#docs = [doc for path in doc_paths for doc in Docx2txtLoader(path).load()]
docs = []
for path in doc_paths:
    docs.extend(Docx2txtLoader(path).load())

## 2. Splitting

In [None]:
#import a text splitter that converts the documents into manageable size
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500, #the total size of a chunck is limited to 1500 characters
    chunk_overlap = 150 #the following chunck overlaps the previous chucks last 150 characters
)

splits = text_splitter.split_documents(docs)

## 3. Storage

In [None]:
#install the necessary modules
%%capture
!pip install openai
!pip install chromadb
!pip install tiktoken

You have to create an account with [OpenAI](https://openai.com/product) and create an API key to use their models within Langchain.

to access the API key you can follow these steps:
1. Log in to your account
2. Select API
3. Click on "Personal" (top right corner)
4. Click on "Manage API keys"
5. Click on "Create new secret key"

In [None]:
#better option is to save your API key as system variable
openai_api_key = "provide your OpenAI API key here"

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

#create a vector database that stores the documents and their respective embeddings
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings(openai_api_key=openai_api_key),
    persist_directory='./chroma/'
)

## 4. Retrieval

In [None]:
#set vector database as retriever for the RetrievalQA chain
retriever = vectordb.as_retriever()

## 5. Output

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

#create a RetrievalQA chain 
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(openai_api_key=openai_api_key),
                                      chain_type="stuff",
                                      retriever=retriever)

In [None]:
qa.run("Ask any question about your documents")

### Attention

The process above works just fine and is totally fine in case you only need these documents once. 
If you want to come back laters and ask questions about information contained in your documents, you do not need to run everything above. This would be very expensive in the long run because embedding your documents during the creation of the vector database is not free of charge. The code actually creates a directory where all the documents and their embeddings are saved. So you can step the first steps and just load the saved embeddings. You can use the following code instead right before step 5:

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

#load the vector database from saved directory
db = Chroma(persist_directory="./chroma", embedding_function = OpenAIEmbeddings())
retriever = db.as_retriever()

Step 5 is exactly the same as before.

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

#create a RetrievalQA chain 
qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(openai_api_key=openai_api_key),
                                      chain_type="stuff",
                                      retriever=retriever)

qa.run("Ask any question about your documents")