Hi, this notebook is part of the article "Building Gen AI chatbot using documents as knowledge base" Link to article: https://medium.com/@rahulmili277/building-genai-chatbot-for-querying-documents-38f80de5796e


Please follow the instructions to run this notebook, you will be required to create a directory and upload documents that will be used as knowledge base for answering queries

1. Create a directory called 'documents' in '/content'
2. Upload the pdf or text documents inside this directory

In [None]:
# lets install the dependencies
!pip install google-cloud-aiplatform
!pip install pypdf
!pip install langchain
!pip install shapely==2.0.0
!pip install chromadb

Restart the runtime after running the above cell

Authenticate your google account

In [None]:
from google.colab import auth

auth.authenticate_user()

text_utils -> Please refer the article to read about this function

In [None]:
import os
from langchain.document_loaders import PyPDFLoader


def context_extractor(file_path):
    text_extensions = ['.txt', '.csv']  # Add more text extensions if needed
    pdf_extensions = ['.pdf']  # Add more PDF extensions if needed
    text = ''
    file_extension = os.path.splitext(file_path)[1].lower()

    if file_extension in text_extensions:
        with open(file_path, 'r') as file:
            text = file.read()
        return text
    elif file_extension in pdf_extensions:
        text = ''
        pdf_loader = PyPDFLoader(file_path)
        pages = pdf_loader.load_and_split()
        for i in pages:
            text = text +"\n\n" + i.page_content
        return text
    else:
        return 'Unknown file type'

files_util -> parses your directory and sends the paths as argument to above function

In [None]:
import os

def get_file_paths(directory):
    file_paths = []
    for root, _, files in os.walk(directory):
        for file in files:
            file_path = os.path.join(root, file)
            file_paths.append(file_path)
    return file_paths

create_context -> please refer the article to read about this function

In [None]:
from langchain.text_splitter import CharacterTextSplitter

def create_context(dir_path):
  file_paths = get_file_paths(dir_path)
  texts_list = []
  for file_path in file_paths:
      print(file_path)
      text = context_extractor(file_path)
      text_splitter = CharacterTextSplitter(chunk_size=3000, chunk_overlap=100)
      context = text
      texts = text_splitter.split_text(context)
      for i in texts:
        texts_list.append(i)
  for chunk_index, chunk in enumerate(texts_list):
      # print(chunk)
      # Create a unique file name for each chunk
      chunk_file_name = f"/content/chunks/chunk_{chunk_index}.txt"

      # Save the chunk as a text file
      with open(chunk_file_name, 'w') as file:
          file.write(chunk)

      # Print the file path of the saved chunk
      print(f"Saved chunk {chunk_index} as {chunk_file_name}")

read_chunks -> reads the saved chunks from the saved directory and loads them in memory

In [None]:
def read_chunks(dir_path):
    context_paths = get_file_paths(dir_path)
    chunks = []
    for context_path in context_paths:
        with open(context_path, 'r') as file:
            chunk = file.read()
            chunks.append(chunk)
            print(chunk)
    return chunks

Run this cell to implement the chatbot, it will extract the text, create chunks, load them in memory, create embeddings and make the chunks ready for querying

In [None]:
import vertexai
import warnings
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain.vectorstores import Chroma
import os
warnings.filterwarnings("ignore")

PROJECT_ID = "YOUR_PROJECT_ID"
REGION = "PROJECT_REGION"

vertexai.init(project=PROJECT_ID, location=REGION)

# parameters for model
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 380,
    "top_p": 0.8,
    "top_k": 40
}


#getting the text model
model = TextGenerationModel.from_pretrained("text-bison@001")
vertex_embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@001")

# pass the path to diectory containing documents
create_context('/content/documents')

# load chunks
chunks = read_chunks('/content/chunks')

# create embeddings from chunks
vector_index = Chroma.from_texts(chunks, vertex_embeddings).as_retriever()


This is the driver code, which will query the LLM and generate the responses, you can modify the prompt as needed for your case.

In [None]:
# main driver function
def run_query(question):
    # get the chunks having possibility of containing answer
    docs = vector_index.get_relevant_documents(question)
    answers = ''
    for i in docs:
        # query the LLM
        response = model.predict(
          f"Answer the question as precise as possible using the provided context.\
           If the answer is not contained in the context, say 'answer not available \
           in context'. \n\n Context: \n {i.page_content} \nQuestion: \n {question} \n",
          **parameters
        )
        # append the answers
        answers = answers + "\n\n" + response.text

    final_response = model.predict(
        f"Given the extracted content and the question, create a conversational \
        final answer. If the answer is not contained in the context or if the context\
         is empty then, say 'answer not available in context'. \n\n Context: \n \
         {answers} \nQuestion: \n {question} \n",
        **parameters
        )
    return final_response.text

Run this cell and pass the question you want to ask from the documents

In [None]:
print(run_query("Ask your question here"))