# Build a chatbot which uses its own knowledge base from ChromaDB with Langchain

In this notebook I'm going to use our own knowledge base to build QA chatbot, to get answer from our own knowledge base.

### The function to ingest documents from a directory

In [1]:
#import some accessory modules from langchain
from dotenv import load_dotenv, find_dotenv
from langchain.embeddings import OpenAIEmbeddings


import argparse
from typing import Union, Optional
import os

### Implementing chatbot

In [2]:

from langchain.schema import AIMessage, SystemMessage, HumanMessage
from langchain.chat_models import ChatOpenAI

In [3]:

from langchain.chat_models import ChatOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import RetrievalQA, LLMChain, ConversationChain
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Chroma

In [4]:
# load the .env file
load_dotenv(find_dotenv())
open_ai_api_key = os.getenv('OPENAI_API_KEY')

In [5]:
embeddings = OpenAIEmbeddings(openai_api_key=open_ai_api_key)

In [6]:
#  setup sqlite3 for linux based OS, for windows it's not needed, comment these lines for windows
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

# load the knowledgebase
vector_store = Chroma(persist_directory='db', embedding_function=embeddings)


In [7]:
# retrieve vector store

retriever = vector_store.as_retriever()

In [8]:
prompt_template = '''
You are a Bioinformatics expert with immense knowledge and experience in the field.
Answer my questions based on your knowledge and our older conversation. Do not make up answers.
If you do not know the answer to a question, just say "I don't know".

{context}

question: {question}
'''

PROMPT = PromptTemplate(
            template=prompt_template, input_variables=["context", "question"]
        )

In [9]:
chain_type_kwargs = {"prompt": PROMPT}

In [10]:
# define the chat model

chat_model = ChatOpenAI(
        model='gpt-4-0314',
        temperature=0.7,
        model_kwargs = {'top_p':0.5,
        'presence_penalty':0,
        'frequency_penalty':0,},
        n=1,
        streaming=True,
        callbacks=[StreamingStdOutCallbackHandler()],
    )

In [11]:
memory = ConversationBufferMemory(
                                    memory_key="chat_history",
                                    max_len=50,
                                    return_messages=True,
                                )

In [12]:
chain = RetrievalQA.from_chain_type(
                                llm=chat_model,
                                chain_type="stuff",
                                retriever=retriever,
                                chain_type_kwargs=chain_type_kwargs,
                                memory=memory,
                                callbacks = [StreamingStdOutCallbackHandler()],
                            )

In [13]:
q = 'What are some famous algorithms in bioinformatics? give me references too.'

In [14]:
answer = chain.run(q)

Some famous algorithms in bioinformatics include:

1. Needleman-Wunsch Algorithm: This algorithm is used for global sequence alignment, which finds the best alignment between two sequences by maximizing the similarity score. Reference: Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.

2. Smith-Waterman Algorithm: This algorithm is used for local sequence alignment, which finds the most similar subsequences between two sequences. Reference: Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.

3. BLAST (Basic Local Alignment Search Tool): This algorithm is used for searching sequence databases for local alignments with a query sequence. Reference: Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal 

In [15]:
import pprint

In [16]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(answer)

('Some famous algorithms in bioinformatics include:\n'
 '\n'
 '1. Needleman-Wunsch Algorithm: This algorithm is used for global sequence '
 'alignment, which finds the best alignment between two sequences by '
 'maximizing the similarity score. Reference: Needleman, S. B., & Wunsch, C. '
 'D. (1970). A general method applicable to the search for similarities in the '
 'amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), '
 '443-453.\n'
 '\n'
 '2. Smith-Waterman Algorithm: This algorithm is used for local sequence '
 'alignment, which finds the most similar subsequences between two sequences. '
 'Reference: Smith, T. F., & Waterman, M. S. (1981). Identification of common '
 'molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.\n'
 '\n'
 '3. BLAST (Basic Local Alignment Search Tool): This algorithm is used for '
 'searching sequence databases for local alignments with a query sequence. '
 'Reference: Altschul, S. F., Gish, W., Miller, W., Myers, E.

In [17]:
# let's check if there is something in memory

memory.chat_memory.messages

[HumanMessage(content='What are some famous algorithms in bioinformatics? give me references too.', additional_kwargs={}, example=False),
 AIMessage(content='Some famous algorithms in bioinformatics include:\n\n1. Needleman-Wunsch Algorithm: This algorithm is used for global sequence alignment, which finds the best alignment between two sequences by maximizing the similarity score. Reference: Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.\n\n2. Smith-Waterman Algorithm: This algorithm is used for local sequence alignment, which finds the most similar subsequences between two sequences. Reference: Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.\n\n3. BLAST (Basic Local Alignment Search Tool): This algorithm is used for searching sequence databases for local a

In [18]:
pp.pprint(chain.run('Would you please explain the 2nd algo'))

The second algorithm mentioned in the text is the Z-algorithm, which is a string matching algorithm that runs in linear time with respect to the length of the text. The Z-algorithm is used to find all occurrences of a pattern in a given text. It works by constructing an auxiliary array, called the Z-array, which stores the length of the longest common prefix between the pattern and the corresponding substring of the text.

The Z-algorithm starts by concatenating the pattern and the text, separated by a special character (e.g., $) that does not appear in either the pattern or the text. Then, it computes the Z-array for this concatenated string. The Z-array is an array of integers, where the value at position i represents the length of the longest common prefix between the substring starting at position i and the concatenated string itself.

The algorithm iterates through the concatenated string, calculating the Z-values for each position. When a Z-value is equal to the length of the pat

In [19]:
# let's check if there is something in memory

memory.chat_memory.messages

[HumanMessage(content='What are some famous algorithms in bioinformatics? give me references too.', additional_kwargs={}, example=False),
 AIMessage(content='Some famous algorithms in bioinformatics include:\n\n1. Needleman-Wunsch Algorithm: This algorithm is used for global sequence alignment, which finds the best alignment between two sequences by maximizing the similarity score. Reference: Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.\n\n2. Smith-Waterman Algorithm: This algorithm is used for local sequence alignment, which finds the most similar subsequences between two sequences. Reference: Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.\n\n3. BLAST (Basic Local Alignment Search Tool): This algorithm is used for searching sequence databases for local a

In [20]:
pp.pprint(chain.run('why is it important in bioinformatics?'))

It is important in bioinformatics because bioinformatics is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data, particularly genomic and molecular data. This field plays a crucial role in understanding the molecular mechanisms of living organisms, discovering new genes and their functions, identifying potential drug targets, and developing personalized medicine. By using computational tools and techniques, bioinformatics experts can efficiently analyze large-scale biological data, identify patterns and relationships, and make predictions that can help advance our understanding of various biological processes and diseases. Additionally, bioinformatics has become an essential component of modern biological research, as it enables researchers to manage, analyze, and interpret the vast amounts of data generated by high-throughput technologies such as next-generation sequencing, microarrays, and proteomics

**Chat history / remembering older conversations did not work with RetrievalQAChain. Let's Try 'ConversationalRetrievalChain' instead.**

In [21]:
from langchain.chains import ConversationalRetrievalChain

In [22]:
# build memory
memory = ConversationBufferMemory(
                                    memory_key="chat_history",
                                    max_len=50,
                                    return_messages=True,
                                )

In [23]:
prompt_template = '''
You are a Bioinformatics expert with immense knowledge and experience in the field. Your name is Dr. Fanni.
Answer my questions based on your knowledge and our older conversation. Do not make up answers.
If you do not know the answer to a question, just say "I don't know".

Given the following conversation and a follow up question, answer the question.

{chat_history}

question: {question}
'''

PROMPT = PromptTemplate.from_template(
            template=prompt_template
        )

In [24]:
chain = ConversationalRetrievalChain.from_llm(
                                                chat_model,
                                                retriever,
                                                memory=memory,
                                                condense_question_prompt=PROMPT
                                            )

In [25]:
q1 = 'What are some famous algorithms in bioinformatics? give me references too.'
q2 = 'Would you please explain the 3rd algorithm mentioned.'
q3 = 'What is its importance in bioinformatics?'


In [26]:
pp.pprint(chain({'question': q1, 'chat_history': memory.chat_memory.messages}))

1. Needleman-Wunsch Algorithm: This algorithm is used for global sequence alignment, finding the best alignment between two sequences by maximizing the similarity score. It was introduced by Needleman and Wunsch in 1970.

Reference: Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.

2. Smith-Waterman Algorithm: This algorithm is used for local sequence alignment, finding the most similar subsequences between two sequences. It was developed by Smith and Waterman in 1981.

Reference: Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.

3. BLAST (Basic Local Alignment Search Tool): BLAST is a widely used algorithm for searching sequence databases for similar sequences. It was developed by Altschul, Gish, Miller, Myers, and Lipman in 1990.

Reference: Altschul, S. F., 

In [27]:
pp.pprint(chain({'question': q2, 'chat_history': memory.chat_memory.messages}))

Certainly! The 3rd algorithm mentioned is BLAST, which stands for Basic Local Alignment Search Tool. BLAST is a widely used algorithm in bioinformatics for searching sequence databases to find similar sequences. It was developed by Altschul, Gish, Miller, Myers, and Lipman in 1990.

BLAST works by comparing a query sequence (such as a DNA or protein sequence) to a database of known sequences. It identifies regions of similarity between the query sequence and sequences in the database, allowing researchers to find sequences that are evolutionarily related or share functional similarities. The algorithm is designed to be fast and efficient, making it suitable for searching large databases.

BLAST uses a heuristic approach to speed up the search process. It first identifies short, highly conserved regions (called "words" or "seeds") between the query sequence and database sequences. Then, it extends these initial matches to find longer alignments with higher similarity scores. The algorit

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: The server had an error while processing your request. Sorry about that! {
  "error": {
    "message": "The server had an error while processing your request. Sorry about that!",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
 500 {'error': {'message': 'The server had an error while processing your request. Sorry about that!', 'type': 'server_error', 'param': None, 'code': None}} {'Date': 'Tue, 17 Oct 2023 19:45:52 GMT', 'Content-Type': 'application/json', 'Content-Length': '176', 'Connection': 'keep-alive', 'access-control-allow-origin': '*', 'openai-model': 'gpt-4-0314', 'openai-organization': 'shwra', 'openai-processing-ms': '61', 'openai-version': '2020-10-01', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'x-ratelimit-limit-requests': '200', 'x-ratelimit-limit-tokens': '10000', 'x-ratelimit-remaini

In summary, BLAST is a popular algorithm used in bioinformatics for searching sequence databases to find similar sequences. It compares a query sequence to a database of known sequences and identifies regions of similarity, allowing researchers to find sequences that are evolutionarily related or share functional similarities. The algorithm is fast and efficient, making it suitable for searching large databases. BLAST uses a heuristic approach to speed up the search process by identifying short, highly conserved regions and extending these initial matches to find longer alignments with higher similarity scores.{   'answer': 'In summary, BLAST is a popular algorithm used in bioinformatics '
              'for searching sequence databases to find similar sequences. It '
              'compares a query sequence to a database of known sequences and '
              'identifies regions of similarity, allowing researchers to find '
              'sequences that are evolutionarily related or s

In [28]:
pp.pprint(chain({'question': q3, 'chat_history': memory.chat_memory.messages}))

BLAST's importance in bioinformatics lies in its ability to quickly and efficiently search large sequence databases for similar sequences. This is crucial for various applications, such as identifying homologous genes or proteins, annotating newly sequenced genomes, studying evolutionary relationships, and predicting protein functions based on sequence similarity. By finding sequences that are evolutionarily related or share functional similarities, researchers can gain insights into the biological roles and molecular mechanisms of genes and proteins. Overall, BLAST has become an indispensable tool in the field of bioinformatics due to its speed, accuracy, and wide range of applications.Additionally, BLAST incorporates statistical theory, allowing the direct computation of the significance of a match, which helps researchers determine whether a given local alignment might be due to chance alone. This feature is particularly important in assessing the significance of alignments in real-

It worked by passing chat history as an argument to the prompt.

Let's test the same apporach with 'RetrievalQAChain' as well.

### RetrievalQAChain with chat history

In [118]:
prompt_template = '''
You are a Bioinformatics expert with immense knowledge and experience in the field.
Answer my questions based on your knowledge and our older conversation. Do not make up answers.
If you do not know the answer to a question, just say "I don't know".

{context}

Given the following conversation and a follow up question, answer the question.

{chat_history}

question: {question}
'''

PROMPT = PromptTemplate(
            template=prompt_template, input_variables=["context", "chat_history", "question"]
        )

In [119]:
chain_type_kwargs = {"prompt": PROMPT}

In [120]:
memory = ConversationBufferMemory(
                                    memory_key="chat_history",
                                    max_len=50,
                                    return_messages=True,
                                )

In [128]:
from langchain.chains.question_answering import load_qa_chain
qa_chain = load_qa_chain(chat_model, chain_type="stuff")


In [141]:
from langchain.chains import RetrievalQAWithSourcesChain


In [146]:
chain = RetrievalQAWithSourcesChain(
                        combine_documents_chain=qa_chain,
                        retriever=retriever,
                        memory=memory,
                        callbacks = [StreamingStdOutCallbackHandler()],
                    )

ValidationError: 1 validation error for RetrievalQAWithSourcesChain
prompt
  extra fields not permitted (type=value_error.extra)

In [145]:
inputs = {
    "chat_history": memory.chat_memory.messages, 
}

In [137]:
inputs['query'] = q1
pp.pprint(chain(inputs))

There are several famous algorithms in bioinformatics that have significantly contributed to the field. Here are a few notable ones along with their references:

1. Needleman-Wunsch Algorithm: This algorithm is used for global sequence alignment of two sequences. It is based on dynamic programming and was one of the first algorithms developed for this purpose.
Reference: Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443-453.

2. Smith-Waterman Algorithm: This algorithm is used for local sequence alignment of two sequences. It is also based on dynamic programming and is an extension of the Needleman-Wunsch algorithm.
Reference: Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195-197.

3. BLAST (Basic Local Alignment Search Tool): BLAST is a widely used algorithm for searching

In [140]:
inputs['query'] = q2
pp.pprint(chain(inputs))

The third algorithm mentioned in the text is the Suffix Tree algorithm, which is a data structure for indexing a text. Suffix trees are used for efficient string matching and searching operations in computational biology and computer science.

Once the suffix tree is built for a given text, it has a remarkable property: a text can be searched for a pattern in time proportional to the length of the pattern, rather than the length of the text. Building the suffix tree itself takes time proportional to the length of the text, but the factor of proportionality is relatively large.

In the context of computational biology, suffix trees are useful for searching sequence patterns in molecular genetic data more efficiently than other string matching algorithms. This is particularly helpful when the text is stable and needs to be searched repeatedly, as it allows for faster search times and more efficient use of computational resources.{   'chat_history': [   HumanMessage(content='What are some