# PDF Knowledge-base QA using chroma DB, langchain and mistral LLM

### Import a bunch of PDF, chunk them up and store their embeddings in a database. Then, prompt the database with a question and retrieve the most relevant documents through language chaining and mistral LLM.

# Imports

In [1]:
%load_ext dotenv
%dotenv

import os

from langchain.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

from langchain_mistralai.chat_models import ChatMistralAI
from langchain.chains.question_answering import load_qa_chain

# Loading documents

In [2]:
# loading txt documents for testing purposes
txt_directory = 'documents/txt/'

def load_txt_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

txt_documents = load_txt_docs(txt_directory)
txt_documents

[Document(page_content='sjdfbwdf wfgjkergjher wgerg ewrg ergg ergewrwgerrg r erg gewrg erwg reg wgerg erg erge\\n erg wreg \\n', metadata={'source': 'documents/txt/toto.txt'})]

In [3]:
pdf_directory = 'documents/pdf/'

def load_pdf_docs(directory):
    loader = PyPDFDirectoryLoader(directory)
    documents = loader.load()
    return documents

pdf_documents = load_pdf_docs(pdf_directory)
pdf_documents

[Document(page_content='SUPERVISED CHORUS DETECTION FOR POPULAR MUSIC USING CONVOLUTIONAL\nNEURAL NETWORK AND MULTI-TASK LEARNING\nJu-Chiang Wang, Jordan B.L. Smith, Jitong Chen, Xuchen Song, and Yuxuan Wang\nByteDance\n{ju-chiang.wang, jordan.smith, chenjitong.1, xuchen.song, wangyuxuan.11 }@bytedance.com\nABSTRACT\nThis paper presents a novel supervised approach to de-\ntecting the chorus segments in popular music. Traditional ap-\nproaches to this task are mostly unsupervised, with pipelines\ndesigned to target some quality that is assumed to deﬁne\n“chorusness,” which usually means seeking the loudest or\nmost frequently repeated sections. We propose to use a\nconvolutional neural network with a multi-task learning ob-\njective, which simultaneously ﬁts two temporal activation\ncurves: one indicating “chorusness” as a function of time,\nand the other the location of the boundaries. We also propose\na post-processing method that jointly takes into account the\nchorus and boundary pr

29 documents for 3 pdfs

# Splitting the text 
on ["\n\n", "\n", " ", ""] in sequential order

In [4]:
def split_docs(documents,chunk_size=1000,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(pdf_documents)
docs


[Document(page_content='SUPERVISED CHORUS DETECTION FOR POPULAR MUSIC USING CONVOLUTIONAL\nNEURAL NETWORK AND MULTI-TASK LEARNING\nJu-Chiang Wang, Jordan B.L. Smith, Jitong Chen, Xuchen Song, and Yuxuan Wang\nByteDance\n{ju-chiang.wang, jordan.smith, chenjitong.1, xuchen.song, wangyuxuan.11 }@bytedance.com\nABSTRACT\nThis paper presents a novel supervised approach to de-\ntecting the chorus segments in popular music. Traditional ap-\nproaches to this task are mostly unsupervised, with pipelines\ndesigned to target some quality that is assumed to deﬁne\n“chorusness,” which usually means seeking the loudest or\nmost frequently repeated sections. We propose to use a\nconvolutional neural network with a multi-task learning ob-\njective, which simultaneously ﬁts two temporal activation\ncurves: one indicating “chorusness” as a function of time,\nand the other the location of the boundaries. We also propose\na post-processing method that jointly takes into account the\nchorus and boundary pr

144 chunks from 29 documents

# Embedding Text

SentenceTransformer originates from Sentence-BERT

In [5]:
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings

  from .autonotebook import tqdm as notebook_tqdm


HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

# Chroma DB embedding store


In [6]:
db = Chroma.from_documents(docs, embeddings)
db

<langchain_community.vectorstores.chroma.Chroma at 0x2b8f19150>

# Query vector DB

In [7]:
query = "How to go from a time/time similarity matrix to a time/lag surface matrix ?"
matching_docs = db.similarity_search(query)

matching_docs[0]

Document(page_content='Figure 8. Similarity matrix using spectral features from the bridge of "Day Tripper" by the Beatles. The Time-Lag Matrix When the goal is to find repeating sequence patterns, it is sometimes simpler to change coordinate systems so that patterns appear as horizontal or vertical lines. The time-lag matrix r is defined by:   r(t, l) = S(t, t−l), where t−l ≥ 0 (8) Thus, if there is repetition, there will be a sequence of similar frames with a constant lag. Since lag is represented by the vertical axis, a constant lag implies a horizontal line. The time-lag version of Figure 7 is shown in Figure 9. Only the lines representing similar sequences are shown, and the grayscale has been reversed, so that similarity is indicated by black lines.', metadata={'page': 9, 'source': 'documents/pdf/Music_Structure_Analysis_from_Acoustic_Signals.pdf'})

# Plug LLM to Chroma DB through langchain

In [9]:
MISTRAL_KEY = os.environ.get('MISTRAL_KEY')

In [10]:
llm = ChatMistralAI(model="mistral-small", temperature=0, mistral_api_key=MISTRAL_KEY)

# Extracting the answer from the document

In [11]:
chain = load_qa_chain(llm, chain_type="stuff",verbose=True)

query = "How to go from a time/time similarity matrix to a time/lag surface matrix ?"
matching_docs = db.similarity_search(query)
answer =  chain.run(input_documents=matching_docs, question=query)
answer


  warn_deprecated(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Figure 8. Similarity matrix using spectral features from the bridge of "Day Tripper" by the Beatles. The Time-Lag Matrix When the goal is to find repeating sequence patterns, it is sometimes simpler to change coordinate systems so that patterns appear as horizontal or vertical lines. The time-lag matrix r is defined by:   r(t, l) = S(t, t−l), where t−l ≥ 0 (8) Thus, if there is repetition, there will be a sequence of similar frames with a constant lag. Since lag is represented by the vertical axis, a constant lag implies a horizontal line. The time-lag version of Figure 7 is shown in Figure 9. Only the lines representing similar sequences are shown, and the grayscale has 

"To go from a time/time similarity matrix to a time/lag surface matrix, you can use the time-lag matrix concept. The time-lag matrix, as defined by r(t, l) = S(t, t−l), where t−l ≥ 0, represents similarity in a coordinate system where repetition appears as horizontal lines. This is because lag is represented by the vertical axis, and a constant lag implies a horizontal line.\n\nIn the context of the given text, the time-lag matrix is used to find alignment paths that maximize the average similarity of the aligned features. To create a time/lag surface matrix from a time/time similarity matrix, you would need to transform the coordinate system of the similarity matrix to have lines of constant lag oriented along the diagonals. This would result in an extended area of high correlation along one of the diagonals indicating an extended region of similarity between two portions of a song.\n\nAfter this transformation, you can filter along the diagonals of the similarity matrix to compute si

In [12]:
from langchain.chains import RetrievalQA
retrieval_chain = RetrievalQA.from_chain_type(llm, chain_type="stuff", retriever=db.as_retriever())
retrieval_chain.run(query)


"To go from a time/time similarity matrix to a time/lag surface matrix, you can use the time-lag matrix concept. The time-lag matrix, as defined by r(t, l) = S(t, t−l), where t−l ≥ 0, represents similarity in a coordinate system where repetition appears as horizontal lines. This is because lag is represented by the vertical axis, and a constant lag implies a horizontal line.\n\nIn the context of the given text, the time-lag matrix is used to find alignment paths that maximize the average similarity of the aligned features. To create a time/lag surface matrix from a time/time similarity matrix, you would need to transform the coordinate system of the similarity matrix to have lines of constant lag oriented along the diagonals. This would result in an extended area of high correlation along one of the diagonals indicating an extended region of similarity between two portions of a song.\n\nAfter this transformation, you can filter along the diagonals of the similarity matrix to compute si