**Coursebook: Understanding Embedding in LLM**

- Part 3 of Understanding Embedding in LLM
- Course Length: 9 hours
- Last Updated: July 2023
---

Developed by Algoritma's Research and Development division

## Background

The coursebook is part of the **Large Language Models Specialization** developed by [Algoritma](https://algorit.ma/). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

# Understanding Embedding in LLM 

## Training Objectives

- **Understanding Embeddings in Large Language Models (LLM) for Natural Language Processing**
   - Basic concepts of embeddings in LLM
   - Usage of embeddings in natural language processing
   - Demonstration of embeddings usage in text analysis


- **Advanced Embeddings in Large Language Models (LLM) for Text Processing**
   - In-depth understanding of embeddings in LLM
   - Implementation of embeddings in text processing using Python
   - Demonstration of embedding techniques in text processing tasks


- **Advanced Applications of Embeddings in Text Processing with Large Language Models (LLM)**
   - Application of embeddings in advanced text processing
   - Usage of embeddings for text classification and contextual understanding
   - Demonstration of embeddings usage in advanced tasks

## Understanding Embeddings in Large Language Models (LLM) for Natural Language Processing

We have created a GPT question and answering system using Large Language Models (LLMs) that can generate answers based on our data. Now, let's delve deeper into how LLMs understand natural language by exploring the concept of embeddings.

In natural language processing (NLP), **embeddings** are representations of **words or text as numerical vectors**. These vectors capture the **semantic and contextual** information of the text, allowing the model to understand the meaning and relationships between words. In simpler terms, embeddings help the chatbot understand the meaning of words and how they relate to each other.

Imagine you have a chatbot designed to assist customers with their inquiries. When a customer asks a question, the chatbot needs to understand the meaning of the question and provide a relevant answer. This is where embeddings come into play. The chatbot is trained on a large amount of text data and learns to associate words with their respective embeddings. These embeddings encode the information about the words' meaning and context.

For example, if a customer asks, "What are the payment options available?" the chatbot uses the embeddings to understand the meaning of the words "payment" and "options" and their relationship within the sentence. It can then provide an appropriate response by retrieving information from its knowledge base.

Embeddings enable the chatbot to **make sense** of the customer's input and generate accurate and contextually relevant responses. By capturing the meaning and relationships between words, embeddings enhance the chatbot's understanding of natural language and improve its ability to generate meaningful and coherent answers.

### Basic Concept of Embedding (Vector)

A vector (or embedding) is an **array of numbers**. That on its own is exciting, but what is even more exciting is that these arrays can represent more complex data like text, images, audio or even video. In the case of text, these representations are designed to capture **semantic and syntactic** relationships between words, allowing algorithms to understand and process language more effectively.

Word embeddings, specifically, are dense vector representations that encode the meaning of a word based on its context in a large corpus of text. In simpler terms, they map words to numerical vectors in a high-dimensional space, where similar words are located closer to each other. This is done in a vector database (we will talk about this later)

Creating these embeddings is done by an embedding model. There are multiple embedding models that can be used. OpenAI also provide embedding model but we will use free LLM model so we don't ran out of credit. We will use **`"all-MiniLM-L6-v2"`** embedding model.

Making embeddings can be visualised in the following way:

![embedding](assets/embedding.gif)

This embedding process apply in many LLM implementation, for example QnA system or GPT chatbot. The question asked to the chatbot will be embedded as well, and on the basis of similarity search, the retriever will return the embeddings with the data to answer the question. After this, the LLM will return a coherent and well-structured answer.

But let's dive deep the concept one by one start by how to perform embedding from raw text to vector form.

In [2]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

ModuleNotFoundError: No module named 'langchain.embeddings.sentence_transformer'

This embedding function is based on an open-source sentence transformer model, which converts sentences or text into numerical embeddings that capture the semantic meaning and contextual information of the text. Let's create an example sentences:

In [None]:
sentences = [
    "This is document about cat",
    "This is document about car",
    "Example of the long sentences: China increased its coal-fired power capacity by 42.9 GW, or 4.5%, in the 18 months to June 2019, according to a report by Global Energy Monitor. The study also found that another 121.3 GW of coal-fired power plants are under construction in China, which has pledged to reduce its coal usage. However, the country’s absolute coal consumption has still increased in line with rising energy demand. China accounts for more than 40% of the world's total coal generation capacity."
]

In [None]:
# Perform embedding using embed_documents()
embedded_sentences = embedding_function.embed_documents(sentences)

# show embedded result
embedded_sentences[0][:10]

The output of this code is the result of performing embedding using the `embed_documents()` method of the `embedding_function`. The `sentences` variable represents a list of sentences or text that we want to embed. The `embed_documents()` method takes these sentences as input and generates their corresponding embeddings. 

The variable `embedded_sentences` stores the embedded representations of the input sentences. It is a numerical representation of the sentences that captures their semantic meaning and context.

In [None]:
# See len of embedded sentences
len(embedded_sentences)

We have three sentences and have embedded all of them. Let's see the embedding shape all of them.

In [None]:
# Shape of embedded sentences
for text in embedded_sentences:
    print(len(text))

Each embedded sentence is represented as a numerical vector or array of length 384. This length is determined by the embedding model used, and it indicates the **dimensionality or the number of features in the embedding space.** Each element of the vector captures specific information about the corresponding sentence's semantic meaning and context.

In [None]:
# show the first 10 vector of first sentences
embedded_sentences[0][:10]

The statement indicates that the embedding model, specifically the [sentence-transformers/msmarco-MiniLM-L-12-v3](sentence-transformers/msmarco-MiniLM-L-12-v3) model, generates a fixed-size vector of 384 dimensions for any given sentence, regardless of its length. This model is designed to map sentences and paragraphs to a dense vector space with 384 dimensions.

The purpose of this embedding is to capture the semantic meaning of sentences and enable tasks such as semantic search, where similarity between sentences can be measured in this vector space. By representing sentences in a fixed-size vector format, the model facilitates efficient comparison and retrieval of semantically similar sentences.

In summary, regardless of the length of the input sentence, the embedding model consistently produces a 384-dimensional vector representation that captures the semantic information of the sentence. This representation can be used for various NLP tasks, including semantic search, where similarity between sentences is important.

### Find Similarity Between Documents

We have embedding three sentences (documents) above. 

> - What if we want to know which document contains relevance information about our question?
> > We can resolve this problem by embedding our question to vector dimension and **compute the similarity** between our question and our documents.

The similarity compute using `cosine distance` which the lower the distances, the similar the vector is.

For example if we want to know which document contains information about China and coal.

In [None]:
# Embed the question/query
embed_query = embedding_function.embed_documents(['China and coal'])

In [None]:
# Import the cosine_distance
from sklearn.metrics.pairwise import cosine_distances

# Compute the cosine distance between query and documents
cosine_distances(embed_query, embedded_sentences)

Based on the results above, we can see that the third document has the lowest distance, indicating that it is the most relevant to the query "China and coal". This is not surprising, as the third document is a summary news article specifically about China's coal-powered plants. The lower distance suggests a higher similarity between the document and the query, indicating that it contains information closely related to the topic of interest.

Let's see another example. What if we want to know which document contains about vehicle.

In [None]:
query = ["show document about vehicle"]
embed_query = embedding_function.embed_documents(query)

cosine_distances(embed_query, embedded_sentences)

Since the second document contains the lowest distance, lets see the second document.

In [None]:
# show second document
sentences

Notice that the second document doesn't contains any "vehicle" word but contains "car" which semantically we know that car is vehicle. 

Also notice that the first and second document just have 1 different letter "cat" and "car". But the distance pretty far. 

This is because the embedding model embed the sentence based on its semantic textual meaning instead of just the letter/word like conventional embedding function.

Let's use another example

In [None]:
query = ["show document about animals"]
embed_query = embedding_function.embed_documents(query)

cosine_distances(embed_query, embedded_sentences)

In [None]:
sentences[0]

Even though the documents do not explicitly contain the word "animals," by computing the cosine distance of the embedding vectors, we can identify the document that represents the semantic meaning of our query. The cosine distance captures the similarity between the vectors, allowing us to find documents that capture the context and concept related to "animals," even if the exact word is not present. This demonstrates the power of embedding models in capturing the semantic meaning and enabling effective information retrieval and search tasks.

## Advanced Embedding in Large Language Models (LLM) for Text Processing

The ability of embedding to handle text data is crucial for addressing the demands of today's industries. Embedding allows us to process and understand textual information effectively, enabling a wide range of text processing tasks.

One significant application of embedding is in large language models (LLMs), which leverage advanced embedding techniques to comprehend and generate natural language. By utilizing embedding in LLMs, we can tackle various industrial challenges more efficiently.

In this section, we will explore the use of **Chroma DB**, a powerful embedding-based database, to enhance text processing capabilities. By training the model on relevant data, we can leverage the embedded representations of text to perform tasks like question-answering. This approach enables us to extract meaningful information from the data and provide accurate responses to user queries.

By harnessing the power of embedding in LLMs and leveraging technologies like Chroma DB, we can significantly improve the **efficiency and effectiveness** of text processing in industries. This opens up new opportunities for automating tasks, gaining insights from textual data, and enhancing decision-making processes.

### Vector Database (CHROMA DB)

When working with Large Language Models (LLMs) like GPT-4 or Google's PaLM 2, we will often be working with big amounts of unstructured, textual data. Structured data can just be stored in a SQL database, but that is much harder with unstructured data. When we for instance have a lot of text files like above example with information on a certain topic, it might be good to store this information in a different way in order to retrieve the desired data in the most efficient way. The answer to this: **Vector Databases**.

The specific vector database that we will use is the **ChromaDB** vector database.

[Chroma Website](https://docs.trychroma.com/getting-started#:~:text=Chroma%20is%20a%20database%20for,hosted%20version%20is%20coming%20soon!):

> Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine. ChromaDB

By using `Chroma`, we can streamline the process of embedding and computing cosine distances, as it provides built-in functionality for these tasks.

Both `Chroma` and `LangChain` are integrated, allowing for seamless usage. To take advantage of this integration, we need to import the necessary functions from the respective libraries. This integration simplifies the implementation of text processing tasks by providing convenient methods for embedding text, computing cosine distances, and utilizing these functionalities within the broader context of `LangChain`.

To utilize `Chroma DB` and `LangChain`, we need to import the necessary libraries and modules. 

- `SentenceTransformerEmbeddings` from `langchain.embeddings.sentence_transformer`: to generate embeddings for sentences using a pre-trained Sentence Transformer model.
- `CharacterTextSplitter` from `langchain.text_splitter`: split text documents into smaller chunks or segments, which can be useful for efficient processing and analysis.
- `Chroma` from `langchain.vectorstores`: Chroma is a vector store that enables us to store and query embedded text data efficiently.
- `TextLoader` from `langchain.document_loaders`: provides functionality to load text documents from various sources, such as files or directories.

In [None]:
# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

In [None]:
# load the document and split it into chunks
loader = TextLoader("data_input/state_of_the_union.txt")
document = loader.load()

In [None]:
document

Once the document has been loaded, we may find that it consists of multiple paragraphs. To facilitate our search for the most relevant paragraph, we can use the `CharacterTextSplitter` module. By specifying the `chunk_size` parameter as 1000 characters and setting `chunk_overlap` to 0, the document will be divided into smaller chunks or segments, each containing approximately 1000 characters.

This splitting process allows us to analyze each paragraph individually and determine which one is most similar to our query. It simplifies the task of finding relevant information within the document and enables more focused analysis.

In [None]:
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text = text_splitter.split_documents(document)

text[:3]

Once we have split each paragraph in the document, we can proceed to embed the sentences and store them in Chroma for efficient retrieval.

- We will create an open-source embedding function using `SentenceTransformerEmbeddings`. In this example, we use the model "all-MiniLM-L6-v2" to perform the sentence embedding.

- We load the embedded sentences into Chroma using the `from_documents` method. We pass in the `text` as the input and the `embedding_function` to perform the embedding process. Chroma will store the embedded vectors along with their corresponding sentences, enabling fast and accurate retrieval based on semantic similarity.


In [None]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(text, embedding_function)

By leveraging Chroma, we can easily search for the most relevant sentences or paragraphs in the document by comparing their embedded vectors, providing a powerful tool for text processing and information retrieval.

Chroma will store the extracted ids, embeddings, documents, and metadata into a collection. This collection acts as a repository where the information is organized and indexed for efficient retrieval

In [None]:
db._collection.get().keys()

- The `ids` represent unique identifiers associated with each document or sentence in the collection. These ids serve as references to access specific entries in the collection.

- The `embeddings` are the vector representations of the documents or sentences. These embeddings capture the semantic information and enable similarity-based searches within the collection.

- The `documents` refer to the original text content that has been split and processed. These documents can be paragraphs, sentences, or any other meaningful textual units.

- The `metadata` includes any additional information associated with the documents, such as timestamps, author names, or any other relevant attributes.

In [None]:
db._collection.get()['documents'][:3]

In [None]:
db._collection.get()['ids'][:3]

In [None]:
db._collection.get()['metadatas'][:3]

After exploring the Chroma collection and embedding the query, we can evaluate the performance of our model by finding similar documents that best match the given question.

We will define the `query` as "What did the president say about Ketanji Brown Jackson" and use the `similarity_search_with_score` function from Chroma to find the most similar documents to the query. We specify `k=3` to retrieve the top 3 matching documents.


In [None]:
# Embed query and find similar document
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search_with_score(query, k=3)

In [None]:
docs

The result of this operation will provide us with a 5 list of documents along with their similarity scores, indicating how closely they match the query. The higher the score, the more relevant the document is to the given question. This allows us to assess the performance of our model in retrieving relevant information based on the input query.


What if we have a collection of text files stored in a folder and we want to embed each of them? We can achieve this by using the `DirectoryLoader` to load all the files in the directory as documents. Then, we can split each paragraph within the documents using the `RecursiveCharacterTextSplitter` to prepare the text for embedding.

The `DirectoryLoader` allows us to conveniently load all the files in a directory, while the `RecursiveCharacterTextSplitter` enables us to split the text into smaller chunks, such as paragraphs, for further processing. This combination of loaders and text splitters helps us prepare the data for embedding and subsequent analysis or retrieval tasks.

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

We will use `DirectoryLoader` to load text files from a specific directory (`data_input/new_articles/`) using the `TextLoader` class. It will load all the files with the `.txt` extension in the specified directory.

In [None]:
# Load and process the all text files in new_articles
loader = DirectoryLoader('data_input/new_articles/', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

To handle the multiple text files, we use the `RecursiveCharacterTextSplitter` to split the text into smaller chunks. This allows us to process each file efficiently and perform further analysis or tasks on the split text.

In [None]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text = text_splitter.split_documents(documents)

In this case, it will split the text into chunks of 1000 characters with an overlap of 200 characters between consecutive chunks. 

In [None]:
text[:3]

We have split all our text. The following steps is same like the code before.

In [None]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
vectordb = Chroma.from_documents(text, embedding_function)

# query it
query = "What is the news about Pando?"
docs = vectordb.similarity_search_with_score(query)

docs

The output variable contains the result of the similarity search performed on the `vectordb` using the query "What is the news about Pando?". It provides a ranked list of documents that have the highest similarity to the query, along with their similarity scores.

By examining the `docs` list, we can see which documents are considered most relevant to the query based on their similarity scores. The **higher** the similarity score, the **more similar** the document is to the query.

### Saving to Disk

Embedding text can be a time-consuming and resource-intensive process, especially when dealing with a large amount of text data.

To address this issue, we have the option to save our vector database, including the vectorized sentences, to disk. This allows us to load the saved vector database instead of re-embedding the text every time we need to use it.

To save the vector database to disk, we simply initialize the Chroma client and specify the directory where we want the data to be saved.

> **Caution**: Chroma makes a best-effort to automatically save data to disk, however multiple in-memory clients can stomp each other's work. As a best practice, only have one client per path running at any given time.

> **Protip**: Sometimes we can use `db.persist()` to force a save if needed

In [None]:
# save to disk
# load it into Chroma
vectordb = Chroma.from_documents(text, embedding_function, persist_directory="./chroma_db")
vectordb.persist()

The vector database will be saved in the `"./chroma_db"` folder. Whenever we want to use the embedded vectors, we can simply load them from the `".chroma_db"` file.

To achieve this, we create the vector database using `Chroma.from_documents(text, embedding_function, persist_directory="./chroma_db")`. The `persist_directory` parameter specifies the directory where the vector database will be saved.

To ensure that the vector database is saved to disk, we call `vectordb.persist()`. This command forces the vector database to be saved in the specified directory.

By saving the vector database to disk, we can easily load it whenever we need to use the embedded vectors, eliminating the need for repetitive embedding computations.

Let's see how to load saved vector db in Chroma.

In [None]:
vectordb_load = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

# query it
query = "What is the news about Pando?"
docs_load = vectordb_load.similarity_search_with_score(query)

docs_load

The code is loading the vector database from the "./chroma_db" directory using the `Chroma` class with the `persist_directory` parameter set to the same directory where the vector database was previously saved. The `embedding_function` is also provided to ensure consistency in embedding.

After loading the vector database, a query is performed by specifying the query sentence as "What is the news about Pando?". The `similarity_search_with_score` method is used to find the most similar documents in the vector database based on the query. 

The output, `docs_load`, contains the results of the query, which typically include the most relevant documents along with their similarity scores. These documents are ranked based on their similarity to the query, with higher scores indicating greater similarity. The exact format of the output depends on the implementation, but it usually includes information such as document IDs, similarity scores, and potentially other metadata associated with the documents.

## Workflow: Applications of Embedding in Text Processing with Large Language Models (LLM)

## Create QnA System from Vector Database

In this section, we will explain the workflow to create a Q&A system by leveraging a vector database and the power of LangChain and Chroma DB. This workflow will serve as a guide for building similar projects and can be adapted for various text processing tasks, including summarized text analysis, document retrieval, and more.

### Embedding and Vector Database Creation

In this section, we will utilize the previously embedded data that was stored on disk. The Chroma database is loaded from the disk, and this is indicated by the `persist_directory` parameter set to "./chroma_db". Loading the Chroma database from the specified directory allows us to efficiently reuse the embeddings that were computed earlier. This avoids the need to recompute the embeddings, which saves time and computational resources, making our text processing workflow more efficient and faster.

In [None]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vectordb_load = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)

### Query Processing


We will perform query processing by defining a question, and then we will find the most relevant answers using `vectordb_load.similarity_search`. 

In [None]:
# Ask a question using the QA chain
question = "What is Pando vision?"
similar_docs = vectordb_load.similarity_search(question)

In [None]:
similar_docs

This process allows us to search for the documents that are semantically similar to the query question within the loaded vector database.

### Q&A System Implementation

After identifying the similar documents from the query, we can now advance the implementation by creating a Question-Answering (Q&A) system. This system will utilize the relevant information retrieved from the vector database to provide answers to specific questions asked by the users.

All we need to do is import the library `dotenv` to load the environment variables, `load_qa_chain` to set up the Question-Answering chain, and the specific model we want to use from HuggingFaceHub.

In [None]:
from dotenv import load_dotenv
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub

In [None]:
# Load HuggingFace token from env
load_dotenv()

In [None]:
# Load the LLM
llm = HuggingFaceHub(repo_id="declare-lab/flan-alpaca-large",
                     model_kwargs={"temperature":0.9, "max_length":512})

After loading the LLM model "declare-lab/flan-alpaca-large" from Hugging Face, we can create a Question-Answering (QA) chain using the `load_qa_chain` function provided by LangChain. This QA chain allows us to interact with the model and generate responses to questions based on the input provided.

In [None]:
# Create qa chain
qa_chain = load_qa_chain(llm, chain_type="stuff")

When using `chain_type="stuff"`, the method allows us to build a QA chain that uses retrieval-based question answering. 

> In this approach, the model first retrieves relevant documents or passages from a database (in this case, the Chroma vector database) based on the input question. Then, the model generates the answer from the retrieved information. This method is useful for efficiently obtaining contextually relevant answers from a large corpus of data without relying solely on pre-defined answers in the model.

### Display the Results


Using `.run`, we can execute the Q&A system and provide the input document and the question as parameters.

In [None]:
# Generate answer from related document (from similarity search result)
response = qa_chain.run(input_documents=similar_docs, question=question)

response

### Chaining

**Retriever**

In order to retrieve the relevant data from the database, we need to create a retriever. This retriever will return all the documents (or chunks) related to the question asked using `as_retriever()`. It will utilize the data stored in the database (vector database) to identify the most relevant documents that match the user's question, allowing the Q&A system to extract the necessary information efficiently.

In [None]:
# Create retriever
retriever = vectordb_load.as_retriever()

**RetrievalQA**

After creating the retriever, we can implement the question-answering functionality using `RetrievalQA`. This allows us to match the user's question with the relevant documents retrieved by the retriever and generate contextually relevant answers based on the information stored in the vector database. `RetrievalQA` integrates the retrieved documents with the language model, enabling the system to provide accurate responses to the user's queries.

In [None]:
from langchain.chains import RetrievalQA

# create the chain to answer questions
# so we can cut the process/code to generate answer from related document
qa_chain = RetrievalQA.from_chain_type(llm = llm,
                                  chain_type = "stuff",
                                  retriever = retriever,
                                  return_source_documents = True,
                                  verbose = True)

In [None]:
qa_chain("what is Pando Vision?")

We observe the same results as our previously built QnA system, but with a more straightforward and streamlined process due to **"chaining"** the various components using LangChain's Chain feature. This chaining functionality simplifies the implementation and improves the overall efficiency of the system, making it a powerful tool for creating question-answering applications.

## Create QnA System From PDF Source

After successfully creating the QnA System from the Vector Database using Chroma, in this section, we will move on to the next step, which is creating a QnA System from a PDF source. 

This new approach will demonstrate how to handle text data in PDF format and utilize **LangChain** to build a QnA system that can answer questions based on the content of the **PDF documents**.

### Business Problem

In important meetings, documentation is often required to record crucial information. However, we may not need to access all the information, but only specific details that are essential. These confidential documents can be used as data to be embedded and create a Q&A system, enabling us to quickly retrieve relevant information and answer specific questions without having to go through the entire document manually. This Q&A system will enhance efficiency and accessibility to the essential data from the secret documents.

### Data Preparation

In this example, we will use the [**"Copy of minutes of meeting" document from the Australian Securities & Investments Commission**](https://www2.deloitte.com/content/dam/Deloitte/au/Documents/finance/insolvency/virgin/deloitte-au-fa-virgin-australia-minutes-to-6th-coI-meeting-10-august-2020-010920.pdf) in PDF format. This data contains important details about the minutes of a meeting. To facilitate embedding, we can effortlessly load the PDF data using the `UnstructuredPDFLoader` provided by Langchain. This loader allows us to process unstructured data, such as PDF files, and prepare them for embedding in the Q&A system. By using this method, we can efficiently access and analyze essential information from the PDF documents.

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader 

def loadPDFFromLocal(pdf_file_path="data_input/Eurovision_Song_Contest_2023.pdf"):
    loader = UnstructuredPDFLoader(pdf_file_path)
    loaded_docs = loader.load()
    return loaded_docs

After loading the PDF file using `UnstructuredPDFLoader`, we need to further process the document by splitting it into smaller sections **to improve efficiency in handling large amounts of text**. 

To achieve this, we can utilize the `CharacterTextSplitter`. This step allows us to create smaller chunks or paragraphs from the original document, making it easier to manage and extract relevant information for embedding and the Q&A system. By using the `CharacterTextSplitter`, we can efficiently process lengthy documents and prepare them for the embedding process and subsequent question-answering tasks.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# Example of minutes of meeting
document = loadPDFFromLocal("data_input/deloitte-au-fa-virgin-australia-minutes.pdf")
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
text = text_splitter.split_documents(document)

In [None]:
text[:5]

### Embedding and Vector Database Creation

Before creating the Q&A system, we can leverage Chroma to process our text data and generate embeddings. To achieve this, we need to utilize the `Chroma.from_documents` method. During this process, we have to specify three essential parameters:

1. **Text data:** We need to provide the text data that we want to embed. This could be a single document or multiple documents.

2. **Embedding function:** We should define an embedding function, such as `SentenceTransformerEmbeddings`, which will transform the text into meaningful embeddings.

3. **Persist directory:** This parameter specifies the directory where the Chroma database, along with the embedded vectors, will be saved. By setting the `persist_directory`, we can efficiently store and reuse the already computed embeddings, saving time and resources in subsequent queries and tasks.


In [None]:
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# save it into Chroma
vectordb_pdf = Chroma.from_documents(text, embedding_function, persist_directory="./chroma_db_pdf")
vectordb_pdf.persist()

After saving the Chroma database to the disk with the `persist_directory="./chroma_db_pdf"`, we have the option to reload and access the saved data later using the Chroma function.

In [None]:
# if want to load again from disk
vectordb_pdf = Chroma(persist_directory="./chroma_db_pdf", embedding_function=embedding_function)

By loading the database from the disk, we can efficiently reuse the previously computed embeddings, which can be crucial when dealing with large amounts of text data. This capability allows us to access the saved embeddings whenever needed, without the need to recompute them, thereby **improving the efficiency and speed of text processing** tasks like question-answering systems.

### Q&A System Implementation

To implement the Q&A system, the workflow involves creating a **retriever**, which helps to retrieve relevant documents or chunks based on the given questions. Then, we proceed to **create the chain**, which uses the retriever's output to answer the questions. 

By doing so, we can **streamline the process and simplify the code** required to generate answers from related documents, making the Q&A system efficient and effective in handling various queries.

In [None]:
# Create retriever
retriever = vectordb_pdf.as_retriever()

The retriever is responsible for finding relevant documents or chunks in the vector database based on the given input question. It returns the search results that can be used to generate an answer to the question. The retriever plays a crucial role in efficiently retrieving important information from the vector database, which is then used by the Q&A system to provide accurate answers.

In [None]:
from langchain.chains import RetrievalQA

# create the chain to answer questions
# so we can cut the process/code to generate answer from related document
qa_chain_pdf = RetrievalQA.from_chain_type(llm = llm,
                                  chain_type = "stuff",
                                  retriever = retriever,
                                  return_source_documents = True,
                                  verbose = True)

### Display the Results

We have reached the end of building the Q&A system, and now we can display the results by inputting the questions to the system. The system will then process the questions and provide relevant answers based on the retrieved information from the vector database.

In [None]:
qa_chain_pdf("Who is the chairperson of the meeting?")

In [None]:
qa_chain_pdf("Who is the chairperson of the meeting?")

In [None]:
answer = qa_chain_pdf("What is the agenda of the meeting?")

answer

In [None]:
answer['result']

In [None]:
qa_chain_pdf("What is the resolution of the meeting?")

In [None]:
qa_chain_pdf("What's the action plan of to fix or determine the remuneration of the administrators?")

## Dive Deeper

We've covered the entire workflow, starting from implementing embedding to creating a Q&A system that assists us in finding relevant information. Next, we'll introduce another dataset that you can use to practice and deepen your understanding of the workflow.

This document contains information about ["KINERJA DAN PROSPEK EKONOMI NASIONAL: OPTIMIS DAN WASPADA"](https://www.bi.go.id/id/publikasi/laporan/Documents/4_LPI2022_BAB2.pdf) from Bank Indonesia. Your task is to create a Q&A system that can answer questions based on the information provided in the document.

In [None]:
# your code here


# Summary

In this module, we have learned about the workflow to create a Q&A system using unstructured data. By understanding embedding, we can process and represent text data in a meaningful way. We also explored Chroma, which allows us to efficiently store and retrieve embedded data. With these tools, we can build powerful Q&A systems that can answer questions based on the information in unstructured documents.