# CHAABI ASSIGNMENT: BUILDING A QUERY ENGINE FROM A GIVEN DATA SOURCE

##### By Abhranil Das, d.abhranil@iitg.ac.in
##### Roll no. 200108002, Electronics and Electrical Engineering, IIT Guwahati


# Introduction
ChatGPT and other similar models struggle with generating factual statements if no context is provided. They have some general knowledge but cannot guarantee to produce a valid answer consistently. Thus, it is better to provide some facts we know are actual, so it can just choose the valid parts and extract them from all the provided contextual data to give a comprehensive answer. Vector databases, such as FAISS, Qdrant, Pinecone, .etc, can be of great help here, as their ability to perform a semantic search over a huge knowledge base is crucial to preselect some possibly valid documents, so they can be provided into the LLM, which then answers queries based on the context extracted from the queries.

## What do we need?
We need two models to set up a query engine using any LLM. First of all, we need an embedding model that will convert the set of facts into vectors, and will store it in a vector database like Qdrant or FAISS.  We’re going to use one of the Hugging Face Instruct Embedding models, so it can be hosted locally. The embeddings created by that model will be put into  and used to retrieve the most similar documents, given the query.

However, when we receive a query, there are two steps involved. First of all, we ask the vector database to provide the most relevant documents and simply combine all of them into a single text. Then, we build a prompt to the LLM, including those documents as a context, of course together with the question asked. So the input to the LLM looks like the following:



> Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. \
\
> ...  
It's as certain as 2 + 2 = 4  
...
\
\
> Question: How much is 2 + 2? \
Helpful Answer:



There might be several context documents combined, and it is solely up to LLM to choose the right piece of content. But our expectation is, the model should respond with just 4.

Why do we need two different models? Both solve some different tasks. The first model performs feature extraction, by converting the text into vectors, while the second one helps in text generation or summarization. Disclaimer: This is not the only way to solve that task with LangChain. Such a chain is called stuff in the library nomenclature.

This is called **Retrieval Augmented Generation (RAG)**. The pipeline for RAG looks like the following:

<a href="https://ibb.co/7yYLxsS"><img src="https://i.ibb.co/x5CdKx2/Screenshot-217.png" alt="Screenshot-217" border="0" /></a>

We will be using **Langchain** to build the query engine, starting from generating the embeddings, building the vector database to implementing the LLM and creating a query-answer interface. Every step is documented and explained in details in this notebook.

## Langchain:

LangChain provides unified interfaces to different libraries, so one can avoid writing boilerplate code and focus on the value he/she wants to bring. Langchain supports various pre-trained models for generating vector embeddings, and also supports popular vector databases like Qdrant, Pinecone, FAISS, .etc, which can be used for storing the embeddings. LangChain also allows us to utilize already pre-trained models and support even complex pipelines with a few lines of code, making the process of building applications with LLMs (Large-Language Models) efficient and simple.

<img src = "https://miro.medium.com/v2/resize:fit:1400/format:webp/1*-PlFCd_VBcALKReO3ZaOEg.png"/>

In this notebook, we list out all the steps we need to create a Query Engine, given any data source/database. We will be using Hugging Face Instruct Embeddings (hkunlp/instructor-base) model to generate the embeddings, FAISS to create the vector database, Hugging Face Hub as the LLM and Retrieval QA to build the engine.

## INITIAL SANITY CHECK - INSTALLING DEPENDENCIES AND IMPORTING NECESSARY LIBRARIES

Run the following cell to install all the dependencies that are needed to run this notebook

In [None]:
!pip install langchain InstructorEmbedding sentence_transformers faiss-gpu huggingface_hub

Run the following cell to import the necessary modules

In [None]:
from google.colab import files
import pandas as pd

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
import sentence_transformers
from InstructorEmbedding import INSTRUCTOR
from langchain.vectorstores import FAISS
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA

import os, sys, warnings
from getpass import getpass

warnings.filterwarnings('ignore')

## UPLOADING THE DATA SOURCE

Run the following cell to input the data source to be used for building the query engine. After running the cell, click the **Choose files** button and upload the dataset file from your computer. After uploading, wait for the file to be fully uploaded, then run the next cell.

In [None]:
uploaded = files.upload()

Saving bigBasketProducts.csv to bigBasketProducts (1).csv


In [None]:
filename = list(uploaded.keys())[0]

## PREPARING THE DATALOADER AND VISUALIZING THE DATA

We will be using the CSVLoader function from Langchain to build the database from the CSV file provided as input. We also use Pandas to load the first few rows and visualize them to see how our database looks like. Run the following two cells to prepare the data to be used as the database for the engine.

In [None]:
loader = CSVLoader(file_path = filename, encoding = 'utf-8')
data = loader.load()

In [None]:
df = pd.read_csv(filename, nrows = 20, encoding = 'utf-8', encoding_errors = 'ignore', index_col = [0])
df.head()

Unnamed: 0_level_0,product,category,sub_category,brand,sale_price,market_price,type,rating,description
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.0,220,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...
2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.0,180,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ..."
3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.0,250,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m..."
4,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149.0,176,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...
5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162.0,162,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...


## CREATING VECTOR EMBEDDINGS

After creating the database, the next step is to generate vector embeddings from it. For this we we will be using the **hku-nlp/instructor-base** model to generate the embeddings. Model details and implementation can be found here [*hku-nlp/instructor-base*](https://huggingface.co/hku-nlp/instructor-base).

In [None]:
embeddings = HuggingFaceInstructEmbeddings(model_name = 'hku-nlp/instructor-base', model_kwargs = {"device":  "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


## STORING THE EMBEDDINGS IN A VECTOR DATABASE

Now we need to define a vector database which will store the embeddings generated from the dataset. For this we will be using **FAISS**.

**Facebook AI Similarity Search (FAISS)**  is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

Uncomment the following cell and run it to create the vector store that contains the embeddings generated from the database. This takes a lot of time to run, so we store the vector store in a local folder so that we can reuse the embeddings later without having to generate them again.

In [None]:
# vectorstore = FAISS.from_documents(data, embeddings)
# vectorstore.save_local('FAISS_Index')

After saving the vector store, run the following cell to load the vector store from the saved folder.

In [None]:
vectorstore = FAISS.load_local('FAISS_Index', embeddings)

The following cell implements one of the functionalities of FAISS (and for that matter, any vector database) which is **similarity search**. Given any query, similarity search compares the vectors stored in the database and find the ones that are most similar to the query vector. This is important, as similarity search extracts the documents that are the most relevant to the query inputted by the user (the documents are then later fed to the LLM).

We illustrate it with a simple example.

In [None]:
docs = vectorstore.similarity_search_with_score("Best oil for cooking")
docs = sorted(docs, key = lambda x: x[1], reverse = True)
docs[0]

(Document(page_content='index: 26556\nproduct: Extra Light Olive Oil\ncategory: Foodgrains, Oil & Masala\nsub_category: Edible Oils & Ghee\nbrand: Jivo\nsale_price: 1714\nmarket_price: 3900\ntype: Olive & Canola Oils\nrating: 3.3\ndescription: Extra Light Olive Oil is suitable for all types of Indian cuisine and deep-frying. It may help in lowering bad cholesterol, prevents strokes, protects heart disease, and fights Alzheimer disease. Also, it benefits to heart, brain, joints and more.', metadata={'source': 'bigBasketProducts.csv', 'row': 26555}),
 0.20270832)

# IMPLEMENTING THE LARGE LANGUAGE MODEL (LLM)

The next step, and possibly the most important step in building the query engine, is implementing the Large Language Model (LLM). The LLM plays the pivotal role in the search engine: it receives the documents relevant to the query from the vector database, analyzes the documents for extracting query context, and then tries to answer the query on the basis of the context extracted from the query. In short, LLM act as the interface between the user and the database: it receives the query from the user, and provides the answer from the database.

Here we will be using the **Hugging Face Hub** to implement our LLM. The Hugging Face Hub is a platform with over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.

To implement the LLM, we first need an API token from Huggingface. We can obtain the API token by going to [Get Your API Token](https://huggingface.co/docs/api-inference/quicktour#get-your-api-token), and following the steps mentioned there:
1. Login to or create a new account at Hugging face
2. Go to **Hugging Face profile settings** and get the API token/User access from there

After obtaining the Hugging Face API token, run the following cell and enter the token, which will be used by the Hugging Face Hub LLM.

In [None]:
HUGGINGFACEHUB_API_TOKEN = getpass()
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

··········


Hugging Face Hub has a lot of models one can access to build a LLM. We will be using **Flan** by Google as our LLM.

For other options, visit this website https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads

In [None]:
callbacks = [StreamingStdOutCallbackHandler()]

llm = HuggingFaceHub(repo_id = "google/flan-t5-xxl", model_kwargs = {"temperature": 0.5, "max_length": 64})



## BUILDING THE QUERY-ANSWER INTERFACE USING RETRIEVAL-AUGMENTED GENERATION

Now that we have defined the model for generating embeddings, the vector store/database, and the LLM, the final step is to combine all these three into a RAG-based query-answer chain, which takes the query from the user, generates its embedding, obtains the document vectors most similar to the query vector from the vector store, passes those document vectors to the LLM, obtains the answer from the LLM and passes it to the user.

We use the **RetrievalQA** method of LangChain to build our query-answer chain. We pass the LLM that we implemented in the previous step, and the vector store retriever (the method of the vector store that retrieves documents most similar to the query) as arguments to the method. ```code_type = stuff``` means that there exists multiple chains with which we can retrieve the answer to a question. This chain is one of the possible ways, and is called **stuff** in library nomenclature.

Once the question-answer chain is created, we pass a few sample questions as queries to the chain, and print the answers. Run the following cells to see the output for the same.



In [None]:
qa = RetrievalQA.from_chain_type(llm = llm, chain_type = 'stuff', retriever = vectorstore.as_retriever(), callbacks = callbacks)

In [None]:
questions = ['Which company claims to have the best canola oil?',
             'What is the best type of cooking oil?',
             'Which brand has the best rated shampoo?',
             'Which brand has the best cooking oil under 500 Rs?',
             'Name a product made by Sri Sri Ayurveda']

In [None]:
for question in questions:
  print(f"Question: {question}")
  print(f"Answer: {qa.run(question)}", end = "\n\n")

Question: Which company claims to have the best canola oil?
Answer: Disano

Question: What is the best type of cooking oil?
Answer: Canola Oil

Question: Which brand has the best rated shampoo?
Answer: Nyle

Question: Which brand has the best cooking oil under 500 Rs?
Answer: Earthon

Question: Name a product made by Sri Sri Ayurveda
Answer: Pradara Shamaka Syrup

