## **Development of a QnA Chatbot for Interaction with a Tabular Dataset**

### **Objective:**

To create a question and answer (QnA) chatbot capable of interacting with a given pollution dataset in .csv format.

### **Task:**

1. Use any suitable RAG method to extract relevant information from the data.
2. Develop two versions of the chatbot:
  * List Template Based Chatbot
  * Text Template Based Chatbot
  
  Both versions of the chatbot should be able to interpret and respond to user queries using the
data from the pollution dataset.
3. Use any Large Language Model to generate responses based on the data extracted from the
table.

### **User Instructions:**
1.	The python file is in Notebook format (.ipynb). The file can be run using `Google Colab`.
2.	Go to Colab, click `upload notebook` and upload the notebook file.
3.	Next step is to upload the CSV file that needed to generated from. Upload the document in /content/ folder.
4.	Change the `query` variable accordingly.
5.	Run the cells by clicking `runtime -> run all`. This will all the cells and gives the output of each cell.

### **Implementation Details**:

1.	**Import necessary libraries:** The first step of our implementation is to install and import all the necessary libraries that we need. The libraries that are used are: `pandas`, `sklearn`, and `Groq`.
2.	**Load and Vectorize CSV data:** Load the CSV data and Vectorize it for both the list and text template using pandas and `TfidfVectorizer`class.
3.	**Perform Similarity Search:** Next step is to perform similarity search according to the given query on both the list and text template.
5.	**Generating Answer:** The retrieved content along with the user query for both list and text template is used seperatly as input to generate seperate answers using the `llama2-70b-4096` model which we can use from `Groq` API.

## **Code Implementation**

### **Installing Dependencies**

In [None]:
!pip install pandas
!pip install groq

### **Importing all neccessary packages**

In [16]:
import os
import pandas as pd

from groq import Groq
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### **Defining query that we need to perform from the dataset**

In [78]:
query = "Which city in Myanmar has good AQI value?"

### **Vectorizing and performing Similarity Search over list template**

In [80]:
# Reading and converting the dataset into pandas DataFrame
df = pd.read_csv("pollution_dataset.csv")

# converting DF to List of dictionaries
list_template = df.to_dict(orient='records')

# creating new key with name `text` which concatinates all the row values
df['text'] = df.apply(lambda row: ' '.join(map(str, row.values)), axis=1)

# Vectorizing the DF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])

# Function that perform Similarity search and return the of data that matches the query
def find_best_matches(query, k=5):
    query_vec = vectorizer.transform([query])
    similarity_scores = cosine_similarity(query_vec, X).flatten()
    best_match_indices = similarity_scores.argsort()[-k:][::-1]
    return [list_template[i] for i in best_match_indices]

# storing the list of result got from the function for list template to a variable
best_matches = find_best_matches(query)
best_matches_list_template = ' '.join(map(str, best_matches))

### **Vectorizing and performing Similarity Search over text template**

In [81]:
# Iterating over the list template and converting the dictionary into text
text_template = []
for row_dict in list_template:
    text_template.append(f"In {row_dict['City']}, {row_dict['Country']}, the AQI Value is {row_dict['AQI Value']} and falls under the category {row_dict['AQI Category']}. The CO AQI Value is {row_dict['CO AQI Value']} ({row_dict['CO AQI Category']}), Ozone AQI Value is {row_dict['Ozone AQI Value']} ({row_dict['Ozone AQI Category']}), NO2 AQI Value is {row_dict['NO2 AQI Value']} ({row_dict['NO2 AQI Category']}), and PM2.5 AQI Value is {row_dict['PM2.5 AQI Value']} ({row_dict['PM2.5 AQI Category']}).")

# Same process that we have done for list template
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(text_template)

def find_best_matches(query, k=5):
    query_vec = vectorizer.transform([query])
    similarity_scores = cosine_similarity(query_vec, X).flatten()
    best_match_indices = similarity_scores.argsort()[-k:][::-1]
    return [text_template[i] for i in best_match_indices]

# storing the list of result got from the function for text template to a variable
best_matches = find_best_matches(query)
best_matches_text_template = ' '.join(best_matches)

### **Generating response for date returned from text template**

In [None]:
# setting environment variable for groq
os.environ['GROQ_API_KEY'] = 'your Groq API Key Here'

prompt = f'''You are a chatbot you must generate a good summarised answer from the given context.
    Use the following provided context to answer the query enclosed within triple backticks.
    Context: {best_matches_text_template}
    User Query: ```{query}```
    Answer:
'''

# creating groq client to use the model `llama2-70b-4096`
client = Groq()

# generating answer from the context and query using 'llama2-70b-4096' model:
completion = client.chat.completions.create(
    model="llama2-70b-4096",
    messages=[
        {
            "role": "user",
            "content": f""" {prompt}
            """
        }
    ],
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    stream=True,
    stop=None,
)

# printing the result
for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Sure, I can help you with that!

Based on the given information, all the cities mentioned - Taunggyi, Ye, Labutta, Pathein, and Pyapon - have good AQI values, with the PM2.5 AQI value ranging from 30 to 39, which falls under the "Good" category.

Therefore, all five cities in Myanmar mentioned in the context have good AQI values.

### **Generating response for date returned from list template**

In [83]:
prompt = f'''You are a chatbot you must generate a good summarised answer from the given context.
    Use the following provided context to answer the query enclosed within triple backticks.
    Context: {best_matches_list_template}
    User Query: ```{query}```
    Answer:
'''

# creating groq client to use the model `llama2-70b-4096`
client = Groq()

# generating answer from the context and query using 'llama2-70b-4096' model:
completion = client.chat.completions.create(
    model="llama2-70b-4096",
    messages=[
        {
            "role": "user",
            "content": f""" {prompt}
            """
        }
    ],
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    stream=True,
    stop=None,
)

# printing the result
for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Based on the given context, both Taunggyi and Ye in Myanmar have good AQI values. The AQI value for Taunggyi is 39, which falls under the "Good" category, while Ye has an AQI value of 34, also categorized as "Good". Therefore, both cities can be considered as having good air quality. However, it's important to note that air quality can change over time and may vary depending on different locations within a city. It's always a good idea to check the current AQI value before planning outdoor activities.