# End-to-End Project: Building a WhatsApp-Based Q&A Chatbot with Vector Database and RAG for Data Engineering Community.

# Overview:
The goal is to build a Question and Answer chatbot using WhatsApp group chat data. This will include cleaning the data, vectorizing it, and implementing a Retrieval-Augmented Generation (RAG) framework using LangChain and a Large Language Model (LLM).

# Steps Outline:
1. Data Preprocessing and Cleaning (already done and loaded into a CSV).
2. Data Preparation
3. Embedding Generation and Storage in a Vector Database.
3. Implemention of RAG Workflow with LangChain.

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [None]:
!pip install -q \
pandas==2.2.3 \
pinecone==5.4.2 \
langchain_pinecone==0.2.0 \
langchain-huggingface==0.1.2 \
langchain-community==0.3.13 \
langchain-openai==0.2.14

# Data Preparation
For this project, we will be using a pre-cleaned dataset of the data engineering community Whatsapp group conversations. This dataset contains about 20,000 group messages. When working with your own dataset, you may need to perform the cleaning step but the dataset has already been cleaned so we can jump right to the action.

In [None]:
import pandas as pd

# Load cleaned WhatsApp data from a CSV file.
data = pd.read_csv("/content/messages.csv")
# Let's format the dataset to extract and concatinate the sender and message.
data['content'] = data['sender'].astype(str) + ': ' + data['message'].astype(str)
data = data['content'].tolist()
# Using a subset of the full dataset.
data = data[4:6]
data

# Generate embeddings
 Now, let's generate sample embedding for a list of texts using a subset of the full dataset. We achieve this by using an open source embedding model `all-MiniLM-L6-v2` from HuggingFace. It is free to use this model unlike the OpenAI embedding models which cost some penny.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
# Instantiate a model object
model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')
# Use the model object to generate embeddings
embeddings = model.embed_query(data[0])
print("Generated Embeddings:", pd.DataFrame(embeddings))

# Initializing the Index in a Vector Database
Index are vector stores for embeddings and they enable efficient search on these vectors. To persist these vector stores we use a vector database. We will be using Pinecone as our vector database. It is one of the most popular vector databases and offers generous free tier. To interact with pinecone databases we need an API key, we can get a [free API key](https://:app.pinecone.io) to initialize a connection to Pinecone and create an index.

In [66]:
from google.colab import userdata
from pinecone import ServerlessSpec, Pinecone

# get API key stored in environment variable in colab secrets
api_key = userdata.get('PINECONE_API_KEY')

# initialize connection to pinecone
pc = Pinecone(api_key=api_key)

We need to setup our index specification by defining the cloud provider and region where we want to deploy our serverless index.

In [68]:
spec = ServerlessSpec(cloud='aws', region='us-east-1')

index_name = 'dec-chat-index'

In [69]:
# check for and delete index if already exists
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

In [70]:
# Check if the index exists, if not, create it
if index_name not in pc.list_indexes().names():
    pc.create_index(
        index_name,
        dimension=384, # Dimension for SentenceTransformer embeddings
        metric='cosine', # Vectors similarity search metric
        spec=spec)

In [None]:
# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

We should see that the new Pinecone index has been created and has a `total_vector_count` of 0, as we haven't added any vectors yet.

# Add data to the vector store
Let's add data to the vector store by using the langchain's `PineconeVectorStore` class. Once we have initialized a PineconeVectorStore object, we can add records to the underlying Pinecone index (and thus also the linked LangChain object) using either the `add_documents` or `add_texts` methods.
Both of these methods also handle the embedding of the provided text data and the creation of records in the Pinecone index.

In [None]:
# Sample iteration to demonstrate how we will iterate over our data in other to generate texts embeddings and store them.
for i, embedding in enumerate(data):
  print(i, embedding)

In [73]:
from langchain_pinecone import PineconeVectorStore
# Add data to the vector store
vector_store = PineconeVectorStore(index, embedding=model, text_key="text")
for i, embedding in enumerate(data):
    vector_store.add_texts([data[i]], metadatas=[{"id": i}])

Now, Let's check the number of vectors in our index.

In [None]:
index.describe_index_stats()

Let's play with some similarity search on our vector index.

In [None]:
query = "What did Najeeb say"
query = "Which community was everyone welcomed to?"
query = "How many percent response do we have for the survey?"
query = "What is data engineering?"

vector_store.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

# RAG Workflow Implementation
To create our Q&A chatbot we need to create a Retrieval-Augmented Generation chain using LangChain. We will integrate with different LLMs for demonstration of how each of these LLMs performs with text generation.

Let's start with the `tiiuae/falcon-7b-instruct` LLM from HuggingFace. Falcon-7B-Instruct is a 7B parameters causal decoder-only model built by TII based on Falcon-7B and finetuned on a mixture of chat/instruct datasets.

In [90]:
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFaceEndpoint

# Create a Retrieval-Augmented Generation chain using LangChain.
llm = HuggingFaceEndpoint(
    repo_id="tiiuae/falcon-7b-instruct",
    temperature= 0.01,
    max_new_tokens=250,
    huggingfacehub_api_token=userdata.get('HF_TOKEN'))
retriever = vector_store.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                       chain_type="stuff",
                       retriever=retriever)

In [94]:
# Run a query against the RAG pipeline
query = "How many percent response do we have for the survey?"
qa_chain.invoke(query)

{'query': 'How many percent response do we have for the survey?',
 'result': 'less than 20%'}

Let's test performance with `google/flan-t5-large` LLM from HuggingFace. It is one of the leading LLMs on HuggingFace with about 1B model size.

In [91]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFaceHub

# Create a Retrieval-Augmented Generation chain using LangChain.
llm = HuggingFaceHub(
    repo_id="google/flan-t5-large",
    model_kwargs={"temperature": 0.01, "max_new_tokens": 250},
    huggingfacehub_api_token=userdata.get('HF_TOKEN'))
retriever = vector_store.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                       chain_type="stuff",
                       retriever=retriever)

In [None]:
# Run a query against the RAG pipeline
query = "How many percent response do we have for the survey?"
qa_chain.invoke(query)

Lastly, we will test performance with the popular OpenAI `gpt-4o-mini` model. This model has more than 1 trillion parameters

In [93]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# chatbot language model
llm = ChatOpenAI(
    openai_api_key=userdata.get('OPENAI_API_KEY'),
    model_name='gpt-4o-mini',
    temperature=0.0
)
# retrieval augmented pipeline for chatbot
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

In [None]:
# Run a query against the RAG pipeline
query = "How many percent response do we have for the survey?"
qa.invoke(query)
#qa.run(query)