<a href="https://colab.research.google.com/github/Sahanave/found_it_using_gemma/blob/main/Found_it_using_Gemma%2C_Langchain_and_ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
memocan_data_science_interview_q_and_a_treasury_path = kagglehub.dataset_download('memocan/data-science-interview-q-and-a-treasury')

print('Data source import complete.')


<center><h1>RAG using Gemma, Langchain and ChromaDB</h1></center>
<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook demonstrates how to build a retrieval augmented generation (RAG) system using Gemma as a large language model (LLM), Langchain for tools to process input files, and ChromaDB as vector database.

## What is RAG?

Retriever augmented generation (RAG) is a system that improves the response generated by a LLM in two ways:
- First, the information is retrieved from a dataset that is stored in vector database; the query is used to perform similarity search in the documents stored in the vector database.
- Second, by restraining the context provided to the LLM to content that is similar with the initial query, stored in the vector database, we can reduce significantly (or even eliminate) LLM's halucinations, since the answer is provided from the context of the stored documents.

An important advantage of this approach is that we do not need to fine-tune the LLM with our custom data; instead, the data is ingested (cleaned, transformed, chunked, and indexed in the vector database).

## Procedure

We create two classes:
* AIAgent - An AI Agent that query Gemma LLM using a custom prompt that instruct Gemma to generate and answer (from the query) by refering to the context (as well provided); the answer to the AI Agent query function is then returned.
* RAGSystem - initialized with the dataset with Data Science information, with an AIAgent object. In the init function of this class, we ingest the data from the dataset in the vector database. This class have as well a query member function. In this function we first perform similarity search with the query to the vector database. Then, we call the generate function of the ai agent object. Before returning the answer, we use a predefined template to compose the overal response from the question, answer and the context retrieved.


# Packages instalation and configurations

In [None]:
# install required libraries
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install langchain
!pip install sentence-transformers
!pip install chromadb

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from IPython.display import display, Markdown


# AI Agent class

In [None]:
class AIAgent:
    """
    Gemma 2b-it assistant.
    It uses Gemma transformers 2b-it/2.
    """
    def __init__(self, max_length=256):
        self.max_length = max_length
        self.tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/2b-it/2")
        self.gemma_lm = AutoModelForCausalLM.from_pretrained("/kaggle/input/gemma/transformers/2b-it/2")

    def create_prompt(self, query, context):
        # prompt template
        prompt = f"""
        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: {query}
        Context: {context}
        Answer:
        """
        return prompt

    def generate(self, query, retrieved_info):
        prompt = self.create_prompt(query, retrieved_info)
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        # Answer generation
        answer = self.gemma_lm.generate(
            input_ids,
            #max_length=self.max_length, # limit the answer to max_length
            max_new_tokens=self.max_length
        )
        # Decode and return the answer
        answer = self.tokenizer.decode(answer[0], skip_special_tokens=True, skip_prompt=True)
        return prompt, answer

## Test the AIAgent

In [None]:
ai_agent = AIAgent()

Let's use the context from the Data Science interview Q&A treasury.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 1000)
data_df = pd.read_csv("/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv")
data_df.head(3)

In [None]:
context = data_df.iloc[0].answer
print("Context: ", context)
prompt, answer = ai_agent.generate(query="What is supervised learning?", retrieved_info=context)
print("LLM Answer: ", answer)

In [None]:
class RAGSystem:
    """Sentence embedding based Retrieval Based Augmented generation.
        Given database of pdf files, retriever finds num_retrieved_docs relevant documents"""
    def __init__(self, ai_agent, num_retrieved_docs=2):
        # load the data
        self.num_docs = num_retrieved_docs
        self.ai_agent = ai_agent
        loader = CSVLoader("/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv")
        documents = loader.load()
        self.template = "\n\nQuestion:\n{question}\n\nPrompt:\n{prompt}\n\nAnswer:\n{answer}\n\nContext:\n{context}"

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800,
            chunk_overlap=100)
        all_splits = text_splitter.split_documents(documents)
        # create a vectorstore database
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        self.vector_db = Chroma.from_documents(documents=all_splits,
                                               embedding=embeddings,
                                               persist_directory="chroma_db")
        self.retriever = self.vector_db.as_retriever()

    def retrieve(self, query):
        # retrieve top k similar documents to query
        docs = self.retriever.get_relevant_documents(query)
        return docs

    def query(self, query):
        # generate the answer
        context = self.retrieve(query)
        data = ""
        for item in list(context):
            data += item.page_content

        data = data[:500]

        prompt, answer = self.ai_agent.generate(query, data)

        return self.template.format(question=query,
                                    prompt=prompt,
                                   answer=answer,
                                   context=context)



In [None]:
def colorize_text(text):
    for word, color in zip(["Question", "Prompt", "Answer", "Context"], ["blue", "magenta", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Test the RAG system

In [None]:
rag_system = RAGSystem(ai_agent)

Let's try first with few of the questions from the data we used for the retrieval system.

In [None]:
answer = rag_system.query(data_df.iloc[0].question)
display(Markdown(colorize_text(answer)))

In [None]:
answer = rag_system.query(data_df.iloc[3].question)
display(Markdown(colorize_text(answer)))

In [None]:
answer = rag_system.query("What’s the normal distribution? Why do we care about it?")
display(Markdown(colorize_text(answer)))

Let's try also with some "fresh" questions.

In [None]:
answer = rag_system.query("Please explain bias and variance?")
display(Markdown(colorize_text(answer)))

In [None]:
answer = rag_system.query("What is a Dropout?")
display(Markdown(colorize_text(answer)))

# Conclusions

We tested a RAG system developed with Gemma as LLM, Langchain for data loaders utilities, and ChromaDB as database.
The RAG system is initialized with a dataset, that is used to populate the vector database, and with an AI Agent, that will query Gemma, given the initial query and the retrieved context.
To verify that the result is composed based on the context provided, we include as well the context in the exported result.
