<center><h1>RAG using Gemma, Langchain and ChromaDB</h1></center>
<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook demonstrates how to build a retrieval augmented generation (RAG) system using Gemma as a large language model (LLM), Langchain for tools to process input files, and ChromaDB as vector database.

## What is RAG?

Retriever augmented generation (RAG) is a system that improves the response generated by a LLM in two ways:
- First, the information is retrieved from a dataset that is stored in vector database; the query is used to perform similarity search in the documents stored in the vector database.
- Second, by restraining the context provided to the LLM to content that is similar with the initial query, stored in the vector database, we can reduce significantly (or even eliminate) LLM's halucinations, since the answer is provided from the context of the stored documents.

An important advantage of this approach is that we do not need to fine-tune the LLM with our custom data; instead, the data is ingested (cleaned, transformed, chunked, and indexed in the vector database).

## Procedure

We create two classes:
* AIAgent - An AI Agent that query Gemma LLM using a custom prompt that instruct Gemma to generate and answer (from the query) by refering to the context (as well provided); the answer to the AI Agent query function is then returned.
* RAGSystem - initialized with the dataset with Data Science information, with an AIAgent object. In the init function of this class, we ingest the data from the dataset in the vector database. This class have as well a query member function. In this function we first perform similarity search with the query to the vector database. Then, we call the generate function of the ai agent object. Before returning the answer, we use a predefined template to compose the overal response from the question, answer and the context retrieved.


# Packages instalation and configurations

In [1]:
# install required libraries
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install langchain
!pip install sentence-transformers
!pip install chromadb

Looking in indexes: https://pypi.org/simple/
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0
Collecting langchain
  Downloading langchain-0.3.13-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.26 (from langchain)
  Downloading langchain_core-0.3.28-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.3 (from langchain)
  Downloading langchain_text_splitters-0.3.4-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.3,>=0.1.17 (from langchain)
  Downloading langsmith-0.2.4-py3-none-any.whl.metadata (14 kB)
Collecting pydantic<3.0.0,>=2.7.4 (from langchain)
  Downl

In [9]:
! pip install transformers -U


Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.3
    Uninstalling tokenizers-0.20.3:
      Successfully uninstalled tokenizers-0.20.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.0 which is incompatible.[0m[31m
[0mSuccessfully installed tokenizers-0.21.0


In [12]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.13-py3-none-any.whl.metadata (2.9 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Downloading langchain_community-0.3.13-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading pydantic_settings-2.7.0-py3-none-any.whl (29 kB)
Installing collected packages: httpx-sse, pydantic-settings, langchain-community
Successfully installed httpx-sse-0.4.0 langchain-community-0.3.13 pydantic-settings-2.7.0


In [13]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

from IPython.display import display, Markdown


# AI Agent class

In [14]:
class AIAgent:
    """
    Gemma 2b-it assistant.
    It uses Gemma transformers 2b-it/2.
    """
    def __init__(self, max_length=256):
        self.max_length = max_length
        self.tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/2b-it/2")
        self.gemma_lm = AutoModelForCausalLM.from_pretrained("/kaggle/input/gemma/transformers/2b-it/2")

    def create_prompt(self, query, context):
        # prompt template
        prompt = f"""
        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: {query}
        Context: {context}
        Answer:
        """
        return prompt
    
    def generate(self, query, retrieved_info):
        prompt = self.create_prompt(query, retrieved_info)
        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids
        # Answer generation
        answer = self.gemma_lm.generate(
            input_ids,
            #max_length=self.max_length, # limit the answer to max_length
            max_new_tokens=self.max_length
        )
        # Decode and return the answer
        answer = self.tokenizer.decode(answer[0], skip_special_tokens=True, skip_prompt=True)
        return prompt, answer

## Test the AIAgent

In [15]:
ai_agent = AIAgent()

2024-12-22 15:30:13.910611: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-22 15:30:13.910728: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-22 15:30:14.030064: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's use the context from the Data Science interview Q&A treasury.

In [16]:
import pandas as pd
pd.set_option('display.max_colwidth', 1000)
data_df = pd.read_csv("/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv")
data_df.head(3)

Unnamed: 0,question,answer
0,What is supervised machine learning? 👶,"Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc."
1,What is regression? Which models can you use to solve a regression problem? 👶,Regression is a part of supervised ML. Regression models investigate the relationship between a dependent (target) and independent variable (s) (predictor).\nHere are some common regression models\n\n- *Linear Regression* establishes a linear relationship between target and predictor (s). It predicts a numeric value and has a shape of a straight line.\n- *Polynomial Regression* has a regression equation with the power of independent variable more than 1. It is a curve that fits into the data points.\n- *Ridge Regression* helps when predictors are highly correlated (multicollinearity problem). It penalizes the squares of regression coefficients but doesn’t allow the coefficients to reach zeros (uses L2 regularization).\n- *Lasso Regression* penalizes the absolute values of regression coefficients and allows some of the coefficients to reach absolute zero (thereby allowing feature selection).
2,What is linear regression? When do we use it? 👶,"Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).\n\nWith a simple equation:\n\n```\ny = B0 + B1*x1 + ... + Bn * xN\n```\n\nB is regression coefficients, x values are the independent (explanatory) variables and y is dependent variable.\n\nThe case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.\n\nSimple linear regression:\n\n```\ny = B0 + B1*x1\n```\n\nMultiple linear regression:\n\n```\ny = B0 + B1*x1 + ... + Bn * xN\n```"


In [17]:
context = data_df.iloc[0].answer
print("Context: ", context)
prompt, answer = ai_agent.generate(query="What is supervised learning?", retrieved_info=context)
print("LLM Answer: ", answer)

Context:  Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.
LLM Answer:  
        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What is supervised learning?
        Context: Supervised learning is a type of machine learning in which our algorithms are trained us

In [18]:
class RAGSystem:
    """Sentence embedding based Retrieval Based Augmented generation.
        Given database of pdf files, retriever finds num_retrieved_docs relevant documents"""
    def __init__(self, ai_agent, num_retrieved_docs=2):
        # load the data
        self.num_docs = num_retrieved_docs
        self.ai_agent = ai_agent
        loader = CSVLoader("/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv")
        documents = loader.load()
        self.template = "\n\nQuestion:\n{question}\n\nPrompt:\n{prompt}\n\nAnswer:\n{answer}\n\nContext:\n{context}"
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800, 
            chunk_overlap=100)
        all_splits = text_splitter.split_documents(documents)
        # create a vectorstore database
        embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        self.vector_db = Chroma.from_documents(documents=all_splits, 
                                               embedding=embeddings, 
                                               persist_directory="chroma_db")
        self.retriever = self.vector_db.as_retriever()

    def retrieve(self, query):
        # retrieve top k similar documents to query
        docs = self.retriever.get_relevant_documents(query)
        return docs
    
    def query(self, query):
        # generate the answer
        context = self.retrieve(query)
        data = ""
        for item in list(context):
            data += item.page_content
            
        data = data[:500]

        prompt, answer = self.ai_agent.generate(query, data)
        
        return self.template.format(question=query,
                                    prompt=prompt,
                                   answer=answer,
                                   context=context)
        
        

In [19]:
def colorize_text(text):
    for word, color in zip(["Question", "Prompt", "Answer", "Context"], ["blue", "magenta", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Test the RAG system

In [20]:
rag_system = RAGSystem(ai_agent)

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's try first with few of the questions from the data we used for the retrieval system.

In [21]:
answer = rag_system.query(data_df.iloc[0].question)
display(Markdown(colorize_text(answer)))

  docs = self.retriever.get_relevant_documents(query)




**<font color='blue'>Question:</font>**
What is supervised machine learning? 👶

**<font color='magenta'>Prompt:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What is supervised machine learning? 👶
        Context: question: What is supervised machine learning? 👶
answer: Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.ques
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What is supervised machine learning? 👶
        Context: question: What is supervised machine learning? 👶
answer: Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.ques
        Answer:
        Supervised machine learning is a type of machine learning where algorithms are trained using well-labeled training data.

**<font color='green'>Context:</font>**
[Document(metadata={'row': 0, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is supervised machine learning? 👶\nanswer: Supervised learning is a type of machine learning in which our algorithms are trained using well-labeled training data, and machines predict the output based on that data. Labeled data indicates that the\xa0input data has already been tagged with the appropriate output. Basically, it is the task of learning a function that maps the input set and returns an output. Some of its examples are: Linear Regression, Logistic Regression, KNN, etc.'), Document(metadata={'row': 147, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: How can we use machine learning for search? \u200d⭐️\nanswer: Answer here'), Document(metadata={'row': 132, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is unsupervised learning? 👶\nanswer: Unsupervised learning aims to detect patterns in data where no labels are given.'), Document(metadata={'row': 20, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is classification? Which models would you use to solve a classification problem? 👶\nanswer: Classification problems are problems in which our prediction space is discrete, i.e. there is a finite number of values the output variable can be. Some models which can be used to solve classification problems are: logistic regression, decision tree, random forests, multi-layer perceptron, one-vs-all, amongst others.')]

In [22]:
answer = rag_system.query(data_df.iloc[3].question)
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
What are the main assumptions of linear regression? ⭐

**<font color='magenta'>Prompt:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What are the main assumptions of linear regression? ⭐
        Context: question: What are the main assumptions of linear regression? ⭐
answer: There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.question: What is linear regression? When do we use it? 👶
answer: Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).

With a simple equation:

```
y = B0 + B1*x1 + ... + Bn * xN
```

B is regression 
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What are the main assumptions of linear regression? ⭐
        Context: question: What are the main assumptions of linear regression? ⭐
answer: There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.question: What is linear regression? When do we use it? 👶
answer: Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).

With a simple equation:

```
y = B0 + B1*x1 + ... + Bn * xN
```

B is regression 
        Answer:
        Linear regression is used when we have a dependent variable that is linearly related to one or more independent variables. It is commonly used in various fields, including business, science, and social sciences.

**<font color='green'>Context:</font>**
[Document(metadata={'row': 3, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What are the main assumptions of linear regression? ⭐\nanswer: There are several assumptions of linear regression. If any of them is violated, model predictions and interpretation may be worthless or misleading.'), Document(metadata={'row': 2, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is linear regression? When do we use it? 👶\nanswer: Linear regression is a model that assumes a linear relationship between the input variables (X) and the single output variable (y).\n\nWith a simple equation:\n\n```\ny = B0 + B1*x1 + ... + Bn * xN\n```\n\nB is regression coefficients, x values are the independent (explanatory) variables  and y is dependent variable.\n\nThe case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.\n\nSimple linear regression:\n\n```\ny = B0 + B1*x1\n```\n\nMultiple linear regression:\n\n```\ny = B0 + B1*x1 + ... + Bn * xN\n```'), Document(metadata={'row': 39, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What happens to our linear regression model if we have three columns in our data: x, y, z \u200a—\u200a and z is a sum of x and y? \u200d⭐️\nanswer: We would not be able to perform the regression. Because z is linearly dependent on x and y so when performing the regression <img src="https://render.githubusercontent.com/render/math?math={X}^{T}{X}"> would be a singular (not invertible) matrix.'), Document(metadata={'row': 1, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is regression? Which models can you use to solve a regression problem? 👶\nanswer: Regression is a part of supervised ML. Regression models investigate the relationship between a dependent (target) and independent variable (s) (predictor).\nHere are some common regression models')]

In [23]:
answer = rag_system.query("What’s the normal distribution? Why do we care about it?")
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
What’s the normal distribution? Why do we care about it?

**<font color='magenta'>Prompt:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What’s the normal distribution? Why do we care about it?
        Context: question: What’s the normal distribution? Why do we care about it? 👶
answer: The normal distribution is a continuous probability distribution whose probability density function takes the following formula:

![formula](https://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif)

where μ is the mean and σ is the standard deviation of the distribution.

The normal distribution derives its importance from the **Central Limit Theorem**, which states that if we draw a larg
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What’s the normal distribution? Why do we care about it?
        Context: question: What’s the normal distribution? Why do we care about it? 👶
answer: The normal distribution is a continuous probability distribution whose probability density function takes the following formula:

![formula](https://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif)

where μ is the mean and σ is the standard deviation of the distribution.

The normal distribution derives its importance from the **Central Limit Theorem**, which states that if we draw a larg
        Answer:
        Sure, here's an explanation of the normal distribution:

The normal distribution is a bell-shaped curve that is commonly used to model real-world phenomena. It is often used in statistics, probability theory, and machine learning to model continuous data.

The normal distribution is important because it has a number of properties that make it a useful tool for modeling real-world data. These properties include:

* The mean and standard deviation of the normal distribution are the same for any set of data. This means that the mean and standard deviation can be used to represent the entire distribution of data.
* The normal distribution is symmetric, meaning that it is symmetrical about its mean. This means that the probability density is the same on both sides of the mean.
* The normal distribution is bell-shaped, meaning that it is symmetric about its mean. This means that the probability density is higher in the center of the distribution and lower on the sides.

These properties make the normal distribution a useful tool for modeling real-world data. It can be used to fit a wide variety of data sets, and it can provide valuable insights into the underlying structure of the data.

**<font color='green'>Context:</font>**
[Document(metadata={'row': 4, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What’s the normal distribution? Why do we care about it? 👶\nanswer: The normal distribution is a continuous probability distribution whose probability density function takes the following formula:\n\n![formula](https://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif)\n\nwhere μ is the mean and σ is the standard deviation of the distribution.\n\nThe normal distribution derives its importance from the **Central Limit Theorem**, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample, i.e **the distribution of the mean of the samples is normal**. It is important that each sample is independent from the other.'), Document(metadata={'row': 4, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='This is powerful because it helps us study processes whose population distribution is unknown to us.'), Document(metadata={'row': 5, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: How do we check if a variable follows the normal distribution? \u200d⭐️\nanswer: 1. Plot a histogram out of the sampled data. If you can fit the bell-shaped "normal" curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.\n2. Check Skewness and Kurtosis of the sampled data. Skewness = 0 and kurtosis = 3 are typical for a normal distribution, so the farther away they are from these values, the more non-normal the distribution.\n3. Use Kolmogorov-Smirnov or/and Shapiro-Wilk tests for normality. They take into account both Skewness and Kurtosis simultaneously.'), Document(metadata={'row': 9, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is the normal equation? \u200d⭐️\nanswer: Normal equations are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors (least squares); normal equations allow one to estimate the parameters of a multiple linear regression.')]

Let's try also with some "fresh" questions.

In [24]:
answer = rag_system.query("Please explain bias and variance?")
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
Please explain bias and variance?

**<font color='magenta'>Prompt:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: Please explain bias and variance?
        Context: question: What’s the interpretation of the bias term in linear models? ‍⭐️
answer: Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.question: What is the bias-variance trade-off? 👶
answer: **Bias** is the error introduced by approximating the true underlying function, which can be quite complex, by a s
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: Please explain bias and variance?
        Context: question: What’s the interpretation of the bias term in linear models? ‍⭐️
answer: Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.question: What is the bias-variance trade-off? 👶
answer: **Bias** is the error introduced by approximating the true underlying function, which can be quite complex, by a s
        Answer:
        Sure. Here's a breakdown of bias and variance:

**Bias:** Bias is the difference between the predicted value and the actual/true value. It can be interpreted as the distance from the average prediction and true value.

**Variance:** Variance is a measure of how much the predicted value varies from one sample to another. It can be interpreted as the average of the squared differences between the predicted value and the actual/true value.

The bias-variance trade-off is a trade-off between bias and variance. The goal of any machine learning algorithm is to achieve a good balance between bias and variance. A low bias means that the model is too simple and does not capture the underlying patterns in the data. A high bias means that the model is too complex and overfits the data. A high variance means that the model is more sensitive to noise in the data. A low variance means that the model is very robust and does not change much with noise in the data.

**<font color='green'>Context:</font>**
[Document(metadata={'row': 50, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What’s the interpretation of the bias term in linear models? \u200d⭐️\nanswer: Bias is simply, a difference between predicted value and actual/true value. It can be interpreted as the distance from the average prediction and true value i.e. true value minus mean(predictions). But dont get confused between accuracy and bias.'), Document(metadata={'row': 13, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is the bias-variance trade-off? 👶\nanswer: **Bias** is the error introduced by approximating the true underlying function, which can be quite complex, by a simpler model. **Variance** is a model sensitivity to changes in the training dataset.\n\n**Bias-variance trade-off** is a relationship between the expected test error and the variance and the bias - both contribute to the level of the test error and ideally should be as small as possible:\n\n```\nExpectedTestError = Variance + Bias² + IrreducibleError\n```\n\nBut as a model complexity increases, the bias decreases and the variance increases which leads to *overfitting*. And vice versa, model simplification helps to decrease the variance but it increases the bias which leads to *underfitting*.'), Document(metadata={'row': 4, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What’s the normal distribution? Why do we care about it? 👶\nanswer: The normal distribution is a continuous probability distribution whose probability density function takes the following formula:\n\n![formula](https://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif)\n\nwhere μ is the mean and σ is the standard deviation of the distribution.\n\nThe normal distribution derives its importance from the **Central Limit Theorem**, which states that if we draw a large enough number of samples, their mean will follow a normal distribution regardless of the initial distribution of the sample, i.e **the distribution of the mean of the samples is normal**. It is important that each sample is independent from the other.'), Document(metadata={'row': 89, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='Assigning random values to weights is better than just 0 assignment. \n* a) If weights are initialized with very high values the term np.dot(W,X)+b becomes significantly higher and if an activation function like sigmoid() is applied, the function maps its value near to 1 where the slope of gradient changes slowly and learning takes a lot of time.\n* b) If weights are initialized with low values it gets mapped to 0, where the case is the same as above. This problem is often referred to as the vanishing gradient.')]

In [25]:
answer = rag_system.query("What is a Dropout?")
display(Markdown(colorize_text(answer)))



**<font color='blue'>Question:</font>**
What is a Dropout?

**<font color='magenta'>Prompt:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What is a Dropout?
        Context: question: What is dropout? Why is it useful? How does it work? ‍⭐️
answer: Dropout is a technique that at each training step turns off each neuron with a certain probability of *p*. This way at each iteration we train only *1-p* of neurons, which forces the network not to rely only on the subset of neurons for feature representation. This leads to regularizing effects that are controlled by the hyperparameter *p*.question: What’s pooling in CNN? Why do we need it? ‍⭐️
answer: Pooling is a techni
        Answer:
        

**<font color='red'>Answer:</font>**

        You are an AI Agent specialized to answer to questions about Data Science.
        Explain the concept or answer the question about Data Science.
        In order to create the answer, please only use the information from the
        context provided (Context). Do not include other information.
        Answer with simple words.
        If needed, include also explanations.
        Question: What is a Dropout?
        Context: question: What is dropout? Why is it useful? How does it work? ‍⭐️
answer: Dropout is a technique that at each training step turns off each neuron with a certain probability of *p*. This way at each iteration we train only *1-p* of neurons, which forces the network not to rely only on the subset of neurons for feature representation. This leads to regularizing effects that are controlled by the hyperparameter *p*.question: What’s pooling in CNN? Why do we need it? ‍⭐️
answer: Pooling is a techni
        Answer:
        Sure, here's a simplified explanation of the concept of Dropout and its importance in Data Science:

**Dropout**:

* Dropout is a technique used in machine learning algorithms to prevent overfitting by randomly dropping out neurons during training.
* It involves setting a probability, usually *p*, for each neuron to be dropped out at each training step.
* This means that only *1-p* neurons are trained at any given iteration.
* By doing this, the network is forced to rely on a different subset of neurons for feature representation, which helps to reduce overfitting.

**Importance of Dropout**:

* Dropout is important because it helps to:
    * Reduce overfitting by preventing the network from over-learning from the training data.
    * Improve generalization performance by forcing the network to learn from a more diverse set of features.
    * Control the complexity of the model by adjusting the value of *p*.

**Pooling in CNN**:

* Pooling is a technique used in convolutional neural networks (CNNs) to reduce the dimensionality of feature maps by taking a subset of the input features.
* It involves computing a summary statistic (mean, median, or min-max) of the selected features.
* Pooling

**<font color='green'>Context:</font>**
[Document(metadata={'row': 92, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is dropout? Why is it useful? How does it work? \u200d⭐️\nanswer: Dropout is a technique that at each training step turns off each neuron with a certain probability of *p*. This way at each iteration we train only *1-p* of neurons, which forces the network not to rely only on the subset of neurons for feature representation. This leads to regularizing effects that are controlled by the hyperparameter *p*.'), Document(metadata={'row': 108, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What’s pooling in CNN? Why do we need it? \u200d⭐️\nanswer: Pooling is a technique to downsample the feature map. It allows layers which receive relatively undistorted versions of the input to learn low level features such as lines, while layers deeper in the model can learn more abstract features such as texture.'), Document(metadata={'row': 21, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is logistic regression? When do we need to use it? 👶\nanswer: Logistic regression is a Machine Learning algorithm that is used for binary classification. You should use logistic regression when your Y variable takes only two values, e.g. True and False, "spam" and "not spam", "churn" and "not churn" and so on. The variable is said to be a "binary" or "dichotomous".'), Document(metadata={'row': 54, 'source': '/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv'}, page_content='question: What is feature selection? Why do we need it? 👶\nanswer: Feature Selection is a method used to select the relevant features for the model to train on. We need feature selection to remove the irrelevant features which leads the model to under-perform.')]

# Conclusions

We tested a RAG system developed with Gemma as LLM, Langchain for data loaders utilities, and ChromaDB as database. 
The RAG system is initialized with a dataset, that is used to populate the vector database, and with an AI Agent, that will query Gemma, given the initial query and the retrieved context.
To verify that the result is composed based on the context provided, we include as well the context in the exported result.
