# Evaluating the normal llm vs along with rag

In [4]:
!pip install langchain
!pip install langchain-core
!pip install langchain-community
!pip install langchain-pinecone
!pip install sentence-transformers
!pip install groq

Collecting groq
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Downloading groq-0.11.0-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.11.0


### Setting up questions to evaluate

In [1]:
questions = [
    "Can you explain what a Bidirectional LSTM layer is in Keras?",
    "How do I use the Keras Sequential model for building neural networks?",
    "What are the key differences between the Dense and Conv2D layers in Keras?",
    "How do you implement dropout in a Keras model to prevent overfitting?",
    "Explain me why and how to use the EarlyStopping callback in Keras",
    "Explain me how to build a LSTM model in Keras",
    "How to use the ResNet model from keras applications for image classification?",
    "Explain me the parameters used in the Conv2D layer in Keras",
    "Explain me about the MNIST digits classification dataset",
    "What is the purpose of using the Embedding layer in Keras?",
    "Give me code for loading images using ImageDataGenerator in Keras",
    "How do you perform transfer learning using a pre-trained model in Keras?",
    "Can you explain how to save and load a Keras model?",
    "How do I install Keras on my system?",
    "What are the key features of Keras?",
    "What is a Keras model and how do I create one?",
    "How do I add a Dense layer to a Keras model?",
    "What is the purpose of the Dropout layer in Keras?",
    "How do I create a custom layer in Keras?",
    "What is the use of Conv2D in Keras for building CNNs?",
    "How do I use LSTM layers in a Sequential model?",
    "How do I compile a Keras model?",
    "How do I implement early stopping during model training in Keras?",
    "Give me a code example of creating a simple Keras neural network.",
    "Show me how to create a CNN using Keras.",
    "How can I implement an RNN using Keras?",
    "Provide code for training a Keras model with a custom dataset.",
    "How can I use transfer learning with Keras?",
    "What is the difference between model.fit() and model.fit_generator() in Keras?",
    "What optimizers can I use in Keras, and which one is best for my problem?",
    "How do I preprocess image data for training with Keras?",
    "How can I use data augmentation in Keras to improve model performance?",
    "Show me a code example of using Keras ImageDataGenerator.",
]


### Gemini and vectorStore Setup

In [5]:
#setting up api keys
import google.generativeai as genai
from google.colab import userdata
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
pinecone_api_key = userdata.get('pinecone_api_key')
hf_token = userdata.get('HF_TOKEN')
groq_api_key = userdata.get('groq_api_key')
genai.configure(api_key=GEMINI_API_KEY)

In [6]:
from pinecone import Pinecone, ServerlessSpec
import time
# from config import VECTOR_DIMENSION
VECTOR_DIMENSION = 384 # 768 do if nomic embeddings using
class PineconeManager:
    def __init__(self, api_key: str, index_name: str):
        self.pc = Pinecone(api_key=api_key)
        self.index_name = index_name
        self.index = None
        self.initialize_index() #to initialize the index
    def initialize_index(self):
        if self.index_name not in self.pc.list_indexes().names():#shd use list indexes here
            print(f"Creating index: {self.index_name}")
            self.pc.create_index(
                name=self.index_name,
                dimension=VECTOR_DIMENSION,
                metric="cosine",
                spec=ServerlessSpec(
                    cloud="aws",
                    region="us-east-1"
                )
            )
        else:
            print(f"Index {self.index_name} already exists")

        while not self.pc.describe_index(self.index_name).status['ready']:
            time.sleep(1)

        self.index = self.pc.Index(self.index_name)


In [7]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
INDEX_NAME = 'dlprojectchecknomic'
pinecone_manager = PineconeManager(pinecone_api_key, INDEX_NAME)
pinecone_manager.initialize_index()

Index dlprojectchecknomic already exists
Index dlprojectchecknomic already exists


In [8]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = PineconeVectorStore(index=pinecone_manager.index, embedding=embeddings)

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Evaluation with using finetune model vs gemini and then verification by mistral

In [9]:
from groq import Groq
client = Groq(
    api_key = groq_api_key
)
def groq_mixtral_answer_generate(prompt):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="mixtral-8x7b-32768",
    )
    return chat_completion.choices[0].message.content

In [10]:
#main function to get and store the results
import time
from random import uniform


model = genai.GenerativeModel('gemini-pro')
tuned_model = genai.get_tuned_model('tunedModels/finetuninggemmafordl1-xxcubsl6ftaf')
fine_tuned_model = genai.GenerativeModel(model_name=tuned_model.name)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

rag_prompt = """You are an intelligent assistant designed to provide accurate and relevant information from Keras documentation.

        Here is the retrieved context, which may contain both explanatory text and meaningful code snippets:

        {context}

        Carefully analyze the above context, considering both the text and any provided code for clarity.

        Now, review the user's query:

        {question}

        Generate a detailed response that accurately addresses the query using the provided context. If the context includes relevant code, incorporate it into your response. Ensure that your answer is both clear and grounded in the provided content.

        Response:
        """

best_answer_prompt = """
You are an expert in evaluating responses based on technical documentation, specifically Keras. Your task is to compare two responses generated for a user's query regarding Keras.

Evaluate the two responses provided below using the following criteria:
1. **Relevance to Keras Documentation (0 to 1):** How closely does the response align with information from Keras documentation?
2. **Accuracy (0 to 1):** How correct is the response in addressing the user's query with factual information?
3. **Clarity (0 to 1):** How clear and easy to understand is the response?

Based on these criteria, calculate a total score for each response (maximum score of 3). Also, consider if the response uses any relevant code examples from Keras documentation.

**RAG Response:**
{rag_response}

**Normal Model Response:**
{normal_response}

Now, analyze and score each response. Provide reasoning for the scores, specifically noting if the response used Keras-specific information or included irrelevant details.

- **RAG Response Total Score (out of 3):**
- **Normal Model Response Total Score (out of 3):**
- **Better Response:** (Specify which one is better, RAG or Normal, and explain why)

Provide detailed reasoning for your choice, highlighting which response was more grounded in the Keras documentation and which one provided a clearer, more accurate answer to the user's query.

Reasoning:
"""


def get_and_store_results(questions,delay_range=(3, 10)):

    results = []

    for question in questions:
        try:
            #answer from the rag
            print(f"Processing question: {question} \n")

            print("Retrieving context... \n")
            context_from_pinecone = retriever.get_relevant_documents(question)
            context = "\n\n".join(doc.page_content for doc in context_from_pinecone)

            time.sleep(uniform(delay_range[0], delay_range[1]))

            print("Generating rag response... \n")
            rag_prompt_formatted = rag_prompt.format(context=context, question=question)
            rag_result = fine_tuned_model.generate_content(rag_prompt_formatted)
            rag_response = rag_result.text

            time.sleep(uniform(delay_range[0], delay_range[1]))

            print("Generating normal response... \n")
            #getting answer from the normal model
            normal_result = model.generate_content(question)
            normal_response = normal_result.text

            time.sleep(uniform(delay_range[0], delay_range[1]))

            print("Verifying the best answer... \n")
            #verifying the best answer
            best_answer_prompt_formated = best_answer_prompt.format(rag_response=rag_response, normal_response=normal_response)
            verify_response = groq_mixtral_answer_generate(best_answer_prompt_formated)

            print("Done! for question : " + question + " appending results \n")
            #appending results
            results.append({
                'question': question,
                'context retrived': context,
                'rag_response': rag_response,
                'normal_response': normal_response,
                'verify_response': verify_response
            })

            time.sleep(uniform(delay_range[0], delay_range[1]))
        except Exception as e:
            print(f"Error processing question: {question}")
            print(f"Error message: {str(e)}")
            results.append(None)

    return results


In [11]:
answers = get_and_store_results(questions,delay_range=(3, 10))

Processing question: Can you explain what a Bidirectional LSTM layer is in Keras? 

Retrieving context... 



  context_from_pinecone = retriever.get_relevant_documents(question)


Generating rag response... 

Generating normal response... 

Verifying the best answer... 

Done! for question : Can you explain what a Bidirectional LSTM layer is in Keras? appending results 

Processing question: How do I use the Keras Sequential model for building neural networks? 

Retrieving context... 

Generating rag response... 

Generating normal response... 

Verifying the best answer... 

Done! for question : How do I use the Keras Sequential model for building neural networks? appending results 

Processing question: What are the key differences between the Dense and Conv2D layers in Keras? 

Retrieving context... 

Generating rag response... 

Generating normal response... 

Verifying the best answer... 

Done! for question : What are the key differences between the Dense and Conv2D layers in Keras? appending results 

Processing question: How do you implement dropout in a Keras model to prevent overfitting? 

Retrieving context... 

Generating rag response... 

Generating

In [13]:
import pandas as pd

# Function to filter out None values and save the results to a CSV file
def save_results_to_csv(answers, file_name="results.csv"):
    # Filter out None values from the list
    filtered_answers = [answer for answer in answers if answer is not None]

    # Create a DataFrame from the filtered list of dictionaries
    df = pd.DataFrame(filtered_answers)

    # Save the DataFrame to a CSV file
    df.to_csv(file_name, index=False)

    print(f"Results successfully saved to {file_name}")

# Call the function to save your results
save_results_to_csv(answers, file_name="answers_results.csv")


Results successfully saved to answers_results.csv


In [14]:
df = pd.read_csv('answers_results.csv')
df.head(10)

Unnamed: 0,question,context retrived,rag_response,normal_response,verify_response
0,Can you explain what a Bidirectional LSTM laye...,3D Convolutional LSTM.\nSimilar to an LSTM lay...,A Bidirectional LSTM layer in Keras is a speci...,**Bidirectional Long Short-Term Memory (BiLSTM...,- **RAG Response Total Score (out of 3):**\n\n...
1,How do I use the Keras Sequential model for bu...,You can create a Sequential model by passing a...,The Keras Sequential model is a straightforwar...,**Importing the Necessary Libraries:**\n\n```p...,- **RAG Response Total Score (out of 3):**\n\n...
2,What are the key differences between the Dense...,2D transposed convolution layer.\nThe need for...,"The Dense layer is a fully connected layer, me...",**Dense Layer (Fully Connected Layer):**\n\n* ...,- **RAG Response Total Score (out of 3):**\n\n...
3,How do you implement dropout in a Keras model ...,Applies dropout to the input.\nTheDropoutlayer...,"To implement dropout in a Keras model, you can...",```python\nimport tensorflow as tf\n\n# Create...,- **RAG Response Total Score (out of 3):**\n\n...
4,Explain me why and how to use the EarlyStoppin...,"Authors:Rick Chao, Francois CholletDate create...",The EarlyStopping callback in Keras is used to...,**Why Use EarlyStopping?**\n\nEarlyStopping is...,- **RAG Response Total Score (out of 3): 2.8**...
5,Explain me how to build a LSTM model in Keras,There are three ways to create Keras models:\n...,"To build an LSTM model in Keras, you can use t...",**Step 1: Import necessary libraries**\n\n```p...,- **RAG Response Total Score (out of 3):**\n\n...
6,How to use the ResNet model from keras applica...,Instantiates the ResNet50V2 architecture.\nRef...,For image classification using the ResNet mode...,**1. Import the necessary libraries:**\n\n```p...,- **RAG Response Total Score (out of 3):**\n\n...
7,Explain me the parameters used in the Conv2D l...,2D Convolutional LSTM.\nSimilar to an LSTM lay...,The Conv2D layer in Keras allows you to perfor...,**Parameters of the Conv2D Layer in Keras**\n\...,- **RAG Response Total Score (out of 3):**\n\n...
8,Explain me about the MNIST digits classificati...,Loads the MNIST dataset.\nThis is a dataset of...,"The MNIST dataset is a collection of 70,000 gr...",**MNIST (Modified National Institute of Standa...,- **RAG Response Total Score (out of 3):**\n\n...
9,What is the purpose of using the Embedding lay...,"In the mixed dimension embedding technique, we...",The Embedding layer in Keras is used to conver...,The purpose of using the Embedding layer in Ke...,- **RAG Response Total Score (out of 3):**\n\n...


In [15]:
#iterate over 10 rows and print the result
i=0
for index, row in df.iterrows():
  i+=1
  if i>10:
    break
  print(f"Question: {row['question']} \n")
  print(f"Context Retrieved: {row['context retrived']} \n")
  print(f"RAG Response: {row['rag_response']} \n")
  print(f"Normal Response: {row['normal_response']} \n")
  print(f"Verify Response: {row['verify_response']} \n")
  print("\n")

Question: Can you explain what a Bidirectional LSTM layer is in Keras? 

Context Retrieved: 3D Convolutional LSTM.
Similar to an LSTM layer, but the input transformations
and recurrent transformations are both convolutional.
Arguments
Call arguments
Input shape
Output shape
References


Code:
keras_core.layers.ConvLSTM3D(filters,kernel_size,strides=1,padding="valid",data_format=None,dilation_rate=1,activation="tanh",recurrent_activation="sigmoid",use_bias=True,kernel_initializer="glorot_uniform",recurrent_initializer="orthogonal",bias_initializer="zeros",unit_forget_bias=True,kernel_regularizer=None,recurrent_regularizer=None,bias_regularizer=None,activity_regularizer=None,kernel_constraint=None,recurrent_constraint=None,bias_constraint=None,dropout=0.0,recurrent_dropout=0.0,seed=None,return_sequences=False,return_state=False,go_backwards=False,stateful=False,**kwargs)



Bidirectional wrapper for RNNs.
Arguments
Call arguments
The call arguments for this layer are the same as those of

# Conclusion

### 1. **Fine-tuning an LLM vs. Fine-tuning Other Models:**
   It is better to focus on fine-tuning the LLM itself for the KerasInsight project. Since the core of the project involves extracting relevant information from large text corpora, LLMs are ideally suited for this task due to their ability to understand and generate human-like text based on vast amounts of data. Fine-tuning an LLM is the most effective approach, given that the project revolves around handling documentation and answering complex queries. Fine-tuning other models may not be as relevant, unless specific optimizations (like search ranking) are required, but in this case, the LLM should be the primary focus.

### 2. **Evaluation Metrics:**
   Using the **RAG triad** (Answer Relevance, Context Relevance, Groundedness) is the most suitable choice for this task. Unlike metrics such as ROUGE or BLEU that are more suited for text summarization or translation tasks, the RAG metrics are tailored for evaluating how accurate, relevant, and grounded the responses are to the source material. Since the goal is to provide grounded answers specific to Keras documentation, RAG metrics are the best fit.

   Additionally, assessing whether the combination of **RAG + LLM** enhances response quality versus the LLM alone is a useful approach. Verifying groundedness by comparing responses from different LLMs (e.g., using Gemini alongside other LLMs) adds robustness to the evaluation process.

### 3. **Use Case and Flexibility for Documentation Updates:**
   The design of KerasInsight to handle websites with frequently changing documentation is a key strength. Unlike static LLMs that are trained once and might miss updated content, this system scrapes the latest data and dynamically builds a vector store for similarity search. This ensures that users get up-to-date information, giving the system a significant advantage over models relying on outdated datasets.

### 4. **Potential as a Chatbot for Documentation:**
   The project can easily evolve into a chatbot for Keras or other similar documentation platforms. Given that it dynamically fetches and embeds content, the system can be extended to a wide range of documentation-heavy use cases such as developer platforms, APIs, etc. The ability to adapt to changing documentation environments makes this project highly valuable for real-time information retrieval.

### 5. Scope for Improvement:
  While the generation of responses has been good, there are areas where retrieval and chunking strategies can be further improved. Using more sophisticated embedding models and experimenting with different chunking strategies can enhance the accuracy and relevance of the retrieved information. Future work should focus on optimizing these aspects to handle even more complex queries and improve response precision

### Final Points:
   - **Fine-tuning the LLM** was the right choice for the KerasInsight project.
   - The **RAG triad metrics** are a well-suited and effective way to evaluate the model's performance.
   - Testing the effectiveness of **RAG + LLM** versus LLM alone, and verifying groundedness using another LLM, adds value to the evaluation process.
   - The **flexibility to handle dynamic documentation** makes the system superior to static LLMs for real-time tasks.
   - The project can be scaled into a chatbot solution for other documentation-heavy fields as well.
   - Further improvements in embedding models and chunking strategies will enhance retrieval accuracy and system robustness.