The dataset is in this link: https://www.kaggle.com/datasets/hserdaraltan/1000-data-science-concepts/data
and the kaggle command is:  kaggle datasets download -d hserdaraltan/1000-data-science-concepts
for simplicity i download it manually and save as ./data/data.csv as for downloading with kaggle command an API KEY is required.

In [1]:
import pandas as pd
from tqdm import tqdm
import minsearch

## Ingestion

In [2]:
df = pd.read_csv('../data/data.csv')

In [3]:
documents = df.to_dict(orient='records')

In [4]:
#For the project we limit the dataset to the first 200 records of the dataset. For those 200 we have made
#3 questions that we saved in ground_truth_data.csv
documents = documents[:200]

In [6]:
def display_head_and_tail(dataset, n=1):
    """Display the first and last `n` entries of the dataset."""
    # Display first `n` entries (head)
    print(f"--- First {n} entries ---")
    for item in dataset[:n]:  # Slicing to get the first `n` items
        print(item)
    
    print("\n--- Last {n} entries ---")
    for item in dataset[-n:]:  # Slicing to get the last `n` items
        print(item)

display_head_and_tail(documents, 1)

--- First 1 entries ---
{'id': 1, 'Question': 'What is under-fitting and overfitting in machine learning?', 'Answer': "Underfitting is when a model is too simple, and overfitting is when it's too complex, making it perform poorly on new data."}

--- Last {n} entries ---
{'id': 200, 'Question': 'What is the difference between a generative and discriminative model?', 'Answer': 'Generative models learn data categories, while discriminative models learn category distinctions. Discriminative models generally outperform generative models in classification tasks.'}


#no need for this as data.csv has been already prepared with id
def add_id_to_dataset(dataset):
    """Add 'id' to each item in the dataset, with 'id' being the first key."""
    for idx, item in enumerate(dataset, start=1):
        # Create a new dictionary with 'id' first, followed by the rest of the keys
        item = {'id': idx, **item}
        dataset[idx - 1] = item  # Update the dataset with the new order
    return dataset

# Add 'id' to each dictionary, with 'id' as the first key
documents = add_id_to_dataset(documents)

In [16]:
df

Unnamed: 0,id,Question,Answer
0,1,What is under-fitting and overfitting in machi...,"Underfitting is when a model is too simple, an..."
1,2,Can you explain what a false positive and a fa...,A false positive incorrectly indicates a condi...
2,3,Clarify the concept of Phase IV.,"Phase IV studies, also known as post-marketing..."
3,4,What is semi-supervised learning described in ...,Semi-supervised learning integrates both label...
4,5,Discuss the parallelization of training in gra...,Parallelizing training of a gradient boosting ...
...,...,...,...
1065,1066,Define the ACID property in SQL and its signif...,ACID principles maintain database integrity by...
1066,1067,What are the different types of data warehouses?,"Data warehouses vary by scope and function, wi..."
1067,1068,What are the key stages in a data mining project?,A data mining project starts with understandin...
1068,1069,What is information extraction?,Information extraction systematically identifi...


In [17]:
import minsearch

In [18]:
index = minsearch.Index(
    text_fields=['Question', 'Answer'],  # Fields where text-based searches will be performed
    keyword_fields=['id']  # If you have an 'id' or unique identifier for each entry
)


In [19]:
index.fit(documents)

<minsearch.Index at 0x73a294f0ba60>

## RAG flow

In [20]:
from dotenv import load_dotenv
import os
#source ./bin/activate from project folder before to activate environment
load_dotenv()  # Loads the variables from .env file

# Now you can access the API_KEY
api_key = os.getenv("OPENAI_API_KEY")
#print(api_key)


In [21]:
from openai import OpenAI

#client = OpenAI(api_key=api_key)

client = OpenAI()

In [22]:
def search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [23]:
prompt_template = """
Your role is to answer questions about datascience based on the database with questions and asnwers about datascience.
Use only the facts from the CONTEXT when answering the QUESTION, where the CONTEXT is just our question and answer database, that you have below.
If the context is not sufficient for giving a response, your response will be "Not enough context in questions and answers database".
Even if you have more information outside the context, do not use it and base your answer only in the context.
Write only the answer itself without any further comments.
Be precise and concise and try using as few words as required to give a meaninful answer, with the minimum words possible.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

entry_template = """
Question: {Question}
Answer: {Answer}

""".strip()

def build_prompt(query, search_results):
    context = ""
    
    for doc in search_results:
        context = context + entry_template.format(**doc) + "\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [24]:
model='llama3.1:8b'
model_ollama = 'llama3.1:8b'
model_openai = 'gpt-4o-mini'

In [25]:
def llm(prompt, model):
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [27]:
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',
)

In [28]:
def rag(query, model):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    #print(prompt)
    answer = llm(prompt, model=model)
    return answer

In [29]:
query = 'What tools are used in datascience?'

busqueda = search(query)
busqueda

[{'id': 183,
  'Question': 'What tools are used for training NLP models?',
  'Answer': 'Tools for training NLP models include NLTK for language processing tasks, spaCy for advanced NLP, and PyTorch-NLP for deep learning in NLP.'},
 {'id': 153,
  'Question': 'What are the tools of AI?',
  'Answer': 'AI tools range from libraries like Scikit Learn for machine learning to TensorFlow and Keras for deep learning, providing environments for developing AI models.'},
 {'id': 126,
  'Question': 'What are the common ETL tools used during data warehousing activities?',
  'Answer': 'In data warehousing, popular ETL (Extract, Transform, Load) tools include Informatica for enterprise data integration, Talend for data management and integration, Ab Initio for handling large data volumes, Oracle Data Integrator for combining with Oracle databases, Skyvia for cloud data integration, SSIS for SQL Server integration, Pentaho for business analytics, and Xplenty for ETL processes in the cloud.'},
 {'id': 5

In [31]:
resultado = rag(query,model_ollama)

In [32]:
resultado

'NLTK, spaCy, PyTorch-NLP, Scikit Learn, TensorFlow, Keras, Informatica, Talend, Ab Initio, Oracle Data Integrator, Skyvia, SSIS, Pentaho, Xplenty, Pandas, NumPy, SciPy, Sklearn, and TensorFlow.'

In [None]:
#already generated in notebook for evaluation with llama3.1
#def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

## Retrieval evaluation

In [33]:
#Data already in a csv, created in notebook evaluation_data_generation

In [34]:
df_question = pd.read_csv('../data/ground-truth-retrieval.csv')

In [35]:
df_question.head()

Unnamed: 0,id,question
0,1,What happens when a machine learning model is ...
1,1,How does overfitting affect a model's performa...
2,1,"Can a model be too complex, and if so, what ar..."
3,2,What are the implications of a test result tha...
4,2,How does a false indication of a condition's a...


In [36]:
ground_truth = df_question.to_dict(orient='records')

In [37]:
ground_truth[0]

{'id': 1,
 'question': 'What happens when a machine learning model is too simple?'}

In [38]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [39]:
def minsearch_search(query):
    boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [40]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['id']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [41]:
from tqdm.auto import tqdm

In [42]:
evaluate(ground_truth, lambda q: minsearch_search(q['question']))

100%|███████████████████████████████████████| 600/600 [00:00<00:00, 1168.06it/s]


{'hit_rate': 0.95, 'mrr': 0.8056335978835977}

## Finding the best parameters

In [43]:
df_validation = df_question[:100]
df_test = df_question[100:]

In [44]:
df_validation

Unnamed: 0,id,question
0,1,What happens when a machine learning model is ...
1,1,How does overfitting affect a model's performa...
2,1,"Can a model be too complex, and if so, what ar..."
3,2,What are the implications of a test result tha...
4,2,How does a false indication of a condition's a...
...,...,...
95,32,How do stakeholders determine the order of ful...
96,33,What techniques can reduce overfitting in SVM ...
97,33,How does dimensionality reduction impact compu...
98,33,Can high-dimensional datasets benefit from pre...


In [45]:
import random

def simple_optimize(param_ranges, objective_function, n_iterations=10):
    best_params = None
    best_score = float('-inf')  # Assuming we're minimizing. Use float('-inf') if maximizing.

    for _ in range(n_iterations):
        # Generate random parameters
        current_params = {}
        for param, (min_val, max_val) in param_ranges.items():
            if isinstance(min_val, int) and isinstance(max_val, int):
                current_params[param] = random.randint(min_val, max_val)
            else:
                current_params[param] = random.uniform(min_val, max_val)
        
        # Evaluate the objective function
        current_score = objective_function(current_params)
        
        # Update best if current is better
        if current_score > best_score:  # Change to > if maximizing
            best_score = current_score
            best_params = current_params
    
    return best_params, best_score

In [46]:
gt_val = df_validation.to_dict(orient='records')

In [47]:
def minsearch_search(query, boost=None):
    if boost is None:
        boost = {}

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

In [48]:
param_ranges = {
    'Question': (0.0, 3.0),
    'Answer': (0.0, 3.0),
   
}

def objective(boost_params):
    def search_function(q):
        return minsearch_search(q['question'], boost_params)

    results = evaluate(gt_val, search_function)
    return results['mrr']

In [49]:
simple_optimize(param_ranges, objective, n_iterations=20)

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1093.63it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1146.83it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1147.29it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1103.09it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1199.06it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1204.16it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1209.46it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1175.53it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1179.29it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1182.28it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1186.64it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1183.05it/s]
100%|███████████████████████

({'Question': 0.9712610571824632, 'Answer': 2.4119285092818314},
 0.9043452380952379)

In [50]:
best_params, best_score = simple_optimize(param_ranges, objective, n_iterations=20)

100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1131.42it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1176.52it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1145.97it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1128.54it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1038.12it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1170.14it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1198.11it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1205.03it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1187.02it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1192.51it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1191.73it/s]
100%|███████████████████████████████████████| 100/100 [00:00<00:00, 1189.23it/s]
100%|███████████████████████

In [51]:
print(best_params, best_score)

{'Question': 0.5579461443001708, 'Answer': 1.4578273881638246} 0.9043452380952379


In [52]:
def minsearch_improved(query, best_params):
    boost = {
        'Question': best_params['Question'],
        'Answer': best_params['Answer']
       
    }

    results = index.search(
        query=query,
        filter_dict={},
        boost_dict=boost,
        num_results=10
    )

    return results

#evaluate(ground_truth, lambda q: minsearch_improved(q['question']))

In [53]:
evaluate(ground_truth, lambda q: minsearch_improved(q['question'],best_params))

100%|███████████████████████████████████████| 600/600 [00:00<00:00, 1189.82it/s]


{'hit_rate': 0.985, 'mrr': 0.9059900793650794}

## RAG evaluation

In [54]:
prompt2_template = """
You have to evaluate a Retrieval Augmented Generation System (RAG).
Your task consists in analyzing the relevance of the answer generated by a large language model (llm) to a given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks with this structure:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}

 The final answer must be just the json object without any further comments before or after.

""".strip()

In [55]:
len(ground_truth)

600

In [57]:
ground_truth[2]

{'id': 1,
 'question': 'Can a model be too complex, and if so, what are the consequences?'}

In [61]:
question=ground_truth[2]['question']

In [63]:
question

'Can a model be too complex, and if so, what are the consequences?'

In [68]:
answer_llm=rag(question,model_ollama)

In [69]:
answer_llm

'Yes, a model can be too complex, leading to overfitting. Consequences include poor performance on new data.'

In [70]:
prompt = prompt2_template.format(question=question, answer_llm=answer_llm)
print(prompt)

You have to evaluate a Retrieval Augmented Generation System (RAG).
Your task consists in analyzing the relevance of the answer generated by a large language model (llm) to a given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: Can a model be too complex, and if so, what are the consequences?
Generated Answer: Yes, a model can be too complex, leading to overfitting. Consequences include poor performance on new data.

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks with this structure:

{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}

 The final answer must be just the json object without any further comments before or after.


In [71]:
import json


In [97]:
df_sample = df_question.sample(n=100, random_state=1)

In [98]:
sample = df_sample.to_dict(orient='records')

In [99]:
sample

[{'id': 149, 'question': 'List common NLP tasks supported by Apache OpenNLP.'},
 {'id': 135,
  'question': 'What happens when you use the break statement in a loop?'},
 {'id': 170,
  'question': 'Can you describe a scenario where patterns and relationships are discovered without prior knowledge of the outcome?'},
 {'id': 152,
  'question': 'How is an S-curve used for trend analysis and forecasting?'},
 {'id': 68,
  'question': 'What is a range of values that likely contains the population parameter?'},
 {'id': 8,
  'question': 'Can you explain how mutable and immutable data affect argument passing in Python?'},
 {'id': 139,
  'question': 'How does econometrics help understand economic relationships?'},
 {'id': 183,
  'question': 'List the primary tools utilized in deep learning for natural language processing.'},
 {'id': 23,
  'question': 'What is the average difference between predicted and actual values in a dataset?'},
 {'id': 56, 'question': 'What is a concise way to create lists i

In [100]:
#FUNCTION TO PARSE THE OUTPUTS FROM THE LLM WHEN IT DOES NOT PRODUCE PURE JSON CONSISTENTLY (usually open source models)
#after many iterations this is the function that works better when the output of the llm is not correctly formatted
#and we have more control of which outputs failed as we save the errors and the program continue even if some
#outputs failed

import json
import re

def robust_json_loads(s):
    """
    Attempts to parse a JSON string, fixing common errors if parsing fails.

    Parameters:
    s (str): The JSON string to parse.

    Returns:
    tuple: (success (bool), data (dict or None), error_message (str or None))
    """
    original_s = s  # Keep a copy of the original string

    # First, attempt to parse the input string directly
    try:
        data = json.loads(s)
        return (True, data, None)
    except json.JSONDecodeError:
        pass  # Proceed to cleaning steps if parsing fails

    # Step 1: Strip wrapping backticks and language specifiers
    s = s.strip()
    s = re.sub(r'^```[a-zA-Z]*\s*', '', s)  # Remove starting triple backticks and optional language
    s = re.sub(r'```$', '', s)              # Remove ending triple backticks
    s = s.strip('`')                        # Remove any remaining backticks

    # Step 2: Remove any text before the first '{' or '['
    start_idx = re.search(r'[\{\[]', s)
    if not start_idx:
        error_message = "No JSON object could be detected in the input."
        return (False, None, error_message)
    s = s[start_idx.start():]

    # Step 3: Remove any text after the last '}' or ']'
    end_idx = max(s.rfind('}'), s.rfind(']'))
    if end_idx == -1:
        error_message = "No JSON object could be detected in the input."
        return (False, None, error_message)
    s = s[:end_idx+1]

    # Step 4: Remove extraneous characters after the JSON content
    # Remove any characters after the last closing brace/bracket
    s = re.sub(r'([\}\]])[\s\S]*$', r'\1', s)

    # Step 5: Remove extraneous characters before the JSON content
    # Remove any characters before the first opening brace/bracket
    s = re.sub(r'^[\s\S]*?([\{\[])', r'\1', s)

    # Step 6: Replace single quotes with double quotes for keys and values
    # Avoid changing single quotes inside double-quoted strings
    s = re.sub(
        r'(?<=[:\{\[,])\s*\'([^\']*)\'\s*(?=[:,\}\]])',
        r'"\1"',
        s
    )

    # Step 7: Remove trailing commas before closing braces/brackets
    s = re.sub(r',\s*(\}|\])', r'\1', s)

    # Step 8: Balance brackets and braces if necessary
    def balance_characters(s, open_char, close_char):
        opens = s.count(open_char)
        closes = s.count(close_char)
        if opens > closes:
            s += close_char * (opens - closes)
        elif closes > opens:
            s = open_char * (closes - opens) + s
        return s

    s = balance_characters(s, '{', '}')
    s = balance_characters(s, '[', ']')

    # Step 9: Remove unescaped control characters
    # Control characters are not allowed in JSON strings
    s = re.sub(r'[\x00-\x1F]+', '', s)

    # Step 10: Final check to remove extra double quotes at the end of strings in arrays
    # This targets the specific case you mentioned
    s = re.sub(r'(".*?")"+(?=\s*[\],}])', r'\1', s)

    # Step 11: Attempt to parse the cleaned string
    try:
        data = json.loads(s)
    except json.JSONDecodeError as e:
        error_message = f"Error parsing JSON after cleaning: {e}"
        return (False, None, error_message)

    # Validate that the parsed data contains the expected 'questions' key
    if not isinstance(data, dict) or 'questions' not in data:
        error_message = "Parsed JSON does not contain 'questions' key."
        return (False, None, error_message)

    # Check that 'questions' is a non-empty list
    if not isinstance(data['questions'], list) or not data['questions']:
        error_message = "'questions' key is empty or not a list."
        return (False, None, error_message)

    # Parsing and validation successful
    return (True, data, None)


In [101]:
question

'What is a range of values that likely contains the population parameter?'

In [102]:
answer_llm

'A confidence interval is a range of values that likely contains the population parameter with a certain level of confidence, accounting for sample variability.'

#the answers to evaluate are generated with ollama models for cost and efficiency. 
llama3.1:8b       		4.7 GB	26  	
qwen2-math:7b     		4.4 GB	2   	
wizard-math:latest		4.1 GB	5   	
phi3:latest       		2.2 GB	5   

In [105]:
#after trying several options with small samples we choose llama3.1 as the generator of answers and judge for the
#first round, and below the same answers will be evaluated by gpt-4o-mini
from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',
)

In [106]:
model_1='llama3.1:8b'
model_2='llama3.1:8b'

In [107]:
evaluations = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question,model_1) #here the model that creates the question
    #answer_llm = robust_json_loads(answer_llm)
    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt, model_2) #here the model as judge that decides the relevance of the questions
    #evaluation = json.loads(evaluation)
    #evaluation = robust_json_loads(evaluation)
    # Call the parsing function robust_json_loads
    success, evaluation, error_message = robust_json_loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))

100%|█████████████████████████████████████████| 100/100 [06:03<00:00,  3.64s/it]


In [108]:
evaluations

[({'id': 149,
   'question': 'List common NLP tasks supported by Apache OpenNLP.'},
  'Tokenization, parsing, named entity recognition.',
  {'Relevance': 'RELEVANT',
   'Explanation': 'The generated answer directly addresses the question by listing specific NLP tasks supported by Apache OpenNLP, making it highly relevant.'}),
 ({'id': 135,
   'question': 'What happens when you use the break statement in a loop?'},
  'Break exits loop prematurely.',
  {'Relevance': 'RELEVANT',
   'Explanation': 'The generated answer directly addresses the question about what happens when using the break statement in a loop.'}),
 ({'id': 170,
   'question': 'Can you describe a scenario where patterns and relationships are discovered without prior knowledge of the outcome?'},
  'Patterns and relationships are discovered without prior knowledge of the outcome when using techniques like transfer learning in computer vision, where a model developed for one task is repurposed on another related task.',
  {'Re

In [109]:
df_eval = pd.DataFrame(evaluations, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [110]:
df_eval

Unnamed: 0,answer,id,question,relevance,explanation
0,"Tokenization, parsing, named entity recognition.",149,List common NLP tasks supported by Apache Open...,RELEVANT,The generated answer directly addresses the qu...
1,Break exits loop prematurely.,135,What happens when you use the break statement ...,RELEVANT,The generated answer directly addresses the qu...
2,Patterns and relationships are discovered with...,170,Can you describe a scenario where patterns and...,PARTLY_RELEVANT,The generated answer partially addresses the q...
3,The S-curve serves as a visual tool for analyz...,152,How is an S-curve used for trend analysis and ...,RELEVANT,The generated answer directly addresses the us...
4,A confidence interval is a range of values tha...,68,What is a range of values that likely contains...,RELEVANT,The generated answer directly addresses the qu...
...,...,...,...,...,...
95,Generalized linear models differ from traditio...,147,How do generalized linear models differ from t...,RELEVANT,The generated answer directly addresses the di...
96,Semi-supervised learning integrates both label...,4,What is semi-supervised learning?,RELEVANT,The generated answer directly addresses the qu...
97,The factors influencing chart creation are the...,64,What factors influence chart creation?,RELEVANT,The generated answer directly addresses the fa...
98,"Grid Search, Random Search, Bayesian Optimizat...",125,List different methods to find optimal hyperpa...,RELEVANT,The generated answer directly addresses the qu...


In [111]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.79
PARTLY_RELEVANT    0.20
NON_RELEVANT       0.01
Name: proportion, dtype: float64

In [112]:
model

'llama3.1:8b'

In [113]:
#df_eval.to_csv('../data/rag-eval-{gpt-4o-mini}.csv', index=False)
df_eval.to_csv(f'../data/rag-eval-{model}.csv', index=False)


In [114]:
df_eval[df_eval.relevance == 'NON_RELEVANT']

Unnamed: 0,answer,id,question,relevance,explanation
67,Not enough context in questions and answers da...,135,What are the three main loop control statement...,NON_RELEVANT,The generated answer does not address the ques...


In [115]:
#before we did evaluation with llama3.1:8b, below with gpt-4o-mini
#as we have already the answers from the first model we take those answers from the dataframe instead of repeating
#all the calls to llms

In [116]:
client=OpenAI()
model_2='gpt-4o-mini'

In [117]:
def llm(prompt, model):
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )
    tokens = {
            'prompt_tokens': response.usage.prompt_tokens,
            'completion_tokens': response.usage.completion_tokens,
            'total_tokens': response.usage.total_tokens
        }
    
    return response.choices[0].message.content, tokens

In [118]:
import pandas as pd
from tqdm import tqdm

# Lists to hold evaluation results and token counts for each type of token
evaluations = []
prompt_tokens_list = []
completion_tokens_list = []
total_tokens_list = []

# Assuming df_eval is your DataFrame with the relevant structure
for record in tqdm(sample):
    question = record['question']

    # Lookup the answer in df_eval based on the question
    answer_row = df_eval[df_eval['question'] == question]

    if not answer_row.empty:
        answer_llm = answer_row['answer'].values[0]  # Get the answer from the DataFrame
    else:
        answer_llm = "No answer found"  # Handle case where question is not in DataFrame

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    # Control of tokens used for the query to see cost of API calls
    evaluation, tokens = llm(prompt, model_2)  # Get evaluation and token usage

    # Call the parsing function robust_json_loads
    success, evaluation, error_message = robust_json_loads(evaluation)

    # Append the evaluation
    evaluations.append((record, answer_llm, evaluation))

    # Unpack tokens into separate lists
    prompt_tokens_list.append(tokens['prompt_tokens'])
    completion_tokens_list.append(tokens['completion_tokens'])
    total_tokens_list.append(tokens['total_tokens'])



100%|█████████████████████████████████████████| 100/100 [02:47<00:00,  1.68s/it]


In [119]:
# Create a DataFrame for tokens only
df_tokens = pd.DataFrame({
    'prompt_tokens': prompt_tokens_list,
    'completion_tokens': completion_tokens_list,
    'total_tokens': total_tokens_list
})


In [120]:
df_tokens

Unnamed: 0,prompt_tokens,completion_tokens,total_tokens
0,201,60,261
1,198,47,245
2,236,59,295
3,217,71,288
4,220,43,263
...,...,...,...
95,216,46,262
96,222,49,271
97,217,55,272
98,231,56,287


In [121]:
#calculation of costs
import pandas as pd

def calculate_token_costs(df_tokens, price_per_input, price_per_output):
    """
    Calculate total costs based on token usage.

    Parameters:
    - df_tokens: DataFrame containing token counts with columns 'prompt_tokens' and 'completion_tokens'.
    - price_per_input: Cost per 1 million input tokens (prompt tokens).
    - price_per_output: Cost per 1 million output tokens (completion tokens).

    Returns:
    - A DataFrame summarizing costs per token type and the total cost.
    """
    # Calculate costs for input tokens
    df_tokens['input_cost'] = (df_tokens['prompt_tokens'] / 1_000_000) * price_per_input
    
    # Calculate costs for output tokens
    df_tokens['output_cost'] = (df_tokens['completion_tokens'] / 1_000_000) * price_per_output
    
    # Calculate total cost
    df_tokens['total_cost'] = df_tokens['input_cost'] + df_tokens['output_cost']
    
    # Summarize costs
    total_input_cost = df_tokens['input_cost'].sum()
    total_output_cost = df_tokens['output_cost'].sum()
    total_cost = df_tokens['total_cost'].sum()

    # Create a summary DataFrame
    cost_summary = pd.DataFrame({
        'Total Input Cost': [total_input_cost],
        'Total Output Cost': [total_output_cost],
        'Total Cost': [total_cost]
    })

    return df_tokens, cost_summary




In [123]:
#calculation of costs
price_per_input = 0.150  # price per 1M input tokens
price_per_output = 0.600  # price per 1M output tokens

# Calculate costs
df_tokens_with_costs, cost_summary = calculate_token_costs(df_tokens, price_per_input, price_per_output)

# Display results
print("Tokens with Costs:")
print(df_tokens_with_costs)

print("\nCost Summary:")
print(cost_summary)

Tokens with Costs:
    prompt_tokens  completion_tokens  total_tokens  input_cost  output_cost  \
0             201                 60           261    0.000030     0.000036   
1             198                 47           245    0.000030     0.000028   
2             236                 59           295    0.000035     0.000035   
3             217                 71           288    0.000033     0.000043   
4             220                 43           263    0.000033     0.000026   
..            ...                ...           ...         ...          ...   
95            216                 46           262    0.000032     0.000028   
96            222                 49           271    0.000033     0.000029   
97            217                 55           272    0.000033     0.000033   
98            231                 56           287    0.000035     0.000034   
99            220                 57           277    0.000033     0.000034   

    total_cost  
0     0.000066 

evaluations = []

# Assuming df_eval is your DataFrame with the relevant structure
for record in tqdm(sample):
    question = record['question']

    # Lookup the answer in df_eval based on the question
    answer_row = df_eval[df_eval['question'] == question]

    if not answer_row.empty:
        answer_llm = answer_row['answer'].values[0]  # Get the answer from the DataFrame
    else:
        answer_llm = "No answer found"  # Handle case where question is not in DataFrame

    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )
#control of tokens used for the query to see cost of api calls
    evaluation,tokens = llm(prompt, model_2)  # Here the model as judge that decides the relevance of the questions

    # Call the parsing function robust_json_loads
    success, evaluation, error_message = robust_json_loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))


#evaluations_gpt4o = []
#evaluations_llama31_8b = []
#now with 4o-mini
evaluations = []

for record in tqdm(sample):
    question = record['question']
    answer_llm = rag(question,model_1) #here the model that creates the question
    #answer_llm = robust_json_loads(answer_llm)
    prompt = prompt2_template.format(
        question=question,
        answer_llm=answer_llm
    )

    evaluation = llm(prompt, model_2) #here the model as judge that decides the relevance of the questions
    #evaluation = json.loads(evaluation)
    #evaluation = robust_json_loads(evaluation)
    # Call the parsing function robust_json_loads
    success, evaluation, error_message = robust_json_loads(evaluation)

    evaluations.append((record, answer_llm, evaluation))

In [124]:
evaluations_gpt4o=evaluations

In [125]:
df_eval = pd.DataFrame(evaluations_gpt4o, columns=['record', 'answer', 'evaluation'])

df_eval['id'] = df_eval.record.apply(lambda d: d['id'])
df_eval['question'] = df_eval.record.apply(lambda d: d['question'])

df_eval['relevance'] = df_eval.evaluation.apply(lambda d: d['Relevance'])
df_eval['explanation'] = df_eval.evaluation.apply(lambda d: d['Explanation'])

del df_eval['record']
del df_eval['evaluation']

In [126]:
df_eval

Unnamed: 0,answer,id,question,relevance,explanation
0,"Tokenization, parsing, named entity recognition.",149,List common NLP tasks supported by Apache Open...,PARTLY_RELEVANT,The generated answer lists some common NLP tas...
1,Break exits loop prematurely.,135,What happens when you use the break statement ...,RELEVANT,The generated answer accurately describes the ...
2,Patterns and relationships are discovered with...,170,Can you describe a scenario where patterns and...,PARTLY_RELEVANT,The answer mentions transfer learning in compu...
3,The S-curve serves as a visual tool for analyz...,152,How is an S-curve used for trend analysis and ...,PARTLY_RELEVANT,The generated answer mentions the S-curve as a...
4,A confidence interval is a range of values tha...,68,What is a range of values that likely contains...,RELEVANT,The generated answer accurately defines a conf...
...,...,...,...,...,...
95,Generalized linear models differ from traditio...,147,How do generalized linear models differ from t...,RELEVANT,The generated answer accurately describes a ke...
96,Semi-supervised learning integrates both label...,4,What is semi-supervised learning?,RELEVANT,The generated answer accurately defines semi-s...
97,The factors influencing chart creation are the...,64,What factors influence chart creation?,RELEVANT,The generated answer directly addresses the qu...
98,"Grid Search, Random Search, Bayesian Optimizat...",125,List different methods to find optimal hyperpa...,RELEVANT,The generated answer lists different methods f...


In [127]:
df_eval.relevance.value_counts()

relevance
RELEVANT           75
PARTLY_RELEVANT    24
NON_RELEVANT        1
Name: count, dtype: int64

In [128]:
df_eval.relevance.value_counts(normalize=True)

relevance
RELEVANT           0.75
PARTLY_RELEVANT    0.24
NON_RELEVANT       0.01
Name: proportion, dtype: float64

In [129]:
df_eval.to_csv('../data/rag-eval-gpt-4o.csv', index=False)

In [None]:
#we observe that with the 2 judges, llama 3.1 8b and openai 4o mini there is no irrelevant response
#for the small sample with only 5 examples is 80-20 and 60-40 respectively
#now with the sample = 100
#llama3.1:8b as generator and judge: 80%-20% and only 1 irrelevant
#llama3.1:8b as generator and gpt-4o-mini as judge: aprox 75%-25% and only 1 irrelevant answer
#the irrelevant one is the only that the generator did not answer for not having enough context
