In this notebook you will find:

- chunk process of the data
- RAG with elasticsearch and llama2
- Retrieval evaluation
- RAG Evaluation

In [1]:
import pandas as pd
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('../../data/data.csv')
documents = df.to_dict(orient='records')

In [3]:
documents[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'section': 'Overview of This Book',
 'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same page 

## Checking the number of tokens 

llama2 from Ollama has an embedding length of 4096. This means that the maximum number of tokens that can be introduced will be 4096 tokens. Let's check if the pre-chunking using the natural structure of the book is enough or not. 

In [4]:
import spacy

nlp = spacy.load("es_core_news_sm") 

In [5]:
token_limit = 4096
results = []
big_docs = []

for i, doc in enumerate(documents):
    spacy_doc = nlp(doc['text'])
    num_tokens = len(spacy_doc)

    result = {
        "chapter": doc["chapter"],
        "title": doc["title"],
        "section": doc["section"],
        "num_tokens": num_tokens
    }
    
    results.append(result)
    
for res in results:
    if res["num_tokens"] > token_limit: 
        big_docs.append(res)
        
print(big_docs)


[{'chapter': 'CHAPTER 3', 'title': 'Technical Interview: Machine Learning Algorithms', 'section': 'Statistical and Foundational Techniques', 'num_tokens': 4151}, {'chapter': 'CHAPTER 6', 'title': 'Technical Interview: Model Deployment and End-to-End ML', 'section': 'Model Deployment', 'num_tokens': 6350}]


## Chunking the problematic parts of the book

I'm going to use the structure of the book to chunk both chapters into two parts. 

Since the initial parsing didn't consider subsections, I will chunk them into 1 chunk of 4000 + the following words until the end of the sentence and another chunk with the second part.

In [6]:
token_limit = 4000
updated_documents = []

for doc in documents:
    spacy_doc = nlp(doc['text'])
    num_tokens = len(spacy_doc)

    if (doc['chapter'] in ["CHAPTER 3", "CHAPTER 6"]) and (doc['section'] in ["Statistical and Foundational Techniques", "Model Deployment"]):
        text = doc['text']
        tokens = nlp(text)  
        current_chunk = []
        current_tokens = 0
        
        for sentence in tokens.sents:
            sentence_tokens = len(sentence)
            
            if current_tokens + sentence_tokens > token_limit:
                updated_documents.append({
                    "chapter": doc["chapter"],
                    "title": doc["title"],
                    "section": doc["section"],
                    "text": " ".join([token.text for token in current_chunk]), 
                    "num_tokens": current_tokens,
                    "id": f"{doc['id']}_chunk_{len(updated_documents) + 1}"  
                })
                current_chunk = [sentence] 
                current_tokens = sentence_tokens  
            else:
                current_chunk.append(sentence)
                current_tokens += sentence_tokens

        if current_chunk:
            updated_documents.append({
                "chapter": doc["chapter"],
                "title": doc["title"],
                "section": doc["section"],
                "text": " ".join([token.text for token in current_chunk]), 
                "num_tokens": current_tokens,
                "id": f"{doc['id']}_chunk_{len(updated_documents) + 1}"  
            })
    else:
        updated_documents.append(doc)

# for updated_doc in updated_documents:
#     print(updated_doc)


# Setup Elasticsearch connection

#TODO -  docker config

### run on the console (linux)

sudo docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

# Create mappings and Index


In [7]:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('http://localhost:9200') 

es_client.info()

ObjectApiResponse({'name': '10e9358c75eb', 'cluster_name': 'docker-cluster', 'cluster_uuid': '9cbyo7m3TIqww90LbpPAcQ', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [8]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "standard_analyzer": {
                "type": "standard"
                }
            }
        }
    },
    "mappings": {
    "properties": {        
        "chapter": {
            "type": "text",
        },
        "title": {
            "type": "text",
        },
        "section": {
            "type": "text",
        },
        "text": {
            "type": "text",
            "analyzer": "standard_analyzer"  
        },
        "id":{
            "type": "keyword",
        },
        
    }
}

}

In [9]:
index_name = "ds-interview-questions"

# it is better to delete the index every time when experimenting
es_client.indices.delete(index=index_name, ignore_unavailable=True) 
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'ds-interview-questions'})

### Add documents to the index

In [10]:
for doc in tqdm(updated_documents):
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(f"Error when indexing the document: {e}")

100%|██████████| 50/50 [00:00<00:00, 191.60it/s]


### Create user query

In [11]:
query = 'what is the scope of a data scientist?'

### Create search function

In [12]:
def execute_search(query, index=index_name):
    """
    Execute a search query on the specified index.

    Parameters:
        query (dict): The search query to execute.
        index (str): The name of the index to search.

    Returns:
        None: Prints the search results.
    """
    try:
        response = es_client.search(index=index, body=query)
        return response
    except Exception as e:
        print(f"Error during search: {e}")

In [13]:
def full_text_search(query):
    full_text_query = {
        "size": 15,
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["text^3", "section", "title"],
                "type": "best_fields"
            }
        }
    }
    
    full_text_results = execute_search(full_text_query)

    return full_text_results

# RAG

In [14]:
import ollama
client = ollama.Client()

# to initiate ollama on console for the first time
# ollama serve
# ollama pull llama2

## Prompt


In [15]:
prompt_template = """
<<SYS>>
You are an assistant preparing a candidate for a data science job interview. 
Based on the provided context, please provide a concise and accurate answer to the following question in plain text format without any additional formatting. 
<</SYS>>

QUESTION: {question}

CONTEXT:
{context}

[INST] The answer has to be plain text
 [/INST]
"""


In [16]:
query

'what is the scope of a data scientist?'

In [17]:
def build_prompt(query, full_text_results):
    context = ""

    for hit in full_text_results['hits']['hits']: 
        text = hit['_source']['text']  
        context += f"Text: {text}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt


# prompt = build_prompt(query, search_results)
# print(prompt)

In [18]:
def generate_answer(query, full_text_results):
    message_content = build_prompt(query, full_text_results)
    
    response = client.chat(model="llama2", messages=[{"role": "user", "content": message_content}])
    
    if 'message' in response and 'content' in response['message']:
        content = response['message']['content']
        
        return content.strip()  
    return ""  


In [19]:
# response = generate_answer(query, search_results)
# print(response)

In [20]:
def rag(query):
    full_text_results = full_text_search(query)
    response = generate_answer(query, full_text_results)
    return response

    

In [21]:
query = 'Which skills are important for a data scientist?'
print(rag(query))

Sure, here's the answer in plain text format:

Data scientists require a range of skills, including:

1. Programming skills (e.g., Python, R, SQL)
2. Statistical knowledge (e.g., hypothesis testing, regression analysis)
3. Data visualization and communication
4. Machine learning and deep learning techniques
5. Data wrangling and preprocessing
6. Understanding of data mining and big data technologies
7. Familiarity with cloud computing platforms (e.g., AWS, Azure)
8. Experience with data management and storage solutions (e.g., relational databases, NoSQL)
9. Ability to work with large datasets and perform complex analysis
10. Strong analytical and problem-solving skills.


## Retrieval evaluation

In [22]:
import pandas as pd

df = pd.read_csv('../../data/ground_truth_data.csv')
df

Unnamed: 0,question,text_id,chapter,title,section
0,How do you approach data preprocessing for mac...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,What are some common pitfalls to avoid when wo...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,How do you handle missing values in a dataset ...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
...,...,...,...,...,...
235,In what ways do you tailor your resume for dif...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
236,Can you share an instance when you received an...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
237,How do you handle rejection in the job search ...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
238,How long should you wait before following up w...,22eb7b9b30,CHAPTER 9,Post-Interview and Follow-up,Post-Interview Steps


In [23]:
df_questions = df[['question', 'text_id']]

In [24]:
ground_truth = df_questions.to_dict(orient = 'records')
ground_truth[0]

{'question': 'How do you approach data preprocessing for machine learning models?',
 'text_id': '86fd49a66d'}

In [25]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if isinstance(line, (list, tuple)) and True in line:
            cnt += 1

    return cnt / len(relevance_total)


In [26]:
def mrr(relevance_total):
    total_score = 0.0
    num_queries = len(relevance_total)

    for line in relevance_total:
        query_score = 0.0
        for rank in range(len(line)):
            if line[rank] == True:
                query_score = 1 / (rank + 1)
                break  

        total_score += query_score

    return total_score / num_queries if num_queries > 0 else 0.0


In [27]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['text_id']
    text_results = full_text_search(q['question'])
    hits = text_results.get('hits', {}).get('hits', [])
    relevance = [doc['_source']['id'] == doc_id for doc in hits]
    relevance_total.append(relevance)

100%|██████████| 240/240 [00:01<00:00, 163.43it/s]


In [28]:
hit_rate(relevance_total), mrr(relevance_total)

(0.775, 0.5180606893106892)

## Hyperparams Optimization


In [29]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import numpy as np
from sklearn.model_selection import ParameterGrid


In [30]:
df_val = df_questions[:100]
df_test = df_questions[100:]

In [31]:
params = {
    'temperature': hp.uniform('temperature', 0.1, 1.0),
    'top_p': hp.uniform('top_p', 0.5, 1.0),
    'top_k': hp.quniform('top_k', 10, 100, 1),
    'frequency_penalty': hp.uniform('frequency_penalty', 0.0, 1.0),
    'presence_penalty': hp.uniform('presence_penalty', 0.0, 1.0),
    'max_tokens': hp.quniform('max_tokens', 50, 200, 1),
    'repetition_penalty': hp.uniform('repetition_penalty', 1.0, 1.5)
}


In [32]:
def objective(params):
    # Extrae los parámetros del diccionario `params`
    temperature = params['temperature']
    top_p = params['top_p']
    top_k = params['top_k']
    frequency_penalty = params['frequency_penalty']
    presence_penalty = params['presence_penalty']
    max_tokens = params['max_tokens']
    repetition_penalty = params['repetition_penalty']
    
    response = rag(query)

    hit_rate_value = hit_rate(response)  
    mrr_value = mrr(response)
    
    loss = - (hit_rate_value + mrr_value)
    
    return {'loss': loss, 'status': STATUS_OK}

trials = Trials()

best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=10,  
    trials=trials
)

print(f"Best hyperparameters: {best}")


NameError: name 'space' is not defined

In [None]:
best_hyperparams, best_score = hyperparameter_optimization(df_questions)

# Output the best hyperparameters
print("Best Hyperparameters:", best_hyperparams)
print("Best Score:", best_score)


for _, row in df_val.iterrows():
    query = row['question']
    val_results = full_text_search(query, best_hyperparams)
    print("Validation Results:", val_results)

# Example of applying the best parameters on test set
for _, row in df_test.iterrows():
    query = row['question']
    test_results = full_text_search(query, best_hyperparams)
    print("Test Results:", test_results)


In [None]:
test_scores = []
for _, row in df_test.iterrows():
    query = row['question']  # Assuming your DataFrame has a 'question' column
    relevant_ids = row['relevant_ids']  # Replace with your actual relevant ID retrieval logic
    results = full_text_search(query, best_params["boost_title"], best_params["boost_content"], best_params["min_score"])
    score = evaluate_results(results, relevant_ids)
    test_scores.append(score)

print("Test set scores:", test_scores)
print("Average test score:", np.mean(test_scores))

In [None]:
# with open("../../data/best_hyperparameters_elasticsearch.json", "w") as json_file:
#     json.dump(best_params, json_file)

# print("Best hyperparameters saved to 'best_hyperparameters.json':", best_params)


In [None]:
gt_val = df_val.to_dict(orient='records')

In [None]:
def minsearch_search_optimized(query, boost):
    # boost = {'text': 3.0, 'section': 0.5}
    
    results = index.search(
        query=query,
        filter_dict = {},
        boost_dict=boost,
        num_results=5)

    return results


In [None]:
boost = {'text': best['boost']}
         
evaluate(gt_val, lambda q: minsearch_search_optimized(q['question'], boost))
# para mirar cuanto da con los mejores hyperparam 

A little bit better :)

## RAG Evaluation

In [None]:
prompt1_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Original Answer: {answer_orig}
Generated Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the original
answer and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

prompt2_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()


In [None]:
len(ground_truth) 

In [None]:
ground_truth[0]

In [None]:
record = ground_truth[0]
question = record['question']
answer_llm = rag(question)

In [None]:
print(answer_llm)

In [None]:
prompt = prompt2_template.format(question = question , answer_llm = answer_llm)
print(prompt)

In [None]:
search_results = minsearch_search_optimized(query, boost)
relevance = generate_answer(prompt, search_results)

print(relevance)

In [None]:
for record in tqdm(ground_truth):
    print(record)

In [None]:
evaluations = []

for record in tqdm(ground_truth):
    question = record['question']
    answer_llm = rag(question)
    
    prompt = prompt2_template.format(question = question , answer_llm = answer_llm)
    search_results = minsearch_search_optimized(query, boost)
    relevance = generate_answer(prompt, search_results)
    evaluations.append((record['question'], answer_llm, relevance))

In [None]:
evaluations[0]

In [None]:
df_eval = pd.DataFrame(evaluations, columns=['Question', 'Response', 'Evaluation'])

In [None]:
df_eval

In [None]:
import re 

def categorize_evaluation(text):
    if re.search(r'"NON_RELEVANT"', text):
        return "NON_RELEVANT"
    elif re.search(r'"PARTLY_RELEVANT"', text):
        return "PARTLY_RELEVANT"
    elif re.search(r'"RELEVANT"', text):
        return "RELEVANT"
    else:
        return "UNKNOWN"

df_eval['Category'] = df_eval['Evaluation'].apply(categorize_evaluation)

category_counts = df_eval['Category'].value_counts()

In [None]:
category_counts

In [None]:
normalized_counts = df_eval['Category'].value_counts(normalize= True)
normalized_counts