In this notebook you will find:

- chunk process of the data
- RAG with elasticsearch and llama2
- Retrieval evaluation
- RAG Evaluation

In [1]:
import pandas as pd
from tqdm import tqdm
import warnings

warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('../../data/data.csv')
documents = df.to_dict(orient='records')

In [3]:
documents[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'section': 'Overview of This Book',
 'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same page 

## Checking the number of tokens 

llama2 from Ollama has an embedding length of 4096. This means that the maximum number of tokens that can be introduced will be 4096 tokens. Let's check if the pre-chunking using the natural structure of the book is enough or not. 

In [4]:
import spacy

nlp = spacy.load("es_core_news_sm") 

In [5]:
token_limit = 4096
results = []
big_docs = []

for i, doc in enumerate(documents):
    spacy_doc = nlp(doc['text'])
    num_tokens = len(spacy_doc)

    result = {
        "chapter": doc["chapter"],
        "title": doc["title"],
        "section": doc["section"],
        "num_tokens": num_tokens
    }
    
    results.append(result)
    
for res in results:
    if res["num_tokens"] > token_limit: 
        big_docs.append(res)
        
print(big_docs)


[{'chapter': 'CHAPTER 3', 'title': 'Technical Interview: Machine Learning Algorithms', 'section': 'Statistical and Foundational Techniques', 'num_tokens': 4151}, {'chapter': 'CHAPTER 6', 'title': 'Technical Interview: Model Deployment and End-to-End ML', 'section': 'Model Deployment', 'num_tokens': 6350}]


## Chunking the problematic parts of the book

I'm going to use the structure of the book to chunk both chapters into two parts. 

Since the initial parsing didn't consider subsections, I will chunk them into 1 chunk of 4000 + the following words until the end of the sentence and another chunk with the second part.

In [6]:
token_limit = 4000
updated_documents = []

for doc in documents:
    spacy_doc = nlp(doc['text'])
    num_tokens = len(spacy_doc)

    if (doc['chapter'] in ["CHAPTER 3", "CHAPTER 6"]) and (doc['section'] in ["Statistical and Foundational Techniques", "Model Deployment"]):
        text = doc['text']
        tokens = nlp(text)  
        current_chunk = []
        current_tokens = 0
        
        for sentence in tokens.sents:
            sentence_tokens = len(sentence)
            
            if current_tokens + sentence_tokens > token_limit:
                updated_documents.append({
                    "chapter": doc["chapter"],
                    "title": doc["title"],
                    "section": doc["section"],
                    "text": " ".join([token.text for token in current_chunk]), 
                    "num_tokens": current_tokens,
                    "id": f"{doc['id']}_chunk_{len(updated_documents) + 1}"  
                })
                current_chunk = [sentence] 
                current_tokens = sentence_tokens  
            else:
                current_chunk.append(sentence)
                current_tokens += sentence_tokens

        if current_chunk:
            updated_documents.append({
                "chapter": doc["chapter"],
                "title": doc["title"],
                "section": doc["section"],
                "text": " ".join([token.text for token in current_chunk]), 
                "num_tokens": current_tokens,
                "id": f"{doc['id']}_chunk_{len(updated_documents) + 1}"  
            })
    else:
        updated_documents.append(doc)

# for updated_doc in updated_documents:
#     print(updated_doc)


# Setup Elasticsearch connection

#TODO -  docker config

### run on the console (linux)

sudo docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

# Create mappings and Index


In [9]:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('http://localhost:9200') 

es_client.info()

ObjectApiResponse({'name': 'd6cc5366abe6', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'v41YPv22T0Cz5Ot3oDqStA', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [10]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "standard_analyzer": {
                "type": "standard"
                }
            }
        }
    },
    "mappings": {
    "properties": {        
        "chapter": {
            "type": "text",
        },
        "title": {
            "type": "text",
        },
        "section": {
            "type": "text",
        },
        "text": {
            "type": "text",
            "analyzer": "standard_analyzer"  
        },
        "id":{
            "type": "keyword",
        },
        
    }
}

}

In [11]:
index_name = "ds-interview-questions"

# it is better to delete the index every time when experimenting
es_client.indices.delete(index=index_name, ignore_unavailable=True) 
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'ds-interview-questions'})

### Add documents to the index

In [12]:
for doc in tqdm(updated_documents):
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(f"Error when indexing the document: {e}")

100%|██████████| 50/50 [00:00<00:00, 143.43it/s]


### Create user query

In [13]:
query = 'what is the scope of a data scientist?'

### Create search function

In [14]:
def execute_search(query, index=index_name):
    """
    Execute a search query on the specified index.

    Parameters:
        query (dict): The search query to execute.
        index (str): The name of the index to search.

    Returns:
        None: Prints the search results.
    """
    try:
        response = es_client.search(index=index, body=query)
        return response
    except Exception as e:
        print(f"Error during search: {e}")

In [15]:
def full_text_search(query):
    full_text_query = {
        "size": 15,
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["text^3", "section", "title"],
                "type": "best_fields"
            }
        }
    }
    
    full_text_results = execute_search(full_text_query)

    return full_text_results

# RAG

In [16]:
import ollama
client = ollama.Client()

# to initiate ollama on console for the first time
# ollama serve
# ollama pull llama2

## Prompt


In [17]:
prompt_template = """
<<SYS>>
You are an assistant preparing a candidate for a data science job interview. 
Based on the provided context, please provide a concise and accurate answer to the following question in plain text format without any additional formatting. 
<</SYS>>

QUESTION: {question}

CONTEXT:
{context}

[INST] The answer has to be plain text
 [/INST]
"""


In [18]:
query

'what is the scope of a data scientist?'

In [19]:
def build_prompt(query, full_text_results):
    context = ""

    for hit in full_text_results['hits']['hits']: 
        text = hit['_source']['text']  
        context += f"Text: {text}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt


# prompt = build_prompt(query, search_results)
# print(prompt)

In [20]:
def generate_answer(query, full_text_results):
    message_content = build_prompt(query, full_text_results)
    
    response = client.chat(model="llama2", messages=[{"role": "user", "content": message_content}])
    
    if 'message' in response and 'content' in response['message']:
        content = response['message']['content']
        
        return content.strip()  
    return ""  


In [21]:
# response = generate_answer(query, search_results)
# print(response)

In [22]:
def rag(query):
    full_text_results = full_text_search(query)
    response = generate_answer(query, full_text_results)
    return response

    

In [23]:
query = 'Which skills are important for a data scietist?'
print(rag(query))

Some of the key skills that are important for a data scientist include:

1. Programming skills: Proficiency in programming languages such as Python, R, or Julia is essential for data science tasks such as data cleaning, manipulation, and analysis.
2. Statistical knowledge: A strong understanding of statistical concepts and methods, including hypothesis testing, regression analysis, and time series analysis, is critical for interpreting and analyzing data.
3. Data visualization: The ability to effectively communicate insights and findings through data visualizations is crucial for presenting results to stakeholders.
4. Machine learning: Knowledge of machine learning algorithms and techniques, such as supervised and unsupervised learning, is important for developing predictive models and classifying data.
5. Data wrangling: The ability to clean, transform, and restructure data into a format suitable for analysis is essential for working with large datasets.
6. Communication skills: Data 

## Retrieval evaluation

In [24]:
import pandas as pd

df = pd.read_csv('../../data/ground_truth_data.csv')
df

Unnamed: 0,question,text_id,chapter,title,section
0,How do you approach data preprocessing for mac...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,What are some common pitfalls to avoid when wo...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,How do you handle missing values in a dataset ...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
...,...,...,...,...,...
235,In what ways do you tailor your resume for dif...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
236,Can you share an instance when you received an...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
237,How do you handle rejection in the job search ...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
238,How long should you wait before following up w...,22eb7b9b30,CHAPTER 9,Post-Interview and Follow-up,Post-Interview Steps


In [25]:
df_questions = df[['question', 'text_id']]

In [26]:
ground_truth = df_questions.to_dict(orient = 'records')
ground_truth[0]

{'question': 'How do you approach data preprocessing for machine learning models?',
 'text_id': '86fd49a66d'}

In [27]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if isinstance(line, (list, tuple)) and True in line:
            cnt += 1

    return cnt / len(relevance_total)


In [28]:
def mrr(relevance_total):
    total_score = 0.0
    num_queries = len(relevance_total)

    for line in relevance_total:
        query_score = 0.0
        for rank in range(len(line)):
            if line[rank] == True:
                query_score = 1 / (rank + 1)
                break  

        total_score += query_score

    return total_score / num_queries if num_queries > 0 else 0.0


In [29]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['text_id']
    text_results = full_text_search(q['question'])
    hits = text_results.get('hits', {}).get('hits', [])
    relevance = [doc['_source']['id'] == doc_id for doc in hits]
    relevance_total.append(relevance)

100%|██████████| 240/240 [00:01<00:00, 148.43it/s]


In [30]:
hit_rate(relevance_total), mrr(relevance_total)

(0.775, 0.5180606893106892)

## Hyperparams Optimization


In [31]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from typing import List, Dict
import numpy as np
from sklearn.model_selection import ParameterGrid


In [32]:
df_val = df_questions[:100]
df_test = df_questions[100:]

In [33]:
search_space = {
    'title_boost': hp.uniform('title_boost', 1.0, 5.0),
    'content_boost': hp.uniform('content_boost', 1.0, 3.0),
    'minimum_should_match': hp.choice('minimum_should_match', ["50%", "75%", "90%"])
}

In [34]:
def evaluate_results(results, query):
    doc_ids = [hit["_id"] for hit in results["hits"]["hits"]]
    # For uniform relevance, we can calculate hit rate as a measure
    # Assuming we want to check if any document was returned
    return int(len(doc_ids) > 0)  # Returns 1 if there are results, else 0

# Hyperparameter search function
def hyperparameter_optimization(df_questions):
    # Define hyperparameters to optimize
    param_grid = {
        'fuzziness': ['AUTO', '1', '2'],
        'operator': ['AND', 'OR'],
        'size': [5, 10, 20]
    }

    best_score = 0
    best_params = {}

    for params in ParameterGrid(param_grid):
        scores = []

        for _, row in df_questions.iterrows():
            query = row['question']  # Assuming df_questions has a 'question' column
            results = full_text_search(query, params)
            score = evaluate_results(results, query)
            scores.append(score)

        avg_score = np.mean(scores)

        if avg_score > best_score:
            best_score = avg_score
            best_params = params

    return best_params, best_score

In [35]:
best_hyperparams, best_score = hyperparameter_optimization(df_questions)

# Output the best hyperparameters
print("Best Hyperparameters:", best_hyperparams)
print("Best Score:", best_score)


for _, row in df_val.iterrows():
    query = row['question']
    val_results = full_text_search(query, best_hyperparams)
    print("Validation Results:", val_results)

# Example of applying the best parameters on test set
for _, row in df_test.iterrows():
    query = row['question']
    test_results = full_text_search(query, best_hyperparams)
    print("Test Results:", test_results)


TypeError: full_text_search() takes 1 positional argument but 2 were given

In [49]:
test_scores = []
for _, row in df_test.iterrows():
    query = row['question']  # Assuming your DataFrame has a 'question' column
    relevant_ids = row['relevant_ids']  # Replace with your actual relevant ID retrieval logic
    results = full_text_search(query, best_params["boost_title"], best_params["boost_content"], best_params["min_score"])
    score = evaluate_results(results, relevant_ids)
    test_scores.append(score)

print("Test set scores:", test_scores)
print("Average test score:", np.mean(test_scores))

KeyError: 'relevant_ids'

In [50]:
# with open("../../data/best_hyperparameters_elasticsearch.json", "w") as json_file:
#     json.dump(best_params, json_file)

# print("Best hyperparameters saved to 'best_hyperparameters.json':", best_params)


In [35]:
gt_val = df_val.to_dict(orient='records')

In [36]:
def minsearch_search_optimized(query, boost):
    # boost = {'text': 3.0, 'section': 0.5}
    
    results = index.search(
        query=query,
        filter_dict = {},
        boost_dict=boost,
        num_results=5)

    return results


In [None]:
boost = {'text': best['boost']}
         
evaluate(gt_val, lambda q: minsearch_search_optimized(q['question'], boost))
# para mirar cuanto da con los mejores hyperparam 

A little bit better :)

## RAG Evaluation

In [38]:
prompt1_template = """
You are an expert evaluator for a RAG system.
Your task is to analyze the relevance of the generated answer compared to the original answer provided.
Based on the relevance and similarity of the generated answer to the original answer, you will classify
it as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Original Answer: {answer_orig}
Generated Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the original
answer and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()

prompt2_template = """
You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
Your task is to analyze the relevance of the generated answer to the given question.
Based on the relevance of the generated answer, you will classify it
as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".

Here is the data for evaluation:

Question: {question}
Generated Answer: {answer_llm}

Please analyze the content and context of the generated answer in relation to the question
and provide your evaluation in parsable JSON without using code blocks:

{{
  "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
  "Explanation": "[Provide a brief explanation for your evaluation]"
}}
""".strip()


In [None]:
len(ground_truth) 

In [None]:
ground_truth[0]

In [41]:
record = ground_truth[0]
question = record['question']
answer_llm = rag(question)

In [None]:
print(answer_llm)

In [None]:
prompt = prompt2_template.format(question = question , answer_llm = answer_llm)
print(prompt)

In [None]:
search_results = minsearch_search_optimized(query, boost)
relevance = generate_answer(prompt, search_results)

print(relevance)

In [None]:
for record in tqdm(ground_truth):
    print(record)

In [None]:
evaluations = []

for record in tqdm(ground_truth):
    question = record['question']
    answer_llm = rag(question)
    
    prompt = prompt2_template.format(question = question , answer_llm = answer_llm)
    search_results = minsearch_search_optimized(query, boost)
    relevance = generate_answer(prompt, search_results)
    evaluations.append((record['question'], answer_llm, relevance))

In [None]:
evaluations[0]

In [48]:
df_eval = pd.DataFrame(evaluations, columns=['Question', 'Response', 'Evaluation'])

In [None]:
df_eval

In [50]:
import re 

def categorize_evaluation(text):
    if re.search(r'"NON_RELEVANT"', text):
        return "NON_RELEVANT"
    elif re.search(r'"PARTLY_RELEVANT"', text):
        return "PARTLY_RELEVANT"
    elif re.search(r'"RELEVANT"', text):
        return "RELEVANT"
    else:
        return "UNKNOWN"

df_eval['Category'] = df_eval['Evaluation'].apply(categorize_evaluation)

category_counts = df_eval['Category'].value_counts()

In [None]:
category_counts

In [None]:
normalized_counts = df_eval['Category'].value_counts(normalize= True)
normalized_counts