The purpose of this notebook is to create a dataset of ground data questions to perform further metrics of the LLM model

# Import libraries and load the book in JSON format

In [1]:
import json
import hashlib
import ollama
import pandas as pd
from tqdm.auto import tqdm
import pickle

# to initiate ollama on console
# ollama serve
# ollama pull llama2

In [2]:
with open('../data/parsed_book.json', 'r') as f_in:
    book_raw = json.load(f_in)

In [3]:
book_raw[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'content': [{'section': 'Overview of This Book',
   'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on

Flatten the json and reformat chapter and title --> list of dicts

In [4]:
documents = []

for chapter in book_raw:
    chapter_name = chapter['chapter']
    title = chapter['title']

    for doc in chapter['content']:
        new_doc = {
            'chapter': chapter_name,
            'title': title,
            'section': doc['section'],
            'text': doc['text']
        }
        documents.append(new_doc) 


In [5]:
documents[0:2]

[{'chapter': 'CHAPTER 1',
  'title': 'Machine Learning Roles and the Interview Process',
  'section': 'Overview of This Book',
  'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same p

In [6]:
# Subsection

# documents = [d for d in documents if d['chapter'] == 'CHAPTER 1']

# Generate ids


In [7]:
generated_ids = {}

def generate_document_id(doc):
    combined = f"{doc['chapter']}-{doc['title']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:10]
    
    if document_id in generated_ids:
        counter = generated_ids[document_id]
        document_id = f"{document_id}-{counter}"
        generated_ids[document_id] = counter + 1
    else:
        generated_ids[document_id] = 1
    
    return document_id

for doc in documents:
    doc['id'] = generate_document_id(doc)


In [8]:
documents[0:5]

[{'chapter': 'CHAPTER 1',
  'title': 'Machine Learning Roles and the Interview Process',
  'section': 'Overview of This Book',
  'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same p

In [9]:
with open('../data/documents_with_ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2, ensure_ascii=False)

# Ollama

In [10]:
client = ollama.Client()

Prompt parameters used:
- << SYS>>...<</ SYS>>: Defines the behavioural rules of the model.

- [INST]...[/INST]: Defines the instruction or the task that the model has to follow.

In [11]:
prompt = """
<<SYS>>
You are an interviewer preparing for technical interviews for a data scientist position.
Make sure that the questions are relevant to the data scientist role, avoid any question related to any website, figure, image, table, or reference to other chapters in the book. 
Do not include any form of enumeration or numbering.
Do not include any form of introduction to the questions. Return ONLY the questions.
Ensure each question is no longer than 15 words each.
Don't ask for examples of things. 
<</SYS>>

The record:

chapter: {chapter}
title: {title}
text: {text}

[INST] Generate only FIVE brief questions in a free-form style without any introduction before: [/INST]
"""


Observations:

There can be a lot of room for improvement in the prompt configuration but I left it like this since the scope of this project is to focus on the good performance of the chatbot and the model produces a decent set of questions.

In [12]:
def generate_questions(doc):
    message_content = prompt.format(chapter=doc['chapter'], title=doc['title'], text=doc['text'])
    
    response = client.chat(model="llama2", messages=[{"role": "user", "content": message_content}])
    
    if 'message' in response and 'content' in response['message']:
        content = response['message']['content']
        
        return content.strip()  
    return ""

In [13]:
results = {}

for doc in tqdm(documents): 
    doc_id = doc['id']
    questions = generate_questions(doc) 
    results[doc_id] = questions 


  0%|          | 0/48 [00:00<?, ?it/s]

In [14]:
print(f'results = {results}')

results = {'86fd49a66d': '1. How do you evaluate the performance of a machine learning model?\n2. What is your approach to handling missing data in a dataset?\n3. Can you explain the difference between supervised and unsupervised learning?\n4. How do you ensure interpretability of a complex machine learning model?\n5. Can you describe your experience with distributed computing technologies like Hadoop or Spark?', '9a2356679c': '1. How have advances in distributed and parallel computing impacted the availability and accessibility of large datasets for machine learning research?\n2. What are some common job titles in the field of machine learning, and how have they evolved over time?\n3. How has the increasing recognition of machine learning and related topics from the broader population affected the job market for these roles?\n4. Can you describe a specific example of how a generative AI model, such as ChatGPT, was able to perform a task that was previously thought to be the exclusive 

# Parse questions

In this version of the project, the Llama2 model from Ollama is being used and the output is parsed into the desired JSON format but the results are not optimal.

The problems I faced are:
- The model enumerates the questions anyways.
- For some iterations the model produces an intro that is parsed as an independent question later. 


In future versions of the project, results could be optimized by adjusting certain model parameters such as temperature, the maximum number of tokens generated (max_tokens), frequency penalty, and presence penalty. Also further optimization of the prompt may lead to improved performance and more accurate outputs.


In [15]:
data = []

for text_id, questions in results.items():
    # Split by new lines or bullet points
    questions_list = questions.replace('•', '\n').split('\n')
    
    cleaned_questions = []
    for question in questions_list:
        # Clean up numbers, dots, and any extra spaces
        cleaned_question = question.lstrip('0123456789. ').strip()
        if cleaned_question:  
            cleaned_questions.append(cleaned_question)
    
    # Add the cleaned questions to your data list
    for question in cleaned_questions:
        data.append({
            'question': question,
            'text_id': text_id
        })


df = pd.DataFrame(data, columns=['question', 'text_id'])


In [16]:
# Save cleaned questions
with open('../data/ground_truth_data.bin', 'wb') as f_out:
    pickle.dump(data, f_out)

In [17]:
for record in documents:
    df.loc[df['text_id'] == record['id'], 'chapter'] = record['chapter']
    df.loc[df['text_id'] == record['id'], 'title'] = record['title']
    df.loc[df['text_id'] == record['id'], 'section'] = record['section']


In [18]:
df.head(10)

Unnamed: 0,question,text_id,chapter,title,section
0,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,What is your approach to handling missing data...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,How do you ensure interpretability of a comple...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,Can you describe your experience with distribu...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
5,How have advances in distributed and parallel ...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
6,What are some common job titles in the field o...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
7,How has the increasing recognition of machine ...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
8,Can you describe a specific example of how a g...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
9,How have advances in Kubernetes and other ML i...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...


In [19]:
id_counts = df['text_id'].value_counts()
id_counts 

text_id
195ee8d52d      6
114486e1a1      6
914e36400a      6
86fd49a66d      5
535fe561d2      5
f0d1a5d7be      5
6d22154063      5
9a2356679c      5
2337809957      5
6e80eb677b      5
e3868ef048      5
59be39571f      5
c80e9a7eaf      5
072c02694c      5
331444d407      5
61153e0cc4      5
d0f8681a67      5
0ce98a8968      5
a6426309ff      5
a89864b52f      5
4ef8a420bc      5
005a577345      5
99eacbbf97      5
0720f39c5b      5
0720f39c5b-1    5
e5aff76113      5
6d7e81991e      5
b00775576b      5
16c46692ba      5
ae2aa1ca31      5
dbd8460895      5
f45e93a39b      5
2e54c275a1      5
95d66e0b67      5
acad0ef561      5
9ebc3f2b8d      5
f6f741906a      5
f6620b37bc      5
052c470148      5
322f155e7d      5
e123c4c971      5
995cfd588e      5
561ddf5f64      5
6dadcd84bd      5
22eb7b9b30      5
1026686599      5
dee0126444      5
2ca59d8bf2      5
Name: count, dtype: int64

In [20]:
df_cleaned = df.copy()

In [21]:
repeated_ids = df_cleaned[df_cleaned['text_id'] == 'f6f741906a']
repeated_ids

Unnamed: 0,question,text_id,chapter,title,section
183,How do you explain technical and nontechnical ...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
184,What strategies have you used to connect with ...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
185,"How do you provide context for acronyms, techn...",f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
186,In what situations might you adjust the level ...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
187,Can you give an example of how you provided co...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses


The first question is just the introduction to the answer given by the model even though I explicitly clarified that I do not want that intro.

Luckily this only happens in few iterations. 

In [22]:
ids_with_6_rows = id_counts[id_counts == 6].index

rows_to_keep = []

for id_ in ids_with_6_rows:
    rows_for_id = df_cleaned[df_cleaned['text_id'] == id_]
    rows_to_keep.append(rows_for_id.iloc[1:])

remaining_rows = pd.concat(rows_to_keep)

df_remaining = df_cleaned[~df_cleaned['text_id'].isin(ids_with_6_rows)]

df_final = pd.concat([df_remaining, remaining_rows], ignore_index=True)

In [23]:
df_final

Unnamed: 0,question,text_id,chapter,title,section
0,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,What is your approach to handling missing data...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,How do you ensure interpretability of a comple...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,Can you describe your experience with distribu...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
...,...,...,...,...,...
235,How do you handle class imbalance in image rec...,914e36400a,CHAPTER 3,Technical Interview: Machine Learning Algorithms,Computer Vision Algorithms
236,Can you explain the difference between early s...,914e36400a,CHAPTER 3,Technical Interview: Machine Learning Algorithms,Computer Vision Algorithms
237,How do you ensure that the thumbnails used for...,914e36400a,CHAPTER 3,Technical Interview: Machine Learning Algorithms,Computer Vision Algorithms
238,What are some common techniques used in data a...,914e36400a,CHAPTER 3,Technical Interview: Machine Learning Algorithms,Computer Vision Algorithms


Sort the chapters again

In [24]:
df_final['chapter_number'] = df_final['chapter'].str.extract(r'CHAPTER (\d+)').astype(int)

df_final_sorted = df_final.sort_values(by='chapter_number').drop(columns=['chapter_number']).reset_index(drop=True)

df_final_sorted

Unnamed: 0,question,text_id,chapter,title,section
0,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,What is your approach to handling missing data...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,How do you ensure interpretability of a comple...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,Can you describe your experience with distribu...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
...,...,...,...,...,...
235,Can you share an instance where you tailored y...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
236,What is your approach to networking and buildi...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
237,How do you handle rejection during the job sea...,1026686599,CHAPTER 9,Post-Interview and Follow-up,What to Do Between Interviews
238,Can you share an example of a thank-you note y...,22eb7b9b30,CHAPTER 9,Post-Interview and Follow-up,Post-Interview Steps


In [25]:
df_final_sorted['text_id'].value_counts()

text_id
86fd49a66d      5
9a2356679c      5
6d22154063      5
61153e0cc4      5
535fe561d2      5
f0d1a5d7be      5
6e80eb677b      5
2337809957      5
e3868ef048      5
59be39571f      5
c80e9a7eaf      5
072c02694c      5
195ee8d52d      5
114486e1a1      5
d0f8681a67      5
331444d407      5
a89864b52f      5
a6426309ff      5
4ef8a420bc      5
0ce98a8968      5
914e36400a      5
99eacbbf97      5
0720f39c5b-1    5
005a577345      5
0720f39c5b      5
e5aff76113      5
16c46692ba      5
b00775576b      5
6d7e81991e      5
dbd8460895      5
ae2aa1ca31      5
f45e93a39b      5
9ebc3f2b8d      5
2e54c275a1      5
95d66e0b67      5
acad0ef561      5
052c470148      5
f6f741906a      5
f6620b37bc      5
322f155e7d      5
995cfd588e      5
e123c4c971      5
6dadcd84bd      5
561ddf5f64      5
22eb7b9b30      5
2ca59d8bf2      5
dee0126444      5
1026686599      5
Name: count, dtype: int64

In [26]:
pd.set_option('display.max_colwidth', None)

print(df_final_sorted['question'].head())

0                                  How do you evaluate the performance of a machine learning model?
1                                      What is your approach to handling missing data in a dataset?
2                      Can you explain the difference between supervised and unsupervised learning?
3                           How do you ensure interpretability of a complex machine learning model?
4    Can you describe your experience with distributed computing technologies like Hadoop or Spark?
Name: question, dtype: object


In [27]:
df_final_sorted.to_csv('../data/ground_truth_data.csv', index=False)