The purpose of this notebook is to create a dataset of ground data questions to perform further metrics of the LLM model

# Import libraries and load the book in JSON format

In [1]:
import json
import hashlib
import ollama
import pandas as pd
from tqdm.auto import tqdm
import pickle

# to initiate ollama on console
# ollama serve
# ollama pull llama2

In [4]:
with open('../../data/parsed_book.json', 'r') as f_in:
    book_raw = json.load(f_in)

In [5]:
book_raw[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'content': [{'section': 'Overview of This Book',
   'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on

Flatten the json and reformat chapter and title --> list of dicts

In [6]:
documents = []

for chapter in book_raw:
    chapter_name = chapter['chapter']
    title = chapter['title']

    for doc in chapter['content']:
        new_doc = {
            'chapter': chapter_name,
            'title': title,
            'section': doc['section'],
            'text': doc['text']
        }
        documents.append(new_doc) 


In [7]:
documents[0:2]

[{'chapter': 'CHAPTER 1',
  'title': 'Machine Learning Roles and the Interview Process',
  'section': 'Overview of This Book',
  'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same p

In [8]:
# Subsection

# documents = [d for d in documents if d['chapter'] == 'CHAPTER 1']

# Generate ids


In [9]:
generated_ids = {}

def generate_document_id(doc):
    combined = f"{doc['chapter']}-{doc['title']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:10]
    
    if document_id in generated_ids:
        counter = generated_ids[document_id]
        document_id = f"{document_id}-{counter}"
        generated_ids[document_id] = counter + 1
    else:
        generated_ids[document_id] = 1
    
    return document_id

for doc in documents:
    doc['id'] = generate_document_id(doc)


In [10]:
documents[0:5]

[{'chapter': 'CHAPTER 1',
  'title': 'Machine Learning Roles and the Interview Process',
  'section': 'Overview of This Book',
  'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same p

In [11]:
with open('../../data/documents_with_ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2, ensure_ascii=False)

# Ollama

In [12]:
client = ollama.Client()

Prompt parameters used:
- << SYS>>...<</ SYS>>: Defines the behavioural rules of the model.

- [INST]...[/INST]: Defines the instruction or the task that the model has to follow.

In [13]:
prompt = """
<<SYS>>
You are an interviewer preparing for technical interviews for a data scientist position.
Make sure that the questions are relevant to the data scientist role, avoid any question related to any website, figure, image, table, or reference to other chapters in the book. 
Do not include any form of enumeration or numbering.
Do not include any form of introduction to the questions. Return ONLY the questions.
Ensure each question is no longer than 15 words each.
Don't ask for examples of things. 
<</SYS>>

The record:

chapter: {chapter}
title: {title}
text: {text}

[INST] Generate only FIVE brief questions in a free-form style without any introduction before: [/INST]
"""


Observations:

There can be a lot of room for improvement in the prompt configuration but I left it like this since the scope of this project is to focus on the good performance of the chatbot and the model produces a decent set of questions.

In [14]:
def generate_questions(doc):
    message_content = prompt.format(chapter=doc['chapter'], title=doc['title'], text=doc['text'])
    
    response = client.chat(model="llama2", messages=[{"role": "user", "content": message_content}])
    
    if 'message' in response and 'content' in response['message']:
        content = response['message']['content']
        
        return content.strip()  
    return ""

In [15]:
results = {}

for doc in tqdm(documents): 
    doc_id = doc['id']
    questions = generate_questions(doc) 
    results[doc_id] = questions 


  0%|          | 0/48 [00:00<?, ?it/s]

In [16]:
print(f'results = {results}')

results = {'86fd49a66d': '1. How do you evaluate the performance of a machine learning model?\n2. Can you explain the difference between supervised and unsupervised learning?\n3. How do you handle missing data in a machine learning project?\n4. What is your approach to debugging a machine learning issue?\n5. Can you describe a time when you had to communicate complex machine learning concepts to a non-technical audience?', '9a2356679c': '1. What are some common challenges that data scientists face when working with large datasets?\n2. How have advances in distributed and parallel computing impacted the machine learning field?\n3. Can you explain the difference between a machine learning engineer and a product data scientist?\n4. How has the rise of generative AI applications like ChatGPT affected the job market for data scientists?\n5. What are some emerging trends in the field of machine learning that could potentially lead to new job titles or roles?', '6d22154063': '1. How do you en

# Parse questions

In this version of the project, the Llama2 model from Ollama is being used and the output is parsed into the desired JSON format but the results are not optimal.

The problems I faced are:
- The model enumerates the questions anyways.
- For some iterations the model produces an intro that is parsed as an independent question later. 


In future versions of the project, results could be optimized by adjusting certain model parameters such as temperature, the maximum number of tokens generated (max_tokens), frequency penalty, and presence penalty. Also further optimization of the prompt may lead to improved performance and more accurate outputs.


In [17]:
data = []

for text_id, questions in results.items():
    questions_list = questions.replace('•', '\n').split('\n')
    
    cleaned_questions = []
    for question in questions_list:
        cleaned_question = question.lstrip('0123456789. ').strip()
        if cleaned_question:  
            cleaned_questions.append(cleaned_question)
    
    for question in cleaned_questions:
        data.append({
            'question': question,
            'text_id': text_id
        })


df = pd.DataFrame(data, columns=['question', 'text_id'])


In [19]:
with open('../../data/ground_truth_data.bin', 'wb') as f_out:
    pickle.dump(data, f_out)

In [20]:
for record in documents:
    df.loc[df['text_id'] == record['id'], 'chapter'] = record['chapter']
    df.loc[df['text_id'] == record['id'], 'title'] = record['title']
    df.loc[df['text_id'] == record['id'], 'section'] = record['section']


In [21]:
df.head(10)

Unnamed: 0,question,text_id,chapter,title,section
0,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,How do you handle missing data in a machine le...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,What is your approach to debugging a machine l...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,Can you describe a time when you had to commun...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
5,What are some common challenges that data scie...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
6,How have advances in distributed and parallel ...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
7,Can you explain the difference between a machi...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
8,How has the rise of generative AI applications...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...
9,What are some emerging trends in the field of ...,9a2356679c,CHAPTER 1,Machine Learning Roles and the Interview Process,A Brief History of Machine Learning and Data S...


In [22]:
id_counts = df['text_id'].value_counts()
id_counts 

text_id
c80e9a7eaf      6
86fd49a66d      5
6d22154063      5
9a2356679c      5
535fe561d2      5
f0d1a5d7be      5
6e80eb677b      5
61153e0cc4      5
2337809957      5
e3868ef048      5
195ee8d52d      5
114486e1a1      5
59be39571f      5
072c02694c      5
331444d407      5
d0f8681a67      5
0ce98a8968      5
a89864b52f      5
a6426309ff      5
4ef8a420bc      5
914e36400a      5
005a577345      5
99eacbbf97      5
0720f39c5b      5
0720f39c5b-1    5
e5aff76113      5
6d7e81991e      5
b00775576b      5
16c46692ba      5
ae2aa1ca31      5
dbd8460895      5
f45e93a39b      5
2e54c275a1      5
95d66e0b67      5
acad0ef561      5
9ebc3f2b8d      5
f6f741906a      5
f6620b37bc      5
052c470148      5
322f155e7d      5
e123c4c971      5
995cfd588e      5
561ddf5f64      5
6dadcd84bd      5
22eb7b9b30      5
1026686599      5
dee0126444      5
2ca59d8bf2      5
Name: count, dtype: int64

In [23]:
df_cleaned = df.copy()

In [24]:
repeated_ids = df_cleaned[df_cleaned['text_id'] == 'f6f741906a']
repeated_ids

Unnamed: 0,question,text_id,chapter,title,section
181,What are some common pitfalls you see in job c...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
182,How do you determine the appropriate level of ...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
183,Can you share an example of how you used conte...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
184,How do you ensure that you are using terminolo...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses
185,What is the most effective way to adjust the l...,f6f741906a,CHAPTER 7,Behavioral Interviews,Behavioral Interview Questions and Responses


The first question is just the introduction to the answer given by the model even though I explicitly clarified that I do not want that intro.

Luckily this only happens in few iterations. 

In [25]:
ids_with_6_rows = id_counts[id_counts == 6].index

rows_to_keep = []

for id_ in ids_with_6_rows:
    rows_for_id = df_cleaned[df_cleaned['text_id'] == id_]
    rows_to_keep.append(rows_for_id.iloc[1:])

remaining_rows = pd.concat(rows_to_keep)

df_remaining = df_cleaned[~df_cleaned['text_id'].isin(ids_with_6_rows)]

df_final = pd.concat([df_remaining, remaining_rows], ignore_index=True)

In [26]:
df_final

Unnamed: 0,question,text_id,chapter,title,section
0,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,How do you handle missing data in a machine le...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,What is your approach to debugging a machine l...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,Can you describe a time when you had to commun...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
...,...,...,...,...,...
235,What is the most important information you sho...,c80e9a7eaf,CHAPTER 2,Machine Learning Job Application and Resume,"Additional Job Application Materials, Credenti..."
236,"Should you use a one-page or two-page resume, ...",c80e9a7eaf,CHAPTER 2,Machine Learning Job Application and Resume,"Additional Job Application Materials, Credenti..."
237,How can you make sure your resume stands out f...,c80e9a7eaf,CHAPTER 2,Machine Learning Job Application and Resume,"Additional Job Application Materials, Credenti..."
238,What are some common mistakes people make when...,c80e9a7eaf,CHAPTER 2,Machine Learning Job Application and Resume,"Additional Job Application Materials, Credenti..."


Sort the chapters again

In [27]:
df_final['chapter_number'] = df_final['chapter'].str.extract(r'CHAPTER (\d+)').astype(int)

df_final_sorted = df_final.sort_values(by='chapter_number').drop(columns=['chapter_number']).reset_index(drop=True)

df_final_sorted

Unnamed: 0,question,text_id,chapter,title,section
0,How do you evaluate the performance of a machi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
1,Can you explain the difference between supervi...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
2,How do you handle missing data in a machine le...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
3,What is your approach to debugging a machine l...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
4,Can you describe a time when you had to commun...,86fd49a66d,CHAPTER 1,Machine Learning Roles and the Interview Process,Overview of This Book
...,...,...,...,...,...
235,How do benefits like health and dental impact ...,dee0126444,CHAPTER 9,Post-Interview and Follow-up,Steps of the Offer Stage
236,Can you provide examples of non-base pay optio...,dee0126444,CHAPTER 9,Post-Interview and Follow-up,Steps of the Offer Stage
237,How can data scientists ensure they are contri...,2ca59d8bf2,CHAPTER 9,Post-Interview and Follow-up,First 30/60/90 Days of Your New ML Job
238,Can you share an experience where reaching out...,2ca59d8bf2,CHAPTER 9,Post-Interview and Follow-up,First 30/60/90 Days of Your New ML Job


In [28]:
df_final_sorted['text_id'].value_counts()

text_id
86fd49a66d      5
9a2356679c      5
6d22154063      5
61153e0cc4      5
535fe561d2      5
f0d1a5d7be      5
6e80eb677b      5
2337809957      5
e3868ef048      5
114486e1a1      5
195ee8d52d      5
59be39571f      5
072c02694c      5
c80e9a7eaf      5
331444d407      5
d0f8681a67      5
a89864b52f      5
a6426309ff      5
4ef8a420bc      5
0ce98a8968      5
914e36400a      5
005a577345      5
99eacbbf97      5
0720f39c5b      5
0720f39c5b-1    5
e5aff76113      5
6d7e81991e      5
b00775576b      5
16c46692ba      5
ae2aa1ca31      5
dbd8460895      5
f45e93a39b      5
95d66e0b67      5
acad0ef561      5
9ebc3f2b8d      5
2e54c275a1      5
f6f741906a      5
f6620b37bc      5
052c470148      5
322f155e7d      5
995cfd588e      5
e123c4c971      5
561ddf5f64      5
6dadcd84bd      5
22eb7b9b30      5
1026686599      5
dee0126444      5
2ca59d8bf2      5
Name: count, dtype: int64

In [29]:
pd.set_option('display.max_colwidth', None)

print(df_final_sorted['question'].head())

0                                                      How do you evaluate the performance of a machine learning model?
1                                          Can you explain the difference between supervised and unsupervised learning?
2                                                         How do you handle missing data in a machine learning project?
3                                                          What is your approach to debugging a machine learning issue?
4    Can you describe a time when you had to communicate complex machine learning concepts to a non-technical audience?
Name: question, dtype: object


In [31]:
df_final_sorted.insert(0, 'id', df_final_sorted.index)

In [33]:
df_final_sorted.to_csv('../../data/ground_truth_data.csv', index=False)