In [2]:
!wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/01-intro/documents.json

--2025-06-08 20:17:34--  https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/01-intro/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json’


2025-06-08 20:17:34 (245 MB/s) - ‘documents.json’ saved [658332/658332]



In [9]:
import json
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [10]:
documents = []
for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [11]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

# Set up IDs

In [12]:
import hashlib
# 使用哈希（如 MD5） 的方式更适合构建长期稳定的、跨系统一致的文档 ID。
def generate_document_id(doc):
    combined = f"{doc['course']} - {doc['question']} - {doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [13]:
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [14]:
#n = len(documents)
#for i in range(n):
    #documents[i]['id'] = i

In [15]:
documents[3]

{'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'section': 'General course-related questions',
 'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'course': 'data-engineering-zoomcamp',
 'id': '31d3f3a3'}

In [16]:
from collections import defaultdict

In [17]:
hashes = defaultdict(list)
for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

In [18]:
len(hashes), len(documents)

(947, 948)

In [19]:
for k,values in hashes.items():
    if len(values)>1:
        print(k, len(values))

2cc7916e 2


In [20]:
hashes['2cc7916e']

[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '2cc7916e'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '2cc7916e'}]

In [21]:
with open('documents-with-ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

In [22]:
!head documents-with-ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "911e9749"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


# Use LLM - Ground Truth Dataset Generation

In [23]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [24]:
import os
from groq import Groq

In [35]:
client = Groq(
    api_key=""
)

In [36]:
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Explain the importance of fast language models",
        }
    ],
    model="llama-3.3-70b-versatile",
)

print(chat_completion.choices[0].message.content)

Fast language models are crucial in the field of natural language processing (NLP) and have numerous applications in various industries. The importance of fast language models can be summarized as follows:

1. **Real-time Processing**: Fast language models enable real-time processing of language inputs, such as speech, text, or gestures. This is essential for applications like voice assistants, live captioning, and real-time language translation.
2. **Improved User Experience**: Quick response times provided by fast language models enhance the user experience in applications like chatbots, voice assistants, and language translation software. Users expect instant responses, and slow models can lead to frustration and disengagement.
3. **Scalability**: Fast language models can handle large volumes of data and process multiple requests concurrently, making them ideal for applications with high traffic or large user bases.
4. **Low Latency**: Fast language models minimize latency, which is

In [37]:
doc = documents[2]
prompt = prompt_template.format(**doc) # ** 匹配 {course}、{question} 这些字段

In [38]:
print(prompt)

You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]


In [39]:
response = client.chat.completions.create(
    
    messages=[{"role": "user", "content": prompt}],
    model="llama-3.3-70b-versatile",
)
json_response = response.choices[0].message.content
json_response

'["What happens if I enroll in the course after it has already begun", "Can I still complete the coursework if I miss the initial registration period", "How does late registration affect my ability to participate in the course", "Will I be penalized for joining the course after the official start date", "What are the implications of delayed enrollment on my coursework submission"]'

In [41]:
json.loads(json_response)

['What happens if I enroll in the course after it has already begun',
 'Can I still complete the coursework if I miss the initial registration period',
 'How does late registration affect my ability to participate in the course',
 'Will I be penalized for joining the course after the official start date',
 'What are the implications of delayed enrollment on my coursework submission']

In [42]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)
    
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="llama-3.3-70b-versatile",
    )
    json_response = response.choices[0].message.content
    return json_response

In [43]:
from tqdm.auto import tqdm

In [44]:
results = {}
for doc in tqdm(documents):
    doc_id = doc['id']
    if doc_id in results:
        continue
        
    questions = generate_questions(doc)
    results[doc_id] = questions

  0%|          | 0/948 [00:00<?, ?it/s]

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.3-70b-versatile` in organization `org_01j0ky1yste1j98fbfv00k3k6y` service tier `on_demand` on tokens per day (TPD): Limit 100000, Used 100147, Requested 287. Please try again in 6m15.798s. Need more tokens? Upgrade to Dev Tier today at https://console.groq.com/settings/billing', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

In [45]:
print(results[list(results.keys())[0]])

["What is the scheduled start date and time of our course and what is the initial activity", 
"How can I stay updated on course events and deadlines through a calendar", 
"Where do I need to register before the course begins and what is the registration link", 
"What is the best way to receive important course announcements and communications", 
"Are there any additional platforms I need to join for course discussions and interactions beyond the main course channel"]


In [46]:
import pickle

In [47]:
with open('results.bin', 'wb') as f_out:
    pickle.dump(results, f_out)

In [None]:
#with open('results.bin', 'rb') as f_in:
    #results = pickle.load(f_in)

In [50]:
parsed_resulst = {}
for doc_id, json_questions in results.items():
    parsed_resulst[doc_id] = json.loads(json_questions)

In [51]:
parsed_resulst

{'911e9749': ['What is the scheduled start date and time of our course and what is the initial activity',
  'How can I stay updated on course events and deadlines through a calendar',
  'Where do I need to register before the course begins and what is the registration link',
  'What is the best way to receive important course announcements and communications',
  'Are there any additional platforms I need to join for course discussions and interactions beyond the main course channel'],
 '0fce8dfe': ['What do I need to know before enrolling in this course',
  'Are there any specific requirements to join this class',
  'Do I need prior knowledge or experience to take this course',
  'What are the necessary skills or background to succeed in this course',
  'Are there any particular prerequisites that I must fulfill before starting this course'],
 '4a142fd9': ['What happens if I want to enroll in the course after it has already begun?',
  'Is it possible to join the course late and still b

In [52]:
doc_index = {d['id']: d for d in documents} 
# 遍历 documents 中的每个文档 d，然后用 d['id'] 作为键，d 本身作为值，构建一个新字典 doc_index

final_results = []
for doc_id, questions in parsed_resulst.items():
    course = doc_index[doc_id]['course']
    for q in questions:
        final_results.append((q, course, doc_id))

In [53]:
final_results[:10]

[('What is the scheduled start date and time of our course and what is the initial activity',
  'data-engineering-zoomcamp',
  '911e9749'),
 ('How can I stay updated on course events and deadlines through a calendar',
  'data-engineering-zoomcamp',
  '911e9749'),
 ('Where do I need to register before the course begins and what is the registration link',
  'data-engineering-zoomcamp',
  '911e9749'),
 ('What is the best way to receive important course announcements and communications',
  'data-engineering-zoomcamp',
  '911e9749'),
 ('Are there any additional platforms I need to join for course discussions and interactions beyond the main course channel',
  'data-engineering-zoomcamp',
  '911e9749'),
 ('What do I need to know before enrolling in this course',
  'data-engineering-zoomcamp',
  '0fce8dfe'),
 ('Are there any specific requirements to join this class',
  'data-engineering-zoomcamp',
  '0fce8dfe'),
 ('Do I need prior knowledge or experience to take this course',
  'data-engineer

In [54]:
import pandas as  pd

In [57]:
df = pd.DataFrame(final_results, columns = ['question', 'course','document'])

In [58]:
df

Unnamed: 0,question,course,document
0,What is the scheduled start date and time of o...,data-engineering-zoomcamp,911e9749
1,How can I stay updated on course events and de...,data-engineering-zoomcamp,911e9749
2,Where do I need to register before the course ...,data-engineering-zoomcamp,911e9749
3,What is the best way to receive important cour...,data-engineering-zoomcamp,911e9749
4,Are there any additional platforms I need to j...,data-engineering-zoomcamp,911e9749
...,...,...,...
1250,What steps can I take to resolve the error thr...,data-engineering-zoomcamp,a392082e
1251,How do I fix the CSV parse error that occurs w...,data-engineering-zoomcamp,a392082e
1252,What is the cause of the pyarrow.lib.ArrowInva...,data-engineering-zoomcamp,a392082e
1253,How can I delete random line breaks in a CSV f...,data-engineering-zoomcamp,a392082e


In [59]:
df.to_csv('ground-truth-data.csv', index=False)

In [60]:
!head ground-truth-data.csv

question,course,document
What is the scheduled start date and time of our course and what is the initial activity,data-engineering-zoomcamp,911e9749
How can I stay updated on course events and deadlines through a calendar,data-engineering-zoomcamp,911e9749
Where do I need to register before the course begins and what is the registration link,data-engineering-zoomcamp,911e9749
What is the best way to receive important course announcements and communications,data-engineering-zoomcamp,911e9749
Are there any additional platforms I need to join for course discussions and interactions beyond the main course channel,data-engineering-zoomcamp,911e9749
What do I need to know before enrolling in this course,data-engineering-zoomcamp,0fce8dfe
Are there any specific requirements to join this class,data-engineering-zoomcamp,0fce8dfe
Do I need prior knowledge or experience to take this course,data-engineering-zoomcamp,0fce8dfe
What are the necessary skills or background to succeed in this course,dat