%%capture
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

In [29]:
import minsearch
from openai import OpenAI
import json

In [4]:
with open('../documents.json', 'rt') as f:
    docs_raw = json.load(f)

In [5]:
documents = []

for course in docs_raw:
    for doc in course['documents']:
        doc['course'] = course['course']
        documents.append(doc)
len(documents)

948

In [6]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [7]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [8]:
index.fit(documents)

<minsearch.Index at 0x79b34b9db8f0>

In [14]:
q = "How to understand evaluation metrics?"

In [15]:
boost = {'question': 3.0, 'section':0.5}
data = index.search(
    query = q,
    boost_dict=boost,
    num_results=5,
    filter_dict={'course':'machine-learning-zoomcamp'}
)

In [16]:
data

[{'text': 'How to get classification metrics - precision, recall, f1 score, accuracy simultaneously\nUse classification_report from sklearn. For more info check here.\nAbhishek N',
  'section': '4. Evaluation Metrics for Classification',
  'question': 'How to get all classification metrics?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "It's a complex and abstract topic and it requires some time to understand. You can move on without fully understanding the concept.\nNonetheless, it might be useful for you to rewatch the video, or even watch videos/lectures/notes by other people on this topic, as the ROC AUC is one of the most important metrics used in Binary Classification models.",
  'section': '4. Evaluation Metrics for Classification',
  'question': 'I didn’t fully understand the ROC curve. Can I move on?',
  'course': 'machine-learning-zoomcamp'},
 {'text': 'You must use the `dt_val` dataset to compute the metrics asked in Question 3 and onwards, as you did in Question 2.\

In [11]:
prompt_template = """
You're a course teacher assistant. Answer the question based on the context.
Use only the facts from the context. If relevant information is missing, say you do not know.
Question:
{question}
Context:
{context}
""".strip()

In [21]:
context = ""
for doc in data:
    context += f"section: {doc['section']}\nquestion: {doc['question']}\nanswer:{doc['text']}\n\n"

In [26]:
prompt = prompt_template.format(question=q, context=context).strip()

In [27]:
print(prompt)

You're a course teacher assistant. Answer the question based on the context.
Use only the facts from the context. If relevant information is missing, say you do not know.
Question:
How to understand evaluation metrics?
Context:
section: 4. Evaluation Metrics for Classification
question: How to get all classification metrics?
answer:How to get classification metrics - precision, recall, f1 score, accuracy simultaneously
Use classification_report from sklearn. For more info check here.
Abhishek N

section: 4. Evaluation Metrics for Classification
question: I didn’t fully understand the ROC curve. Can I move on?
answer:It's a complex and abstract topic and it requires some time to understand. You can move on without fully understanding the concept.
Nonetheless, it might be useful for you to rewatch the video, or even watch videos/lectures/notes by other people on this topic, as the ROC AUC is one of the most important metrics used in Binary Classification models.

section: 4. Evaluation M

In [30]:
client = OpenAI()

In [31]:
response = client.chat.completions.create(
    model='gpt-4.1-mini',
    messages=[
        {'role': 'user', 'content': prompt }
             ]
)

In [35]:
response.choices[0].message.content

"To understand evaluation metrics for classification, you can compute several key metrics such as precision, recall, F1 score, accuracy, and ROC AUC using scikit-learn's built-in functions. This approach is more accurate and efficient than manual calculations. For example, you can use functions like `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, and `roc_auc_score` on your validation dataset predictions.\n\nAdditionally, to get all classification metrics together (precision, recall, f1 score, accuracy) simultaneously, you can use the `classification_report` function from scikit-learn, which summarizes these metrics in one output.\n\nIt is important to use the validation dataset (e.g., `dt_val`) to compute these metrics after model training.\n\nIf you are dealing with complex metrics like the ROC curve, it might take time to fully understand, and it's okay to move on initially and revisit the topic later through different resources.\n\nIn summary:\n- Use scikit-learn f

To understand evaluation metrics for classification, you can compute several key metrics such as precision, recall, F1 score, accuracy, and ROC AUC using scikit-learn's built-in functions. This approach is more accurate and efficient than manual calculations. For example, you can use functions like `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, and `roc_auc_score` on your validation dataset predictions.\n\nAdditionally, to get all classification metrics together (precision, recall, f1 score, accuracy) simultaneously, you can use the `classification_report` function from scikit-learn, which summarizes these metrics in one output.\n\nIt is important to use the validation dataset (e.g., `dt_val`) to compute these metrics after model training.\n\nIf you are dealing with complex metrics like the ROC curve, it might take time to fully understand, and it's okay to move on initially and revisit the topic later through different resources.\n\nIn summary:\n- Use scikit-learn functions to calculate metrics.\n- Use `classification_report` for a combined metric summary.\n- Use your validation dataset for evaluation.\n- Take time to understand complex metrics like ROC curve if needed.