# Dataset Exploration

This notebook explores the Q&A database used for the LLM evaluator chatbot.

## Overview

The dataset contains question-answer pairs covering general machine learning concepts. Each entry has:
- `question`: The question text
- `answer`: The standard/reference answer

This dataset serves as the knowledge base for evaluating student answers.


In [None]:
# Load environment variables from .env file (if it exists)
# This allows notebooks to use the same configuration as the main app
try:
    from dotenv import load_dotenv
    load_dotenv()  # Loads variables from .env file in project root
    print("Environment variables loaded from .env file")
except ImportError:
    print("python-dotenv not installed. Using system environment variables only.")
except Exception as e:
    print(f"Note: Could not load .env file: {e}")
    print("Using system environment variables only.")


python-dotenv not installed. Using system environment variables only.


In [2]:
import sys
from pathlib import Path
import os

# Find project root by looking for src/ directory
current = Path.cwd()
project_root = None

# Check if we're in notebooks/ directory
if current.name == 'notebooks':
    project_root = current.parent
else:
    # Walk up the directory tree looking for src/ folder
    for parent in [current] + list(current.parents):
        if (parent / 'src').exists() and (parent / 'src' / '__init__.py').exists():
            project_root = parent
            break
    
    # Fallback: assume current directory is project root if src/ exists here
    if project_root is None and (current / 'src').exists():
        project_root = current

# If still not found, use current directory's parent
if project_root is None:
    project_root = current.parent if current.name == 'notebooks' else current

# Change to project root directory so relative paths work correctly
os.chdir(project_root)

# Add project root to path to import src modules
sys.path.insert(0, str(project_root))

import pandas as pd
from src.data_loader import load_qa_dataset
from src.config import DATA_PATH

print(f"Current working directory: {os.getcwd()}")
print(f"Project root: {project_root}")
print(f"Loading dataset from: {DATA_PATH}")
# Resolve the path relative to project root
data_path_absolute = (project_root / "data" / "Q&A_db_practice.json").resolve()
print(f"Absolute path: {data_path_absolute}")
print(f"File exists: {data_path_absolute.exists()}")


Current working directory: C:\Users\Levin\OneDrive\Desktop\DAI Assignment Part 2
Project root: C:\Users\Levin\OneDrive\Desktop\DAI Assignment Part 2
Loading dataset from: data\Q&A_db_practice.json
Absolute path: C:\Users\Levin\OneDrive\Desktop\DAI Assignment Part 2\data\Q&A_db_practice.json
File exists: True


In [None]:
# Load the dataset (use absolute path to ensure it works)
data_file = project_root / "data" / "Q&A_db_practice.json"
df = load_qa_dataset(path=data_file)
print(f"Dataset loaded: {len(df)} questions")
print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

INFO:src.data_loader:Successfully loaded 150 question-answer pairs from c:\Users\Levin\OneDrive\Desktop\DAI Assignment Part 2\data\Q&A_db_practice.json


Dataset loaded: 150 questions

Dataset shape: (150, 3)

Columns: ['id', 'question', 'answer']


In [None]:
# Display basic statistics
print("Dataset Statistics:")
print(f"Total questions: {len(df)}")
print(f"\nQuestion length statistics (characters):")
print(df['question'].str.len().describe())
print(f"\nAnswer length statistics (characters):")
print(df['answer'].str.len().describe())


Dataset Statistics:
Total questions: 150

Question length statistics (characters):
count    150.000000
mean      16.833333
std        6.726105
min        4.000000
25%       12.250000
50%       16.000000
75%       20.000000
max       38.000000
Name: question, dtype: float64

Answer length statistics (characters):
count    150.000000
mean     397.900000
std      192.832916
min      174.000000
25%      274.000000
50%      342.000000
75%      425.000000
max      927.000000
Name: answer, dtype: float64


In [None]:
# Display sample entries
print("Sample Questions (first 3):")
print("=" * 80)
for idx, row in df.head(3).iterrows():
    print(f"\nQuestion ID: {row['id']}")
    print(f"Question: {row['question']}")
    print(f"Answer (preview): {row['answer'][:200]}...")
    print("-" * 80)


Sample Questions (first 3):

Question ID: 0
Question: Activation Function
Answer (preview): An activation function is a mathematical function that transforms! each neuron’s aggregated input (pre‑activation) into its output signal by applying a non‑linear, usually differentiable mapping that ...
--------------------------------------------------------------------------------

Question ID: 1
Question: Anomaly Detection
Answer (preview): Anomaly detection is a subfield of unsupervised learning that identifies data points whose feature patterns deviate significantly from the statistical regularities of a reference dataset, assuming ano...
--------------------------------------------------------------------------------

Question ID: 2
Question: Area Under the Curve (AUC)
Answer (preview): The Area Under the Curve (AUC) is a scalar performance metric for binary classifiers that is defined as the definite integral of the receiver operating characteristic (ROC) curve with respect to the f...
-

## Dataset Usage

This dataset is used in the Streamlit chatbot as follows:

1. **Question Selection**: Questions are randomly selected from the dataset
2. **Reference Answer**: The `answer` field serves as the "gold standard" for comparison
3. **Evaluation**: Student answers are compared against the reference using:
   - LLM-based evaluation (explanation + score)
   - Automatic metrics (ROUGE-1, ROUGE-L)

The dataset should contain diverse ML concepts to provide a comprehensive evaluation experience.
