# Auto-Grader ML System Demonstration

This notebook demonstrates the machine learning capabilities of the Auto-Grader system. It includes model training, evaluation, and prediction examples to show how the system grades student submissions automatically.

## 1. Setup and Imports

First, let's import the necessary modules and set up our environment.

In [None]:
import os
import sys
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import datetime

# Add the parent directory to sys.path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our ML modules
from utils.train_models import DataProcessor, ModelTrainer, ModelManager
from utils.grade_engine import GradingEngine
from utils.feedback_generator import FeedbackGenerator, FeedbackFormatter
from utils.model_manager import ModelVersion, ABTestManager

# Set up directories
MODEL_DIR = '../models'
ASSIGNMENTS_DIR = '../../storage/nbgrader_assignments'
FEEDBACK_DIR = '../../storage/nbgrader_feedback'

# Create directories if they don't exist
os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(ASSIGNMENTS_DIR, exist_ok=True)
os.makedirs(FEEDBACK_DIR, exist_ok=True)

print("Setup complete!")

## 2. Generate Sample Training Data

For demonstration purposes, we'll create some sample training data to simulate NBGrader assignments.

In [None]:
# Create a function to generate sample data
def create_sample_training_data():
    """
    Create sample training data for demonstration
    """
    # Define questions and answers
    questions = [
        {
            "id": "q1",
            "text": "What is the difference between supervised and unsupervised learning?",
            "sample_answer": "Supervised learning uses labeled data where the algorithm learns to map inputs to known outputs. It requires a training dataset with correct answers (labels). Examples include classification and regression problems. Unsupervised learning, on the other hand, works with unlabeled data and tries to find patterns or structure without predefined outputs. Common unsupervised learning techniques include clustering, dimensionality reduction, and association.",
            "keywords": ["labeled", "unlabeled", "classification", "clustering", "regression", "training"],
            "max_score": 10.0,
            "grading_rubric": {
                "10": "Complete explanation with clear distinction between both types and examples",
                "8": "Good explanation with basic distinction and some examples",
                "6": "Basic distinction between supervised and unsupervised learning",
                "4": "Partial explanation with some confusion",
                "2": "Very limited explanation with major misconceptions",
                "0": "No answer or completely incorrect"
            }
        },
        {
            "id": "q2",
            "text": "Explain how the backpropagation algorithm works in neural networks.",
            "sample_answer": "Backpropagation is an algorithm used to train neural networks by calculating gradients. It works in two phases: forward pass and backward pass. In the forward pass, input data is fed through the network to generate predictions. The prediction error is calculated using a loss function. In the backward pass, the algorithm calculates the gradient of the loss function with respect to each weight by applying the chain rule of calculus, propagating the error backward through the network. The weights are then updated using gradient descent to minimize the loss. This process is repeated iteratively until the model converges to an optimal solution.",
            "keywords": ["gradient", "forward pass", "backward pass", "chain rule", "weights", "loss function"],
            "max_score": 10.0,
            "grading_rubric": {
                "10": "Comprehensive explanation including forward pass, backward pass, chain rule, and gradient descent",
                "8": "Good explanation with most key concepts mentioned",
                "6": "Basic explanation of backpropagation process",
                "4": "Partial explanation with some key concepts missing",
                "2": "Very limited explanation with major misconceptions",
                "0": "No answer or completely incorrect"
            }
        },
        {
            "id": "q3",
            "text": "What is the bias-variance tradeoff in machine learning?",
            "sample_answer": "The bias-variance tradeoff is a fundamental concept in machine learning that describes the balancing act between creating models that are too simple (high bias) versus too complex (high variance). High bias models underfit the data, making strong assumptions and simplifying the target function, resulting in high error on both training and test data. High variance models overfit the data, capturing noise along with the underlying pattern, performing well on training data but poorly on test data. The goal is to find the sweet spot that minimizes total error. Techniques to manage this tradeoff include regularization, cross-validation, and ensemble methods. The ideal model complexity depends on the amount and quality of training data available.",
            "keywords": ["underfitting", "overfitting", "complexity", "regularization", "error", "model selection"],
            "max_score": 10.0,
            "grading_rubric": {
                "10": "Complete explanation including definition, consequences of high bias/variance, and methods to balance the tradeoff",
                "8": "Good explanation with clear understanding of overfitting and underfitting",
                "6": "Basic explanation of bias-variance tradeoff concept",
                "4": "Partial explanation with some misconceptions",
                "2": "Very limited explanation",
                "0": "No answer or completely incorrect"
            }
        }
    ]
    
    # Create a DataFrame
    df = pd.DataFrame(questions)
    
    return df

# Generate and display sample data
training_data = create_sample_training_data()
print(f"Generated {len(training_data)} sample questions")
training_data[['id', 'text']]

## 3. Train ML Models

Now we'll train our machine learning models using the sample data. We'll train both a similarity-based model and a transformer-based model.

In [None]:
# Initialize components
data_processor = DataProcessor()
model_trainer = ModelTrainer(MODEL_DIR)

# Generate training pairs from the sample data
X, y = data_processor.generate_training_pairs(training_data)
print(f"Generated {len(X)} training pairs from sample data")

# Display a few examples
for i in range(min(3, len(X))):
    print(f"\nExample {i+1}:")
    print(f"Question: {X[i][0][:100]}...")
    print(f"Answer: {X[i][1][:100]}...")
    print(f"Score: {y[i]}")

In [None]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

In [None]:
# Train similarity model
print("Training similarity model...")
similarity_model = model_trainer.train_similarity_model(X_train, y_train, X_test, y_test)

# Save model
model_path = model_trainer.save_model(similarity_model, 'similarity')
print(f"Saved similarity model to {model_path}")

# Print metrics
print("\nModel performance:")
for metric, value in similarity_model['metrics'].items():
    print(f"{metric}: {value:.4f}")

## 4. Grade Student Submissions

Let's use our trained models to grade some sample student submissions.

In [None]:
# Initialize grading engine
grading_engine = GradingEngine(MODEL_DIR)

# Sample student submissions
student_submissions = [
    {
        "id": "submission1",
        "question_id": "q1",
        "answer": "Supervised learning uses labeled data with inputs and outputs. Unsupervised learning uses unlabeled data to find patterns. Examples of supervised learning are classification and regression. Examples of unsupervised learning include clustering."
    },
    {
        "id": "submission2",
        "question_id": "q1",
        "answer": "Supervised learning is when the computer is given examples to learn from. Unsupervised learning is when it learns on its own."
    },
    {
        "id": "submission3",
        "question_id": "q2",
        "answer": "Backpropagation works by calculating gradients of the loss function with respect to weights. It uses the chain rule to propagate errors backward through the network, updating weights to minimize the loss function. The process includes a forward pass where predictions are made and a backward pass where errors are propagated back."
    }
]

# Prepare questions dictionary
questions = {q['id']: q for q in training_data.to_dict('records')}

# Grade submissions
grading_results = grading_engine.batch_grade_submissions(student_submissions, questions)

# Display results
for i, result in enumerate(grading_results):
    print(f"\nSubmission {i+1} (ID: {result['submission_id']})")
    print(f"Question: {questions[result['question_id']]['text'][:100]}...")
    print(f"Score: {result['score']} / {result['max_score']} (Confidence: {result['confidence']:.2f})")
    print("Feedback:")
    for feedback in result['feedback']:
        print(f"- {feedback}")

## 5. Generate Detailed Feedback

Now let's generate more detailed feedback for one of the submissions using our feedback generator.

In [None]:
# Create feedback generator
feedback_generator = FeedbackGenerator()

# Select a submission
submission_idx = 0  # First submission
submission = student_submissions[submission_idx]
result = grading_results[submission_idx]
question = questions[submission['question_id']]

# Generate feedback
missing_concepts = ["The need for labeled training data in supervised learning", 
                   "How supervised learning maps inputs to known outputs"]
missing_keywords = ["training", "regression"]

feedback = feedback_generator.generate_feedback(
    student_answer=submission['answer'],
    expected_answer=question['sample_answer'],
    score=result['score'],
    max_score=result['max_score'],
    missing_concepts=missing_concepts,
    missing_keywords=missing_keywords
)

# Format feedback as HTML
html_feedback = FeedbackFormatter.format_as_html(feedback, result['score'], result['max_score'])

# Display feedback
from IPython.display import HTML
display(HTML(html_feedback))

## 6. Model Versioning and A/B Testing

Let's demonstrate the model versioning and A/B testing capabilities of our system.

In [None]:
# Initialize model manager
from utils.model_manager import ModelManager
model_manager = ModelManager(MODEL_DIR)

# List model versions
versions = model_manager.list_versions()
print(f"Found {len(versions)} model versions:")
for v in versions:
    print(f"- {v['model_type']} {v['version']} {'(active)' if v.get('is_active') else ''}")

In [None]:
# If we have at least one similarity model, let's train another version for comparison
similarity_versions = [v for v in versions if v['model_type'] == 'similarity']
if similarity_versions:
    print("Training a new version of the similarity model...")
    
    # Train with different parameters
    # For demonstration, we'll use the same data but pretend it's a new version
    similarity_model_v2 = model_trainer.train_similarity_model(X_train, y_train, X_test, y_test)
    
    # Manually adjust metrics to show difference
    similarity_model_v2['metrics']['accuracy'] = similarity_model_v2['metrics']['accuracy'] * 1.05
    similarity_model_v2['metrics']['f1_score'] = similarity_model_v2['metrics']['f1_score'] * 1.08
    
    # Save as new version
    model_path = model_trainer.save_model(similarity_model_v2, 'similarity')
    print(f"Saved new similarity model to {model_path}")
    
    # Refresh model list
    model_manager.load_model_versions()
    versions = model_manager.list_versions('similarity')
    print(f"\nUpdated similarity model versions:")
    for v in versions:
        print(f"- {v['model_type']} {v['version']} {'(active)' if v.get('is_active') else ''}")

In [None]:
# Set up A/B testing
similarity_versions = [v for v in versions if v['model_type'] == 'similarity']
if len(similarity_versions) >= 2:
    # Initialize A/B test manager
    ab_manager = ABTestManager(model_manager)
    
    # Create A/B test
    version_a = similarity_versions[0]['version']
    version_b = similarity_versions[1]['version']
    
    test_config = ab_manager.create_test(
        test_name="Similarity Model Comparison",
        model_type="similarity",
        version_a=version_a,
        version_b=version_b,
        traffic_split=0.5,
        metrics=["accuracy", "f1_score"]
    )
    
    print(f"Created A/B test: {test_config['name']} (ID: {test_config['id']})")
    
    # Simulate collecting metrics
    test_id = test_config['id']
    
    # Update metrics for version A
    version_a_metrics = {
        "accuracy": 0.82,
        "f1_score": 0.79
    }
    ab_manager.update_test_metrics(test_id, version_a, version_a_metrics, sample_count=10)
    
    # Update metrics for version B
    version_b_metrics = {
        "accuracy": 0.87,
        "f1_score": 0.85
    }
    ab_manager.update_test_metrics(test_id, version_b, version_b_metrics, sample_count=10)
    
    # Get results
    results = ab_manager.get_test_results(test_id)
    print("\nA/B Test Results:")
    print(json.dumps(results, indent=2))
    
    # End test
    final_results = ab_manager.end_test(test_id)
    
    # Promote winner
    if 'final_metrics' in final_results and 'winner' in final_results['final_metrics']:
        winner = final_results['final_metrics']['winner']
        if winner:
            version = final_results['version_a'] if winner == 'version_a' else final_results['version_b']
            print(f"\nPromoting {version} as the active version")
            ab_manager.promote_winner(test_id)
            
            # Refresh model list
            model_manager.load_model_versions()
            versions = model_manager.list_versions('similarity')
            print(f"\nUpdated similarity model versions:")
            for v in versions:
                print(f"- {v['model_type']} {v['version']} {'(active)' if v.get('is_active') else ''}")

## 7. Visualize Model Performance

Let's visualize the performance of our models over time.

In [None]:
# Get model performance history
similarity_history = model_manager.get_model_performance_history('similarity')

if similarity_history:
    # Convert to DataFrame
    history_df = pd.DataFrame(similarity_history)
    
    # Add created_at as datetime
    history_df['created_at'] = pd.to_datetime(history_df['created_at'])
    
    # Sort by creation date
    history_df = history_df.sort_values('created_at')
    
    # Plot performance metrics
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    plt.plot(history_df['created_at'], history_df['accuracy'], marker='o', linestyle='-', color='blue')
    plt.title('Accuracy Over Time')
    plt.xlabel('Version Creation Date')
    plt.ylabel('Accuracy')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.xticks(rotation=45)
    plt.tight_layout()
    
    plt.subplot(1, 2, 2)
    plt.plot(history_df['created_at'], history_df['f1_score'], marker='o', linestyle='-', color='green')
    plt.title('F1 Score Over Time')
    plt.xlabel('Version Creation Date')
    plt.ylabel('F1 Score')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.xticks(rotation=45)
    plt.tight_layout()
    
    plt.show()
else:
    print("No performance history available for similarity models.")

## 8. End-to-End Grading Example

Let's put everything together and demonstrate the end-to-end grading process for a new submission.

In [None]:
# Define a new student submission
new_submission = {
    "id": "new_submission",
    "question_id": "q3",  # Bias-variance tradeoff question
    "answer": """The bias-variance tradeoff is an important concept in machine learning. Bias is when a model makes strong assumptions about the data, which can lead to underfitting where it doesn't capture the true patterns. Variance is when a model is too sensitive to the training data, leading to overfitting where it captures noise rather than just the underlying pattern.
    
    High bias models are too simple and perform poorly on both training and test data. High variance models are too complex and perform well on training data but poorly on test data. The goal is to find the right balance that minimizes total error.
    
    Some ways to manage this tradeoff include regularization techniques, cross-validation, and using ensemble methods. The right model complexity depends on how much training data you have."""
}

# Grade the submission
print("Grading new submission...")
result = grading_engine.grade_submission(
    student_answer=new_submission['answer'],
    question_text=questions[new_submission['question_id']]['text'],
    expected_answer=questions[new_submission['question_id']]['sample_answer'],
    max_score=questions[new_submission['question_id']]['max_score'],
    rubric=questions[new_submission['question_id']]['grading_rubric'],
    keywords=questions[new_submission['question_id']]['keywords']
)

print(f"\nGrading Result:\nScore: {result['score']} / {result['max_score']} (Confidence: {result['confidence']:.2f})")
print(f"Methods used: {', '.join(result['methods_used'])}")
print("\nDetailed metrics:")
for metric, value in result.items():
    if isinstance(value, (int, float)) and metric not in ['score', 'max_score', 'confidence']:
        print(f"- {metric}: {value:.4f}")

print("\nFeedback:")
for item in result['feedback']:
    print(f"- {item}")

In [None]:
# Generate detailed feedback with our feedback generator
print("Generating detailed feedback...")

# Extract missing concepts and keywords from the detailed feedback
missing_concepts = []
missing_keywords = []

if 'semantic_similarity' in result['detailed_feedback']:
    sem_feedback = result['detailed_feedback']['semantic_similarity']
    if 'missing_concepts' in sem_feedback:
        missing_concepts = sem_feedback['missing_concepts']

if 'keyword_matching' in result['detailed_feedback']:
    kw_feedback = result['detailed_feedback']['keyword_matching']
    if 'missing_keywords' in kw_feedback:
        missing_keywords = kw_feedback['missing_keywords']

# Generate enhanced feedback
enhanced_feedback = feedback_generator.generate_feedback(
    student_answer=new_submission['answer'],
    expected_answer=questions[new_submission['question_id']]['sample_answer'],
    score=result['score'],
    max_score=result['max_score'],
    missing_concepts=missing_concepts,
    missing_keywords=missing_keywords
)

# Format as HTML and display
html_feedback = FeedbackFormatter.format_as_html(enhanced_feedback, result['score'], result['max_score'])
display(HTML(html_feedback))

## 9. Evaluate Human vs. ML Grading

Let's compare how our ML grading system performs against human grading.

In [None]:
# Define some sample data with both ML and human grades
comparison_data = [
    {
        "submission_id": "s1",
        "question_id": "q1",
        "human_score": 8.5,
        "ml_score": result['score'] if new_submission['question_id'] == 'q1' else 7.5,
        "confidence": result['confidence'] if new_submission['question_id'] == 'q1' else 0.85
    },
    {
        "submission_id": "s2",
        "question_id": "q2",
        "human_score": 6.0,
        "ml_score": result['score'] if new_submission['question_id'] == 'q2' else 5.5,
        "confidence": result['confidence'] if new_submission['question_id'] == 'q2' else 0.75
    },
    {
        "submission_id": "s3",
        "question_id": "q3",
        "human_score": 9.0,
        "ml_score": result['score'] if new_submission['question_id'] == 'q3' else 8.5,
        "confidence": result['confidence'] if new_submission['question_id'] == 'q3' else 0.90
    }
]

# Convert to DataFrame
comparison_df = pd.DataFrame(comparison_data)

# Calculate absolute difference
comparison_df['diff'] = abs(comparison_df['human_score'] - comparison_df['ml_score'])

# Display comparison
print("Human vs. ML Grading Comparison:")
print(comparison_df)

# Calculate overall metrics
mean_diff = comparison_df['diff'].mean()
max_diff = comparison_df['diff'].max()
within_1_point = (comparison_df['diff'] <= 1.0).mean() * 100

print(f"\nAverage difference: {mean_diff:.2f} points")
print(f"Maximum difference: {max_diff:.2f} points")
print(f"Percentage within 1 point: {within_1_point:.1f}%")

# Visualize comparison
plt.figure(figsize=(10, 6))

# Bar chart comparing human and ML scores
bar_width = 0.35
index = np.arange(len(comparison_df))

plt.bar(index, comparison_df['human_score'], bar_width, label='Human Score', color='blue', alpha=0.7)
plt.bar(index + bar_width, comparison_df['ml_score'], bar_width, label='ML Score', color='green', alpha=0.7)

plt.xlabel('Submission')
plt.ylabel('Score')
plt.title('Human vs. ML Grading Comparison')
plt.xticks(index + bar_width / 2, comparison_df['submission_id'])
plt.legend()
plt.ylim(0, 10)
plt.grid(True, linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

## 10. Conclusion and Next Steps

In this notebook, we've demonstrated the core ML capabilities of the Auto-Grader system:

1. Training ML models for grading student submissions
2. Using multiple grading methods (similarity, keyword matching, transformer-based)
3. Generating detailed, actionable feedback for students
4. Managing model versions and conducting A/B tests
5. Evaluating ML grading against human grading

**Next steps for the Auto-Grader ML system:**

1. **Expand training data**: Incorporate more real student submissions and human-graded examples
2. **Support more question types**: Extend to handle code submissions, mathematical equations, diagrams
3. **Improve feedback generation**: Create more personalized and educationally valuable feedback
4. **Implement active learning**: Allow the system to learn from teacher corrections
5. **Develop bias detection**: Ensure fair grading across different student populations
6. **Build explainability tools**: Help teachers understand why the system assigned specific grades

These improvements will help make the Auto-Grader system more accurate, fair, and educationally effective.