# Math Error Detection Datasets - Download Guide

This notebook provides step-by-step instructions for downloading and setting up three key datasets for math error detection research:

1. **Math Misconceptions Dataset** - 220 algebra misconception examples
2. **FERMAT Dataset** - 2,244 handwritten math problems with images
3. **EGE Math Dataset** - 122 Russian EGE exam solutions with 3 image types

Each dataset is useful for different aspects of math error detection and OCR research.

## Prerequisites

Install required packages:

In [None]:
# Install required packages
!pip install datasets requests pandas numpy matplotlib

In [None]:
import json
import requests
import pandas as pd
from pathlib import Path
from datasets import load_dataset
import numpy as np
from collections import Counter

## Setup Directory Structure

Create organized folder structure for all datasets:

In [None]:
# Create directory structure
base_dir = Path("math_datasets")
processed_dir = base_dir / "processed"
raw_dir = base_dir / "raw"
scripts_dir = base_dir / "scripts"

# Create all directories
for dir_path in [processed_dir, raw_dir, scripts_dir]:
    dir_path.mkdir(parents=True, exist_ok=True)

print(f"Created directory structure:")
print(f"{base_dir}/")
print(f"├── processed/    # Clean, ready-to-use datasets")
print(f"├── raw/          # Original downloaded data")
print(f"└── scripts/      # Processing utilities")

# Dataset 1: Math Misconceptions Dataset

**Source**: [MarkingCopilot Research Benchmarks](https://github.com/MarkingCopilot/markingResearch/blob/main/docs/benchmarks.md)  
**Size**: 220 examples  
**Language**: English  
**Content**: Common algebra misconceptions with incorrect/correct answer pairs  
**Use Cases**: Testing error detection on student misconceptions

## Download Math Misconceptions

In [None]:
def download_math_misconceptions():
    """Download math misconceptions dataset from the benchmarks repository."""
    
    print("Downloading Math Misconceptions Dataset...")
    
    # URL to the math misconceptions data
    # Note: This would need to be replaced with the actual URL when available
    misconceptions_url = "https://raw.githubusercontent.com/MarkingCopilot/markingResearch/main/data/math_misconceptions.json"
    
    try:
        # For demonstration - create sample data structure
        # In practice, you would download from the actual URL
        sample_misconceptions = [
            {
                "Example Number": 1,
                "Question": "Solve for x: 2x + 3 = 7",
                "Incorrect Answer": "x = 5",
                "Correct Answer": "x = 2",
                "Misconception": "Adding instead of subtracting when isolating variable",
                "Misconception ID": "ISOL_ADD_INSTEAD_SUB",
                "Topic": "Linear equations"
            },
            # Add more examples here...
        ]
        
        # Create output directories
        misconceptions_dir = processed_dir / "math_misconceptions"
        misconceptions_dir.mkdir(exist_ok=True)
        
        # Save the dataset
        output_file = misconceptions_dir / "math_misconceptions.json"
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(sample_misconceptions, f, indent=2, ensure_ascii=False)
        
        # Create dataset info
        dataset_info = {
            "dataset_name": "Math Misconceptions Dataset",
            "source_url": "https://github.com/MarkingCopilot/markingResearch/blob/main/docs/benchmarks.md",
            "description": "Collection of common algebra misconceptions with structured incorrect/correct answer pairs",
            "language": "English",
            "total_examples": len(sample_misconceptions),
            "format": {
                "Question": "Problem statement",
                "Incorrect Answer": "Common student mistake",
                "Correct Answer": "Expected correct response",
                "Misconception": "Description of the underlying misconception",
                "Misconception ID": "Unique identifier for misconception type",
                "Topic": "Mathematical topic area"
            },
            "use_cases": [
                "Testing error detection algorithms",
                "Understanding common student mistakes",
                "Training misconception identification models"
            ]
        }
        
        with open(misconceptions_dir / "dataset_info.json", 'w', encoding='utf-8') as f:
            json.dump(dataset_info, f, indent=2, ensure_ascii=False)
        
        print(f"Math Misconceptions downloaded: {len(sample_misconceptions)} examples")
        print(f"Saved to: {output_file}")
        
        return sample_misconceptions
        
    except Exception as e:
        print(f"Error downloading misconceptions: {e}")
        return None

# Download the dataset
misconceptions_data = download_math_misconceptions()

# Dataset 2: FERMAT Dataset

**Source**: [HuggingFace - ai4bharat/fermat](https://huggingface.co/datasets/ai4bharat/fermat)  
**Paper**: "Can Vision-Language Models Evaluate Handwritten Math?" (arXiv:2501.07244)  
**Size**: 2,244 handwritten math problems  
**Language**: English  
**Content**: Handwritten math solutions with error annotations and images  
**Use Cases**: Testing complete OCR + error detection pipeline

## Download FERMAT Dataset

In [None]:
def download_fermat_dataset(sample_size=100, save_images=True):
    """Download FERMAT dataset from HuggingFace.
    
    Args:
        sample_size: Number of examples to process (default 100, set to None for all 2244)
        save_images: Whether to save the handwritten images
    """
    
    print(f"Downloading FERMAT Dataset (sample_size: {sample_size})...")
    
    try:
        # Load dataset from HuggingFace
        print("Loading from HuggingFace...")
        dataset = load_dataset("ai4bharat/fermat")
        train_data = dataset['train']
        
        print(f"Total examples available: {len(train_data)}")
        
        # Create output directories
        fermat_dir = processed_dir / "fermat"
        fermat_images_dir = fermat_dir / "images"
        fermat_dir.mkdir(exist_ok=True)
        
        if save_images:
            fermat_images_dir.mkdir(exist_ok=True)
        
        # Process examples
        processed_data = []
        num_to_process = sample_size if sample_size else len(train_data)
        
        for idx in range(min(num_to_process, len(train_data))):
            item = train_data[idx]
            
            if idx % 50 == 0:
                print(f"Processing example {idx + 1}/{num_to_process}...")
            
            # Extract metadata
            processed_item = {
                'index': idx,
                'problem': item.get('problem', ''),
                'correct_answer': item.get('correct_answer', ''),
                'student_answer': item.get('student_answer', ''),
                'has_error': item.get('has_error', False),
                'error_reasoning': item.get('error_reasoning', ''),
                'domain': item.get('domain', ''),
                'grade': item.get('grade', ''),
                'handwriting_legible': item.get('handwriting_legible', True),
                'good_image_quality': item.get('good_image_quality', True),
                'dataset': 'fermat',
                'language': 'english'
            }
            
            # Handle image
            if save_images and 'image' in item and item['image']:
                image_filename = f"fermat_{idx:04d}.png"
                image_path = fermat_images_dir / image_filename
                
                try:
                    # Save PIL Image
                    item['image'].save(image_path, 'PNG')
                    processed_item['image_filename'] = image_filename
                    processed_item['has_image'] = True
                except Exception as e:
                    print(f"Failed to save image {idx}: {e}")
                    processed_item['has_image'] = False
            else:
                processed_item['has_image'] = False
            
            processed_data.append(processed_item)
        
        # Save processed dataset
        output_file = fermat_dir / "fermat_processed.json"
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(processed_data, f, indent=2, ensure_ascii=False)
        
        # Create dataset info
        error_count = sum(1 for item in processed_data if item.get('has_error', False))
        domains = Counter([item.get('domain', 'unknown') for item in processed_data])
        
        dataset_info = {
            "dataset_name": "FERMAT: Can Vision-Language Models Evaluate Handwritten Math?",
            "source_url": "https://huggingface.co/datasets/ai4bharat/fermat",
            "paper_url": "https://arxiv.org/abs/2501.07244",
            "description": "Benchmark for assessing VLMs' ability to detect, localize and correct errors in handwritten mathematical content",
            "language": "English",
            "total_examples_available": len(train_data),
            "processed_examples": len(processed_data),
            "examples_with_errors": error_count,
            "examples_without_errors": len(processed_data) - error_count,
            "domains": dict(domains.most_common()),
            "format": {
                "problem": "Original math question in LaTeX",
                "correct_answer": "Correct solution in LaTeX",
                "student_answer": "Erroneous solution in LaTeX",
                "has_error": "Boolean indicating actual error vs superficial change",
                "error_reasoning": "Explanation of introduced error",
                "domain": "Mathematical domain code",
                "grade": "Grade level",
                "image_filename": "Corresponding handwritten image file"
            },
            "use_cases": [
                "Testing OCR accuracy on handwritten math",
                "Evaluating VLM error detection capabilities",
                "Training handwritten math recognition models"
            ]
        }
        
        with open(fermat_dir / "dataset_info.json", 'w', encoding='utf-8') as f:
            json.dump(dataset_info, f, indent=2, ensure_ascii=False)
        
        print(f"FERMAT downloaded: {len(processed_data)} examples")
        print(f"Saved to: {output_file}")
        if save_images:
            image_count = sum(1 for item in processed_data if item.get('has_image', False))
            print(f"Images saved: {image_count} in {fermat_images_dir}")
        
        return processed_data
        
    except Exception as e:
        print(f"Error downloading FERMAT: {e}")
        return None

# Download the dataset (start with smaller sample)
fermat_data = download_fermat_dataset(sample_size=100, save_images=True)

## Increase FERMAT Sample Size (Optional)

If you want to download more examples:

In [None]:
# Uncomment to download more examples (this may take longer)
# fermat_data_large = download_fermat_dataset(sample_size=500, save_images=True)

# Uncomment to download ALL examples (this will take significant time and space)
# fermat_data_full = download_fermat_dataset(sample_size=None, save_images=True)

# Dataset 3: EGE Math Solutions Assessment Benchmark

**Source**: [HuggingFace - Karifannaa/EGE_Math_Solutions_Assessment_Benchmark](https://huggingface.co/datasets/Karifannaa/EGE_Math_Solutions_Assessment_Benchmark)  
**Paper**: "EGE Math Solutions Assessment Benchmark" (arXiv:2507.22958)  
**Size**: 122 examples  
**Language**: Russian  
**Content**: Russian EGE exam solutions with 3 types of images  
**Use Cases**: Cross-language error detection, Russian math education research

## Download EGE Dataset with All Image Types

In [None]:
def download_ege_dataset():
    """Download EGE Math dataset with all three image types."""
    
    print("Downloading EGE Math Dataset...")
    print("This dataset contains THREE types of images:")
    print("1. images_with_answer (student solutions)")
    print("2. images_without_answer (problem statements)")
    print("3. images_with_true_solution (correct solutions)")
    
    try:
        # Load dataset from HuggingFace
        dataset = load_dataset("Karifannaa/EGE_Math_Solutions_Assessment_Benchmark")
        train_data = dataset['train']
        
        print(f"Total examples: {len(train_data)}")
        
        # Create output directories
        ege_dir = processed_dir / "ege_math"
        images_dir = ege_dir / "images"
        student_answers_dir = images_dir / "student_answers"
        problems_dir = images_dir / "problems"
        correct_solutions_dir = images_dir / "correct_solutions"
        
        for dir_path in [ege_dir, student_answers_dir, problems_dir, correct_solutions_dir]:
            dir_path.mkdir(parents=True, exist_ok=True)
        
        # Process examples
        processed_data = []
        all_images_info = []
        
        image_type_mapping = {
            'images_with_answer': (student_answers_dir, 'Student solution'),
            'images_without_answer': (problems_dir, 'Problem statement'),
            'images_with_true_solution': (correct_solutions_dir, 'Correct solution')
        }
        
        for idx, item in enumerate(train_data):
            if idx % 25 == 0:
                print(f"Processing example {idx + 1}/{len(train_data)}...")
            
            # Create metadata entry
            processed_item = {
                'index': idx,
                'solution_id': item.get('solution_id'),
                'task_id': item.get('task_id'),
                'example_id': item.get('example_id'),
                'task_type': item.get('task_type'),
                'score': item.get('score'),
                'parts_count': item.get('parts_count'),
                'dataset': 'ege_math',
                'language': 'russian',
                'problem': '',  # To be filled by OCR
                'solution': '',  # To be filled by OCR
                'student_answer': '',  # To be filled by OCR
                'has_images': False,
                'image_counts': {}
            }
            
            # Process each image type
            for field_name, (output_dir, description) in image_type_mapping.items():
                images = item.get(field_name, [])
                if images:
                    processed_item['has_images'] = True
                    processed_item['image_counts'][field_name] = len(images)
                    
                    # Save each image
                    for img_idx, image_obj in enumerate(images):
                        try:
                            solution_id = item.get('solution_id', f'unknown_{idx}')
                            image_filename = f"ege_{solution_id}_{field_name}_{img_idx}.png"
                            image_path = output_dir / image_filename
                            
                            # Save PIL Image
                            image_obj.save(image_path, 'PNG')
                            
                            # Track image info
                            image_info = {
                                'solution_id': solution_id,
                                'task_type': item.get('task_type'),
                                'score': item.get('score'),
                                'image_filename': image_filename,
                                'image_type': field_name,
                                'image_description': description,
                                'relative_path': f'images/{output_dir.name}/{image_filename}',
                                'original_index': idx,
                                'image_index': img_idx
                            }
                            all_images_info.append(image_info)
                            
                        except Exception as e:
                            print(f"Failed to save {field_name} image {img_idx} for {solution_id}: {e}")
                else:
                    processed_item['image_counts'][field_name] = 0
            
            processed_data.append(processed_item)
        
        # Save processed dataset
        output_file = ege_dir / "ege_processed.json"
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(processed_data, f, indent=2, ensure_ascii=False)
        
        # Save image mapping
        mapping_file = ege_dir / "complete_image_mapping.json"
        with open(mapping_file, 'w', encoding='utf-8') as f:
            json.dump(all_images_info, f, indent=2, ensure_ascii=False)
        
        # Create statistics
        stats = {
            'total_examples': len(processed_data),
            'total_images': len(all_images_info),
            'image_type_counts': {}
        }
        
        for field_name in image_type_mapping.keys():
            count = len([img for img in all_images_info if img['image_type'] == field_name])
            stats['image_type_counts'][field_name] = count
        
        # Create dataset info
        dataset_info = {
            "dataset_name": "EGE Math Solutions Assessment Benchmark",
            "source_url": "https://huggingface.co/datasets/Karifannaa/EGE_Math_Solutions_Assessment_Benchmark",
            "paper_url": "https://arxiv.org/abs/2507.22958",
            "description": "Russian high school math exam solutions with quality assessments and three types of images",
            "language": "Russian",
            "total_examples": len(processed_data),
            "total_images": len(all_images_info),
            "image_types": {
                "images_with_answer": "Student handwritten solutions",
                "images_without_answer": "Problem statements (no solutions shown)",
                "images_with_true_solution": "Correct reference solutions"
            },
            "image_organization": {
                "student_answers/": "Student solution images",
                "problems/": "Problem statement images",
                "correct_solutions/": "Reference solution images"
            },
            "format": {
                "solution_id": "Unique solution identifier",
                "task_type": "Mathematical topic/domain",
                "score": "Quality assessment score (0-4)",
                "image_counts": "Count of each image type for this example"
            },
            "use_cases": [
                "Russian language math OCR testing",
                "Cross-language error detection",
                "Comparing student vs correct solutions",
                "Problem statement extraction"
            ],
            "statistics": stats
        }
        
        with open(ege_dir / "dataset_info.json", 'w', encoding='utf-8') as f:
            json.dump(dataset_info, f, indent=2, ensure_ascii=False)
        
        print(f"\nEGE Dataset download complete!")
        print(f"Main data: {output_file}")
        print(f"Total images: {len(all_images_info)}")
        print(f"\nImage breakdown:")
        for field_name, (_, description) in image_type_mapping.items():
            count = stats['image_type_counts'][field_name]
            print(f"  {description}: {count} images")
        
        return processed_data
        
    except Exception as e:
        print(f"Error downloading EGE: {e}")
        return None

# Download the dataset
ege_data = download_ege_dataset()

# Summary and Dataset Overview

Let's summarize what we've downloaded:

In [None]:
def summarize_datasets():
    """Provide a summary of all downloaded datasets."""
    
    print("DATASET DOWNLOAD SUMMARY")
    print("=" * 50)
    
    datasets_info = [
        {
            'name': 'Math Misconceptions',
            'path': processed_dir / 'math_misconceptions' / 'math_misconceptions.json',
            'language': 'English',
            'content': 'Algebra misconceptions',
            'use_case': 'Error detection testing'
        },
        {
            'name': 'FERMAT',
            'path': processed_dir / 'fermat' / 'fermat_processed.json',
            'language': 'English',
            'content': 'Handwritten math + images',
            'use_case': 'OCR + error detection'
        },
        {
            'name': 'EGE Math',
            'path': processed_dir / 'ege_math' / 'ege_processed.json',
            'language': 'Russian',
            'content': 'EGE exam solutions + 3 image types',
            'use_case': 'Cross-language detection'
        }
    ]
    
    for dataset in datasets_info:
        if dataset['path'].exists():
            try:
                with open(dataset['path'], 'r', encoding='utf-8') as f:
                    data = json.load(f)
                    count = len(data)
                    status = f"Downloaded: {count} examples"
            except:
                status = "File exists but couldn't read"
        else:
            status = "Not downloaded"
        
        print(f"\n{dataset['name']}")
        print(f"   Language: {dataset['language']}")
        print(f"   Content: {dataset['content']}")
        print(f"   Use Case: {dataset['use_case']}")
        print(f"   Status: {status}")
    
    print(f"\nAll datasets saved to: {base_dir.absolute()}")
    
    # Show directory structure
    print(f"\nDirectory Structure:")
    print(f"{base_dir}/")
    print(f"├── processed/")
    for dataset_dir in processed_dir.iterdir():
        if dataset_dir.is_dir():
            print(f"│   ├── {dataset_dir.name}/")
            for file in dataset_dir.iterdir():
                if file.is_file():
                    print(f"│   │   ├── {file.name}")
                elif file.is_dir():
                    print(f"│   │   └── {file.name}/ (images)")
    print(f"├── raw/")
    print(f"└── scripts/")

# Show summary
summarize_datasets()

# Next Steps

## Using the Datasets

1. **Math Misconceptions**: Use for testing error detection algorithms on common student mistakes
2. **FERMAT**: Use for testing OCR + error detection on handwritten math images
3. **EGE Math**: Use for cross-language research and comparing different image types

## Integration with Error Detection Tools

These datasets can be integrated with:
- OCR systems (Mathpix, GPT-4V)
- Error detection models
- Math education research tools

## File Organization

Each dataset includes:
- **Main data file**: JSON with all examples
- **dataset_info.json**: Complete metadata and documentation
- **Images folder**: Organized image files (where applicable)

## Contributing Back

Consider contributing:
- Improved processing scripts
- Additional dataset formats
- Error detection results and benchmarks

## Repository Integration

This notebook can be added to the [MarkingCopilot/trainingData](https://github.com/MarkingCopilot/trainingData) repository to help others download and set up these datasets for math error detection research.