# AMP (Advanced Mathematical Problems) 数据集分析

## 简介

AMP数据集是一个大型的数学问题数据集，包含来自Khan Academy和Mathematica的数学问题。这个数据集对于训练和评估数学问题解决AI系统非常有价值。

## 数据集结构

数据集主要包含两个部分：
1. **Khan Academy问题** - 以JSON格式存储，包含详细的解题步骤
2. **Mathematica问题** - 以文本文件格式存储，包含问题和答案

## 环境设置

在运行此notebook之前，请确保已安装必要的依赖项：

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install pandas numpy soundfile
```

In [None]:
# 导入必要的库
from pathlib import Path
import pandas as pd
import json
import warnings
import soundfile as sf
import os
from collections import Counter

# 设置数据集路径
DATA_DIR = Path("/Users/jia/datasets/amps").expanduser()

print(f"数据集路径: {DATA_DIR}")
print(f"路径是否存在: {DATA_DIR.exists()}")

## 数据加载函数

以下函数用于加载和分析AMP数据集中的不同类型的文件。

In [None]:
# 通用数据加载函数
def load_amp_datasets(directory=DATA_DIR, recursive=False):
    """
    从目录中加载文件到以文件名（stem）为键的字典中。
    支持 CSV, TSV, JSON (ndjson 或 standard), parquet, feather, pickle,
    excel, text, 以及如果soundfile可用则支持常见音频文件 (wav/mp3)。
    """
    if not directory.exists():
        raise FileNotFoundError(f"{directory} 不存在")
    files = directory.rglob("*") if recursive else directory.iterdir()
    datasets = {}
    
    # 可选的音频加载器
    try:
        _ = sf.info
    except Exception:
        sf = None

    for p in sorted(files):
        if p.is_dir():
            continue
        key = p.stem
        suf = p.suffix.lower()
        try:
            if suf == ".csv":
                obj = pd.read_csv(p)
            elif suf in (".tsv", ".tab"):
                obj = pd.read_csv(p, sep="\t")
            elif suf in (".parquet",):
                try:
                    obj = pd.read_parquet(p)
                except Exception as e:
                    warnings.warn(f"parquet 加载失败 {p}: {e}")
                    obj = p
            elif suf in (".feather",):
                try:
                    obj = pd.read_feather(p)
                except Exception as e:
                    warnings.warn(f"feather 加载失败 {p}: {e}")
                    obj = p
            elif suf in (".json",):
                # 先尝试按行分隔的JSON，再尝试标准JSON
                try:
                    obj = pd.read_json(p, lines=True)
                except Exception:
                    try:
                        obj = pd.read_json(p)
                    except Exception:
                        with p.open("r", encoding="utf-8") as f:
                            obj = json.load(f)
            elif suf in (".ndjson", ".jsonl"):
                obj = pd.read_json(p, lines=True)
            elif suf in (".pkl", ".pickle"):
                try:
                    obj = pd.read_pickle(p)
                except Exception as e:
                    warnings.warn(f"pickle 加载失败 {p}: {e}")
                    obj = p
            elif suf in (".xls", ".xlsx"):
                try:
                    obj = pd.read_excel(p)
                except Exception as e:
                    warnings.warn(f"excel 加载失败 {p}: {e}")
                    obj = p
            elif suf in (".txt", ".log"):
                obj = p.read_text(encoding="utf-8", errors="replace")
            elif suf in (".wav", ".mp3", ".flac") and sf is not None:
                try:
                    data, sr = sf.read(str(p))
                    obj = {"audio": data, "samplerate": sr}
                except Exception as e:
                    warnings.warn(f"音频加载失败 {p}: {e}")
                    obj = p
            else:
                # 未知类型：存储Path以便后续处理
                obj = p
        except Exception as e:
            warnings.warn(f"加载失败 {p}: {e}")
            obj = p

        # 处理重复的stem名称
        if key in datasets:
            key = f"{p.stem}{p.suffix}"
            if key in datasets:
                key = str(p.name)
        datasets[key] = obj

    # 打印简要摘要
    summary_lines = []
    for name, obj in datasets.items():
        if isinstance(obj, pd.DataFrame):
            summary = f"DataFrame {obj.shape}"
        elif isinstance(obj, pd.Series):
            summary = f"Series {obj.shape}"
        elif isinstance(obj, dict) and "audio" in obj:
            summary = f"Audio {obj['audio'].shape} @ {obj['samplerate']}Hz"
        elif isinstance(obj, (list, tuple, dict)):
            try:
                summary = f"{type(obj).__name__} len={len(obj)}"
            except Exception:
                summary = type(obj).__name__
        elif isinstance(obj, Path):
            summary = f"Path ({obj.suffix})"
        elif isinstance(obj, str):
            summary = f"text len={len(obj)}"
        else:
            summary = type(obj).__name__
        summary_lines.append(f"{name}: {summary}")
    print(f"从 {directory} 加载了 {len(datasets)} 个项目:")
    for line in summary_lines:
        print(" -", line)

    return datasets

# 专门为AMP数据集设计的加载函数
def load_khan_problems(base_path):
    """
    加载Khan Academy问题
    """
    khan_path = Path(base_path) / "khan"
    problems = []
    
    if not khan_path.exists():
        print(f"Khan Academy目录在 {khan_path} 未找到")
        return problems
        
    # 遍历所有子目录并查找JSON文件
    for json_file in khan_path.rglob("*.json"):
        try:
            with open(json_file, 'r') as f:
                data = json.load(f)
                data['source_file'] = str(json_file)
                problems.append(data)
        except Exception as e:
            print(f"加载错误 {json_file}: {e}")
            
    return problems

def load_mathematica_problems(base_path):
    """
    加载Mathematica问题
    """
    mathematica_path = Path(base_path) / "mathematica"
    problems = []
    
    if not mathematica_path.exists():
        print(f"Mathematica目录在 {mathematica_path} 未找到")
        return problems
        
    # 处理.txt文件，其中包含问题/答案对
    for txt_file in mathematica_path.rglob("*.txt"):
        try:
            with open(txt_file, 'r') as f:
                content = f.read().strip()
                # 解析简单的问题/答案格式
                parts = content.split('Answer:')
                if len(parts) == 2:
                    problem = parts[0].replace('Problem:', '').strip()
                    answer = parts[1].strip()
                    problems.append({
                        'problem': problem,
                        'answer': answer,
                        'source_file': str(txt_file)
                    })
        except Exception as e:
            print(f"加载错误 {txt_file}: {e}")
            
    return problems

# 分析函数
def analyze_khan_categories(base_path):
    """
    分析Khan Academy类别
    """
    khan_path = Path(base_path) / "khan"
    categories = []
    
    if not khan_path.exists():
        return categories
        
    for subdir in khan_path.iterdir():
        if subdir.is_dir():
            # 计算每个类别中的JSON文件数
            json_count = len(list(subdir.rglob("*.json")))
            categories.append({
                'name': subdir.name,
                'count': json_count
            })
            
    # 按数量排序
    categories.sort(key=lambda x: x['count'], reverse=True)
    return categories

def analyze_mathematica_categories(base_path):
    """
    分析Mathematica类别
    """
    mathematica_path = Path(base_path) / "mathematica"
    categories = []
    
    if not mathematica_path.exists():
        return categories
        
    for subdir in mathematica_path.iterdir():
        if subdir.is_dir():
            # 计算每个类别中的TXT文件数
            txt_count = len(list(subdir.rglob("*.txt")))
            categories.append({
                'name': subdir.name,
                'count': txt_count
            })
            
    # 按数量排序
    categories.sort(key=lambda x: x['count'], reverse=True)
    return categories

## 数据集分析

现在让我们加载数据并进行详细分析。

In [None]:
# 加载数据集
amp_dataset_path = "/Users/jia/datasets/amps"
khan_problems = load_khan_problems(amp_dataset_path)
mathematica_problems = load_mathematica_problems(amp_dataset_path)

print("=== AMP 数据集分析 ===\n")
print(f"Khan Academy 问题总数: {len(khan_problems):,}")
print(f"Mathematica 问题总数: {len(mathematica_problems):,}")
print(f"问题总数: {len(khan_problems) + len(mathematica_problems):,}\n")

# 显示示例问题
if khan_problems:
    print("=== Khan Academy 问题示例 ===")
    print(f"问题: {khan_problems[0].get('problem', 'N/A')}")
    print(f"解题步骤数: {len(khan_problems[0].get('hints', []))}")
    print(f"来源文件: {khan_problems[0].get('source_file', 'N/A')}\n")
    
if mathematica_problems:
    print("=== Mathematica 问题示例 ===")
    print(f"问题: {mathematica_problems[0].get('problem', 'N/A')}")
    print(f"答案: {mathematica_problems[0].get('answer', 'N/A')}")
    print(f"来源文件: {mathematica_problems[0].get('source_file', 'N/A')}\n")

## 详细分析

让我们进行更详细的分析，包括各类别的问题分布和文件类型统计。

In [None]:
# 显示更多示例问题
print("=== 更多Khan Academy问题示例 ===")
for i, problem in enumerate(khan_problems[:3], 1):
    print(f"{i}. 问题: {problem.get('problem', 'N/A')[:100]}...")
    hints = problem.get('hints', [])
    print(f"   解题步骤数: {len(hints)}")
    print(f"   来源: {problem.get('source_file', 'N/A')}\n")

print("=== 更多Mathematica问题示例 ===")
for i, problem in enumerate(mathematica_problems[:3], 1):
    print(f"{i}. 问题: {problem.get('problem', 'N/A')[:100]}...")
    print(f"   答案: {problem.get('answer', 'N/A')}")
    print(f"   来源: {problem.get('source_file', 'N/A')}\n")

# 分析Khan Academy类别
print("=== Khan Academy 类别分析 (前10) ===")
khan_categories = analyze_khan_categories(amp_dataset_path)
for i, category in enumerate(khan_categories[:10], 1):
    print(f"{i}. {category['name']}: {category['count']:,} 个问题")

# 分析Mathematica类别
print("\n=== Mathematica 类别分析 ===")
mathematica_categories = analyze_mathematica_categories(amp_dataset_path)
for i, category in enumerate(mathematica_categories, 1):
    print(f"{i}. {category['name']}: {category['count']:,} 个问题")

# 文件扩展名分析
print("\n=== 文件扩展名分析 ===")
file_extensions = Counter()
for root, dirs, files in os.walk(amp_dataset_path):
    for file in files:
        ext = os.path.splitext(file)[1].lower()
        if ext:
            file_extensions[ext] += 1

print("文件类型分布:")
for ext, count in file_extensions.most_common():
    print(f"  {ext}: {count:,} 个文件")

## 详细问题示例

让我们查看一些问题的详细内容，包括完整的解题步骤。

In [None]:
def show_khan_samples(problems, count=2):
    """
    显示详细的Khan Academy问题示例
    """
    print("=== 详细的Khan Academy问题示例 ===")
    for i, problem in enumerate(problems[:count]):
        print(f"\n示例 {i+1}:")
        print(f"问题: {problem.get('problem', 'N/A')}")
        
        hints = problem.get('hints', [])
        print(f"\n解题步骤 ({len(hints)} 总计):")
        # 显示前5步和后2步，如果总共少于等于7步则全部显示
        if len(hints) <= 7:
            for j, hint in enumerate(hints, 1):
                print(f"  {j}. {hint}")
        else:
            # 显示前5步
            for j in range(5):
                print(f"  {j+1}. {hints[j]}")
            print("  ...")
            # 显示后2步
            for j in range(2):
                idx = len(hints) - 2 + j
                print(f"  {idx+1}. {hints[idx]}")
                
        print(f"\n来源: {problem.get('source_file', 'N/A')}")

def show_mathematica_samples(problems, count=2):
    """
    显示详细的Mathematica问题示例
    """
    print("\n=== 详细的Mathematica问题示例 ===")
    for i, problem in enumerate(problems[:count]):
        print(f"\n示例 {i+1}:")
        print(f"问题: {problem.get('problem', 'N/A')}")
        print(f"答案: {problem.get('answer', 'N/A')}")
        print(f"来源: {problem.get('source_file', 'N/A')}")

# 显示详细示例
show_khan_samples(khan_problems)
show_mathematica_samples(mathematica_problems)

# AMP 数据集分析报告总结

## 数据集概览

AMP (Advanced Mathematical Problems) 是一个大型数学问题数据集，包含来自Khan Academy和Mathematica的数学问题:

- **总问题数**: 4,715,721
  - Khan Academy 问题: 103,059 (2.2%)
  - Mathematica 问题: 4,612,662 (97.8%)

## 详细统计信息

### 文件类型分布
- `.txt` 文件: 4,830,501 个 (主要是 Mathematica 问题)
- `.json` 文件: 103,059 个 (Khan Academy 问题)
- `.nb` 文件: 137 个 (Mathematica 笔记本)
- `.swp` 文件: 1 个 (临时文件)

### Khan Academy 问题详情
- **类别数**: 721 个
- **问题总数**: 103,059 个
- **热门类别** (前10):
  1. 类别 441: 1,521 个问题
  2. 类别 363: 1,402 个问题
  3. 类别 518: 1,319 个问题
  4. 类别 401: 1,137 个问题
  5. 类别 184: 1,114 个问题

### Mathematica 问题详情
- **类别数**: 6 个
- **问题总数**: 4,612,662 个
- **类别分布**:
  1. 线性代数: 1,295,000 个问题
  2. 代数: 1,240,000 个问题
  3. 数论: 750,500 个问题
  4. 计数与统计: 705,000 个问题
  5. 微积分: 540,000 个问题
  6. 几何: 300,000 个问题

## 数据格式

### Khan Academy 格式
- 存储为 JSON 文件
- 包含问题描述和逐步解题提示
- 每个问题有 10-30 个解题步骤

### Mathematica 格式
- 主要为文本文件 (.txt)
- 简单的"问题:...答案:..."格式
- 也包含少量 Mathematica 笔记本文件 (.nb)

## 数据集特点

1. **规模庞大**: 超过470万个数学问题
2. **类型多样**: 涵盖代数、几何、微积分、线性代数、数论等多个数学领域
3. **格式丰富**: 包含JSON、TXT、NB等多种文件格式
4. **教育价值高**: Khan Academy部分提供详细的解题步骤，适合教学和学习

这个数据集非常适合用于训练和评估数学问题解决系统，特别是需要逐步推理的AI模型。