# AlignJuice - 数据探索与流水线演示

本 Notebook 演示如何使用 AlignJuice 框架进行高质量对齐数据的管理、处理和评估。

## 目录
1. 安装与导入
2. 数据加载与探索
3. 单独使用算子
4. 运行完整流水线
5. 质量评估与报告

## 1. 安装与导入

In [None]:
# 安装 AlignJuice (如果尚未安装)
# !pip install -e ..

# 导入核心模块
from alignjuice import AlignJuice, DataContainer, AlignmentSample
from alignjuice.operators import (
    SemanticDedup, ExactDedup,
    QualityFilter, KnowledgeFilter, DiversityFilter,
    LLMSynthesis, ReasoningChainEnhancer
)
from alignjuice.metrics import KnowledgeDensity, Diversity

print("AlignJuice 导入成功!")

## 2. 数据加载与探索

In [None]:
# 创建示例数据
sample_data = [
    {
        "id": "sample_001",
        "instruction": "What is the capital of France?",
        "input": "",
        "output": "The capital of France is Paris. Paris is not only the capital but also the largest city in France, known for landmarks like the Eiffel Tower and the Louvre Museum.",
        "category": "factual"
    },
    {
        "id": "sample_002",
        "instruction": "Explain how photosynthesis works.",
        "input": "",
        "output": "Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen. It occurs in the chloroplasts of plant cells.",
        "category": "factual"
    },
    {
        "id": "sample_003",
        "instruction": "If a train travels at 60 mph for 2.5 hours, how far does it travel?",
        "input": "",
        "output": "To find the distance, multiply speed by time: 60 mph × 2.5 hours = 150 miles. The train travels 150 miles.",
        "category": "reasoning"
    },
    {
        "id": "sample_004",
        "instruction": "Write a haiku about autumn.",
        "input": "",
        "output": "Crimson leaves falling\nCrisp wind whispers through bare trees\nNature's last warm breath",
        "category": "creative"
    },
    {
        "id": "sample_005",
        "instruction": "How do I make a cup of tea?",
        "input": "",
        "output": "To make tea: 1) Boil water, 2) Place tea bag in cup, 3) Pour hot water over tea bag, 4) Steep for 3-5 minutes, 5) Remove tea bag and enjoy. Add milk or honey if desired.",
        "category": "daily"
    },
    {
        "id": "sample_006",
        "instruction": "What is the capital of France?",  # 重复问题
        "input": "",
        "output": "Paris is the capital of France.",
        "category": "factual"
    },
]

# 从列表创建 DataContainer
data = DataContainer.from_list(sample_data)
print(f"加载了 {len(data)} 条数据")

In [None]:
# 数据概览
data.describe()

In [None]:
# 展示数据
data.show()

In [None]:
# 随机采样
data.sample(3).show()

In [None]:
# 类别分布可视化
data.plot_distribution("category")

## 3. 单独使用算子

AlignJuice 的算子可以单独使用，便于交互式探索和调试。

In [None]:
# 精确去重
exact_dedup = ExactDedup(field="instruction")
deduped = exact_dedup(data)

print(f"精确去重: {len(data)} -> {len(deduped)} 条")
print(f"去重率: {exact_dedup.metrics['dedup_rate']:.1%}")

In [None]:
# 查看被去除的数据
removed = data.diff(deduped)
print(f"被去除的 {len(removed)} 条数据:")
removed.show()

In [None]:
# 质量过滤
quality_filter = QualityFilter(threshold=0.7)
filtered = quality_filter(deduped)

print(f"质量过滤: {len(deduped)} -> {len(filtered)} 条")
print(f"平均质量分: {quality_filter.metrics['avg_score']:.2f}")

In [None]:
# 查看每个样本的质量分
for sample in filtered:
    score = sample.metadata.get('quality_score', 'N/A')
    print(f"{sample.id}: {score:.2f} - {sample.instruction[:50]}...")

## 4. 运行完整流水线

In [None]:
# 初始化 AlignJuice (使用默认配置)
# aj = AlignJuice(config="../configs/default.yaml")

# 或者手动构建流水线
from alignjuice.core.pipeline import Pipeline
from alignjuice.stages import DataJuicerStage, KnowledgeFilterStage, SandboxEvalStage

# 创建简化的流水线 (不需要 LLM)
pipeline = Pipeline()
pipeline.add_stage(DataJuicerStage(target_count=5))
pipeline.add_stage(SandboxEvalStage(
    report_path="../reports/demo_report.html",
    metrics_path="../reports/demo_metrics.json"
))

print(f"流水线包含 {len(pipeline.stages)} 个阶段")

In [None]:
# 运行流水线
result = pipeline.run(data)

# 显示结果报告
result.report()

In [None]:
# 查看最终数据
result.data.show()

## 5. 质量评估与报告

In [None]:
# 查看处理历史 (provenance)
print("数据处理历史:")
for i, step in enumerate(result.data.provenance, 1):
    print(f"  {i}. {step}")

In [None]:
# 保存最终数据
result.data.to_jsonl("../output/demo_output.jsonl")
print("数据已保存到 output/demo_output.jsonl")

## 下一步

- 查看 `02_pipeline_demo.ipynb` 了解完整流水线配置
- 查看 `03_quality_analysis.ipynb` 了解详细质量分析
- 查看 `04_custom_operators.ipynb` 了解如何创建自定义算子