#    maths reasoning data distillation from deepseek R1

You can also check this cookbook in colab [here](https://colab.research.google.com/drive/1BnV4iyWlXdizzpRQPYjmwIt70oVKziBw?usp=sharing)  (Use the colab share link)

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://www.camel-ai.org/"><img src="https://i.postimg.cc/KzQ5rfBC/button.png"width="150"></a>
  <a href="https://discord.camel-ai.org"><img src="https://i.postimg.cc/L4wPdG9N/join-2.png"  width="150"></a></a>
  
⭐ <i>Star us on [*Github*](https://github.com/camel-ai/camel), join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)
</div>




This notebook demonstrates how to set up and leverage CAMEL's **data distillation pipline** for distilling high-quality maths reasoning data with thought process (Long Cot data)from deepseek R1, and uploading the results to Hugging Face.

In this notebook, you'll explore:

- **CAMEL**: A powerful multi-agent framework that enables Long Cot data generation、distillation and multi-agent role-playing scenarios, allowing for sophisticated AI-driven tasks.
- **Data distillation pipline**: for distilling high-quality maths reasoning data with thought process (Long Cot data) from deepseek R1 etc model
- **Hugging Face Integration**: Uploading datasets  to the Hugging Face platform for sharing.



## 📦 Installation

Firstly, we need to install the camel-ai package for datagen pipline

In [2]:
%%capture
!pip install "git+https://github.com/camel-ai/camel.git@17bfab03cf3e1b2c0982b07649f2eab392d97d10#egg=camel-ai[all]"

## 🔑 Setting Up API Keys

Secondly, we will set the DEEPSEEK_API_KEY that will be used to distill the maths reasoning data with thought process.

In [None]:
from getpass import getpass
import os

In [None]:
DEEPSEEK_API_KEY = getpass('Enter your DEEPSEEK_API_KEY: ')
os.environ["DEEPSEEK_API_KEY"] = DEEPSEEK_API_KEY

In [None]:
#to make deepseek r1 responds with thought process content,we should set the following environment variable
os.environ["GET_REASONING_CONTENT"]="True"

## Download gsm8k  from huggingface and Convert to the desired format


Now, lets start to prepare the original maths  data from huggingface,which mainly have two important key:questions and answers

and after we download these datasets, we will convert these datasets to the desired format which suitable to be used in **camel data distillation pipline**

for this example,we mainly use two datasets:openai/gsm8k

In [3]:
import json
from pathlib import Path
import uuid
from datasets import load_dataset

openai/gsm8k

In [5]:

def download_gsm8k_dataset():
    try:
        # Load the dataset using the datasets library
        dataset = load_dataset("openai/gsm8k", "main")

        # Get the first 10 items from train split
        data = dataset['train'].select(range(10))

        # Convert to the desired format
        formatted_data = []
        for item in data:
            # Extract the final answer from the solution
            solution = item['answer']
            if solution:
                # GSM8K solutions typically end with "#### number"
                import re

                match = re.search(r'####\s*(\d+)', solution)
                if match:
                    number = match.group(1)
                    # Replace the "#### number" with "\boxed{number}"
                    solution = re.sub(
                        r'####\s*\d+', f'\\\\boxed{{{number}}}', solution
                    )

            formatted_item = {
                "id": str(uuid.uuid4()),  # GSM8K doesn't provide IDs
                "problem": item['question'],
                "type": "openai/gsm8k",  # All problems are from GSM8K
                "solution": solution,  # Use the modified solution with \boxed
            }
            formatted_data.append(formatted_item)

        # Create output directory if it doesn't exist
        output_dir = Path("data")
        output_dir.mkdir(exist_ok=True)


        return formatted_data

    except Exception as e:
        print(f"Error downloading GSM8K dataset: {e}")
        return None

if __name__ == "__main__":
    download_gsm8k_dataset()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

cool,now you have already got some desired format example data,lets move to start to distill some maths reasoning data with thought process.

## start to distill some maths reasoning data with thought process (Long Cot data).

In [None]:
# make loop run in parallel in notebook,when you run in colab
import nest_asyncio
nest_asyncio.apply()

as you can see,we use one reason model to distill the maths reasoning data with thought process(Long Cot data),and evaluate model will try to judge that Whether the final response data with thought process is consistent with the original standard response

importantly,we use the STaRPipeline which could be import like **from camel.datagen import STaRPipeline**

In [None]:
import json
import os
import time

from camel.agents import ChatAgent
from camel.datagen import STaRPipeline
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType

"""
please set the below os environment:
export DEEPSEEK_API_KEY=""
export GET_REASONING_CONTENT="true"
"""

evaluate_model = ModelFactory.create(
    model_platform=ModelPlatformType.DEFAULT,
    model_type=ModelType.DEFAULT,
)

reason_model_1 = ModelFactory.create(
    model_platform=ModelPlatformType.DEEPSEEK,
    model_type=ModelType.DEEPSEEK_REASONER,
)

reason_model_2 = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type="accounts/fireworks/models/deepseek-r1",
    api_key=os.getenv("FIREWORKS_API_KEY"),
    url="https://api.fireworks.ai/inference/v1",
    model_config_dict={"max_tokens": 4096},
)

reason_model_3 = ModelFactory.create(
    model_platform=ModelPlatformType.OPENAI_COMPATIBLE_MODEL,
    model_type="deepseek-ai/DeepSeek-R1",
    api_key=os.getenv("HYPERBOLIC_API_KEY"),
    url="https://api.hyperbolic.xyz/v1",
)

reason_model_4 = ModelFactory.create(
    model_platform=ModelPlatformType.TOGETHER,
    model_type="deepseek-ai/DeepSeek-R1",
    api_key=os.getenv("TOGETHER_API_KEY"),
)
# from camel.models.reward import NemotronRewardModel
#


Now we start to excute the main function of STaRPipeline,
pay attention to the parameters like problems_path,output_path,max_iterations,rationalization

In [None]:


def main():
    start_time = time.time()

    current_dir = os.getcwd()
    problems_path = os.path.join(
        current_dir+"data", 'outputs/gsm8k_dataset_part1_failed.json'
    )
    output_path = os.path.join(
        current_dir+"data", 'outputs/gsm8k_dataset_output_part1_retried.json'
    )

    # Load problems from JSON file
    with open(problems_path, 'r') as f:
        problems = json.load(f)

    # Initialize agent
    reason_agent_system_message = """Answer my question and give your
    final answer within \\boxed{}."""
    evaluate_agent_system_message = """You are a highly critical teacher who
    evaluates the student's answers with a meticulous and demanding approach.
    """
    reason_agent = ChatAgent(
        system_message=reason_agent_system_message,
        model=[
            # reason_model_1,
            reason_model_2,
            # reason_model_3,
            # reason_model_4,
        ],
    )
    evaluate_agent = ChatAgent(
        system_message=evaluate_agent_system_message, model=evaluate_model
    )

    # Initialize reward model (optional)
    # reward_model = NemotronRewardModel(
    #     model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
    #     url="https://integrate.api.nvidia.com/v1",
    #     api_key=os.environ.get("NVIDIA_API_KEY"),
    # )

    # Set score thresholds for different dimensions (optional)
    score_threshold = {
        "correctness": 1.0,
        "clarity": 0.0,
        "completeness": 0.0,
    }
    # Or use a single threshold for all dimensions:
    # score_threshold = 0.9

    # Create and run pipeline
    pipeline = STaRPipeline(
        reason_agent=reason_agent,
        evaluate_agent=evaluate_agent,
        problems=problems,  # Pass problems list directly
        output_path=output_path,
        max_iterations=0,
        score_threshold=score_threshold,
        # reward_model=reward_model,  # To use a reward model (optional)
    )

    results = pipeline.generate(rationalization=False)

    end_time = time.time()
    execution_time = end_time - start_time

    print(f"\nProcessed {len(results)} problems")
    print(f"Results saved to: {output_path}")
    print(f"Total execution time: {execution_time:.2f} seconds")


if __name__ == "__main__":
    main()

## Upload the Data to Huggingface

after we distill the desired data,why not upload to the huggingface and share with others?lets do it to help more people who may need more maths reasoning data for training a model etc.

### define the dataset upload pipline like create records and dataset card etc

In [None]:
# Import necessary modules and classes
from camel.datahubs.huggingface import HuggingFaceDatasetManager  # Manages interactions with Hugging Face datasets
from camel.datahubs.models import Record  # Represents a single record in the dataset
from datetime import datetime  # Handles date and time operations
import json  # For reading JSON files

def load_star_output(file_path):
    r"""Load and parse the star output JSON file.

    Args:
        file_path (str): Path to the star_output.json file.

    Returns:
        list: List of traces from the JSON file.
    """
    with open(file_path, 'r') as f:
        data = json.load(f)
    return data['traces']

# Main function: Upload dataset to Hugging Face
def upload_to_huggingface(transformed_data, username, dataset_name=None):
    r"""Uploads transformed data to the Hugging Face dataset platform.

    Args:
        transformed_data (list): Transformed data, typically a list of dictionaries.
        username (str): Hugging Face username.
        dataset_name (str, optional): Custom dataset name.

    Returns:
        str: URL of the uploaded dataset.
    """
    # Initialize HuggingFaceDatasetManager to interact with Hugging Face datasets
    manager = HuggingFaceDatasetManager()

    # Generate or validate the dataset name
    dataset_name = generate_or_validate_dataset_name(username, dataset_name)

    # Create the dataset on Hugging Face and get the dataset URL
    dataset_url = create_dataset(manager, dataset_name)

    # Create a dataset card to add metadata
    create_dataset_card(manager, dataset_name, username)

    # Convert the transformed data into a list of Record objects
    records = create_records(transformed_data)

    # Add the Record objects to the dataset
    add_records_to_dataset(manager, dataset_name, records)

    # Return the dataset URL
    return dataset_url

# Generate or validate the dataset name
def generate_or_validate_dataset_name(username, dataset_name):
    r"""Generates a default dataset name or validates and formats a user-provided name.

    Args:
        username (str): Hugging Face username.
        dataset_name (str, optional): User-provided custom dataset name.

    Returns:
        str: Formatted dataset name.
    """
    if dataset_name is None:
        # If no dataset name is provided, generate a default name with the username and current date
        current_date = datetime.now().strftime("%Y%m%d")
        dataset_name = f"star_traces_{current_date}"

    # Format the dataset name to include the username
    return f"{username}/{dataset_name}"

# Create a dataset on Hugging Face
def create_dataset(manager, dataset_name):
    r"""Creates a new dataset on Hugging Face and returns the dataset URL.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.

    Returns:
        str: URL of the created dataset.
    """
    dataset_url = manager.create_dataset(dataset_name)
    return dataset_url

# Create a dataset card with metadata
def create_dataset_card(manager, dataset_name, username):
    r"""Creates a dataset card to add metadata

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        username (str): Hugging Face username.
    """
    manager.create_dataset_card(
        dataset_name=dataset_name,
        description="A dataset containing mathematical problem-solving traces with step-by-step solutions and improvement history. Each record includes a mathematical problem, its final solution, and the iterative improvement process.",
        license="mit",  # Using lowercase 'mit' as required by HuggingFace
        tags=["math", "problem-solving", "step-by-step", "traces"],
        authors=[username],
        language=["en"],
        task_categories=["text-generation"],
        content="This dataset contains mathematical problem-solving traces generated using the CAMEL framework. Each entry includes:\n\n"
                "- A mathematical problem statement\n"
                "- A detailed step-by-step solution\n"
                "- An improvement history showing how the solution was iteratively refined"
    )

# Convert transformed data into Record objects
def create_records(transformed_data):
    r"""Converts transformed data into a list of Record objects.

    Args:
        transformed_data (list): List of trace dictionaries from star_output.json.

    Returns:
        list: List of Record objects.
    """
    records = []
    for trace in transformed_data:
        record = Record(
            source_type=trace['type'],
            problem=trace['problem'],
            solution=trace['final_trace'],
        )
        records.append(record)
    return records

# Add Record objects to the dataset
def add_records_to_dataset(manager, dataset_name, records):
    r"""Adds a list of Record objects to the dataset.

    Args:
        manager (HuggingFaceDatasetManager): Instance of HuggingFaceDatasetManager.
        dataset_name (str): Name of the dataset.
        records (list): List of Record objects.
    """
    manager.add_records(dataset_name, records)


### Config Access Token of Huggingface and upload to huggingface

You can go to [here](https://huggingface.co/settings/tokens/new?tokenType=write) to get API Key from Huggingface

In [None]:
  import os
  from getpass import getpass

  # Get HuggingFace token and username
  HUGGING_FACE_TOKEN = getpass('Enter your HUGGING_FACE_TOKEN: ')
  os.environ["HUGGING_FACE_TOKEN"] = HUGGING_FACE_TOKEN
  username = input("Enter your HuggingFace username: ")
  dataset_name = input("Enter dataset name (press Enter to use default): ").strip()

  # Load the star output data
  current_dir = os.getcwd()
  star_output_path = os.path.join(current_dir, './data/star_r1_output.json')
  traces = load_star_output(star_output_path)

  # Upload the data to HuggingFace
  dataset_url = upload_to_huggingface(traces, username, dataset_name)
  print(f"\nDataset uploaded successfully!")
  print(f"You can view your dataset at: {dataset_url}")


## 🌟 Highlights

As you can see, this is a short and easy to operate notebook for fast data distillation from deepseek r1 models, ideal for distillation mathematical thought process data, looking forward to your use and feedback!

That's everything: Got questions about 🐫 CAMEL-AI? Join us on [Discord](https://discord.camel-ai.org)! Whether you want to share feedback, explore the latest in multi-agent systems, get support, or connect with others on exciting projects, we’d love to have you in the community! 🤝

Check out some of our other work:
1. 🐫 Creating Your First CAMEL Agent [free Colab](https://docs.camel-ai.org/cookbooks/create_your_first_agent.html)
2.  Graph RAG Cookbook [free Colab](https://colab.research.google.com/drive/1uZKQSuu0qW6ukkuSv9TukLB9bVaS1H0U?usp=sharing)
3. 🧑‍⚖️ Create A Hackathon Judge Committee with Workforce [free Colab](https://colab.research.google.com/drive/18ajYUMfwDx3WyrjHow3EvUMpKQDcrLtr?usp=sharing)
4. 🔥 3 ways to ingest data from websites with Firecrawl & CAMEL [free Colab](https://colab.research.google.com/drive/1lOmM3VmgR1hLwDKdeLGFve_75RFW0R9I?usp=sharing)
5. 🦥 Agentic SFT Data Generation with CAMEL and Mistral Models, Fine-Tuned with Unsloth [free Colab](https://colab.research.google.com/drive/1lYgArBw7ARVPSpdwgKLYnp_NEXiNDOd-?usp=sharingg)

Thanks from everyone at 🐫 CAMEL-AI


<div class="align-center">
  <a href="https://www.camel-ai.org/"><img src="https://i.postimg.cc/KzQ5rfBC/button.png"width="150"></a>
  <a href="https://discord.camel-ai.org"><img src="https://i.postimg.cc/L4wPdG9N/join-2.png"  width="150"></a></a>
  
⭐ <i>Star us on <a href="https://github.com/camel-ai/camel">Github</a> </i>, join our [*Discord*](https://discord.camel-ai.org) or follow our [*X*](https://x.com/camelaiorg)  ⭐
</div>
