Skip to content

6chHenry/DataRecipe

Repository files navigation

Data Recipe for Causal Data Generation

image

This project constructs causal graphs from object attributes and affordances, and then generates task, emergency, and recovery text based on those graphs. It supports automatic object discovery, multiple LLM-assisted causal graph construction strategies, and optional visualization.


1. Quick Start

Environment Setup

conda create -n data-recipe python=3.11
conda activate data-recipe
pip install -r requirements.txt

Common dependencies include:

  • requests
  • networkx
  • matplotlib (optional, for visualization)

2. End-to-End Agent (Recommended)

The end-to-end agent automatically performs:

Object discovery → causal graph construction → text generation

This is the recommended workflow if you do not want to manually specify objects.

2.1 Configure LLM API (Qwen / DashScope)

If you use a cloud LLM, configure the following environment variables:

macOS / Linux (bash)

export QWEN_API_KEY="<your-key>"
export QWEN_API_BASE="https://dashscope.aliyuncs.com/compatible-mode/v1"  # optional
export QWEN_MODEL="qwen2.5-7b-instruct"                                   # optional

Windows (PowerShell)

$env:QWEN_API_KEY="<your-key>"
$env:QWEN_API_BASE="https://dashscope.aliyuncs.com/compatible-mode/v1"    # optional
$env:QWEN_MODEL="qwen2.5-7b-instruct"                                     # optional

Important

  • run_agent.sh and pipeline.py read API credentials only from environment variables.
  • Do NOT hardcode API keys in scripts or source code.

2.2 OpenAI-Compatible API Mapping (Automatic)

If USE_API=true is enabled, the system automatically maps available credentials to a unified interface:

Unified Variable Source Priority
CAUSAL_LLM_API_KEY OPENAI_API_KEY / AZURE_OPENAI_API_KEY / CAUSAL_LLM_API_KEY / QWEN_API_KEY / DASHSCOPE_API_KEY
CAUSAL_LLM_API_BASE CAUSAL_LLM_API_BASE / QWEN_API_BASE / default DashScope endpoint
CAUSAL_LLM_MODEL CAUSAL_LLM_MODEL / QWEN_MODEL / qwen-max-2025-01-25

➡️ Setting only QWEN_API_KEY is sufficient. The pipeline falls back to local or non-API modes only if no valid key is found.


2.3 Configure run_agent.sh

Key variables (auto-discovery enabled by default):

  • Object Discovery

    • AUTO_DISCOVER=true
    • OBJECT_PROMPT="your task or scenario description"
    • DISCOVER_NUM_OBJECTS
    • DISCOVER_TEMPERATURE
    • DISCOVER_TOP_P
  • Optional Manual Inputs

    • OBJECTS_TEXT_FILE — one object hint per line
    • OBJECTS_JSON — fully structured object definitions
  • Causal Graph Strategy

    • USE_QWEN_API_GRAPH — one-shot graph generation
    • MICRO_FIRST — micro-graph incremental stitching
    • USE_API — use OpenAI-compatible API interface

2.4 Run the Agent

bash run_agent.sh

The pipeline performs:

  1. Automatic object discovery
  2. Quality filtering and deduplication
  3. Causal graph construction (Qwen one-shot / micro-graph stitching / legacy mode)
  4. Task case generation

Output (default):

output/task_case.json

3. Visualization

You can visualize the generated causal graph as a DOT or PNG file:

python visualize_logic_graph.py \
  --task_json output/task_case.json \
  --dot output/logic_graph.dot \
  --png output/logic_graph.png

4. Batch Run Multiple Scenarios (New)

Use batch_run.py to process a scenarios file (JSONL or JSON array) and generate a TaskCase JSON per scenario. Filenames are auto-generated from the scenario/task name for clarity.

Scenario file format (recommended: JSONL, one scenario per line)

{"task_id":"T001","task_name":"Kitchen prep","object_prompt":"Prepare a sandwich and drink","discover_num_objects":4,"use_qwen_api_graph":true,"micro_first":true,"use_api":true}
{"task_id":"T002","task_name":"Office cleaning","object_prompt":"Clean a small office space","discover_num_objects":3,"micro_first":true,"use_api":true}

Also supported: JSON array [{...},{...}].

Field hints (all optional unless noted):

  • task_id / task_name: used in TaskCase metadata and output filename slug.
  • object_prompt: description for LLM discovery (if objects/objects_text are absent).
  • discover_num_objects, discover_temperature, discover_top_p: sampling for object discovery.
  • use_qwen_api_graph, micro_first, use_api: graph strategy and API toggle (Qwen keys auto-mapped to OpenAI-compatible envs).
  • objects: structured objects list (category/name/attributes/affordances/logic_graph) to bypass LLM discovery.
  • objects_text: plain text list; each entry becomes an object category/name.

Run example

python batch_run.py \
  --scenarios_file scenarios.jsonl \
  --output_dir output/batch \
  --use_api \
  --micro_first

Output: one JSON per scenario, named slug(task_name)_taskId.json (e.g., kitchen_prep_T001.json) in output/batch (override with --output_dir).


4. One-Shot Causal Graph via Qwen API

You can directly let Qwen generate the entire causal graph in a single API call, then run downstream text generation.

Required Environment Variables

  • QWEN_API_KEY or DASHSCOPE_API_KEY
  • QWEN_API_BASE (optional)
  • QWEN_MODEL (optional)

Run

python pipeline.py \
  --use_qwen_api_graph \
  --qwen_temperature 0.4 \
  --qwen_top_p 0.9 \
  --task_id T001 \
  --task_name DemoTask

Notes

  • Existing edges defined in each CausalObject are preserved.
  • New edges are merged and filtered to known nodes only.
  • If the API call fails or returns an empty graph, the system automatically falls back to other enabled strategies.
  • Optional post-verification by Qwen can add cross-object edges for improved global consistency.

5. Micro-Graph Stitching (Multi-Step LLM Assistance)

This mode builds a global DAG by stitching together small, high-confidence causal subgraphs.

python pipeline.py \
  --micro_first \
  --micro_group_size 3 \
  --micro_groups_per_object 6 \
  --micro_min_confidence 0.6
  • Each micro-graph contains 2–3 nodes

  • Micro-graphs are cached at:

    database/micro_graphs.json
    

6. Legacy Full Causal Discovery

To use the original full discovery algorithm:

  • Do not enable --micro_first
  • Do not enable --use_qwen_api_graph
python pipeline.py --use_api

This mode is driven by either a local LLM or an API backend, depending on configuration.


7. Output Summary

Artifact Description
task_case.json Final task/emergency/recovery representation
logic_graph.dot Causal graph in DOT format
logic_graph.png Rendered causal graph

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors