# This Notebook is forked from the 1st Place Winner of ARC Prize 2024

## Dark AGI's ARC-2025 Open Source Commitment: All submissions in this competition will be open-sourced to help the community explore AGI capable of achieving 85%+ scores.

**Competition Link:** [ARC Prize 2024](https://www.kaggle.com/competitions/arc-prize-2024)  
**Original Notebook:** [arc-prize-2024-solution-by-the-architects](https://www.kaggle.com/code/dfranzen/arc-prize-2024-solution-by-the-architects?scriptVersionId=211637468)

---

## Modifications Implemented

1. **4-GPU Support**
   - Extended multi-GPU implementation from 2 to 4 GPUs
   - Modified dataset splitting logic in `prepare_dataset` to evenly distribute work
   - Added training and inference processes for GPU 2 and GPU 3
   - Updated subprocess monitoring to wait for all 8 processes (4 training + 4 inference)
   - Improved resource utilization and inference throughput by 2x

2. **Enhanced Reproducibility**
   - Added global seed control (GLOBAL_SEED = 42) with per-GPU deterministic seeding
   - Applied consistent seed values to all randomized operations for reproducible results
   - Disabled non-deterministic algorithms to ensure consistent outputs across runs
   - Implemented seed-based task distribution for consistent GPU workloads

3. **Comprehensive Visualization**
   - Implemented data visualization for both training and inference phases:
     - Color-coded grid displays for ARC tasks with intuitive color mapping
     - Side-by-side comparisons of inputs, ground truth, and model predictions
   - Added multi-GPU result comparison showing prediction quality across all GPUs
   - Created task-specific visualizations showing training examples, test inputs, and prediction attempts
   - Calculated detailed accuracy metrics with statistical breakdowns:
     - Per-attempt success rates for first and second predictions
     - Overall accuracy percentages for both individual attempts
     - Combined success rate for either prediction attempt
     - Shape and value distribution analysis for predictions vs ground truth
     - Non-zero prediction completion rate and zero-prediction filtering

In [None]:
# Copyright 2024 Daniel Franzen and Jan Disselhoff
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [None]:
# This notebook contains our winning submission to the ARC Prize 2024 Kaggle competition,
# scoring 53.5 points on the private evaluation set.
# the ARChitects (Daniel Franzen and Jan Disselhoff)

# Model Runner Module Explanation

This Python module (`model_runner.py`) is a comprehensive toolkit for working with language models, focusing on efficiency and performance optimization. The code is designed to handle various aspects of model management including loading, training, inference, and optimization.

## Key Components

### Tokenizer Optimization
The module provides several functions for optimizing tokenizers:
- `indices_required_for_merges` identifies necessary token IDs for BPE merges
- `remove_unused_merges` cleans up unused merge rules
- `shrink_tokenizer_vocab` reduces vocabulary size while maintaining model functionality
- `remove_tokenizer_normalizer` removes normalizer components when not needed

### Model Size Reduction
The code includes specialized functionality for reducing model size:
- `shrink_model_embeddings` resizes embedding tables to match reduced vocabularies
- `shrink_embeddings` orchestrates the entire embedding reduction process
- Support for 4-bit quantization through `transformers_4bit` and `unsloth_4bit` loading modes

### Model Management
Several utilities handle model loading and manipulation:
- `prepare_model` provides a unified interface for model loading with various options
- `merge_peft_into_base` merges fine-tuned adapters into base models
- `fix_dtypes` ensures consistent data types across model components
- `save_model` handles model saving with options for merging adapters

### Training Functions
The module supports efficient fine-tuning:
- `training_run` handles the training process with support for both standard and optimized trainers
- The `Retrainer` class enables retraining with data augmentation
- Support for gradient accumulation fixes and packing optimizations

### Inference
Comprehensive inference capabilities:
- `inference_run_v2` orchestrates inference runs across datasets
- `inference_turbo_dfs` implements a depth-first search approach for higher quality outputs
- `inference_step` handles token generation with various decoding strategies

### Result Processing
The `Decoder` class provides extensive functionality:
- Tracks and evaluates generated outputs against reference solutions
- Calculates accuracy metrics based on exact matches
- Supports probability tracking for solution ranking
- Benchmarks different selection algorithms

### Utilities
Helpful utilities include:
- Compressed storage for inference results via `inference_save`/`inference_load` 
- PEFT weight management functions
- GPU memory tracking via `mem_info`

This module appears to be built for competitive or research applications where optimizing model efficiency and generation quality is critical.

In [1]:
# %%writefile model_runner.py

Writing model_runner.py


# ARC Dataset Processing and Formatting Library

This code defines a comprehensive library for working with the Abstraction and Reasoning Corpus (ARC) dataset, providing utilities for data loading, manipulation, augmentation, and formatting for machine learning models. Here's an explanation of the main components:

## Core Functionality

1. **Data Manipulation Utilities**
   - `cut_at_token`: Truncates arrays at specified token positions
   - `shuffled`: Randomizes array elements with NumPy's permutation
   - `permute_mod`: Applies number permutations to arrays with inversion control
   - Various permutation strategies (random, frequency-based) for data augmentation

2. **ArcDataset Class**
   - Handles loading and processing of ARC challenge datasets
   - Provides comprehensive dataset manipulation methods:
     - Data augmentation (rotation, transposition, permutation)
     - Task filtering and sorting
     - Example shuffling and selection
     - Dataset splitting and concatenation
   - Includes utilities for submission creation and validation

3. **ArcFormatter Class**
   - Converts grid-based ARC tasks into text format for language models
   - Configurable formatting with options for prefixes, separators, and tokenization
   - Handles decoding model outputs back into valid grid solutions
   - Supports scoring and evaluation of predictions
   - Includes methods for formatting train-test examples and queries

4. **Custom Data Collator**
   - Implements special handling for training language models on ARC tasks
   - Supports advanced techniques like output masking and controlled fault injection
   - Configurable through options like `fault_freq` and `mask_first_output`

## Key Features

- **Data Augmentation**: Extensive options for task transformation to increase training data variety
- **Formatting Flexibility**: Customizable text representations for different model preferences
- **Length Management**: Methods to filter and truncate tasks to fit model context windows
- **Submission Handling**: Tools for generating and validating competition submissions
- **Predefined Formatters**: Ready-to-use configurations like `ArcFormatter_pretext2` with different masking strategies

This library provides the infrastructure needed to process ARC tasks for machine learning models, handling the conversion between grid-based puzzle representations and the text formats needed by language models.

In [2]:
# %%writefile arc_loader.py

Writing arc_loader.py


# Selection Algorithms for ARC Competition Solutions

The code defines a collection of selection algorithms designed to choose optimal solutions from multiple model predictions for the Abstraction and Reasoning Corpus (ARC) competition. This is a critical component in the submission pipeline, as it determines which predictions will be submitted as final answers.

At its core, the selection module offers several strategies for filtering and ranking candidate solutions based on different criteria. The simplest approach is `first_only`, which simply takes the first prediction, reflecting a high confidence in the model's initial guess. The `keep_order` algorithm preserves all predictions in their original sequence, useful when the ordering already reflects confidence levels. For eliminating redundancy, `keep_order_unique` builds on this by removing duplicate solutions.

The more sophisticated selection strategies leverage scoring mechanisms. `get_best_shape_by_score` groups predictions by their output shape (dimensions) and identifies the most promising shape based on a scoring function. This is particularly valuable in ARC problems where correct solutions often share consistent dimensions. The `score_sum` function extends this concept by accumulating scores for unique outputs while optionally preferring answers that match the most common output shape.

Two notable scoring implementations are provided: `score_all_probsum`, which converts log probabilities to probabilities and sums them to rank solutions, and `score_full_probmul_3`, which incorporates both inference scores and augmented scores with a baseline offset of 3. This combined approach aims to balance the model's direct confidence (inference scores) with additional evaluation metrics (augmented scores) for more robust selection.

The code includes utility functions like `hashable` and `make_unique` to handle the array-based outputs, ensuring proper comparison and deduplication. All these algorithms are collected in `selection_algorithms`, allowing for benchmarking different strategies against each other to determine the optimal approach for the final submission.

In [3]:
# %%writefile selection.py

Writing selection.py


# Asynchronous Subprocess Handling with Streaming Output

This code provides a set of asynchronous utilities for executing and monitoring subprocesses in Python. Built on top of Python's `asyncio` library, these functions enable efficient parallel execution of external processes while capturing their output streams in real-time.

The `stream_reader` function serves as the core component, continuously reading from a subprocess's output stream (either stdout or stderr) in manageable chunks of 4KB. It implements a clever buffering mechanism to ensure complete lines are processed properly. By appending a sentinel character ('X') and using Python's unpacking syntax, it elegantly separates complete lines from partial data that might be cut off mid-line. Each complete line is optionally prefixed with an identifier and directed to the specified output stream.

The `wait_for_subprocess` function builds on this foundation by simultaneously monitoring both the stdout and stderr streams of a single subprocess. It uses `asyncio.gather` to concurrently process both streams until completion, then waits for the subprocess to terminate and returns its exit code. The `print_output` parameter provides control over whether the subprocess output should be displayed, while the `id` parameter helps distinguish between outputs from different processes when multiple are running.

Finally, `wait_for_subprocesses` extends this capability to handle multiple subprocesses concurrently. It automatically assigns sequential numeric identifiers to each process when more than one is being monitored, making it easier to distinguish their outputs in a multiplexed console display.

This asynchronous approach is particularly valuable in data processing pipelines that might involve multiple external tools or long-running computations. Rather than blocking while waiting for each process to complete sequentially, these functions allow the Python program to efficiently manage multiple concurrent tasks, potentially improving overall throughput while maintaining organized output capture.

In [4]:
# %%writefile async_tools.py

Writing async_tools.py


# Comprehensive Framework for ARC Challenge Solution Pipeline

This code defines a robust configuration and execution framework for tackling the Abstraction and Reasoning Corpus (ARC) challenge. It serves as the central orchestration module for the entire solution pipeline, connecting data loading, model training, inference, and result submission generation.

The file begins by establishing essential paths and configuration settings, including locations for the ARC challenge data and temporary storage directories for model weights and inference outputs. It then loads the test dataset and conditionally loads solution data if working with a "fake" test set (likely for validation purposes). The configuration specifies the use of a pre-trained NeMo Mini model with the specialized ArcFormatter_premix_3 for data processing.

At the heart of the implementation are two key preparation functions: `prepare_run` and `prepare_dataset`. The `prepare_run` function configures the model environment, including GPU assignment and model initialization with LoRA (Low-Rank Adaptation) fine-tuning parameters. It uses Unsloth's 4-bit quantization for efficient memory usage, applies parameter-efficient fine-tuning to various model components with carefully chosen hyperparameters (including rank-stabilized LoRA), and optimizes for long contexts with gradient checkpointing.

The `prepare_dataset` function handles dataset preparation with sophisticated pre-processing techniques. For multi-GPU training, it supports both random splitting and length-based distribution of tasks. The function applies different augmentation strategies depending on whether the dataset is being prepared for training or inference. Training data undergoes rotation, transformation, permutation, and sequence shuffling, while ensuring proper length constraints. Inference data is sorted by input length, augmented, and interleaved to optimize processing.

The pipeline continues with two execution functions: `start_training` and `start_inference`. The training function fine-tunes the model using a parameter-efficient approach with 8-bit optimization, cosine learning rate scheduling, and carefully tuned hyperparameters. The inference function applies the trained model to generate solutions, with optional task-specific fine-tuning through the Retrainer class and augmented scoring to enhance solution quality.

A notable protective mechanism is the `RemapCudaOOM` context manager, which gracefully handles CUDA out-of-memory errors by creating a placeholder submission file rather than failing catastrophically. This ensures that even under resource constraints, the system can produce a valid competition entry.

This comprehensive framework represents a sophisticated approach to the ARC challenge, incorporating advanced techniques in efficient model fine-tuning, data augmentation, and robust error handling to maximize performance within Kaggle's computational constraints.

In [5]:
# %%writefile common_stuff.py

Writing common_stuff.py


# Environment Setup for ARC Challenge Pipeline

This code snippet represents the initialization phase of the ARC challenge solution pipeline. It performs critical setup tasks before the main training and inference processes begin. Let me walk through what's happening:

## Environment Configuration and Unsloth Installation

The code begins by importing all components from the `common_stuff` module, which contains the core functionality for the ARC solution pipeline as seen in previous sections. It then disables Weights & Biases logging by setting the `WANDB_DISABLED` environment variable to `"true"`.

A key component is the custom installation and patching of the Unsloth library. Unsloth provides efficient optimization techniques for large language models, but requires specific modifications to work properly in this environment:

1. The code checks if Unsloth is already installed by looking for a marker file
2. If not installed, it:
   - Uninstalls existing PyTorch and Accelerate packages to avoid conflicts
   - Installs Unsloth from a local wheel file (avoiding internet downloads, which is important in Kaggle environments)
   - Applies several patches to the Unsloth source code:
     - Disables the `get_statistics()` function to fix a delay bug
     - Removes multi-GPU detection restrictions to enable distributed training
   - Creates a marker file to indicate successful installation

This custom installation approach ensures compatibility with the Kaggle environment while enabling optimized training performance.

## Training Preparation and Cleanup

After setting up the environment, the code prepares for training by:

1. Removing any "done" signal files from previous runs for both GPUs (0 and 1)
   - This ensures that the training and inference processes won't mistakenly think a previous run completed successfully

2. For debugging scenarios (when using a "fake" test set), there are commented-out commands to remove previous model outputs and temporary files
   - These cleanup commands are disabled (commented out) but could be enabled for debugging purposes

This initialization routine establishes a clean, optimized environment for the subsequent model training and inference phases of the ARC challenge solution pipeline. The modified Unsloth library will enable efficient fine-tuning of the large language model with quantization and other optimizations tailored to this specific task.

In [None]:
from common_stuff import *
import os
os.environ["WANDB_DISABLED"] = "true"

In [5]:
if not os.path.exists(os.path.join(tmp_dir, 'unsloth_installed')):  # unsloth offline install - https://stackoverflow.com/a/51646354
    !pip uninstall --yes torch accelerate
    !pip install --no-index --find-links=./input/unsloth-2024-9-post4/wheelhouse unsloth
    #!pip uninstall --yes accelerate fastai torch torchaudio transformers
    #!pip install --no-index --find-links=/kaggle/input/unsloth-2024-10-7/wheelhouse unsloth  # do not use grad_acc_fix - trains very slow
    #!sed -i 's/if ((post_check - pre_check) >= 1).sum() > 1:/if False:/g' /opt/conda/lib/python3.10/site-packages/unsloth/models/llama.py
    # fix delay bug in get_statistics()
    # !sed -i 's/^def get_statistics():/def get_statistics():\n if False:/g' /opt/conda/lib/python3.10/site-packages/unsloth/models/_utils.py
    # fix faulty unsloth multi-gpu detection
    # !sed -i "s/raise RuntimeError('Unsloth currently does not support multi GPU setups - but we are working on it!')/pass/g" /opt/conda/lib/python3.10/site-packages/unsloth/tokenizer_utils.py /opt/conda/lib/python3.10/site-packages/unsloth/models/llama.py /opt/conda/lib/python3.10/site-packages/unsloth/models/vision.py
    os.makedirs(os.path.join(tmp_dir, 'unsloth_installed'), exist_ok=True)
    print('Unsloth installed & patched.')



Looking in links: ./input/unsloth-2024-9-post4/wheelhouse
Processing c:\fun\ml\arc-2025\input\unsloth-2024-9-post4\wheelhouse\unsloth-2024.9.post4-py3-none-any.whl
INFO: pip is looking at multiple versions of unsloth to determine which version is compatible with other requirements. This could take a while.
Unsloth installed & patched.


ERROR: Could not find a version that satisfies the requirement torch>=2.4.0 (from unsloth) (from versions: none)
ERROR: No matching distribution found for torch>=2.4.0


In [None]:
for gpu in [0, 1]: 
    signal_path = f'{model_temp_storage}_gpu{gpu}_done'
    if os.path.exists(signal_path): os.rmdir(signal_path)

if arc_test_set.is_fake:  # cleanup? (for debugging)
    #!rm -R /kaggle/temp/finetuned_model*
    #!rm -R /kaggle/temp/inference_outputs
    #!rm -R /kaggle/temp/inference_scoring
    #!ls /kaggle/temp
    pass

ModuleNotFoundError: No module named 'torch'

# Asynchronous Training Process Initialization

This code cell initiates a background training process for the ARC challenge solution pipeline. Let me explain the critical aspects of what's happening here:

The cell uses a special Jupyter cell magic `%%python --bg --proc train_proc0` which instructs the notebook to run the Python code in a separate background process rather than within the main notebook execution thread. The `--proc train_proc0` parameter assigns a specific name to this process, making it identifiable for monitoring and management purposes.

Within this background process, the code imports all components from the `common_stuff` module, which contains the comprehensive framework for model configuration, dataset preparation, and training execution that we examined earlier. This module provides access to the model architecture, training parameters, data augmentation strategies, and other essential components of the solution pipeline.

The core action is the call to `start_training(gpu=0)`, which initiates the model training process on GPU 0. As detailed in the `common_stuff` module, this function will:

1. Establish a unique storage path for this GPU's model weights
2. Configure the base model with LoRA fine-tuning parameters
3. Prepare the dataset with appropriate augmentations for training
4. Execute the training process with carefully tuned hyperparameters
5. Create a signal file upon completion to indicate that training has finished

Running this process in the background allows the notebook to remain responsive while the computationally intensive training occurs. This approach is particularly valuable in a multi-GPU setup, as it enables the initiation of parallel training processes across available GPUs, potentially accelerating the overall solution development. The named process also facilitates monitoring or termination if needed during the potentially lengthy training process.

This background execution model is a sophisticated approach to managing computational resources in Jupyter environments, especially for the resource-intensive tasks involved in state-of-the-art AI competition solutions.

In [15]:
# Simplified ARC data visualization script (English version)
from arc_loader import *
import matplotlib.pyplot as plt
from matplotlib import colors
import numpy as np
import json
import os

# Create ARC color map
cmap = colors.ListedColormap(
    ['#000000', '#0074D9', '#FF4136', '#2ECC40', '#FFDC00',
     '#AAAAAA', '#F012BE', '#FF851B', '#7FDBFF', '#870C25'])
norm = colors.Normalize(vmin=0, vmax=9)

# Load data directly from file
arc_challenge_file = './input/arc-prize-2025/arc-agi_test_challenges.json'

# Load original data
with open(arc_challenge_file, 'r') as f:
    arc_data = json.load(f)

# Set random seeds
np.random.seed(42)
random.seed(42)

def visualize_arc_example(train_data, test_data, task_id):
    """Visualize training and test data for an ARC task"""
    # Get number of training and test examples
    n_train = len(train_data)
    n_test = len(test_data)
    
    # Create figure large enough for all examples
    fig, axes = plt.subplots(2, max(n_train, n_test), figsize=(4*max(n_train, n_test), 8))
    fig.suptitle(f"Task ID: {task_id}", fontsize=16)
    
    # Visualize training data
    for i in range(n_train):
        # Input
        axes[0, i].imshow(train_data[i]['input'], cmap=cmap, norm=norm)
        axes[0, i].grid(True, which='both', color='lightgrey', linewidth=0.5)
        axes[0, i].set_title(f"Training #{i+1} - Input")
        axes[0, i].set_xticks([])
        axes[0, i].set_yticks([])
        
        # Output
        axes[1, i].imshow(train_data[i]['output'], cmap=cmap, norm=norm)
        axes[1, i].grid(True, which='both', color='lightgrey', linewidth=0.5)
        axes[1, i].set_title(f"Training #{i+1} - Output")
        axes[1, i].set_xticks([])
        axes[1, i].set_yticks([])
    
    # Handle test data visualization
    for i in range(n_test):
        if i < n_train:
            # Already have training data in this column
            pass
        else:
            # Hide unused training cells
            if i >= n_train:
                axes[0, i].axis('off')
                axes[1, i].axis('off')
    
    # Show first test input
    if n_test > 0:
        # Create separate figure for test input
        plt.figure(figsize=(5, 5))
        plt.imshow(test_data[0]['input'], cmap=cmap, norm=norm)
        plt.grid(True, which='both', color='lightgrey', linewidth=0.5)
        plt.title(f"Test Input - {task_id}")
        plt.xticks([])
        plt.yticks([])
        plt.show()
        
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

# Simulate 4 GPU data splitting
task_ids = list(arc_data.keys())
random.shuffle(task_ids)  # Shuffle task order

# Assign tasks to each GPU
gpu_tasks = {}
for gpu_id in range(4):
    # Simple equal division - each GPU gets 1/4 of tasks
    start_idx = gpu_id * len(task_ids) // 4
    end_idx = (gpu_id + 1) * len(task_ids) // 4
    gpu_tasks[gpu_id] = task_ids[start_idx:end_idx]

# Display training data samples for each GPU
for gpu_id in range(4):
    assigned_tasks = gpu_tasks[gpu_id]
    print(f"\n{'='*40}\nGPU {gpu_id} Training Data Samples\n{'='*40}")
    print(f"GPU {gpu_id} assigned {len(assigned_tasks)} training tasks")
    
    # Show only first 3 examples
    samples = assigned_tasks[:1]
    
    for task_id in samples:
        print(f"\nTask: {task_id}")
        
        # Get training and test data for this task
        train_data = arc_data[task_id]['train']
        test_data = arc_data[task_id]['test']
        
        # Visualize
        # visualize_arc_example(train_data, test_data, task_id)
        
        # Print data matrices
        print("Training Input (first example):")
        print(np.array(train_data[0]['input']))
        print("\nTraining Output (first example):")
        print(np.array(train_data[0]['output']))
        print("-" * 40)


GPU 0 Training Data Samples
GPU 0 assigned 60 training tasks

Task: 12eac192
Training Input (first example):
[[1 7 7 1 0 8 0 5]
 [1 7 7 1 1 0 1 0]
 [8 8 0 0 7 7 7 7]
 [0 1 0 0 0 0 1 1]
 [5 0 8 0 1 0 1 1]]

Training Output (first example):
[[3 7 7 1 0 3 0 3]
 [3 7 7 1 1 0 3 0]
 [3 3 0 0 7 7 7 7]
 [0 3 0 0 0 0 1 1]
 [3 0 3 0 3 0 1 1]]
----------------------------------------

GPU 1 Training Data Samples
GPU 1 assigned 60 training tasks

Task: 281123b4
Training Input (first example):
[[0 0 8 8 3 5 0 0 5 3 9 0 0 9 3 4 0 0 4]
 [0 8 8 0 3 5 5 0 5 3 9 9 0 9 3 0 0 4 4]
 [8 8 8 0 3 0 5 5 0 3 9 9 0 0 3 4 0 0 0]
 [8 8 0 0 3 0 0 0 0 3 0 0 0 0 3 4 4 4 0]]

Training Output (first example):
[[9 0 8 9]
 [9 9 4 9]
 [9 9 8 0]
 [4 4 4 0]]
----------------------------------------

GPU 2 Training Data Samples
GPU 2 assigned 60 training tasks

Task: 13713586
Training Input (first example):
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5]
 [0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 5]
 [0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 5]
 [0 0 0

In [2]:
# %%python --bg --proc train_proc0
from common_stuff import *
start_training(gpu=0)

ModuleNotFoundError: No module named 'torch'

In [None]:
%%python --bg --proc train_proc1
from common_stuff import *
start_training(gpu=1)

In [None]:
%%python --bg --proc train_proc2
from common_stuff import *
start_training(gpu=2)

In [None]:
%%python --bg --proc train_proc3
from common_stuff import *
start_training(gpu=3)

In [None]:
%%python --bg --proc infer_proc0
from common_stuff import *
start_inference(gpu=0)

In [None]:
%%python --bg --proc infer_proc1
from common_stuff import *
start_inference(gpu=1)

In [None]:
%%python --bg --proc infer_proc2
from common_stuff import *
start_inference(gpu=2)

In [None]:
%%python --bg --proc infer_proc3
from common_stuff import *
start_inference(gpu=3)

In [None]:
proc_exit_codes = await wait_for_subprocesses(
    train_proc0, train_proc1, train_proc2, train_proc3,
    infer_proc0, infer_proc1, infer_proc2, infer_proc3,
    print_output=True or arc_test_set.is_fake
)
print(f'*** Subprocesses exit codes: {proc_exit_codes}')
assert all(x==0 for x in proc_exit_codes)

In [None]:
# write submission
from common_stuff import *
with RemapCudaOOM():
    model, formatter, dataset = None, MyFormatter(), None
    decoder = Decoder(formatter, arc_test_set.split_multi_replies(), n_guesses=2, frac_score=True).from_store(infer_params['store'])
    if use_aug_score or arc_test_set.is_fake: decoder.calc_augmented_scores(model=model, store=score_temp_storage, **aug_score_params)
    submission = arc_test_set.get_submission(decoder.run_selection_algo(submission_select_algo))
    with open('submission.json', 'w') as f: json.dump(submission, f)
    if arc_test_set.is_fake:
        decoder.benchmark_selection_algos(selection_algorithms)
        with open('submission.json') as f: reload_submission = json.load(f)
        print('*** Reload score:', arc_test_set.validate_submission(reload_submission))

In [None]:
# Visualization for inference results from submission.json
if arc_test_set.is_fake:
    from common_stuff import *
    import matplotlib.pyplot as plt
    from matplotlib import colors
    import json
    import os
    import numpy as np
    
    print("\n" + "="*80)
    print("VISUALIZING RESULTS FROM SUBMISSION.JSON")
    print("="*80)
    
    # Check if submission file exists
    submission_path = 'submission.json'
    if not os.path.exists(submission_path):
        print(f"Submission file not found at {submission_path}")
    else:
        print(f"Found submission file: {submission_path}")
        
        # Load submission data
        with open(submission_path, 'r') as f:
            submission_data = json.load(f)
        
        print(f"Loaded submission with {len(submission_data)} tasks")
        
        # ARC color map
        cmap = colors.ListedColormap(
            ['#000000', '#0074D9', '#FF4136', '#2ECC40', '#FFDC00',
             '#AAAAAA', '#F012BE', '#FF851B', '#7FDBFF', '#870C25'])
        norm = colors.Normalize(vmin=0, vmax=9)
        
        # Function to check if prediction is non-trivial (not just zeros)
        def is_non_trivial_prediction(pred_array):
            # Check if the prediction contains any non-zero values
            return np.any(np.array(pred_array) > 0)
        
        # Function to visualize a single task result
        def visualize_submission_result(task_id, task_data, submission_output, test_idx):
            # Skip visualization if both predictions are just zeros
            pred_1 = np.array(submission_output['attempt_1'])
            pred_2 = np.array(submission_output['attempt_2'])
            
            if not is_non_trivial_prediction(pred_1) and not is_non_trivial_prediction(pred_2):
                print(f"  Skipping visualization for Task {task_id} - Test #{test_idx+1} (all predictions are zeros)")
                return False
            
            # Create visualization
            fig = plt.figure(figsize=(15, 8))
            grid_spec = plt.GridSpec(2, 3, width_ratios=[1, 1, 1])
            
            # Training examples (first one only for simplicity)
            if task_data['train']:
                # Train Input
                ax1 = fig.add_subplot(grid_spec[0, 0])
                ax1.imshow(task_data['train'][0]['input'], cmap=cmap, norm=norm)
                ax1.grid(True, which='both', color='lightgrey', linewidth=0.5)
                ax1.set_title("Training Input")
                ax1.set_xticks([])
                ax1.set_yticks([])
                
                # Train Output
                ax2 = fig.add_subplot(grid_spec[1, 0])
                ax2.imshow(task_data['train'][0]['output'], cmap=cmap, norm=norm)
                ax2.grid(True, which='both', color='lightgrey', linewidth=0.5)
                ax2.set_title("Training Output")
                ax2.set_xticks([])
                ax2.set_yticks([])
            
            # Test Input
            if test_idx < len(task_data['test']):
                ax3 = fig.add_subplot(grid_spec[0, 1])
                ax3.imshow(task_data['test'][test_idx]['input'], cmap=cmap, norm=norm)
                ax3.grid(True, which='both', color='lightgrey', linewidth=0.5)
                ax3.set_title(f"Test Input (Test #{test_idx+1})")
                ax3.set_xticks([])
                ax3.set_yticks([])
                
                # Ground Truth (if available)
                if 'output' in task_data['test'][test_idx]:
                    ax4 = fig.add_subplot(grid_spec[1, 1])
                    ax4.imshow(task_data['test'][test_idx]['output'], cmap=cmap, norm=norm)
                    ax4.grid(True, which='both', color='lightgrey', linewidth=0.5)
                    ax4.set_title("Ground Truth")
                    ax4.set_xticks([])
                    ax4.set_yticks([])
            
            # Model Predictions
            # Attempt 1
            ax5 = fig.add_subplot(grid_spec[0, 2])
            ax5.imshow(pred_1, cmap=cmap, norm=norm)
            ax5.grid(True, which='both', color='lightgrey', linewidth=0.5)
            ax5.set_title("Model Prediction (Attempt 1)")
            ax5.set_xticks([])
            ax5.set_yticks([])
            
            # Attempt 2
            ax6 = fig.add_subplot(grid_spec[1, 2])
            ax6.imshow(pred_2, cmap=cmap, norm=norm)
            ax6.grid(True, which='both', color='lightgrey', linewidth=0.5)
            ax6.set_title("Model Prediction (Attempt 2)")
            ax6.set_xticks([])
            ax6.set_yticks([])
            
            plt.suptitle(f"Task {task_id} - Test Example #{test_idx+1}", fontsize=16)
            plt.tight_layout()
            plt.subplots_adjust(top=0.9)
            plt.show()
            
            # Calculate accuracy if ground truth is available
            if 'output' in task_data['test'][test_idx]:
                ground_truth = np.array(task_data['test'][test_idx]['output'])
                
                # Check accuracy of both attempts
                results = []
                match_1 = np.array_equal(pred_1, ground_truth) if is_non_trivial_prediction(pred_1) else False
                results.append(f"Attempt 1: {'✓' if match_1 else '✗'}{' (zeros)' if not is_non_trivial_prediction(pred_1) else ''}")
                
                match_2 = np.array_equal(pred_2, ground_truth) if is_non_trivial_prediction(pred_2) else False
                results.append(f"Attempt 2: {'✓' if match_2 else '✗'}{' (zeros)' if not is_non_trivial_prediction(pred_2) else ''}")
                
                print(f"  Results: {', '.join(results)}")
                
                # Display task statistics
                print(f"  Shape - Ground Truth: {ground_truth.shape}, Prediction 1: {pred_1.shape}, Prediction 2: {pred_2.shape}")
                print(f"  Values - Ground Truth unique values: {np.unique(ground_truth)}")
                print(f"          Prediction 1 unique values: {np.unique(pred_1)}")
                print(f"          Prediction 2 unique values: {np.unique(pred_2)}")
            print()
            return True
        
        # Process ALL results from submission (no limit)
        visualized_count = 0
        skipped_count = 0
        
        # Get a list of tasks in the submission
        task_ids = list(submission_data.keys())
        
        # Collect all task/test combinations
        all_predictions = []
        for task_id in task_ids:
            if task_id in arc_test_set.queries:
                task_data = arc_test_set.queries[task_id]
                for test_idx, test_prediction in enumerate(submission_data[task_id]):
                    # Check if we have ground truth available
                    has_ground_truth = (task_id in arc_test_set.replies and 
                                        test_idx < len(arc_test_set.replies[task_id]))
                    
                    # Check if predictions are non-trivial
                    pred_1 = np.array(test_prediction['attempt_1'])
                    pred_2 = np.array(test_prediction['attempt_2'])
                    has_non_zero_pred = is_non_trivial_prediction(pred_1) or is_non_trivial_prediction(pred_2)
                    
                    # Score based on correctness if ground truth is available
                    score = 0
                    if has_ground_truth and has_non_zero_pred:
                        ground_truth = np.array(arc_test_set.replies[task_id][test_idx])
                        
                        match_1 = np.array_equal(pred_1, ground_truth) if is_non_trivial_prediction(pred_1) else False
                        match_2 = np.array_equal(pred_2, ground_truth) if is_non_trivial_prediction(pred_2) else False
                        score = match_1 + match_2
                        
                    all_predictions.append((task_id, test_idx, score, has_ground_truth, has_non_zero_pred))
        
        # Sort by whether they have ground truth first, then by score
        all_predictions.sort(key=lambda x: (-int(x[3]), -x[2]))
        
        # Print summary before visualization
        print(f"\nFound {len(all_predictions)} total predictions to visualize")
        
        # Visualize all tasks
        for task_id, test_idx, score, has_ground_truth, has_non_zero_pred in all_predictions:
            # Get task data and predictions
            task_data = arc_test_set.queries[task_id]
            submission_output = submission_data[task_id][test_idx]
            
            # Visualize this task
            score_info = f" (Score: {score}/2)" if has_ground_truth and has_non_zero_pred else " (no ground truth)" if not has_ground_truth else " (all zeros - no score)"
            print(f"\nTask: {task_id} - Test #{test_idx+1}{score_info}")
            
            # Only increment visualized_count if actually visualized
            if visualize_submission_result(task_id, task_data, submission_output, test_idx):
                visualized_count += 1
            else:
                skipped_count += 1
        
        print(f"\nVisualized {visualized_count} inference results (skipped {skipped_count} with all-zero predictions)")
        
        # Calculate overall accuracy statistics
        if arc_test_set.is_fake:
            total_tests = 0
            total_scored_tests = 0
            correct_attempt1 = 0
            correct_attempt2 = 0
            correct_any = 0
            zero_predictions = 0
            
            for task_id, test_predictions in submission_data.items():
                if task_id in arc_test_set.replies:
                    for test_idx, test_prediction in enumerate(test_predictions):
                        if test_idx < len(arc_test_set.replies[task_id]):
                            total_tests += 1
                            
                            ground_truth = np.array(arc_test_set.replies[task_id][test_idx])
                            pred_1 = np.array(test_prediction['attempt_1'])
                            pred_2 = np.array(test_prediction['attempt_2'])
                            
                            # Check if both predictions are all zeros
                            if not is_non_trivial_prediction(pred_1) and not is_non_trivial_prediction(pred_2):
                                zero_predictions += 1
                                continue
                            
                            # Only count tests with at least one non-zero prediction
                            total_scored_tests += 1
                            
                            match_1 = np.array_equal(pred_1, ground_truth) if is_non_trivial_prediction(pred_1) else False
                            match_2 = np.array_equal(pred_2, ground_truth) if is_non_trivial_prediction(pred_2) else False
                            
                            if match_1: correct_attempt1 += 1
                            if match_2: correct_attempt2 += 1
                            if match_1 or match_2: correct_any += 1
            
            if total_tests > 0:
                print("\n" + "="*80)
                print("OVERALL ACCURACY STATISTICS")
                print("="*80)
                print(f"Total test examples: {total_tests}")
                print(f"Test examples with zero predictions (excluded from accuracy): {zero_predictions}")
                print(f"Test examples included in accuracy calculation: {total_scored_tests}")
                
                if total_scored_tests > 0:
                    print(f"Correct on attempt 1: {correct_attempt1}/{total_scored_tests} ({correct_attempt1/total_scored_tests:.2%})")
                    print(f"Correct on attempt 2: {correct_attempt2}/{total_scored_tests} ({correct_attempt2/total_scored_tests:.2%})")
                    print(f"Correct on either attempt: {correct_any}/{total_scored_tests} ({correct_any/total_scored_tests:.2%})")
                else:
                    print("No non-zero predictions to calculate accuracy")
                    
                print(f"Overall completion rate: {total_scored_tests/total_tests:.2%} of tests have non-zero predictions")
                print("="*80)
else:
    print("Skipping inference visualization - not in fake test mode")