![banner.webp](attachment:banner.webp)

# LLMs Evolutionary-Merging for L2S Reasoning

### 🛠️ Environment Setup 


Before running the library, you need to configure a Python virtual environment.
> ⚠️ Important: Use the provided  ~/DLAI-project/mergenetic folder (do not clone the repository again). \
> ⚠️ Python 3.11 is required to ensure all dependencies are satisfied.

🐍 Create a new virtual environment 
```bash
python3.11 -m venv ~/mergenetic/.venv
source ~/mergenetic/.venv/bin/activate
```
📦 Upgrade pip, install all dependencies defined in requirements.txt and install the local packages in editable mode
```bash
pip install --upgrade pip
cd mergenetic
pip install -r requirements.txt
pip install -e .
cd ../Qwen2.5-Math/evaluation
pip install -r requirements.txt
cd latex2sympy  
pip install -e . 
```

In [3]:
# ==== Imports ====
import os, random, numpy as np, torch

# pymoo components
from pymoo.operators.sampling.rnd import IntegerRandomSampling
from pymoo.algorithms.soo.nonconvex.ga import GA
from mergenetic.merging.taskarithmetic_merger import TaskArithmeticMerger
from mergenetic.merging.ties_merger import TiesMerger
from pymoo.algorithms.moo.nsga2 import NSGA2
from pymoo.operators.crossover.sbx import SBX
from pymoo.operators.mutation.pm import PM

# Mergekit and Mergenetic
import mergekit
import mergenetic
from mergenetic.searcher import Searcher
from mergenetic.utils import ConfigLmEval
from mergenetic import PROJECT_ROOT
from mergenetic.optimization.predefined_problems import (MathReasoningProblem, ConfigPE) 

# lm_eval
from lm_eval.tasks import TaskManager

# Hugging Face 
from huggingface_hub import whoami
from huggingface_hub import notebook_login
from huggingface_hub import snapshot_download



In [4]:
import sys, importlib, pathlib

# 0) make sure the real code exists
real_pkg_dir = pathlib.Path(f"{PROJECT_ROOT}/mergenetic/src/mergenetic")
if not real_pkg_dir.exists():
    raise RuntimeError("mergenetic/src/mergenetic not found – is the repo cloned?")

# 1) purge every cached mergenetic module
for name in list(sys.modules):
    if name == "mergenetic" or name.startswith("mergenetic."):
        del sys.modules[name]

# 2) put src/ directory *first* on sys.path
src_root = str(real_pkg_dir.parent)  # /content/mergenetic/src
if src_root not in sys.path:
    sys.path.insert(0, src_root)

# 3) also remove /content from sys.path if present
for bad in ("", "/content"):
    if bad in sys.path:
        sys.path.remove(bad)

# 4) reload
importlib.invalidate_caches()
import mergenetic, inspect, textwrap

print("Now using:", inspect.getfile(mergenetic))
print("Public names:", textwrap.shorten(", ".join(dir(mergenetic)), 100))



Now using: /Users/iacobelli/Downloads/DLAI-project/mergenetic/src/mergenetic/__init__.py
Public names: Any, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, CACHE_DIR, Config, ConfigLmEval, [...]


In [3]:
# ==== Set the seeds ====
SEED = 42

# Python
random.seed(SEED)
# NumPy
np.random.seed(SEED)
# PyTorch
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)

# Environment variables
os.environ["PYTHONHASHSEED"] = str(SEED)

print(f"All seeds set to {SEED}")

All seeds set to 42


### 🤗 Download Models to merge form HF Hub

Before we begin, we'll need access to the `HuggingFace Hub` using an authentication token.

If you haven’t done this before, follow these steps:

1. Create a **read access token** here:  
   👉 [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

2. Use the token in your code or environment to authenticate:

In [4]:
notebook_login()

try:
    user_info = whoami()
    print("✅ Logged in as:", user_info["name"])
except Exception as e:
    print("❌ Not logged in:", str(e))

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

✅ Logged in as: marioiac


In this notebook we demonstrate a **Mergenetic** merge example between: 
- “slow-reasoning” (System 2) model — `DeepSeek-R1-Distill-Qwen-1.5B`
- “fast-reasoning” (System 1) model — `Qwen2.5-Math-1.5B`.

We first download the model snapshots locally, then show how to configure and run the merge with Mergenetic.

In [None]:
# directory where to store base and merged models
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)

# DeepSeek-R1-Distill-Qwen-1.5B (Base model)
deepseek_repo = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
snapshot_download(
    repo_id=deepseek_repo,
    local_dir=os.path.join(model_dir, "DeepSeek-R1-Distill-Qwen-1.5B")
)

# Qwen2.5-Math-1.5B (Target model)
qwen_math_repo = "Qwen/Qwen2.5-Math-1.5B"
snapshot_download(
    repo_id=qwen_math_repo,
    local_dir=os.path.join(model_dir, "Qwen2.5-Math-1.5B")
)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

benchmark.jpg:   0%|          | 0.00/777k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

'/Users/iacobelli/Downloads/DLAI-project/mergenetic/mergenetic/models/Qwen2.5-Math-1.5B'

### 📊 Fitness Definition for Evolutionary Merging: Task Setup & Subdataset Selection

Now we're ready to define the **task manager** that will drive our fitness evaluation.

In `Mergenetic`, the fitness of a candidate merge can be computed using either:

- A built-in estimator class based on user-defined estiamation methods (this allows to implement more creative fitness functions)
- Tasks from `lm-eval-harness` (the plug and play solutions)

We will use lm-eval-harness and create a small **custom task manager** that evaluates candidates on **GSM8K** — a widely used dataset of grade-school math word problems. We’ll treat the task as a reasoning benchmark and use the harness’ accuracy (exact match of the final answer) as the fitness signal.

#### Task Path Setup & Configuration Wrapper

We now define the path to the folder where lm-eval-harness tasks are stored and we insert the task file we outlined above.  
In the mergenetic repo, this is typically:
```python
TASKS_PATH = "mergenetic/lm_tasks"
```
Next, we define a new configuration file, gsm8k-cot-zeroshot-new.yaml, which specifies:
- Dataset: dataset name/path and the train/validation/test splits to use.
- Prompt template: how to format each example into a model input (e.g., CoT style, zero-shot).
- Evaluation metric: which metric to compute (e.g., accuracy) and the regex used to extract the final numeric answer from model outputs.
- Generation parameters: decoding settings such as max output length, temperature, top-p/top-k (if applicable), and stop tokens.

In [11]:
# === INITIALIZE CONFIG OBJECT ===
config = ConfigLmEval()

# Set the absolute path for custom templates
config.additional_templates_folder = os.path.join(PROJECT_ROOT, "mergenetic", "lm_tasks")
config.bench = "gsm8k"

# Define the full path to the new task file
storing_name_task = "gsm8k-cot-zeroshot-new.yaml"  
path_new_task = os.path.join(config.additional_templates_folder, config.bench, storing_name_task)
print("📁 Task path:", path_new_task)

# Check whether the task file exists
if os.path.exists(path_new_task):
    print("✅ File exists. Proceeding with task setup...")
else:
    print("❌ File does not exist. Please check the path or filename.")

📁 Task path: /Users/iacobelli/Downloads/DLAI-project/mergenetic/lm_tasks/gsm8k/gsm8k-cot-zeroshot-new.yaml
✅ File exists. Proceeding with task setup...


#### Initialize the Task Manager

Now we initialize the Task Manager, which is responsible for coordinating evaluation during the evolutionary search.

Its role includes:
- Loading and validating the evaluation task(s) defined in the lm-eval-harness YAML files
- Handling dataset loading and preprocessing according to the task configuration
- Providing fitness scores that guide the evolutionary search toward better candidate merges

In [12]:
# Path to the custom lm-eval-harness task configurations
path_templates = config.additional_templates_folder

# Initialize the TaskManager (handles loading and validation of tasks)
task_manager = TaskManager(include_path=path_templates)

# Task name (should match the YAML file created earlier)
task_name = "gsm8k_cot_zeroshot_new"
lang_id = "en"

# Load the specified task from the TaskManager
task = task_manager.load_task_or_group(task_name)[task_name]
print("✅ Task setup complete.")

✅ Task setup complete.


#### Selection of Anchors

One of the most useful features of mergenetic is the ability to evaluate candidate merges on a subset of the task dataset, rather than the full set.
This is particularly valuable because, during the search, you often want to estimate model performance quickly without running the full evaluation.

In our case, we select `30 examples` from the GSM8K test set as anchors, each assigned an equal weight in the fitness function.
The final evaluation is then performed on the remaining test examples.

In [13]:
# Randomly select 30 anchor samples from the GSM8K test set 
config.n_samples = 30  
num_test_samples = len(task.dataset["test"]) 

# Sample anchor indices without replacement
anchors = np.random.choice(range(num_test_samples), config.n_samples, replace=False)  

# Assign equal weight to each anchor
anchors_weights = np.ones(len(anchors)) / len(anchors)  

print("Anchor indices:", anchors)
print("\nAnchor weights:", anchors_weights)

Anchor indices: [ 677 1046  610   49 1284  486  548  939   78  506  210  184  485  451
 1158  886  889 1149 1274  286  764  704 1004 1096 1221 1089  852 1237
  221  420]

Anchor weights: [0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333]


### 🧾 Configuration Details

In [14]:
# Reproducibility
config.seed = SEED  

# Device for evaluation ("cuda" if available, else "cpu")
config.device = "cuda"  

# Run identifier (used for logs, checkpoints, results)
config.run_id = "Mergenetic-TA"  

# Define evaluation tasks:
#   - "search": used during evolutionary search
#   - "test":   used for final evaluation
config.tasks = {
    "search": {"en": task_name},
    "test": {"en": task_name},
}

# Metric to evaluate correctness
config.metric = "exact_match"  

# Task type (here: focused on math reasoning, e.g. GSM8K)
config.task_type = "FG_MATH"  

# Paths for saving configs, logs, and merged models
config.path_to_store_config = f"{PROJECT_ROOT}/experiments/evolutionary-merging-lm-harness"
config.path_to_store_merged_model = f"{model_dir}/merged"

# Non-canonical setup:
#   - Deepseek-R1-Distill-Qwen → distillation of DeepSeek-R1 into a Qwen architecture
#   - Qwen2.5-Math-1.5B        → fine-tuned Qwen2.5 on math reasoning
config.base_model = f"{model_dir}/DeepSeek-R1-Distill-Qwen-1.5B"
config.models = {"en": f"{model_dir}/Qwen2.5-Math-1.5B"}

# Fitness evaluation mode ("mean" = average across anchors)
config.mode = "mean"  

# Languages involved (English only in this case)
config.langs = ["en"]  

# Batch size for evaluation
config.eval_batch_size = 8  


In [15]:
# Define the estimation parameters for evolutionary search
est_parameters = ConfigPE(
    thetas=[None, None],               
    weights=anchors_weights,           # Weights assigned to each anchor sample (uniform here)
    sample_ids=anchors,                # Indices of the dataset samples used as anchors
    bench=config.bench,                # Benchmark dataset (e.g., "gsm8k")
    mode=config.mode,                  # Estimator type (e.g., "random" / "mean" / IRT-based)
    correct_metric=config.metric,      # Metric for correctness (e.g., "exact_match")
)

print("✅ Estimation parameters configured.")

✅ Estimation parameters configured.


### 🧪 Define the Merger

The **merger** is a core component of mergenetic.
It fuses two or more models using a chosen strategy (e.g. Task Arithmetic, SLERP, TIES, DARE) guided by the candidate solution proposed during the evolutionary search.

In [18]:
path_to_store_yaml = f"{config.path_to_store_config}/{config.run_id}"
lang_id = "en"

"""
merger = TiesMerger(
    run_id=config.run_id,
    path_to_base_model=config.base_model,
    model_paths=[config.models[lang_id]],
    path_to_store_yaml=path_to_store_yaml,
    path_to_store_merged_model=config.path_to_store_merged_model,
    dtype=config.dtype,    
)
"""

merger = TaskArithmeticMerger(
    run_id=config.run_id,
    path_to_base_model=config.base_model,
    model_paths=[config.models[lang_id]],
    path_to_store_yaml=path_to_store_yaml,
    path_to_store_merged_model=config.path_to_store_merged_model,
    dtype=config.dtype,
)

print("✅ Merger configured.")

✅ Merger configured.


### 🧠 Define the Optimization Problem



To guide the evolutionary search towards the **best merged model**, we need to define a `MergingProblem`.  
This class tells the optimizer **how to evaluate each candidate merge**, using a fitness function (e.g., model accuracy on a task).

Parameters for `MathReasoningProblem`:
- **`merger`**: The merger used to produce merged models.
- **`test_df`/`search_df`**: Optional datasets (must be None when using LM-Eval tasks).
- **`lm_eval_tasks`**: The task setup, as defined in your config (config.tasks).
- **`lang_id`**: Language we're targeting for evaluation (e.g., "en").
- **`conf_pe`**: Estimation config (e.g., sampling strategy, anchors, metric).
- **`device`**: Evaluation device ("cuda" or "cpu").
- **`n_var`**: Number of decision variables (e.g., number of interpolation weights).
- **`n_obj`**: Number of objectives — 1 for single-objective optimization (e.g., accuracy).
- **`n_eq_constr`/`n_ieq_constr`**: Number of equality and inequality constraints (set to 0 if unused).
- **`discrete`**: Whether to treat the search space as discrete (e.g., selecting from fixed layers/weights).
- **`eval_batch_size`**: Batch size for evaluation.
- **`additional_templates_folder`**: Where task templates are stored (e.g., "lm_tasks").

In [19]:
# Define the tokenizer to compute the length of the model response (number of consumed tokens)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", trust_remote_code=True)

# Optimization Problem 
problem = MathReasoningProblem(
    merger=merger,
    test_df=None,  
    search_df=None,
    lm_eval_tasks=config.tasks,
    lang_id=lang_id,
    conf_pe=est_parameters,
    device=config.device,
    n_var=1,   # 1 for TA (weight), 2 for TIES (weight and density)
    n_obj=2,   # -acc and length to minimize 
    n_eq_constr=0,
    n_ieq_constr=0,
    discrete=True,
    eval_batch_size=config.eval_batch_size,
    additional_templates_folder=config.additional_templates_folder,
    tokenizer=tokenizer
)

### 🧬 Define the Evolutionary Algorithm

To optimize our merging problem, we now define an **evolutionary algorithm** using [`pymoo`](https://pymoo.org/).  
We use a basic **Genetic Algorithm (GA)** for single-objective optimization and **NSGA2** for multi-objective optimization, with the following components:

- `IntegerRandomSampling` – Initializes the population with random integers (suitable for discrete search spaces).
- `SBX` (Simulated Binary Crossover) – Recombines parent solutions to explore the space.
- `PM` (Polynomial Mutation) – Introduces diversity into the population via random mutations.
-  Duplicate Elimination – Prevents repeated individuals in the population.

Configuration Parameters:
- `config.pop_size` – Population size (number of candidate merges per generation).
- `config.n_iter` – Number of generations (iterations of the search loop).

In [20]:
# set pop_size = 20 and n_iter = 10 in real scenario, this is just for test 
config.pop_size = 2
config.n_iter = 2

print("Population Size:", config.pop_size)
print("Number of Iterations:", config.n_iter)

algorithm = NSGA2(
    pop_size=config.pop_size,
    sampling=IntegerRandomSampling(),    
    crossover=SBX(),
    mutation=PM(),
    eliminate_duplicates=True,
)

print("✅ Genetic Algorithm configured.")


Population Size: 2
Number of Iterations: 2
✅ Genetic Algorithm configured.


### 🚀 Run the Search & Test the Results



Now that we’ve defined the merging problem and evolutionary algorithm, it’s time to **launch the search process** and evaluate the best merged model.

To do this, we use the `Searcher` class: the high-level orchestrator in mergenetic.

The Searcher wraps everything needed to run an evolutionary merging experiment:

- `search()` 
  Runs the optimization loop for `n_iter` iterations, evolving candidate merges and evaluating their fitness.

- `test()` 
  Evaluates the **best merged model(s)** found during search on the designated test task(s). Saves performance metrics and logs.

- `visualize_results()` *(optional)*  
  Allows you to plot how fitness scores and model parameters evolved across generations (if available in results_df).


In [None]:
results_path = f"{config.path_to_store_config}/{config.run_id}/"

# Initialize the Searcher
searcher = Searcher(
    problem=problem,                # optimization problem instance (MathReasoningProblem)
    n_iter=config.n_iter,           # number of iteration (generations) to run the evolutionary algorithm
    algorithm=algorithm,            # pymoo optimization algorithm (GA/NSGA2)
    results_path=results_path,      # where to store the output files (config.yaml and CSVs)
    run_id=config.run_id,           
    seed=config.seed,               # seed for reproducibility of the search
    verbose=False,
)

searcher.search()
print("✅ Evolutionary search completed.")


#### Evaluation

After running `search()`, the evolutionary process returns one or more candidate solutions.

To actually build those solutions into model checkpoints, run `test()`. 
This step:
- takes the genotypes (merge parameters) of the selected solution(s),
- uses MergeKit to materialize the merged model(s), and
- evaluates them on the full test set (excluding the anchor samples that were used during search).

In this work we don’t evaluate only on the benchmark that guided the merge.
We also assess the merged models on several math reasoning benchmarks to measure accuracy and output length.
For convenience and storage efficiency, we use a helper script in scripts/ to build models on demand:
- The search step writes a CSV named **"<run_id>_solutions.csv"** containing, for each solution:
	- the objective scores, and
	- the genotype (merge parameters).
- To create a concrete checkpoint from any solution in that CSV:
	- Prepare the MergeKit config.yaml (you can find it in the folder specified before) and fill in the merge parameters from the CSV.
	- Run `run_mergekit.py` script to build the model and store it in the path specified by **output_path**

--- 


The final evaluation of both the base models and the merged models is performed in a separate notebook, `evaluation_notebook`, located in the `Qwen2.5-Math` folder.
There, we rely on the specialized evaluation library provided by Qwen to test models across multiple math reasoning benchmarks, including GSM8K, MATH500, Minerva Math, College Math, OlympiadBench, AIME24