# Minimal: Dense → Masked → CSR (similarity and structure)

This notebook provides a small, readable baseline comparison between three variants of the same model:

- **Dense**
- **Masked 30%** (unstructured magnitude pruning, still executed with dense kernels)
- **CSR 30%** (same pruning, but converted to CSR and executed with sparse kernels)

We compare:

- **Structure:** model size (MB), global sparsity, relative compute (nnz (nb of non-zero weights) / dense)
- **Behaviour:** *top-1 agreement* with the dense model (percentage of tokens for which the predicted class is identical).



## What do the metrics mean in this notebook?

This notebook compares three variants of the same model:

- **Dense** – original model, no pruning.
- **Masked 30%** – global magnitude pruning: ~30% of weights in prunable linear layers are set to zero, but tensors stay dense.
- **CSR 30%** – same pruning level, but selected linear layers are converted to a **CSR (Compressed Sparse Row)** representation and executed with a sparse kernel (`LinearCSRForward`).

For each variant we log four key metrics:

### 1. `size_mb`

Approximate **model size in megabytes**:

- Computed from the number of parameters and their data type (fp32, fp16, …).
- Tells you how much **memory** the model occupies on disk / in RAM / on the GPU.
- In this notebook:
  - **Dense** and **Masked30** have almost the same `size_mb`, because masking does **not** physically remove zeros.
  - **CSR30** is smaller, because only the non-zero values and sparse indices are stored for converted layers.

### 2. `sparsity`

Global **fraction of zero weights** in the entire model:

\[
\text{sparsity} = 1 - \frac{\text{nonzero}}{\text{total}}
\]

- `nonzero`: number of parameters that are not equal to zero.
- `total`: total number of parameters.
- A higher `sparsity` means more zeros.
- Here you can see that:
  - **Dense** has almost zero sparsity (baseline).
  - **Masked30** has ~20% sparsity at the model level (only some layers are pruned).
  - **CSR30** appears almost dense again, because only CSR layers are stored sparsely; embeddings and other dense parts remain.

### 3. `compute_ratio`

Approximate **relative compute cost** assuming that each non-zero weight contributes one unit of work:

\[
\text{compute_ratio} = \frac{\text{nonzero (variant)}}{\text{nonzero (Dense)}}
\]

- This is a very simple *theoretical* proxy:
  - If `compute_ratio = 0.8`, we expect ~20% fewer FLOPs than the dense baseline (if kernels were perfectly sparse-aware).
- In practice:
  - **Masked30**: fewer non-zero weights, but tensors are still dense → most BLAS kernels will still perform *dense* computation.
  - **CSR30**: non-zero weights are stored in CSR and used in `LinearCSRForward`, so compute cost is much closer to the number of non-zeros.

### 4. `top1_match`

**Top‑1 agreement** between a variant and the dense reference on the same evaluation texts:

1. We take `SAMPLE_TEXTS` (a small, fixed corpus of sentences).
2. For each model, we compute logits over all positions.
3. For each token position, we compare the *argmax* (most probable token) of:
   - the dense baseline `logits_ref`, and
   - the tested variant `logits_test`.
4. We only count positions where the attention mask is 1 (i.e., real tokens, not padding).
5. `top1_match` is the fraction of positions where the two models predict the **same token**:

$\text{top1\_match} = \mathbb{E}[\mathbf{1}\{ \arg\max \text{Dense} = \arg\max \text{Variant} \}]$

Interpretation:

- `top1_match = 1.0` for **Dense** (same model vs itself).
- A high `top1_match` for **Masked30** / **CSR30** means that pruning and CSR conversion preserve the behaviour of the dense model on this small corpus.
- A drop in `top1_match` quantifies how much pruning changes the predictions, which complements other metrics like perplexity.

This notebook is therefore mainly about:

- **Structure and memory**: `size_mb`, `sparsity`, `compute_ratio`.
- **Behavioural similarity**: `top1_match` vs Dense.


In [1]:
# For Google Colab (comment out if you are running locally)
# %cd /content/Edu-Sparsify-LLMs/notebooks


In [2]:
import os, sys, warnings, pandas as pd, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

sys.path.append('..'); sys.path.append('../src')

from src.eval.metrics import params_size_and_sparsity
from src.eval.csvlog import append_row
from src.eval.plotting import bar_plot
from src.pruning.policies import apply_global_magnitude_pruning_cpu_safe, select_prunable_linears
from src.pruning.pipeline import freeze_pruning_, convert_linear_weights_to_csr_
from src.wrappers.linear_csr import LinearCSRForward

warnings.filterwarnings('ignore', message='.*Sparse CSR tensor support is in beta state.*')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

RESULTS_DIR = os.path.join('..', 'results')
CSV_PATH = os.path.join(RESULTS_DIR, 'S1_minimal_similarity.csv')
os.makedirs(RESULTS_DIR, exist_ok=True)

# We log: setup, size, sparsity, compute ratio, top-1 similarity
pd.DataFrame(columns=[
    "setup",
    "size_mb",
    "sparsity",
    "compute_ratio",
    "top1_match"
]).to_csv(CSV_PATH, index=False)

def load_fresh():
    """Load a small model depending on the device.

    - CUDA -> EleutherAI/pythia-410m (fp16)
    - CPU  -> facebook/opt-125m     (fp32)
    """
    if device == "cuda":
        model_name = "EleutherAI/pythia-410m"
        torch_dtype = torch.float16
    else:
        model_name = "facebook/opt-125m"
        torch_dtype = None  # default fp32

    tok = AutoTokenizer.from_pretrained(model_name)
    tok.pad_token = tok.eos_token

    kwargs = {}
    if torch_dtype is not None:
        kwargs["torch_dtype"] = torch_dtype

    mdl = AutoModelForCausalLM.from_pretrained(model_name, **kwargs).to(device).eval()
    print(f"Loaded: {model_name}")
    return mdl, tok, model_name

def collect_logits(model, tokenizer, texts, device):
    """Return (logits, attention_mask) for a list of texts.

    logits: [B, L, V], mask: [B, L]
    """
    enc = tokenizer(
        texts,
        return_tensors="pt",
        padding=True,
        truncation=True
    )
    enc = {k: v.to(device) for k, v in enc.items()}
    with torch.inference_mode():
        out = model(**enc)
    logits = out.logits.cpu()
    mask = enc["attention_mask"].cpu()
    return logits, mask

def top1_agreement(logits_ref, logits_test, att_mask):
    """Percentage of tokens with identical top-1 predictions between
    a reference model and a tested model.
    """
    ref = logits_ref.argmax(dim=-1)   # [B, L]
    test = logits_test.argmax(dim=-1) # [B, L]
    mask = att_mask.bool()
    match = (ref[mask] == test[mask]).float().mean().item()
    return match

SAMPLE_TEXTS = [
    "In a quiet valley, the river bends slowly around the last farm before the hills.",
    "Sparse pruning zeroes weights but needs a sparse kernel to speed up compute.",
    "A small batch size can distort latency because of cache and warmup effects.",
    "Causal LM perplexity is averaged per token over sliding blocks.",
    "Version 1.2.0 fixes: stability on CPU, deterministic seeds, better logging.",
    "\"Hello?\" — \"Hi; can you hear me?\" — \"Loud and clear.\"",
    "HTTP 429 means rate limiting; use exponential backoff with jitter.",
    "Compute follows memory: fewer bytes moved often means fewer milliseconds.",
    "Numbers: 3.14159, 2.71828, 0.57721 show up in odd places.",
    "Keep the same corpus when comparing Dense vs Masked vs CSR.",
    "If latency jumps, check power limits, thermal throttling, governors.",
    "We log mean, median, and p95 latency because tails matter.",
    "One batch isn’t enough: run multiple iterations with warmup.",
    "Tiny masking mistakes can create NaNs; clamp logits if needed.",
    "When in doubt, profile with both synthetic and real inputs."
]
# We can virtually increase the corpus size if needed
# SAMPLE_TEXTS = SAMPLE_TEXTS * 20

# Global variables for the dense reference
DENSE_NONZERO = None
DENSE_LOGITS = None
DENSE_MASK = None


Device: cpu


## 1) Dense baseline

In this first step we:

- measure the model size and (near-zero) global sparsity;
- store:
  - the number of non-zero parameters (`DENSE_NONZERO`),
  - the reference logits on `SAMPLE_TEXTS` (`DENSE_LOGITS`, `DENSE_MASK`).

By definition for the dense baseline:

- `compute_ratio = 1.0` (reference compute), and  
- `top1_match = 1.0` (dense vs. dense).


In [3]:
global DENSE_NONZERO, DENSE_LOGITS, DENSE_MASK

model, tok, name = load_fresh()
stats = params_size_and_sparsity(model)

# Reference logits on the small corpus
DENSE_LOGITS, DENSE_MASK = collect_logits(model, tok, SAMPLE_TEXTS, device)
DENSE_NONZERO = stats['nonzero']

compute_ratio = 1.0
top1 = 1.0

append_row(
    CSV_PATH,
    setup='Dense',
    size_mb=stats['size_mb'],
    sparsity=stats['sparsity'],
    compute_ratio=compute_ratio,
    top1_match=top1
)

stats, compute_ratio, top1


Loaded: facebook/opt-125m


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)


({'nonzero': 125238422,
  'total': 125239296,
  'sparsity': 6.978640314292406e-06,
  'size_mb': 477.75},
 1.0,
 1.0)

## 2) Masked pruning (30%) — dense execution

Here we apply **global magnitude pruning at 30%** on the selected linear layers.

- The linear layers remain in **dense** format (weights are masked, but kernels are dense).
- We measure:
  - global sparsity,
  - effective model size (should be ~identical to Dense),
  - `compute_ratio = nnz_sparse / DENSE_NONZERO`,
  - `top1_match` vs. the dense model on `SAMPLE_TEXTS`.


In [4]:
assert DENSE_NONZERO is not None, "Run the Dense baseline cell first."

SP_MASK = 0.30
model, tok, name = load_fresh()

layers = select_prunable_linears(model, blacklist=("lm_head",))
apply_global_magnitude_pruning_cpu_safe(layers, amount=SP_MASK)
freeze_pruning_(layers)

stats = params_size_and_sparsity(model)
logits_masked, _ = collect_logits(model, tok, SAMPLE_TEXTS, device)

compute_ratio = stats['nonzero'] / DENSE_NONZERO
top1 = top1_agreement(DENSE_LOGITS, logits_masked, DENSE_MASK)

append_row(
    CSV_PATH,
    setup=f'Masked{int(SP_MASK*100)}',
    size_mb=stats['size_mb'],
    sparsity=stats['sparsity'],
    compute_ratio=compute_ratio,
    top1_match=top1
)

stats, compute_ratio, top1


Loaded: facebook/opt-125m


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


({'nonzero': 99743851,
  'total': 125239296,
  'sparsity': 0.20357384474598128,
  'size_mb': 477.75},
 0.7964317132644805,
 0.6613546013832092)

## 3) CSR execution (30%) — truly sparse kernels

We repeat the same global 30% pruning, then:

1. freeze the pruning masks,
2. convert the pruned linear layers to CSR,
3. replace the corresponding modules with `LinearCSRForward`.

We again measure:

- model size and global sparsity,
- `compute_ratio` (now reflecting the CSR parameter count),
- `top1_match` vs. the dense model on `SAMPLE_TEXTS`.


In [5]:
assert DENSE_NONZERO is not None, "Run the Dense baseline cell first."

SP_CSR = 0.30
model, tok, name = load_fresh()

layers = select_prunable_linears(model, blacklist=("lm_head",))
apply_global_magnitude_pruning_cpu_safe(layers, amount=SP_CSR)
freeze_pruning_(layers)
convert_linear_weights_to_csr_(layers)

# Replace all pruned linear layers by LinearCSRForward
swapped = 0
def find_parent(root, child):
    for _, mod in root.named_modules():
        for cn, cc in mod.named_children():
            if cc is child:
                return mod, cn
    raise RuntimeError('Parent not found')

for lin in layers:
    # Uncomment to restrict to a subset of layers for quick demos
    # if swapped >= 4:
    #     break
    parent, attr = find_parent(model, lin)
    csr_module = LinearCSRForward(
        lin.weight.detach(),
        lin.bias.detach() if lin.bias is not None else None
    ).to(device)
    setattr(parent, attr, csr_module)
    swapped += 1

stats = params_size_and_sparsity(model)
logits_csr, _ = collect_logits(model, tok, SAMPLE_TEXTS, device)

compute_ratio = stats['nonzero'] / DENSE_NONZERO
top1 = top1_agreement(DENSE_LOGITS, logits_csr, DENSE_MASK)

append_row(
    CSV_PATH,
    setup=f'CSR{int(SP_CSR*100)}',
    size_mb=stats['size_mb'],
    sparsity=stats['sparsity'],
    compute_ratio=compute_ratio,
    top1_match=top1
)

stats, compute_ratio, top1


Loaded: facebook/opt-125m


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  out = torch.matmul(W, x.T).T               # [out,b] -> [b,out]


({'nonzero': 40220908,
  'total': 40221696,
  'sparsity': 1.959141653296026e-05,
  'size_mb': 153.43359375},
 0.3211547012305856,
 0.6613546013832092)

## 4) Plots

We visualise the three variants (**Dense**, **Masked30**, **CSR30**) with simple bar plots:

- **Model size (MB)** per setup.
- **Relative compute** (`nnz / dense`) per setup.
- **Top-1 agreement** with the dense model per setup.

These plots give a compact view of the trade-off between sparsity, model size, effective compute and predictive similarity.


In [6]:
df = pd.read_csv(CSV_PATH)
display(df)

bar_plot(df, 'setup', 'size_mb', 'Model size (MB)', 'size_vs_sparsity.png', RESULTS_DIR, y_min=None)
bar_plot(df, 'setup', 'compute_ratio', 'Relative compute (nnz / dense)', 'compute_vs_sparsity.png', RESULTS_DIR)
bar_plot(df, 'setup', 'top1_match', 'Top-1 agreement vs Dense', 'top1_vs_sparsity.png', RESULTS_DIR, y_min=0.0)


Unnamed: 0,setup,size_mb,sparsity,compute_ratio,top1_match
0,Dense,477.75,7e-06,1.0,1.0
1,Masked30,477.75,0.203574,0.796432,0.661355
2,CSR30,153.433594,2e-05,0.321155,0.661355


Saved: ..\results\size_vs_sparsity.png
Saved: ..\results\compute_vs_sparsity.png
Saved: ..\results\top1_vs_sparsity.png
