# Mega similarity check (V1) #

### Abstract ###

- Distance metric is not meaningful, **but the identity does.**
- However I still prefer L2 distance ~~because it looks like I'm doing a ML task~~.
- Objective: Try to explain *subjective* experience with model difference, especially if any components have been **changed**, ignoring how much it has been changed.
- Thanks ["CC"](https://github.com/crosstyan), ["RC"](https://github.com/CCRcmcpe) and ["AO"](https://github.com/AdjointOperator) for providing the initial script (and the idea).

### Input ### 
- See next cell. Paths of models and abbreviation you like.

### Output ###
- TONS of JSON, showing `(layer_name, distance_between_2_models)`
- TONS of IMG, showing `(pair_of_model, distance_for_each_type_of_diffusion_layer)`

### Special case or comparasion ###
- Text encoder for model `nai`: `"cond_stage_model.transformer", "cond_stage_model.transformer.text_model"`

### Some layer name to interprept ###
- `first_stage_model`: VAE
- `cond_stage_model`: Text Encoder
- `model.diffusion_model`: Diffusion model
- `model_ema`: EMA model for training
- `cumprod`, `betas`, `alphas`: `CosineAnnealingLR`

### Some notation (Useful in the bin chart) ###
- `attn1`: `sattn` = *Self attention*
- `attn2`: `xattn` = *Cross attention*
- `ff`: *Feed forward*
- `norm`: [Normalisation layer](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html). `elementwise_affine=True` introduces trainable `bias` and `weight`. 
- `proj`: *Projection*
- `emb_layers`: *Embedding layers*
- `others`: `ff` + `norm` + `proj` + `emb_layers`

Path configuration. See `./model_map.json` for the model paths.

TODO: Single folder, static path generators.

In [13]:
# Set the paths here.
ofp_folder = {
    "json": "./json_v1/",
    "img": "./img_v1/"
}
model_map_f = "./model_map_v1.json"

Load libraries.

In [14]:
import time
import os
import json
from pathlib import Path

import numpy as np
import matplotlib as mpl
import torch
from safetensors.torch import load_file #safe_open

from matplotlib import pyplot as plt
from tqdm import tqdm

Some operations.

In [15]:
# TODO: Support 'cuda', but 'cpu' is arleady fast.
g_device = "cpu"

In [16]:
# Create output folder
for v in ofp_folder.values():
    os.makedirs(os.path.dirname(v), exist_ok=True)  

In [17]:
# For "micro_cmp", go to the cell with cmp_json()
cmp_mapping = []
try:
    with open(model_map_f, "r") as mmf:
        read_content = mmf.read()
        cmp_mapping = json.loads(read_content)
except:
    print("Error when loading model map. There won't be mass scale comparasion.".format(model_map_f))

Functions inside the compare loop.

In [18]:
def load_model(path: Path, device: str, print_ptl_info=False) -> dict[str, torch.Tensor]:
    if ".safetensors" in path.suffixes:
        return load_file(path, device=device)
    else:
        ckpt = torch.load(path, map_location=device)
        if print_ptl_info and "epoch" in ckpt and "global_step" in ckpt:
            print(f"[I] {path.name}: epoch {ckpt['epoch']}, step {ckpt['global_step']}")
        return ckpt["state_dict"] if "state_dict" in ckpt else ckpt

# Reminder: Dodge different shape!
def check_equal_shape(a: torch.Tensor, b: torch.Tensor, fn):
    if a.shape != b.shape:
        raise Exception("DIFFERENT SHAPE")
        #print("DIFFERENT SHAPE: return -1.0")
        #return -1.0
    return fn(a.type(torch.float),b.type(torch.float))

TENSOR_METRIC_MAP = {
    #"equal": torch.equal,
    "l0": lambda a, b: check_equal_shape(a, b, lambda a, b: torch.dist(a, b, p=0)),    
    "l1": lambda a, b: check_equal_shape(a, b, lambda a, b: torch.dist(a, b, p=1)),
    "l2": lambda a, b: check_equal_shape(a, b, lambda a, b: torch.dist(a, b, p=2)),
    "cossim": lambda a, b: check_equal_shape(a, b, lambda a, b: torch.mean(torch.cosine_similarity(a, b, dim=0)))
}

FIG_METRIC_MAP = {
    #"equal": lambda v: np.linalg.norm(v, 0), 
    "l0": lambda v: np.linalg.norm(v, 0),    
    "l1": lambda v: np.linalg.norm(v, 1),
    "l2": lambda v: np.linalg.norm(v, 2),
    #I don't know how to make this meaningful...
    "cossim": lambda v: np.linalg.norm(v, None)
}

Read a pair of models, extract the key paths, compare for difference, and return all the intermediate data (useful for next step).

In graphical sense: `(da(kv)ab)err`. Obvious?

In [19]:
def cmp_c(a_path, b_path, device, metric, no_ptl_info):
    metric_fn = TENSOR_METRIC_MAP[metric]
    
    try:
        a_path = a_path.decode('UTF-8')
        b_path = b_path.decode('UTF-8')
    except:
        #No need
        pass

    a = load_model(Path(a_path), device, not no_ptl_info)
    b = load_model(Path(b_path), device, not no_ptl_info)

    ak = set(a.keys())
    bk = set(b.keys())
    
    keys_inter = ak.intersection(bk)
    da = list(ak.difference(bk))
    db = list(bk.difference(ak))
    kv = {}
    err = []
    for k in keys_inter:
        try:
            rt = metric_fn(a[k], b[k])
            rt = rt.numpy().tolist()
            kv[k] = rt
        except:
            #"nan" or True / False
            print("DIFFERENT SHAPE at key {}. Ignored.".format(k))
            err.append(k)
            pass        

    #Special case: NAI renamed the TE (claimed using GPT-2)
    if not (("animefull" in str(a_path)) and ("animefull" in str(b_path))):
        if "animefull" in str(a_path):
            for dak in da:
                if "cond_stage_model.transformer" in dak:
                    kv["nai." + dak] = metric_fn(a[dak], b[dak.replace("cond_stage_model.transformer", "cond_stage_model.transformer.text_model")]).numpy().tolist()
        elif "animefull" in str(b_path):
            for dbk in db:
                if "cond_stage_model.transformer" in dbk:
                    kv["nai." + dbk] = metric_fn(b[dbk], a[dbk.replace("cond_stage_model.transformer", "cond_stage_model.transformer.text_model")]).numpy().tolist()

    return kv, da, db, err, a, b

Plot graph from the results above.

In [20]:
def cmp_attn(kv, a, b, ofi, d):
    tmfn = TENSOR_METRIC_MAP[d]
    fmfn = FIG_METRIC_MAP[d]
    diffs = {}
    dlabel = d.upper() #L2

    no_unet = True

    # Ensure there is UNET.
    for k in kv.keys():
        if k.startswith('model.diffusion_model'):
            no_unet = False
            break

    if no_unet:
        return

    for k in kv.keys():
        #TODO: Not only the UNET, do for any components.
        if not k.startswith('model.diffusion_model'):
            continue
        delta = tmfn(a[k], b[k]).numpy().tolist()
        if 'attn1' in k:
            c = 'attn1' #'attn'
        elif 'attn2' in k:
            c = 'attn2' #'xattn'
        else:
            c = 'other'
        diffs.setdefault(c, []).append(delta)

    for k in diffs:
        diffs[k] = np.concatenate([diffs[k]], axis=0)

    fig, axs = plt.subplots(3, 1, figsize=(10, 10), sharex=False)
    fig.tight_layout(pad=5.0)
    for i, (k, v) in enumerate(diffs.items()):
        #bins=len(v) for finding outliers
        #v: numpy.array. 80 layers for attn, 526 for others.
        dval = fmfn(v)

        axs[i].hist(v, bins=len(v), density=False)
        axs[i].set(xlabel=dlabel, ylabel='a.u.')
        axs[i].xaxis.labelpad = 20
        axs[i].set_yscale('log')
        axs[i].set_title(f'{k}: ${dlabel}={dval:.4f}$')
    plt.savefig(ofi, bbox_inches='tight')
    #WTF the plot retains? Why?
    plt.close()

Procedure of a comparasion. Original scripts has [custom garbage collection](https://docs.python.org/3/library/gc.html), but its default setting is fine for me. Also $O(N^2)$ comparasion is harsh.

Variables explanation for "nice guys":

|var|text|
|---|---|
|pa|Path of model A.|
|pb|Path of model B.|
|ofp|Folder path for output JSON reports.|
|ofi|Folder path for output PNG plots.|
|d|Distancing method, [p-norm](https://en.wikipedia.org/wiki/Norm_(mathematics)) or [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity).[pytorch](https://pytorch.org/docs/stable/generated/torch.dist.html), [numpy](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html)|
|npi|`no_ptl_info`. IDK what it means.|
|kv|[Key-Value Pairs of intersection.](https://www.w3schools.com/js/js_json_objects.asp)|
|da|Distinct content of model A.|
|db|Distinct content of model B.|
|err|Interset layers which throw errors. Usually they're in different shape.|
|a|Instance of model A.|
|b|Instance of model B.|
|dj|Data for output JSON file.|
|fj|File path for output JSON file.|

In [21]:
def cmp_json(pa, pb, ofp, ofi, d, npi):
    kv, da, db, err, a, b = cmp_c(Path(pa), Path(pb), g_device, d, npi)
    dj = {'kv':kv, 'da':da, 'db':db, 'err': err}
    with open(ofp, "w") as fj:
        json.dump(dj, fj, indent=4, sort_keys=True)
    cmp_attn(kv, a, b, ofi, d)

Test / Manual operation for a single comparasion.

~~Also as example for the above variables.~~

In [22]:
# Testing: Obvious result
cmp_json(
    "../../stable-diffusion-webui/tmp/SD1/aobp/ABPModel-ep59.safetensors", 
    "../../stable-diffusion-webui/models/Stable-diffusion/sample-nd-epoch59.safetensors",
    "./json/test.json",
    "./img/test.png",
    "l2",
    True
)

The compare loop. `tqdm` may not work, at least for me.

In [23]:
cmp_count = 0
ts = time.time()
for cm0 in tqdm(cmp_mapping, desc=" model group", position=0):
    ofp0 = cm0[0]
    for pab in tqdm(cm0[1], desc=" model pairs", position=1):
        pak = pab[0][0]
        pav = pab[0][1]
        pbk = pab[1][0]
        pbv = pab[1][1]
        ofjp = "{}{}_{}_{}.json".format(ofp_folder['json'], ofp0, pak, pbk)
        ofip = "{}{}_{}_{}.png".format(ofp_folder['img'], ofp0, pak, pbk)
        #print(ofjp)
        cmp_count = cmp_count + 1
        cmp_json(pav, pbv, ofjp, ofip, "l2", True)

 model pairs: 100%|██████████| 6/6 [00:07<00:00,  1.26s/it]
 model pairs: 100%|██████████| 1/1 [00:16<00:00, 16.50s/it]]
 model pairs: 100%|██████████| 4/4 [00:45<00:00, 11.25s/it]]
 model pairs: 100%|██████████| 5/5 [00:44<00:00,  8.94s/it]]
 model pairs: 100%|██████████| 3/3 [00:29<00:00,  9.84s/it]]
 model pairs: 100%|██████████| 2/2 [00:15<00:00,  7.57s/it]]
 model pairs: 100%|██████████| 5/5 [00:39<00:00,  7.81s/it]]
 model pairs: 100%|██████████| 2/2 [00:19<00:00,  9.72s/it]]
 model pairs: 100%|██████████| 2/2 [00:16<00:00,  8.06s/it]]
 model group:  90%|█████████ | 9/10 [03:53<00:23, 23.69s/it]

DIFFERENT SHAPE at key model.diffusion_model.input_blocks.7.1.proj_in.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.to_v.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.to_k.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.input_blocks.2.1.proj_in.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.output_blocks.3.1.proj_in.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.output_blocks.6.1.proj_out.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.to_k.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.middle_block.1.proj_out.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_v.weight. Ignored.
DIFFERENT SHAPE at key model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.to_k.weight. Ignored.
DIF

 model pairs: 100%|██████████| 7/7 [01:18<00:00, 11.19s/it]
 model group: 100%|██████████| 10/10 [05:11<00:00, 31.14s/it]


End of the comparasion loop.

In [24]:
te = time.time()
print("Compare: {}, time: {} sec".format(cmp_count, int(te - ts)))

Compare: 37, time: 311 sec


### Findings (VAE) ###
- `kl-f8` vs SD prune: Some layers are pruned.
- `kl-f8` vs WD1: Both `encoder`, `decoder` is trained
- WD1 vs WD2: Only `decoder` is trained
- `kl-f8` vs NAI: Only `decoder` is trained. **However it is same as SD v1.4 bundled. See below.**

### Findings (NAI) ###
- SD 7G vs SD 4G: EMA pruned. *Applies for all models*.
- SD 7G vs NAI 7G: **Same "text encoder" (renamed layer) and "VAE".**
- ACertainty: "Seriously fine-tuned from SD with tons of (NAI) AIGC." **confirmed.** However VAE Decoder is different.

### Findings (SD Variant) ###
- momoko-e: **Dreambooth**. Text encoder is **partially changed.**
- Anything v3 / BasilMix / Anything v4 etc.: **Merged model. All layers are changed.**
- NAI: Some `cumprod` layers dropped
- ANY3: Same as NAI (merged)
- AC: Same as NAI (???)

### Findings (NMFSAN) ###
- NMFSAN: No `cond_stage_model` = Load "last text encoder" or `None` (will generate glitched images). No `first_stage_model` = Must load VAE (`same_model_name.vae.pt`).
- Currently called "negative textual inversion". Freeze TE train UNET > Make TI > Freeze TE train UNET again. i.e. No TE no VAE.

### Findings (BPModel) ###
- I have internal versions started from "Stupid Diffusion", another internal version of ACertainty.
- This is not a strict and formal proof, but I expect the L2 distances will align with a **almost flat plane**, to show some meaning for such comparasion.
- This **must not the exact same plane** because the BPModel was trained with changed dataset and configuration (for example, ARB setting / adding subset of datasets / "negative-TI" trick). However the iterlation should show a somewhat "clear way of improvement".

|Model A|Model B|others|attn1|attn2|
|---|---|---|---|---|
|AC|mk0|96.0301|23.2504|26.0299|
|mk0|mk3|51.6394|16.5754|13.6055|
|mk3|mk5|20.2971|8.6033|4.6885|
|mk5|nman|22.4227|7.5967|5.7737|
|L1(A)|L1(B)|190.3893|56.0258|50.0976|
|L2(A)|L2(B)|**113.1510**|**30.7742**|**30.2982**|
|AC|nman|**113.7759**|**35.8125**|**30.5798**|

- Somehow L2 can reflect "direction between the models".

### Findings (SD 2.x) ###
- WD v1.4: Text encoder and VAE encoder is changed.
- CJD v2.1.1: VAE encoder is changed.
- J's RD: Text encoder and VAE are **uncahnged**.
- P1at's merge: VAE is unchanged, but everything else is changed.
- SD 1.x vs 2.x: Text encoder is entirely swithced. Some layers' dimension is changed. 

## Discussion ##
- "Ignoring prompts" (where is `astolfo`? No human!) is caused by **bias in text encoder?**
- "Missing details" (given some element of the entity is present, e.g. `astolfo` has `pink_hair` but `1girl`) is caused by **bias in UNET?**
- That's why Anything v3 (20 / 20 momoco style with minimal negative prompts) is popular because the task (waifu AIGC) is a narrow objective which favours bias?
- BPModel is commented "hard to use" becasue it relies on original SD text encoder and VAE? However diversity is maximum with most art style is succesfully trained? Original SD has such "artist prompts", e.g. Vincent Van Gogh.
- Why ACertainty looks like NAI? AIGC dataset as informal Reinforcement Learning?
- SD 2.x is so broken becasuse the CLIP? Or the UNET? Why WD 1.4 E1 ignores prompts (where's `astolfo`? No `1boy`!) but start listening prompts in E2 (must include `quality:0` but `pink_hair`, `1boy` is OK)?
- J's RD fails just because the original SD 2.x text encoder is so bad? Applies for CJD also (where's `astolfo`? No `1boy`!)?
- Why BasilMix works (nice merge with chosen hyperparameters)?

# Further work #
- Compare for a set of models **with a clear relatiion**. For example, [merging ratio](https://huggingface.co/ThePioneer/CoolerWaifuDiffusion) and [training epoches](https://huggingface.co/AnnihilationOperator/ABPModel)