# Functional Similarity (FS): End-to-End Pipeline

This notebook runs the **Functional Similarity (FS)** pipeline for a single dataset (e.g., `Cowrie.csv`), **building on top of Functional Identicality (FI)**.

FS operates on **FI-unique archetypes** (`fi_hash`), not raw sessions. The notebook therefore:
1. builds (or loads) an FI table, then
2. compares FI archetypes pairwise to form FS families.

## Pipeline overview

1. **Bootstrap**: set `ROOT`, add `src/` to `sys.path`, import `fi_fs`.
2. **FI bootstrap**: load aggregated sessions → parse → normalise → build `fi_df` with `fi_hash`.
3. **Archetypes**: build `fi_hash → canonical sequence` (`arche_map`) for FI-unique behaviour.
4. **FS matrix**: compute pairwise FS (structural Levenshtein) over archetypes.
5. **Persist FS**: save the FS matrix (`.npy`) for reuse.
6. **Clustering**: cluster archetypes into FS families (Agglomerative @ `tau`) and summarise them.
7. **Reports & inspection**: export summary/report artifacts and inspect families visually/textually.

**Goal:** show what FS does from start to finish, starting from FI-unique archetypes as input.

In [1]:
# Bootstrap project root + Python path
import sys, subprocess
from pathlib import Path
import json
import numpy as np
import pandas as pd

import fi_fs

ROOT = Path(subprocess.check_output(
    ["git", "rev-parse", "--show-toplevel"],
    text=True,
).strip())

sys.path[:0] = [str(ROOT / "src")]
print("ROOT project:", ROOT.name)

# Master import for FI+FS
import fi_fs

PYTHON: cpython 3.12.0
BASHLEX: 0.18
ROOT project: PhD


## Step 1 - FI bootstrap for FS (parse → normalise → FI table)

FS works **on top of** the FI-unique archetypes, so we first (re)run the FI
pipeline for the chosen dataset:

1. **Load aggregated sessions** from `projects/fi_fs/data/processed/<dataset>.csv`.
2. **Parse** each session’s command string with `bashlex` into ordered
   `(op, args, conn)` triplets.
3. **Normalise** the sequences:
   - apply the alias map (canonical operators),
   - replace literals with typed placeholders,
   - α-renumber placeholders for deterministic naming.
4. **Build the FI table**, adding a canonical JSON representation and
   an `fi_hash` (class ID) for each session.

The resulting `fi_df` dataframe is the starting point for FS: it tells us how
many sessions there are in total and how many **FI-unique** behavioural
archetypes we will compare with FS.


In [2]:
# FI bootstrap for FS: parse → normalise → FI table

import csv

# Needed for CyberLab dataset (long commands)
csv.field_size_limit(100000000)

DATASET = "Cowrie.csv"
INPUT = ROOT / "projects" / "fi_fs" / "data" / "processed" / DATASET.strip(".csv") / DATASET
print(f"Input path: {INPUT.relative_to(ROOT)}")

agg, stats = fi_fs.load_aggregated_csv(str(INPUT))
print(f"\n--- FI bootstrap for FS ---")
print(f"Dataset: {INPUT.relative_to(ROOT)}")
print(f"Aggregated sessions: {stats['n_sessions']}")

# 1) bashlex → triplets
seqs, parsed_parts, parse_df, problems = fi_fs.parse_dataframe_to_triplets(
    agg,
    progress=True,
    with_diagnostics=True,
    strict=False,
)
print(f"Parsed sequences: {len(seqs)}")

# 1b) restrict agg to successfully parsed sessions only
parsed_sids = set(seqs.keys())
agg_ok = agg[agg["session"].isin(parsed_sids)].copy()

print("\n--- Filter parsed sessions ---")
print("Original agg sessions:   ", len(agg))
print("Sessions with sequences:", len(agg_ok))

# Optional safety checks
missing_from_seqs = set(agg_ok["session"]) - parsed_sids
missing_from_agg  = parsed_sids - set(agg_ok["session"])
assert not missing_from_seqs, f"In agg_ok but not seqs: {list(missing_from_seqs)[:5]}"
assert not missing_from_agg,  f"In seqs but not agg_ok: {list(missing_from_agg)[:5]}"

# 2) normalisation: aliases → placeholders → α-renumber
alias_map_path = ROOT / "src" / "fi_fs" / "resources" / "alias_map.yaml"
alias_map = fi_fs.load_alias_map_yaml(alias_map_path)
seqs_alias, alias_changes = fi_fs.apply_aliases(seqs, alias_map)
print("Alias changes:", 0 if alias_changes is None else len(alias_changes))

seqs_ph, dbg = fi_fs.apply_placeholders_args_only(
    seqs_alias,
    debug=True,
    preview_changed_first_n_sessions=5,
    sample_per_reason=5,
)
fi_fs.assert_connectors_preserved(seqs_alias, seqs_ph)
seqs_alpha = fi_fs.alpha_renumber(seqs_ph, check_idempotent=True)
print("α-renumbered sessions:", len(seqs_alpha))

# 3) FI table
fi_fs.assert_serialisation_deterministic(seqs_alpha)
fi_df = fi_fs.build_fi_dataframe(agg_ok, seqs_alpha, commands_col="commands_joined")
print(f"Sessions total (parsed only): {len(fi_df)} | FI-unique: {fi_df['fi_hash'].nunique()}")

fi_df.head(5)


Input path: projects/fi_fs/data/processed/Cowrie/Cowrie.csv

--- FI bootstrap for FS ---
Dataset: projects/fi_fs/data/processed/Cowrie/Cowrie.csv
Aggregated sessions: 98513


Parsing sessions (bashlex): 100%|██████████| 98513/98513 [01:05<00:00, 1492.86it/s]


Parsed sequences: 98425

--- Filter parsed sessions ---
Original agg sessions:    98513
Sessions with sequences: 98425
Alias changes: 488
α-renumbered sessions: 98425
Sessions total (parsed only): 98425 | FI-unique: 607


Unnamed: 0,session,n_rows,fi_hash,commands_clean,canonical_json
0,00031aeff1a6,5,9c07a2ac9b0db760,sh\nshell\nenable\necho 'nameserver 95.214.27....,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
1,0003e7887230,5,9c07a2ac9b0db760,sh\nshell\nenable\necho 'nameserver 95.214.27....,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
2,0003f10a2103,4,ec2785b8610be5a7,sh\nshell\nenable\ncat /bin/echo || while read...,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
3,0004025aa9c2,15,5e1318289a0eede4,sh\nshell\nenable\ncd ~ && rm -rf .ssh && mkdi...,"[[""sh"",[],"";""],[""shell"",[],"";""],[""enable"",[],""..."
4,000537a9f749,1,c1f4de5103d6ebc1,uname -s -v -n -r -m,"[[""uname"",[""-s"",""-v"",""-n"",""-r"",""-m""],""EOS""]]"


## Step 2 - Build `fi_hash → canonical sequence` map for FS

Next, we construct a mapping from each **FI class** to its canonical
execution sequence:

- `build_archetypes(fi_df)` groups sessions by `fi_hash`,
- selects the canonical sequence for each FI class, and
- returns a dictionary: `fi_hash → [(op, args, conn), ...]`.

This `arche_map` is the input to FS: it tells us *which* behaviour each
FI-unique archetype represents, in a form that the FS comparator can work on.


In [3]:
# Build fi_hash -> triplets map for FS

# Map fi_hash -> triplets (canonical sequences)
arche_map = fi_fs.build_archetypes(fi_df)
print("FI-unique archetypes (by fi_hash):", len(arche_map))


FI-unique archetypes (by fi_hash): 607


## Step 3 - Compute and save the FS matrix (structural Levenshtein)

Using the FI-unique archetypes (`arche_map`), we now:

1. **Compute pairwise FS scores** with a structural Levenshtein-style
   comparator:
   - operates on canonical `(op, args, conn)` triplets,
   - is order-aware and can optionally include connectors in the comparison.

2. **Inspect basic sanity stats** (mean/median/min/max of off-diagonal FS
   values) to check the spread of similarities.

3. **Save the FS matrix** to disk as:

   `projects/fi_fs/data/output/<dataset_name>/FS_eval/<dataset_name>_FS_Lev_opconn_N{N}.npy`

   where `N` is the number of FI-unique archetypes.
   This `.npy` file can then be reused by clustering/evaluation notebooks
   without recomputing FS.


In [4]:
from pathlib import Path as _Path

N = len(arche_map)
print(f"Computing structural Levenshtein FS over N={N} FI-unique archetypes...")

labels_lev, FS_lev = fi_fs.fs_levenshtein_structural(
    arche_map,
    include_connectors=True,
    progress=True,
)

# Sanity stats (off-diagonal only)
tri = FS_lev[np.triu_indices_from(FS_lev, 1)]
print(
    f"FS (off-diagonal): n={tri.size}, "
    f"mean={tri.mean():.3f}, median={np.median(tri):.3f}, "
    f"min={tri.min():.3f}, max={tri.max():.3f}"
)

# Output path: data/output/<dataset_name>/FS_eval/<dataset_name>_FS_Lev_opconn_N{N}.npy
DATASET_NAME = _Path(DATASET).stem
fs_dir = ROOT / "projects" / "fi_fs" / "data" / "output" / DATASET_NAME / "FS_eval" / "NumPy_Arrays"
fs_dir.mkdir(parents=True, exist_ok=True)

fs_filename = f"{DATASET_NAME}_FS_Lev_opconn_N{FS_lev.shape[0]}.npy"
fs_path = fs_dir / fs_filename
np.save(fs_path, FS_lev)

print("Saved FS array to:", fs_path.relative_to(ROOT))


Computing structural Levenshtein FS over N=607 FI-unique archetypes...


Levenshtein-struct (rows):   0%|          | 0/607 [00:00<?, ?it/s]

Levenshtein-struct (pairs):   0%|          | 0/183921 [00:00<?, ?it/s]

FS (off-diagonal): n=183921, mean=0.185, median=0.139, min=0.002, max=1.000
Saved FS array to: projects/fi_fs/data/output/Cowrie/FS_eval/NumPy_Arrays/Cowrie_FS_Lev_opconn_N607.npy


## Step 4 - Align archetypes to FS matrix order

The FS matrix `FS_lev` is indexed in the order returned by the FS
comparator (`labels_lev`, a list of `fi_hash` values).

To make later analysis easier, we:

1. Rebuild the FI-unique representatives (one row per `fi_hash`), choosing
   the longest `n_rows` per class as before.
2. **Reindex** this table using `labels_lev` so that rows line up exactly
   with the rows/columns of `FS_lev`.
3. Decode `canonical_json` into explicit `(op, args, conn)` sequences and
   store them in a `seq` column.

The resulting `archetypes` dataframe (plus `seqs_unique` and `fi_hashes`)
is now perfectly aligned to the FS matrix and ready for clustering and
family-level analysis.


In [5]:
# Align FI-unique archetypes to FS_lev order (labels_lev)

def decode_canonical(canon_json: str):
    rows = json.loads(canon_json)
    return [(op, tuple(args), conn) for op, args, conn in rows]


# Build FI-unique representatives as before, then reindex by labels_lev
archetypes = (
    fi_df.sort_values(["n_rows", "session"], ascending=[False, True])
         .drop_duplicates("fi_hash", keep="first")
         .set_index("fi_hash")
         .loc[labels_lev]                     # align rows to FS_lev order
         .reset_index()                       # fi_hash back as a column
         [["fi_hash", "session", "n_rows", "canonical_json"]]
)

archetypes["seq"] = archetypes["canonical_json"].map(decode_canonical)

# Convenience globals in FS_lev row order
seqs_unique = archetypes["seq"].tolist()
fi_hashes   = archetypes["fi_hash"].tolist()

print(f"Aligned archetypes to FS matrix order. Rows: {len(archetypes)}")
display(archetypes.head(5))


Aligned archetypes to FS matrix order. Rows: 607


Unnamed: 0,fi_hash,session,n_rows,canonical_json,seq
0,0014c5294a1b8182,b898ce35a477,5,"[[""curl"",[""PH_URL_1""],""|""],[""sudo"",[""python3"",...","[(curl, (PH_URL_1,), |), (sudo, (python3, -, -..."
1,0129d0d0e783bb89,60a7ef53c929,1,"[[""cd"",[""PH_PATH_1""],"";""],[""wget"",[""PH_URL_1""]...","[(cd, (PH_PATH_1,), ;), (wget, (PH_URL_1,), ;)..."
2,015931310783b74b,88f6d0afbaf6,9,"[[""free"",[],"";""],[""lscpu"",[],"";""],[""top"",[],"";...","[(free, (), ;), (lscpu, (), ;), (top, (), ;), ..."
3,02038519ce400bdc,13166095803e,6,"[[""linuxshell"",[],"";""],[""sh"",[],"";""],[""enable""...","[(linuxshell, (), ;), (sh, (), ;), (enable, ()..."
4,021cda85adade07c,1474808cc7d5,1,"[[""rm"",[""-rf"",""PH_PATH_1""],"";""],[""wget"",[""PH_U...","[(rm, (-rf, PH_PATH_1), ;), (wget, (PH_URL_1, ..."


## Step 5 - Cluster archetypes into FS families (Agglomerative)

With the FS matrix `FS_lev` in hand, we now cluster the FI-unique
archetypes into **FS families** using Agglomerative clustering:

- The threshold `tau` controls how similar items must be to join a family.
- `agglomerative_from_fs` returns a cluster label per archetype.
- `evaluate_fs_clustering` reports basic summary stats (number of families,
  singletons, etc.).

The resulting labels are stored in `archetypes["family_agg"]` and will be
used for family-level summaries and visualisations.


In [6]:
# Cluster FS matrix into families (Agglomerative @ tau)

# FS threshold pre-set from FS_Eval notebook experiments
# Adjust accordingly if needed
TAU = 0.75

labels_agg = fi_fs.agglomerative_from_fs(FS_lev, tau=TAU)
archetypes["family_agg"] = labels_agg

stats = fi_fs.evaluate_fs_clustering(
    FS_lev,
    labels_agg,
    tau=TAU,
)

print(f"FS Agglomerative clustering @ tau={TAU}")
print("Families discovered:", stats["n_clusters"])
print("Singleton families:", stats["n_singletons"])

print("\nFamily size distribution (top 10):")
display(pd.Series(labels_agg).value_counts().head(10))

print("\nInternal metrics:")
display(pd.DataFrame([stats]))


FS Agglomerative clustering @ tau=0.75
Families discovered: 298
Singleton families: 204

Family size distribution (top 10):


45     49
69     22
3      14
2      13
44     12
1      11
73     10
8      10
136     9
61      8
Name: count, dtype: int64


Internal metrics:


Unnamed: 0,config,tau,N,n_clusters,n_singletons,max_cluster_size,median_cluster_size,cohesion_min_FS,cohesion_mean_FS,silhouette,calinski_harabasz,davies_bouldin,dunn
0,,0.75,607,298,204,49,1.0,0.633333,0.961235,0.454071,150.145207,0.412721,0.30303


In [7]:
# Optional: explicit groups + medoids from cluster labels

groups = fi_fs.group_indices_from_labels(labels_agg)
medoids = fi_fs.medoid_indices(FS_lev, groups)

print(f"Clusters: {len(groups)}")
print("Example cluster → members (first 5 clusters):")
for gid in sorted(groups.keys())[:5]:
    print(f"  Cluster {gid}: indices {groups[gid][:10]}{' ...' if len(groups[gid]) > 10 else ''}")

print("\nMedoid indices per cluster (first 5):")
for gid in sorted(medoids.keys())[:5]:
    print(f"  Cluster {gid}: medoid index {medoids[gid]}")


Clusters: 298
Example cluster → members (first 5 clusters):
  Cluster 0: indices [415, 456, 520]
  Cluster 1: indices [35, 48, 55, 57, 122, 274, 299, 317, 410, 469] ...
  Cluster 2: indices [39, 41, 147, 168, 246, 321, 324, 372, 397, 455] ...
  Cluster 3: indices [19, 37, 124, 180, 231, 252, 288, 338, 381, 417] ...
  Cluster 4: indices [83, 198, 509]

Medoid indices per cluster (first 5):
  Cluster 0: medoid index 520
  Cluster 1: medoid index 35
  Cluster 2: medoid index 321
  Cluster 3: medoid index 37
  Cluster 4: medoid index 83


## Step 6 - Summarise FS families and members

With cluster labels assigned (`labels_agg`), we now build:

1. **Family-level summaries** via `summarise_families`:
   - one row per FS family,
   - size and basic FS statistics (mean / sd),
   - the medoid index and its `fi_hash`,
   - simple descriptive features such as `top_ops` and
     `consensus_skeleton_pairs`.

2. A **family members table** via `build_family_members_df`:
   - one row per FI-unique archetype,
   - includes its `family_id`, `fi_hash`, and basic metadata.

These two tables (`summ_df` and `family_members_df`) support downstream
inspection, reporting, and visualisation of FS-based behavioural families.


In [8]:
# Sanity checks (optional but nice when running standalone)
for name in ["archetypes", "seqs_unique", "FS_lev", "labels_agg"]:
    if name not in globals():
        raise RuntimeError(f"Expected '{name}' to be defined before this cell.")

fi_hashes = archetypes["fi_hash"].tolist()  # aligned with FS_lev rows

# 1) Per-family summaries
summ_df, family_groups, medoids = fi_fs.summarise_families(
    FS=FS_lev,
    seqs_unique=seqs_unique,
    labels=labels_agg,
    fi_hashes=fi_hashes,
)

cols = [
    "family_id",
    "size",
    "mean_FS",
    "sd_FS",
    "medoid_idx",
    "medoid_fi_hash",
    "top_ops",
    "consensus_skeleton_pairs",
]

n_singletons = int((summ_df["size"] == 1).sum())
print(f"\nFamilies total: {len(summ_df)} | Singleton families: {n_singletons}")
print(f"\nFamily summaries (top 15 by size, then mean_FS):")
display(summ_df[cols].head(15))

# 2) Family members table (one row per archetype with its family label)
family_members_df = fi_fs.build_family_members_df(
    archetypes=archetypes,
    labels=labels_agg,
)

print(
    f"\nfamily_members_df: {len(family_members_df)} rows across "
    f"{family_members_df['family_id'].nunique()} families"
)
display(family_members_df.head(5))



Families total: 298 | Singleton families: 204

Family summaries (top 15 by size, then mean_FS):


Unnamed: 0,family_id,size,mean_FS,sd_FS,medoid_idx,medoid_fi_hash,top_ops,consensus_skeleton_pairs
0,45,49,1.0,0.0,16,0575fce921790578,"[(busybox, 49)]","[(busybox, EOS)]"
1,69,22,1.0,0.0,50,153fd69fc12292c2,"[(echo, 22)]","[(echo, EOS)]"
2,3,14,0.9202,0.0819,37,0e2f33f329730941,"[(cd, 70), (sh, 57), (chmod, 42), (rm, 29), (t...","[(cd, ||), (cd, ||), (cd, ||), (cd, ||), (cd, ..."
3,2,13,0.7998,0.0834,321,8e6a4487862fa88a,"[(ps, 22), (grep, 22), (cat, 21), (echo, 13), ...","[(ps, |), (grep, ;), (echo, |), (cat, EOS)]"
4,44,12,0.9186,0.037,40,0fbbb73a992a1549,"[(cp, 297), (chmod, 209), (cd, 122), (rm, 111)...","[(cd, &&), (rm, &&), (mkdir, &&), (echo, &&), ..."
5,1,11,0.8149,0.0908,35,0ca225e81884a110,"[(cd, 49), (sh, 32), (chmod, 31), (tftp, 20), ...","[(cd, ||), (cd, ||), (cd, ;), (wget, ;), (chmo..."
6,73,10,0.9492,0.0346,73,1dcdd0685e28acdb,"[(rm, 35), (tftp, 20), (enable, 10), (system, ...","[(enable, ;), (system, ;), (shell, ;), (sh, ;)..."
7,8,10,0.8545,0.0809,106,2b5150c23e9a475b,"[(cd, 50), (cat, 18), (chmod, 14), (wget, 10),...","[(cd, ||), (cd, ||), (cd, ||), (cd, ||), (cd, ..."
8,136,9,0.9655,0.0278,58,16584cd6f1989339,"[(echo, 575), (rm, 63), (chmod, 27), (tftp, 18...","[(enable, ;), (system, ;), (shell, ;), (sh, ;)..."
9,61,8,0.9152,0.0751,54,15c5aa567680fe3f,"[(chmod, 8), (PH_EXEC_1, 8), (wget, 7), (cd, 6...","[(chmod, ;), (PH_EXEC_1, EOS)]"



family_members_df: 607 rows across 298 families


Unnamed: 0,family_id,archetype_idx,fi_hash,session,n_rows,struct_tokens
0,222,0,0014c5294a1b8182,b898ce35a477,5,"[[""curl"",[""PH_URL_1""],""|""],[""sudo"",[""python3"",..."
1,16,1,0129d0d0e783bb89,60a7ef53c929,1,"[[""cd"",[""PH_PATH_1""],"";""],[""wget"",[""PH_URL_1""]..."
2,168,2,015931310783b74b,88f6d0afbaf6,9,"[[""free"",[],"";""],[""lscpu"",[],"";""],[""top"",[],"";..."
3,79,3,02038519ce400bdc,13166095803e,6,"[[""linuxshell"",[],"";""],[""sh"",[],"";""],[""enable""..."
4,56,4,021cda85adade07c,1474808cc7d5,1,"[[""rm"",[""-rf"",""PH_PATH_1""],"";""],[""wget"",[""PH_U..."


In [9]:
summary_path = fi_fs.write_fs_summary(
    root=ROOT,
    dataset=DATASET,
    input_path=Path(INPUT) if "INPUT" in globals() else None,
    stats=stats if "stats" in globals() else None,
    agg=agg if "agg" in globals() else None,
    fi_df=fi_df,
    FS_lev=FS_lev,
    summ_df=summ_df,
    family_groups=family_groups,
    archetypes=archetypes,
    alias_changes=alias_changes if "alias_changes" in globals() else None,
    tau=globals().get("TAU", None),
    fs_path=globals().get("fs_path", None),
)

print(f"Wrote summary to: {summary_path.relative_to(ROOT)}")
print(summary_path.read_text(encoding="utf-8"))


Wrote summary to: projects/fi_fs/data/output/Cowrie/FS_eval/Cowrie_FS_Summary.txt
FS Pipeline Summary: Cowrie
Input File:               Cowrie.csv
Input Path:               projects/fi_fs/data/processed/Cowrie/Cowrie.csv

FI Bootstrap Context (for FS)
-----------------------------
Total Raw Sessions:        98513
Successfully Parsed:       98425 (99.91%)
Parse Failures:            88

Unique Archetypes (FI):    607
FI Singleton Archetypes:   241
FI Non-singletons:         366
Deduplication Factor:      162.15x
Avg Sessions / Archetype:  162.15

Normalisation Metadata
----------------------
Alias Changes applied:     488

Functional Similarity (FS)
---------------------------
FS Comparator:             structural Levenshtein (op+conn)
FS Matrix Size (N x N):    607 x 607

FS (off-diagonal) stats:
  mean=0.185, median=0.139, min=0.002, max=1.000

FS Matrix Saved:           projects/fi_fs/data/output/Cowrie/FS_eval/NumPy_Arrays/Cowrie_FS_Lev_opconn_N607.npy

FS Clustering
-------------
Me

## Step 7 - Export an FS families report (Markdown)

This step generates a Markdown report with **one section per FS family**, written to:

`projects/fi_fs/data/output/<dataset_name>/FS_eval/<dataset_name>_FS_families_report.md`

### Ordering
Families are ranked by:
1. **Total session volume** (sum of raw sessions across the family’s archetypes), then
2. **Family size** (number of FI-unique archetypes)

### Per-family contents
Each family section includes:
- **Total session volume**
- **FI-unique archetypes** (family size)
- **Mean FS** (± sd_FS)
- **Medoid archetype** (`fi_hash`, `session`, `n_rows`)
- A short **medoid command snippet** (truncated)
- The **consensus (op, conn) skeleton**
- **Top operators** (if available)
- A table of the **top member archetypes by volume** (default: top 10), showing:
  `fi_hash`, `session`, `n_rows`, and `session_count`

This makes the clustering output easy to browse outside the notebook and convenient to attach as documentation.


In [10]:
from fi_fs.utils import write_fs_families_report

report_path = write_fs_families_report(
    dataset=DATASET,
    root=ROOT,
    fi_df=fi_df,
    summ_df=summ_df,
    archetypes=archetypes,
    family_groups=family_groups,
)

print("Wrote FS families report to:", report_path.relative_to(ROOT))


Wrote FS families report to: projects/fi_fs/data/output/Cowrie/FS_eval/Cowrie_FS_families_report.md


## Step 8 - Sankey visualisation for a chosen FS family

To get an intuitive view of **how** commands vary within a single FS family,
we can draw a Sankey diagram for one `family_id`:

1. Select a family (`FAM`) and collect its `struct_tokens` from
   `family_members_df`.
2. Decode these into structured sequences for all member archetypes.
3. Use `sankey_from_family_enhanced` to build a flow diagram where:
   - the backbone path reflects a consensus sequence (WLCS backbone),
   - branches represent common variants at different positions,
   - edge widths indicate how often each variant occurs.

The printed `variant_detail` data gives a textual breakdown of the same
information (variants per “gap”), complementing the visualisation.


In [25]:
FAM = 1

fig, backbone, stats = fi_fs.sankey_for_family_id(
    FAM,
    family_members_df=family_members_df,
    summ_df=summ_df,
    archetypes=archetypes,

    # main knobs
    min_variant_support=0.05,
    topN_variants_per_gap=10,
    collapse_runs=True,

    # useful extras
    bucket_other_variants=True,
    variant_label_mode="first",
    normalise_widths=False,
    show_link_counts=True,

    # caption + sizing
    show_caption=True,
    fig_width=1000,
    fig_height=600,
    node_pad=18,
    node_thickness=6,
)

fig.show()


In [26]:
out_dir_fi, out_dir_sess = fi_fs.dfg.write_family_dfgs(
    dataset=DATASET,
    root=ROOT,
    family_members_df=family_members_df,
    fi_df=fi_df,
    render=True,
    render_format="png",
    dpi=300,
    fail_fast=False,
    caption=False,
)

print("FI-weighted DFGs:        ", out_dir_fi.relative_to(ROOT))
print("Prevalence-weighted DFGs:", out_dir_sess.relative_to(ROOT))


Rendering DFGs (FI + Prevalence):   0%|          | 0/298 [00:00<?, ?family/s]

Done.
FI-weighted output:         projects/fi_fs/data/output/Cowrie/FS_Families/DFG-FI-Weighted
Prevalence-weighted output: projects/fi_fs/data/output/Cowrie/FS_Families/DFG-Prevalence-Weighted
FI-weighted DFGs:         projects/fi_fs/data/output/Cowrie/FS_Families/DFG-FI-Weighted
Prevalence-weighted DFGs: projects/fi_fs/data/output/Cowrie/FS_Families/DFG-Prevalence-Weighted


## Step 9 - Inspect sessions inside an FS family

FS families are defined over **FI-unique archetypes** (`fi_hash`), but each archetype can map to **many original sessions**. This step expands a chosen family back to raw sessions and prints the commands.

For a chosen `FAM`:

- **Find archetypes in the family**
  Select all `fi_hash` where `archetypes["family_agg"] == FAM`.

- **Pull original sessions**
  Filter `fi_df` to rows whose `fi_hash` is in that set, auto-picking a commands column (`commands_clean` → `commands_joined` → `commands`).
  Sort by `fi_hash`, then `n_rows` (desc), then `session`.

- **Show a coloured legend**
  Build a stable `fi_hash → colour` map and print session counts per FI class.

- **Print commands grouped by `fi_hash`**
  Print either **one representative per FI class** (`PRINT_MODE="ONE"`) or **all sessions** (`"ALL"`), colour-coded.


In [28]:
fi_fs.inspect_family_commands(
    fam_id=2,
    archetypes=archetypes,
    fi_df=fi_df,
    print_mode="ONE",
    rep_strategy="first",
)


Family 2: 11668 original sessions
Distinct FI classes in this family: 13

FI-class legend (colour-coded):
   [38;5;86m0f7af8c894564cd3 (sessions=10329)[0m
   [38;5;217m1026c7528e21944e (sessions=1)[0m
   [38;5;62m3e9b89db5956d28d (sessions=3)[0m
   [38;5;132m49e31ff9d9b6abcd (sessions=1)[0m
   [38;5;82m6ac9c9eb00c87d30 (sessions=5)[0m
   [38;5;198m8e6a4487862fa88a (sessions=1)[0m
   [38;5;186m8ff6c737ee29dac3 (sessions=4)[0m
   [38;5;191ma146bfed2b401308 (sessions=1309)[0m
   [38;5;168mab463dfe3204f8d0 (sessions=7)[0m
   [38;5;115mc760bf9694d9bda0 (sessions=1)[0m
   [38;5;75md0803523114bcd81 (sessions=2)[0m
   [38;5;195mee2a3ed75cc6a79c (sessions=1)[0m
   [38;5;43mef0229b5be2bf157 (sessions=4)[0m

Printing mode: ONE | sessions printed: 13

=== Per-session commands_clean (grouped by fi_hash) ===
[38;5;86m
### FI class 0f7af8c894564cd3 — 10329 sessions (showing 1)[0m
[38;5;86m--- session 000e70d92f9d | n_rows=7 | fi_hash=0f7af8c894564cd3 ---[0m
[38;5;86mifc