# Comparison of Gene Prioritization Methods / Runs

## Objectives

This notebook compares multiple gene prioritization runs (tools, HPO versions, extraction strategies)
on the same cohort of patients with known causal genes.

The goals are to:
- assess global prioritization performance across runs
- compare robustness and failure modes
- evaluate the impact of phenotype quantity and quality
- identify strengths and weaknesses of each method

All analyses are performed at the **patient level**, using the rank of the known causal gene.


## Imports and global parameters

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

pd.set_option("display.max_columns", 200)
pd.set_option("display.max_colwidth", 120)


## Loading data

In [None]:
# === PARAMÈTRE ===
ANALYSIS_TABLE_PATH = "analysis_table.csv"  # ou .parquet

# Chargement
if ANALYSIS_TABLE_PATH.endswith(".parquet"):
    df = pd.read_parquet(ANALYSIS_TABLE_PATH)
else:
    df = pd.read_csv(ANALYSIS_TABLE_PATH)

print("Total rows:", len(df))
print("Runs:", df["run_id"].nunique())
print("Patients:", df["ID_PAT_ETUDE"].nunique())

df.head()


## Verification of runs comparability

ensure that the runs are based on a comparable basis

In [None]:
run_summary = (
    df.groupby("run_id")
    .agg(
        n_rows=("ID_PAT_ETUDE", "count"),
        n_patients=("ID_PAT_ETUDE", "nunique"),
        reports_found=("report_found", "sum"),
        read_errors=("report_read_error", "sum"),
        gene_not_found=("gene_not_found_flag", "sum"),
    )
    .sort_values("n_patients", ascending=False)
)

run_summary


## Filtering usable lines

In [None]:
# Lignes exploitables pour l'évaluation du rang
eval_df = df[
    df["report_found"] &
    (~df["report_read_error"])
].copy()

print("Rows usable for evaluation:", len(eval_df))
print("Patients usable:", eval_df["ID_PAT_ETUDE"].nunique())


## Construction of the median rank per patient and per run

In [None]:
rank_eval = (
    eval_df
    .groupby(["run_id", "ID_PAT_ETUDE"])
    .agg(
        avg_rank=("rank", "mean"),      # ou "min" si tu préfères
        phenotype_length=("hpo_implicated", lambda x: x.notna().sum())
    )
    .reset_index()
)

rank_eval.head()


## Global performance: Top-N per run

In [None]:
def hit_at_k(series, k):
    return (series.notna() & (series <= k)).mean() * 100

topN_summary = (
    rank_eval
    .groupby("run_id")["avg_rank"]
    .apply(lambda s: pd.Series({
        "Top1_%": hit_at_k(s, 1),
        "Top5_%": hit_at_k(s, 5),
        "Top10_%": hit_at_k(s, 10),
        "Top20_%": hit_at_k(s, 20),
        "Top50_%": hit_at_k(s, 50),
    }))
    .reset_index()
)

topN_summary


## Top-N comparative bar chart

In [None]:
topN_long = topN_summary.melt(
    id_vars="run_id",
    var_name="metric",
    value_name="percentage"
)

fig = px.bar(
    topN_long,
    x="run_id",
    y="percentage",
    color="metric",
    barmode="group",
    title="Top-N performance comparison across runs",
)

fig.update_layout(
    template="simple_white",
    xaxis_title="Run / Method",
    yaxis_title="Percentage of patients (%)",
    legend_title="Metric",
    yaxis=dict(range=[0, 100]),
)

fig.show()


## Per run ranks distribution (boxplot)

In [None]:
fig = px.box(
    rank_eval,
    x="run_id",
    y="avg_rank",
    points="all",
    title="Distribution of causal gene ranks across runs",
)

fig.update_layout(
    template="simple_white",
    xaxis_title="Run / Method",
    yaxis_title="Average rank of the causal gene",
    yaxis=dict(autorange="reversed"),
)

fig.show()


## Comparative CDF of ranks (Recall curves)

In [None]:
k_max = 50
cdf_data = []

for run_id in rank_eval["run_id"].unique():
    sub = rank_eval[rank_eval["run_id"] == run_id]
    total = len(sub)
    if total == 0:
        continue
    for k in range(1, k_max + 1):
        pct = (sub["avg_rank"].notna() & (sub["avg_rank"] <= k)).sum() / total
        cdf_data.append({
            "run_id": run_id,
            "k": k,
            "percentage": pct * 100
        })

cdf_df = pd.DataFrame(cdf_data)

fig = px.line(
    cdf_df,
    x="k",
    y="percentage",
    color="run_id",
    title="CDF of causal gene ranks across runs",
)

fig.update_layout(
    template="simple_white",
    xaxis_title="Rank k",
    yaxis_title="Percentage of patients (%)",
    xaxis=dict(tickvals=[1,10,20,30,40,50]),
    yaxis=dict(range=[0, 100]),
)

fig.show()


## To add:

- Comparison at equal phenotype size (handling phenotypic size depends of HPO version too)
- Effect of phenotype specificity across runs
- Failure mode analysis per run
- ClinVar / ACMG stratified comparison
- Gene-centric difficulty across runs
- Inter-run concordance & stability
- Multivariate comparative model