## Quickstart

Install dependencies:

```shell
pip install jupyter altair altair_saver polars pyarrow anywidget ipywidgets
```

Then, as per the [CAIS tutorial](https://cluster.safe.ai/#jupyter-notebooks-on-the-cluster), start a new interactive node:

```shell
srun --partition=single --pty bash
```

Then note the port number from:

```shell
unset XDG_RUNTIME_DIR
export NODEPORT=$(( $RANDOM + 1024 ))
echo $NODEPORT
jupyter notebook --no-browser --port=$NODEPORT
```

Then on you local machine run (filling in the port from above).

```shell
export NODEPORT=####
ssh -t -t [your ssh alias for cais cluster] -L ${NODEPORT}:localhost:${NODEPORT} ssh -N compute-permanent-node-990 -L ${NODEPORT}:localhost:${NODEPORT}
```

Now, open your local browser and enter the URL from jupyter (`http://localhost:19303/?token=cb...`), or open VSCode and under Select Kernel choose "Existing Jupyter Server" and input it there.

In [1]:
from pathlib import Path

import altair as alt
import polars as pl
import sklearn.metrics as skm
from IPython.display import clear_output, display
from ipywidgets import HTML, Dropdown, HBox, Label, Output, VBox, Video

Below are some helper function that are not very important.

In [39]:
def _metadata_cols(df: pl.DataFrame):
    tasks = df.get_column("task").unique().to_list()
    default = [
        "path",
        "task",
        "model",
        "reward",
        "label",
        "true_probability",
        "probability",
    ]
    return [c for c in df.columns if c not in tasks + default]


def _max_prob(df: pl.DataFrame, col: str):
    tasks = df.get_column("task").unique().to_list()
    return df.group_by(["path", "task", "model", "reward"]).agg(
        # Extract the label with the highest probability
        pl.col("label").sort_by(col).last(),
        pl.col(_metadata_cols(df)).first(),
    )

By default, the latest experiment in `out` is loaded. You can change this manually.

In [47]:
# Load the latest experiment
experiments = Path("../../out").iterdir()
latest_experiment = sorted(experiments, key=lambda d: d.stat().st_mtime)[-1]
experiment_dir = latest_experiment

df = pl.read_csv(experiment_dir / "results.csv")

predicted_labels = _max_prob(df, "probability").rename({"label": "predicted_label"})
true_labels = _max_prob(df, "true_probability").rename({"label": "true_label"})
predictions = predicted_labels.join(
    true_labels, on=["path", "task", "model", "reward"] + _metadata_cols(df)
)

We created the table `predictions`, containing the predictions of all model+reward combinations for all tasks and videos. Most visualizations will be okay with using this as a base.

In [48]:
predictions.head(3)

path,task,model,reward,predicted_label,object_detection,is_photorealistic,true_label
str,str,str,str,str,str,bool,str
"""/data/datasets…","""room_detection…","""gpt4""","""default""","""bathroom""","""toilet""",True,"""bathroom"""
"""/data/datasets…","""room_detection…","""gpt4""","""default""","""bathroom""","""toilet""",True,"""bathroom"""
"""/data/datasets…","""room_detection…","""gpt4""","""default""","""bathroom""","""toilet""",True,"""bathroom"""


Based on `predictions`, we can calculate metrics for each task, model, reward combination. Feel free to add more here!

In [57]:
# A wrapper around a sklearn metric function
def f1(group: pl.Series):
    # group is a pl.Series object with two named fields, true_label and predicted_label
    # we can access those fields using group.struct.field
    return skm.f1_score(
        y_true=group.struct.field("true_label").to_numpy(),
        y_pred=group.struct.field("predicted_label").to_numpy(),
        average="macro",
    )


# A helper function used to extract label colums from the dataframe,
# package them as structs, and then map matric_fun over them
def compute_metric(metric_fun):
    return pl.struct("true_label", "predicted_label").map_batches(metric_fun).first()


metrics = predictions.group_by("task", "model", "reward").agg(
    f1=compute_metric(f1),
    # ...add more here!
)

metrics.head(3)

task,model,reward,f1
str,str,str,f64
"""room_detection…","""gpt4""","""default""",1.0
"""clip_through_d…","""gpt4""","""default""",0.653333
"""room_detection…","""clip""","""logit""",0.699662


Now that we have a table with metrics, we can plot it. We use `altair` because it allows for the type of interactivity we need later.

In [56]:
def plot_metric(metric_name):
    return (
        alt.Chart(metrics.to_pandas())
        .mark_bar(width=10)
        .encode(
            x="reward",
            y=metric_name,
            color="model",
            tooltip=[metric_name, "model", "reward"],
        )
        .facet(column="model", row="task")
        .properties(title="F1 score")
    )


plot_metric("f1")

We can also make the plot above but compute the metrics separately based on different metadata values, e.g. `is_photorealistic`. For brevity, we only plot the best-performing model+reward combination per model.

In [91]:
metrics_per_photorealistic = predictions.group_by(
    "task", "model", "reward", "is_photorealistic"
).agg(
    f1=compute_metric(f1),
    # ...add more here!
)

best_models = metrics.group_by("task", "model").agg(
    pl.col("reward").sort_by("f1", descending=True).first()
)

# The metrics_per_photorealistic table filtered to only contain the best models
# i.e. exactly one model+reward per task+is_photorealistic
metrics_per_photorealistic = metrics_per_photorealistic.join(
    best_models, on=["task", "model", "reward"], how="semi"
).with_columns(evaluator=pl.concat_str("model", "reward", separator=" + "))

(
    alt.Chart(metrics_per_photorealistic.to_pandas())
    .mark_bar(width=10)
    .encode(
        x="f1",
        y="evaluator",
        color="model",
        tooltip=["f1", "model", "reward"],
    )
    .properties(title="F1 score")
    .facet(row="task", column="is_photorealistic")
    .resolve_scale(y="independent")
)

Most of the plotting needs will probably be taken care of by the above, or small variations of it. Below we have the interactive confusion matrix; the code there shouldn't be too important to fully understand.

In [7]:
def chart(task, model):
    # A table of best model+reard combination for each model and task
    best_models = metrics.group_by("task", "model").agg(
        pl.col("reward").sort_by("f1", descending=True).first()
    )

    # The predictions table filtered to only contain the best models
    # i.e. exactly one model+reward per task
    best_model_predicitons = predictions.filter(
        pl.col("task") == task, pl.col("model") == model
    ).join(best_models, on=["task", "model", "reward"], how="semi")

    true_label_size = best_model_predicitons.group_by(
        "model", "reward", "true_label"
    ).agg(pl.len().alias("true_label_size"))

    # A normal confusion matrix
    confusion_matrix = (
        best_model_predicitons.join(
            true_label_size, on=["model", "reward", "true_label"]
        )
        .group_by("model", "reward", "true_label", "predicted_label")
        .agg(count=pl.len(), ratio=pl.len() / pl.col("true_label_size").first())
    )

    # Needed to register the click events
    selection = alt.selection_point(
        fields=["true_label", "predicted_label"], name="selection"
    )

    # Base chart to which we'll add layers later
    base = (
        alt.Chart(confusion_matrix.to_pandas())
        .encode(
            x="predicted_label",
            y="true_label",
        )
        .properties(title=f"{model}, {task}")
    )

    # Heatmap layer
    heatmap = base.mark_rect().encode(
        color=alt.Color("ratio").scale(scheme="blues"),
        tooltip=["true_label", "predicted_label", "count", "ratio"],
    )

    # Diagonal frames layer
    labels = confusion_matrix["true_label"].unique()
    diag_df = pl.DataFrame({"predicted_label": labels, "true_label": labels})
    diagonal = (
        alt.Chart(pl.DataFrame(diag_df).to_pandas())
        .mark_rect(stroke="black", strokeWidth=1, fillOpacity=0)
        .encode(x="predicted_label", y="true_label")
    )

    # Text labels in cells
    text = base.mark_text(baseline="middle").encode(
        alt.Text("ratio", format=".1~f"),
        color=alt.condition(
            alt.datum.ratio < 0.5, alt.value("black"), alt.value("white")
        ),
    )

    # Add the layers together and also add the click-selector from eariler
    # Returning this would give us a normal chart, like the one above
    chart = (heatmap + diagonal + text).add_params(selection)

    # Wrap the chart in a Jupyter widget
    jchart = alt.JupyterChart(chart)

    # This is the vertical box the videos will live in
    videos_widget = VBox()

    # Click callback
    def on_select(change):
        if change.new.value is None:
            return

        paths = []

        for sel in change.new.value:
            # Get a list of videos that correspond to the cell that was clicked on
            paths.extend(
                best_model_predicitons.filter(
                    pl.col("model") == model,
                    pl.col("task") == task,
                    pl.col("true_label") == sel["true_label"],
                    pl.col("predicted_label") == sel["predicted_label"],
                )
                .get_column("path")
                .to_list()
            )
        # Load the videos based on the paths above, and put them into a flexbox
        videos = []
        for path in paths:
            video = Video.from_file(path)
            video.autoplay = True
            video.loop = True
            videos.append(VBox([video, Label("👆 " + path)]))

        videos_widget.children = videos

    # Whenever the selection in the chart changes, call the callback above
    jchart.selections.observe(on_select, ["selection"])

    return HBox([jchart, videos_widget])

When you run the cell below, you can pick the model and task combination and also click the cells in the matrix to see which videos ended up in them.

In [11]:
# Create a task selection dropdown
tasks = predictions.get_column("task").unique().sort()
task_dropdown = Dropdown(
    options=tasks,
    value=tasks[0],
    description="Task:",
)

# Create a model selection dropdown
models = predictions.get_column("model").unique().sort()
model_dropdown = Dropdown(
    options=models,
    value=models[0],
    description="Model:",
)

# Create an "output", a sort of a canvas that we can render things into
# This is needed for the live updates whenever the dropdowns change
output = Output()


def on_change(_change):
    with output:
        # Clear the canvas and render the new plot
        clear_output()
        display(chart(task_dropdown.value, model_dropdown.value))


model_dropdown.observe(on_change, names=["value"])
task_dropdown.observe(on_change, names=["value"])

# Render the dropdowns in Jupyter
display(VBox([task_dropdown, model_dropdown]))

with output:
    # Redner the chart into the output
    display(chart(task_dropdown.value, model_dropdown.value))

# Render the output in Jupyter
output

VBox(children=(Dropdown(description='Task:', options=('clip_through_detection', 'room_detection'), value='clip…

Output()

In [12]:
EDIT:
/data/datasets/habitat_recordings/2024-03-06/dining_room/dining_room_2.mp4
/data/datasets/habitat_recordings/2024-03-06/kitchen/kitchen_5.mp4
/data/datasets/habitat_recordings/2024-03-06/hall/hall_5.mp4
/data/datasets/habitat_recordings/2024-03-11/2024-03-11_8.mp4
/data/datasets/habitat_recordings/2024-03-06/office/office_2.mp4
/data/datasets/habitat_recordings/2024-03-06/stairs/stairs_7.mp4
/data/datasets/habitat_recordings/2024-03-11/2024-03-11_40.mp4

SyntaxError: invalid syntax (1552276928.py, line 1)