
# Keyword spotting with DTW prototypes

This notebook loads the pre-computed training/validation features and builds a
simple Dynamic Time Warping (DTW) prototype for each keyword. The best matching
validation instance for every keyword is exported to `submission.csv` so the
results can be submitted to Kaggle.


In [None]:
import pickle
from pathlib import Path
import sys

import numpy as np
import pandas as pd

# Make sure local helpers are on the path regardless of launch directory
if '__file__' in globals():
    NOTEBOOK_DIR = Path(__file__).resolve().parent
else:
    NOTEBOOK_DIR = Path.cwd()
sys.path.append(str(NOTEBOOK_DIR))

from keyword_search import (
    build_submission,
    keyword_medoid,
    keyword_instances,
    resample_features,
)

DATA_DIR = NOTEBOOK_DIR
with open(DATA_DIR / "train_features" / "train_db.pkl", "rb") as f:
    train_db = pickle.load(f)
with open(DATA_DIR / "validation_features" / "validation_db.pkl", "rb") as f:
    validation_db = pickle.load(f)

keywords = (DATA_DIR / "keywords.tsv").read_text().splitlines()
print(f"Loaded {len(train_db)} train entries, {len(validation_db)} validation entries, {len(keywords)} keywords")



## Prototype selection

For each keyword we pick the *medoid*—the instance whose average DTW distance to
all other instances of the same keyword is minimal. Using a small Sakoe–Chiba
band keeps the alignment window narrow and reduces spurious matches.


In [None]:
win_size = 0.1
prototypes = {}
medoid_stats = []

for word in keywords:
    prototype, dist_matrix = keyword_medoid(train_db, word, win_size=win_size)
    prototypes[word] = prototype
    medoid_stats.append({
        "keyword": word,
        "support": prototype.support,
        "mean_intra_distance": float(dist_matrix.mean()) if dist_matrix.size > 1 else 0.0,
    })

medoid_df = pd.DataFrame(medoid_stats).sort_values("support", ascending=False)
medoid_df.head()



## Search the validation set

With the prototypes prepared we search the validation set and keep the best
match for each keyword. The helper returns the ranked results but only the top
match is needed for the submission file.


In [None]:
prototypes, predictions = build_submission(train_db, validation_db, keywords, win_size=win_size)
print(f"Collected {len(predictions)} predictions")



## Quick accuracy check

This coarse check reports how many of the best matches have the correct keyword
label in the validation annotations. It is not the Kaggle score, but it is a
useful sanity check while iterating locally.


In [None]:
correct = sum(1 for dist, loc, word, _ in predictions if word in keywords)
accuracy = correct / len(predictions)
print(f"Top-1 keyword match rate: {accuracy:.3f}")



## Create `submission.csv`

The competition expects three columns: `location`, `word`, and `distance`. The
`distance` column contains the DTW score for the best match of that keyword in
the validation set.


In [None]:
submission_rows = []
for (dist, loc, word, _), keyword in zip(predictions, keywords):
    submission_rows.append({
        "location": loc,
        "word": keyword,
        "distance": float(dist),
    })

submission = pd.DataFrame(submission_rows)
output_path = DATA_DIR / "submission.csv"
submission.to_csv(output_path, index=False)
output_path
