# KWS MODEL

## **STEP 1 — DATA INGESTION**

**PURPOSE:** Load audio + annotation pairs using the same base filename.

**Input:**

* `data/raw/audio/*.wav`
* `data/raw/annotations/*_Annotated.txt`

**Process:**

* Match `sample_id.wav` ↔ `sample_id_Annotated.txt`
* Build sample list from `data/splits/{train,val,test}.txt`

**Code written in:**

* `src/data/audio_loader.py`
* `src/data/annotation_loader.py`

**Output (artifact):**

* In-memory sample index (IDs ready for preprocessing)

**Validation check:**

* For every ID in split file → audio + annotation file exists

---

## **STEP 2 — DATA VALIDATION**

**PURPOSE:** Remove invalid samples using strict annotation rule.

**Input:**

* `data/raw/annotations/*_Annotated.txt`

**Process:**

* If annotation contains `#` → discard sample (do not use)

**Code written in:**

* `src/data/clean.py`

**Output (artifact):**

* Clean sample list (valid IDs only)

**Validation check:**

* Count valid vs invalid samples printed in logs

---

## **STEP 3 — FEATURE EXTRACTION**

**PURPOSE:** Convert audio into log-mel features for Transformer input.

**Input:**

* `data/raw/audio/*.wav`

**Process:**

* Load waveform
* Extract **log-mel spectrogram**
* Normalize features

**Code written in:**

* `src/data/feature_extraction.py`

**Output (artifact):**

* `data/processed/features/{sample_id}.npy`  *(shape: T×80 float32)*

**Validation check:**

* `.npy` loads successfully and has expected shape `(T, 80)`

---

## **STEP 4 — FRAME LABEL GENERATION (FROM ANNOTATIONS)**

**PURPOSE:** Convert word timestamps into **frame-wise labels** for supervision.

**Input:**

* `data/raw/annotations/{sample_id}_Annotated.txt` *(Audacity format: start, end, word)*
* `data/processed/features/{sample_id}.npy` *(to get total frames T)*

**Process:**

* Parse each line: `start_time  end_time  WORD`
* Convert seconds → frame index using:
  `frame = time * sample_rate / hop_length`
* Create label array length `T`:

  * `BLANK` for non-word frames
  * `WORD_ID` for frames inside word segments

**Code written in:**

* `src/data/label_builder.py`

**Output (artifacts):**

* `data/processed/frame_labels/{sample_id}.npy` *(shape: T, int64)*
* `data/processed/frame_labels/label_map.json` *(word → id mapping)*

**Validation check:**

* `len(frame_labels) == T` (matches feature frames exactly)

---

## **STEP 5 — DATASET PREPARATION**

**PURPOSE:** Load features + frame labels in a training-ready PyTorch format.

**Input:**

* `data/processed/features/*.npy`
* `data/processed/frame_labels/*.npy`
* `data/splits/{train,val,test}.txt`

**Process:**

* Load `(features, frame_labels)` per sample
* Pad sequences in batch
* Create masks and lengths

**Code written in:**

* `src/data/dataset.py`

**Output (artifact):**

* Dataloader batches:

  * `x: (B, T, 80)`
  * `y: (B, T)`
  * `lengths: (B,)`

**Validation check:**

* One batch prints correct shapes and no mismatch errors

---

## **STEP 6 — MODEL ARCHITECTURE**

**PURPOSE:** Build model that predicts a class for every frame.

**Input:**

* `x: (B, T, 80)` features
* `num_classes = BLANK + word_vocab_size`

**Process:**

* Transformer Encoder processes frames
* Linear head outputs logits per frame

**Code written in:**

* `src/model/transformer_encoder.py`
* `src/model/frame_classifier.py`
* `src/model/model.py`

**Output (artifact):**

* Logits: `logits: (B, T, num_classes)`

**Validation check:**

* Forward pass works on one batch without crashing

---

## **STEP 7 — TRAINING (FRAME SUPERVISED)**

**PURPOSE:** Train Transformer to classify frame labels using annotations.

**Input:**

* Training DataLoader from Step 5
* Model from Step 6

**Process:**

* Compute logits
* Compute **CrossEntropyLoss** (frame-wise)
* Backprop + optimizer step
* Save checkpoint per epoch

**Code written in:**

* `src/training/train.py`
* `src/training/optimizer.py`
* `src/training/scheduler.py`

**Output (artifacts):**

* `outputs/checkpoints/model_epoch_*.pt`
* `outputs/checkpoints/logs/train.log`

**Validation check:**

* Training loss decreases across epochs

---

## **STEP 8 — INFERENCE (FRAME PREDICTION)**

**PURPOSE:** Predict frame labels from a trained model.

**Input:**

* `data/processed/features/{sample_id}.npy`
* `outputs/checkpoints/model_epoch_*.pt`

**Process:**

* Run model forward
* Get `argmax` class per frame

**Code written in:**

* `src/inference/predict_frames.py`

**Output (artifact):**

* `outputs/predictions/frame_preds_{sample_id}.npy` *(shape: T)*

**Validation check:**

* Prediction length matches feature frames `T`

---

## **STEP 9 — WORD TIMESTAMP EXTRACTION**

**PURPOSE:** Convert frame predictions into word start/end timestamps.

**Input:**

* `frame_preds_{sample_id}.npy`
* hop_length, sample_rate

**Process:**

* Merge consecutive same-label frames
* Ignore BLANK segments
* Convert frames → seconds

**Code written in:**

* `src/inference/word_timestamp_extractor.py`
* `src/inference/timestamp_extractor.py`

**Output (artifact):**

* `outputs/predictions/aligned_words.json`
  Format:

  ```json
  [{"word":"CLOSE","start_time":0.28,"end_time":0.76}]
  ```

**Validation check:**

* Start/end times are within audio duration

---

## **STEP 10 — EVALUATION**

**PURPOSE:** Measure timestamp accuracy against ground-truth annotations.

**Input:**

* `outputs/predictions/aligned_words.json`
* `data/raw/annotations/{sample_id}_Annotated.txt`

**Process:**

* Compare predicted vs true word timestamps
* Compute mean/median deviation

**Code written in:**

* `src/evaluation/alignment_metrics.py`

**Output (artifact):**

* `outputs/predictions/eval_report.json` *(or printed report)*

**Validation check:**

* Metrics run without errors and produce numeric results

---

## ✅ END-TO-END EXECUTION (Scripts)

* `scripts/preprocess.py` → runs Steps **1–4**
* `scripts/train.py` → runs Step **7**
* `scripts/infer_alignment.py` → runs Steps **8–9**
* `scripts/evaluate.py` → runs Step **10**

---
