# Assignment 1: Dynamic Time Warping

---

## Task 4) Isolated Word Recognition

Due to the relatively large sample number (e.g. 8kHz), performing [DTW](https://en.wikipedia.org/wiki/Dynamic_time_warping) on the raw audio signal is not advised (feel free to try!).
A better solution is to compute a set of features; here we will extract [mel-frequency cepstral coefficients](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) over windows of 25ms length, shifted by 10ms.
Recommended implementation is [librosa](https://librosa.org/doc/main/generated/librosa.feature.mfcc.html).

### Data

Download Zohar Jackson's [free spoken digit dataset](https://github.com/Jakobovski/free-spoken-digit-dataset).
There's no need to clone, feel free to use a revision, like [v1.0.10](https://github.com/Jakobovski/free-spoken-digit-dataset/archive/refs/tags/v1.0.10.tar.gz).
File naming convention is trivial (`{digitLabel}_{speakerName}_{index}.wav`); let's restrict to two speakers, eg. `jackson` and `george`.

### Dynamic Time Warping

[DTW](https://en.wikipedia.org/wiki/Dynamic_time_warping) is closely related to [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm).
The main rationale behind DTW is that the two sequences are can be aligned but their speed and exact realization may very.
In consequence, cost is not dependent on an edit operation but on a difference in observations.

---

### Preparation

In [97]:
import numpy as np
import librosa as lr
from typing import Tuple, TypedDict
import tarfile
import os
import requests
import re
from concurrent.futures import ProcessPoolExecutor

In [98]:
### TODO: Read in files, compute MFCC, and organize
### Notice: You can restrict the number to a few files for each speaker-digit

class Audio(TypedDict):
    digitLabel: int
    speakerName: str
    index: int
    mfccs: np.ndarray

# audios: List[Audio] = []

speakers = ["george", "jackson", "yweweler"]

### YOUR CODE HERE

DIGITS_URL = "https://github.com/Jakobovski/free-spoken-digit-dataset/archive/refs/tags/v1.0.10.tar.gz"
DATA_PATH = "data"
DIGITS_TARBALL_PATH = DATA_PATH + "/free-spoken-digit-dataset-1.0.10.tar.gz"
DIGITS_PATH = DATA_PATH + "/free-spoken-digit-dataset-1.0.10"
RECORDINGS_PATH = DIGITS_PATH + "/recordings"

if not os.path.exists(DIGITS_TARBALL_PATH):
    with open(DIGITS_TARBALL_PATH, "wb") as fp:
        fp.write(requests.get(DIGITS_URL).content)
if not os.path.exists(DIGITS_PATH):
    with tarfile.open(DIGITS_TARBALL_PATH) as tar:
        tar.extractall(DATA_PATH)

audios: dict[str, list[list[Audio]]] = { s: [[] for _ in range(10)] for s in speakers }
for file in os.listdir(RECORDINGS_PATH):
    match = re.match(r"^(\d+)_(\w+)_(\d+).wav$", file)
    if match is not None:
        digit = int(match.group(1))
        speaker = match.group(2)
        index = int(match.group(3))
        signal, sr = lr.load(f"{RECORDINGS_PATH}/{file}")
        mfccs = lr.feature.mfcc(y=signal, sr=sr)
        if speaker in speakers:
            audios[speaker][digit].append(Audio(
                digitLabel=digit,
                speakerName=speaker,
                index=index,
                mfccs=mfccs
            ))

for ll in audios.values():
    for l in ll:
        l.sort(key=lambda a: a["index"])

### END YOUR CODE

### Implement Dynamic Time Warping

In [99]:
from math import sqrt
from typing import Callable
import numpy.linalg as la

def dist(x: np.ndarray, y: np.ndarray) -> float:
    """
    Compute the distance between two samples.

    Arguments:
    x: MFCCs of first sample.
    y: MFCCs of second sample.

    Returns the distance as float
    """
    ### YOUR CODE HERE

    # angular distance for level invariance    
    return np.arccos(x @ y.T / (la.norm(x) * la.norm(y))).item()
    
    ### END YOUR CODE


def dtw(obs1: np.ndarray, obs2: np.ndarray, dist_fn: Callable[[np.ndarray, np.ndarray], float]) -> float:
    """
    Compute the dynamic time warping score between two observations.
    
    Arguments:
    obs1: List of first observations.
    obs2: List of second observations.
    dist_fn: Similarity function to use.

    Returns the score as float.
    """
    ### YOUR CODE HERE
    
    D = np.full((obs1.shape[1] + 1, obs2.shape[1] + 1), np.inf)
    D[0, 0] = 0
    for i in range(obs1.shape[1]):
        for j in range(obs2.shape[1]):
            D[i+1, j+1] = dist_fn(obs1[:, i], obs2[:, j]) + min(
                D[i, j],
                D[i+1, j],
                D[i, j+1]
            )
    return D[-1, -1] / sqrt(obs1.shape[1] ** 2 + obs2.shape[1] ** 2)

    ### END YOUR CODE

### Experiment 1: DTW scores

For each speaker and digit, select one recording as an observation (obs1) and the others as tests (obs2). How do scores change across speakers and across digits?

In [100]:
### YOUR CODE HERE

from math import ceil

reference = [l[0] for ll in audios.values() for l in ll]
test = []
for ll in audios.values():
    for l in ll:
        for i in range(1, len(l)):
            test.append(l[i])

def _process(args):
    return args[0], args[1], dtw(args[1]["mfccs"], args[0]["mfccs"], args[2])

results = []
with ProcessPoolExecutor() as exec:
    task_args = ((t, r, dist) for t in test for r in reference)
    count = len(reference) * len(test)
    batch_size = 256
    for i in range(ceil(len(reference) * len(test) / batch_size)):
        for result in exec.map(
            _process,
            (next(task_args) for _ in range(min(batch_size, count - i * batch_size)))
        ):
            results.append(result)
                

### END YOUR CODE

In [101]:
# analysis in separate cell because of long runtime

sum_same_digit_same_speaker = 0
sum_same_digit_different_speaker = 0
sum_different_digit_same_speaker = 0
sum_different_digit_different_speaker = 0
count_same_digit_same_speaker = 0
count_same_digit_different_speaker = 0
count_different_digit_same_speaker = 0
count_different_digit_different_speaker = 0

highest_same_digit_same_speaker = None
highest_same_digit_different_speaker = None
lowest_different_digit_same_speaker = None
lowest_different_digit_different_speaker = None

for result in results:
    if result[0]["digitLabel"] == result[1]["digitLabel"]:
        if result[0]["speakerName"] == result[1]["speakerName"]:
            sum_same_digit_same_speaker += result[2]
            count_same_digit_same_speaker += 1
            if highest_same_digit_same_speaker is None or highest_same_digit_same_speaker[2] < result[2]:
                highest_same_digit_same_speaker = result
        else:
            sum_same_digit_different_speaker += result[2]
            count_same_digit_different_speaker += 1
            if highest_same_digit_different_speaker is None or highest_same_digit_different_speaker[2] < result[2]:
                highest_same_digit_different_speaker = result
    else:
        if result[0]["speakerName"] == result[1]["speakerName"]:
            sum_different_digit_same_speaker += result[2]
            count_different_digit_same_speaker += 1
            if lowest_different_digit_same_speaker is None or lowest_different_digit_same_speaker[2] > result[2]:
                lowest_different_digit_same_speaker = result
        else:
            sum_different_digit_different_speaker += result[2]
            count_different_digit_different_speaker +=1
            if lowest_different_digit_different_speaker is None or lowest_different_digit_different_speaker[2] > result[2]:
                lowest_different_digit_different_speaker = result

average_same_digit_same_speaker = sum_same_digit_same_speaker / count_same_digit_same_speaker
average_same_digit_different_speaker = sum_same_digit_different_speaker / count_same_digit_different_speaker
average_different_digit_same_speaker = sum_different_digit_same_speaker / count_different_digit_same_speaker
average_different_digit_different_speaker = sum_different_digit_different_speaker / count_different_digit_different_speaker

print(f"average distance with same digit and same speaker: {average_same_digit_same_speaker}")
print(f"average distance with same digit and different speaker: {average_same_digit_different_speaker}")
print(f"average distance with different digit and same speaker: {average_different_digit_same_speaker}")
print(f"average distance with different digit and different speaker: {average_different_digit_different_speaker}\n")

if highest_same_digit_same_speaker is not None:
    t, r, d = highest_same_digit_same_speaker
    print("worst example with same digit, same speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")
if highest_same_digit_different_speaker is not None:
    t, r, d = highest_same_digit_different_speaker
    print("worst example with same digit, different speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")
if lowest_different_digit_same_speaker is not None:
    t, r, d = lowest_different_digit_same_speaker
    print("worst example with different digit, same speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")
if lowest_different_digit_different_speaker is not None:
    t, r, d = lowest_different_digit_different_speaker
    print("worst example with different digit, different speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")

average distance with same digit and same speaker: 0.09327153051405934
average distance with same digit and different speaker: 0.1891775557728557
average distance with different digit and same speaker: 0.1668764452006134
average distance with different digit and different speaker: 0.21872847214120383

worst example with same digit, same speaker:
  test:      [speaker: george, digit: 0, index: 10]
  reference: [speaker: george, digit: 0, index: 0]
  distance: 0.24624775914287667

worst example with same digit, different speaker:
  test:      [speaker: george, digit: 6, index: 19]
  reference: [speaker: jackson, digit: 6, index: 0]
  distance: 0.3302148956630585

worst example with different digit, same speaker:
  test:      [speaker: yweweler, digit: 8, index: 8]
  reference: [speaker: yweweler, digit: 6, index: 0]
  distance: 0.06485905584723005

worst example with different digit, different speaker:
  test:      [speaker: george, digit: 6, index: 11]
  reference: [speaker: yweweler, d

### Implement a DTW-based Isolated Word Recognizer

In [None]:
### TODO: Classify recording into digit label based on reference audio recordings

def recognize(obs: list[np.ndarray], refs: list[np.ndarray]) -> str:
    """
    Classify the input based on a reference list (train recordings).
    
    Arguments:
    obs: List of input observations (MFCCs).
    refs: List of audio items (train recordings).
    
    Returns classname where distance of observations is minumum.
    """
    ### YOUR CODE HERE
    
    result = ""
    task_args = ((t, r, dist) for t in test for r in reference)
    for 
    for o in obs:
        dists = [dtw(o, r, dist) for r in r]
        x = 0
        for i in range(1, 10):
            if dists[i] < dists[x]:
                x = i
        result += str(x)
    return result
    
    ### END YOUR CODE

### Experiment 2: Speaker-Dependent IWR

Select training recordings from one speaker $S_i$ and disjoint test recordings from the same speaker $S_i$. Compute the Precision, Recall, and F1 metrics, and plot the confusion matrix.

In [None]:
### YOUR CODE HERE



### END YOUR CODE

### Experiment 3: Speaker-Independent IWR

Select training recordings from one speaker $S_i$ and test recordings from another speaker $S_j$. Compute the Precision, Recall, and F1 metrics, and plot the confusion matrix.

In [None]:
### YOUR CODE HERE



### END YOUR CODE

### Food for Thought

- What are inherent issues of this approach?
- How does this algorithm scale with a larger vocabulary, how can it be improved?
- How can you extend this idea to continuous speech, ie. ?