# Assignment 1: Dynamic Time Warping

---

## Task 4) Isolated Word Recognition

Due to the relatively large sample number (e.g. 8kHz), performing [DTW](https://en.wikipedia.org/wiki/Dynamic_time_warping) on the raw audio signal is not advised (feel free to try!).
A better solution is to compute a set of features; here we will extract [mel-frequency cepstral coefficients](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum) over windows of 25ms length, shifted by 10ms.
Recommended implementation is [librosa](https://librosa.org/doc/main/generated/librosa.feature.mfcc.html).

### Data

Download Zohar Jackson's [free spoken digit dataset](https://github.com/Jakobovski/free-spoken-digit-dataset).
There's no need to clone, feel free to use a revision, like [v1.0.10](https://github.com/Jakobovski/free-spoken-digit-dataset/archive/refs/tags/v1.0.10.tar.gz).
File naming convention is trivial (`{digitLabel}_{speakerName}_{index}.wav`); let's restrict to two speakers, eg. `jackson` and `george`.

### Dynamic Time Warping

[DTW](https://en.wikipedia.org/wiki/Dynamic_time_warping) is closely related to [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) and [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm).
The main rationale behind DTW is that the two sequences are can be aligned but their speed and exact realization may very.
In consequence, cost is not dependent on an edit operation but on a difference in observations.

---

### Preparation

In [1]:
import numpy as np
import librosa as lr
from typing import TypedDict
import tarfile
import os
import requests
import re
from concurrent.futures import ProcessPoolExecutor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [2]:
### TODO: Read in files, compute MFCC, and organize
### Notice: You can restrict the number to a few files for each speaker-digit

class Audio(TypedDict):
    digitLabel: int
    speakerName: str
    index: int
    mfccs: np.ndarray

# audios: List[Audio] = []

speakers = ["george", "jackson", "yweweler"]

### YOUR CODE HERE

DIGITS_URL = "https://github.com/Jakobovski/free-spoken-digit-dataset/archive/refs/tags/v1.0.10.tar.gz"
DATA_PATH = "data"
DIGITS_TARBALL_PATH = DATA_PATH + "/free-spoken-digit-dataset-1.0.10.tar.gz"
DIGITS_PATH = DATA_PATH + "/free-spoken-digit-dataset-1.0.10"
RECORDINGS_PATH = DIGITS_PATH + "/recordings"

if not os.path.exists(DIGITS_TARBALL_PATH):
    with open(DIGITS_TARBALL_PATH, "wb") as fp:
        fp.write(requests.get(DIGITS_URL).content)
if not os.path.exists(DIGITS_PATH):
    with tarfile.open(DIGITS_TARBALL_PATH) as tar:
        tar.extractall(DATA_PATH)

audios: dict[str, list[list[Audio]]] = { s: [[] for _ in range(10)] for s in speakers }
for file in os.listdir(RECORDINGS_PATH):
    match = re.match(r"^(\d+)_(\w+)_(\d+).wav$", file)
    if match is not None:
        digit = int(match.group(1))
        speaker = match.group(2)
        index = int(match.group(3))
        signal, sr = lr.load(f"{RECORDINGS_PATH}/{file}")
        mfccs = lr.feature.mfcc(y=signal, sr=sr)
        if speaker in speakers:
            audios[speaker][digit].append(Audio(
                digitLabel=digit,
                speakerName=speaker,
                index=index,
                mfccs=mfccs
            ))

for ll in audios.values():
    for l in ll:
        l.sort(key=lambda a: a["index"])

### END YOUR CODE

### Implement Dynamic Time Warping

In [3]:
from math import sqrt
from typing import Callable
import numpy.linalg as la

def dist(x: np.ndarray, y: np.ndarray) -> float:
    """
    Compute the distance between two samples.

    Arguments:
    x: MFCCs of first sample.
    y: MFCCs of second sample.

    Returns the distance as float
    """
    ### YOUR CODE HERE
   
    # out of euclidean, angular and cosine distance cosine seems to work best
    # return la.norm(y - x).item()
    # return np.arccos(x @ y.T / (la.norm(x) * la.norm(y))).item()
    return 1 - (x @ y.T / (la.norm(x) * la.norm(y))).item()
    
    ### END YOUR CODE


def dtw(obs1: np.ndarray, obs2: np.ndarray, dist_fn: Callable[[np.ndarray, np.ndarray], float]) -> float:
    """
    Compute the dynamic time warping score between two observations.
    
    Arguments:
    obs1: List of first observations.
    obs2: List of second observations.
    dist_fn: Similarity function to use.

    Returns the score as float.
    """
    ### YOUR CODE HERE
    
    D = np.full((obs1.shape[1] + 1, obs2.shape[1] + 1), np.inf)
    D[0, 0] = 0
    for i in range(obs1.shape[1]):
        for j in range(obs2.shape[1]):
            D[i+1, j+1] = dist_fn(obs1[:, i], obs2[:, j]) + min(
                D[i, j],
                D[i+1, j],
                D[i, j+1]
            )
    return D[-1, -1] / sqrt(obs1.shape[1] ** 2 + obs2.shape[1] ** 2)

    ### END YOUR CODE

### Experiment 1: DTW scores

For each speaker and digit, select one recording as an observation (obs1) and the others as tests (obs2). How do scores change across speakers and across digits?

In [4]:
### YOUR CODE HERE

from math import ceil

reference = [l[0] for ll in audios.values() for l in ll]
test = []
for ll in audios.values():
    for l in ll:
        for i in range(1, len(l)):
            test.append(l[i])

def _process(args):
    return args[0], args[1], dtw(args[1]["mfccs"], args[0]["mfccs"], args[2])

results = []
with ProcessPoolExecutor() as exec:
    task_args = ((t, r, dist) for t in test for r in reference)
    count = len(reference) * len(test)
    batch_size = 256
    for i in range(ceil(count / batch_size)):
        for result in exec.map(
            _process,
            (next(task_args) for _ in range(min(batch_size, count - i * batch_size)))
        ):
            results.append(result)
                

### END YOUR CODE

In [5]:
# analysis in separate cell because of long runtime

sum_same_digit_same_speaker = 0
sum_same_digit_different_speaker = 0
sum_different_digit_same_speaker = 0
sum_different_digit_different_speaker = 0
count_same_digit_same_speaker = 0
count_same_digit_different_speaker = 0
count_different_digit_same_speaker = 0
count_different_digit_different_speaker = 0

highest_same_digit_same_speaker = None
highest_same_digit_different_speaker = None
lowest_different_digit_same_speaker = None
lowest_different_digit_different_speaker = None

for result in results:
    if result[0]["digitLabel"] == result[1]["digitLabel"]:
        if result[0]["speakerName"] == result[1]["speakerName"]:
            sum_same_digit_same_speaker += result[2]
            count_same_digit_same_speaker += 1
            if highest_same_digit_same_speaker is None or highest_same_digit_same_speaker[2] < result[2]:
                highest_same_digit_same_speaker = result
        else:
            sum_same_digit_different_speaker += result[2]
            count_same_digit_different_speaker += 1
            if highest_same_digit_different_speaker is None or highest_same_digit_different_speaker[2] < result[2]:
                highest_same_digit_different_speaker = result
    else:
        if result[0]["speakerName"] == result[1]["speakerName"]:
            sum_different_digit_same_speaker += result[2]
            count_different_digit_same_speaker += 1
            if lowest_different_digit_same_speaker is None or lowest_different_digit_same_speaker[2] > result[2]:
                lowest_different_digit_same_speaker = result
        else:
            sum_different_digit_different_speaker += result[2]
            count_different_digit_different_speaker +=1
            if lowest_different_digit_different_speaker is None or lowest_different_digit_different_speaker[2] > result[2]:
                lowest_different_digit_different_speaker = result

average_same_digit_same_speaker = sum_same_digit_same_speaker / count_same_digit_same_speaker
average_same_digit_different_speaker = sum_same_digit_different_speaker / count_same_digit_different_speaker
average_different_digit_same_speaker = sum_different_digit_same_speaker / count_different_digit_same_speaker
average_different_digit_different_speaker = sum_different_digit_different_speaker / count_different_digit_different_speaker

print(f"average distance with same digit and same speaker: {average_same_digit_same_speaker}")
print(f"average distance with same digit and different speaker: {average_same_digit_different_speaker}")
print(f"average distance with different digit and same speaker: {average_different_digit_same_speaker}")
print(f"average distance with different digit and different speaker: {average_different_digit_different_speaker}\n")

if highest_same_digit_same_speaker is not None:
    t, r, d = highest_same_digit_same_speaker
    print("worst example with same digit, same speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")
if highest_same_digit_different_speaker is not None:
    t, r, d = highest_same_digit_different_speaker
    print("worst example with same digit, different speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")
if lowest_different_digit_same_speaker is not None:
    t, r, d = lowest_different_digit_same_speaker
    print("worst example with different digit, same speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")
if lowest_different_digit_different_speaker is not None:
    t, r, d = lowest_different_digit_different_speaker
    print("worst example with different digit, different speaker:")
    print(f"  test:      [speaker: {t['speakerName']}, digit: {t['digitLabel']}, index: {t['index']}]")
    print(f"  reference: [speaker: {r['speakerName']}, digit: {r['digitLabel']}, index: {r['index']}]")
    print(f"  distance: {d}\n")

average distance with same digit and same speaker: 0.007229644998848718
average distance with same digit and different speaker: 0.023731967534672053
average distance with different digit and same speaker: 0.019942347745633464
average distance with different digit and different speaker: 0.032013864430892325

worst example with same digit, same speaker:
  test:      [speaker: george, digit: 0, index: 10]
  reference: [speaker: george, digit: 0, index: 0]
  distance: 0.042647927332794

worst example with same digit, different speaker:
  test:      [speaker: george, digit: 6, index: 31]
  reference: [speaker: jackson, digit: 6, index: 0]
  distance: 0.06696736893808748

worst example with different digit, same speaker:
  test:      [speaker: yweweler, digit: 0, index: 44]
  reference: [speaker: yweweler, digit: 2, index: 0]
  distance: 0.0029469973038272223

worst example with different digit, different speaker:
  test:      [speaker: george, digit: 6, index: 11]
  reference: [speaker: ywe

### Implement a DTW-based Isolated Word Recognizer

In [6]:
### TODO: Classify recording into digit label based on reference audio recordings

def recognize(obs: list[Audio], refs: list[Audio]) -> list[int]:
    """
    Classify the input based on a reference list (train recordings).
    
    Arguments:
    obs: List of input observations (MFCCs).
    refs: List of audio items (train recordings).
    
    Returns classname where distance of observations is minumum.
    """
    ### YOUR CODE HERE
    
    results = []
    with ProcessPoolExecutor() as exec:
        task_args = ((t, r, dist) for t in obs for r in refs)
        count = len(refs) * len(obs)
        batch_size = 256
        for i in range(ceil(count / batch_size)):
            for result in exec.map(
                _process,
                (next(task_args) for _ in range(min(batch_size, count - i * batch_size)))
            ):
                results.append(result)
    pred = []
    for i in range(len(obs)):
        start = i * len(refs)
        closest = start
        for j in range(start + 1, start + len(refs)):
            if results[j][2] < results[closest][2]:
                closest = j
        pred.append(results[closest][1]["digitLabel"])    
    return pred
    
    ### END YOUR CODE

### Experiment 2: Speaker-Dependent IWR

Select training recordings from one speaker $S_i$ and disjoint test recordings from the same speaker $S_i$. Compute the Precision, Recall, and F1 metrics, and plot the confusion matrix.

In [7]:
### YOUR CODE HERE

RANDOM_SEED = 42

train_ratio = 0.2
train_s1 = []
test_s1 = []
for i in range(10):
    tr, te = train_test_split(
        audios[speakers[0]][i],
        train_size=train_ratio,
        shuffle=True,
        random_state=RANDOM_SEED
    )
    train_s1 += tr
    test_s1 += te

actual = [a["digitLabel"] for a in test_s1]
pred = recognize(test_s1, train_s1)

conf_mat = confusion_matrix(actual, pred)
acc = (conf_mat.diagonal().sum() / len(test_s1)).item()
prec = conf_mat.diagonal() / conf_mat.sum(0)
rec = conf_mat.diagonal() / conf_mat.sum(1)
f1 = 2 / ((1 / prec) + (1 / rec))

print(f"accuracy: {acc}")
print(f"precision scores: {prec[0].item():.2f}", end="")
for i in range(1, prec.shape[0]):
    print(f", {prec[i].item():.2f}", end="")
print()
print(f"recall scores: {rec[0].item():.2f}", end="")
for i in range(1, rec.shape[0]):
    print(f", {rec[i].item():.2f}", end="")
print()
print(f"f1 scores: {f1[0].item():.2f}", end="")
for i in range(1, f1.shape[0]):
    print(f", {f1[i].item():.2f}", end="")
print()
print(f"confusion matrix:\n{conf_mat}")

### END YOUR CODE

accuracy: 0.98
precision scores: 0.97, 1.00, 0.97, 0.89, 1.00, 0.98, 1.00, 1.00, 1.00, 1.00
recall scores: 0.97, 0.97, 0.97, 1.00, 1.00, 1.00, 0.93, 0.95, 1.00, 1.00
f1 scores: 0.97, 0.99, 0.97, 0.94, 1.00, 0.99, 0.96, 0.97, 1.00, 1.00
confusion matrix:
[[39  0  1  0  0  0  0  0  0  0]
 [ 0 39  0  0  0  1  0  0  0  0]
 [ 1  0 39  0  0  0  0  0  0  0]
 [ 0  0  0 40  0  0  0  0  0  0]
 [ 0  0  0  0 40  0  0  0  0  0]
 [ 0  0  0  0  0 40  0  0  0  0]
 [ 0  0  0  3  0  0 37  0  0  0]
 [ 0  0  0  2  0  0  0 38  0  0]
 [ 0  0  0  0  0  0  0  0 40  0]
 [ 0  0  0  0  0  0  0  0  0 40]]


### Experiment 3: Speaker-Independent IWR

Select training recordings from one speaker $S_i$ and test recordings from another speaker $S_j$. Compute the Precision, Recall, and F1 metrics, and plot the confusion matrix.

In [8]:
### YOUR CODE HERE

from functools import reduce


test_s2 = reduce(lambda a, b: a + b,  audios[speakers[1]])

actual = [a["digitLabel"] for a in test_s2]
pred = recognize(test_s2, train_s1)

conf_mat = confusion_matrix(actual, pred)
acc = (conf_mat.diagonal().sum() / len(test_s1)).item()
prec = conf_mat.diagonal() / (conf_mat.sum(0) + 1e-100)
rec = conf_mat.diagonal() / (conf_mat.sum(1) + 1e-100)
f1 = 2 / ((1 / (prec + 1e-100)) + (1 / (rec + 1e-100)))

print(f"accuracy: {acc}")
print(f"precision scores: {prec[0].item():.2f}", end="")
for i in range(1, prec.shape[0]):
    print(f", {prec[i].item():.2f}", end="")
print()
print(f"recall scores: {rec[0].item():.2f}", end="")
for i in range(1, rec.shape[0]):
    print(f", {rec[i].item():.2f}", end="")
print()
print(f"f1 scores: {f1[0].item():.2f}", end="")
for i in range(1, f1.shape[0]):
    print(f", {f1[i].item():.2f}", end="")
print()
print(f"confusion matrix:\n{conf_mat}")

### END YOUR CODE

accuracy: 0.655
precision scores: 0.56, 0.52, 0.11, 0.97, 0.45, 1.00, 0.00, 0.05, 1.00, 0.54
recall scores: 0.98, 0.96, 0.10, 0.64, 0.58, 0.60, 0.00, 0.06, 0.74, 0.58
f1 scores: 0.72, 0.67, 0.11, 0.77, 0.50, 0.75, 0.00, 0.06, 0.85, 0.56
confusion matrix:
[[49  0  0  0  0  0  0  0  0  1]
 [ 0 48  2  0  0  0  0  0  0  0]
 [38  0  5  0  1  0  0  6  0  0]
 [ 0  0 10 32  0  0  0  0  0  8]
 [ 0 11 10  0 29  0  0  0  0  0]
 [ 0  6  0  0 12 30  0  0  0  2]
 [ 0  0  0  0  2  0  0 47  0  1]
 [ 0  7  6  0 21  0  0  3  0 13]
 [ 0  0 11  1  0  0  1  0 37  0]
 [ 0 21  0  0  0  0  0  0  0 29]]


### Food for Thought

- What are inherent issues of this approach?
- How does this algorithm scale with a larger vocabulary, how can it be improved?
- How can you extend this idea to continuous speech, ie. ?