This notebook works through running the [DataPerf Speech](https://www.dataperf.org/training-set-selection-speech) challenge evaluation with a [baseline selection algorithm](https://github.com/harvard-edge/dataperf-speech-example/blob/main/selection/implementations/baseline_selection.py).

We start by cloning our example selection algorithm repository and installing some additional dependencies not preinstalled in Colab environments:

In [None]:
!git clone https://github.com/harvard-edge/dataperf-speech-example/

In [5]:
import sys
sys.path.append("/content/dataperf-speech-example/")
import os
os.chdir("dataperf-speech-example/")

Next, we download the spoken word embeddings which we will use for training coreset selection and evaluation.

In [None]:
!python utils/download_data.py --output_path workspace/data 1> /dev/null

Below, we generate a set of 25 training samples from the available embeddings for each language, using our default selection algorithm (which simply performs crossfold-validation). The evaluation strategy can be changed by editing `dataperf-speech-example/workspace/dataperf_speech_config.yaml` 

The goal of this challenge is to add your own selection algorithm and outperform the provided baselines' macro F1 scores.

The selection algorithm will output a training file for each language, `en_25_train.json`, `id_25_train.json`, and `pt_25_train.json`.

These are the files you would upload to Dynabench for official evaluation, but in the next cell, we will run local unofficial evaluation using our provided evaluation data.

In [6]:
TRAIN_SIZE = 25 # or 60
for lang in ["en", "id", "pt"]:
  !python -m selection.main \
     --language "{lang}" \
     --allowed_training_set "workspace/data/dataperf_{lang}_data/allowed_training_set.yaml" \
     --train_embeddings_dir "workspace/data//dataperf_{lang}_data/train_embeddings/" \
     --train_size {TRAIN_SIZE} \
     --outdir "/drive1/nammt/dataperf-speech-example/dataperf-speech-example"

Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 18.12it/s]
Loading nontargets: 100%|█████████████████████| 100/100 [00:06<00:00, 15.89it/s]
num_targets: 15
per_target_class_size: 3
nontarget_class_size: 10
Using original method to select data subset ...
k-fold cross validation: 100%|██████████████████| 10/10 [01:42<00:00, 10.22s/it]
final best_score=0.6630933625692349
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 50.97it/s]
Loading nontargets: 100%|█████████████████████| 100/100 [00:01<00:00, 84.28it/s]
num_targets: 15
per_target_class_size: 3
nontarget_class_size: 10
Using original method to select data subset ...
k-fold cross validation: 100%|██████████████████| 10/10 [00:26<00:00,  2.64s/it]
final best_score=0.6547101271438757
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 22.94it/s]
Loading nontargets: 100%|█████████████████████| 100/100 [00:03<00:00, 30.55it/s]
num_targets: 15
per_target_class_size: 3
nontarget_c

Finally, let's run a local unofficial evaluation on the results of the training set selection algorithm (`en_25_train.json`, `id_25_train.json`, and `pt_25_train.json`). 

For each language, we load the coreset training samples specified in the  JSON file, along with evaluation samples specified in `eval.yaml`. We then train an ensemble classifier and [average the macro F1 score across ten random seeds](https://github.com/harvard-edge/dataperf-speech-example/blob/main/eval.py#L139-L154), and display the score (which should match the scores on the DynaBench leaderboard for the coreset sizes of 25 and 60). 

Here is the expected output for English with a coreset size of 25, using the input of `en_25_train.json` produced by the previous cell:

```
validating selected IDs
loading selected training data
Loading targets: 100% 5/5 [00:00<00:00, 17.97it/s]
Loading nontargets: 100% 9/9 [00:00<00:00, 140.54it/s]
loading eval data
Loading targets: 100% 5/5 [00:00<00:00, 119.50it/s]
Loading nontargets: 100% 200/200 [00:12<00:00, 16.11it/s]

Score:  0.3524448610675314
```


In [7]:
for lang in ["en", "id", "pt"]:
  !python eval.py \
    --language "{lang}" \
    --eval_embeddings_dir "workspace/data/dataperf_{lang}_data/eval_embeddings/" \
    --train_embeddings_dir "workspace/data/dataperf_{lang}_data/train_embeddings/" \
    --allowed_training_set "workspace/data/dataperf_{lang}_data/allowed_training_set.yaml" \
    --eval_file "workspace/data/dataperf_{lang}_data/eval.yaml" \
    --train_file "{lang}_{TRAIN_SIZE}_train.json" \
    --train_size {TRAIN_SIZE} \
    --config_file workspace/dataperf_speech_config.yaml

validating selected IDs
loading selected training data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 18.79it/s]
Loading nontargets: 100%|████████████████████████| 9/9 [00:00<00:00, 102.61it/s]
loading eval data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 75.85it/s]
Loading nontargets: 100%|█████████████████████| 200/200 [00:18<00:00, 10.56it/s]
Score:  0.31644245062823384
validating selected IDs
loading selected training data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 51.58it/s]
Loading nontargets: 100%|██████████████████████| 10/10 [00:00<00:00, 195.39it/s]
loading eval data
Loading targets: 100%|███████████████████████████| 5/5 [00:00<00:00, 143.13it/s]
Loading nontargets: 100%|█████████████████████| 200/200 [00:03<00:00, 62.52it/s]
Score:  0.36194208552859847
validating selected IDs
loading selected training data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 23.42it/s]
Loading nontar