This notebook works through running the [DataPerf Speech](https://www.dataperf.org/training-set-selection-speech) challenge evaluation with a [baseline selection algorithm](https://github.com/harvard-edge/dataperf-speech-example/blob/main/selection/implementations/baseline_selection.py).

We start by cloning our example selection algorithm repository and installing some additional dependencies not preinstalled in Colab environments:

In [None]:
!pip install -q fire wget
!git clone https://github.com/harvard-edge/dataperf-speech-example/
import sys
sys.path.append("/content/dataperf-speech-example/")
import os


Cloning into 'dataperf-speech-example'...
remote: Enumerating objects: 426, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 426 (delta 11), reused 8 (delta 8), pack-reused 389 (from 1)[K
Receiving objects: 100% (426/426), 120.71 KiB | 889.00 KiB/s, done.
Resolving deltas: 100% (230/230), done.


FileNotFoundError: [Errno 2] No such file or directory: '/content/dataperf-speech-example/'

Next, we download the spoken word embeddings which we will use for training coreset selection and evaluation.

In [3]:
os.chdir("dataperf-speech-example/")

In [4]:
!python utils/download_data.py --output_path workspace/data 1> /dev/null

Below, we generate a set of 25 training samples from the available embeddings for each language, using our default selection algorithm (which simply performs crossfold-validation). The evaluation strategy can be changed by editing `dataperf-speech-example/workspace/dataperf_speech_config.yaml` 

The goal of this challenge is to add your own selection algorithm and outperform the provided baselines' macro F1 scores.

The selection algorithm will output a training file for each language, `en_25_train.json`, `id_25_train.json`, and `pt_25_train.json`.

These are the files you would upload to Dynabench for official evaluation, but in the next cell, we will run local unofficial evaluation using our provided evaluation data.

In [6]:
TRAIN_SIZE = 25 # or 60
for lang in ["en", "id", "pt"]:
  !python -m selection.main \
     --language "{lang}" \
     --allowed_training_set "workspace/data/dataperf_{lang}_data/allowed_training_set.yaml" \
     --train_embeddings_dir "workspace/data//dataperf_{lang}_data/train_embeddings/" \
     --train_size {TRAIN_SIZE} \
     --outdir ""

Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 19.52it/s]
Loading nontargets: 100%|█████████████████████| 100/100 [00:06<00:00, 15.42it/s]
num_targets: 15
per_target_class_size: 3
nontarget_class_size: 10
Using original method to select data subset ...
k-fold cross validation: 100%|██████████████████| 10/10 [01:41<00:00, 10.19s/it]
final best_score=0.6537003019500972
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 50.66it/s]
Loading nontargets: 100%|█████████████████████| 100/100 [00:01<00:00, 84.34it/s]
num_targets: 15
per_target_class_size: 3
nontarget_class_size: 10
Using original method to select data subset ...
k-fold cross validation: 100%|██████████████████| 10/10 [00:27<00:00,  2.74s/it]
final best_score=0.6481170711274378
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 23.32it/s]
Loading nontargets: 100%|█████████████████████| 100/100 [00:03<00:00, 30.19it/s]
num_targets: 15
per_target_class_size: 3
nontarget_c

Finally, let's run a local unofficial evaluation on the results of the training set selection algorithm (`en_25_train.json`, `id_25_train.json`, and `pt_25_train.json`). 

For each language, we load the coreset training samples specified in the  JSON file, along with evaluation samples specified in `eval.yaml`. We then train an ensemble classifier and [average the macro F1 score across ten random seeds](https://github.com/harvard-edge/dataperf-speech-example/blob/main/eval.py#L139-L154), and display the score (which should match the scores on the DynaBench leaderboard for the coreset sizes of 25 and 60). 

Here is the expected output for English with a coreset size of 25, using the input of `en_25_train.json` produced by the previous cell:

```
validating selected IDs
loading selected training data
Loading targets: 100% 5/5 [00:00<00:00, 17.97it/s]
Loading nontargets: 100% 9/9 [00:00<00:00, 140.54it/s]
loading eval data
Loading targets: 100% 5/5 [00:00<00:00, 119.50it/s]
Loading nontargets: 100% 200/200 [00:12<00:00, 16.11it/s]

Score:  0.3524448610675314
```


In [11]:
for lang in ["en", "id", "pt"]:
  !python eval.py \
    --language "{lang}" \
    --eval_embeddings_dir "workspace/data/dataperf_{lang}_data/eval_embeddings/" \
    --train_embeddings_dir "workspace/data/dataperf_{lang}_data/train_embeddings/" \
    --allowed_training_set "workspace/data/dataperf_{lang}_data/allowed_training_set.yaml" \
    --eval_file "workspace/data/dataperf_{lang}_data/eval.yaml" \
    --train_file "{lang}_{TRAIN_SIZE}_train.json" \
    --train_size {TRAIN_SIZE} \
    --config_file workspace/dataperf_speech_config.yaml




validating selected IDs
loading selected training data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 18.83it/s]
Loading nontargets: 100%|████████████████████████| 9/9 [00:00<00:00, 106.17it/s]
loading eval data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 70.26it/s]
Loading nontargets: 100%|█████████████████████| 200/200 [00:18<00:00, 10.55it/s]
Score:  0.31644245062823384
validating selected IDs
loading selected training data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 52.14it/s]
Loading nontargets: 100%|██████████████████████| 10/10 [00:00<00:00, 182.26it/s]
loading eval data
Loading targets: 100%|███████████████████████████| 5/5 [00:00<00:00, 139.98it/s]
Loading nontargets: 100%|█████████████████████| 200/200 [00:03<00:00, 62.03it/s]
Score:  0.3815048781660937
validating selected IDs
loading selected training data
Loading targets: 100%|████████████████████████████| 5/5 [00:00<00:00, 24.01it/s]
Loading nontarg