## Sample demo for using ESPnet-Easy!
In this notebook, we will train an ASR model on Librispeech-100 dataset. This notebook follows the same dataset preparation process as the kaldi-style dataset. If yo uwant to finetune the pretrained models, please refer to the `libri100_finetune.ipynb` file.

This notebook assumes that you have already downloaded the Librispeech-100 dataset from [OpenSLR](https://www.openslr.org/12), and placed the data in `/hdd/dataset/` directory.
Please replace the `/hdd/dataset/` directory with your own path.

### Data Preparation
This notebook follows an data preparation steps written in `asr.sh`. First, we will create a dump file to store the data id, audio path, and the transcription.

ESPnet-Easy accepts several types of datasets, including:
- Dictionary-based dataset with the following structure:
  ```python
  {
    "data_id": {
        "speech": path_to_speech_file,
        "text": transcription
    },
  }
  ```
- List of datasets with the following structure:
  ```python
  [
    {
        "speech": path_to_speech_file,
        "text": transcription
    },
  ]
  ```

If you want to use a dictionary-based dataset, each `data_id` must be unique.
ESPnet-Easy also accepts a dump file already created by `asr.sh`. But in this notebook, we will craete dump file from the beginning.

In [None]:
# Need to install espnet if you don't have it
!pwd
!pip install -U ../../
!pip install torchaudio

In [None]:
# First, we define a function to create a dataset in dictionary format.

import os
import glob


def create_dataset(data_dir):
    dataset = {}
    for chapter in glob.glob(os.path.join(data_dir, "*/*")):
        text_file = glob.glob(os.path.join(chapter, "*.txt"))[0]

        with open(text_file, "r") as f:
            lines = f.readlines()

        ids_text = {
            line.split(" ")[0]: line.split(" ", maxsplit=1)[1].replace("\n", "")
            for line in lines
        }
        audio_files = glob.glob(os.path.join(chapter, "*.wav"))
        for audio_file in audio_files:
            audio_id = os.path.basename(audio_file)[: -len(".wav")]
            dataset[audio_id] = {"speech": audio_file, "text": ids_text[audio_id]}
    return dataset

## Comment out
Then, let's create a dump file!

Note that the `file` key denotes the file name for the dump file, and the `type` key denotes the type of the inputs.
The `type` must be one of the data type listed in the [DATA_TYPES](https://github.com/espnet/espnet/blob/1409d89d1ca33417a7f57e4cfa77925a4f00cc3f/espnet2/train/dataset.py#L208).

Then we generate the dump files.

Then, let's create dump files!  
Note that you need to provide a dictionary to indicate dump file for each data.
The dictionary should have the following format:

```python
{
    "data_name": ["dump_file_name", "dump_format"]
}
```

In [None]:
import espnetez as ez

# Then create the dump files
DUMP_DIR = "./dump/libri100"
LIBRI_100_DIRS = [
    ["/hdd/database/librispeech-100/LibriSpeech/train-clean-100", "train"],
    ["/hdd/database/librispeech-100/LibriSpeech/dev-clean", "dev-clean"],
    ["/hdd/database/librispeech-100/LibriSpeech/dev-other", "dev-other"],
]
data_info = {
    "speech": ["wav.scp", "sound"],
    "text": ["text", "text"],
}

for d, n in LIBRI_100_DIRS:
    dump_dir = os.path.join(DUMP_DIR, n)
    if not os.path.exists(dump_dir):
        os.makedirs(dump_dir)

    dataset = create_dataset(d)
    ez.data.create_dump_file(dump_dir, dataset, data_info)

For the dev files, we have `dev-clean` and `dev-other` directories.
We can join them to get one dev dataset, by using `ez.data.join_dumps` function.

In [None]:
ez.data.join_dumps(
    ["./dump/libri100/dev-clean", "./dump/libri100/dev-other"], "./dump/libri100/dev"
)

Now you have dataset files in the `dump` directory.
It looks like this:

wav.scp
```
1255-138279-0008 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0008.flac
1255-138279-0022 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0022.flac
```

text
```
1255-138279-0008 TWO THREE
1255-138279-0022 IF I SAID SO OF COURSE I WILL
```


### Train sentencepiece model

Next, we will train a sentencepiece model. We need text file for training, so first, let's create a training file.

In [None]:
# generate training texts from the training data
# you can select several datasets to train sentencepiece.
ez.preprocess.prepare_sentences(["dump/libri100/train/text"], "dump/spm")

ez.preprocess.train_sentencepiece(
    "dump/spm/train.txt",
    "data/bpemodel",
    vocab_size=5000,
)

Okay, we have finished the data preparation. Now we will configure training process. We can use the configuration files already created by the ESPnet contributers.

To use the configuration file, we need to create the yaml file on your local machine. For example, I will use this [e-branchformer config](train_asr_e_branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml).

I changed the `batch_bins` parameter from `16000000` to `1600000`, to train on my GPU (RTX2080ti).

One thing I changed from the original ESPnet in this notebook is the way we define the token list.

The original ESPnet configuration defines the token list by giving all tokens in the yaml file, but in the ESPnet-Easy, we just give the path to the vocab file.
So the configuration file for preprocessing looks like this:

Before training, we need to prepare the stats file.
We can do this by running the `collect_stats` method.

In [None]:
import espnetez as ez

EXP_DIR = "exp/train_asr_branchformer_e24_amp"
STATS_DIR = "exp/stats_all"
training_config = ez.config.from_yaml(
    "asr",
    "train_asr_e_branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml",
)
preprocessor_config = ez.utils.load_yaml("preprocess.yaml")
training_config.update(preprocessor_config)

# replace token list if required.
with open(preprocessor_config["token_list"], "r") as f:
    training_config["token_list"] = [t.replace("\n", "") for t in f.readlines()]

trainer = ez.Trainer(
    task='asr',
    train_config=training_config,
    train_dump_dir="dump/libri100/train",
    valid_dump_dir="dump/libri100/dev",
    data_info=data_info,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    ngpu=1,
)
trainer.collect_stats()

Finally, let's run the training!

In [None]:
trainer.train()