Notebook for testing of SALT functions on a local machine.

In [None]:
import sys
sys.path.append('../../..')
import salt.dataset
import salt.utils
import yaml

## One-to-multiple translation: English text to Luganda and Acholi text

In [None]:
yaml_config = '''
huggingface_load:   
  path: Sunbird/salt
  split: train
  name: text-all
source:
  type: text
  language: eng
  preprocessing:
      - prefix_target_language
target:
  type: text
  language: [lug, ach]
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)
salt.utils.show_dataset(ds, N=5)

## ASR: Luganda speech to text

In [None]:
yaml_config = '''
huggingface_load:
    path: Sunbird/salt
    split: train
    name: multispeaker-lug
source:
  type: speech
  language: lug
target:
  type: text
  language: lug
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds, audio_features=['source'], N=10)

Example of adding some preprocessing operations to both the audio and the text.

In [None]:
yaml_config = '''
huggingface_load:
    path: Sunbird/salt
    split: train
    name: multispeaker-lug
source:
  type: speech
  language: lug
  preprocessing:
    - normalize_audio
    - augment_audio_speed:
        low: 0.95
        high: 1.15
    - augment_audio_noise:
        max_relative_amplitude: 0.5
        noise_audio_repo:
            path: Sunbird/urban-noise
            name: small
            split: train
target:
  type: text
  language: lug
  preprocessing:
    - ensure_text_ends_with_punctuation
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds, audio_features=['source'], N=10)

## ASR: Combine Common Voice and SALT

This example also shows multilingual ASR data, with a mixture of Luganda and English.

In [None]:
%%time
yaml_config = '''
huggingface_load:
  - path: mozilla-foundation/common_voice_13_0
    split: train[:5000]
    name: lg
  - path: sunbird/salt
    name: multispeaker-eng
    split: train
source:
  type: speech
  language: [lug,eng]
  preprocessing:
    - set_sample_rate:
        rate: 16_000
target:
  type: text
  language: [lug,eng]
  preprocessing:
    - clean_and_remove_punctuation
    - lower_case
shuffle: True
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds, audio_features=['source'], N=5)

## Speech translation: Acholi speech to English text

In [None]:
yaml_config = '''
huggingface_load:
  join:
    - path: Sunbird/salt
      split: train
      name: text-all
    - path: Sunbird/salt
      split: train
      name: multispeaker-ach
source:
  type: speech
  language: ach
target:
  type: text
  language: eng
  preprocessing:
    - clean_and_remove_punctuation
    - lower_case
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds.take(5), audio_features=['source'])

## Random augmentation

In [None]:
yaml_config = '''
huggingface_load:   
  path: Sunbird/salt
  split: train
  name: text-all
source:
  type: text
  language: eng
  preprocessing:
    - augment_characters:
          action: swap
          spec_char: None
          include_numeric: False
          aug_word_p: 0.1
          aug_word_min: 0
    - prefix_target_language
target:
  type: text
  language: [lug, ach, teo, ibo]
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

In [None]:
salt.utils.show_dataset(ds, N=10)

## Reloading the module for debugging

Note that some HuggingFace cache files have to be also deleted, and some library references, to avoid unexpected behaviour when updating code.

In [None]:
from importlib import reload
reload(salt.dataset)
reload(salt.utils)
reload(salt.dataset.preprocessing)
!rm -rf ~/.cache/huggingface/datasets/generator/*