In [2]:
import sys
sys.path.append('../..')
import salt.dataset
import salt.utils
import yaml

## One-to-multiple translation: English text to Luganda and Acholi text

In [5]:
yaml_config = '''
huggingface_load:   
  path: Sunbird/salt
  split: train
  name: text-all
source:
  type: text
  language: eng
  preprocessing:
      - prefix_target_language
target:
  type: text
  language: [lug, ach]
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)
salt.utils.show_dataset(ds, N=5)

Unnamed: 0,source,target,source.language,target.language
0,>>lug<< Eggplants always grow best under warm conditions.,Bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,eng,lug
1,>>ach<< Eggplants always grow best under warm conditions.,Bilinyanya pol kare dongo maber ka lyeto tye,eng,ach
2,>>lug<< Farmland is sometimes a challenge to farmers.,Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi,eng,lug
3,>>ach<< Farmland is sometimes a challenge to farmers.,Ngom me pur i kare mukene obedo peko madit bot lupur,eng,ach
4,>>lug<< Farmers should be encouraged to grow more coffee.,Abalimi balina okukubirizibwa okwongera okulima emmwanyi,eng,lug


## ASR: Luganda speech to text

In [6]:
yaml_config = '''
huggingface_load:
    path: Sunbird/salt
    split: train
    name: multispeaker-lug
source:
  type: speech
  language: lug
target:
  type: text
  language: lug
  preprocessing:
    - clean_and_remove_punctuation
    - lower_case
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds, audio_features=['source'], N=5)

Unnamed: 0,source,target,source.language,target.language
0,Your browser does not support the audio element.,bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,lug,lug
1,Your browser does not support the audio element.,bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,lug,lug
2,Your browser does not support the audio element.,bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,lug,lug
3,Your browser does not support the audio element.,bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,lug,lug
4,Your browser does not support the audio element.,bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,lug,lug


## ASR: Combine Common Voice and SALT

This example also shows multilingual ASR data, with a mixture of Luganda and English.

In [12]:
%%time
yaml_config = '''
huggingface_load:
  - path: mozilla-foundation/common_voice_13_0
    split: train[:5000]
    name: lg
  - path: sunbird/salt
    name: multispeaker-eng
    split: train
source:
  type: speech
  language: [lug,eng]
  preprocessing:
    - set_sample_rate:
        rate: 16_000
target:
  type: text
  language: [lug,eng]
  preprocessing:
    - clean_and_remove_punctuation
    - lower_case
shuffle: True
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds, audio_features=['source'], N=5)

Unnamed: 0,source,target,source.language,target.language
0,Your browser does not support the audio element.,emmere bwetokota nga tonnaganaaba engalo tezitukula,lug,lug
1,Your browser does not support the audio element.,endwadde zebisolo zikosa nnyo amakungula gabalimi,lug,lug
2,Your browser does not support the audio element.,abakuumaddembe babeera mu nnyumba eyonoonese,lug,lug
3,Your browser does not support the audio element.,nina omukisa nti ku ttaka lyobwakabaka ninako ebibanja bitaano,lug,lug
4,Your browser does not support the audio element.,munnamateeka wa famire alina ekiraamo kya taata wange,lug,lug


CPU times: user 3min 36s, sys: 3.43 s, total: 3min 40s
Wall time: 1min 40s


## Speech translation: Acholi speech to English text

In [13]:
yaml_config = '''
huggingface_load:
  join:
    - path: Sunbird/salt
      split: train
      name: text-all
    - path: Sunbird/salt
      split: train
      name: multispeaker-ach
source:
  type: speech
  language: ach
target:
  type: text
  language: eng
  preprocessing:
    - clean_and_remove_punctuation
    - lower_case
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

salt.utils.show_dataset(ds.take(5), audio_features=['source'])

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/230M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/222M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/185M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/190M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/4811 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/96 [00:00<?, ? examples/s]

Unnamed: 0,source,target,source.language,target.language
0,Your browser does not support the audio element.,eggplants always grow best under warm conditions,ach,eng
1,Your browser does not support the audio element.,eggplants always grow best under warm conditions,ach,eng
2,Your browser does not support the audio element.,farmland is sometimes a challenge to farmers,ach,eng
3,Your browser does not support the audio element.,farmers should be encouraged to grow more coffee,ach,eng
4,Your browser does not support the audio element.,uganda is focusing on farming,ach,eng


## Random augmentation

In [15]:
yaml_config = '''
huggingface_load:   
  path: Sunbird/salt
  split: train
  name: text-all
source:
  type: text
  language: eng
  preprocessing:
    - augment_characters:
          action: swap
          spec_char: None
          include_numeric: False
          aug_word_p: 0.1
          aug_word_min: 0
    - prefix_target_language
target:
  type: text
  language: [lug, ach, teo, ibo]
'''

config = yaml.safe_load(yaml_config)
ds = salt.dataset.create(config)

In [16]:
salt.utils.show_dataset(ds, N=10)

Unnamed: 0,source,target,source.language,target.language
0,>>lug<< Eggplants always grow best under warm cnodiitnos.,Bbiringanya lubeerera asinga kukulira mu mbeera ya bugumu,eng,lug
1,>>ach<< Eggplants always grow best under arwm conditions.,Bilinyanya pol kare dongo maber ka lyeto tye,eng,ach
2,>>teo<< Egglpnast always grow best under warm conditions.,Epoloi ebirinyanyi ojok apakio nu emwanar akwap.,eng,teo
3,>>ibo<< Eggplants lawasy grow best under warm conditions.,A na-eto eggplants mgbe nile n'ọnọdụ okpomọkụ.,eng,ibo
4,>>lug<< Farmland is sometimes a challenge to afmrres.,Ettaka ly'okulimirako n'okulundirako ebiseera ebimu kisoomooza abalimi,eng,lug
5,>>ach<< Famraldn is sometimes a challenge to farmers.,Ngom me pur i kare mukene obedo peko madit bot lupur,eng,ach
6,>>teo<< Farmland is sometimse a challenge to farmers.,Akiro nu alupok nes erai ationis kanejaas akoriok,eng,teo
7,>>ibo<< Farmland is sometimes a cahllegne to farmers.,"Mgbe ụfọdụ, ihe ịma aka na-abịara ndị ọrụ ugbo bụ ala ha na-akọ ugbo.",eng,ibo
8,>>lug<< Farmers should be encourgaed to grow more coffee.,Abalimi balina okukubirizibwa okwongera okulima emmwanyi,eng,lug
9,>>ach<< Farmers should be encouraged to grow more ocffee.,Lupur omyero ki konygi wek nong miti me puru mwanyi,eng,ach


## Reloading the module for debugging

Note that some HuggingFace cache files have to be also deleted, and some library references, to avoid unexpected behaviour when updating code.

In [63]:
from importlib import reload
reload(salt.dataset)
reload(salt.dataset.preprocessing)
!rm -rf ~/.cache/huggingface/datasets/generator/*