Recall: The main task is Speech Translation: translating audio in one language to text in another language

## Initial experiments on one pair of languages: English --> German

In [None]:
import os

# Change the working directory to the root of the project
os.chdir(r'C:\Users\TuAhnDinh\Desktop\MediaanProjects\BachelorThesisST')

In [None]:
from covost_data_preparation import read_tsv_split

COVOST_DIR = 'data/CoVoST2'
PREPROCESSED_DIR = 'preprocessed/full'

src_lang = 'en'
en_X_dir = f'{COVOST_DIR}/{PREPROCESSED_DIR}/en-X'
SRC_AUDIO_DIR = COVOST_DIR + '/' + src_lang
audiodir = SRC_AUDIO_DIR + '/clips'
tgt_lang = 'de'
TRANSLATIONS_DIR = COVOST_DIR + '/covost2' + f'/{src_lang}_{tgt_lang}'

test_df = read_tsv_split(TRANSLATIONS_DIR, src_lang=src_lang, tgt_lang=tgt_lang, split='test', audiodir=audiodir)
test_audios_list = [audiodir + '/' + path for path in test_df['path']]

### Data: CoVoST2

English audio - English transcription - German translation

In [None]:
import pandas as pd

pd.read_csv('data/CoVoST2/en_X_stat.csv', index_col=0, header=[0,1])

### Metrics explaination

- For Speech Recognition tasks: use Word error rate (WER) <br>
    WER = (number of wrongly-transcribed words) / (number of all words) <br>
    WER ranges from 0-100, the smaller the better
    
    
- For translation tasks: BLEU score <br>
    30 - 40	Understandable to good translations<br>
    40 - 50	High quality translations<br>
    50 - 60	Very high quality, adequate, and fluent translations<br>
    \> 60	Quality often better than human<br>
    BLEU score ranges from 0-100, the bigger the better

### Cascaded approach

#### Automatic Speech Recognition (ASR) model: English audio --> English text 

WER score on the test set: 29.7 <br>
(the baseline in CoVoST paper gives 25.6)

Result on a test sample:

In [None]:
from IPython.display import Audio

with open("experiments/audio_en_text_en/raw_text_translation.txt", 'r', encoding="utf-8") as f:
    output_texts = f.readlines()
    
sample_index = 100
sample_audio = test_audios_list[sample_index]
print('English audio')
display(Audio(filename=sample_audio, autoplay=False))
print('English text output by the model:')
print(output_texts[sample_index])
print('Human-labled English transcription:')
print(test_df.loc[sample_index]['sentence'])

#### Machine Translation (MT) model: English text --> German text

BLEU score on the test set: 33.0 <br>
(the baseline in CoVoST paper gives 29.0)

Result on a test sample:

In [None]:
import sacrebleu
with open("experiments/text_en_text_de/raw_text_translation.txt", 'r', encoding="utf-8") as f:
    output_texts = f.readlines()
    
sample_index = 100
sample_audio = test_audios_list[sample_index]
output_text = output_texts[sample_index]
reference_text = test_df.loc[sample_index]['translation']

print('English text')
print(test_df.loc[sample_index]['sentence'])
print()
print('German text output by the model:')
print(output_text)
print('Human-labled German translation:')
print(reference_text)


#### Cascaded Speech Translation (ST) using the above 2 models: English audio --> English text --> German text

BLEU score on the test set: 20.6 <br>
(the baseline in CoVoST paper gives 18.3)

Result on a test sample:

In [None]:
with open("experiments/cascaded_ST_en_de/raw_text_output.txt", 'r', encoding="utf-8") as f:
    output_texts = f.readlines()
    
sample_index = 100
sample_audio = test_audios_list[sample_index]
output_text = output_texts[sample_index]
reference_text = test_df.loc[sample_index]['translation']

print('English audio')
display(Audio(filename=sample_audio, autoplay=False))
print('German text output by the model:')
print(output_text)
print('Human-labled German translation:')
print(reference_text)

### End-to-end approach

#### Speech Translation (ST) model: English audio --> German text

BLEU score on the test set: 14.9 <br>
(the baseline in CoVoST paper gives 13.6)

Result on a test sample: 

In [None]:
with open("experiments/audio_en_text_de/raw_text_output.txt", 'r', encoding="utf-8") as f:
    output_texts = f.readlines()
    
sample_index = 100
sample_audio = test_audios_list[sample_index]
output_text = output_texts[sample_index]
reference_text = test_df.loc[sample_index]['translation']

print('English audio')
display(Audio(filename=sample_audio, autoplay=False))
print('German text output by the model:')
print(output_text)
print('Human-labled German translation:')
print(reference_text)

### Next step:

Combine MT and ASR into one model: <br>
English audio --> English text <br>
English text --> German text <br>

And run Zero-shot:
English audio --> German text 