# QuizBowl ASR Transcription Generation Notebook

This file generates text-to-speech MP3s, converts those MP3s into a specific .wav format, then decodes the .wav files with Kaldi.  This file has the below dependencies.  Additionally it pulls in QuizBowl data from a pre-generated JSON file.

In [1]:
#audio
from gtts import gTTS
import subprocess

#storage and displays
import json
from tqdm import tqdm_notebook 
import glob
import time

## Data

The first paragraph loads JSON.  The second paragraph selects the relevant part of the JSON from only the buzzer fold.  The last paragraph maps each sentence in a full QuizBowl question (generally 4-5 per question) to the respective answer.  Total data size ends up being ~111k sentences representing ~26k questions.

In [None]:
%%time
with open('quiz-bowl.all.json') as fp:
    data = json.load(fp)

buzzer_data = []
for question in data['questions']:
    if question['fold'] == 'buzzertrain':
        buzzer_data.append([question['sentences'], question['page'], question['qnum']])

data = []
for index, sentences in enumerate([x[0] for x in buzzer_data]):
        for sent_count, sentence in enumerate(sentences):
            data.append([buzzer_data[index][2], sent_count, sentence, buzzer_data[index][1]])

## Generate  Text to Speech

This code generates mp3 files from the data generated above.  Takes ~18 hours on all 110k sentences.

In [None]:
%%time
for sentence in tqdm_notebook(data, total = len(data)):
    file_name = str(sentence[0]) + "_" + str(sentence[1])
    text = sentence[2]
    #convert into audio with gTTS, save it to mp3, convert it to WAV
    sentTTS = gTTS(text, lang='en', slow=False)
    sentTTS.save('buzzer/'+file_name+".mp3")  

## Kaldi

The first paragraph sorts the files by numerical order (1_1, 1_2, 2_1, 2_2) rather than (1_0, 1_1, 10_0, 10_1).  The second paragraph loops through the files, generates a .wav file for Kaldi, and then runs online2-wav-nnet3-latgen-faster on it, recording both the transcription and the lattice.

In [None]:
%%time
#sort list orthographically rather than lexigraphically
file_list = []
for each_file in (glob.glob('buzzer/*.mp3')):
    file_list.append(each_file)
file_list.sort(key = lambda file: int(file[file.find('/')+1:file.find('_')]) )


transcriptions = {}
error_files = []
for each_file in tqdm_notebook(file_list, total=len(file_list)):
    file_name = each_file [each_file.find('/')+1:each_file.find('.')]
    file_name_mp3 = file_name+ ".mp3"
    file_name_wav = "wav/"+file_name+".wav"
    #file_name_edit = "wav_edit/"+file_name+"_edit.wav"
    file_name_lat = "lattices/"+file_name+".lat"
    
    #convert mp3 to a specific Kaldi-friendly wav
    hide_output_variable = !ffmpeg -y -i {each_file} -acodec pcm_s16le -ac 1 -ar 8000 {file_name_wav}
   
    #Kaldi
    try:
        output_kaldi = !online2-wav-nnet3-latgen-faster \
        --online=false \
        --do-endpointing=false \
          --frame-subsampling-factor=3 \
          --config=exp/tdnn_7b_chain_online/conf/online.conf \
          --max-active=7000 \
          --beam=15.0 \
          --lattice-beam=6.0 \
          --acoustic-scale=1.0 \
          --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
          exp/tdnn_7b_chain_online/final.mdl \
          exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
          'ark:echo utterance-id1 utterance-id1|' \
          'scp:echo utterance-id1 {file_name_wav}|' \
          'ark,t:{file_name_lat}'
        
        transcriptions[file_name] = output_kaldi[7][14:]
        
    except:
        error_files.append(each_file)
        print ("Kaldi error for " + each_file +  ".  Did you run cmd.sh and path.sh?")

with open('0_5000Questions.json', 'w') as fp:
    json.dump(transcriptions, fp)