## ` Importing Libraries`

In [1]:
import csv
import json
import re
import os
import pydub
import hazm
from IPython import display

## `1) Creating json file , normalizing text and converting mp3`

### First five lines of data

In order to read tsv file, python built-in `csv` library is used.

In [2]:
counter = 0

with open('data/Senatelecom.tsv' , encoding='utf-8') as file:
    data = csv.reader(file , delimiter='\t')

    for line in data:
        print(line)
        counter += 1

        if counter == 5:
            break

['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment']
['016099f3eca09d2769a4bcbef9d90cb9742b9b2bf2b6ad1d9ad937cfc05950da21463035a8f66e02d1ce21135adb47b4cea46a110e65c5d4c8f6e1a31a07b3f0', 'senatelecom_fa_18325365.mp3', 'از مهمونداری کنار بکشم', '2', '1', '', '', '', 'fa', '']
['08c6e3ae4acdced8e01b6cdd184ab85238211761114f124eb3da44a63eb2edb9676810d7da72283bb5612c215366c1b5c4f960e7accbb1402d4d806f94cc9e01', 'senatelecom_fa_18960256.mp3', 'خب ، تو چیكار می كنی؟', '2', '0', '', '', '', 'fa', '']
['0b8b07282e9b4f9b6495362396d6fd92e35443f3567b4d4274c21298f38ca5de039b9062af42f324f694c9357049d371f02c34166b031c943a66740af4517954', 'senatelecom_fa_21928528.mp3', 'مسقط پایتخت عمان در عربی به معنای محل سقوط است', '2', '0', '', '', '', 'fa', '']
['0d02373503174ec83834e30c332811c63bd929c74fbc65391724f8277aaf7ff5fa7fadac6eed8f03649533f629c793c1eb5805cc629aec45543fb5142deff831', 'senatelecom_fa_19446941.mp3', 'آه، نه اصلاُ!', '2', '0', '', '', '', 

### Preprocessing `wav` file names

Before converting and using `wav` files, inconsistencies in their name needed to be addressed.

In [3]:
data_dir = 'data/'
formatted_dir = 'formatted/'
file_paths = os.listdir(data_dir)

In [4]:
for file_path in file_paths:
    
    if file_path.endswith('.wav'):

        new_path = []
        new_path.extend(file_path)
        
        if new_path[3] == 'a':
            continue

        if new_path.count('.') == 2:
            new_path.remove('.')

        new_path.insert(3 , 'a')
        new_path = ''.join(new_path)
        
        os.rename(data_dir + file_path , data_dir + new_path)

file_paths = os.listdir(data_dir)

### Converting `wav` files to `mp3`

For converting `wav` files to `mp3`, `pydub` library is used. This library can read `wav` files and convert them to any `non-wav` format if `ffmpeg` is installed.

`IPython.display` is used for playing audio files inside `jupyter notebook`

In [5]:
for file_path in file_paths:

    if file_path.endswith('wav'):

        audio = pydub.AudioSegment.from_wav(data_dir + file_path)
        audio.export(formatted_dir + file_path.split('.')[0] + '.mp3' , format='mp3')

In [6]:
for path in os.listdir(formatted_dir):

    if path.endswith('mp3'):
        display.display(display.Audio(formatted_dir + path))

### Text Normalization

`Hazm` is the libary used for text normalization, but since it doesn't remove a few arabic characters a custom function was used alongside it.

In [7]:
def normalize(input_text):

    """
    Return a normalized persian text

        text : [string] Inputed text
    """
    patterns = [
        ('ي' , 'ی'),
        ('ئ' , 'ی'),
        ('ك' , 'ک'),
        ('ؤ' , 'و'),
        ('ة' , 'ه'),
        ('ۀ' , 'ه'),
        ('ـ' , ''),
        ('[إأآ]' , 'ا'),
        ('[ءًٌٍَُِّ]' , ''),
    ]

    per = '۱۲۳۴۵۶۷۸۹۰'
    eng = '1234567890'
    arb = '١٢٣٤٥٦٧٨٩٠'

    text = input_text

    for to_replace , with_char in patterns:
        text = re.sub(to_replace,with_char,text)

    table_eng = text.maketrans(eng , per)
    table_arb = text.maketrans(arb , per)

    text = text.translate(table_eng)
    text = text.translate(table_arb)

    return text

In [8]:
sample_text = 'آ ئ ء أ إ ؤ ي ة َ ُ ِ ّ ً ٌ ٍ ۀ ـ ك 1234567890 ١٢٣٤٥٦٧٨٩٠'
hazm_normalizer = hazm.Normalizer()
print(hazm_normalizer.normalize(sample_text))
print(normalize(sample_text))
print(normalize(hazm_normalizer.normalize(sample_text)))

آ ئ ء أ إ ؤ ی ة        ۀ  ک ۱۲۳۴۵۶۷۸۹۰ ١٢٣٤٥٦٧٨٩٠
ا ی  ا ا و ی ه        ه  ک ۱۲۳۴۵۶۷۸۹۰ ۱۲۳۴۵۶۷۸۹۰
ا ی  ا ا و ی ه        ه  ک ۱۲۳۴۵۶۷۸۹۰ ۱۲۳۴۵۶۷۸۹۰


### Creating the template dictionary for `json` file

finally the built-in `json` library is used to create a `json` file from a dictionary.

In [9]:
data_dict ={
    'audio' : [],
    'duration' : [],
    'text' : []
}

with open('data/Senatelecom.tsv' , encoding='utf-8') as file:
    data = csv.reader(file , delimiter='\t')

    for line in data:
        if (line[1].split('.')[0] + '.wav') in file_paths:

            name = line[1].split('.')[0] + '.wav'

            audio_obj = pydub.AudioSegment.from_wav(data_dir + name)
            duration = audio_obj.duration_seconds

            text = hazm_normalizer.normalize(line[2])
            text = normalize(text)

            data_dict['audio'].append(name)
            data_dict['duration'].append(duration)
            data_dict['text'].append(text)

In [10]:
for name , duration , text in zip(data_dict['audio'] , data_dict['duration'] , data_dict['text']):
    print(f'Name : {name} , Duration : {duration} , Text : {text}')
    display.display(display.Audio(data_dir + name))

Name : senatelecom_fa_18325365.wav , Duration : 2.712 , Text : از مهمونداری کنار بکشم


Name : senatelecom_fa_21928528.wav , Duration : 8.64 , Text : مسقط پایتخت عمان در عربی به معنای محل سقوط است


Name : senatelecom_fa_18557643.wav , Duration : 3.792 , Text : دو استایل متفاوت دارین


Name : senatelecom_fa_19209861.wav , Duration : 6.672 , Text : دو روز قبل از کریسمس؟


Name : senatelecom_fa_19887398.wav , Duration : 2.496 , Text : ساعت‌های کاری چیست؟


Name : senatelecom_fa_19208010.wav , Duration : 5.856 , Text : اعصابم اون شب خورد بود


Name : senatelecom_fa_20323483.wav , Duration : 10.488 , Text : هیچ گل مینا دارید؟


Name : senatelecom_fa_21807667.wav , Duration : 4.68 , Text : من در هفته چهاردهم بارداری هستم


Name : senatelecom_fa_18636711.wav , Duration : 9.12 , Text : ما با هم خیلی مچ هستیم


Name : senatelecom_fa_21915769.wav , Duration : 4.32 , Text : اجرای احجام ساده مثل کره


### Creating `json` file

In [11]:
with open('audio_data.json' , mode='w') as f:
    json.dump(data_dict , f)

## `2) Preliminary model for ASR`

#### Step 1 : Data Preprocessing and numerical representation
1. First step is to load audio files into numpy array, which depending on the number of channels can be 2D or 3D array. This array has a sequence of number representing a measurement of the intensity or amplitude of the audio at a particular moment in time.
2. Since model needs uniform inputs the second steps is to clean the data by making sure every audio file has the same sampling rate , number of channels and duration.
3. Data augmentation is applied to audio files by speeding up or down, changing volume , pitch shift and time shift so the model can better generalize.
4. Converting the audio files into spectogram  which capture the intensity of each frequency across timeline, in an image format.
5. Converting spectograms into MFCC to extract essential frequencies which corresponds to the frequency ranges humans speak.
6. Data augmentation can also be applied to spectograms by masking vertical (time) or horizontal (frequency) bands of information.

#### Step 2 : Model Architecture

1. Feeding spectograms into a CNN network to create feature maps 
2. Feeding feature maps into a RNN network to map the continous representation of audio files to discreet representation.
3. Feeding last step output to a linear layer to output character probabilities based on transcript.

#### Audio model

In a similar way to transformers language model, we can first train a network to understand the basic unit of audio and speech itself, by using self-supervised methods in which raw audio are masked and then fed through a network that try to reconstruct the audio. Then we can use the learned audio representation of this network as medium to fine-tune other networks on the downstream task.