# Description of the Project
**This Project main purpose is recognising Bengali Speech/Audio and giving back Transcripts/Text. I have worked on it during a Kaggle ML Competition.**

**We performs automatic speech recognition (ASR) using a pre-trained language model (LM), and returns the predicted text from the speech.**

**Data :** It is clean, labelled dataset, publicly available dataset for a Kaggle Competition. 

**Model Used :** *Wav2Vec2ProcessorWithLM* - an implementation by HuggingFace


In [None]:
!cp -r ../input/python-packages2 ./

In [None]:
!tar xvfz ./python-packages2/jiwer.tgz
!pip install ./jiwer/jiwer-2.3.0-py3-none-any.whl -f ./ --no-index
!tar xvfz ./python-packages2/normalizer.tgz
!pip install ./normalizer/bnunicodenormalizer-0.0.24.tar.gz -f ./ --no-index
!tar xvfz ./python-packages2/pyctcdecode.tgz
!pip install ./pyctcdecode/attrs-22.1.0-py2.py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/exceptiongroup-1.0.0rc9-py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/hypothesis-6.54.4-py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/pygtrie-2.5.0.tar.gz -f ./ --no-index --no-deps
!pip install ./pyctcdecode/sortedcontainers-2.4.0-py2.py3-none-any.whl -f ./ --no-index --no-deps
!pip install ./pyctcdecode/pyctcdecode-0.4.0-py2.py3-none-any.whl -f ./ --no-index --no-deps

!tar xvfz ./python-packages2/pypikenlm.tgz
!pip install ./pypikenlm/pypi-kenlm-0.1.20220713.tar.gz -f ./ --no-index --no-deps
!pip install pyctcdecode


In [None]:
import os
import numpy as np
from tqdm.auto import tqdm
from glob import glob
from transformers import AutoFeatureExtractor, pipeline
import pandas as pd
import librosa
import IPython
from datasets import load_metric
from tqdm.auto import tqdm
from torch.utils.data import Dataset, DataLoader
import torch
import gc
import wave
from scipy.io import wavfile
import scipy.signal as sps
import pyctcdecode

tqdm.pandas()
import warnings
warnings.filterwarnings("ignore")



In [None]:
# CHANGE ACCORDINGLY
BATCH_SIZE = 1
TEST_DIRECTORY = '/kaggle/input/bengaliai-speech/test_mp3s'

In [None]:

class CFG:
    my_model_name = '../input/yellowking-dlsprint-model/YellowKing_model'
    processor_name = '../input/yellowking-dlsprint-model/YellowKing_processor'

In [None]:
from transformers import Wav2Vec2ProcessorWithLM

processor = Wav2Vec2ProcessorWithLM.from_pretrained(CFG.processor_name)


In [None]:
my_asrLM = pipeline("automatic-speech-recognition", model=CFG.my_model_name ,feature_extractor =processor.feature_extractor, tokenizer= processor.tokenizer,decoder=processor.decoder ,device=0)


**Following Sample Submission:**

In [None]:
def infer(audio_path):
    speech, sr = librosa.load(audio_path, sr=processor.feature_extractor.sampling_rate)

    my_LM_prediction = my_asrLM(
                speech
            )

    return my_LM_prediction['text']
"""
In the provided code snippet, it appears that you are trying to implement a function named `infer` that takes an audio file path as input, performs automatic speech recognition (ASR) using a pre-trained language model (LM), and returns the predicted text from the speech. However, there are a few undefined variables and functions in the code that need clarification to understand the entire process. I'll explain the code step-by-step:

1. `librosa.load(audio_path, sr=processor.feature_extractor.sampling_rate)`: This line uses the librosa library to load the audio file specified by `audio_path` and returns the audio waveform `speech` and the sample rate `sr`. The `processor.feature_extractor.sampling_rate` seems to be a variable or attribute that holds the desired sampling rate for the audio.

2. `my_asrLM(speech)`: It seems like `my_asrLM` is a custom function that performs automatic speech recognition using a pre-trained language model. The input to this function is the `speech`, which is the audio waveform loaded in the previous step. This function might internally use a language model specifically trained for ASR to convert the speech into text.

3. `my_LM_prediction['text']`: Assuming `my_asrLM` returns a dictionary containing various information about the ASR prediction, this line retrieves the recognized text from the ASR prediction result.

Based on the provided code snippet, I can't determine the specific details of the `my_asrLM` function or the `processor.feature_extractor.sampling_rate` attribute since they are not defined in the snippet. If you can provide more context or the implementation of these functions/variables, I can help you further with the ASR inference process.

""";

In [None]:
def batch_infer(audio_paths, batch_size=BATCH_SIZE):
    '''
    infers on a batch of audio
    args:
      audio_paths  : list of path to audio files <list of string>
    returns:
      bangla predicted texts <list of string>
    '''
    results = []
    for path in audio_paths:
        pred = ""
        try:
            pred = infer(path)
        except:
            pred = "এ"
        if len(pred)==0:
            pred = "এ"
        results.append(pred)
    
    return results

"""
The provided code snippet defines a function named `batch_infer`, which performs inference on a batch of audio files using the `infer` function (assuming that the `infer` function is defined elsewhere in the code, and it processes a single audio file to obtain a predicted text). The function handles exceptions and returns the predicted texts for each audio file in the input list `audio_paths`.

Let's break down the function step-by-step:

1. `def batch_infer(audio_paths, batch_size=BATCH_SIZE):`: The function `batch_infer` takes two parameters: `audio_paths`, which is a list of file paths to audio files, and `batch_size` (defaulted to `BATCH_SIZE`, which should be defined elsewhere).

2. `results = []`: Initializes an empty list named `results`, which will be used to store the predicted texts for each audio file.

3. `for path in audio_paths:`: This loop iterates through each audio file path in the input list `audio_paths`.

4. `pred = ""`: Initializes an empty string variable `pred`, which will be used to store the predicted text for the current audio file.

5. `try:`: This block tries to execute the `infer` function with the current audio file path `path`.

6. `pred = infer(path)`: Calls the `infer` function with the current audio file path `path` to get the predicted text for the audio file.

7. `except:`: If an exception occurs during the execution of the `infer` function (e.g., an error or exception in the `infer` function), this block will be executed.

8. `pred = "এ"`: In case of an exception, the variable `pred` is set to the Bengali character "এ".

9. `if len(pred) == 0:`: This checks if the length of the predicted text `pred` is zero, which means no text was predicted for the audio file.

10. `pred = "এ"`: If no text was predicted (i.e., the length is zero), the variable `pred` is set to the Bengali character "এ".

11. `results.append(pred)`: The predicted text for the current audio file is added to the `results` list.

12. `return results`: After processing all audio files in the `audio_paths` list, the function returns the `results` list, which contains the predicted texts for each audio file.

It is important to note that the `infer` function is not defined within the provided code snippet. The `infer` function is assumed to be implemented elsewhere in the code and is responsible for processing a single audio file and returning the predicted text. The `batch_infer` function, on the other hand, handles a batch of audio files by calling the `infer` function for each audio file and collecting the results in the `results` list.
""";

In [None]:
from bnunicodenormalizer import Normalizer 


bnorm = Normalizer()
def normalize(sen):
    _words = [bnorm(word)['normalized']  for word in sen.split()]
    return " ".join([word for word in _words if word is not None])

def dari(sentence):
    try:
        if sentence[-1]!="।":
            sentence+="।"
    except:
        print(sentence)
    return sentence

"""
The provided code snippet defines two functions, `normalize` and `dari`, that appear to be related to processing Bengali text. Let's break down each function:

1. `normalize(sen)`: This function takes a Bengali sentence as input (`sen`) and returns the normalized version of the sentence. The normalization process seems to involve using the `bnunicodenormalizer` library to normalize individual words within the sentence. The function iterates through each word in the sentence, normalizes it using the `bnunicodenormalizer` library, and then joins the normalized words back into a normalized sentence. The normalized sentence will have normalized characters (e.g., combining characters) for proper rendering.

   However, there's a small issue in the function. In the list comprehension used to normalize each word (`_words`), the normalization is attempted for every word, even if it contains characters that are not Bengali. The `bnunicodenormalizer` library is designed to work with Bengali text, so using it on non-Bengali characters may lead to unintended behavior or errors. If the input sentence contains non-Bengali characters, it's better to handle those cases explicitly.

2. `dari(sentence)`: This function takes a Bengali sentence as input (`sentence`). It checks if the sentence ends with a Bengali full stop (U+09। - DARI). If the sentence does not end with the full stop, it appends one at the end. The purpose of this function seems to ensure that the Bengali sentence ends with the appropriate punctuation.

   The function includes a `try-except` block, which is not necessary for this specific case. The code inside the `try` block simply checks the last character of the sentence. If the last character is not a full stop (DARI), it appends one. If the last character is already a full stop, no error will occur. Therefore, the `try-except` block is redundant, and the code can be simplified to just the `if` statement.

It is important to note that the code relies on an external library `bnunicodenormalizer`, which is used for Bengali Unicode normalization. Ensure that you have installed this library and imported it correctly for the code to work as intended. Additionally, the code may not handle all edge cases and may need further refinement depending on the specific use case.
""";

In [None]:
def post_process_keys(str):
    return str.replace("../input/test-wav-files-dl-sprint/test_files_wav/","").replace(".wav",".mp3")

"""
The function `post_process_keys(str)` appears to be a post-processing function designed to modify and clean up file paths or keys (strings) related to audio files.

Let's break down the function:

1. `str.replace("../input/test-wav-files-dl-sprint/test_files_wav/", "")`: This line of code replaces the substring `../input/test-wav-files-dl-sprint/test_files_wav/` with an empty string in the input `str`. This is essentially removing the specified prefix from the string.

2. `.replace(".wav", ".mp3")`: After removing the prefix in the previous step, this line replaces the substring `.wav` with `.mp3` in the remaining string. This is essentially changing the file extension of the audio file from WAV to MP3.

The purpose of this function seems to be converting file paths or keys of WAV audio files to corresponding MP3 file paths or keys, possibly for further processing or handling of the audio data.

It is important to note that modifying file paths or keys using string replacement can be error-prone, especially if the paths are not in the exact format expected by the function. Ensure that the input `str` matches the expected format, or consider adding error handling to handle unexpected inputs more gracefully. Additionally, if this function is used in a larger codebase, it's a good practice to choose a more descriptive name for the function than `post_process_keys` to reflect its specific purpose.
""";

In [None]:
def directory_infer(audio_dir):
    '''
    infers on a directory that contains audio files
    args:
      audio_dir  : directory that contains some audio files <string>
    returns:
      a dataframe that contains 2 columns:
        * path <string>
        * sentence <string>
    '''
    # list all audio files

    audio_paths=[audio_path for audio_path in tqdm(glob(os.path.join(audio_dir,"*.*")))]
    files = os.listdir("/kaggle/input/bengaliai-speech/test_mp3s")
    paths = []
    for i in files:
        paths.append(i.split(".")[0])
    sentences=[]
    for idx in tqdm(range(0,len(audio_paths),BATCH_SIZE)):
        batch_paths=audio_paths[idx:idx+BATCH_SIZE]
        sentences+=batch_infer(batch_paths)
        
    df= pd.DataFrame({"id":paths,"sentence":sentences})
    df.sentence= df.sentence.apply(lambda x:normalize(x))
    #df.sentence= df.sentence.apply(lambda x:dari(x))
    df['id'] = df['id'].apply(lambda x: post_process_keys(x))
    
    return df 
"""
The function `directory_infer(audio_dir)` performs inference on a directory that contains audio files. It processes the audio files in batches using the `batch_infer` function and returns the results as a DataFrame with two columns: "id" and "sentence".

Let's break down the function step-by-step:

1. `audio_paths = [audio_path for audio_path in tqdm(glob(os.path.join(audio_dir, "*.*")))]`: This line lists all the audio files in the specified `audio_dir` directory using the `glob` function. It filters all files with any extension (`*.*`) and stores their paths in the `audio_paths` list.

2. `files = os.listdir("/kaggle/input/bengaliai-speech/test_mp3s")`: This line seems to list all files in the directory "/kaggle/input/bengaliai-speech/test_mp3s" (hardcoded path). However, this line appears to be redundant and not directly related to the `audio_dir` parameter.

3. `paths = []`: Initializes an empty list named `paths` to store the extracted "id" values from the filenames.

4. `for i in files:`: This loop iterates through each filename in the `files` list (from step 2).

5. `paths.append(i.split(".")[0])`: It splits each filename by the dot (.) and takes the first part to get the "id" value. The "id" values are then appended to the `paths` list.

6. `sentences = []`: Initializes an empty list named `sentences` to store the predicted sentences from the ASR (Automatic Speech Recognition) model.

7. `for idx in tqdm(range(0, len(audio_paths), BATCH_SIZE)):`: This loop iterates through the `audio_paths` list in batches of size `BATCH_SIZE` (assuming `BATCH_SIZE` is defined elsewhere).

8. `batch_paths = audio_paths[idx:idx + BATCH_SIZE]`: Extracts a batch of audio file paths from `audio_paths`.

9. `sentences += batch_infer(batch_paths)`: Calls the `batch_infer` function with the current batch of audio file paths and adds the predicted sentences to the `sentences` list.

10. `df = pd.DataFrame({"id": paths, "sentence": sentences})`: Creates a DataFrame (`df`) using the "id" values from step 5 and the predicted sentences obtained from ASR in step 9.

11. `df.sentence = df.sentence.apply(lambda x: normalize(x))`: Applies the `normalize` function to each sentence in the "sentence" column of the DataFrame. This function normalizes Bengali sentences, as explained in a previous response.

12. `df['id'] = df['id'].apply(lambda x: post_process_keys(x))`: Applies the `post_process_keys` function to each "id" value in the DataFrame. This function modifies the "id" values, as explained in a previous response.

13. `return df`: The function returns the DataFrame `df`, which contains the "id" and "sentence" columns with the processed data.

It is important to note that the provided code references `BATCH_SIZE`, which is not defined in the given snippet. For this code to work correctly, `BATCH_SIZE` should be defined earlier in the code or imported from an external module. Additionally, some parts of the code (e.g., the lines related to `files`) appear to be specific to a Kaggle environment and may need modification if used in a different context.
""";

In [None]:
submission = directory_infer(TEST_DIRECTORY)
submission.head()

In [None]:
def check(sentence):
    if len(sentence)==0:
        return '।'
    return sentence

In [None]:
submission.sentence = submission.sentence.apply(lambda x:check(x))

In [None]:
submission.to_csv("submission.csv", index=False)

##### This notebook is a fork of [this notebook](https://www.kaggle.com/code/mbmmurad/lb-0-506-inference-w-previous-comp-winner-s-model/notebook).

💗This notebook has made some annotations on the original code. If it is useful to you, please click like. Thank you!💗