# Introduction

Continuation of the pseudolabeling pipeline described in https://www.kaggle.com/code/reasat/pseudolabeling-step-1-download-speech-audio

Model weight and inference notebook copied from: https://www.kaggle.com/competitions/bengaliai-speech/discussion/447970

## STT Model:

* OpenAI whisper-medium
* Huggingface trainer
* Trained on 8x 48GB RTX A6000
* bs=8 and lr=1e-5
* Train steps 50k
* Spectrogram dithering
* Spectrogram time and frequency masking
* Resampling 16khz->8khz->16khz as augmentation
* Inference with max_length=260, num_beams=4 and chunk_length_s=20.1s
* Libsonic based speed/pitch augmentation
* Datasets: OpenSLR 37, OpenSLR 53, MadASR, Shrutilipi, Macro, Kathbath, GoogleTTS generated audios and pseudo labeled YouTube videos

## Punctuation Model:

* AutoModelForTokenClassification google/muril-base-cased
* Huggingface trainer
* Labels: period, comma and question mark
* bs=64, lr=2e-4 and max_seq_length=512
* Ensemble of 4 models (using 6, 8, 11 and 12 layers of google/muril-base-cased)
* Normalized IndicCorp v2 Bangla dataset



In [1]:
!cp /kaggle/input/bengali-eval-data/predict.py .

!cp -r ../input/python-packages2 ./
!tar xvfz ./python-packages2/jiwer.tgz
!pip install ./jiwer/python-Levenshtein-0.12.2.tar.gz -f ./ --no-index
!pip install ./jiwer/jiwer-2.3.0-py3-none-any.whl -f ./ --no-index

cp: cannot stat '/kaggle/input/bengali-eval-data/predict.py': No such file or directory
jiwer/
jiwer/jiwer-2.3.0-py3-none-any.whl
jiwer/python-Levenshtein-0.12.2.tar.gz
jiwer/setuptools-65.3.0-py3-none-any.whl
Looking in links: ./
Processing ./jiwer/python-Levenshtein-0.12.2.tar.gz
  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l- \ | done
[?25h  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.2-cp310-cp310-linux_x86_64.whl size=79809 sha256=674f26f4a2b389ec89da380c7fbd375a2c1de249be6505abc4547e45e368f5fd
  Stored in directory: /root/.cache/pip/wheels/f1/ab/b1/90d2068d73d15e52c1a65676d269a9f043b61221a29f7298e7
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
  Attempting uninstall: python-Levenshtein
    Found existing installation: python-Levenshtein 0.25.0
    Uninstalling python

In [2]:
import os
import csv
import time
import glob

MODEL = '/kaggle/input/bengali-ai-asr-submission/bengali-whisper-medium/'
PUNCT_MODELS = [
    '/kaggle/input/bengali-ai-asr-submission/punct-model-6layers/',
    '/kaggle/input/bengali-ai-asr-submission/punct-model-8layers/',
    '/kaggle/input/bengali-ai-asr-submission/punct-model-11layers/',
    '/kaggle/input/bengali-ai-asr-submission/punct-model-12layers/'
]
CHUNK_LENGTH_S = 20.1
ENABLE_BEAM = True


PUNCT_WEIGHTS = [[1.0, 1.4, 1.0, 0.8]]

if ENABLE_BEAM:
    BATCH_SIZE = 4
else:
    BATCH_SIZE = 8


DATASET_PATH = '/kaggle/input/yt-speech-chunks/chunks'
    
import csv
import glob
import shutil
import librosa
import argparse
import warnings
from pathlib import Path
import transformers
print(transformers.__version__)
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

import warnings

warnings.filterwarnings("ignore")

files = list(glob.glob(DATASET_PATH + '/' + '*.wav'))
files += list(glob.glob(DATASET_PATH + '/' + '*.mp3'))

# NOTE: running on a few samples for demonstration
files = files[:10]

files.sort()

pipe = pipeline(task="automatic-speech-recognition",
                model=MODEL,
                tokenizer=MODEL,
                chunk_length_s=CHUNK_LENGTH_S, device=0, batch_size=BATCH_SIZE)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="bn", task="transcribe")

print("model loaded!")

4.38.2


2024-04-07 03:57:34.692018: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-07 03:57:34.692164: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-07 03:57:34.795374: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model loaded!


In [3]:
def fix_repetition(text, max_count):
    uniq_word_counter = {}
    words = text.split()
    for word in text.split():
        if word not in uniq_word_counter:
            uniq_word_counter[word] = 1
        else:
            uniq_word_counter[word] += 1

    for word, count in uniq_word_counter.items():
        if count > max_count:
            words = [w for w in words if w != word]
    text = " ".join(words)
    return text

In [4]:
if ENABLE_BEAM:
    texts = pipe(files, generate_kwargs={"max_length": 260, "num_beams": 4})
else:
    texts = pipe(files)

In [5]:
del pipe
import torch
models = [
    AutoModelForTokenClassification.from_pretrained(f).eval().cuda() for f in PUNCT_MODELS
]
tokenizer = AutoTokenizer.from_pretrained(PUNCT_MODELS[0])
def punctuate(text):
    input_ids = tokenizer(text).input_ids
    with torch.no_grad():
        model = models[0]
        logits = torch.nn.functional.softmax(
            model(input_ids=torch.LongTensor([input_ids]).cuda()).logits[0, 1:-1],
            dim=1).cpu()
        for model in models[1:]:
            logits += torch.nn.functional.softmax(
                model(input_ids=torch.LongTensor([input_ids]).cuda()).logits[0, 1:-1],
                dim=1).cpu()
        logits = logits / len(models)
        logits *= torch.FloatTensor(PUNCT_WEIGHTS)
        label_ids = torch.argmax(logits, dim=-1)

        tokens = tokenizer(text, add_special_tokens=False).input_ids
        punct_text = ""
        for index, token in enumerate(tokens):
            token_str = tokenizer.decode(token)
            if '##' not in token_str:
                punct_text += " " + token_str
            else:
                punct_text += token_str[2:]
            punct_text += ['', '।', ',', '?'][label_ids[index].item()]

    punct_text = punct_text.strip()
    return punct_text

In [6]:
predictions = []
with open("submission.csv", 'wt', encoding="utf8") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['id', 'sentence'])
    for f, text in zip(files, texts):
        file_id = Path(f).stem
        pred = text['text'].strip()
        pred = fix_repetition(pred, max_count=8)
        pred = punctuate(pred)
        if pred[-1] not in ['।', '?', ',']:
            pred = pred + '।'
        # print(i, file_id, pred)
        prediction = [file_id, pred]
        writer.writerow(prediction)
        predictions.append(prediction)
print("inference finished!")

inference finished!


In [7]:
import pandas as pd
df = pd.read_csv("/kaggle/working/submission.csv")


In [8]:
df

Unnamed: 0,id,sentence
0,000000_000121,হামার ও তারুণ্যের কিতাবার্তা ভাইসাহাব আমি হলাম...
1,000000_000167,এক লাখ টাকায় বিপদের সাথে সাহায্য করছে।
2,000000_000205,কিন্তু আমার না হলে বড় তালায় ফাইনা দিব কোথায় আম...
3,000000_000206,যাইহোক আজকের লাগে ডিউটি শেষ আমার বেশ বুঝে একটু...
4,000000_000229,তাই আমার ব্যবসার লিয়ামাতে কত বড় তুই তো আমার সি...
5,000000_000242,ওহ বাবা খুচরে যার জীবনীত সব ছাইবার করিয়া থেকে ...
6,000000_000285,ওই যাও তুমি। কুটিপতি দেওয়া বাবা ফকিডর গেছে। ফক...
7,000001_000127,বাদ দিলেন না।
8,000001_000203,আর তোমার এত ফলা ফৈসল যে পায় আমারে দোকা তুমি না...
9,000001_000246,পিতারে আমার দেয়টা তার বাপ লাগে।


In [9]:
print(len(df))

10
