## The training is done in my personal gpu

#### The training is performed into two steps. On our first step, we have trained the pretrained byt5-small model on the dataset provided in the DU-iitverse competition for 5 epochs. And then, trained the model acquired from the first step on the new dataset of Bhashamul for 20 epoch. Inference is done with the output model of second step

# Training

## First step

## Initialization

In [1]:
! pip install jiwer
! pip install seaborn
! pip install datasets
! pip install -U scikit-learn scipy matplotlib
! pip install re
! pip install os
! pip install torchvision
! pip install transformers
! pip install accelerate -U
! pip install sentencepiece

Collecting jiwer
  Downloading jiwer-3.0.3-py3-none-any.whl.metadata (2.6 kB)
Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)
Installing collected packages: jiwer
Successfully installed jiwer-3.0.3
Collecting scikit-learn
  Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting scipy
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting matplotlib
  Downloading matplotlib-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hDownloading scipy-1.12.0-cp310-cp310-manyli

In [2]:
import pandas as pd
import seaborn as sns
from datasets import Dataset
from datasets import load_metric
import jiwer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tqdm import tqdm # tqdm is used to show progress bar
import re # re is used for regular expressions
import os # os is used for operating system related functions
import torch # torch is used for building deep learning models
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

2024-02-29 08:41:22.389674: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-29 08:41:22.389777: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-29 08:41:22.542571: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
import os
import random

import numpy as np
import torch

SEED = 3000

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

seed_everything(SEED)

## Data pre-processing

In [None]:
train_df = pd.read_csv("/kaggle/input/uiu-dataset/trainIPAdb_u.csv")

In [None]:
alpha_pat = "[a-zA-z0-9]"

train_df["text"] = train_df["text"].str.replace(alpha_pat, "", regex=True)

In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.00012, shuffle=True, random_state=3000)
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

## Dataset

In [None]:
ds_train = Dataset.from_pandas(train_df)
ds_eval = Dataset.from_pandas(val_df)

## Model

In [None]:
model_id = "google/byt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
data_collator = DataCollatorForSeq2Seq(tokenizer)

In [None]:
def prepare_dataset(sample):
    output = tokenizer(sample["text"])
    output["labels"] = tokenizer(sample["ipa"])['input_ids']
    output["length"] = len(output["labels"])
    return output


ds_train = ds_train.map(prepare_dataset, remove_columns=ds_train.column_names)
ds_eval = ds_eval.map(prepare_dataset, remove_columns=ds_eval.column_names)

## Metric

In [None]:
wer_metric = load_metric("wer")


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    
    if isinstance(preds, tuple):
        preds = preds[0]
    
    preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = wer_metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"wer": result}

## Training

In [None]:
model_id = "uiu/iit-pretrained-one/"

training_args = Seq2SeqTrainingArguments(
    output_dir=model_id,
    group_by_length=True,
    length_column_name="length",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    metric_for_best_model="wer",
    greater_is_better=False,
    load_best_model_at_end=True,
    num_train_epochs=5,
    save_steps=1000/2,
    eval_steps=1000/2,
    logging_steps=1000/2,
    learning_rate=5e-4,
    weight_decay=1e-2,
    warmup_steps=500/2,
    save_total_limit=2,
    predict_with_generate=True,
    generation_max_length=512,
    push_to_hub=False,
    report_to="none",
)

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_eval,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
trainer.save_model(model_id)

## Second step

In [None]:
train_df = pd.read_csv("/kaggle/input/regipa/train_regipa.csv")
train_df=train_df.dropna().reset_index(drop=True)

In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.00017, shuffle=True, random_state=3000, stratify = train_df['District'])
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

In [None]:
ds_train = Dataset.from_pandas(train_df)
ds_eval = Dataset.from_pandas(val_df)

In [None]:
model_id = "uiu/iit-pretrained-one/"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
data_collator = DataCollatorForSeq2Seq(tokenizer)

In [None]:
def prepare_dataset(sample):
    output = tokenizer(sample["Contents"])
    output["labels"] = tokenizer(sample["IPA"])['input_ids']
    output["length"] = len(output["labels"])
    return output


ds_train = ds_train.map(prepare_dataset, remove_columns=ds_train.column_names)
ds_eval = ds_eval.map(prepare_dataset, remove_columns=ds_eval.column_names)

In [None]:
model_id = "uiu/uiu-one"

training_args = Seq2SeqTrainingArguments(
    output_dir=model_id,
    group_by_length=True,
    length_column_name="length",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    metric_for_best_model="wer",
    greater_is_better=False,
    load_best_model_at_end=True,
    num_train_epochs=20,
    save_steps=1000/2,
    eval_steps=1000/2,
    logging_steps=1000/2,
    learning_rate=5e-4,
    weight_decay=1e-2,
    warmup_steps=500/2,
    save_total_limit=2,
    predict_with_generate=True,
    generation_max_length=512,
    push_to_hub=False,
    report_to="none",
)

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_eval,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
trainer.save_model(model_id)

# Inference

In [4]:
import pandas as pd
test_df = pd.read_csv("/kaggle/input/regipa/test_regipa.csv")

In [5]:
# Sort by length
index = test_df["Contents"].str.len().sort_values(ascending=False).index
test_df = test_df.reindex(index)

In [6]:
from transformers import pipeline
model_id = "/kaggle/input/uiu-dataset/checkpoint-15000"
pipe = pipeline("text2text-generation", model=model_id, device=0)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
%%time
texts = test_df["Contents"].tolist()
ipas = pipe(texts, max_length=512, batch_size=64)
ipas = [ipa["generated_text"] for ipa in ipas]

CPU times: user 5min 10s, sys: 128 ms, total: 5min 10s
Wall time: 5min 11s


In [8]:
test_df["string"] = ipas
test_df = test_df.sort_index()
test_df.head(10)

Unnamed: 0,Index,District,Contents,string
0,0,Rangpur,এলা সবায় সবার হাতোত <> অসহায় মানুষ আচে?,elɐ ʃɔbɐe̯ ʃɔbɐɾ hɐt̪ot̪ <> ɔʃɔhɐe̯ mɐnuʃ ɐce?
1,1,Rangpur,কেউ কারো ইয়া নাই।,ke͡u̯ kɐɾo ɪʲɐ nɐ͡ɪ̯।
2,2,Rangpur,"এলা ওই যে, কাইলকা ব্যাটায় ইপতারি আনচে, খাইচোং,...","elɐ o͡ɪ̯ ɟe, kɐ͡ɪ̯lkɐ bɛtɐe̯ ɪpot̪ɐɾɪ ɐnce, kʰ..."
3,3,Rangpur,আর মুই আগোত কী করচিনু?,ɐɾ mu͡ɪ̯ ɐgot̪ kɪ koɾcɪnu?
4,4,Rangpur,"<> আগের কতা বাদ দেও, এলা নাই।","<> ɐgeɾ kɔt̪ɐ bɐd̪ d̪e͡o̯, elɐ nɐ͡ɪ̯।"
5,5,Rangpur,আগের দিন ভুলি যাও।,ɐgeɾ d̪ɪn bʱulɪ ɟɐ͡o̯।
6,6,Rangpur,"আগেত কেমন ইনকাম আছিল, আগেত যেমন হাতোত ট্যাকা আ...","ɐget̪ kɛmon ɪnkɐm ɐcʰɪlo, ɐget̪ ɟɛmon hɐt̪ot̪ ..."
7,7,Rangpur,আগে তোমার জিনিস-পাতির দামও কম আছিলো।,ɐge t̪omɐɾ -pɐt̪ɪɾ d̪ɐmo kɔm ɐcʰɪlo।
8,8,Rangpur,"একন মনে করো, ট্যাকাআলা শাহিনোক, শাহিনের মায়োক,...","ɛkon mone koɾo, tɛkɐ͡ɐ̯lɐ ʃɐhɪnok, ʃɐhɪneɾ mɐʲ..."
9,9,Rangpur,"আর আগে টেকা আলা আসিল খাল, কায়?","ɐɾ ɐge tekɐ ɐlɐ ɐʃɪl kʰɐlo, kɐe̯?"


In [9]:
test_df = test_df.drop('Contents', axis = 1)
test_df = test_df.drop('District', axis = 1)
test_df = test_df.rename({'Index': 'id'}, axis = 1)

In [10]:
# Dealing with English samples
test_df.at[6975, 'string'] = 'ɾɐt̪e 10 tɐ bɐ͡u̯ɟɟe।'
test_df.at[2418, 'string'] = 'ɐmɾɐɾ mɔtoɾ d̪u͡ɪ̯ gɔntɐ, 3 gɔntɐ pʰɔɾe, pʰɔɾe cʰɐɽun lɐge।'

In [11]:
test_df.to_csv("final.csv", index = None)