# Inference Notebook
Model: /kaggle/input/schadenfreude-bhashabhrom/banglat5 <br>
Model Source: Finetuned https://huggingface.co/csebuetnlp/banglat5_small for 120 epochs on DataSetFold1.csv (see training notebook ) <br>

Dataset : DataSetFold1_u.csv, DataSetFold2.csv, test.csv <br>

### Input
* /kaggle/input/bangla-ged : Contains the datasets provided by the hosts of EEEDay Datathon
* /kaggle/input/bangla-sadhu-verbs/sadhu.txt : Contains about 300 Bangla Verbs (Produced by me using Github Copilot)
* /kaggle/input/schadenfreude-bhashabhrom/banglat5 : Finetuned banglat5_small from the notebook Schadenfreude_EEEDay_Training

### Output
This notebook outputs:
* submission.csv
* Miscellaneous intermediate files


### Credit
Dataset provided by the Hosts of EEEDAY 2023 Datathon



In [1]:
!pip install -q datasets
!pip install -q git+https://github.com/csebuetnlp/normalizer
!pip install -q transformers
!pip install -q levenshtein
!pip install -q gdown

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[0m

# Changing the Test File (/Prediction File)

By default the test file is /kaggle/input/bangla-ged/test.csv <br>
To replace it, change the following code block <br>
The format has to match the original test.csv <br>
For some post-processing, the test.csv is read directly (not with pandas). Due to Windows using \r\n and Linux using \n, there might be some problem if the test.csv was made in Windows, although I think I've handled it <br>

In [2]:
test_file_or_path = "/kaggle/input/bangla-ged/test.csv"

### Example of how to change the test file

In [3]:
# import gdown
# # url = "https://drive.google.com/uc?id=1iXsDI8yFTnJ27k80iPSBJ4YsjJVcOmgD"

# url = "https://drive.google.com/uc?id=10WrOwETFFsU4-0xiwtm7tGgWUDWCZ-aP"
# output = "/kaggle/working/test.csv"
# gdown.download(url,output, quiet=False)
# test_file_or_path = "/kaggle/working/test.csv"

# Creating files for Preprocessing and Postprocessing

## Making train.json, test.json and validation.json in kaggle/working/mt5_input

In [4]:

!rm -r /kaggle/working/mt5_input
!mkdir /kaggle/working/mt5_input

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
rm: cannot remove '/kaggle/working/mt5_input': No such file or directory
/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [5]:
import pandas as pd
import codecs
import sys
import re


# DataSetFold1_u.csv has a single sentence with a newline inside it. 
# The JSONL input required by the model can not handle sentences with embedded new lines
# Must omit the newline pseudo-manually
train_file_or_path = "/kaggle/input/bangla-ged/DataSetFold1_u.csv"

newline_removed = ""
with codecs.open(train_file_or_path,encoding = 'utf8', mode="r") as f:
    whole_content = f.read()
    newline_removed = re.sub(r"কপোল\n", r"কপোল", whole_content)
    newline_removed = re.sub(r"\$কপোল\$\n", r"$কপোল$", newline_removed)

train_file_or_path = "/kaggle/working/DataSetFold1_u.csv"
with codecs.open(train_file_or_path,encoding = 'utf8', mode="w") as f:
    f.write(newline_removed)


#making train.json


df = pd.read_csv(train_file_or_path, encoding = 'utf8')

with codecs.open("/kaggle/working/mt5_input/train.json", encoding = 'utf8', mode='w') as f:
    for idx, row in df.iterrows():
        sentence = row["sentence"]
        gt = row["gt"]
        sentence = sentence.replace("\"", "\\\"")
        gt = gt.replace("\"", "\\\"")
        f.write("{\"source\": \"" + sentence + "\", \"target\": \"" + gt + "\"}\n")



#making test.json

df = pd.read_csv(test_file_or_path, encoding = 'utf8')
with codecs.open("/kaggle/working/mt5_input/test.json", encoding = 'utf8', mode='w') as f:
    for idx, row in df.iterrows():
        sentence = row["text"]
        sentence = sentence.replace("\"", "\\\"")
        gt = ""
        # gt = row["gt"]
        # gt = gt.replace("\"", "\\\"")
        f.write("{\"source\": \"" + sentence + "\", \"target\": \"" + gt + "\"}\n")


In [6]:
!tail -n 1000 /kaggle/working/mt5_input/train.json > /kaggle/working/mt5_input/validation.json

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [7]:
!ls /kaggle/working/mt5_input

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
test.json  train.json  validation.json


## Making files required for postprocessing in kaggle/working/post_process
* test_sentences.txt (test.csv in .txt format without column name)
* combined.csv (DataSetFold1_u.csv + DataSetFold2.csv)
* error_words.txt (List of words from combined.csv that are commonly flagged as errors)


In [8]:
!rm -r /kaggle/working/post_process
!mkdir /kaggle/working/post_process

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
rm: cannot remove '/kaggle/working/post_process': No such file or directory
/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [9]:
!tail -n +2 $test_file_or_path > /kaggle/working/post_process/test_sentences.txt

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [10]:
!cp /kaggle/working/DataSetFold1_u.csv /kaggle/working/post_process/combined.csv

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [11]:
!tail -n +2 /kaggle/input/bangla-ged/DataSetFold2.csv >> /kaggle/working/post_process/combined.csv

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)


## error_words.txt
This file contains a list of (word, occurence_count, error_count, ratio) tuples created from combined.csv
This is mainly used to handle sentences which the main t5 model cannot correctly predict. Above a certain ratio (error_count/occurence_count), words are surrounded by \$ using regex

NB: The regex used to create this txt file is a word in progress and fails to find all occurences of words. If the occurence count is zero, the words aren't included in error_words.txt

In [12]:
import pandas as pd
import re
import codecs


def create_error_words_file(df_path, expected_output_file):
    df = pd.read_csv(df_path)

    error_words = {}
    base_sentences = []

    for index, row in df.iterrows():
        base_line = row['sentence']
        base_sentences.append(base_line)
        row_data = row['gt']
        parts = re.split("(\$[^$]*\$)", row_data)
         
        for part in parts:
            if part.startswith("$") and part.endswith("$"):
                if len(part) > 3:
                    if part in error_words.keys():
                        error_words[part] +=1
                    else:
                        error_words[part] = 1
                    # error_words.add(part)

    sorted_error_words = []

    with codecs.open(expected_output_file, encoding = 'utf8', mode='w') as f:
        for item in error_words.keys():
            occ_count = 0
            for sentence in base_sentences:
                if item[1:-1] in sentence:
                    occ_count+=len(re.findall(r"(?<![^\u0980-\u09FF])"+re.escape(item[1:-1]) + r"(?=[^\u0980-\u09FF])",sentence))
                    occ_count+=len(re.findall(r"(?<=[^\u0980-\u09FF])"+re.escape(item[1:-1]) + r"(?=[^\u0980-\u09FF])",sentence))
            
            if occ_count > 0:
                sorted_error_words.append((error_words[item]/occ_count, item, occ_count, error_words[item]))
            else: 
                print("Skipping Regex Error: " + item, occ_count, error_words[item])
                #found the $word$ in [gt] but couldn't find word in [senetence]
        sorted_error_words.sort(reverse=True)

        sep = "-:-:-"
        for item in sorted_error_words:
            f.write(item[1] +  sep + str(item[0]) + sep + str(item[2]) + sep + str(item[3]) + "\n")

            
            
def file_level_levenshtein(file1, file2):
    with codecs.open(filename=file1,encoding='utf8') as f1:
        with codecs.open(filename=file2,encoding='utf8') as f2:
            total1 = 0
            total2 = 0
            count = 0
            for line1,line2 in zip(f1,f2):
                # total1 += my_levenstein(line1,line2)
                total1 += distance(line1,line2)
                count += 1
                


create_error_words_file("/kaggle/working/post_process/combined.csv", "/kaggle/working/post_process/error_words.txt")


Skipping Regex Error: $মাছ টাই$ 0 1
Skipping Regex Error: $□ $ 0 1
Skipping Regex Error: $বানাইছে$ 0 1
Skipping Regex Error: $বেসসা$ 0 1
Skipping Regex Error: $বাংগালী$ 0 1
Skipping Regex Error: $ পুলিশের কাঁধে ভর করে যখন সরকার চলে তখন $ 0 1
Skipping Regex Error: $ দেশের আইন শৃঙ্খলা $ 0 1
Skipping Regex Error: $ স্বাভাবিক $ 0 1
Skipping Regex Error: $...।$ 0 2
Skipping Regex Error: $মেডাম$ 0 1
Skipping Regex Error: $মাইরালা$ 0 1
Skipping Regex Error: $দোকান গুলোতে$ 0 1
Skipping Regex Error: $ব্লগ  $ 0 1
Skipping Regex Error: $সদেগাপ$ 0 1
Skipping Regex Error: $,,,,।$ 0 1
Skipping Regex Error: $ :-D$ 0 1
Skipping Regex Error: $হালারপুয়েরা$ 0 1
Skipping Regex Error: $৮) $ 0 1
Skipping Regex Error: $ক্যামনে$ 0 1
Skipping Regex Error: $*;;;;;;$ 0 1
Skipping Regex Error: $তোর বাপ তো $ 0 1
Skipping Regex Error: $আজাড়ে$ 0 1
Skipping Regex Error: $সোদাও$ 0 1
Skipping Regex Error: $জ্ঞান-$ 0 1
Skipping Regex Error: $।}$ 0 1
Skipping Regex Error: $ভালবা‌সি$ 0 1
Skipping Regex Error: $লড়ে$ 0 1


# Inference Using Finetuned banglat5_small

In [13]:
# This section was adapted from https://github.com/csebuetnlp/BanglaNLG/blob/main/seq2seq/run_seq2seq.py
    
import logging
import os
import sys
import glob
import json
from dataclasses import dataclass, field
from typing import Optional
from Levenshtein import distance

import datasets
import numpy as np
from datasets import load_dataset, load_metric
from datasets.io.json import JsonDatasetReader
from datasets.io.csv import CsvDatasetReader

import transformers
from transformers import (
    AutoConfig,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    HfArgumentParser,
    M2M100Tokenizer,
    MBart50Tokenizer,
    MBart50TokenizerFast,
    MBartTokenizer,
    MBartTokenizerFast,
    MBartForConditionalGeneration,
    AlbertTokenizer,
    AlbertTokenizerFast,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    default_data_collator,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version
from transformers.utils.versions import require_version
from normalizer import normalize

EXT2CONFIG = {
    "csv" : (CsvDatasetReader, {}),
    "tsv" : (CsvDatasetReader, {"sep": "\t"}),
    "jsonl": (JsonDatasetReader, {}),
    "json": (JsonDatasetReader, {})
}


logger = logging.getLogger(__name__)


@dataclass
class ModelArguments:
    
    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
    )


@dataclass
class DataTrainingArguments:
    dataset_dir: Optional[str] = field(
        default=None, metadata={
            "help": "Path to the directory containing the data files. (.csv / .tsv / .jsonl)"
            "File datatypes will be identified with their prefix names as follows: "
            "`train`- Training file(s) e.g. `train.csv`/ `train_part1.csv` etc. "
            "`validation`- Evaluation file(s) e.g. `validation.csv`/ `validation_part1.csv` etc. "
            "`test`- Test file(s) e.g. `test.csv`/ `test_part1.csv` etc. "
            "All files for must have the same extension."
        }
    )
    
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
            "value if set."
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
            "value if set."
        },
    )
    max_predict_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of prediction examples to this "
            "value if set."
        },
    )
    train_file: Optional[str] = field(
        default=None, metadata={"help": "A csv / tsv / jsonl file containing the training data."}
    )
    validation_file: Optional[str] = field(
        default=None, metadata={"help": "A csv / tsv / jsonl file containing the validation data."}
    )
    test_file: Optional[str] = field(default=None, metadata={"help": "A csv / tsv / jsonl file containing the test data."})
    do_normalize: Optional[bool] = field(default=False, metadata={"help": "Normalize text before feeding to the model."})
    unicode_norm: Optional[str] = field(default="NFKC", metadata={"help": "Type of unicode normalization"})
    remove_punct: Optional[bool] = field(
        default=False, metadata={
            "help": "Remove punctuation during normalization. To replace with custom token / selective replacement you should "
            "use this repo (https://github.com/abhik1505040/normalizer) before feeding the data to the script."
    })
    remove_emoji: Optional[bool] = field(
        default=False, metadata={
            "help": "Remove emojis during normalization. To replace with custom token / selective replacement you should "
            "use this repo (https://github.com/abhik1505040/normalizer) before feeding the data to the script."
    })
    remove_urls: Optional[bool] = field(
        default=False, metadata={
            "help": "Remove urls during normalization. To replace with custom token / selective replacement you should "
            "use this repo (https://github.com/abhik1505040/normalizer) before feeding the data to the script."
    })
    source_key: Optional[str] = field(
        default="source", metadata={"help": "Key / column name in the input file corresponding to the source data"}
    )
    target_key: Optional[str] = field(
        default="target", metadata={"help": "Key / column name in the input file corresponding to the target data"}
    )

    source_lang: Optional[str] = field(default=None, metadata={"help": "Source language id."})
    target_lang: Optional[str] = field(default=None, metadata={"help": "Target language id."})

    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_source_length: Optional[int] = field(
        default=1024,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    max_target_length: Optional[int] = field(
        default=128,
        metadata={
            "help": "The maximum total sequence length for target text after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    val_max_target_length: Optional[int] = field(
        default=128,
        metadata={
            "help": "The maximum total sequence length for validation target text after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
            "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
            "during ``evaluate`` and ``predict``."
        },
    )
    
    num_beams: Optional[int] = field(
        default=5,
        metadata={
            "help": "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
            "which is used during ``evaluate`` and ``predict``."
        },
    )
    
    source_prefix: Optional[str] = field(
        default=None, metadata={"help": "A prefix to add before every source text."}
    )
    
    rouge_lang: Optional[str] = field(
        default=None,
        metadata={
            "help": "Target language for rouge",
        }    
    )

    def __post_init__(self):
        if self.train_file is not None and self.validation_file is not None:
            train_extension = self.train_file.split(".")[-1]
            assert train_extension in ["csv", "jsonl", "tsv", "json"], "`train_file` should be a csv / tsv / jsonl file."
            validation_extension = self.validation_file.split(".")[-1]
            assert (
                validation_extension == train_extension
            ), "`validation_file` should have the same extension csv / tsv / jsonl as `train_file`."


def main( model_args, data_args, training_args):
   
 
    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Training/evaluation parameters {training_args}")

    set_seed(training_args.seed)
    has_ext = lambda path: len(os.path.basename(path).split(".")) > 1
    get_ext = lambda path: os.path.basename(path).split(".")[-1]


    if data_args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(
            data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir
        )

    elif data_args.dataset_dir is not None:
        data_files = {}
        all_files = glob.glob(
            os.path.join(
                data_args.dataset_dir,
                "*"
            )
        )
        all_exts = [get_ext(k) for k in all_files if has_ext(k)]
        if not all_exts:
            raise ValueError("The `dataset_dir` doesnt have any valid file.")
            
        selected_ext = max(set(all_exts), key=all_exts.count)
        for search_prefix in ["train", "validation", "test"]:
            found_files = glob.glob(
                os.path.join(
                    data_args.dataset_dir,
                    search_prefix + "*" + selected_ext
                )
            )
            if not found_files:
                continue

            data_files[search_prefix] = found_files

        dataset_configs = EXT2CONFIG[selected_ext]
        raw_datasets = dataset_configs[0](
            data_files, 
            **dataset_configs[1]
        ).read()
        
    else:
        data_files = {
            "train": data_args.train_file, 
            "validation": data_args.validation_file,
            "test": data_args.test_file
        }

        data_files = {k: v for k, v in data_files.items() if v is not None}
        
        if not data_files:
            raise ValueError("No valid input file found.")

        selected_ext = get_ext(list(data_files.values())[0])

        dataset_configs = EXT2CONFIG[selected_ext]
        raw_datasets = dataset_configs[0](
            data_files, 
            **dataset_configs[1]
        ).read()

    config = AutoConfig.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,     
    )

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=False
    )
    
    model = AutoModelForSeq2SeqLM.from_pretrained(
        model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir,
    )

   
    model.resize_token_embeddings(len(tokenizer))

    if data_args.source_lang is not None and data_args.target_lang is not None:
        tokenizer.src_lang = data_args.source_lang
        tokenizer.tgt_lang = data_args.target_lang
    
        if isinstance(tokenizer, (MBartTokenizer, MBartTokenizerFast)):
            if isinstance(tokenizer, MBartTokenizer):
                model.config.decoder_start_token_id = tokenizer.lang_code_to_id[data_args.target_lang]
            else:
                model.config.decoder_start_token_id = tokenizer.convert_tokens_to_ids(data_args.target_lang)
        elif isinstance(tokenizer, AlbertTokenizer):
            model.config.decoder_start_token_id = tokenizer._convert_token_to_id_with_added_voc(tokenizer.tgt_lang)


    prefix = data_args.source_prefix if data_args.source_prefix is not None else ""

    for data_type, ds in raw_datasets.items():
        assert data_args.source_key in ds.features, f"Input files doesnt have the `{data_args.source_key}` key"
        if data_type != "test":
            assert data_args.target_key in ds.features, f"Input files doesnt have the `{data_args.target_key}` key"
        
        ignored_columns = set(ds.column_names) - set([data_args.source_key, data_args.target_key])
        raw_datasets[data_type] = ds.remove_columns(ignored_columns)

    max_target_length = data_args.max_target_length
    
    def preprocess_function(examples):
        normalization_kwargs = {
            "unicode_norm": data_args.unicode_norm,
            "punct_replacement": " " if data_args.remove_punct else None,
            "url_replacement": " " if data_args.remove_urls else None,
            "emoji_replacement": " " if data_args.remove_emoji else None
        }
        
        
        inputs = [normalize(ex, **normalization_kwargs) if data_args.do_normalize else ex 
                    for ex in examples[data_args.source_key]]
        inputs = [prefix + inp for inp in inputs]

        tokenizer_kwargs = {
            "max_length": data_args.max_source_length, 
            "padding": False,
            "truncation": True,
            "return_tensors": "np"
        }
        
   
        model_inputs = tokenizer(inputs, **tokenizer_kwargs)

   
        if data_args.target_key in examples:
            targets = [normalize(ex, **normalization_kwargs) if data_args.do_normalize else ex
                        for ex in examples[data_args.target_key]]

            tokenizer_kwargs.update({"max_length": max_target_length})
            

            with tokenizer.as_target_tokenizer():
                labels = tokenizer(targets, **tokenizer_kwargs)

                
            model_inputs["labels"] = labels["input_ids"]

        return model_inputs

    if training_args.do_train:
        if "train" not in raw_datasets:
            raise ValueError("--do_train requires a train dataset")
        train_dataset = raw_datasets["train"]
        if data_args.max_train_samples is not None:
            train_dataset = train_dataset.select(range(data_args.max_train_samples))
        
        with training_args.main_process_first(desc="train dataset map pre-processing"):
            train_dataset = train_dataset.map(
                preprocess_function,
                batched=True,
                batch_size= training_args.train_batch_size,
                num_proc=data_args.preprocessing_num_workers,
                remove_columns=train_dataset.column_names,
                load_from_cache_file=not data_args.overwrite_cache,
                desc="Running tokenizer on train dataset",
            )

    if training_args.do_eval:
        max_target_length = data_args.val_max_target_length
        if "validation" not in raw_datasets:
            raise ValueError("--do_eval requires a validation dataset")
        eval_dataset = raw_datasets["validation"]
        if data_args.max_eval_samples is not None:
            eval_dataset = eval_dataset.select(range(data_args.max_eval_samples))
        with training_args.main_process_first(desc="validation dataset map pre-processing"):
            eval_dataset = eval_dataset.map(
                preprocess_function,
                batched=True,
                batch_size= training_args.train_batch_size,
                num_proc=data_args.preprocessing_num_workers,
                remove_columns=eval_dataset.column_names,
                load_from_cache_file=not data_args.overwrite_cache,
                desc="Running tokenizer on validation dataset",
            )

    if training_args.do_predict:
        max_target_length = data_args.val_max_target_length
        if "test" not in raw_datasets:
            raise ValueError("--do_predict requires a test dataset")
        predict_dataset = raw_datasets["test"]
        if data_args.max_predict_samples is not None:
            predict_dataset = predict_dataset.select(range(data_args.max_predict_samples))
        with training_args.main_process_first(desc="prediction dataset map pre-processing"):
            predict_dataset = predict_dataset.map(
                preprocess_function,
                batched=True,
                batch_size=training_args.train_batch_size,
                num_proc=data_args.preprocessing_num_workers,
                remove_columns=predict_dataset.column_names,
                load_from_cache_file=not data_args.overwrite_cache,
                desc="Running tokenizer on prediction dataset",
            )

    data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        model=model,
        padding=True,
        label_pad_token_id=tokenizer.pad_token_id,
        pad_to_multiple_of=8 if training_args.fp16 else None,
    )



    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        if isinstance(preds, tuple):
            preds = preds[0]

        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        
        decoded_preds2 = [x.replace("$","") for x in decoded_preds]
        decoded_labels2 = [x.replace("$","") for x in decoded_labels]
        
        sum_distance = distance(decoded_preds, decoded_labels)
        
        sum_distance2 = distance(decoded_preds2, decoded_labels2)
        
        result = {}
        result["lev"] = sum_distance
        result["lev2"] = sum_distance2

        prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
        result["gen_len"] = np.mean(prediction_lens)
        result = {k: round(v, 4) for k, v in result.items()}
        
        return result

    # Initialize our Trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if training_args.do_train else None,
        eval_dataset=eval_dataset if training_args.do_eval else None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics if training_args.predict_with_generate else None,
    )

    # Training
    if training_args.do_train:
        checkpoint = None
        if training_args.resume_from_checkpoint is not None:
            checkpoint = get_last_checkpoint(training_args.output_dir)
        
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model() 

        metrics = train_result.metrics
        max_train_samples = (
            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
        )
        metrics["train_samples"] = min(max_train_samples, len(train_dataset))

        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()

    # Evaluation
    results = {}
    max_length = (
        training_args.generation_max_length
        if training_args.generation_max_length is not None
        else data_args.val_max_target_length
    )
    num_beams = data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams

    if training_args.do_predict:
        logger.info("*** Predict ***")

        predict_results = trainer.predict(
            predict_dataset, metric_key_prefix="predict", max_length=max_length, num_beams=num_beams
        )
        print("predictions done")
        metrics = predict_results.metrics
        max_predict_samples = (
            data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset)
        )
        metrics["predict_samples"] = min(max_predict_samples, len(predict_dataset))

        results.update(metrics)
        trainer.log_metrics("predict", metrics)
        trainer.save_metrics("predict", metrics)
        
        print("will write")
        if trainer.is_world_process_zero():
            if training_args.predict_with_generate:
                predictions = tokenizer.batch_decode(
                        predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True
                    
                )
                predictions = [pred.strip() for pred in predictions]
                output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt")
                with open(output_prediction_file, "w", encoding="utf-8") as writer:
                    writer.write("\n".join(predictions))
                print("write done")
    all_results_path = os.path.join(training_args.output_dir, "all_results.json")
    with open(all_results_path, 'w') as f:
        json.dump(results, f, indent=4, ensure_ascii=False)

    



In [14]:
# python ./run_seq2seq.py \
#     --model_name_or_path "csebuetnlp/banglat5" \
#     --dataset_dir "sample_inputs/" \
#     --output_dir "outputs/" \
#     --learning_rate=5e-4 \
#     --warmup_steps 5000 \
#     --label_smoothing_factor 0.1 \
#     --gradient_accumulation_steps 4 \
#     --weight_decay 0.1 \
#     --lr_scheduler_type "linear"  \
#     --per_device_train_batch_size=8 \
#     --per_device_eval_batch_size=8 \
#     --max_source_length 256 \
#     --max_target_length 256 \
#     --logging_strategy "epoch" \
#     --save_strategy "epoch" \
#     --evaluation_strategy "epoch" \
#     --source_key bn --target_key en \
#     --greater_is_better true --load_best_model_at_end \
#     --num_train_epochs 20 \
#     --do_train --do_eval --do_predict \
#     --predict_with_generate

In [15]:
# !rm -rf /kaggle/working/banglat5

In [16]:
# model_args = ModelArguments("csebuetnlp/banglat5_small")
model_args = ModelArguments("/kaggle/input/schadenfreude-bhashabhrom/banglat5")


data_args = DataTrainingArguments(  dataset_dir = "/kaggle/working/mt5_input", 
                                    max_source_length= 256 ,
                                    max_target_length= 256 ,
                                 )
training_args = Seq2SeqTrainingArguments(
    output_dir="banglat5/",
    learning_rate=5e-4,
    warmup_steps = 5000,
    label_smoothing_factor= 0.1 ,
    weight_decay =0.1 ,
    lr_scheduler_type ="linear" ,
    per_device_train_batch_size=32 ,
    per_device_eval_batch_size=32,
    logging_strategy= "epoch", 
    save_strategy ="epoch" ,
    evaluation_strategy ="epoch",
#     greater_is_better= False,
#     load_best_model_at_end = True,
#     metric_for_best_model = "lev" ,
    report_to = [],
    save_total_limit=2,
    gradient_accumulation_steps=4,
    resume_from_checkpoint = True,
    num_train_epochs =  120,
    do_train = False,
    do_predict = True,
    do_eval = False,
    predict_with_generate = True,
)


In [17]:
main(model_args, data_args, training_args)

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-b9a3157292396ae9/0.0.0...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-b9a3157292396ae9/0.0.0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

[INFO|configuration_utils.py:657] 2023-03-10 09:15:26,873 >> loading configuration file /kaggle/input/schadenfreude-bhashabhrom/banglat5/config.json
[INFO|configuration_utils.py:708] 2023-03-10 09:15:26,878 >> Model config T5Config {
  "_name_or_path": "/kaggle/input/schadenfreude-bhashabhrom/banglat5",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "gradient_checkpointing": false,
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "torch_dtype": "float32",
  "transformers_version": "4.20.1",
  "use_c

Running tokenizer on validation dataset:   0%|          | 0/32 [00:00<?, ?ba/s]

  tensor = as_tensor(value)


Running tokenizer on prediction dataset:   0%|          | 0/157 [00:00<?, ?ba/s]

[INFO|trainer.py:2753] 2023-03-10 09:15:35,150 >> ***** Running Prediction *****
[INFO|trainer.py:2755] 2023-03-10 09:15:35,151 >>   Num examples = 5000
[INFO|trainer.py:2758] 2023-03-10 09:15:35,154 >>   Batch size = 32


predictions done
***** predict metrics *****
  predict_gen_len            =    14.7906
  predict_lev                =       5000
  predict_lev2               =       5000
  predict_loss               =    13.2731
  predict_runtime            = 0:08:53.15
  predict_samples            =       5000
  predict_samples_per_second =      9.378
  predict_steps_per_second   =      0.294
will write
write done


# Postprocessing

The raw output of the model is located in banglat5/generated_predictions.txt <br>
But as it is, the raw output has a number of issues that increase the F1 score a lot. <br>
A naive submission of the raw t5 output results in a F1 score of in the range 3.3-3.7 <br>

### Some Issues in the raw model output
* Missing double quotes
* Producing different unicode combinations of the same character (য় vs. য়)
* Missing letters (কোনভাবে vs. কোনভাবেই)
* Different spellings 
* Whole word errors (রেস্টুরেন্ট vs রেস্তোরাঁ)
* Missing punctuation (বাতাস বন্ধু vs. ’‘ বাতাস বন্ধু)
* Cannot handle long sentences
* others.

These need to be corrected algorithmically.


Out of 5000 test sentences:
* 253 were found in the provided training sets
* 40 could not be reconciled with the t5 raw output and were handled corrected using regex
* 4707 were predicted by the model (+ correction)

The effect of each pathway is discussed in more detail in the report paper.

In [18]:
import pandas as pd
import re
import codecs

from Levenshtein import distance


def get_bad_words(file_name):
    '''
    loads the words from sadhu.txt
    Not specific to "sadhu" words. Should also work with common misspellings etc.
    '''
    bad_words = []
    with codecs.open(file_name, encoding = 'utf8', mode='r') as f:
        for line in f.readlines():
            bad_words.append(line[:-2])
    return bad_words


def get_sorted_error_words(file_path):
    ''' load the words from error_words.txt generated in Preporcessing '''
    
    sorted_error_words = []
    with codecs.open(file_path, encoding = 'utf8', mode='r') as f:
        for line in f.readlines():
            
            parts = re.split("-:-:-", line)
        
            sorted_error_words.append((float(parts[1]), parts[0], int(parts[2]), int(parts[3])))
    return sorted_error_words


def regex_one_sentence(input_sentence, sorted_error_words, bad_words, threshold=0.45):
    '''Correct a single sentence using regex. Threshold value controls which ratio of in error_words.txt to correct upto'''
    
    return_sentence = input_sentence
    
    for word in bad_words:
        if word in return_sentence: 
            return_sentence = re.sub(re.escape(word) + r"(?=[ !?।,])", "$"+word+"$", return_sentence)
    
    for ratio,word,occ_c,err_c in sorted_error_words:
        if word.startswith("$\""):
            continue
        if word[1:-1] in bad_words:
            # print("Skipping bad word: " + word)
            continue

        if re.fullmatch(r"\$[0-9]+\$", word):
            # print("Skipping: " + word)
            continue
        
        if re.fullmatch(r"\$[,.\";*?-]+\$", word):
            # print("Skipping: " + word)
            continue
        

        if ratio < threshold:
            break
        
        if word[1:-1] in return_sentence:
            # all_line[idx] = line.replace(word[1:-1], word)
            return_sentence = re.sub(re.escape(word[1:-1]) + r"(?=[ !?।,])", word, return_sentence)
        # print(word)
    
    file_contents = return_sentence

    # file_contents = re.sub(r'-', r'$-$', file_contents)

    file_contents = re.sub(r'[0-9]+', r'$\g<0>$', file_contents)

    file_contents = re.sub(r'[,.;?][,.;?]+', r'$\g<0>$', file_contents)

    file_contents = re.sub(r'\$\$[^ \$]+\$ ', lambda x: x.group(0)[1:-2]+' ', file_contents)

    file_contents = re.sub(r' (?=[।?,!])', r'$ $', file_contents)

    # file_contents = re.sub(r'  ', r' $ $', file_contents)

    file_contents = re.sub(r'(?<=[^?!।"])\n', r'$$\n', file_contents)
    
    file_contents = re.sub(r'(?<=[^?!।])"\n', r'$$"\n', file_contents)
    
    file_contents = re.sub(r' "\n', r'$$ "\n', file_contents)

    # emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002702-\U000027B0\U0001F170-\U0001F251\U0001F004]+")

    # file_contents = emoji_pattern.sub(r'$\g<0>$', file_contents)
    
    file_contents = re.sub(r'[#|৵_♥✌]+', r'$\g<0>$', file_contents)
    
    file_contents = re.sub(r' ত ', r' $ত$ ', file_contents)

    return file_contents




### Adding quotes to the generated_predictions.txt by comparing it to test_sentences.txt

In [19]:
sorted_error_words = get_sorted_error_words("/kaggle/working/post_process/error_words.txt")
bad_words = get_bad_words("/kaggle/input/bangla-sadhu-verbs/sadhu.txt")

t5_output = "banglat5/generated_predictions.txt"
train_file = "/kaggle/working/post_process/combined.csv"
test_file = "/kaggle/working/post_process/test_sentences.txt"

quoted_mt5_output = []
quotes_added = 0

with codecs.open(test_file, encoding="utf8", mode='r') as f , codecs.open(t5_output, encoding="utf8", mode='r') as f2:
    
    for line1, line2 in zip(f.readlines(), f2.readlines()):
        line2 = line2.replace("\"", "\"\"")
        if line1[0]=="\"" and (line1[-2]=="\"" or  line1[-3]=="\""):
            quotes_added += 1
            quoted_mt5_output.append("\""+line2[:-1]+"\"")
        else:
            quoted_mt5_output.append(line2[:-1])

print("Quotes added to: ", quotes_added)

Quotes added to:  1312


### Making a dictionary of  test sentences that are in the train datasets. These will be put into the final submission file as is.

In [20]:

train_df = pd.read_csv(train_file, encoding = 'utf8')

test_lines = []
with codecs.open(test_file, encoding = 'utf8', mode='r') as f:
    for line in f.readlines():
        line = line.strip()
        test_lines.append(line)

        
test_lines_set = set(test_lines)

exact_match_dict = {}

with codecs.open(train_file, encoding = 'utf8', mode='r') as f:
    for tuple,codec_reading in zip(train_df.iterrows(), f.readlines()[1:]):
        index, row = tuple
        sentence = row["sentence"]
        gt = row["gt"]

        if codec_reading[0]=='"' and codec_reading[-3]=='"':
            if sentence[0]!='"' and sentence[-1]!='"':
                sentence = '"'+sentence+'"'
                gt = '"'+gt+'"'
                
                
        if sentence in test_lines_set:
            exact_match_dict[sentence] = gt


print("exact_match_size: ", len(exact_match_dict.keys()) )



exact_match_size:  253


In [21]:
## Uncomment this to essentially turn of exact matching
# exact_match_dict={}

In [22]:
print("size of dict in bytes: ", sys.getsizeof(exact_match_dict))

size of dict in bytes:  9328


### Correcting t5 output

NB: This is arguably the most important step in our submission.
The steps are:
* Read in one sentence from t5 output and corrresponding base text from test.csv
* If the sentence is in exact_match_dict, append it to submissions list as is
* Otherwise, step through the sentence character by character, making corrections when necessary
* If it fails, try whole word replacement correction
* If word replacement correction fails as well, handle that sentence with regex_one_sentence()

In [23]:
import pandas as pd
import codecs
import re
import sys
from Levenshtein import distance

utostio1_ = "য়"
utostio2 = "য়"
dor1_ = "ড়"
dor2 = "ড়"
dhor1_ = "ঢ়"
dhor2 = "ঢ়"
akar1 = "া"
roshikar1 = "ি"
dirghikar1 = "ী"
roshaukar1 ="ু"
dirghuukar1 = "ূ"
eekar1 = "ে"
oikar1 = "ৈ"
okar1_ = "ো"
okar2 = "ো"
aukar1_ = "ৌ"
aukar2 = "ৌ"
dontonaw1 = "ন"
moddhonno = "ণ"

REGEX_ERROR_THRESHOLD = 0.5
# used in regex_one_sentence(). if err_count/occ_count < 0.5, dont correct it


first_step_output = []

def t5_tendency(line,line2):
    ''' This function is called when all character level corrects fail. It attempts a word level correction and tries again'''

    line_t = line.replace("$", "")
    line_t = line_t.strip()
    line_t = line_t.replace("য়", "য়")
    line_t = line_t.replace("ো", "ো")
    line_t = line_t.replace("ড়", "ড়")
    line_t = line_t.replace("ঢ়", "ঢ়")

    line_t2 = line2.replace("$", "")
    line_t2 = line_t2.replace("য়", "য়")
    line_t2 = line_t2.replace("ো", "ো")
    line_t2 = line_t2.replace("ড়", "ড়")
    line_t2 = line_t2.replace("ঢ়", "ঢ়")

    words1 = line_t.split()
    words2 = line_t2.split()

    indexes = []
    count = 0
    for  word1, word2 in zip(words1, words2):
        if word1 != word2:
            indexes.append(count)
        count += 1
    
    for index in indexes:
        line = line.replace(words1[index], words2[index])
    
    return line


line_count = 0
error_count = 0
exact_match_count = 0



for line1, line2 in zip(quoted_mt5_output, test_lines):
    line1+="\n"
    line2+="\n"
    line_count += 1
       
    #modest improvement: private 1.072->1.0224 public 1.0988->1.0588
    if line2.strip() in exact_match_dict.keys():
        first_step_output.append(exact_match_dict[line2.strip()]+"\n")
#         print(exact_match_dict[line2.strip()])
        exact_match_count+=1
        continue

    index1 = 0
    index2 = 0
    error_flag = False
    new_line1 = ''
    chance = 1
    while True:
        if index1 >= len(line1) or index2 >= len(line2):
            break

        if line1[index1] == '$':
            #print("--1--")
            new_line1 += "$"
            index1 += 1
            continue

        if line1[index1] == line2[index2]:
            #print("--2--")
            new_line1 += line1[index1]
            index1 += 1
            index2 += 1
            continue
        
        
        
        if line1[index1] == "ু".strip() and line2[index2] == "ূ".strip():
            new_line1 += "ূ"
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == "ূ".strip() and line2[index2] == "ু".strip():
            new_line1 += "ু"
            index1 += 1
            index2 += 1
            continue
        if line1[index1] == roshaukar1 and line2[index2] == okar1_:
            new_line1 += okar1_
            index1 += 1
            index2 += 1
            continue
        if line1[index1] == roshaukar1 and line2[index2] == okar2:
            new_line1 += okar2
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == akar1 and line2[index2] == roshikar1:
            new_line1 += roshikar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == roshikar1 and line2[index2] == akar1:
            new_line1 += akar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == roshikar1 and line2[index2] == dirghikar1:
            new_line1 += dirghikar1
            index1 += 1
            index2 += 1
            continue
            
        if line1[index1] == dirghikar1 and line2[index2] == roshikar1:
            new_line1 += roshikar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == dontonaw1 and line2[index2] == moddhonno:
            new_line1 += moddhonno
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == moddhonno and line2[index2] == dontonaw1:
            new_line1 += dontonaw1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == roshikar1 and line2[index2] == eekar1:
            new_line1 += eekar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == eekar1 and line2[index2] == roshikar1:
            new_line1 += roshikar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == eekar1 and line2[index2] == akar1:
            new_line1 += akar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == akar1 and line2[index2] == eekar1:
            new_line1 += eekar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == eekar1 and line2[index2] == dirghikar1:
            new_line1 += dirghikar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == dirghikar1 and line2[index2] == eekar1:
            new_line1 += eekar1
            index1 += 1
            index2 += 1
            continue

        if (line1[index1] == okar1_ or line1[index1] == okar2) and line2[index2] == akar1:
            new_line1 += akar1
            index1 += 1
            index2 += 1
            continue

        if line1[index1] == akar1 and (line1[index1] == okar1_ or line1[index1] == okar2):
            new_line1 += line1[index1]
            index1 += 1
            index2 += 1
            continue

        allowed = ["?", "…","!","*","ু"," ূ","ঁ", "।", "\n",",","-","\"",".", " ","!","?","।", "“", "—", "”", ".", "‘", "–", "’", "*", ]

        if line2[index2] in allowed:
            new_line1 += line2[index2]
            index2 += 1
            continue

        problematic = ["♥","✌","✔","♣"]

        if line2[index2] in problematic:
            new_line1 += ("$"+line2[index2]+"$")
            index2 += 1
            continue
        

        if line2[index2] == " ":
            new_line1 += line2[index2]
            index2 += 1
            # index2 += 1
            continue
        
        if line1[index1] in [" ", "ঁ","?","’","।", "✔","'","\"", "-",",","."]:
            index1 += 1
            continue

        if line1[index1] == "\n":
            new_line1 += line2[index2:]
            break

        if line1[index1:index1+2] == utostio1_ and line2[index2] == utostio2:
            #print("--3--")
            new_line1 += utostio2
            index1 += 2
            index2 += 1
            continue
            
        if line1[index1] == utostio2 and line2[index2:index2+2] == utostio1_:
            #print("--4--")
            new_line1 += utostio1_
            index1 += 1
            index2 += 2
            continue

        if line1[index1:index1+2] == okar1_ and line2[index2] == okar2:
            #print("--5--")
            new_line1 += okar2
            index1 += 2
            index2 += 1
            continue
        
        if line1[index1] == okar2 and line2[index2:index2+2] == okar1_:
            #print("--6--")
            new_line1 += okar1_
            index1 += 1
            index2 += 2
            continue

        if line1[index1:index1+2] == aukar1_ and line2[index2] == aukar2:
            #print("--7--")
            new_line1 += aukar2
            index1 += 2
            index2 += 1
            continue

        if line1[index1] == aukar2 and line2[index2:index2+2] == aukar1_:
            #print("--8--")
            new_line1 += aukar1_
            index1 += 1
            index2 += 2
            continue

        if line1[index1:index1+2] == dor1_ and line2[index2] == dor2:
            #print("--7--")
            new_line1 += dor2
            index1 += 2
            index2 += 1
            continue

        if line1[index1] == dor2 and line2[index2:index2+2] == dor1_:
            #print("--8--")
            new_line1 += dor1_
            index1 += 1
            index2 += 2
            continue
            
        if line1[index1:index1+2] == dhor1_ and line2[index2] == dhor2:
            #print("--9--")
            new_line1 += dhor2
            index1 += 2
            index2 += 1
            continue

        if line1[index1] == dhor2 and line2[index2:index2+2] == dhor1_:
            #print("--10--")
            new_line1 += dhor1_
            index1 += 1
            index2 += 2
            continue
        
        print("couldn't correct, line_count: ", line_count)
#         print("error", line1[index1],"  &   ", line2[index2])
#         print("error", line1[index1:index1+2],"  &   ", line2[index2])
#         print("error", line1[index1],"  &   ", line2[index2:index2+2])
        
#         print(line_count)

        #trying to repair
        if chance == 1:
            line1 = t5_tendency(line1,line2)
            index1 = 0
            index2 = 0
            new_line1 = ''
            chance = 0    #try word level correction only once
            continue
            
        error_flag = True
        error_count += 1
        break
   
    if error_flag:
        first_step_output.append(regex_one_sentence(line2, sorted_error_words, bad_words, threshold=REGEX_ERROR_THRESHOLD))   
       
    else:
        first_step_output.append(new_line1)   

print("Regexed Sentences #", error_count)  
print("Exact Match Sentences #", exact_match_count)  


couldn't correct, line_count:  23
couldn't correct, line_count:  23
couldn't correct, line_count:  70
couldn't correct, line_count:  112
couldn't correct, line_count:  112
couldn't correct, line_count:  135
couldn't correct, line_count:  211
couldn't correct, line_count:  211
couldn't correct, line_count:  318
couldn't correct, line_count:  318
couldn't correct, line_count:  319
couldn't correct, line_count:  325
couldn't correct, line_count:  385
couldn't correct, line_count:  385
couldn't correct, line_count:  421
couldn't correct, line_count:  421
couldn't correct, line_count:  431
couldn't correct, line_count:  473
couldn't correct, line_count:  548
couldn't correct, line_count:  557
couldn't correct, line_count:  599
couldn't correct, line_count:  599
couldn't correct, line_count:  678
couldn't correct, line_count:  678
couldn't correct, line_count:  685
couldn't correct, line_count:  731
couldn't correct, line_count:  772
couldn't correct, line_count:  784
couldn't correct, line_

In [24]:
print(len(first_step_output))

5000


### Alogorithmic Improvements on T5 Predictions
Corrections against sadhu.txt were made only for sentences handled with regex. T5 on its own misses a few of these words which can optionally be corrected. Not a significant improvement in any case.

In [25]:
#very slight improvement in public score 1.0588 -> 1.0564

from tqdm import tqdm

second_step_output = []

line_count = 0
corrections_made = 0

for line in tqdm(first_step_output):
    to_append = line
    if line.strip().replace("$","") in exact_match_dict.keys():
        second_step_output.append(to_append)  
        line_count += 1
        continue

    for word in bad_words:
        if re.search(r"(?<=[ !?।,\"])"+re.escape(word) + r"(?=[ !?।,\"])", line):
            if "$"+word+"$" not in line:
                to_append = line.replace(word, "$"+word+"$")
                corrections_made +=1

# #   This actually degrades F1 score and is slow to as well.
#     for ratio,word,occ_c,err_c in sorted_error_words:
        
#         if occ_c == 1 and err_c!=1:
#             continue
            
#         if word == "$\"আমি$":
#             continue
#         if word[1:-1] in bad_words:
#             continue

#         if re.fullmatch(r"\$[0-9]+\$", word):
#             continue

#         if re.fullmatch(r"\$[,.;*?-]+\$", word):
#             continue

#         if ratio < 1:
#             break

#         allowed = [" " , "!", "?", ",", "।", '"']
#         for x in allowed:
#             for y in allowed:
#                 if x+word[1:-1]+y in to_append and word not in to_append:
#                     to_append = to_append.replace(" "+word[1:-1]+" "," "+word+" ")
#                     corrections_made +=1
#                     break

    second_step_output.append(to_append)
    line_count += 1

print("2nd step correction count: ", corrections_made)     


100%|██████████| 5000/5000 [00:05<00:00, 893.05it/s]

2nd step correction count:  14





In [26]:
print(len(second_step_output))

5000


In [27]:
second_step_output[:5]

['ব্যক্তি থেকেই শুরু যার সমাপ্তি হবে বিশ্বে মুসলিমদের মহত্ত্বের চর্চায় যেমনটি আগে ছিল রাসূলের আদর্শে আছে।\n',
 '"ইউনিটগণ, পিছিয়ে যান।"\n',
 'ভিক্টোরিয়ার ক্লেটন ক্যাম্পাসে ছয়টি আবাসিক হল রয়েছে।\n',
 'তেমনি আবির একটা বইপোঁকা সেটা বললে ভুল $হবেনা$।\n',
 'আমরা চাই গরিব চাচা যেন $তারনেয্য$ টাকা $পিরে$ পায় এবং সারা $বাংলাদেসে$ যমুনা টিবির সুনাম বয়ে $জায়$$$\n']

# Writing to file submission.csv

In [28]:

output_file = "post_processed_predictions.txt"
with codecs.open(output_file, mode='w', encoding="utf8") as f:
    for line in second_step_output:
        f.write(line)

In [29]:
!ls /kaggle/working/banglat5

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
all_results.json  generated_predictions.txt  predict_results.json


In [30]:
from IPython.display import FileLink
FileLink(r'banglat5/generated_predictions.txt')

In [31]:
FileLink(r'post_process/test_sentences.txt')

In [32]:
FileLink(r'post_processed_predictions.txt')

In [33]:

with codecs.open(output_file, encoding="utf8") as f:
    with codecs.open("submission.csv", mode='w', encoding="utf8") as f1:
        index = 1
        f1.write("Id,Expected\n")
        for line in f.readlines():
            f1.write(str(index) + "," + line)
            index += 1


df = pd.read_csv("submission.csv")

In [34]:
from IPython.display import FileLink
FileLink(r'submission.csv')