# Try on arabic-words code
### challenges:
1. Lack of parallel corpus for english to arabic DS contenct

2. Large dataset needed for finetuning

let's start with coursera, and gradually build on it.

# MT5 background
We'll be using MT5.
I chose this model because it has a relatively small number of parameters (compared to other pre-trained seq2seq models, like M2M), so it would be compatible for training on Kaggle.

mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
The mT5 model was introduced back in 2020 as the multilingual rightful heir of the T5 model. The m stands for multilingual.

Both mT5 and T5 were trained in similar fashion. The only difference was that mT5 was trained on multi-lingual data, and had vastly more token embeddings (250k). Both were initially trained on the objective of span-corruption: “consecutive spans of input tokens are replaced with a mask token and the model is trained to reconstruct the masked-out token”.

The dataset used for training the model had 6.3 Trillion tokens of 107 languages. 

# Getting libraries

Transformers and simpletransformers: Huggingface transformers is the most popular NLP library as to date. It requires minimal to no effort to fine-tune state-of-the-art transformer-based models on tasks such as classification, text generation and summarization. Simpletransformers is just a small library built on top of it to speed up prototyping and testing.

In [None]:
# !pip install transformers
!pip install simpletransformers

# #For tokenization
# !pip install sentencepiece 

In [None]:
!ls

In [None]:
import torch
import numpy as np
import pandas as pd
import os
# from google.colab import drive
import logging
#The T5Model class is used for any NLP task performed with a T5 model or a mT5 model.
from simpletransformers.t5 import T5Model, T5Args

In [None]:
!ls '/kaggle/input'

In [None]:
train = pd.read_csv('/kaggle/input/mt5finetuning/train_1.csv')

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.isnull().sum() / train.shape[0] *100

### drop na

In [None]:
train.dropna(inplace=True)

In [None]:
train.isnull().sum() / train.shape[0] *100

In [None]:

train = train.drop_duplicates(subset=['AR', 'EN'])
# train['Arabic_transcript'] = train.apply(lambda row: (row.Arabic_transcript).strip().lower(), axis=1)
train['EN'] = train.apply(lambda row: row.EN.lower(), axis=1)
train = train[["AR", "EN"]]


Finetuning mt5 was computationally challenging. I had to reduce the sequence length to 128, which means that each input sentence should't exceed 128 words (to avoid truncating sentences and performance degredation)

# Dataset format

The library requires dataset to be in the format of a Pandas dataframe, with three columns: input_text, target_text, and prefix. Prefix is a column used during the training of mT5 to specify the task the model should do (summarize, classify …). We won’t need it for our case, we create it and leave it blank “”.

Note that we are casting all the data in the Dataframe as strings. This is because mT5 is a sequence-to-sequence model which expects all inputs and outputs to be text sequences. If we have numeric values (or any other non-string values), we’ll run into errors during training.

In [None]:
train.columns = ['target_text', 'input_text']

### get number of sentences

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

In [None]:
nltk.download('punkt')

In [None]:
# train['input_text_sent'] = train['input_text'].apply(sent_tokenize).tolist()

In [None]:
# train['input_text_sent_ar'] = train['target_text'].apply(sent_tokenize).tolist()

In [None]:
# train['input_text_sent_ar'][0]

In [None]:
# train['input_text_sent'][0]

## Count the number of sentences in the entire text corpus.

In [None]:
import string
nsentences = train['input_text'].str.split('.').map(len).sum()
# nsentences = train['input_text'].count()

In [None]:
nsentences

In [None]:
import string
nsentences = train['target_text'].str.split('.').map(len).sum()
# nsentences = train['input_text'].count()

In [None]:
nsentences

In [None]:
train['input_text'].head(1)

## count the number of words in each example

In [None]:
train['count_words_ar'] = train['target_text'].apply(lambda row: len(word_tokenize(row)))
# train['article_len'] = train['target_text'].apply(lambda row: len(word_tokenize(row)))

In [None]:
train['count_words_en'] = train['input_text'].apply(lambda row: len(word_tokenize(row)))

In [None]:
train['count_words_en'].describe()

In [None]:
train['count_words_en'].shape

In [None]:
14955 - 14325

In [None]:
train[train['count_words_en'] <=128]

In [None]:
train[train['count_words_ar'] <=128]

In [None]:
train_128 = train[(train['count_words_ar'] <=128) | (train['count_words_en'] <=128)]

In [None]:
train_128.shape

I can first try to train on lengths < 128

### create val. data

In [None]:
val = train_128.sample(frac = 0.05)
train_128 = train_128.drop(index = val.index).astype(str)

In [None]:
val.info()

In [None]:
# train.drop(columns = ['input_text_sent', 'input_text_sent_ar'], inplace=True)

In [None]:
# train['INPUT_ len'] = train['target_text'].apply(lambda row: len(word_tokenize(row)))

In [None]:
# train['INPUT_ len'].describe()

### split into sentences according to the max length= 100

## Training the model

## 1. mt5-small - 300M parameters (small xD)

51M parameters

###Sidenote on GPU memory usage
The amount of GPU memory required to train a Transformer model depends on many different factors (maximum sequence length, number of layers, number of attention heads, size of the hidden dimensions, size of the vocabulary, etc.). Out of these, the maximum sequence length of the model is one of the most significant.
Also, mT5 has a much larger vocabulary than T5 (~250,000 tokens to ~32,000 tokens), contributing to mT5 being quite punishing in terms of GPU memory required.

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
import gc
# del variables
gc.collect()

ok let's train on a max length of 1024, then split text into sentences.
But a big seq_length could cause out of memory issues.

Another approach is to split each article into sentences. However, doing this wouldn't be straightforward. We need to automate this in an automated fashion

In [None]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

train_df = train_128
eval_df = val

train_df["prefix"] = ""
eval_df["prefix"] = ""

model_args = T5Args()

#The maximum sequence length of 100
#  allows the model to work with reasonably long text (typically a few sentences) while also keeping the training time practical.
# model_args.max_seq_length = 1024
model_args.max_seq_length = 128
#Generally, larger batch sizes mean better GPU utilization, and therefore, shorter training times
model_args.train_batch_size = 8
model_args.eval_batch_size = 8
model_args.num_train_epochs = 5
model_args.scheduler = "cosine_schedule_with_warmup"
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 10000
model_args.learning_rate = 0.0001
model_args.optimizer = 'Adafactor'
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_eval_checkpoints = False
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.save_model_every_epoch = False
model_args.preprocess_inputs = False
model_args.use_early_stopping = True
model_args.num_return_sequences = 1
model_args.do_lower_case = True
model_args.output_dir = "/kaggle/output/kaggle/working/mt5/"
model_args.best_model_dir = "/kaggle/output/kaggle/working/mt5/best_model"

#If you are using wandb add: wandb.login(key="API KEY")
model_args.wandb_project = "Yoruba mT5"

model = T5Model("mt5", "google/mt5-small", args=model_args)

# Train the model
model.train_model(train_df, eval_data=eval_df)

### Trial: seq_length=128 >> started 4:50 

Epochs 0/5. Running Loss: 4.3811

Epochs 1/5. Running Loss: 3.5221

Epochs 2/5. Running Loss: 3.6496

Epochs 3/5. Running Loss: 3.0159: 
{'global_step': [1776, 3552, 5328, 7104, 8880],
  'eval_loss': [3.452780764153663,
   2.907095891364077,
   2.6020818492199513,
   2.4616720942740744,
   2.4358180659882565],
  'train_loss': [4.381147861480713,
   3.5221028327941895,
   3.6496052742004395,
   3.0159411430358887,
   3.2644360065460205]})

# Inference
For inference, we first need to load the fine-tuned model from the output directory specified earlier( in model.best_model_dir)

In [None]:
model_args = T5Args()
model_args.max_length = 128 #should match the max_seq_length that the model was trained on
model_args.length_penalty = 2.5 #Exponential penalty to the length. Default to 2
model_args.repetition_penalty = 1.5 #The parameter for repetition penalty. Between 1.0 and infinity. 1.0 means no penalty. Default to 1.0.
model_args.num_beams = 10

model1 = T5Model("mt5","/kaggle/output/kaggle/working/mt5/best_model" , args = model_args)

In [None]:
model1

1 epoch in around half an hour

1.20.. 1.50

train loss is larger than eval_loss, wierd

**Trial 1** : coursera DL >> garbage , took like 10 mins, 6k sentences.

**Trial 2** :  2 coursera courses (DL and Michigan), springer file, DS_codata_org, yt_stanford), split into sentences.

#evaluation

In [None]:
trans = model1.predict("In the last few years the Recurrent Neural Network-based architectures have shown the best performance in machine translation problems, but still they have some problems that had to be solved. First, they have a difficulty to cope with long-range dependencies (also LSTM when it has to deal with really long sentences). Secondly, each hidden state depends on the previous one")

In [None]:
len(trans)

In [None]:
trans

In [None]:
val.head()

In [None]:
val_list = val.input_text.values.tolist()

In [None]:
val_list[4]

In [None]:
model1.predict(val.input_text.values.tolist())

In [None]:
 ، هناك بعض النقاط التي يمكن أن ت'

In [None]:
val['input_text'].head(1)

In [None]:
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args
from rouge import Rouge 

#Load validation set
validation = pd.read_csv(os.path.join(PATH_TO_DATA, "validation.csv"))

model_args = T5Args()
model_args.max_length = 100
model_args.length_penalty = 2.5
model_args.repetition_penalty = 1.5
model_args.num_beams = 5

#Load model
model = T5Model("mt5", "mT5/best_model", args=model_args)


#Perform the inference
validation["preds"] = model.predict(validation.input_text.values.tolist())

#Compute rouge score
rouge = Rouge()
scores = rouge.get_scores(preds, validation["target_text"].values.tolist(), avg=True)

#inference

In [None]:
model.predict("In the last few years the Recurrent Neural Network-based architectures have shown the best performance in machine translation problems, but still they have some problems that had to be solved. ")

1. 6k sentences; started 1:50 --8 mins

# Resources 
- https://huggingface.co/transformers/notebooks.html
- https://github.com/huggingface/transformers/issues/8704
- https://towardsdatascience.com/how-to-train-an-mt5-model-for-translation-with-simple-transformers-30ba5fa66c5f
- https://simpletransformers.ai/docs/usage/
- https://simpletransformers.ai/docs/t5-model/
- https://github.com/maroxtn/mt5-M2M-comparison/blob/main/mt5_test.ipynb