## This notebook is about trainig a Roberta (or any transformer) model using Hugging Face library
Things you need to know while using hugging face transformers library
- There are 4 for finetuning a transformer library
    1. Import a __tokenizer__ to tokenise the given text in a format the model understands
    2. Feed the tokenized data to __model__
    3. __Define training prarameters__ for _finetuning_ the model
    4. __Train__ the model
    
    


# Imports

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


from transformers import RobertaTokenizer, RobertaForSequenceClassification,Trainer, TrainingArguments
import torch

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
        


#
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Make some directories for Outputs, logs and model
> 📌 __./model__ directory is used for storing files of both tokeniser and model weights. Again its not just model parameters but also tokeniser parameters in the same path

In [None]:
if "output" not in os.listdir():
    os.mkdir("./output")
if "logs" not in os.listdir():
    os.mkdir("./logs")
if "model" not in os.listdir():
    os.mkdir("./model") 
  

## Install wandb library .This helps to fetch weights of a given model

In [None]:
!pip install wandb

## Get wandb API key
For accessing a wandb \<API KEY> you need to signup at [wandb](https://wandb.ai/site) website. There are two ways to access the API key
- As soon as you signup at [wandb](https://wandb.ai/site) a key will automatically popup on webpage
- If not go to [account settings](https://wandb.ai/settings) you can find the api key

Get the key and place it in the below code

> 📌 I have used my own key below. However, I will be deleting that key after making this notebook public. Please follow above steps and get access for your own key

In [None]:
import wandb
wandb.login(key = "a0f553a701b1c86e18b067324c61cdf1adcd410b") ## Use your api key here

## Load Dataset

In [None]:
data = pd.read_csv("/kaggle/input/commonlitreadabilityprize/train.csv")
data.head()

## Split train test data

In [None]:
cols = ["excerpt","target"]
msk = np.random.rand(len(data)) < 0.8

train_data = data[cols][msk]
val_data = data[cols][~msk]

## Tokenize data

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
train_encodings = tokenizer(list(train_data["excerpt"]), truncation=True, padding=True, return_tensors="pt")
val_encodings = tokenizer(list(val_data["excerpt"]), truncation=True, padding=True, return_tensors="pt")


In [None]:
tokenizer.save_pretrained("./model")

In [None]:
train_encodings.keys()

## Pytorch dataset reading class
- prepare dataset for feeding into model

In [None]:
import torch

class ReadabilityDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ReadabilityDataset(train_encodings, list(train_data["target"]))
val_dataset = ReadabilityDataset(val_encodings, list(val_data["target"]))

## Define training arguments for finetuning
- We are using single cyclie learning rate for finetuning the model by using [cosine scheduler with warmup](https://huggingface.co/transformers/main_classes/optimizer_schedules.html#transformers.get_cosine_schedule_with_warmup) function
- You can read [this](https://medium.com/dsnet/the-1-cycle-policy-an-experiment-that-vanished-the-struggle-in-training-neural-nets-184417de23b9) article to learn more about one cycle LR
- Change in learning rate looks something like this
    - Initital climb in learning rate is called _warmup_steps_
    
![](https://i.ibb.co/FD6fXFr/warmup-cosine-schedule.png)

> 📌 I manually tried some good learning rates for training the model. Since hugging face has not yet implemented LR finder function

In [None]:
import transformers
training_args = TrainingArguments(
    output_dir='./output',          # output directory
    num_train_epochs=8,              # total number of training epochs
    per_device_train_batch_size=12,  # batch size per device during training
    per_device_eval_batch_size=12,   # batch size for evaluation
    warmup_steps=300,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    load_best_model_at_end = True,
    do_eval = True,
    learning_rate = 1e-5, 
    lr_scheduler_type = "cosine"
)

In [None]:
#1e5, 180 best, try 300

## Load Model

In [None]:
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels = 1)


# Train the model

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

# Save model
- saving the model so that it can be used later in offline settings

In [None]:
trainer.save_model("./model")

# Validation scores

In [None]:
from sklearn.metrics import mean_squared_error
val_preds = trainer.predict(val_dataset)
mean_squared_error(list(val_data["target"]), list(val_preds.predictions.reshape(1,-1)[0]))**(1/2)

# Predicting on test data

In [None]:
test_data = pd.read_csv("/kaggle/input/commonlitreadabilityprize/test.csv")
test_encodings = tokenizer(list(test_data["excerpt"]), truncation=True, padding=True, return_tensors="pt")
test_dataset = ReadabilityDataset(test_encodings,[0 for i in range(len(test_data["excerpt"]))])
preds = trainer.predict(test_dataset)

# Making a submission file

In [None]:
submit = pd.read_csv("/kaggle/input/commonlitreadabilityprize/sample_submission.csv")
submit["target"] = list(preds.predictions.reshape(1,-1)[0])
submit["id"] = test_data["id"]

In [None]:
submit

In [None]:
submit.to_csv("submission.csv",index = None)

# How sumbit this model

- In this notebook we have used the model by downloading from internet
- But while submission we need to keep the model offline
- Once you fine tune and generate a model commit the notebook
- Then you will see something like this in the output section of commit
- Click the button (circled in yellow) to add the model as a dataset and use it while inferencing
- You can find the inferencing code [here](https://www.kaggle.com/abhilashreddyy/inference-transformer-model-using-hugging-face) (can be used for submitting the code)

![](https://i.ibb.co/4MSSMQT/Whats-App-Image-2021-05-18-at-13-42-10.jpg)

## Now your turn. Upvote if you find this notebook helpful :-)