## **Utilize the GPU of Colab**
In this session, we will work on experiments that require GPU to run. To make the experiments running over the GPU provided by Colab, you need to do the following:

1. Go to Menu > Runtime > Change runtime.

2. Change hardware acceleration to GPU.

Then run the following cell to confirm that the GPU is detected.

In [1]:
import tensorflow as tf
import torch
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
# else:
#     raise SystemError('GPU device not found')

# Choose GPU as device to run the experiments on
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

Found GPU at: /device:GPU:0
cuda:0


## **Hugging Face**
[Hugging face](https://huggingface.co/) is an NLP-focused startup with a large open-source community, in particular around the Transformers library. 🤗 Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the pip package:

In [2]:
#install the transformer library
!pip install transformers



In [3]:
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import re
from snowballstemmer import stemmer
from tqdm import tqdm


In [4]:
# import needed libraries
import math
import numpy as np
import pandas as pd
import time
import datetime
import torch
from torch import nn, optim
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split
from transformers import AdamW, get_linear_schedule_with_warmup
import gc
import sys

Here we provide some helper functions that will help in preprocessing steps

In [5]:
def clean(text):
  '''
  Clean input text form urls, handles, tabs, line jumps, and extra white spaces
  '''
  text = re.sub(r"http\S+", " ", text)  # remove urls
  text = re.sub(r"[\.\,\#_\|\:\?\?\/\=]", " ", text)# remove special characters
  text = re.sub(r"\t", " ", text)  # remove tabs
  text = re.sub(r"\n", " ", text)  # remove line jump
  text = re.sub(r"\s+", " ", text)  # remove extra white space
  text = text.strip()
  return text

#a function to normalize the tweets
def normalize(text):
  text = re.sub("[إأٱآا]", "ا", text)
  text = re.sub("ى", "ي", text)
  text = re.sub("ؤ", "ء", text)
  text = re.sub("ئ", "ء", text)
  text = re.sub("ة", "ه", text)
  return(text)

def preprocess(sentence):
  # apply preprocessing steps on the given sentence
  sentence =clean(sentence)
  sentence =normalize(sentence)
  return sentence

## **Building the Relevance classifier**

Here, we are building a model that takes a query-document pair as an input, feed them to bert, pass the cls embedding to a classification layer, then output a probability score that is between 0 and 1. This score measures how much the document is relevant to the query.

In [6]:
from transformers import AutoTokenizer, AutoModel
from torch import nn

class RelevanceClassifier(nn.Module):

    """
    create a RelevanceClassifier model, that can be used to classify text pairs into relevant/non-relevant
    This class adds classification layer on top of BERT model
    The input is sequence text pair (query & document)
    The output is the relevance score, i.e., how much the document is relevant to the query as a score between 0 and 1

    Parameters
    ----------
    :param model_name: The name of the BERT model that will be used to get the pair embedding
    :param freeze_bert: This Flag is used to allow/(not allow) changing in weights of BERT layers during the training
    """

    def __init__(self, model_name, freeze_bert=False):

        super(RelevanceClassifier, self).__init__()

        # load the bert model by its name
        self.bert = AutoModel.from_pretrained(model_name)

        # relu activation function
        self.relu = nn.ReLU()

        # dense layer 1
        self.fc1 = nn.Linear(768, 5)

        # use softmax activation function to give probability distribution
        self.softmax = nn.Softmax(dim=1)

        # Freeze bert layers
        if freeze_bert:
            for p in self.bert.parameters():
                p.requires_grad = False  # turn off the weight changing

    """
    Define the forward pass for the model

      Parameters
      ----------
      :param inputs: the input contains input_ids, attention mask,and token type ids of the input text pair

      Returns
      -------
      probability scores for relevant & non relevant classes
      """
    # def forward(self, input_ids, attention_mask):
    def forward(self, input_ids, attention_mask, token_type_ids):

        # pass the inputs to the BERT model
        _, cls_emb = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            return_dict=False,
        )

        # pass the embedding to the classification layer
        x = self.fc1(cls_emb)

        # apply softmax activation and get the probabilities for each class
        x = self.softmax(x)

        return x


## **Create the dataset class**

In order to encode data in batches and efficiently, we need to define a class that encodes a given sequence pair and return its input_ids", attention_mask, and token_type_ids among other needed information like docno and query id.

In [7]:
import torch
from torch.utils.data import Dataset, DataLoader


class EvetarDataset(Dataset):
    """
    create a class for Evetar dataset so that for a given query-document pair, it encodes this pair, then it returns
    the encoding along with needed ids.

    Parameters
    ----------
    :param queries: text of query
    :param query_ids: ids of the queries
    :param documents: text of document
    :param document_ids: id of the documents
    :param labels: relevance score. It is one for relevant pair and 0 for non-relevant
    :param tokenizer: the bert tokenizer that will perform the encoding work
    :param max_len: maximum allowed length as an input for bert model
    """
    def __init__(self, queries, query_ids, documents, document_ids, labels, tokenizer, max_len):
        self.labels = labels
        self.queries = queries
        self.documents = documents
        self.query_ids = query_ids
        self.document_ids = document_ids
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.queries)

    # return the encoding for a given dataset item, i.e., encode the query-document pair
    def __getitem__(self, item):
        query = str(self.queries[item])
        document = str(self.documents[item])
        query_id = str(self.query_ids[item])
        document_id = str(self.document_ids[item])
        label = int(self.labels[item])

        encoding = self.tokenizer.encode_plus(
            query,
            document,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=True,
            # pad_to_max_length=True,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors="pt",
        )

        # here we keep the text and ids of the query and documents to be used later in re-ranking
        return {
            "query": query,
            "query_id": query_id,
            "document": document,
            "document_id": document_id,
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "token_type_ids": encoding["token_type_ids"].flatten(),
            "label": torch.tensor(label, dtype=torch.long),
        }


# create data loader that will split the dataset into batches
def create_data_loader(
    queries, query_ids, documents, document_ids, labels, tokenizer, max_len, batch_size
):
    ds = EvetarDataset(queries, query_ids, documents, document_ids, labels, tokenizer, max_len)

    return DataLoader(ds, batch_size=batch_size, num_workers=2)


In [8]:
def format_time(elapsed):
    """
    Takes a time in seconds and returns a string hh:mm:ss
    """
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

## **Design the training function**

Here we implement the training function to fine-tune bert for our task.

In [9]:
def train_epoch(model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
    """
    Performs one training epoch using the bert model and the provided dataloader
    :param model: the model to use during training (classification layer on top of bert)
    :param data_loader: data loader to get the data in batches
    :param optimizer: what optimizer to use in order to reduce the error rate while training the neural networks
    :param device: GPU or Cpu
    :param scheduler: Scheduler to adjust the learning rate during training
    :param n_examples: total number of training examples
    """

    # to compute execution time
    t0 = time.time()

    # set the bert model in training mode, i.e., weights will be updated
    model = model.train()

    losses = []
    correct_predictions = 0

    y_test = np.array([], dtype=int)  # the real output
    y_pred = np.array([], dtype=int) # the predicted output

    for step, batch in enumerate(data_loader):
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)

            # Report progress.
            print("  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.".format(step, len(data_loader), elapsed))

        # Unpack this training batch from dataloader.
        label = batch["label"].to(device)
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because
        # accumulating the gradients is "convenient while training RNNs".
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()

        # the output here for each training example is two values. One represent the score for relevance class
        # and the other is the score of the non-relevant class
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

        _, preds = torch.max(outputs, dim=1)

        # compute the loss between the predicted output and the real output
        loss = loss_fn(outputs, label)

        correct_predictions += torch.sum(preds == label)

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value
        # from the tensor.
        losses.append(loss.item())

        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()
        # Update the learning rate.
        scheduler.step()

        y_test = np.append(y_test, label.cpu().numpy())
        y_pred = np.append(y_pred, preds.cpu().numpy())

    print("")
    print("  correct_predictions: {0:.2f}".format(correct_predictions.double()))
    print("  number of examples: {0:.2f}".format(n_examples))
    print("  Accuracy : {0:.2f}".format(correct_predictions.double() / n_examples))
    print("  Average training loss: {0:.2f}".format(np.mean(losses)))
    print("  Training epoch took: {:}".format(format_time(time.time() - t0)))
    y_test = np.array(y_test)
    y_pred = np.array(y_pred)
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy, np.mean(losses)


Here, we provide an evaluation function that could help when we have train, dev, and test data. Since, in our case, we have only training and testing data we will not use this function. We are just providing this function as a reference for you in the future.

In [10]:

def eval_model(model, data_loader, loss_fn, device, n_examples):
    print("Running Evaluation...")
    t0 = time.time()
    # Put the model in evaluation mode--the dropout layers behave differently
    model = model.eval()

    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for batch in data_loader:
            label = batch["label"].to(device)
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            token_type_ids = batch["token_type_ids"].to(device)

            outputs = model(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids
            )

            _, preds = torch.max(outputs, dim=1)

            loss = loss_fn(outputs, label)

            correct_predictions += torch.sum(preds == label)
            losses.append(loss.item())

    print("  correct_predictions: {0:.2f}".format(correct_predictions.double()))
    print("  n_examples: {0:.2f}".format(n_examples))
    print("  Accuracy: {0:.2f}".format(correct_predictions.double() / n_examples))
    print("  Average Validation loss: {0:.2f}".format(np.mean(losses)))
    print("  Evaluation took: {:}".format(format_time(time.time() - t0)))
    print("")

    return correct_predictions.double() / n_examples, np.mean(losses)


## **Design the test function**


This function gives the predictions of our fine-tuned model. More specifically, it gives the id of each query and document along with the relevance score.

In [11]:
def get_predictions(model, data_loader, device):
    """
    Feed the test dataloader to the fine-tuned bert model and gives back the prediction results.
    :param model: the model to use during training (classification layer on top of bert)
    :param data_loader: test data loader to get the data in batches
    :param device: GPU or Cpu
    """
    # Put the model in evaluation mode--the dropout layers behave differently
    model = model.eval()

    predictions = []  # predicted labels (0s or 1s)
    prediction_probs = [] # probability score of the relevance class
    labels = []
    query_ids = []
    document_ids = []
    queries = []
    documents = []
    indices = torch.tensor([1]).to(device)

    with torch.no_grad(): # don't update gradients
        for batch in data_loader:

            # encoding = batch["encoding"]
            query = batch["query"]
            query_id = batch["query_id"]
            document = batch["document"]
            document_id = batch["document_id"]
            label = batch["label"].to(device)

            # outputs = model(**encoding)
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            token_type_ids = batch["token_type_ids"].to(device)

            outputs = model(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids
            )

            _, preds = torch.max(outputs, dim=1)

            # choose the neuron that predics the relevance score , reference: https://pytorch.org/docs/stable/generated/torch.index_select.html
            probs = torch.index_select(outputs, dim=1, index=indices)


            labels.extend(label)
            prediction_probs.extend(probs.flatten().tolist())
            predictions.extend(preds)
            queries.extend(query)
            documents.extend(document)
            document_ids.extend(document_id)
            query_ids.extend(query_id)

    predictions = torch.stack(predictions).cpu()
    labels = torch.stack(labels).cpu()
    # prediction_probs = prediction_probs.numpy()
    # prediction_probs = torch.stack(prediction_probs).cpu().tolist
    return queries, query_ids, documents, document_ids, predictions, prediction_probs, labels


## **Load the training set**

In this step, we are loading the training set from our GitHub repository. The training set has 10 positive and negative examples for each tweet, i.e., 10 for class 0 and 10 for class 1. The training set also contains the ids of each document and query. While the positive examples were chosen from qrels file, negative examples were chosen from the top retrieved documents by BM25 and given that they are not considered already as positive

In [12]:
data = pd.read_csv('mydata.csv', encoding='utf-8')
data = pd.DataFrame({'qid': data['qid'],
                     'document': data['document'],
                     'query': data['query'],
                     'label': data['label']})
data.head()

Unnamed: 0,qid,document,query,label
0,1,هو علم محاكاة الذكاء البشري من خلال تصنيف إما خوارزميات برمجية software أو hardware نمذجة للذكاء االانساني، وهو علم واسع جداً أساسه رياضي يعتمد عل...,ما هو الذكاء الصنعي Artificial inteligence,4
1,2,إن ميزة الانسان الاساسية هي القدرة على التعلم والتطور، ومن هنا بدأت فكرة الذكاء االصطناعي، وذلك عن طريق الشبكات العصبونية، حيث تحاكي هذه الشبكات ن...,ما هو الذكاء الصنعي Artificial inteligence,3
2,3,يتم التعلم بدون وجود معلم حيث يقوم بتجميع الداتا إلى عدة أصناف,ما هو الذكاء الصنعي Artificial inteligence,0
3,4,محاكاة الدماغ هي البحث عن بنية شبيهة ببنية الدماغ أي نأخذ الاية التي يتعلم بها الانسان ونطبقها على الالة، أن نفكر كيف يعالج الانسان المعلومات ويتط...,ما هو الذكاء الصنعي Artificial inteligence,2
4,5,1. processing language Natural) NLP) معالجة اللغات الطبيعية: وهي مجال علوم الحاسوب واللغويات المعنية بالتفاعلات بين الحاسوب واللغات الطبيعية، أبسط...,ما مجالات الذكاء الصنعي,4


In [13]:
data["query"] = data["query"].apply(preprocess)
data["document"] = data["document"].apply(preprocess)


data["docno"] = data["qid"].astype(str)

df_train, df_test = train_test_split(data, test_size=0.2,random_state=42)

df_train = df_train.sample(frac=1) # Just shuffle the data
x_train_query = df_train["query"].values
x_train_query_id = df_train["qid"].values
x_train_documents = df_train["document"].values
x_train_document_ids = df_train["docno"].values
y_train_label = df_train["label"].values

TRAIN_LENGTH = len(x_train_query)
print("train size ", TRAIN_LENGTH)
df_train

train size  120


Unnamed: 0,qid,document,query,label,docno
119,120,يحدد فيما اذا كان سيتم تفعيل (تنشيط) العصبون من خلال شرط او حد معين، وله عده اشكال ابسط شكل هو تابع العتبه اذ يتنشط فوق حد معين ويخمد ما دونه,ما هو تابع التنشيط في ال Perceptron,4,120
28,29,المرحله االاولي تطوير نماذج احصاءيه ننطلق منها التخاذ قرار المرحله الثانيه بناء خوارزميات عن طريق جعل الكومبيوتر ينفذ مهام دون تعليمات خارجيه، فهي...,كيف تكون عمليه التعلم,3,29
148,149,العمليه تشمل مراحل تحديد الاوزان المثلي التي تقلل من الخطا بين الاخراج المتوقع والاخراج الفعلي تتكرر هذه العمليه عده مرات حتي يصل النموذج الي مستو...,كيف يتم تدريب الشبكه العصبونيه فعليا,2,149
22,23,المرحله االاولي تطوير نماذج احصاءيه ننطلق منها التخاذ قرار المرحله الثانيه بناء خوارزميات عن طريق جعل الكومبيوتر ينفذ مهام دون تعليمات خارجيه، فهي...,ما مراحل تعلم الاله Machine Learning,4,23
123,124,تستخدم مصفوفه الدخل لتدريب النموذج، حيث يُحسب المجموع لكل دفعه بتطبيق الاوزان وتابع التنشيط اذا تماثلت النتيجه مع القيمه المطلوبه، يتم الانتقال ال...,ما شرح عمليه التاكد من الاوزان في البوابه AND,3,124
...,...,...,...,...,...
116,117,يحدد عتبه نشاط العصبون,ما هو الجامع في ال Perceptron,0,117
3,4,محاكاه الدماغ هي البحث عن بنيه شبيهه ببنيه الدماغ اي ناخذ الايه التي يتعلم بها الانسان ونطبقها علي الاله، ان نفكر كيف يعالج الانسان المعلومات ويتط...,ما هو الذكاء الصنعي Artificial inteligence,2,4
111,112,تقوم بتمرير البيانات اي ال hidden lyares,ما هي طبقه الدخل في ال Perceptron,3,112
75,76,التعلم الالي هو بناء الخوارزميات لحل مشكله التعرف علي الانماط، فهو يقوم بتحليل المعلومات الموجوده ومحاوله ايجاد حل للمشكله وايجاد طريق للتعرف علي ...,ما هو الفرق بين Pattern recognition و Machine Learning,4,76


## **Load the test data**

The test data is just output of one of the exercises in previous sessions. Shortly, for each tweet, we have 1000 document retrieved by BM25. The goal is to re-rank those documents so that the most relevant ones become on top. The testing set attributes are similar to the training set.

In [14]:
x_test_query = df_test["query"].values
x_test_query_id = df_test["qid"].values
x_test_document = df_test["document"].values
x_test_document_ids = df_test["docno"].values
y_test_labels = [0] * len(x_test_query)


TEST_LENGTH = len(x_test_query)

print("test size ", TEST_LENGTH)
df_test


test size  30


Unnamed: 0,qid,document,query,label,docno
73,74,التنبء بسعر منزل وفقاً لمساحته وسعره,ما هي مجالات التعلم الالي,3,74
18,19,ان ميزه الانسان الاساسيه هي القدره علي التعلم والتطور، ومن هنا بدات فكره الذكاء االصطناعي، وذلك عن طريق الشبكات العصبونيه، حيث تحاكي هذه الشبكات ن...,ما اهميه الشبكات العصبونيه,4,19
118,119,يحدد عتبه نشاط العصبون,ما هو تابع التنشيط في ال Perceptron,2,119
78,79,يتميز التعلم الالي عن تمييز الانماط بكونه يستطيع التعامل مع المعلومات الجديده و التكيف معها و تحسين اداءه بناءً علي المدخلات الجديده و استنتاجاً م...,ماذا يتميز التعلم االالي عن تمييز الالنماط,4,79
76,77,يتميز التعلم الالي عن تمييز الانماط بكونه يستطيع التعامل مع المعلومات الجديده و التكيف معها و تحسين اداءه بناءً علي المدخلات الجديده و استنتاجاً م...,ما هو الفرق بين Pattern recognition و Machine Learning,3,77
31,32,Regressionو Classification و Dimensionality Reduction و Clustering,ما الهدف الرءيسي من تعلم الاله ML,2,32
64,65,شبكه عصبونيه,ما اقسام الداتاست,0,65
141,142,توابع التنشيط التي تعطي قيمًا مستمره (بدلاً من القيم الثناءيه) تستخدم عاده في مشكلات التنبء الرقمي او التي تتطلب اخراجًا يتراوح في نطاق مستمر,ما هي توابع التنشيط التي تعطي قيم مستمره,3,142
68,69,وكلما كانت الصفات الواصفه للنمط قابله للقياس بدقه كانت دقه التوقع اكبر،فلو استخدمنا صفات قابله للقياس مثل الوزن والطول واللون و الخ فان النتيجه ست...,ما هو Pattern Recognition,4,69
82,83,وذلك لان الشبكه العصبونيه بطبقات قليله لم تكن قادره علي معالجه البيانات واستخراج المعلومات اللازمه والمرغوب بها,لماذا نلجا لل learning Deep,4,83


## **Setting the hyper parameters**

In this section, we are setting the hyperparameter for BERT model like the batch size, learning rate, maximum length, and number of epochs. Moreover, we choose what model to use, define optimizer and scheduler. In addition, we create the train and test data loaders.

In [15]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.empty_cache()

# Here, I am listing some Arabic bert models that you can try
MARBERT = "UBC-NLP/MARBERT"
ARBERT = "UBC-NLP/ARBERT"
CAMeLBERT_mix = "CAMeL-Lab/bert-base-camelbert-mix"
Arabic_BERT = "asafaya/bert-base-arabic"
QARiB = "qarib/bert-base-qarib"
AraBERT = "aubmindlab/bert-base-arabertv02"

model_name = Arabic_BERT
tokenizer = AutoTokenizer.from_pretrained(model_name)


# set the random seed to replicate results
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)


batch_size = 16  # try other batch sizes like 32, or 8
num_epochs = 3  # you can try 2, or 5
MAX_LEN = 180
learning_rate = 2e-5  # you can try 3e-5, 5e-5


# Create train and test dataloaders

train_data_loader = create_data_loader(
    queries=x_train_query,
    query_ids=x_train_query_id,
    documents=x_train_documents,
    document_ids=x_train_document_ids,
    labels=y_train_label,
    tokenizer=tokenizer,
    max_len=MAX_LEN,
    batch_size=batch_size,
)

test_data_loader = create_data_loader(
    queries=x_test_query,
    query_ids=x_test_query_id,
    documents=x_test_document,
    document_ids=x_test_document_ids,
    labels=y_test_labels,
    tokenizer=tokenizer,
    max_len=MAX_LEN,
    batch_size=batch_size,
)

# initialize the model
model = RelevanceClassifier(model_name=model_name, freeze_bert=False)
model = model.to(device)

# create the optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate, correct_bias=False)

total_steps = len(train_data_loader) * num_epochs

#   10% of train data for warm-up
warmup_steps = math.ceil(len(train_data_loader) * num_epochs * 0.05)
#  Warmup steps are just a few updates with low learning rate before / at the beginning of training.
#  After this warmup, you use the regular learning rate (schedule) to train your model to convergence.
# The idea that this helps your network to slowly adapt to the data intuitively makes sense.
#  However, theoretically, the main reason for warmup steps is to allow adaptive optimisers (e.g. Adam, RMSProp, ...)
# to compute correct statistics of the gradients. Check this: https://datascience.stackexchange.com/questions/55991/in-the-context-of-deep-learning-what-is-training-warmup-steps

scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps
)

# loss_fn = nn.CrossEntropyLoss().to(device)
loss_fn = nn.CrossEntropyLoss().to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/491 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/334k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]



## **Training the model**
The last step is just to invoke train_epoch function and pass the required parameters

In [16]:
# start training
for epoch in tqdm(range(num_epochs)):

    print(f"Epoch {epoch + 1}/{num_epochs}")
    print("-" * 10)

    train_acc, train_loss = train_epoch(
        model, train_data_loader, loss_fn, optimizer, device, scheduler, TRAIN_LENGTH,
    )
print("Training complete!")

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1/3
----------


 33%|███▎      | 1/3 [00:06<00:13,  6.67s/it]


  correct_predictions: 34.00
  number of examples: 120.00
  Accuracy : 0.28
  Average training loss: 1.57
  Training epoch took: 0:00:07
Epoch 2/3
----------


 67%|██████▋   | 2/3 [00:10<00:04,  4.98s/it]


  correct_predictions: 49.00
  number of examples: 120.00
  Accuracy : 0.41
  Average training loss: 1.48
  Training epoch took: 0:00:04
Epoch 3/3
----------


100%|██████████| 3/3 [00:14<00:00,  4.83s/it]


  correct_predictions: 57.00
  number of examples: 120.00
  Accuracy : 0.47
  Average training loss: 1.42
  Training epoch took: 0:00:04
Training complete!





##**Save and load the fine-tuned model**

After fine-tuning AraBERT model, we want to save the model for future uses. Thus, save the training time. We just need to provide the save path for save method.

In [17]:
# save the trained model
model_file_path = "trained_model.bin"
torch.save(model.state_dict(), model_file_path)

For loading, we just need to initiliaze the model architecture, then load the weights using the load mehtod.

In [18]:
import os

# Set your Hugging Face token as an environment variable
os.environ['HF_TOKEN'] = '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'

In [19]:
# Load the saved model
model_name = Arabic_BERT
model = RelevanceClassifier(model_name=model_name, freeze_bert=False)
model.load_state_dict(torch.load(model_file_path, map_location=torch.device(device)))
model = model.to(device)
