# Creating language models for text-classification

<a href="https://colab.research.google.com/drive/1vH0YJE0GfdvN33TB-mYMQo1vFtHOi-pe" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

In the vast realm of artificial intelligence, language models are the linchpin connecting humans and machines through the intricate tapestry of communication. **[Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)** (NLP), a burgeoning field within AI, seeks to unravel the complexities of human language and enable machines to comprehend, interpret, and respond to text in a way that mirrors human understanding.

At the heart of NLP lies the remarkable evolution of language models, i.e., algorithms designed to process and generate human-like text. These models are instrumental in tasks ranging from sentiment analysis to language translation, offering unprecedented capabilities in understanding and developing linguistic expressions.

Among language models, **[Long Short-Term Memory](https://arxiv.org/abs/1909.09586)** (LSTM) networks have emerged as pioneers in capturing contextual dependencies within sequential data. LSTMs, a **[recurrent neural network](https://en.wikipedia.org/wiki/Recurrent_neural_network)** (RNN), excel in preserving long-range dependencies, making them particularly adept at understanding the nuanced relationships between words and phrases over extended text passages. This ability to capture context is paramount in tasks like language translation and sentiment analysis, where the meaning often hinges on the broader context in which words are used.

With the advent of **[transformer-based models](https://arxiv.org/abs/1706.03762)**, a new era of NLP capabilities has been unleashed. Models like BERT (**[Bidirectional Encoder Representations from Transformers](https://arxiv.org/abs/1810.04805)**), a groundbreaking model Google developed, form the basis of many modern NLP applications through their bidirectional context analysis and attention mechanism.

One of the most common tasks in NLP is [text-classification](https://huggingface.co/docs/transformers/tasks/sequence_classification), which involves assigning predefined categories or labels to textual data. This task has wide-ranging implications, from spam detection in emails to sentiment analysis in social media. In this tutorial, we will provide a simple NLP introduction using text-classification as our subject.

![sentiment-analisys](https://miro.medium.com/proxy/1*_JW1JaMpK_fVGld8pd1_JQ.gif)

[Source](https://medium.com/analytics-vidhya/sentiment-analysis-on-ellens-degeneres-tweets-using-textblob-ff525ea7c30f).

In this notebook, we will create two language models for sentiment analysis: the `Keras` API and the `Transformers` library. To train our models, we will use a custom dataset that we put together by combining several datasets for sentiment classification. More information can be found on our [dataset card](https://huggingface.co/datasets/AiresPucrs/sentiment-analysis), available on the Hub (also in [Portuguese](https://huggingface.co/datasets/AiresPucrs/sentiment-analysis-pt)!). 🤗

First, let us download our dataset and take a look at it.

In [1]:
!pip install datasets -q

from datasets import load_dataset

dataset = load_dataset('AiresPucrs/sentiment-analysis', split='train')

display(dataset.to_pandas())

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/507.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/507.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/912 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/85089 [00:00<?, ? examples/s]

Unnamed: 0,text,label
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tech...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,petter mattei's love in the time of money is a...,1
...,...,...
85084,yaaa cool use last weeks give good response,1
85085,years daughter love alexa enjoy alexa,1
85086,yes popular but doesnt use except listen songs...,1
85087,yo alexa love,1


As you can see, our dataset is a list of sentences, like movie reviews, tweets, etc., and labels. Our dataset is configured as a binary classification problem. The two classes are `positive sentiment` (1) and `negative sentiment` (0).

The first model we will be training is an **Bidirectional-LSTM** (Bi-LSTM). LSTMs are networks used in various tasks involving time series and other sequence-related tasks. Thanks to their recurrent segment, which means that LSTM output is fed back into itself, LSTMs can use context when predicting the following sample in a sequence. Traditionally, LSTMs have been one-way models, also called unidirectional ones. In other words, sequences such as tokens are read left-to-right or right-to-left. Meanwhile, transformer-based approaches like  [BERT](https://huggingface.co/docs/transformers/model_doc/bert) bidirectionality process the input in a left-to-right and right-to-left fashion. To mimic this behavior, we will create an LSTM that also employs this bidirectional analysis (i.e., Bidirectional LSTMs), which, very conveniently, are already implemented in TensorFlow and Keras via the  [`tf.keras.layers.Bidirectional`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) layer.

![LSTM](https://raw.githubusercontent.com/christianversloot/machine-learning-articles/main/images/2560px-Recurrent_neural_network_unfold.svg_.png)

[Source](https://machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm).

> **Note: To learn more about the LSTM arquitecture we recommend "_[A gentle introduction to Long Short-Term Memory Networks (LSTM)](https://machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm)_", by Christian Versloot.**

Here is a simple example of how to initiate a bidirectional layer in a sequential model:

```python
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(Bidirectional(LSTM(10), merge_mode='sum'))
model.add(Dense(1, activation='sigmoid'))
```
This network has three main components: the `Embedding`, `Bidirectional`, and `LSTM` layers:

- The `Embedding` layer converts input text data into numerical vectors. It maps each word in the text to a fixed-size vector of real numbers, which can be learned during training or pre-trained on a large corpus of text. The purpose of the embedding layer is to capture the semantic meaning of words and represent them in a dense vector space, which can be used as input to the subsequent layers of the network.

- The `Bidirectional` layer is used to improve the performance of `RNNs` by processing the input sequence in both directions, forward and backward. It consists of two separate `RNNs` that process the input sequence in opposite directions and concatenate the outputs of each time step. This allows the network to capture information from both past and future contexts, which can be particularly useful for tasks such as text classification and named entity recognition.

- The `LSTM` layer is a type of `RNN` designed to overcome traditional `RNNs` limitations, such as the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). Its more complex architecture allows it to selectively forget or remember information from previous time steps, making it particularly effective for tasks involving long-term dependencies, such as language modeling and machine translation. The `LSTM` layer consists of memory cells that store information over time, input gates that regulate the flow of new information into the memory cells, and output gates that control the layer's output.

> **Note:** For more information, read the original proposal for this architecture, "_[Long Short-Term Memory](https://dl.acm.org/doi/10.1162/neco.1997.9.8.1735)_."

Before training or Bi-LSTM, we need to first create our Tokenizer. Since we are using Keras and Tensorflow, we will use the [`tf.keras.preprocessing.text.Tokenizer`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer), which allows to vectorize a text corpus, by turning each text into a sequence of integers.

In [2]:
!pip install keras_preprocessing -q

import io
import json
import tensorflow as tf

# Define the length of your tokenizer (how many words will be stored)
vocab_size = 5000

# Initiate the Tokenizer class
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=vocab_size,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', # Filter ponctuation
    lower=True, # Lower all letters
    split=" ",
    oov_token="<OOV>", # Out-Of-Vocabulary token
  )

# Fit the tokenizer into our text corpus
tokenizer.fit_on_texts(dataset['text'])

# Transform the tokenizer into a json file for further loading and use
tokenizer_json = tokenizer.to_json()

# Save the Tokenizer!
with io.open('./tokenizer-BiLSTM-sentiment-classifier.json', 'w', encoding='utf-8') as fp:
    fp.write(json.dumps(tokenizer_json, ensure_ascii=False))
    fp.close()


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Now that we have a tokenizer, we can tokenize our dataset, i.e., turn all text sequences into sequences of tokens integers. But before, let us split our dataset into the good-old train test split.

In [3]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    dataset['text'],
    dataset['label'],
    test_size=0.2,
    random_state=42
  )

print("Training samples: ", len(x_train))
print("Test samples: ", len(x_test))

Training samples:  68071
Test samples:  17018


One thing that will improve our training, and something you will see quite a lot when working with language models, is to make sure our samples are the same shape so our batches have all the same shape. To achieve this, since our samples are all of different length, we can use the [`tf.keras.utils.pad_sequences`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences), to create this homogeneous training tensors.

In [4]:
import numpy as np

# The maximum sequence length we will consider. Shorter sequences are padded,
# and longer sequences are truncated

sequence_length = 250

x_train = tf.keras.utils.pad_sequences(
    tokenizer.texts_to_sequences(x_train),
    maxlen=sequence_length,
    truncating='post'
  )

x_test = tf.keras.utils.pad_sequences(
    tokenizer.texts_to_sequences(x_test),
    maxlen=sequence_length,
    truncating='post'
  )

# Turn labels into numpy arrays
y_train = np.array(y_train).astype(float)
y_test = np.array(y_test).astype(float)

# Shapes of our tensors
print("Training samples: ", x_train.shape)
print("Test samples: ", x_test.shape)

Training samples:  (68071, 250)
Test samples:  (17018, 250)


Now, we can initiate our Bi-LSTM model. This model consists of an embedding layer, two Bi-LSTM layers, and an output layer with a single neuron tied to a softmax activation unit. Our model uses the `tf.keras.Model` class and stacks all layers on top of each other vis function calls. Our loss function is sparce categorical cross-entropy, and the optimizer is [Adam](https://paperswithcode.com/method/adam).

In [6]:
# Dimensionality of the Embedding layer, which is the look-up table
# that converts tokens into dense vectors.

embed_size = 128

# Input Layer
inputs = tf.keras.Input(shape=(None,), dtype="int32")

# The Embedding layer
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length)(inputs)

# First BiLSTM layer
x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True))(x)

# Second BiLSTM layer
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64))(x)

# Output Layer
outputs = tf.keras.layers.Dense(2, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

Version:  2.15.0
Eager mode:  True
GPU is available
Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 128)         640000    
                                                                 
 bidirectional_2 (Bidirecti  (None, None, 128)         98816     
 onal)                                                           
                                                                 
 bidirectional_3 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dense_1 (Dense)             (None, 2)                 258       
                                                                 
Total p

Now, we are ready to train our model! We will use two different callbacks, `ModelCheckpoint`  and `EarlyStopping`, to control the duration of our training run, i.e., we will save the best model according to validation loss and perform early stopping if it does not improve after three epochs.

> **Note: If you dont want to train this from scratch, you can download it from the Hub! 🤗**

```bash
!git lfs install
!git clone https://huggingface.co/AiresPucrs/BiLSTM-sentiment-classifier
```

In [None]:
callbacks = [
    tf.keras.callbacks.ModelCheckpoint("./BiLSTM-sentiment-classifier.h5",
      save_best_only=True
    ),
    tf.keras.callbacks.EarlyStopping(monitor="val_loss",
      patience=3,
      verbose=1,
      mode="auto",
      baseline=None,
      restore_best_weights=True
    )
]

model.fit(x_train,
          y_train,
          epochs=20,
          validation_split=0.2,
          callbacks=callbacks,
          verbose=1)

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 2)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')


Congratulations, you have trained your own Bi-LSTM! Bellow, we are testing our model on some strings, and it appears it can perform fundamental sentiment analysis.


In [8]:
# Download model from huggingface.co
!git lfs install
!git clone https://huggingface.co/AiresPucrs/BiLSTM-sentiment-classifier

import json
import tensorflow as tf

model_path = './BiLSTM-sentiment-classifier/BiLSTM-sentiment-classifier.h5'
tokenizer_path = './BiLSTM-sentiment-classifier/tokenizer-BiLSTM-sentiment-classifier.json'

model = tf.keras.models.load_model(model_path)

with open(tokenizer_path) as fp:
    data = json.load(fp)
    tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(data)
    word_index = tokenizer.word_index
    fp.close()


strings = [
    'this explanation is really bad',
    'i did not like this tutorial 2/10',
    'this tutorial is garbage i wont my money back',
    'is nice to see philosophers doing machine learning',
    'this is a great and wonderful example of nlp',
    'this tutorial is great one of the best tutorials ever made'
]

preds = model.predict(
    tf.keras.preprocessing.sequence.pad_sequences(
        tokenizer.texts_to_sequences(strings),
        maxlen=250,
        truncating='post'
    ), verbose=0)

for i, string in enumerate(strings):
    print(f'Review: "{string}"\n(Negative 😔 \
    {preds[i][0] * 100:.2f}% | Positive 😊 \
    {preds[i][1] * 100:.2f}%)\n')

Git LFS initialized.
Cloning into 'BiLSTM-sentiment-classifier'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 14 (delta 3), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (14/14), 4.55 KiB | 1.52 MiB/s, done.
Filtering content: 100% (2/2), 25.87 MiB | 11.51 MiB/s, done.
Review: "this explanation is really bad"
(Negative 😔     95% | Positive 😊     5%)

Review: "i did not like this tutorial 2/10"
(Negative 😔     88% | Positive 😊     12%)

Review: "this tutorial is garbage i wont my money back"
(Negative 😔     89% | Positive 😊     11%)

Review: "is nice to see philosophers doing machine learning"
(Negative 😔     4% | Positive 😊     96%)

Review: "this is a great and wonderful example of nlp"
(Negative 😔     0% | Positive 😊     100%)

Review: "this tutorial is great one of the best tutorials ever made"
(Negative 😔     0% | Positive 😊     100%)



## The transformer-based approach

The second model in our training lineup is a **DistilBERT** model, which stands for **[Distilled Bidirectional Encoder Representations](https://arxiv.org/abs/1910.01108)** from Transformers. DistilBERT is a refined and compact version of the original BERT model. DistilBERT is designed to maintain the bidirectional analysis while being more resource-efficient and faster to train. Like BERT, DistilBERT processes input sequences bidirectionally, combining information from left-to-right and right-to-left directions. We leverage the transformer architecture and its attention mechanisms to implement this bidirectional functionality. DistilBERT, with its distilled knowledge representation, offers a practical and efficient solution for various NLP tasks, and its implementation is readily available through libraries like [Transformers](https://huggingface.co/docs/transformers/model_doc/distilbert).

We already have our dataset ready while [DistilBERT](https://huggingface.co/distilbert/distilbert-base-cased) is promptly available in the Hubbing Face Hub, tokenizer, and all. Hence, we only need to download our base model and tokenizer and establish a training pipeline.

In [9]:
# Here is a sample of our dataset (IMDB sample)

dataset[0]


{'text': "one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked they are right as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to many aryans muslims gangstas latinos christians italians irish and more so scuffles death stares dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare forg

First, we download our tokenizer, and like before, we will tokenize our text sequences up to a maximum length. However, unlike before, we don't need to train this tokenizer since it is already pre-trained!

In [2]:
!pip install transformers accelerate -q

from transformers import AutoTokenizer

# Download the `distilbert-base-cased` tokenizer from the Hub
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

# Preprocess function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=256)

# Preprocess the dataset!
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Split the dataset into `train` and `test`
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.2)

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m266.2/270.9 kB[0m [31m8.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/85089 [00:00<?, ? examples/s]

We are creating a `DataCollator` to feed batches of samples to our model during training. We are also creating an evaluation function to compute accuracy scores while validating our model during training via the [Evaluate](https://huggingface.co/docs/evaluate/index) library.

> **Note: We will use these scores to choose which model we should save as "the best" (a.k.a. the best model is the one that achieves the best accuracy in validation).**

In [3]:
!pip install evaluate -q

import evaluate
import numpy as np
from transformers import DataCollatorWithPadding

# Create a simple data collactor
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Use accuracy as an evaluation metric
accuracy = evaluate.load("accuracy")

# Function to compute accuracy
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Now, we can finally download our pre-trained model. DstillBERT is a model trained on a masked language modeling task. However, we are downloading an instance of DistillBERT attached to a SequenceClassification head. We can set this head's labels and number of classes through the `num_labels`, `id2label`, and `label2id` arguments.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-cased",
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)


model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To fine-tune our model, we only need to stipulate the `TrainingArguments,` and thanks to the `Trainer` API, we only need to pass these arguments, together with our model, tokenizer, data collector, and evaluation functions, and `Transformers` does the rest.

The `TrainingArguments` object defines various settings and parameters for training a machine learning model. Here's an explanation for each argument in the provided code:

-  `output_dir`: Specifies the directory where model checkpoints and other outputs will be saved during training.
    
-  `learning_rate`: Sets the initial learning rate for the optimizer. It controls the step size during optimization and influences how quickly the model adapts to the training data.
    
-  `per_device_train_batch_size`: Defines the number of training examples per batch processed on each device (GPU or CPU). Larger batch sizes can lead to faster training but may require more memory.
    
-  `per_device_eval_batch_size`: Similar to `per_device_train_batch_size`, but for evaluation (validation and test) datasets.
    
-  `num_train_epochs`: Specifies the number of times the entire training dataset is passed through the model during training.
    
-  `weight_decay`: Adds a regularization term to the loss function to prevent overfitting. It penalizes large weights in the model.
    
-  `evaluation_strategy`: Determines when to evaluate training. In this case, "epoch" means evaluation is done at the end of each training epoch.
    
-  `save_strategy`: Determines when to save model checkpoints. Here, "epoch" means saving checkpoints at the end of each training epoch.
    
-  `load_best_model_at_end`: If set to True, the training process will load the best model checkpoint based on the evaluation metric at the end of training.
    
-  `push_to_hub`: If True, it enables pushing the trained model and associated files to the Hugging Face Model Hub, making it accessible to the community.
    
-  `hub_token`: A token required for authentication when pushing the model to the Hugging Face Model Hub.
    
-  `hub_model_id`: Specifies the unique identifier for the model on the Hugging Face Model Hub that the user has permission to push the model to.

The learn more about the `TrainingArguments` and `Trainer` classes, read the [documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments).

> **Note: If you dont want to train this from scratch, you can download it from the Hub! 🤗**

In [None]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="checkpoints",
    learning_rate=4e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_token="your-token-here",
    hub_model_id="AiresPucrs/distilbert-base-cased-sentiment-classifier"
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train!
trainer.train()

Congratulations, you just fine-tuned a transformer model. Bellow, we are testing our model on some strings, and it can also perform fundamental sentiment analysis.

In [2]:
from transformers import pipeline

texts = [
    'This explanation is really Bad!',
    'Is to complicated and boring.',
    'Garbage, not worth my time and money!',
    'Is nice to see philosophers doing machine learning.',
    'This is a great and wonderful example of nlp.',
    'This tutorial is great and I have learned a lot.'
]

# Create a text classification pipeline
# you can also pass the folder your model is localy saved
classifier = pipeline("text-classification",
                      model="AiresPucrs/distilbert-base-cased-sentiment-classifier")


for text in texts:
  preds = classifier(text)

  print(f"""Review: '{text}'\n(Label: {preds[0]['label']} | Confidence: {preds[0]['score'] * 100:.2f}%)\n""")


Review: 'This explanation is really Bad!'
(Label: NEGATIVE | Confidence: 88.42%)

Review: 'Is to complicated and boring.'
(Label: NEGATIVE | Confidence: 94.99%)

Review: 'Garbage, not worth my time and money!'
(Label: NEGATIVE | Confidence: 99.48%)

Review: 'Is nice to see philosophers doing machine learning.'
(Label: POSITIVE | Confidence: 99.69%)

Review: 'This is a great and wonderful example of nlp.'
(Label: POSITIVE | Confidence: 99.93%)

Review: 'This tutorial is great and I have learned a lot.'
(Label: POSITIVE | Confidence: 99.90%)



We will use these models in some of our other tutorials in the Explainability folder.

> **Note: For more tutorials on text classification, visit the [Hub](https://huggingface.co/docs/transformers/tasks/sequence_classification). If you are interested in more complex types of language modeling, like text generation, check the [source code for the training of our TeenyTinyLlamas](https://github.com/Nkluge-correa/TeenyTinyLlama).**

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
