In [None]:
# Transformers installation
! pip install transformers datasets

# Text classification


Text classification is a common NLP task that assigns a label or class to text. Here we are going to analyze emotions which assigns one of the main 6 emotions to a sequence of text.

here we Finetune [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) on dataset we found to determine whether a journal is one of the six emotions.


Before we begin, we make sure we have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

We are going to login to our Hugging Face account so we can upload and share our model, we enterd our token to login here:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load Our Dataset

Start by loading the emotion dataset from the 🤗 Datasets library:

In [None]:
from datasets import load_dataset

dataset = load_dataset("AdamCodd/emotion-balanced")

Then take a look at an example:

In [None]:
dataset["test"][0]

There are two fields in this dataset:

- `text`: the  text.
- `label`: a value that represent the emotion.

## Preprocess

The next step is to load a BERT tokenizer to preprocess the `text` field:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

Here's a preprocessing function that tokenizes the text field in the examples and truncates sequences to be no longer than BERT's maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, we use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. setting batched=True can significantly speed up the preprocessing process by processing multiple elements of the dataset at once.

In [None]:
tokenized_data = dataset.map(preprocess_function, batched=True)

Initialize DataCollatorWithPadding to dynamically pad sentences to the longest length in a batch


In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

## Evaluate

here is how we computed evaluation metrics, such as accuracy, precision, recall, and F1 score, using scikit-learn in Python. Evaluation metrics are essential for assessing the performance of machine learning models on classification tasks.

In this script, we define a function named `compute_metrics` that takes predictions and labels as input and computes various evaluation metrics using scikit-learn's functions. These metrics provide insights into the model's performance and can help us understand its strengths and weaknesses.we first  load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. 

In [None]:
pip install evaluate

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

 we can easily compute evaluation metrics for our classification model and gain valuable insights into its performance.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Define a function to compute metrics
def compute_metrics(eval_pred):
    # Unpack predictions and labels
    predictions, labels = eval_pred
    # Convert predictions to class labels by selecting the class with the highest probability
    predictions = np.argmax(predictions, axis=1)
    # Compute accuracy
    accuracy = accuracy_score(labels, predictions)
    # Compute precision
    precision = precision_score(labels, predictions, average='weighted')
    # Compute recall
    recall = recall_score(labels, predictions, average='weighted')
    # Compute F1 score
    f1 = f1_score(labels, predictions, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

## Train

<Tip>
 the training process for our model using the 🤗 Transformers library with TensorFlow is being prepared. First,the key training parameters such as the batch size and number of epochs are defined. then, the total number of training steps is computed based on the number of batches per epoch and the total number of epochs. Finally, the optimizer is initialized using the create_optimizer function from the Transformers library, with parameters specifying the initial learning rate, the number of warm-up steps, and the total number of training steps. This setup ensures that the model is equipped with an appropriate optimizer and learning schedule to undergo effective training.
</Tip>


In [None]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_data["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

The TFAutoModelForSequenceClassification class is utilized, and a pre-trained BERT model (bert-base-uncased) from Google is loaded. The num_labels parameter is set to 6 to accommodate the six emotion classes: sadness, joy, love, anger, fear, and surprise. The label2int dictionary is defined to map class labels to numerical IDs, and the id2label dictionary is created by reversing the key-value pairs of label2int. Both dictionaries are provided to the model instantiation to establish bidirectional mappings between labels and their corresponding numerical representations. Lastly, the from_pt=True argument indicates that the model weights are loaded from a PyTorch checkpoint.

In [None]:
from transformers import TFAutoModelForSequenceClassification

label2int = {
    "sadness": 0,
    "joy": 1,
    "love": 2,
    "anger": 3,
    "fear": 4,
    "surprise": 5
}

# Create id2label dictionary by reversing key-value pairs of label2int
id2label = {v: k for k, v in label2int.items()}

# Create label2id dictionary from label2int
label2id = label2int

model = TFAutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-uncased", num_labels=6, id2label=id2label, label2id=label2id,from_pt=True
)

 prepares TensorFlow datasets for training, validation, and testing.
 

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_data["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_data["validation"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
tf_test_set = model.prepare_tf_dataset(
    tokenized_data["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)


Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so we don't need to specify one

In [None]:
import tensorflow as tf

model.compile(optimizer=optimizer, metrics='accuracy')  # No loss argument!

we instantiate a KerasMetricCallback object, which allows us to compute custom evaluation metrics during training using TensorFlow/Keras.

we pass our `compute_metrics` function to [KerasMetricCallback](https://huggingface.co/docs/transformers/main/en/main_classes/keras_callbacks#transformers.KerasMetricCallback):

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

Specify where to push our model and tokenizer in the [PushToHubCallback](https://huggingface.co/docs/transformers/main/en/main_classes/keras_callbacks#transformers.PushToHubCallback):

In [None]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="./Bert",
    tokenizer=tokenizer,
)

Then Combining our callbacks together:

In [None]:
callbacks = [metric_callback, push_to_hub_callback]

Finally, we're prepared to start training our model! Utilize the [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) method, we provided our training and validation datasets along with the desired number of epochs and any necessary callbacks to fine-tune the model:

In [None]:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

Now, we perform evaluation of the trained model using the test dataset.

In [None]:
# Evaluate your model on the test set
evaluation_result = model.evaluate(tf_test_set)

Once training is completed, our model is automatically uploaded to the Hub!
