# Fine-tune CANINE for binary text classification

In this notebook, we are going to fine-tune Google's character-level [CANINE](https://arxiv.org/abs/2103.06874) model to classify movie reviews as either positive/negative. We will do so using HuggingFace Transformers (I contributed CANINE in PyTorch to it!). The dataset we are going to use is the [IMDB dataset](https://huggingface.co/datasets/imdb), which is a large collection of movie reviews labeled as positive/negative.

For training, we will use [PyTorch Lightning](https://www.pytorchlightning.ai/) (note that you could also use alternative solutions such as native PyTorch, the [HuggingFace Trainer](https://huggingface.co/transformers/main_classes/trainer.html), [HuggingFace Accelerate](https://github.com/huggingface/accelerate), etc.). For logging the metrics (such as loss and accuracy) during training, we will use Weights and Biases.

Note that this notebook is very similar to how we would fine-tune a BERT model for binary text classification. The only difference is that BERT uses word pieces (subword tokenization), whereas CANINE works at a character-level.

To give an example, if you would provide the sentence "hello world" to BERT, it would first be tokenized into the word pieces ["hello", "world"]. Then, BERT will convert each word piece into some vector (also referred  to as hidden state). For BERT-base, this is a vector of size 768. CANINE on the other hand would "tokenize" the sentence into ["h", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"], i.e. split it up into the individual characters. Then, CANINE will convert each character into some vector (for CANINE, this is also a vector of size 768). Classification of sequences is the same for BERT/CANINE: one simply places a linear layer on top of the final hidden state of the special [CLS] token.

* CANINE paper: https://arxiv.org/abs/2103.06874
* CANINE documentation: https://huggingface.co/transformers/model_doc/canine.html

## Install dependencies

In [1]:
!pip install transformers datasets torch scikit-learn


Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

In [2]:
from transformers import CanineForSequenceClassification, CanineTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import torch


In [4]:
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split

# Load your dataset from CSV
dataset = load_dataset("csv", data_files="/content/IMDB Dataset.csv")

# Convert the Hugging Face dataset to a Pandas DataFrame
df = dataset['train'].to_pandas()

# Split the DataFrame into training and testing sets
train_df, test_df = train_test_split(df, train_size=800, test_size=200, random_state=42)

# Convert the DataFrames back to Hugging Face datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Check the first few rows of the training dataset to ensure it's loaded correctly
print(train_dataset)
print(test_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['review', 'label', '__index_level_0__'],
    num_rows: 800
})
Dataset({
    features: ['review', 'label', '__index_level_0__'],
    num_rows: 200
})


## Prepare data

Here we load a small portion of the IMDb dataset which is hosted on the HuggingFace hub, for demonstration purposes.

In [5]:
# Load Canine tokenizer
tokenizer = CanineTokenizer.from_pretrained("google/canine-s")

# Tokenize the data
def tokenize_function(example):
    return tokenizer(example['review'], truncation=True, padding='max_length', max_length=512) # Changed 'text' to 'review'

# Apply the tokenizer to the train and test datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Set the format for PyTorch tensors
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label']) # Changed 'label' to 'sentiment'
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label']) # Changed 'label' to 'sentiment'

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/854 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/670 [00:00<?, ?B/s]



Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [6]:
# Load the Canine model for sequence classification
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=2)  # 2 labels: positive and negative


model.safetensors:   0%|          | 0.00/528M [00:00<?, ?B/s]

Some weights of CanineForSequenceClassification were not initialized from the model checkpoint at google/canine-s and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's look at one particular example:

In [12]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=500,


)




In [13]:
# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)


In [14]:
# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss
1,0.6957,0.696067
2,0.6762,0.693092
3,0.6352,0.699055


TrainOutput(global_step=300, training_loss=0.6836899820963541, metrics={'train_runtime': 121.5196, 'train_samples_per_second': 19.75, 'train_steps_per_second': 2.469, 'total_flos': 788199284736000.0, 'train_loss': 0.6836899820963541, 'epoch': 3.0})

In [15]:
trainer.save_model("./results")


In [18]:
import torch
from transformers import CanineForSequenceClassification, CanineTokenizer

# Load the fine-tuned model and tokenizer
model = CanineForSequenceClassification.from_pretrained("./results")  # Path to the fine-tuned model
tokenizer = CanineTokenizer.from_pretrained("google/canine-s")

# Set the model to evaluation model
model.eval()

# Define the function for sentiment prediction
def predict_sentiment(review):
    # Tokenize the input review
    inputs = tokenizer(review, truncation=True, padding='max_length', max_length=512, return_tensors="pt")

    # Run inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class
    logits = outputs.logits
    predicted_class_id = torch.argmax(logits, dim=-1).item()

    # Convert class ID to sentiment label (0: negative, 1: positive)
    if predicted_class_id == 0:
        return "Negative"
    else:
        return "Positive"

# Example usage with a movie review
review = "i love this movie."
prediction = predict_sentiment(review)
print(f"Sentiment: {prediction}")


Sentiment: Negative
