# Text classification : Comparison betwenn DistilBERT and CANINE

## Imports

In [1]:
import os
import numpy as np
import warnings

warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [None]:
! pip install transformers
! pip install datasets

In [3]:
from datasets import load_dataset, load_metric
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

In [4]:
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import CanineTokenizer, CanineForSequenceClassification
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score

# IMDB dataset

## loading imdb datatset

We chose this dataset because it contains movie reviews. Therefore, it is common to have corruptions in the input due to natural typos. It is also considered informal text that often includes typos, spelling variations, transliterations or emoji. With this type of dataset, we expect better performance with CANINE, as it is designed to handle these problems as stated in the CANINE paper. 

In [None]:
dataset = load_dataset("imdb")

In [6]:
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"This is mostly a story about the growing relationship between Jeff Webster(Jimmy Stewart) and Ronda Castle(Ruth Roman). She takes an instant liking to Jeff in a brief encounter on the deck of the steamer to Skagway, and a longer look when he hides in her cabin while authorities seek him on a charge of murder. They find out they have some things in common besides an animal attraction. Neither trusts a member of the opposite sex, apparently because both have been married to spouses who cheated on them. Gradually, they learn to trust each other, as they journey from Skagway to Dawson. But Ronda clearly has close dealings with corrupt sheriff Gannon and engages in some shady practices in her Castle saloon in Skagway. She eventually has to decide between Gannon and Jeff. Meanwhile, Rene, a young naive French woman also takes an immediate liking to Jeff, but only gets insulting brush offs in return. Yet, she sticks with him in his travels from Skagway to Dawson and his activities around Dawson. Along with Ronda, she nurses him back to health after Jeff is left for dead by Gannon's gunslingers at his gold claim. Walter Brennan, as Ben, serves as Jeff's long time sidekick. He doesn't have a meaty role, but serves to soften Jeff's hard edges. His demise symbolically opens the door for a woman companion replacement for Jeff.<br /><br />John McIntire(as Sheriff Gannon) makes probably the most charismatic evil town boss you will ever see on film, oozing charm and humor to go along with his bullying. Evidently, he sees something of himself in Jeff, repeatedly declaring that he's going to like him. He makes a believable incarnation of the infamous Soapy Smith, who spent his last years in Skagway, as one of the premier con men of his times.<br /><br />Jeff is the quintessential antihero, a loner(except for companion Ben), who doesn't want to stick his neck out for others, even when he knows he is the one right man for the job. In this respect, he closely resemble's Burt Lancaster's character in ""Vera Cruz"", for example. Thus, Jeff not only turns down the job of marshall of Dawson, he is convinced to leave Dawson after Gannon's gang move in with clear intentions of taking over everyone's insufficiently legal gold claims, while disposing of some miners and suggesting that the rest make a hurried exit from Dawson. Even Ronda suggests that she and Jeff make a hurried exit from Dawson while they are still alive. Then, Jeff has a sudden change of heart, apparently still nursing desire for revenge for the shooting of Ben and himself. He changes from anti-hero to hero in leading the expulsion of Gannon's gang from Dawson. In this respect, he differs from Lancaster's character, who never reforms(But is Jeff truly changed, or just handing out revenge for wrongs committed against his own interests?)<br /><br />The main problem I see with the plot is the 2 principle women. Clearly, Ronda is groomed as the right woman to tame Jeff. Although she is clearly characterized as a ""bad"" girl, Jeff has a checkered recent past himself, having shot at least 5 men in the US or Yukon, and having stolen his cattle back from Gannon. Ironically, soon after Jeff changes from anti-hero to hero, Rhonda makes a similar change in running into the street to warn Jeff of Gannon's impending ambush. She dies as a result and Jeff asks her why she didn't just look out for herself(his supposedly just abandoned creed!).<br /><br />It's clear that Corine Calvert, as Renee, just doesn't make a credible substitute for the dead Ronda, in Jeff's mind. Yet, the apparent implication of the parting scene is that they get together, even though Jeff never visibly gives her a kiss or hug. Her image as a good, if naive, young woman is somewhat compromised by her job in Rhonda's saloon of bumping miners weighting their gold dust, pushing the spilled dust on the floor and recovering it later. I'm also very unclear about her relationship with Rube Morris, a middle aged miner who followers her around and works a claim with her.(He's not her father).<br /><br />Another problem is the amateurish handling of the gun fight between Jeff and Gannon's gang. If Gannon had any skill at all with a pistol, he should have killed or seriously wounded Jeff under that boardwalk, before Jeff did the same to him. And how did Jeff's badly shot up right hand suddenly become well enough to shoot a pistol with apparent ease? I also wonder what Jeff and friends did to help save the avalanche victims. They were much too far away to pull them out alive from under the snow. And why weren't most of Ronda's pack horses and mules also buried by the avalanche?<br /><br />You will see a host of probably nameless but familiar faces among the miners and Gannon's gang. The sequences shot in the Canadian Rockies provide a breathtaking backdrop to the action. All-in-all, a very entertaining western, with most of the major flaws concentrated at the end. No doubt, this film takes some great liberties with history and geography, especially, the part taking place in the Canadian Yukon, which was in fact much tamer than the US Skagway.",pos
1,"And I repeat, please do not see this movie! This is more than a review. This is a warning. This sets the record for the worst, most effortless comedy ever made. At least with most of the recent comedies nowadays, the gags are crude and flat, but the writers and directors put in at least some sort of effort into making them funny. I never get tired of repeating one of my favorite mottos: Everyone thinks they can do comedy, and only 10 percent of them are right. Comedy is hard! This is not some genre any fool can play around with. I think it's atrocious that the filmmakers are comparing this piece of garbage to ""Kentucky Fried Movie."" Basically, these bozos are comparing their so-called comic talents to those of the brilliant Jim Abrahams and the Zucker Brothers. Come on, I've seen Pauly Shore movies that are 10 times funnier than ""The Underground Comedy Movie."" Here's a sample of the comedy for those curious about seeing this movie: One sketch involves a superhero dressed like a penis named D**kman. The whole joke is that he defeats his enemies by squirting them with semen. That's it. That's the whole joke. Wow. This is enough to make Carrot Top roll his eyes. Another sketch involves a man having sex with a dead person in a porn movie. And in another sketch, there's a bag lady beauty contest, in which we're exposed to the horrible sights of bikini-clad middle-aged women with beer guts and stretch marks. Plus, making fun of the homeless is more sad than funny. It's a step away from mocking the mentally handicapped. The whole movie is supposed to be a satire. I think the filmmakers forgot that a key element of satire...is TRUTH!!! For anybody who actually enjoyed this crap, explain to me what is truthful about ANY of these gags! Some of the sketches might've sounded funny on paper, but anybody who's taken any screen writing classes knows that if a sight gag sounds too funny on paper, it probably won't be funny on screen. If I tell someone about a big, black, muscular gay virgin, who's saving himself for the right man, he or she would probably laugh. But watching the premise played out on screen for about 10 minutes is a complete drag. I hate how whenever people criticize a low-brow comedy like this for not being funny, they're regarded as stuck-up squares. I just saw ""White Chicks"" recently. That's another low-brow, politically incorrect comedy, but I laughed my head off. The most offensive thing about ""The Underground Comedy Movie"" is it's not funny! What the writers and directors don't understand is that merely being filthy and tasteless doesn't work. There has to be more! Just think of the famous scene from ""There's Something About Mary"" (ironically, enough the bozo filmmakers put the Farrellys on their special thanks list). The joke about the semen wasn't just funny because it involved bodily fluids. There was a buildup. Ben Stiller was masturbating in the bathroom to make sure he didn't go out on a date with a ""loaded gun."" Then he looked around to see where all the semen went after it was released. A knock is on the door, and he has to answer it. His date, Mary, is at the door and that's when it's revealed that the semen is hanging off Ben's ear. In this movie, there are multiple gags involving characters squirting loads of semen at people, with no buildup whatsoever. As Jay Leno always says, ""This comedy thing's not so easy, is it?"" Keep that in mind, Vince Offer, 'cause you weren't cut out for this genre!! The only reason people might laugh at these gags is because they want to feel hip. Let's face it, nowadays it's hip to laugh at anything politically incorrect. I know comedy is subjective...but this movie shouldn't be funny to anybody, except maybe the filmmakers themselves. As a side note, the movie had to have been made before Michael Clarke Duncan's fame in movies like ""Armageddon"" and ""The Green Mile."" There can't be any other reason why an actor of his caliber would volunteer to be part of this amateurish freak show. All the others in the cast are either non-actors, has-been actors or B-movie stars. Karen Black made a good impression in ""Five Easy Pieces,"" but I don't think she's done anything of value ever since. Slash was probably drugged into being in this film. Gina Lee Nolin is nothing without ""Baywatch."" Angelyne is the film's biggest star (keeping in mind Duncan wasn't famous at the time), and there are still probably a ton of people who haven't heard of her--for good reason. Usually, I'm in support of extremely low-budget flicks, but this one deserves to drift into obscurity. I hope to Lord this doesn't become a cult classic! Shouldn't there be a law against distributing crap like this?",neg
2,"Wow, what a great cast! Julia Roberts, John Cusack, Christopher Walken, Catherine Zeta-Jones, Hank Azaria...what's that? A script, you say? Now you're just being greedy! Surely such a charismatic bunch of thespians will weave such fetching tapestries of cinematic wonder that a script will be unnecessary? You'd think so, but no. America's Sweethearts is one missed opportunity after another. It's like everyone involved woke up before each day's writing/shooting/editing and though ""You know what? I've been working pretty hard lately, and this is guaranteed to be a hit with all these big names, right? I'm just gonna cruise along and let somebody else carry the can."" So much potential, yet so painful to sit through. There isn't a single aspect of this thing that doesn't suck. Even Julia's fat suit is lame.",neg
3,"I saw this movie with my dad. I must have been pretty young, around 15. It was on Star Movies one afternoon.The movie started a bit vaguely, but you could tell those robbers were gathering up for a score. It really caught pace after the first half hour.<br /><br />All the actors are great, especially Blades and Lou Diamond. I Guess it's the ensemble, they just play so well together. I can watch this film anytime.I think it is the relative stupidity of the plot and the characters trying to deal with a very weird score. The jokes are not corny but they are subtle and extreme at the same time that make them so hilarious.<br /><br />A perfect comedy for a lazy afternoon.",pos
4,"How this film could miss so many of the fascinating, complex and mysterious aspects of the original story or the original movie is truly remarkable. An unbelievably thin and unengaging plot, ankle-deep characterisation/motivation and a really awful soundtrack (replacing tension with vast swathes of noise, replacing the arcane musical references of the original for digitised crashes and roars. Then there are the specific references to the original which are merely ""plastered on"" over the cracks... Dreadful. In a world where gormless, brain-dead Amerikan remakes of The Italian Job (a tear appears), Get Carter (sobs uncontrollably) and Alfie have desecrated our screens recently, this one takes the proverbial biscuit. Execrable nonsense. How Ellen Burstyn ever got involved is a wonder... Rubbish.",neg


In [8]:
sum(dataset["train"]["label"]) / len(dataset["train"]["label"])

0.5

In [9]:
sum(dataset["test"]["label"]) / len(dataset["train"]["label"])

0.5

## Metrics

In [None]:
from datasets import load_metric

metric1 = load_metric("accuracy")
metric2 = load_metric("precision")
metric3 = load_metric("recall")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = metric1.compute(predictions=predictions, references=labels)["accuracy"]
    precision = metric2.compute(predictions=predictions, references=labels)["precision"]
    recall = metric3.compute(predictions=predictions, references=labels)["recall"]

    return {"accuracy": accuracy, "precision": precision, "recall": recall}

## distilbert-base-uncased-finetuned-sst-2-english for text classification on imdb 

This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english", num_labels=2
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)

In [None]:
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

In [None]:
predictions = []
scores = []
for i in range(25000):
    single_pred = classifier(dataset["test"]["text"][i][:512])
    if single_pred[0]["label"] == "NEGATIVE":
        predictions.append(0)
    else:
        predictions.append(1)
    scores.append(single_pred[0]["score"])

In [None]:
accuracy_score(dataset["test"]["label"], predictions)

0.82832

In [None]:
precision_score(dataset["test"]["label"], predictions)

0.8483870967741935

In [None]:
recall_score(dataset["test"]["label"], predictions)

0.79952

## canine-s-finetuned-sst2 for text classification on imdb 

canine-s-finetuned-sst2
This model is a fine-tuned version of google/canine-s on the glue dataset. It achieves the following results on the evaluation set: 
* Loss: 0.5259
* Accuracy: 0.8578

In [None]:
model = CanineForSequenceClassification.from_pretrained(
    "celine98/canine-s-finetuned-sst2", num_labels=2
)
tokenizer = CanineTokenizer.from_pretrained("celine98/canine-s-finetuned-sst2")

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

In [None]:
print(model)

CanineForSequenceClassification(
  (canine): CanineModel(
    (char_embeddings): CanineEmbeddings(
      (HashBucketCodepointEmbedder_0): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_1): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_2): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_3): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_4): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_5): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_6): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_7): Embedding(16384, 96)
      (char_position_embeddings): Embedding(16384, 768)
      (token_type_embeddings): Embedding(16, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (initial_char_encoder): CanineEncoder(
      (layer): ModuleList(
        (0): CanineLayer(
          (attention): CanineAttention(
            (self): CanineSelfAttention(
            

In [None]:
predictions = []
scores = []
for i in range(25000):
    single_pred = classifier(dataset["test"]["text"][i][:512])
    predictions.append(int(single_pred[0]["label"][-1]))
    scores.append(single_pred[0]["score"])

In [None]:
accuracy_score(dataset["test"]["label"], predictions)

0.74556

In [None]:
precision_score(dataset["test"]["label"], predictions)

0.846483801783497

In [None]:
recall_score(dataset["test"]["label"], predictions)

0.59992

## canine-s for text classification using imdb

In [None]:
tokenizer = CanineTokenizer.from_pretrained("google/canine-s")
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=2)


def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)


# tokenize dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

# creating a batch of examples
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

# Model was saved using *save_pretrained('./test/saved_model/')* (for example purposes, not runnable).
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

# scores
predictions = []
scores = []
for i in range(25000):
    single_pred = classifier(dataset["test"]["text"][i][:512])
    predictions.append(int(single_pred[0]["label"][-1]))
    scores.append(single_pred[0]["score"])

In [None]:
accuracy_score(dataset["test"]["label"], predictions)

0.50004

In [None]:
precision_score(dataset["test"]["label"], predictions)

0.5000200056015685

In [None]:
recall_score(dataset["test"]["label"], predictions)

0.99976

## finetuning all DistilBERT using imdb (text classification task)

In [None]:
# Loading the DistilBERT tokenizer to process the text field

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# tokenizing text and truncating sequences to be no longer than DistilBERT’s maximum input length


def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
# creating a batch of examples

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

In [None]:
# distilbert-base-uncased model

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

In [None]:
# Architecture

print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

**training distilbert-base-uncased model  using imdb dataset**

In [None]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7815


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall
1,0.2343,0.215394,0.91488,0.959345,0.86648
2,0.1441,0.221827,0.92832,0.908017,0.9532
3,0.0879,0.32391,0.92856,0.945155,0.90992
4,0.0454,0.35553,0.92864,0.917355,0.94216
5,0.025,0.385541,0.93076,0.927511,0.93456


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-1563
Configuration saved in ./results/checkpoint-1563/config.json
Model weights saved in ./results/checkpoint-1563/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1563/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1563/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples =

TrainOutput(global_step=7815, training_loss=0.11321147417152683, metrics={'train_runtime': 4612.0295, 'train_samples_per_second': 27.103, 'train_steps_per_second': 1.694, 'total_flos': 1.64028791287344e+16, 'train_loss': 0.11321147417152683, 'epoch': 5.0})

In [None]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 16


{'epoch': 5.0,
 'eval_loss': 0.19566546380519867,
 'eval_runtime': 444.1365,
 'eval_samples_per_second': 56.289,
 'eval_steps_per_second': 3.519}

**evaluation**

In [None]:
# the dataset is balanced (accuracy can be used without problems):

print(sum(dataset["test"]["label"]))
print(len(dataset["test"]["label"]))

12500
25000


In [None]:
! cp -r /content/Drive/MyDrive/NLPTraining/checkpoint-7815 ./checkpoint-7815

In [None]:
# Model was saved using *save_pretrained('./test/saved_model/')* (for example purposes, not runnable).
model = AutoModelForSequenceClassification.from_pretrained(
    "/content/checkpoint-7815", num_labels=2
)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

In [None]:
predictions = []
scores = []
for i in range(25000):
    single_pred = classifier(dataset["test"]["text"][i][:512])
    predictions.append(int(single_pred[0]["label"][-1]))
    scores.append(single_pred[0]["score"])

In [None]:
accuracy_score(dataset["test"]["label"], predictions)

0.86784

In [None]:
precision_score(dataset["test"]["label"], predictions)

0.8606840288672732

In [None]:
recall_score(dataset["test"]["label"], predictions)

0.87776

## finetuning most CANINE using imbd (text classification task)

In [None]:
list_para = [
    "classifier.weight",
    "classifier.bias",
    "canine.projection.conv.weight",
    "canine.projection.conv.bias",
    "canine.projection.LayerNorm.weight",
    "canine.projection.LayerNorm.bias",
    "canine.final_char_encoder.layer.0.attention.self.query.weight",
    "canine.final_char_encoder.layer.0.attention.self.query.bias",
    "canine.final_char_encoder.layer.0.attention.self.key.weight",
    "canine.final_char_encoder.layer.0.attention.self.key.bias",
    "canine.final_char_encoder.layer.0.attention.self.value.weight",
    "canine.final_char_encoder.layer.0.attention.self.value.bias",
    "canine.final_char_encoder.layer.0.attention.output.dense.weight",
    "canine.final_char_encoder.layer.0.attention.output.dense.bias",
    "canine.final_char_encoder.layer.0.attention.output.LayerNorm.weight",
    "canine.final_char_encoder.layer.0.attention.output.LayerNorm.bias",
    "canine.final_char_encoder.layer.0.intermediate.dense.weight",
    "canine.final_char_encoder.layer.0.intermediate.dense.bias",
    "canine.final_char_encoder.layer.0.output.dense.weight",
    "canine.final_char_encoder.layer.0.output.dense.bias",
    "canine.final_char_encoder.layer.0.output.LayerNorm.weight",
    "canine.final_char_encoder.layer.0.output.LayerNorm.bias",
    "canine.pooler.dense.weight",
    "canine.pooler.dense.bias",
]

In [None]:
tokenizer = CanineTokenizer.from_pretrained("google/canine-s")
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=2)

for name, param in model.named_parameters():
    if name not in list_para:
        print(name)
        param.requires_grad = False


def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)


tokenized_dataset = dataset.map(preprocess_function, batched=True)

# creating a batch of examples

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

In [None]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `CanineForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CanineForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7815


Epoch,Training Loss,Validation Loss
1,0.6894,0.684168
2,0.6833,0.679061
3,0.679,0.674761
4,0.6787,0.672913
5,0.6764,0.672415


The following columns in the evaluation set  don't have a corresponding argument in `CanineForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CanineForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-1563
Configuration saved in ./results/checkpoint-1563/config.json
Model weights saved in ./results/checkpoint-1563/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1563/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1563/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `CanineForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CanineForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch s

TrainOutput(global_step=7815, training_loss=0.6834106289112484, metrics={'train_runtime': 9685.1513, 'train_samples_per_second': 12.906, 'train_steps_per_second': 0.807, 'total_flos': 1.6315300305358272e+17, 'train_loss': 0.6834106289112484, 'epoch': 5.0})

In [None]:
# Model was saved using *save_pretrained('./test/saved_model/')* (for example purposes, not runnable).
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

# scores
predictions = []
scores = []
for i in range(25000):
    single_pred = classifier(dataset["test"]["text"][i][:512])
    predictions.append(int(single_pred[0]["label"][-1]))
    scores.append(single_pred[0]["score"])

In [None]:
accuracy_score(dataset["test"]["label"], predictions)

0.57688

In [None]:
precision_score(dataset["test"]["label"], predictions)

0.560100062539087

In [None]:
recall_score(dataset["test"]["label"], predictions)

0.71648

## finetuning distilbert (classifier only) using imbd (text classification task) 

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)


tokenized_dataset = dataset.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

In [None]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [None]:
for name, param in model.named_parameters():
    if name not in [
        "pre_classifier.weight",
        "pre_classifier.bias",
        "classifier.weight",
        "classifier.bias",
    ]:
        print(name)
        param.requires_grad = False

In [None]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7815


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall
1,0.4527,0.411329,0.82928,0.835343,0.82024
2,0.3875,0.368,0.8414,0.830379,0.85808
3,0.3759,0.352435,0.84708,0.847664,0.84624
4,0.3686,0.34828,0.84968,0.845916,0.85512
5,0.3598,0.346832,0.85048,0.846488,0.85624


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-1563
Configuration saved in ./results/checkpoint-1563/config.json
Model weights saved in ./results/checkpoint-1563/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1563/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1563/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples =

TrainOutput(global_step=7815, training_loss=0.4057229557635307, metrics={'train_runtime': 2492.2056, 'train_samples_per_second': 50.156, 'train_steps_per_second': 3.136, 'total_flos': 1.64028791287344e+16, 'train_loss': 0.4057229557635307, 'epoch': 5.0})

In [None]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 16


{'epoch': 5.0,
 'eval_accuracy': 0.85048,
 'eval_loss': 0.3468317687511444,
 'eval_precision': 0.8464884530211958,
 'eval_recall': 0.85624,
 'eval_runtime': 240.1053,
 'eval_samples_per_second': 104.121,
 'eval_steps_per_second': 6.51}

## finetuning CANINE (classifier only) using imbd (text classification task)

In [None]:
tokenizer = CanineTokenizer.from_pretrained("google/canine-s")
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=2)

In [None]:
# Architecture

print(model)

CanineForSequenceClassification(
  (canine): CanineModel(
    (char_embeddings): CanineEmbeddings(
      (HashBucketCodepointEmbedder_0): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_1): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_2): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_3): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_4): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_5): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_6): Embedding(16384, 96)
      (HashBucketCodepointEmbedder_7): Embedding(16384, 96)
      (char_position_embeddings): Embedding(16384, 768)
      (token_type_embeddings): Embedding(16, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (initial_char_encoder): CanineEncoder(
      (layer): ModuleList(
        (0): CanineLayer(
          (attention): CanineAttention(
            (self): CanineSelfAttention(
            

In [None]:
for name, param in model.named_parameters():
    if name not in ["classifier.weight", "classifier.bias"]:
        print(name)
        param.requires_grad = False

canine.char_embeddings.HashBucketCodepointEmbedder_0.weight
canine.char_embeddings.HashBucketCodepointEmbedder_1.weight
canine.char_embeddings.HashBucketCodepointEmbedder_2.weight
canine.char_embeddings.HashBucketCodepointEmbedder_3.weight
canine.char_embeddings.HashBucketCodepointEmbedder_4.weight
canine.char_embeddings.HashBucketCodepointEmbedder_5.weight
canine.char_embeddings.HashBucketCodepointEmbedder_6.weight
canine.char_embeddings.HashBucketCodepointEmbedder_7.weight
canine.char_embeddings.char_position_embeddings.weight
canine.char_embeddings.token_type_embeddings.weight
canine.char_embeddings.LayerNorm.weight
canine.char_embeddings.LayerNorm.bias
canine.initial_char_encoder.layer.0.attention.self.query.weight
canine.initial_char_encoder.layer.0.attention.self.query.bias
canine.initial_char_encoder.layer.0.attention.self.key.weight
canine.initial_char_encoder.layer.0.attention.self.key.bias
canine.initial_char_encoder.layer.0.attention.self.value.weight
canine.initial_char_enc

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

In [None]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
# creating a batch of examples

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

In [None]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `CanineForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CanineForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 25000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 7815


Epoch,Training Loss,Validation Loss
1,0.6966,0.694117
2,0.6965,0.693941
3,0.6953,0.692565
4,0.6935,0.692273
5,0.6946,0.692236


The following columns in the evaluation set  don't have a corresponding argument in `CanineForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CanineForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-1563
Configuration saved in ./results/checkpoint-1563/config.json
Model weights saved in ./results/checkpoint-1563/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1563/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1563/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `CanineForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CanineForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 25000
  Batch s

TrainOutput(global_step=7815, training_loss=0.6959521703055023, metrics={'train_runtime': 9700.3192, 'train_samples_per_second': 12.886, 'train_steps_per_second': 0.806, 'total_flos': 1.6315300305358272e+17, 'train_loss': 0.6959521703055023, 'epoch': 5.0})

In [None]:
# Model was saved using *save_pretrained('./test/saved_model/')* (for example purposes, not runnable).
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

In [None]:
predictions = []
scores = []
for i in range(25000):
    single_pred = classifier(dataset["test"]["text"][i][:512])
    predictions.append(int(single_pred[0]["label"][-1]))
    scores.append(single_pred[0]["score"])

In [None]:
accuracy_score(dataset["test"]["label"], predictions)

0.52728

In [None]:
precision_score(dataset["test"]["label"], predictions)

0.547812675266405

In [None]:
recall_score(dataset["test"]["label"], predictions)

0.31256

## finetuning canine-s-finetuned-sst2 using imbd (text classification task)

In [None]:
model = CanineForSequenceClassification.from_pretrained(
    "celine98/canine-s-finetuned-sst2", num_labels=2
)
tokenizer = CanineTokenizer.from_pretrained("celine98/canine-s-finetuned-sst2")

In [12]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)

In [19]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

In [20]:
for name, param in model.named_parameters():
    if name not in ["classifier.weight", "classifier.bias"]:
        print(name)
        param.requires_grad = False

canine.char_embeddings.HashBucketCodepointEmbedder_0.weight
canine.char_embeddings.HashBucketCodepointEmbedder_1.weight
canine.char_embeddings.HashBucketCodepointEmbedder_2.weight
canine.char_embeddings.HashBucketCodepointEmbedder_3.weight
canine.char_embeddings.HashBucketCodepointEmbedder_4.weight
canine.char_embeddings.HashBucketCodepointEmbedder_5.weight
canine.char_embeddings.HashBucketCodepointEmbedder_6.weight
canine.char_embeddings.HashBucketCodepointEmbedder_7.weight
canine.char_embeddings.char_position_embeddings.weight
canine.char_embeddings.token_type_embeddings.weight
canine.char_embeddings.LayerNorm.weight
canine.char_embeddings.LayerNorm.bias
canine.initial_char_encoder.layer.0.attention.self.query.weight
canine.initial_char_encoder.layer.0.attention.self.query.bias
canine.initial_char_encoder.layer.0.attention.self.key.weight
canine.initial_char_encoder.layer.0.attention.self.key.bias
canine.initial_char_encoder.layer.0.attention.self.value.weight
canine.initial_char_enc

In [16]:
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy="epoch",
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()