If you're opening this Notebook on colab, you will probably need to install the most recent versions of 🤗 Transformers and 🤗 Datasets. We will also need `scipy` and `scikit-learn` for some of the metrics. Uncomment the following cell and run it.

In [1]:
! pip --quiet install git+https://github.com/huggingface/transformers.git
! pip --quiet install git+https://github.com/huggingface/datasets.git
! pip --quiet install scipy sklearn

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [2]:
import transformers

print(transformers.__version__)

4.16.0.dev0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

We will see how to easily load the dataset for each one of those tasks and use the `Trainer` API to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [3]:
GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [4]:
task = "sst2"
# model_checkpoint = "distilbert-base-uncased"
model_checkpoint ="bert-base-cased" # name from Hugging Face repository
batch_size = 8 # 16 too large

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [5]:
from datasets import load_dataset, load_metric

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [6]:
# actual_task = "mnli" if task == "mnli-mm" else task
# dataset = load_dataset("glue", actual_task)
# metric = load_metric("glue", actual_task)

# dataset = load_dataset("glue", task)
metric = load_metric("glue", task)

In [7]:
# dataset = datasets.load_dataset("imdb", split="test")
dataset = load_dataset("imdb")

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [9]:
dataset["train"][0]

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [10]:
import numpy as np
num_labels=len(np.unique(dataset["train"]["label"])) # luokkien lukumäärä
print(np.unique(dataset["train"]["label"]))
num_labels

[0 1]


2

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [11]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [12]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"Rob Roy is and underrated epic of passion and action!SOME MILD SPOILERS WITHIN. Liam Neeson gives a towering performance as Rob Roy MacGregor,one of the best in his career.Jessica Lange is letter-perfect as his wife Mary.They have the most passion and chemistry I've seen in a screen couple.John Hurt gives his best snotty aristocrat performance.Tim Roth portrays one of the great screen villains.His rape of Mary is repugnant and harrowing.He really is a magnificent bastard in this movie.The final duel between Rob and Cunningham is one of the best swordfights ever.Well scripted ans scored,and Michael Caton-Jones direction is flawless. 10 out of 10.",pos
1,"Like last year, I didn't manage to sit through the whole thing. Okay, so Chris Rock as a host was a good choice because he was vaguely engaging. Or rather, out of all the total bores packed into the theatre, he at least wasn't in the Top 10 Most Boring. A lot of the presenters, on the other hand, were in this coveted Top 10. I hadn't known that the whole thing had been done by autocue (although I knew it was scripted) but it was really terrible to see these supposedly good actors unable to insert expression, look away from the cue and stumble over simple words (Natalie Portmanif there's no director, she's gone). The Night of Fancy Dresses and Boring Speeches was long and tedious, Beyonce Knowles butchered some good songs and there were very few decent acceptance speeches and clips. Adam Sandler wins the Worst Presenter award.<br /><br />For helping me write this review I'd like to thank my Mum, my Dad, my lawyers and my pedicurist for all believing in me, and I'd like to point out that I have a high metabolism and of course I haven't been starving myself for a month. I'm not going to cry...thank you.",neg
2,"The quote I used for my summary occurs about halfway through THE GOOD EARTH, as a captain of a Chinese revolutionary army (played by Philip Ahn) apologizes to a mob for not having time to shoot MORE of the looters among them, as his unit has just been called back to the front lines. Of course, the next looter about to be found out and shot is the main character of the film, the former kitchen slave girl O-Lan (for whose portrayal Luise Rainer, now 99-years-old, won her second consecutive best actress Oscar).<br /><br />The next scene finds O-Lan dutifully delivering her bag of looted jewels to her under-appreciative husband, farmer Wang Lung (Paul Muni), setting in motion that classic dichotomy of a man's upward financial mobility being the direct inverse of his moral decline.<br /><br />For a movie dealing with subject matter including slavery, false accusations, misogyny, starvation, home invasion, eating family pets, mental retardation, infanticide, exploited refugees, riots, civil war, summary mass street executions, bigamy, child-beating, adultery, incest, and insect plagues of biblical proportions, THE GOOD EARTH is a surprisingly heart-warming movie.<br /><br />My parting thought is in the form of another classic quote, from O-Lan herself (while putting the precious soup bone her son has just admitted stealing from an old woman back into the cooking pot after husband Wang Lung had angrily tossed it to the dirt floor on the other side of their hut): ""Meat is meat.""",pos
3,"A still famous but decadent actor (Morgan Freeman) has not filmed for four years. When he is invited to participate in a new project, he asks the clumsy cousin of the director to drop him in a poor Latin neighborhood in Carlson to research the work of the manager of a small supermarket. He sees the gorgeous Spanish cashier Scarlet (Paz Vega) and he becomes attracted with her ability. His driver never returns to catch him and Scarlet gives a ride to the actor. But first she has a job interview for the position of secretary in a construction company and the actor helps her to be prepared; then they spend the afternoon together having a pleasant time.<br /><br />I am a big fan of Morgan Freeman and Paz Vega. However, the pointless ""10 Items or Less"" is absolutely disappointing. This low-budget movie does not seem to have a storyline, and is supported by the chemistry and improvisations of Morgan Freeman and Paz Vega and actually nothing happens along 82 minutes. The ambiguous open conclusion is simply ridiculous, with the character of Morgan Freeman returning to his silver spoon world and telling the simple worker that they would never see each other again. Was he afraid to have a love affair with her and destroy his perfect world with his family? Or was a clash of classes, and he realizes that his fancy neighborhood would not be adequate to a simple worker from the lower classes? My vote is four.<br /><br />Title (Brazil): ""Um Astro em Minha Vida"" (""A Star in My Life"")",neg
4,"Just saw this tonight at a seminar on digital projection (shot on 35mm, and first feature film fully scanned in 6k mastered in 4k, and projected with 2k projector at ETC/USC theater in Hwd)..so much for tech stuff. 18 directors (including Alexander Payne, Wes Cravens, Joel and Ethan Coen, Gus Van Sant, Walter Salles and Gerard Depardieu, among several good French/ international directors) were each given 5 minutes to make a love story. They come in all shapes and forms, with known actors(Elijah Wood, Natalie Portman, Steve Buscemi ..totally hilarious..., Maggie Glyllenhall, Nick Nolte, Geena Rowlands ..soo good..and she actually wrote the piece she was in, Msr Depardieu and many good international actors as well. The stories vary from all out romance to quirky comedy to Alex Payne's touching study of a woman discovering herself to Van Sant and one of those things that happens anywhere..maybe? Nothing really off putting by having French spoken in most sequences (with English subtitles) and a small amount of actual English spoken, though that will probably relegate it to art houses (a la Diva.) Also only one piece that might be considered ""experimental"" but colorful and funny as well, the rest simple studies of sometimes complex relationships. All easy to follow (unless the ""experimental"" one irritates your desire for a formulaic story. Several brought up some emotions for me...I admit I am affected by love in cinema...when it is presented in something other than sentimentality. I even laughed at a mime piece, like no other I have seen (thank you for that!) The film hit its peak, for me, somewhere around a little more than half way through, then the last two sequences picked up again. Some beautiful shots of Paris at night, lush romantic kind of music, usually used to good effect, not just schmaltz for ""emotions"" in sound, generally good cinematography, though some shots seemed soft focus when it couldn't have meant to have been (main character in shot/scene). Pacing of each film was good, and overall structure, though a bit long (they left out two of what was to be 20 films, but said all would be on the DVD) seemed to vary between tones of the films to keep a good balance. Not sure when it comes out, but a good study of how to make a 5 min film work..and sometimes, what doesn't work (if it covers too much time, emotionally, for a short film.) Should be in region one when released, but they didn't know when.",pos
5,"My comments on this movie have been deleted twice, which i find pretty offending, since i am making an effort to judge this movie for other people. Please be tolerant of other people's opinion. Obviously writing in the spirit of Nietzsches works is not understood, so ill change my comment completely.<br /><br />I think this is a really bad movie for several reasons.<br /><br />Subject: one should be very careful in making a movie about a philosopher that is even today not understood by the masses and amongst peers brings out passionate discussions. One thing philosophers do agree on is that Nietzsche was a great thinker. So making a movie about his life, which obviously includes his 'ideas' is a thing one should be extremely careful with, or preferably, don't do at all. Wisdom starts with knowing what you don't know. One might think this is not a review of the movie itself, but the movie is not about an imaginary character, it is about the life of someone who actually lived and had/has great influence on the world of yesterday, today and tomorrow. If someone tells a story about a tomato, i can express my thoughts about the story itself, but also about the chosen subject, the tomato. There is a responsibility for producers when they make a movie about actual facts. Specially in a case like this and this responsibility was not taken.<br /><br />Screenplay: One of the first things i noticed were the ridiculous accents. Why? It distracts from what it should be about; Nietzsche and the truths he found. It doesn't help putting things in a right geographical perspective or time! Come on, make it proper English or better yet; German! Even Mel Gibson got that part right... letting his characters speak some gibberish Aramaic in the Passion.<br /><br />Secondly, it is well over-acted.<br /><br />3d, Assante is not an actor to depict Nietzsche. Bad casting.<br /><br />4th, facts are way off.<br /><br />And so on. Its a waste of celluloid.",neg
6,"I think its time for Seagal to go quietly into the night. What I have just seen makes all his direct to video releases in the last few years look like his early 90's smash hits in comparison.<br /><br />A secret bio lab is making a new kind of drug that jacks up a human's adrenaline system to the point where they become psychopathic killers or something. Somehow Seagal is supposed to stop the infection or its the end of the world...or something. Seagal also went through hit squads like jellybeans, every time I look up he was commanding a new face so it kinda got hard to follow character development as well I know Steven's athsma prevent him from yelling at the top of his lungs but even so why is he constantly being dubbed by people who sound nothing like him? Usually the films plot and action sequences can save it from being a total waste of time but this was not even close. Like I said, it was more of a horror movie with a lot of blood and shank stabbing rather than straight up fighting. The problem was it wasn't really scary and Seagal looked completely out of place because the infected people were supposed to have speed of light movement yet the 40 year old 280 lb Seagal killed them all singlehandedly? I guess the lone highlight of the movie was the first 20 minutes where the new recruits ask Seagal to come to the strip club with them.<br /><br />2 out of 10",neg
7,"Sequels have a nasty habit of being disappointing, and the best credit I can give this is that it maintains that old tradition. These three tales aren't anything as good as any from the original Creepshow.<br /><br />By far the best of the trio involves a wooden idol which comes to life to take revenge on the thugs who killed its owners. The second story is about a lake monster which seems to be nothing more than a lot of floating slop, makes you wonder how anybody could possibly be scared of it. The third story includes a cameo from Stephen King as a truck driver, but other than that is a pretty unmemorable tale concerning the victim of a road traffic accident who comes back from the dead for the person who knocked him down.<br /><br />Watch the original Creepshow instead, or if you already have done then be happy with that.",neg
8,"I caught this movie on my local movie channel, and i rather enjoyed watching the film. It has all the elements of a good teen film, and more - this film, aside from dealing with boys-girls relationships and sex and the like, also deals with the issue of steroid use by young people.<br /><br />The film has that real-life feel to it - no loud music, no special effects and no outrageous scenes - which, for this movie, was right. That feel makes it easy to relate to the characters in the film - some of which we probably know from where we live.<br /><br />Overall, a good movie, fun to watch.<br /><br />8/10",pos
9,"I was expecting a documentary that focused on the tobacco industry in North Carolina. Instead I watched a man who rues the fact that his great grandfather lost his tobacco empire to the Duke family. And this went on and on. If Mr. McElwee's family had prevailed over the Dukes I doubt that Mr. McElwee would have any problems with the death toll caused by tobacco-related diseases. I grew up near the area where Mr. McElwee's family began it tobacco business ; I expected more than McEwee's continual focus on his family. I learned very little about the history of tobacco in the NC economy and the ramifications to the state's economy by tighter regulation of tobacco. The countless references to the movie ""Bright Leaves"" are out of place - So what if Gary Cooper played Mr. McElwee's great grandfather? Does the viewer gain any understanding of the role of tobacco in the North Carolina economy by the showing of old film clips of a fictionalized film? I didn't.",neg


In [13]:
dataset["train"] = dataset["train"].filter(lambda example, idx: idx % 10 == 0, with_indices=True)
# if index is divisible with 10, select the instance

len(dataset["train"])

  0%|          | 0/25 [00:00<?, ?ba/s]

2500

In [14]:
dataset["validation"] = dataset["test"].filter(lambda example, idx: idx % 2 == 0, with_indices=True).filter(lambda example, idx: idx % 20 == 0, with_indices=True)
dataset["test"] = dataset["test"].filter(lambda example, idx: idx % 2 != 0, with_indices=True).filter(lambda example, idx: idx % 20 == 0, with_indices=True)

len(dataset["validation"]), len(dataset["test"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-cf0a4bebe61b82c7.arrow


  0%|          | 0/13 [00:00<?, ?ba/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-2d010b75dc07ab97.arrow


  0%|          | 0/13 [00:00<?, ?ba/s]

(625, 625)

In [15]:
del dataset["unsupervised"]
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 625
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 625
    })
})

In [None]:
dataset.features

## Preparing the metrics

In [None]:
# Partly adapted from UTU Textual Data Analysis course material:

def compute_metrics(pred):
    y_pred = pred.predictions.argmax(axis=1) 
    # we get the probability distribution out and the highest is selected with argmax
    y_true = pred.label_ids
    TP = len([a and b for a, b in zip(y_pred, y_true) if a == 1 and b == 1])
    TN = len([a and b for a, b in zip(y_pred, y_true) if a == 0 and b == 0])
    FN = len([a and b for a, b in zip(y_pred, y_true) if a == 0 and b == 1])
    FP = len([a and b for a, b in zip(y_pred, y_true) if a == 1 and b == 0])

    ACC = (TP+TN)/(TP+FP+FN+TN) # Overall accuracy
    PRE = TP/(TP+FP) # Precision: share of relevant items
    REC = TP/(TP+FN) # Recall: proportion of relevant items found
    F1 = (2*((PRE*REC)/(PRE+REC))) # Balance between precision and recall
    return {'accuracy': ACC,
            'precision': PRE, 
            'recall': REC,
            'F1-score':F1
            }

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [16]:
metric
# TODO: F1-score

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [17]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.546875}

Note that `load_metric` has loaded the proper metric associated to your task, which is:

- for CoLA: [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient)
- for MNLI (matched or mismatched): Accuracy
- for MRPC: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for QNLI: Accuracy
- for QQP: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for RTE: Accuracy
- for SST-2: Accuracy
- for STS-B: [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman's_Rank_Correlation_Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
- for WNLI: Accuracy

so the metric object only computes the one(s) needed for your task.

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on one sentence or a pair of sentences:

In [19]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 8667, 117, 1142, 1141, 5650, 106, 102, 1262, 1142, 5650, 2947, 1114, 1122, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [20]:
# task_to_keys = {
#     "cola": ("sentence", None),
#     "mnli": ("premise", "hypothesis"),
#     "mnli-mm": ("premise", "hypothesis"),
#     "mrpc": ("sentence1", "sentence2"),
#     "qnli": ("question", "sentence"),
#     "qqp": ("question1", "question2"),
#     "rte": ("sentence1", "sentence2"),
#     "sst2": ("sentence", None),
#     "stsb": ("sentence1", "sentence2"),
#     "wnli": ("sentence1", "sentence2"),
# }

We can double check it does work on our current dataset:

In [21]:
# sentence1_key, sentence2_key = task_to_keys[task]
# if sentence2_key is None:
#     print(f"Sentence: {dataset['train'][0][sentence1_key]}")
# else:
#     print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
#     print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

In [22]:
print(f"Sentence: {dataset['train'][0]['text']}")

Sentence: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the arguments `truncation=True` and `padding='longest`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model, and all inputs will be padded to the maximum input length to give us a single input array. A more performant method that reduces the number of padding tokens is to write a generator or `tf.data.Dataset` to only pad each *batch* to the maximum length in that batch, but most GLUE tasks are relatively quick on modern GPUs either way.

In [23]:
# def preprocess_function(examples):
#     if sentence2_key is None:
#         return tokenizer(examples[sentence1_key], truncation=True)
#     return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [24]:
preprocess_function(dataset["train"][:5])

{'input_ids': [[101, 146, 12765, 146, 6586, 140, 19556, 19368, 13329, 118, 162, 21678, 2162, 17056, 1121, 1139, 1888, 2984, 1272, 1104, 1155, 1103, 6392, 1115, 4405, 1122, 1165, 1122, 1108, 1148, 1308, 1107, 2573, 119, 146, 1145, 1767, 1115, 1120, 1148, 1122, 1108, 7842, 1118, 158, 119, 156, 119, 10148, 1191, 1122, 1518, 1793, 1106, 3873, 1142, 1583, 117, 3335, 1217, 170, 5442, 1104, 2441, 1737, 107, 6241, 107, 146, 1541, 1125, 1106, 1267, 1142, 1111, 1991, 119, 133, 9304, 120, 135, 133, 9304, 120, 135, 1109, 4928, 1110, 8663, 1213, 170, 1685, 3619, 3362, 2377, 1417, 14960, 1150, 3349, 1106, 3858, 1917, 1131, 1169, 1164, 1297, 119, 1130, 2440, 1131, 3349, 1106, 2817, 1123, 2209, 1116, 1106, 1543, 1199, 3271, 1104, 4148, 1113, 1184, 1103, 1903, 156, 11547, 1162, 1354, 1164, 2218, 1741, 2492, 1216, 1112, 1103, 4357, 1414, 1105, 1886, 2492, 1107, 1103, 1244, 1311, 119, 1130, 1206, 4107, 8673, 1105, 6655, 10552, 3708, 2316, 1104, 8583, 1164, 1147, 11089, 1113, 4039, 117, 1131, 1144, 2673, 

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [25]:
pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Columns added by tokenizer: ['input_ids', 'token_type_ids', 'attention_mask']


In [26]:
encoded_dataset["train"].features["label"]

ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)

In [27]:
encoded_dataset["train"].shape

(2500, 5)

In [28]:
for k, v in encoded_dataset["train"][0].items():
  print(k, v)

attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

Finally, we convert our datasets to `tf.data.Dataset`. There's a built-in method for this, so all you need to do is specify the columns you want (both for the inputs and the labels), whether the data should be shuffled, the batch size, and an optional collation function, that controls how a batch of samples is combined.

We'll need to supply a `DataCollator` for this. The `DataCollator` handles grouping each batch of samples together, and different tasks will require different data collators. In this case, we will use the `DataCollatorWithPadding`, because our samples need to be padded to the same length to form a batch. Remember to supply the `return_tensors` argument too - our data collators can handle multiple frameworks, so you need to be clear that you want TensorFlow tensors back.

In [29]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
tf_validation_dataset = encoded_dataset["validation"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)
tf_test_dataset = encoded_dataset["test"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `TFAutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels). We also need to get the appropriate loss function (SparseCategoricalCrossentropy for every task except STSB, which as a regression problem requires MeanSquaredError).

Note that all models in `transformers` compute loss internally too, and you can train on this loss value. This can be very helpful when the loss is not easy to specify yourself. To use this, pass the labels as a `labels` key in the input dictionary, and then compile the model without specifying a loss. You can see examples of this approach in several of the other TensorFlow notebooks.

In [30]:
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf

# num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
# if task == "stsb":
#     loss = tf.keras.losses.MeanSquaredError()
#     num_labels = 1
# elif task.startswith("mnli"):
#     loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#     num_labels = 3
# else:
#     loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#     num_labels = 2

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) 
# from_logits 	Whether y_pred is expected to be a logits tensor. 
# By default, we assume that y_pred encodes a probability distribution (from_logits=False). 
num_labels = 2

model = TFAutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

In [31]:
from transformers import create_optimizer

num_epochs = 2 #5
batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
model.compile(optimizer=optimizer, loss=loss)

The `create_optimizer` function in the Transformers library creates a very useful `AdamW` optimizer with weight and learning rate decay. This performs very well for training most transformer networks - we recommend using it as your default unless you have a good reason not to! Note, however, that because it decays the learning rate over the course of training, it needs to know how many batches it will see during training.

In [32]:
metric_name = "accuracy"

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of `push_to_hub_model_id` to something you would prefer.

The last thing to define is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [33]:
# def compute_metrics(predictions, labels):
#     if task != "stsb":
#         predictions = np.argmax(predictions, axis=1)
#     else:
#         predictions = predictions[:, 0]
#     return metric.compute(predictions=predictions, references=labels)

def compute_metrics(predictions, labels):
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

We can now finetune our model by just calling the `fit` method. Be sure to pass the TF datasets, and not the original datasets! We can also add a callback to sync up our model with the Hub - this allows us to resume training from other machines and even test the model's inference quality midway through training! Make sure to change the `username` if you do. If you don't want to do this, simply remove the callbacks argument in the call to `fit()`.

In [34]:
# from transformers.keras_callbacks import PushToHubCallback

# model_name = model_checkpoint.split("/")[-1]
# push_to_hub_model_id = f"{model_name}-finetuned-{task}"

# callback = PushToHubCallback(
#     output_dir="./tc_model_save",
#     tokenizer=tokenizer,
#     hub_model_id=push_to_hub_model_id,
# )

# TODO: early stopping, save best

model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=2#3,
    # callbacks=[callback],
)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f3bfe7f9590>

We can add Keras metrics during compilation above if we want to get live readouts during training, or we can use the `compute_metrics` function after training to compute the metrics specified for each task.

In [38]:
predictions = model.predict(tf_test_dataset)["logits"]

In [39]:
compute_metrics(predictions, np.array(encoded_dataset["test"]["label"]))
# compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))

{'accuracy': 0.904}

## SHAP

## uusi

In [40]:
! pip --quiet install shap

[?25l[K     |▋                               | 10 kB 21.7 MB/s eta 0:00:01[K     |█▏                              | 20 kB 26.2 MB/s eta 0:00:01[K     |█▊                              | 30 kB 11.8 MB/s eta 0:00:01[K     |██▎                             | 40 kB 8.9 MB/s eta 0:00:01[K     |███                             | 51 kB 5.7 MB/s eta 0:00:01[K     |███▌                            | 61 kB 5.7 MB/s eta 0:00:01[K     |████                            | 71 kB 5.7 MB/s eta 0:00:01[K     |████▋                           | 81 kB 6.4 MB/s eta 0:00:01[K     |█████▏                          | 92 kB 6.4 MB/s eta 0:00:01[K     |█████▉                          | 102 kB 5.0 MB/s eta 0:00:01[K     |██████▍                         | 112 kB 5.0 MB/s eta 0:00:01[K     |███████                         | 122 kB 5.0 MB/s eta 0:00:01[K     |███████▌                        | 133 kB 5.0 MB/s eta 0:00:01[K     |████████▏                       | 143 kB 5.0 MB/s eta 0:00:01[K  

In [41]:
import shap
# https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html?highlight=return_all_scores
# https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html?highlight=return_all_scores#

# classifier = transformers.pipeline('sentiment-analysis', return_all_scores=True)
# classifier(short_data[:2])

classifier = transformers.pipeline(task='sentiment-analysis', model=model, tokenizer=tokenizer, return_all_scores=True)
short_data = [v[:500] for v in dataset["test"]["text"][:20]] # first 500 lettres from the first 20 test split texts
classifier(short_data[:2]) # pipeline predictions for the 2 first samples in the "short_data"
# Labels need to be int for the model, thus 'LABEL_0' and 'LABEL_1'
# TODO: fix labels

[[{'label': 'LABEL_0', 'score': 0.43242624402046204},
  {'label': 'LABEL_1', 'score': 0.5675737261772156}],
 [{'label': 'LABEL_0', 'score': 0.9710727334022522},
  {'label': 'LABEL_1', 'score': 0.028927218168973923}]]

In [48]:
type(short_data)

list

In [49]:
# define the explainer
explainer = shap.Explainer(classifier)

In [50]:
# explain the predictions of the pipeline on the first two samples
shap_values = explainer(short_data[:2])

  0%|          | 0/248 [00:00<?, ?it/s]

Partition explainer:  50%|█████     | 1/2 [00:00<?, ?it/s]

  0%|          | 0/248 [00:00<?, ?it/s]

Partition explainer: 3it [06:19, 189.78s/it]


In [54]:
shap.plots.text(shap_values[:,:,"LABEL_0"])

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
