<a href="https://colab.research.google.com/github/HannaKi/kandi/blob/master/sentiment_analysis_explainability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install the most recent versions of 🤗 Transformers and 🤗 Datasets. We will also need `scipy` and `scikit-learn` for some of the metrics. Uncomment the following cell and run it.

In [41]:
! pip --quiet install git+https://github.com/huggingface/transformers.git
! pip --quiet install git+https://github.com/huggingface/datasets.git
! pip --quiet install scipy sklearn

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone


Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [42]:
import transformers

print(transformers.__version__)

4.17.0.dev0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/text-classification).

# Fine-tuning a model on a text classification task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a text classification task of the [GLUE Benchmark](https://gluebenchmark.com/).

The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences which are:

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

We will see how to easily load the dataset for each one of those tasks and use the `Trainer` API to fine-tune a model on it. Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):

In [76]:
GLUE_TASKS = [
    "cola",
    "mnli",
    "mnli-mm",
    "mrpc",
    "qnli",
    "qqp",
    "rte",
    "sst2",
    "stsb",
    "wnli",
]

This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [77]:
task = "sst2"
# model_checkpoint = "distilbert-base-uncased"
model_checkpoint ="bert-base-cased" # name from Hugging Face repository
batch_size = 6# 32#16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [78]:
from datasets import load_dataset, load_metric

Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions. `load_dataset` will cache the dataset to avoid downloading it again the next time you run this cell.

In [79]:
# actual_task = "mnli" if task == "mnli-mm" else task
# dataset = load_dataset("glue", actual_task)
# metric = load_metric("glue", actual_task)

dataset = load_dataset("glue", task)
metric = load_metric("glue", task)

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

In [80]:
# dataset = datasets.load_dataset("imdb", split="test")
dataset = load_dataset("imdb")

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set (with more keys for the mismatched validation and test set in the special case of `mnli`).

In [81]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

To access an actual element, you need to select a split first, then give an index:

In [82]:
dataset["train"][0]

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are f

In [83]:
import numpy as np
num_labels=len(np.unique(dataset["train"]["label"])) # luokkien lukumäärä
print(np.unique(dataset["train"]["label"]))
num_labels

[0 1]


2

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [84]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [85]:
show_random_elements(dataset["train"])

Unnamed: 0,text,label
0,"Didn't know anything about the movie before watching and I think it was the ""no expectation"" factor that helped me endure at first and later like it more than I anticipated.<br /><br />The setting was interesting, strange but interesting. The storyline had gaps/jumps that I think throws the audience off a bit. There's no great soundtrack playing in the background, creating the ""romantic"" ambiance. BUT they all didn't matter.<br /><br />The chemistry between Emma and Luis was simply exquisite. There was some inexplicable strange chemistry that I couldn't resist; I fell in love with it and here I am, writing this review. The subtle love portrayal by the two actors was superb, and I believe, that is the core of this movie.<br /><br />This movie is not an everyday romantic comedy; in fact, not all of us will appreciate it. I had to sit a while and then slowly began to comprehend the little things I didn't catch at first. I cannot guarantee everyone will like it, but I hope YOU do.",pos
1,"Perhaps Disney was hoping for another Mary Poppins but this is a very different story and while Angela is delightful she was a very different performer to the great Julie Andrews. Having said that Lansbury is perfectly cast and delivers a magical performance. There is something deliciously dotty about her character and she is given wonderful support by David Tomlinson. Tomilinson can carry a tune but he is certainly not much chop as a singer. It does not matter he was such a gifted actor you hardly notice. There are some great cameos from much loved stars of another time like Roddy McDowel who gives a winning performance and the much loved Tessie OShea who does very little but its nice to see the old gal again. Its also lovely to see Sam Jaffe and the king of English television Bruce Forsythe in small roles. The score has a couple of beautiful songs especially The briny sea and The age of Not Believing. The big number Portabello Road is stretched to the limit but it has plenty of theatricality. The effects look a bit cliché today but the scene with the German invaders being attacked by the wildest army in film is pretty impressive. The kids are not as annoying as other movies but one does struggle to understand what the youngest boy is saying. I loved the marching song of the home army. The home guard were very important to Britain and this is a warm tribute. The animation is delightful, much better than Pixar which I find grotesque. A warm happy film and its a wonder its not done on stage.",pos
2,"Nick Cage is Randall Raines, a retired car thief who is forced out of retirement when he's forced to save his the life of his brother Kip (Giovanni Ribisi) when he screws up on a job, by completing his brothers job of stealing 50 cars in one night. He has to get together his old crew that he can trust to help him pull it off and get his bro out of dutch. But the cops are onto him, so can he pull it off? This was one of the great candidates of a film to re-make as the Original was far from a classic. And if you don't go into it expecting much, and turn the thinking portion of your brain off so you can ignore the plot hole ans just take the movie for what it is. You'll end up enjoying the ride. Watch it on a double-bill with ""The Fast and the Furious"" for a night of high-speed hijinks, just don't take the car out for a spin right afterwards.<br /><br />My Grade: B- <br /><br />DVD Extras: 7 minute Jerry Bruckheimer Interview; Bruckheimer Bio/Filmography; Action Overload: Highlight Reel; The Big Chase; ""0 To 60"" featurette; ""Wild Rides"" featurette; Stars On The Move; The Cult ""Painted On The Heart"" music video; Theatrical Trailer, and Trailers for ""Shanghai Noon"", ""Mission to Mars"" and ""Coyote Ugly""",pos
3,"Let me be the first non Australian to comment on this :) I got the movie for Hugo Weaving and I watched it to the end. It's one of those ""drama of life"" films, as my mother used to call a movie that depicts a real life story with no extraordinary events and that is mostly descriptive.<br /><br />I liked the light and the girls. The rest was without too much fault, but without too much merit either. I yearned for something like The Interview, or at least some matrix villain element here and there, but nothing out of the ordinary. The story does teach one about facing one's own destiny and break free from the environment others build for you, but this happens when the life giving peach factory in the area is about to close, so not much of an effort to change things is required.<br /><br />The ""smart"" American Beauty sound-alike song in the background could have been part of a larger soundtrack, but just that one playing over and over again became annoying after 100 minutes of film.<br /><br />In the end, I guess it did his job of presenting a part of Australian life, but to me it didn't seem specifically Australian (it could have been placed anywhere) and it didn't seem attractive as a story.<br /><br />I guess one must be in a certain mood to like the movie.",pos
4,"I had to write a review of this film after reading another comment saying that this is Sidney Poitier's best movie. Poitier had just returned from over a decade's break in film acting and he is clearly creaky here. 11 of his films are mentioned in Wikipedia and they don't include this. 5 of his films are on the AFI's list of top 100 inspiring movies, again, not including this. Berenger and Poitier, rube and city slicker set out to hunt down a dangerous psychopath before he crosses the border to Canada. Some of the attempts at comedy in this film clearly fail and Berenger and Poitier's bonding was cringeworthy and awkward (not helped by a completely bland script). Kirstie Alley (as the hostage) was underused, and almost entirely ignored when she was on screen. Some attempt at suspense is made, for example when you're meant to try and guess which of 5 men on a fishing trip is the murderer (all of them are type-cast villains). I understand that this is the entire appeal to most fans out there. I guessed who it was and I wasn't really trying hard.<br /><br />If you're a Berenger fan, watch the Sniper (1993), you even get to see Billy Zane strutting his stuff. It's much better. All in all I'd give Shoot to Kill 3/10. It's not daring, and it's just too straightforward for me.",neg
5,"Of course, the story line for this movie isn't the best, but the dances are wonderful. This story line is different from other Astaire-Rogers movies in that neither one is ""chasing"" the other. The dancing of Fred and Ginger is what makes this movie.",pos
6,"SPOILERS AHEAD<br /><br />This is one of the worst movies ever made - it's that simple. There is not one redeeming quality about this movie. The first 10 minutes are quite tricky - they actually lead you to believe that this film will be shocking and will have you on the edge of your seat. Instead, you will spend 83 minutes punching yourself while watching stolen and poorly made scenes run without any organization. The lake was ridiculous, looked like an aquarium, and had the same plant in different parts of the lake bed. Characters show their advanced teleportation powers, for example Alex Thomas who falls into the lake (drunk), and then ends up on his boat in an impossible position. Angie Harmon put up a pitiful performance as Kate, made worse by the space-time continuum rupturing dialog that appears to have been written at the last minute by a fifth grader. An example of this would be when she said, ""Flashlight!"" in such a stupid manner that it shows the threshold of how much a human body can cringe before it snaps in half. Finally, the editing of this movie was by far the most bizarre and horrific that I have ever seen. It was like the cameramen were a bunch of chimps who had been given camcorders by scientists. An example of this would be when we suddenly get a closeup of the headlight on Alex's car. I would bet that there was little to no time spent editing this movie. The ending was absolutely pathetic. The writers were obviously trying to create some sort of mysterious plot line that made the viewer say, ""oh yeah!"" Instead, we're left to view some dumb painting of a spider that somehow fits into the story line. Unfortunately, there is not one perspective in the millions out there that could save this movie from being a festering piece of crap.<br /><br />I give this a .5 out of 10, the .5 being from the fact that this movie was recorded on film instead of becoming a picture book.",neg
7,"Writer/Director/Co-Star Adam Jones is headed for great things. That is the thought I had after seeing his feature film ""Cross Eyed"". Rarely does an independent film leave me feeling as good as his did. Cleverly written and masterfully directed, ""Cross Eyed"" keeps you involved from beginning to end. Adam Jones may not be a well known name yet, but he will be. If this movie had one or two ""Named Actors"" it would be a Box Office sensation. I think it still has a chance to get seen by a main stream audience if just one film distributor takes the time to work this movie. Regardless of where it ends up, if you get a chance to see it you won't be disappointed.",pos
8,"The first of two films by Johnny To, this film won many awards, but none so prestigious as a Cannes Golden Palm nomination.<br /><br />The Triad elects their leader, but it is far from democratic with the behind the scenes machinations.<br /><br />Tony Leung Ka Fai (Zhou Yu's Train, Ashes of Time Redux) is Big D, who plans to take the baton no matter what it takes, even if it means a war. Well, war is not going to happen as that is bad for business. Big D will change his tune or...<br /><br />Good performances by Simon Yam, Louis Koo and Ka Tung Lam (Infernal Affairs I & III), along with Tony Leung Ka Fai.<br /><br />Whether Masons, made men in the Mafia, or members of the Wo Sing Society, the ceremonies are the same; fascinating to watch.<br /><br />To be continued...",pos
9,"Now, admittedly, I'm no ardent student of the genre. As a matter of fact, I've tended always to shy away from Westerns because, in spite of all their critical cachet as America's primal stories (or whatever), they seem to me to forever devolve into tiresome retreads of either ""shoot up the Injuns,"" ""the big gunfight,"" or ""Hey, let's form a posse!"" In other words, it always seemed to me a genre so rooted in and tied to convention, that it left precious little room for surprise or originality. (And yes, I HAVE seen at least some of the so-called ""greats"", and unapologetically lump them into this negative assessment - including Stagecoach, Rio Bravo, My Darling Clementine, and of course the infamous [but profoundly dull] Clint Eastwood-Sergio Leone teamups in the '60s.)<br /><br />But when I saw this movie on TV - as part of a commemorative Jimmy Stewart weekend upon his death - I finally GOT IT: I understood, at least in theory, what the Western mythos has to offer as a serious thematic preoccupation (aside from just action and thrills). It is the push-pull between lawlessness and order; the American West represented freedom, but also the prospect of the wild, the untamed. Respectable folk could get hurt out there. Which, of course, meant that perhaps - just perhaps - it wasn't meant for respectable folk, and that the only residents should be the amoral and the shifty, those who dispensed justice strictly from the barrel of their revolvers, and where kill or be killed would ever be the law of the land. In such an environment, of course, the true heroes are the ones who are ornery and free-spirited enough to be out there in the first place (and so reject ""society,"" at least as it manifested itself on the Eastern seaboard), and yet have enough sense of justice to believe that a society based on chaos and fear just IS NOT RIGHT. Catching and examining that disparity between law and disorder IN THE MAIN CHARACTER HIMSELF is, I believe (after seeing this movie), the highest and truest goal of any Western. Sadly, it is so often not the case, as the white hats are completely white, the black ones completely black (and let's not even get started talking about the Indians, ok) and there is precious little shades of gray in between.<br /><br />Not in this one. Jimmy Stewart plays a blatant fortune hunter who follows the trail of miners before him into the Alaskan wilderness to prospect for gold. He is joined in this by his lifelong buddy, played by Walter Brennan (perhaps the Western cliché character to end them all - but nevertheless enjoyable here, as always) - and no one else. Pointedly, they are out for themselves, and while Stewart displays his patented charm (come on, we could never really dislike the guy, now could we?), we are left with little doubt that his is basically a self-centered, self-interested character: none of his ""Gosh"" or ""Oh golly gee"" humanism is allowed to come through. Or, rather, it has to be EARNED, by the end of the picture, in the way I described above. He must confront the lawlessness in himself, and weigh it against the need for order and justice which are so blatantly lacking in the border town which serves as the miners' starting point on their gold dust trail. This town is ruled tightly by its wicked sheriff, Mr. Gannon, played by John McIntire in one of the best ""bad guy"" performances I've ever seen. He comes on with so much charm and humor, and has such a relaxed and interesting rapport with Stewart, that it actually takes awhile to recognize that he *is* the bad guy - so that when it finally sinks in, it does so with double force. Further, by establishing a type of breezy (if necessarily guarded) camaraderie between McIntire and Stewart, the film plays up the notion of how close in temperament they really are - and so how far a moral distance Stewart must walk by the end of the film.<br /><br />I won't go through all the twists and turns the plot takes - see those for yourself (as well as the rugged and gorgeous Alaskan scenery - filmed on location, mind you, not cheap painted stills that the studio made up). What's key here is how much this story focuses upon character, with great dialogue and character interaction substituting for gunplay much of the time - although the film has just enough action and adventure to prevent it from ever being static (read: ""talky""). Definitely one of the greatest performances I've seen from Stewart, showing he could play the renegade, the ""man's man"" just as convincingly as the decent and upright guy next door. If anything, in fact, his ""everyman"" qualities lend greater strength to his characterization, making him seem less mythic or overblown - -like, say, Eastwood or John Wayne - and more a three-dimensional personage. His relationship with Brennan is well-played: understated, but nevertheless touching (with a faint suggestion of George and Lenny from ""Of Mice and Men"" - an altogether different type of ""western"").<br /><br />I certainly have more Westerns to see, but this is for now my favorite, and the yardstick by which I will necessarily judge all the others. It deserves to be much better known and appreciated than it is.",pos


In [86]:
dataset["train"] = dataset["train"].filter(lambda example, idx: idx % 10 == 0, with_indices=True)
# if index is divisible with 10, select the instance

# len(dataset["train"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-2bb51c6f00c9430b.arrow


In [87]:
dataset["validation"] = dataset["test"].filter(lambda example, idx: idx % 2 == 0, with_indices=True).filter(lambda example, idx: idx % 20 == 0, with_indices=True)
dataset["test"] = dataset["test"].filter(lambda example, idx: idx % 2 != 0, with_indices=True).filter(lambda example, idx: idx % 20 == 0, with_indices=True)

len(dataset["validation"]), len(dataset["test"])

Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-cf0a4bebe61b82c7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-f5cee140a9f911b7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-2d010b75dc07ab97.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-2beba353f71da13f.arrow


(625, 625)

In [88]:
del dataset["unsupervised"]

In [89]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 625
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 625
    })
})

In [90]:
len(dataset["train"])

2500

The metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [91]:
metric

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

You can call its `compute` method with your predictions and labels directly and it will return a dictionary with the metric(s) value:

In [92]:
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'accuracy': 0.46875}

Note that `load_metric` has loaded the proper metric associated to your task, which is:

- for CoLA: [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient)
- for MNLI (matched or mismatched): Accuracy
- for MRPC: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for QNLI: Accuracy
- for QQP: Accuracy and [F1 score](https://en.wikipedia.org/wiki/F1_score)
- for RTE: Accuracy
- for SST-2: Accuracy
- for STS-B: [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman's_Rank_Correlation_Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)
- for WNLI: Accuracy

so the metric object only computes the one(s) needed for your task.

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [93]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on one sentence or a pair of sentences:

In [94]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 8667, 117, 1142, 1141, 5650, 106, 102, 1262, 1142, 5650, 2947, 1114, 1122, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:

In [95]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("text", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

We can double check it does work on our current dataset:

In [96]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

We can them write the function that will preprocess our samples. We just feed them to the `tokenizer` with the arguments `truncation=True` and `padding='longest`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model, and all inputs will be padded to the maximum input length to give us a single input array. A more performant method that reduces the number of padding tokens is to write a generator or `tf.data.Dataset` to only pad each *batch* to the maximum length in that batch, but most GLUE tasks are relatively quick on modern GPUs either way.

In [97]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [98]:
preprocess_function(dataset["train"][:5])

{'input_ids': [[101, 146, 12765, 146, 6586, 140, 19556, 19368, 13329, 118, 162, 21678, 2162, 17056, 1121, 1139, 1888, 2984, 1272, 1104, 1155, 1103, 6392, 1115, 4405, 1122, 1165, 1122, 1108, 1148, 1308, 1107, 2573, 119, 146, 1145, 1767, 1115, 1120, 1148, 1122, 1108, 7842, 1118, 158, 119, 156, 119, 10148, 1191, 1122, 1518, 1793, 1106, 3873, 1142, 1583, 117, 3335, 1217, 170, 5442, 1104, 2441, 1737, 107, 6241, 107, 146, 1541, 1125, 1106, 1267, 1142, 1111, 1991, 119, 133, 9304, 120, 135, 133, 9304, 120, 135, 1109, 4928, 1110, 8663, 1213, 170, 1685, 3619, 3362, 2377, 1417, 14960, 1150, 3349, 1106, 3858, 1917, 1131, 1169, 1164, 1297, 119, 1130, 2440, 1131, 3349, 1106, 2817, 1123, 2209, 1116, 1106, 1543, 1199, 3271, 1104, 4148, 1113, 1184, 1103, 1903, 156, 11547, 1162, 1354, 1164, 2218, 1741, 2492, 1216, 1112, 1103, 4357, 1414, 1105, 1886, 2492, 1107, 1103, 1244, 1311, 119, 1130, 1206, 4107, 8673, 1105, 6655, 10552, 3708, 2316, 1104, 8583, 1164, 1147, 11089, 1113, 4039, 117, 1131, 1144, 2673, 

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [99]:
pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-5408db77831bf4c4.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-4c8ec4fb03f085dc.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1/cache-cebf94b3f1cd41d3.arrow


Columns added by tokenizer: ['attention_mask', 'token_type_ids', 'input_ids']


In [100]:
encoded_dataset["train"].features["label"]

ClassLabel(num_classes=2, names=['neg', 'pos'], id=None)

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

Finally, we convert our datasets to `tf.data.Dataset`. There's a built-in method for this, so all you need to do is specify the columns you want (both for the inputs and the labels), whether the data should be shuffled, the batch size, and an optional collation function, that controls how a batch of samples is combined.

We'll need to supply a `DataCollator` for this. The `DataCollator` handles grouping each batch of samples together, and different tasks will require different data collators. In this case, we will use the `DataCollatorWithPadding`, because our samples need to be padded to the same length to form a batch. Remember to supply the `return_tensors` argument too - our data collators can handle multiple frameworks, so you need to be clear that you want TensorFlow tensors back.

In [101]:
'''

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

validation_key = (
    "validation_mismatched"
    if task == "mnli-mm"
    else "validation_matched"
    if task == "mnli"
    else "validation"
)
tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)
tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

'''

'\n\nfrom transformers import DataCollatorWithPadding\n\ndata_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")\n\nvalidation_key = (\n    "validation_mismatched"\n    if task == "mnli-mm"\n    else "validation_matched"\n    if task == "mnli"\n    else "validation"\n)\ntf_train_dataset = encoded_dataset["train"].to_tf_dataset(\n    columns=tokenizer_columns,\n    label_cols=["labels"],\n    shuffle=True,\n    batch_size=16,\n    collate_fn=data_collator,\n)\ntf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(\n    columns=tokenizer_columns,\n    label_cols=["labels"],\n    shuffle=False,\n    batch_size=16,\n    collate_fn=data_collator,\n)\n\n'

In [102]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

tf_train_dataset = encoded_dataset["train"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
tf_validation_dataset = encoded_dataset["validation"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)
tf_test_dataset = encoded_dataset["test"].to_tf_dataset(
    columns=tokenizer_columns,
    label_cols=["labels"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `TFAutoModelForSequenceClassification` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is always 2, except for STS-B which is a regression problem and MNLI where we have 3 labels). We also need to get the appropriate loss function (SparseCategoricalCrossentropy for every task except STSB, which as a regression problem requires MeanSquaredError).

Note that all models in `transformers` compute loss internally too, and you can train on this loss value. This can be very helpful when the loss is not easy to specify yourself. To use this, pass the labels as a `labels` key in the input dictionary, and then compile the model without specifying a loss. You can see examples of this approach in several of the other TensorFlow notebooks.

In [103]:
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf

num_labels = 3 if task.startswith("mnli") else 1 if task == "stsb" else 2
if task == "stsb":
    loss = tf.keras.losses.MeanSquaredError()
    num_labels = 1
elif task.startswith("mnli"):
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 3
else:
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    num_labels = 2
model = TFAutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=num_labels
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

In [104]:
from transformers import create_optimizer

num_epochs = 2#5
batches_per_epoch = len(encoded_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(
    init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps
)
model.compile(optimizer=optimizer, loss=loss)

In [105]:
print(model.outputs)

None


The `create_optimizer` function in the Transformers library creates a very useful `AdamW` optimizer with weight and learning rate decay. This performs very well for training most transformer networks - we recommend using it as your default unless you have a good reason not to! Note, however, that because it decays the learning rate over the course of training, it needs to know how many batches it will see during training.

In [106]:
metric_name = (
    "pearson"
    if task == "stsb"
    else "matthews_correlation"
    if task == "cola"
    else "accuracy"
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay. Since the best model might not be the one at the end of training, we ask the `Trainer` to load the best model it saved (according to `metric_name`) at the end of training.

The last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of `push_to_hub_model_id` to something you would prefer.

The last thing to define is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits (our just squeeze the last axis in the case of STS-B):

In [107]:
def compute_metrics(predictions, labels):
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

We can now finetune our model by just calling the `fit` method. Be sure to pass the TF datasets, and not the original datasets! We can also add a callback to sync up our model with the Hub - this allows us to resume training from other machines and even test the model's inference quality midway through training! Make sure to change the `username` if you do. If you don't want to do this, simply remove the callbacks argument in the call to `fit()`.

In [108]:
# from transformers.keras_callbacks import PushToHubCallback

# model_name = model_checkpoint.split("/")[-1]
# push_to_hub_model_id = f"{model_name}-finetuned-{task}"

# callback = PushToHubCallback(
#     output_dir="./tc_model_save",
#     tokenizer=tokenizer,
#     hub_model_id=push_to_hub_model_id,
# )

model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=2#3,
    # callbacks=[callback],
)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fd298f3d7d0>

We can add Keras metrics during compilation above if we want to get live readouts during training, or we can use the `compute_metrics` function after training to compute the metrics specified for each task.

In [109]:
predictions = model.predict(tf_validation_dataset)["logits"]

In [110]:
compute_metrics(predictions, np.array(encoded_dataset[validation_key]["label"]))

{'accuracy': 0.904}

## SHAP

In [111]:
! pip --quiet install shap

[?25l[K     |▋                               | 10 kB 26.8 MB/s eta 0:00:01[K     |█▏                              | 20 kB 27.9 MB/s eta 0:00:01[K     |█▊                              | 30 kB 11.7 MB/s eta 0:00:01[K     |██▎                             | 40 kB 9.1 MB/s eta 0:00:01[K     |███                             | 51 kB 4.9 MB/s eta 0:00:01[K     |███▌                            | 61 kB 5.7 MB/s eta 0:00:01[K     |████                            | 71 kB 5.8 MB/s eta 0:00:01[K     |████▋                           | 81 kB 5.7 MB/s eta 0:00:01[K     |█████▏                          | 92 kB 6.3 MB/s eta 0:00:01[K     |█████▉                          | 102 kB 5.2 MB/s eta 0:00:01[K     |██████▍                         | 112 kB 5.2 MB/s eta 0:00:01[K     |███████                         | 122 kB 5.2 MB/s eta 0:00:01[K     |███████▌                        | 133 kB 5.2 MB/s eta 0:00:01[K     |████████▏                       | 143 kB 5.2 MB/s eta 0:00:01[K  

In [112]:
# len(model.outputs)
isinstance(model, tf.keras.Model)

True

In [113]:
print(model.outputs)

None


In [114]:
print(model.layers) # kaikki kerrokset
print(len(model.layers))
model.layers[0] # Bert
model.layers[1] # Dropout
model.layers[2] # Dense # Viimeinen eli sama kuin model.layers[-1]

[<transformers.models.bert.modeling_tf_bert.TFBertMainLayer object at 0x7fd310427590>, <keras.layers.core.dropout.Dropout object at 0x7fd2a3dfef90>, <keras.layers.core.dense.Dense object at 0x7fd2a3e05450>]
3


<keras.layers.core.dense.Dense at 0x7fd2a3e05450>

In [116]:
print(type(dataset["train"][:100])) # dict ei kelpaa, joten poimitaan 

for k, v in dataset["train"][:100].items():
  print(k, v)

print(type(dataset["train"][:100]['text']))

testi= [list(x) for x in dataset["train"][:100]['text']] # nested list
print(type(testi))
print(type(testi[0]))

<class 'dict'>
label [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
<class 'list'>
<class 'list'>
<class 'list'>


In [117]:

model.outputs = np.array([1,2]) # Asetetaan mallin output: Tässä vain jokin vektori, jonka pituus on 
# XXXX (koska malli tuottaa jokaiselle syötteelle )
# Ks. tarkemmin shap-kirjaston lähdekoodi tiedostossa tf_utils.py ja rivi 85

# we use the first 100 training examples as our background dataset to integrate over
# explainer = shap.DeepExplainer(model, x_train[:100])
# explainer = shap.DeepExplainer(model, dataset["train"][:100]['sentence']) # lista syötteen tekstejä
explainer = shap.DeepExplainer(model, testi) # lista syötteen tekstejä


# SHAP requires tensor outputs from the classifier, and explanations works best in additive spaces so we transform the probabilities into logit values (information values instead of probabilites).


# explain the first 10 predictions
# explaining each prediction requires 2 * background dataset size runs
# shap_values = explainer.shap_values(x_test[:10])
shap_values = explainer.shap_values(dataset["test"][:10]['sentence'])



NameError: ignored

In [None]:
! pip --quiet install shap

In [118]:
import shap
# https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html?highlight=return_all_scores
# https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html?highlight=return_all_scores#

# classifier = transformers.pipeline('sentiment-analysis', return_all_scores=True)
# classifier(short_data[:2])

classifier = transformers.pipeline(task='sentiment-analysis', model=model, tokenizer=tokenizer, return_all_scores=True)
short_data = [v[:500] for v in dataset["test"]["text"][:20]] # first 500 letters from the first 20 test split texts
classifier(short_data[:2]) # pipeline predictions for the 2 first samples in the "short_data"
# Labels need to be int for the model, thus 'LABEL_0' and 'LABEL_1'
# TODO: fix labels

[[{'label': 'LABEL_0', 'score': 0.02706652693450451},
  {'label': 'LABEL_1', 'score': 0.9729334712028503}],
 [{'label': 'LABEL_0', 'score': 0.9921069145202637},
  {'label': 'LABEL_1', 'score': 0.007893082685768604}]]

In [119]:
type(short_data)

list

In [120]:
# define the explainer
explainer = shap.Explainer(classifier)

In [121]:
# explain the predictions of the pipeline on the first two samples
shap_values = explainer(short_data[:2])

  0%|          | 0/248 [00:00<?, ?it/s]

Partition explainer:  50%|█████     | 1/2 [00:00<?, ?it/s]

  0%|          | 0/248 [00:00<?, ?it/s]

Partition explainer: 3it [06:02, 181.22s/it]


In [122]:
shap.plots.text(shap_values[:,:,"LABEL_0"])

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
