# Sentiment analysis pipeline with the transformers library

In [28]:
#for colab
!pip install transformers datasets >> /dev/null

In [2]:
from transformers import pipeline
import torch
from pprint import pprint


## Natural language processing tasks

### An example of previous generation of language model GPT-2

In [3]:
# Dummy model
from transformers import set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("I am a unicorn in a financial office,", max_length=60, num_return_sequences=5)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I am a unicorn in a financial office, but what I do in that office is different—I\'m dealing with people with a lot of stress and the fact that I\'m afraid to let myself get down on my guard when I\'m not working," Kelly said. "How much can I lose from'},
 {'generated_text': 'I am a unicorn in a financial office, and a huge one!" and he got the best of them, and they didn\'t understand why.\n\n"We still have work to do!" the girls cheered, and they knew that when they could get away with it. Everyone took it as a'},
 {'generated_text': 'I am a unicorn in a financial office, and I will do anything in my power to help you."\n\nTo her credit, she was kind and supportive after a bit, but she knew that there were lots of young women out there she couldn\'t count on to care about anything. Then she'},
 {'generated_text': "I am a unicorn in a financial office, not one of my clients.\n\nIf my manager's name doesn't look right on the invoice, I might end up getting sued bec

In [4]:
generator("To bake cookies I need,", max_length=60, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'To bake cookies I need, you can always leave mine in the fridge.\n\nIngredients\n\n1 cup coconut oil\n\n3 oz chocolate chips\n\n1/4 cup coconut sugar\n\n2 oz chocolate chips and optional toppings\n\n\nPreheat the oven to 225ºF (150'},
 {'generated_text': 'To bake cookies I need, you have to find the right place first so you can have that perfect, perfect idea."\n\nShe did a bit of a mini-batch last night by putting a piece of butter to the top of her cake and, with the help of her assistant who was also'},
 {'generated_text': 'To bake cookies I need, my dough comes together as a ball so you can scoop everything into that shape. Here I will show you how to prepare your new dough so you can make the perfect one.\n\nMaking your own cookies\n\n2. Preheat oven to 375 degrees F* with'},
 {'generated_text': "To bake cookies I need, which is just a matter of taking them out with a fork and using them to remove the baking oil and other toppings, which I'd rather not cook, s

In [5]:
generator("I want to kill a kitten,", max_length=60, num_return_sequences=5)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I want to kill a kitten, because I want to kill the kitten. That\'s what I\'m up against in life."\n\nJared was clearly shocked when the teen asked for a piece of paper and asked if there were any other options in the world. "A kitten, or a person'},
 {'generated_text': 'I want to kill a kitten, I want to kill a baby…"'},
 {'generated_text': 'I want to kill a kitten, too. I need to get rid of that awful cat in my garden. Why aren\'t you helping me out? Why aren\'t you killing a kitten that I\'m really sick of?"\n\nPerez watched the cat through a slit from a small window with'},
 {'generated_text': "I want to kill a kitten, or a dog. We believe in human rights, and I respect that. But right now, we're trying for a more humane approach to dealing with people who commit crimes, the most serious crimes. Why does my party oppose the use of drones? Because it's"},
 {'generated_text': "I want to kill a kitten, so I kill the kitten and turn it into a meat eater. I can also

### Text classification

In [6]:
from transformers import pipeline

# This model is a `zero-shot-classification` model.
# It will classify text, except you are free to choose any label you might imagine
classifier = pipeline(model="facebook/bart-large-mnli")
classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!',
 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'],
 'scores': [0.503635585308075,
  0.47879981994628906,
  0.012600085698068142,
  0.002655789954587817,
  0.0023087512236088514]}

## Build a sentiment analysis classifier

### Instantiate a pipeline

A pipeline is composed of a tokenizer and a model.

In [7]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


We start by creating a "Sentiment Analysis" **classifier** using the pipeline function provided by the Hugging Face Transformers library. This function allows us to easily use pre-trained models for various natural language processing (NLP) tasks, like sentiment analysis.

### Run the classifier

In [8]:
results = classifier("This is cool")
results

[{'label': 'POSITIVE', 'score': 0.9998584985733032}]

The model takes this text as input and predicts the sentiment associated with it.

Pipeline on Huggingface [documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)

Your turn: Try to find replace the send to have a score the closest to 50 you can.

### Multiple input

In [9]:
# We give a list to the classifier now
results = classifier(["NLP is nice", "I don't like NLP"])
results

[{'label': 'POSITIVE', 'score': 0.9997960925102234},
 {'label': 'NEGATIVE', 'score': 0.9968776702880859}]

**Exercise:**

Add different text inputs with varying sentiments, run it, check the model's sentiment predictions, and explore how it assigns labels.

### Use a specific model

By default transformers library uses a distilbert model for the pipelines we have created. Let's change this and work with another model.

In [10]:
# we create another generation pipeline

completion = pipeline("text-generation", model="distilgpt2")
generator(
    "I travel by plane I",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I travel by plane I've got all this baggage you need as an excuse to be on the trip out to see the country! And when I go"},
 {'generated_text': "I travel by plane I have no idea they used to use such a thing. It's just an archaic term and I only call it the MASS"}]

### Find more models

**Exercise:**

Find more model on Huggingface [hub](https://huggingface.co/models?sort=trending).

### Models cards


Models cards provide information about the model, code examples, demos and most of the time information about how the models has been trained.
[Mistral model card](https://huggingface.co/mistralai/Mistral-7B-v0.3)

### Get information about the model

In [11]:
model = "distilbert-base-uncased-finetuned-sst-2-english"

In [12]:
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model)
print(config)

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.41.0",
  "vocab_size": 30522
}



The model_name variable holds the name of the pre-trained model. In this case, it's "distilbert-base-uncased-finetuned-sst-2-english"

Let's have a look at the model [card on Hugginface.co](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)

**Exercise:**

Look at different models and pick up one that is the most adapted to your use case and language.

## Tokenizer

### What is a tokenizer

- Tokenization is the process of breaking down text into smaller **units** called **tokens**. In order to process text the computer needs first to transform it into numbers.

- Tokens are the basic building blocks used by Transformers models to understand and process text.

- Tokens can represent **words, subwords, or even individual characters**, depending on the model's vocabulary.

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

Source [image](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt)

### Instanciate a tokenizer

In [13]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")


When using from_pretrained, we are loading a pre-trained model and tokenizer specified by the model_name.

We added our tokenizer to our pipeline:


In [14]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)


## Tokenization

A token is a value extracted from a **vocabulary list**.

A vocabulary list is a set words.

## Create tokens

## Split method

In [15]:
tokenized_text = "NLP is great".split()
print(tokenized_text)

['NLP', 'is', 'great']


### Use a tokenizer

In [16]:
sequence = "NLP is great!"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['nl', '##p', 'is', 'great', '!']


### Another BERT tokenizer

In [17]:
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

tokenizer.tokenize("NLP is great!")

['nl', '##p', 'is', 'great', '!']

### XLNet tokenizer

In [18]:
from transformers import XLNetTokenizer


tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased")

tokens = tokenizer.tokenize(sequence)


print(f"Tokens: {tokens}\n")

Tokens: ['▁N', 'LP', '▁is', '▁great', '!']



## Input IDs

In [19]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[578, 7286, 27, 312, 136]


More on [tokenizers.](https://huggingface.co/docs/transformers/en/tokenizer_summary)

## Padding and truncation

Language models work with **tensors**, we need them to be **the same length**.

```
padding=True and truncation=True
```

In [20]:
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")

sequences = ["NLP is great!",
           "All I need is two sentences."]

print(f"Tokens: {tokens}\n")

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

batch = tokenizer(sequences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

Tokens: ['▁N', 'LP', '▁is', '▁great', '!']

[100, 100, 100, 100, 999]


**Question**:
What are the ```'101'``` and ```'102'``` in the token list?

In [21]:
pprint(batch)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101, 17953,  2361,  2003,  2307,   999,   102,     0,     0],
        [  101,  2035,  1045,  2342,  2003,  2048, 11746,  1012,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]])}


Returns a dictionary with keys ```'input_ids'``` and ```'attention_mask'```, with two tensors the 'input ids' tensor and the 'attention_mask' tensor.
input_ids are unique ids.

# Dataset

## Load a dataset from the hub

In [44]:
from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis", split="train")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


TypeError: BuilderConfig.__init__() got an unexpected keyword argument 'train'

In [35]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'feeling'],
        num_rows: 119988
    })
    validation: Dataset({
        features: ['text', 'feeling'],
        num_rows: 29997
    })
    test: Dataset({
        features: ['text', 'feeling'],
        num_rows: 61998
    })
})

The labels here are ```'feeling'```


In [47]:
dataset[0]

{'text': '@fa6ami86 so happy that salman won.  btw the 14sec clip is truely a teaser',
 'feeling': 0}

References:

[More about pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)

[Huggingface Model hub](https://huggingface.co/models)

[Datasets](https://huggingface.co/docs/datasets/en/index)



In [48]:
dataset["text"]

['@fa6ami86 so happy that salman won.  btw the 14sec clip is truely a teaser',
 "@phantompoptart .......oops.... I guess I'm kinda out of it.... Blonde moment -blushes- epic fail",
 "@bradleyjp decidedly undecided. Depends on the situation. When I'm out with the people I'll be in Chicago with? Maybe.",
 '@Mountgrace lol i know! its so frustrating isnt it?!',
 "@kathystover Didn't go much of any where - Life took over for a while",
 '@TashaWilson like questions she asks me the date etc..i say that i have been to birmingham lol its weird o well  u ok?',
 "@lisastarlynn I haven't heard anything. I'll tweet you as soon as I hear. I'm really worried actually",
 '@SusanCosmos @speakgirl Thx 4 sharing!',
 '@lamere thank you so much, looking at these pics makes me want to have one more',
 'not it teh best form today, dont no why, just having a pissy day, i am all ways happy but to day ah not really   annoyed, bored, angry',
 'About to go to bed. Sleeping really late tomorrow!  I am so glad the

In [49]:
dataset[0]["text"]

'@fa6ami86 so happy that salman won.  btw the 14sec clip is truely a teaser'

In [50]:
tokenizer(dataset[0]["text"])


{'input_ids': [101, 1030, 6904, 2575, 10631, 20842, 2061, 3407, 2008, 28542, 2180, 1012, 18411, 2860, 1996, 2403, 3366, 2278, 12528, 2003, 2995, 2135, 1037, 27071, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [51]:
def tokenization(example):
    return tokenizer(example["text"])

dataset = dataset.map(tokenization, batched=True)

Map:   0%|          | 0/119988 [00:00<?, ? examples/s]

In [68]:
dataset

Dataset({
    features: ['text', 'feeling', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 119988
})

In [54]:
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"])
dataset.format['type']

ValueError: Columns ['label'] not in the dataset. Current columns in the dataset: ['text', 'feeling', 'input_ids', 'token_type_ids', 'attention_mask']

In [57]:
from transformers import DataCollatorWithPadding

#data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = dataset.to_tf_dataset(
    columns=["input_ids", "token_type_ids", "attention_mask"],
    label_cols=["feeling"],
    batch_size=2,
    collate_fn=data_collator,
    shuffle=True
)

In [58]:
tf_dataset

<_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'token_type_ids': TensorSpec(shape=(None, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [62]:
from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)

250

In [63]:
from datasets import load_metric
metric = load_metric('accuracy')

  metric = load_metric('accuracy')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

In [72]:
model_predictions = model(dataset[0]["input_ids"])
final_score = metric.compute(predictions=model_predictions, references=gold_references)

IndexError: too many indices for tensor of dimension 1

In [74]:
dataset[0]["input_ids"]

tensor([  101,  1030,  6904,  2575, 10631, 20842,  2061,  3407,  2008, 28542,
         2180,  1012, 18411,  2860,  1996,  2403,  3366,  2278, 12528,  2003,
         2995,  2135,  1037, 27071,   102])