# Build a sentiment analysis pipeline with HuggingFace

In [33]:
#for colab
!pip install transformers



In [34]:
from transformers import pipeline
import torch
from pprint import pprint

In [35]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


We start by creating a "Sentiment Analysis" **classifier** using the pipeline function provided by the Hugging Face Transformers library. This function allows us to easily use pre-trained models for various natural language processing (NLP) tasks, like sentiment analysis.

In [36]:
results = classifier("This is cool")
results

[{'label': 'POSITIVE', 'score': 0.9998584985733032}]

The model takes this text as input and predicts the sentiment associated with it. 

pipeline doc: https://huggingface.co/docs/transformers/main_classes/pipelines
pipeline tasks: 

### More then one sentence

In [37]:
# We give a list to the classifier now
results = classifier(["NLP is nice", "It's a lot of work"])
results

[{'label': 'POSITIVE', 'score': 0.9997960925102234},
 {'label': 'POSITIVE', 'score': 0.9995623230934143}]

### Exercise:

Add different text inputs with varying sentiments, run it, check the model's sentiment predictions, and explore how it assigns labels.

## Now select a specific model into your pipeline

In [38]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

The model_name variable holds the name of the pre-trained model. In this case, it's "twitter-roberta-base-sentiment-latest"

Let's have a look at the model card: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

In [39]:
classifier = pipeline("sentiment-analysis", model=model_name)

## Tokenizer

- Tokenization is the process of breaking down text into smaller **units** called **tokens**.

- Tokens are the basic building blocks used by Transformers models to understand and process text.

- Tokens can represent **words, subwords, or even individual characters**, depending on the model's vocabulary.

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

Source image: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

In [40]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

"AutoModelForSequenceClassification" adapts to various model architectures automatically.

In [41]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

When using from_pretrained, we are loading a pre-trained model and tokenizer specified by the model_name.

In [42]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

We create our sentiment analysis classifier.

## Tokens to inputs IDs

In [43]:
tokens = tokenizer.tokenize("Another cool sentence to demonstrate something.")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer("Another cool sentence to demonstrate something.")

In [44]:
print(f' Tokens:{tokens}')
print(f' Token IDs: {token_ids}')
print(f' input_ids:{input_ids}')

 Tokens:['another', 'cool', 'sentence', 'to', 'demonstrate', 'something', '.']
 Token IDs: [2178, 4658, 6251, 2000, 10580, 2242, 1012]
 input_ids:{'input_ids': [101, 2178, 4658, 6251, 2000, 10580, 2242, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


### Exercise: 
Test different tokenizers, select models from the hub.

Some:

https://huggingface.co/SamLowe/roberta-base-go_emotions

https://huggingface.co/bert-base-uncased

Some more... 


In [45]:
#uncomment this to answer the exercise
#tokenizer = AutoTokenizer.from_pretrained("[model]")
#tokens = tokenizer.tokenize("Woaou another sentence!")
#token_ids = tokenizer.convert_tokens_to_ids(tokens)
#input_ids = tokenizer("Another cool sentence to demonstrate something.")

In [46]:
print(f' Tokens:{tokens}')
print(f' Token IDs: {token_ids}')
print(f' input_ids:{input_ids}')

 Tokens:['another', 'cool', 'sentence', 'to', 'demonstrate', 'something', '.']
 Token IDs: [2178, 4658, 6251, 2000, 10580, 2242, 1012]
 input_ids:{'input_ids': [101, 2178, 4658, 6251, 2000, 10580, 2242, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Batching

In [47]:
sentences = ["Another cool sentence to demonstrate something.",
           "All I need is two sentences."]
batch = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

### Note:
All our sample will have the same length (necessity for the model) - tensors must have the same shape.
```
padding=True and truncation=True
```

In [48]:
pprint(batch)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101,  2178,  4658,  6251,  2000, 10580,  2242,  1012,   102],
        [  101,  2035,  1045,  2342,  2003,  2048, 11746,  1012,   102]])}


Returns a dictionary with keys 'input_ids' and 'attention_mask', with two tensors the 'input ids' tensor and the 'attention_mask' tensor.
input_ids are unique ids.

## Predictions

In [49]:
# Prevent gradient computation (no need to compute gradients during inference)

with torch.no_grad():
    outputs = model(**batch) 
    print(outputs)
    print('')
    predictions = torch.softmax(outputs.logits, dim=1)      # Apply softmax to convert model logits to probabilities
    pprint(predictions)
    print('')
    labels = torch.argmax(predictions, dim=1)              # Find the index of the class with the highest probability for each example
    pprint(labels)
    labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
    pprint(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.9261,  4.2183],
        [ 2.8756, -2.4102]]), hidden_states=None, attentions=None)

tensor([[2.9026e-04, 9.9971e-01],
        [9.9496e-01, 5.0377e-03]])

tensor([1, 0])
['POSITIVE', 'NEGATIVE']


In [50]:
# Define the number of decimal places to round to
decimal_places = 2
# Round the probabilities
rounded_probabilities = torch.round(predictions * 10**decimal_places) / (10**decimal_places)
# Print the rounded probabilities
print('')
pprint(rounded_probabilities)


tensor([[0.0000, 1.0000],
        [0.9900, 0.0100]])


### Saving

In [51]:
save_directory = "your_dir"
tokenizer.save_pretrained(save_directory)
model. save_pretrained(save_directory)

tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForSequenceClassification.from_pretrained(save_directory)
