# Build a sentiment analysis pipeline with HuggingFace

In [None]:
#for colab
!pip install transformers

In [None]:
from transformers import pipeline
import torch
from pprint import pprint

In [None]:
classifier = pipeline("sentiment-analysis")

We start by creating a "Sentiment Analysis" **classifier** using the pipeline function provided by the Hugging Face Transformers library. This function allows us to easily use pre-trained models for various natural language processing (NLP) tasks, like sentiment analysis.

In [None]:
results = classifier("This is cool")
results

The model takes this text as input and predicts the sentiment associated with it. 

Pipeline on Huggingface [documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)

### More then one sentence

In [None]:
# We give a list to the classifier now
results = classifier(["NLP is nice", "It's a lot of work"])
results

### Exercise:

Add different text inputs with varying sentiments, run it, check the model's sentiment predictions, and explore how it assigns labels.

## Now select a specific model into your pipeline

In [None]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

The model_name variable holds the name of the pre-trained model. In this case, it's "distilbert-base-uncased-finetuned-sst-2-english"

Let's have a look at the model [card on Hugginface.co](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

In [None]:
classifier = pipeline("sentiment-analysis", model=model_name)

## Tokenizer

- Tokenization is the process of breaking down text into smaller **units** called **tokens**.

- Tokens are the basic building blocks used by Transformers models to understand and process text.

- Tokens can represent **words, subwords, or even individual characters**, depending on the model's vocabulary.

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

Source [image](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

"AutoModelForSequenceClassification" adapts to various model architectures automatically.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

When using from_pretrained, we are loading a pre-trained model and tokenizer specified by the model_name.

In [None]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

We create our sentiment analysis classifier.

## Tokens to inputs IDs

In [None]:
tokens = tokenizer.tokenize("Another cool sentence to demonstrate something.")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer("Another cool sentence to demonstrate something.")

In [None]:
print(f' Tokens:{tokens}')
print(f' Token IDs: {token_ids}')
print(f' input_ids:{input_ids}')

### Exercise: 
Test different tokenizers, select models from the hub.

In [None]:
#you can use this code
#tokenizer = AutoTokenizer.from_pretrained("[model]")
#tokens = tokenizer.tokenize("Woaou another sentence!")
#token_ids = tokenizer.convert_tokens_to_ids(tokens)
#input_ids = tokenizer("Another cool sentence to demonstrate something.")

In [None]:
print(f' Tokens:{tokens}')
print(f' Token IDs: {token_ids}')
print(f' input_ids:{input_ids}')

## Batching

In [None]:
sentences = ["Another cool sentence to demonstrate something.",
           "All I need is two sentences."]
batch = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

### Note:
All our sample will have the same length (necessity for the model) - tensors must have the same shape.
```
padding=True and truncation=True
```

In [None]:
pprint(batch)

Returns a dictionary with keys 'input_ids' and 'attention_mask', with two tensors the 'input ids' tensor and the 'attention_mask' tensor.
input_ids are unique ids.

## Predictions

In [None]:
# Prevent gradient computation

with torch.no_grad():
    outputs = model(**batch) 
    predictions = torch.softmax(outputs.logits, dim=1)      # Apply softmax to convert model logits to probabilities
    labels = torch.argmax(predictions, dim=1)              # Find the index of the class with the highest probability for each example
    labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
    pprint(outputs)
    pprint(predictions)
    pprint(labels)

In [None]:
# Define the number of decimal places to round to
decimal_places = 2
# Round the probabilities
rounded_probabilities = torch.round(predictions * 10**decimal_places) / (10**decimal_places)
# Print the rounded probabilities
print(rounded_probabilities)

### Saving

In [None]:
save_directory = "your_dir"
tokenizer.save_pretrained(save_directory)
model. save_pretrained(save_directory)

tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForSequenceClassification.from_pretrained(save_directory)
