# Sentiment analysis pipeline with the transformers library

### What is Sentiment analysis

 Picture this: you've just finished writing an online review about your latest meal at a restaurant, and then someone else reads it. Sentiment analysis is the technology that would help them understand not just what you said, but how you really feel about it - whether it was delicious or disappointing, thrilling or tedious.

Sentiment analysis, opinion mining or emotion AI, involves using computers to dig through all sorts of text, like comments on websites, tweets on social media, and customer feedback forms. The main objective is to figure out the underlying sentiment - basically, what feelings are bubbling up in that digital space about certain topics, products, services, or events.

### With LLMs?

Today language models are about helping computers understand human language and its context instead of treating words as isolated pieces of text. This way, when we try to get sense from texts, the computer can take into account everything together rather than just looking at individual words.

#### Let's dive in

In [None]:
'''
for colab, we install the transformers and datasets library

'''
%pip install transformers datasets >> /dev/null

In [None]:
'''
We import transformers pipeline and torch
'''

from transformers import pipeline
import torch
from pprint import pprint


## Natural language processing tasks

### An example of previous generation of language model GPT-2

In [None]:
'''
Here we create our first pipeline with the library transformers
We import set_seed to use seeds for reproducibility
We first create the pipeline we name it [generator], we use the model [gpt2] and the task [text-generation].
And then we call this pipeline with the text "I am a unicorn in a financial office," and we ask for 5 different sequences

'''

from transformers import set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("I am a unicorn in a financial office,", max_length=60, num_return_sequences=5)


In [None]:
generator("To bake cookies I need,", max_length=60, num_return_sequences=5)


In [None]:
generator("I want to kill a kitten,", max_length=60, num_return_sequences=5)


### Text classification

In [None]:
from transformers import pipeline

# This model is a `zero-shot-classification` model.
# It will classify text, except you are free to choose any label you might imagine
classifier = pipeline(model="facebook/bart-large-mnli")
classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

### Token classification 

[here](https://huggingface.co/dslim/bert-base-NER)

## Build a sentiment analysis classifier

### Instantiate a pipeline

A pipeline is composed of a tokenizer and a model.

In [None]:
classifier = pipeline("sentiment-analysis")

We start by creating a "Sentiment Analysis" **classifier** using the pipeline function provided by the Hugging Face Transformers library. This function allows us to easily use pre-trained models for various natural language processing (NLP) tasks, like sentiment analysis.

### Run the classifier

In [None]:
results = classifier("This is cool")
results

The model takes this text as input and predicts the sentiment associated with it.

Pipeline on Huggingface [documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)

Your turn: Try to find replace the send to have a score the closest to 50 as you can.

### Multiple input

In [None]:
# We give a list to the classifier now
results = classifier(["NLP is nice", "I don't like NLP"])
results

**Exercise:**

Add different text inputs with varying sentiments, run it, check the model's sentiment predictions, and explore how it assigns labels.

### Use a specific model

By default transformers library uses a distilbert model for the pipelines we have created. Let's change this and work with another model.

In [None]:
second_classifier = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

In [None]:
second_classifier("I am happy")

**Exercise:**

Find more model on Huggingface [hub](https://huggingface.co/models?sort=trending).

### Astuce: models cards


Models cards provide information about the model, code examples, demos and most of the time information about how the models has been trained.
[For our second classifier](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

## Tokenizer

### What is a tokenizer

- Tokenization is the process of breaking down text into smaller **units** called **tokens**. In order to process text the computer needs first to transform it into numbers.

- Tokens are the basic building blocks used by Transformers models to understand and process text.

- Tokens can represent **words, subwords, or even individual characters**, depending on the model's vocabulary.

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

Source [image](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt)

### Instanciate a tokenizer

In [None]:
from transformers import BertTokenizer

model = "nlptown/bert-base-multilingual-uncased-sentiment"

tokenizer = BertTokenizer.from_pretrained(model)


When using from_pretrained, we are loading a pre-trained model and tokenizer specified by the model_name.

We added our tokenizer to our pipeline:


In [None]:
new_classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [None]:
new_classifier("I am happy")


## Tokenization

A token is a value extracted from a **vocabulary list**.

A vocabulary list is a set words.

## Create tokens

### Split method

In [None]:
tokenized_text = "NLP is great".split()
print(tokenized_text)

### Use a tokenizer

In [None]:
sequence = "NLP is great!"
tokens = tokenizer.tokenize(sequence)

print(tokens)

### Try another tokenizer

In [None]:
from transformers import XLNetTokenizer


another_tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased")
new_tokens = another_tokenizer.tokenize(sequence)


print(f"Tokens: {new_tokens}\n")

## Input IDs

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

More on [tokenizers.](https://huggingface.co/docs/transformers/en/tokenizer_summary)

## Padding and truncation

Language models work with **tensors**, we need them to be **the same length**.

```
padding=True and truncation=True
```

In [None]:
sequences = ["NLP is great!",
           "All I need is two sentences."]

print(f"Tokens: {tokens}\n")

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

batch = tokenizer(sequences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

**Question**:
What are the ```'101'``` and ```'102'``` in the token list?

In [None]:
pprint(batch)

Returns a dictionary with keys ```'input_ids'``` and ```'attention_mask'```, with two tensors the 'input ids' tensor and the 'attention_mask' tensor.
input_ids are unique ids.

# Dataset

### How does a dataset looks like?

## Load a dataset from the hub

In [None]:
from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis", split="train")

In [None]:
dataset

The labels here are ```'feeling'```


In [None]:
dataset[0]

In [None]:
dataset["text"]

In [None]:
dataset.info

In [None]:
tokenizer(dataset[0]["text"])


In [None]:
def tokenization(example):
    return tokenizer(example["text"])

m_dataset = dataset.map(tokenization, batched=True)

In [None]:
m_dataset

In [None]:
m_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "feeling"])
m_dataset.format['type']

In [None]:
# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(m_dataset)

# Display the columns of the DataFrame
print(df.columns)

### Bonus: Token classification code

In [None]:
from datasets import load_dataset

wnut = load_dataset("wnut_17")

In [None]:
label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

In [None]:
id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}
label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
}

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id
)

In [None]:
text = "The Golden State Warriors are an American professional basketball team based in San Francisco."
classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
classifier(text)