# Sentiment analysis pipeline with the transformers library

## Program

- What is natural language processing and what are classification tasks. 😀
- NLP tasks

### Build a sentiment classifier
- Load and explore your data
- Text preprocessing
- Load a model in the code environment
- Step-by-step building a classifier with a pre-trained model
- Run classification task: sentiment analysis on a data sample


In [None]:
!pip install transformers datasets -q

In [None]:
'''
We import transformers pipeline and torch
'''

from transformers import pipeline
import torch
from pprint import pprint


## Natural language processing tasks

### An example of previous generation of language model GPT-2

In [None]:
'''
Here we create our first pipeline with the library transformers
'''

from transformers import set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

In [None]:
generator("I am a unicorn in a financial office,", max_length=20, num_return_sequences=5)

In [None]:
generator("To bake cookies I need,", max_length=25, num_return_sequences=2)

In [None]:
generator("I don't like cats,", max_length=20, num_return_sequences=5)

In [None]:
#generator("...", max_length=.., num_return_sequences=..)

In [None]:
#generator()

### Labeling

In [None]:
from transformers import pipeline

classifier = pipeline(model="facebook/bart-large-mnli")

In [None]:
results = classifier(
    "I have a problem with my iphone that needs to be resolved asap!!",
    candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"],
)

for i, score in enumerate(results['scores']):
  results['scores'][i] = round(score, 2)

In [None]:
#show results:

### Sentiment classification

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
results = classifier("This conference is amazing!")

In [None]:
#print results:

In [None]:
#change the sentence and print the results:

## Build a sentiment analysis classifier

### Instantiate a pipeline

### Run the classifier

In [None]:
results = classifier("This is cool")
results

### Multiple input

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
results = classifier(["I am happy", "I am sad"])

for result in results:
  result['score'] = round(result['score'], 2)


In [None]:
#rewrite the code to instanciate a pipeline and classify two new sentences:

### Use a specific model

By default transformers library uses a distilbert model for the pipelines we have created. Let's change this and work with another model.

In [None]:
#First, explore this model card page and try some sentences: [link](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

Next, let's import this model into our code:

In [None]:
model = "nlptown/bert-base-multilingual-uncased-sentiment"
second_classifier = pipeline("sentiment-analysis", model=model)

In [None]:
second_classifier("I am happy")

In [None]:
#what is different?

In [None]:
#import a new model
#model = ...

### Models cards


Models cards provide information about the model, code examples, demos and most of the time information about how the models has been trained.

## Tokenizer

### What is a tokenizer

- Tokenization is the process of breaking down text into smaller **units** called **tokens**. In order to process text the computer needs first to transform it into numbers.

- Tokens are the basic building blocks used by Transformers models to understand and process text.

- Tokens can represent **words, subwords, or even individual characters**, depending on the model's vocabulary.

### Instanciate a tokenizer

In [None]:
from transformers import BertTokenizer

model = "nlptown/bert-base-multilingual-uncased-sentiment"

tokenizer = BertTokenizer.from_pretrained(model)


We add our tokenizer to our pipeline:


In [None]:
new_classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [None]:
new_classifier("I am happy")


## Tokenization

A token is a value extracted from a **vocabulary list**.

A vocabulary list is a set words.

## Create tokens

### Split method

In [None]:
tokenized_text = "We are at a hacking conference.".split()
print(tokenized_text)

In [None]:
#tokenize another sentence with split method.

### Use a tokenizer

In [None]:
sequence = "We are at a hacking conference."
tokens = tokenizer.tokenize(sequence)

print(tokens)

In [None]:
#What is different?

In [None]:
#create a new sequence and generate tokens for this sequence

### Try another tokenizer

In [None]:
from transformers import XLNetTokenizer


another_tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased")
new_tokens = another_tokenizer.tokenize(sequence)


In [None]:
print(f"Tokens: {new_tokens}\n")

In [None]:
#What is different?

## Input IDs

Remember our sentence : "We are at a hacking conference." Let's see token ids for this sentence.

In [None]:
'''
Our current tokens:
Tokens: ['▁We', '▁are', '▁at', '▁a', '▁hacking', '▁conference', '.']
'''

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

In [None]:
# Tokenize another sentence and see the token ids:

# your_sentence = ""
# your_tokens =
# your_ids =

In [None]:
# @title
your_sentence = "We are at a hackers party."
your_tokens = tokenizer.tokenize(your_sentence)
your_ids = tokenizer.convert_tokens_to_ids(your_tokens)

In [None]:
# @title
print(your_tokens)
print(your_ids)

## Padding and truncation

Language models work with **tensors**, we need them to be **the same length**.

```
padding=True and truncation=True
```

In [None]:
sentences = ["A white poney.", "A white poney in the garden."]

batch = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

In [None]:
pprint(batch)

In [None]:
#What are the ```'101'``` and ```'102'``` in the token list?


In [None]:
#what are the zeros?

In [None]:
#Try out with two new sentences

Note it returns a dictionary with keys ```'input_ids'``` and ```'attention_mask'```, with two tensors the 'input ids' tensor and the 'attention_mask' tensor.
input_ids are unique ids.

# Dataset

### How does a dataset looks like?

## Load a dataset from the hub

In [None]:
from datasets import load_dataset

dataset = load_dataset("carblacac/twitter-sentiment-analysis", split="train")

In [None]:
dataset

The labels here are ```'feeling'```


In [None]:
#find and load another dataset from huggingface hub: https://huggingface.co/
#dataset2 =

In [None]:
#What are the labels in this one?

In [None]:
#back to our original dataset
dataset[0]

In [None]:
sample = dataset["text"][:10]

In [None]:
#print the 22th tweet

In [None]:
dataset.info

In [None]:
#How many tweets are in this dataset?
#What is the title of the dataset?
#When has it been published?

In [None]:
import pandas as pd

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(dataset)

# Display the columns of the DataFrame
print(df.columns)

## Tokenize the dataset

In [None]:
tokenizer(dataset[0]["text"])


In [None]:
#what's new?

In [None]:
def tokenization(example):
    return tokenizer(example["text"])

tokenized_dataset = dataset.map(tokenization, batched=True)

In [None]:
tokenized_dataset

In [None]:
#what can I say about this new dataset?

In [None]:
tokenized_dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "feeling"])
tokenized_dataset.format['type']

In [None]:
#What did I just do? :)

Now your set is ready for training!

## Create sample of the dataset

In [None]:
import pandas as pd

In [None]:
df = dataset.to_pandas()
sample = df.head(10)

In [None]:
sample

In [None]:
df

In [None]:
#Show the first 4 tweets of this dataframe

In [None]:
# @title
df[:4]

In [None]:
#Show only the labels for the first 4 tweets

In [None]:
# @title
df[:4]['feeling']

## Classify sentiment

In [None]:
classifier = pipeline("sentiment-analysis")

def predict_sentiment(text):
  result = classifier(text)[0]
  return result['label']

sentiment = []
for text in sample['text']:
  sentiment.append(predict_sentiment(text))

In [None]:
sample['predicted_sentiment'] = sentiment
pprint(sentiment)

In [None]:
#let's compare the predictions with the actual labels

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

# Thanks!

## Questions ?