<a href="https://colab.research.google.com/github/RafaelVieira13/Hugging_Face_Tutorial/blob/main/HuggingFace_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFACE Crash Course
https://www.youtube.com/watch?v=GSt00_-0ncQ

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

## 1. Pipelines
* https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html

In [9]:
# Sentiment Analysis using the default model
classifier = pipeline("sentiment-analysis")
results = classifier(["Today I'm feeling sleepy",
                 "Today I'm feeling great"])

for result in results:
  print(result['label'], result['score'])

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


NEGATIVE 0.9985528588294983
POSITIVE 0.9998780488967896


## 2. Model and Tokenization

In [15]:
# Sentiment Analysis using a specific model
model_name = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model = model, tokenizer=tokenizer)
results = classifier(["Today I'm feeling sleepy",
                 "Today I'm feeling great"])

for result in results:
  print(result['label'], result['score'])

NEGATIVE 0.9985528588294983
POSITIVE 0.9998780488967896


In [16]:
# Taking a look to the tokens
tokens = tokenizer.tokenize("Today I'm feeling sleepy")
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer("Today I'm feeling sleepy")

print(f' Tokens: {tokens}')
print(f'Token IDs: {tokens_ids}')
print(f'Input IDs: {input_ids}')

 Tokens: ['today', 'i', "'", 'm', 'feeling', 'sleepy']
Token IDs: [2651, 1045, 1005, 1049, 3110, 17056]
Input IDs: {'input_ids': [101, 2651, 1045, 1005, 1049, 3110, 17056, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


In [20]:
X_train = ["Today I'm feeling sleepy",
           "Today I'm feeling great"]

# Creating the batch
batch = tokenizer(X_train, padding=True, truncation=True, max_length=512, return_tensors="pt")

# Pythorch Classification
with torch.no_grad():
  outputs = model(**batch, labels=torch.tensor([1,0]))
  print(outputs)
  predictions = F.softmax(outputs.logits, dim=1)
  print(predictions)
  labels = torch.argmax(predictions, dim=1)
  print(labels)
  labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
  print(labels)

SequenceClassifierOutput(loss=tensor(7.7752), logits=tensor([[ 3.5992, -2.9375],
        [-4.3180,  4.6940]]), hidden_states=None, attentions=None)
tensor([[9.9855e-01, 1.4472e-03],
        [1.2192e-04, 9.9988e-01]])
tensor([0, 1])
['NEGATIVE', 'POSITIVE']


## 3. Save and Loading

In [21]:
# Define the directory and Loading it
save_directory = "saved"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# Loading the Model
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForSequenceClassification.from_pretrained(save_directory)

# 4. Model Hub

* https://huggingface.co/models

In [28]:
model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

texts = ["Lack of support",
         "Amazing team",
         "Project delivered on time",
         "Project delivered with a small delay",
         "Project delayed",
         "No opinion from my side"]

batch = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
print(batch)
with torch.no_grad():
  outputs = model(**batch)
  labels_ids = torch.argmax(outputs.logits, dim=1)
  print(labels_ids)
  labels = [model.config.id2label[label_id] for label_id in labels_ids.tolist()]
  print(labels)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'input_ids': tensor([[    0,   574,  2990,     9,   323,     2,     1,     1],
        [    0, 41710,   165,     2,     1,     1,     1,     1],
        [    0, 33347,  2781,    15,    86,     2,     1,     1],
        [    0, 33347,  2781,    19,    10,   650,  4646,     2],
        [    0, 33347,  5943,     2,     1,     1,     1,     1],
        [    0,  3084,  2979,    31,   127,   526,     2,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 0]])}
tensor([0, 2, 2, 2, 0, 1])
['negative', 'positive', 'positive', 'positive', 'negative', 'neutral']


# 5. Fine-Tunning With custom datasets