# Using HuggingFace for Sentiment Analysis: An Introduction
---

## Introduction
For this tutorial, I'll be using hugging face to perform sentiment analysis. I will be checking out and exploring the model hub and then fine-tune the chosen model to boost performance.

In [21]:
# importing needeed modules
from transformers import pipeline
import torch
import torch.nn.functional as F

A task for the pipeline is provided, in this case we use the 
'sentiment-analysis' task which will return a text classification class. 

Sentiment analysis is a form of text classification as well as documented
on the huggingface website: [huggingface.co/transformers/main_classes/pipelines.html]

In [3]:
# Specifying the sentiment analysis task to be performed with the pipeline
sentiment_classifier_1 = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


**Lets try some samples!**

In [5]:
test_1 = sentiment_classifier_1 ("Oh, if it isn't the thief of the century.")
print(test_1)

[{'label': 'NEGATIVE', 'score': 0.9808171987533569}]


**INSIGHT:**
As you can see the result of the text shows a dictionary with a label and a score as keys as well as their values. Which shows that the sentence is negative with a score of 0.9808171987533569.

**Let's use the classifier for a list of sentences.**

In [8]:
test_2 = sentiment_classifier_1 ([
    "Oh, if it isn't the thief of the century", 
    "Fly! You fools!", 
    "I can do this all day"
])

# Printing the result for each sentence
for res in test_2:
    print (res)

{'label': 'NEGATIVE', 'score': 0.9774008393287659}
{'label': 'NEGATIVE', 'score': 0.9614507555961609}
{'label': 'POSITIVE', 'score': 0.9981259703636169}


## Using a specific model and tokeniser.

In [9]:
# This model (model name) was chosen from the model hub of huggingface
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [11]:
# Specifying the model name for the classifier

sentiment_classifier_2 = pipeline("sentiment-analysis", model = model_name)

# Testing the new classifer on the list of sentences (test_2)
test_2 = sentiment_classifier_2 ([
    "Oh, if it isn't the thief of the century", 
    "Fly! You fools!", 
    "I can do this all day"
])

for res in test_2:
    print (res)

{'label': 'NEGATIVE', 'score': 0.9774008393287659}
{'label': 'NEGATIVE', 'score': 0.9614507555961609}
{'label': 'POSITIVE', 'score': 0.9981259703636169}


**A different approach to specifying a model and tokeniser**

This just a generic class for a tokeniser and also a sequence classification which lends itself to giving you some more functionality

In [12]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [13]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokeniser = AutoTokenizer.from_pretrained(model_name)

In [14]:
# Using the specified model and tokeniser
sentiment_classifier_3 = pipeline("sentiment-analysis", model = model, tokenizer = tokeniser)

#testing this classifier on the list of examples
test_2 = sentiment_classifier_3 ([
    "Oh, if it isn't the thief of the century", 
    "Fly! You fools!", 
    "I can do this all day"
])

for res in test_2:
    print (res)

{'label': 'NEGATIVE', 'score': 0.9774008393287659}
{'label': 'NEGATIVE', 'score': 0.9614507555961609}
{'label': 'POSITIVE', 'score': 0.9981259703636169}


**Exploring the tokeniser**

In [16]:
tokens = tokeniser.tokenize("Oh, if it isn't the thief of the century")
token_ids = tokeniser.convert_tokens_to_ids(tokens)
input_ids = tokeniser("Oh, if it isn't the thief of the century")

print(f"Tokens: {tokens}")
print(f"Tokens IDs: {token_ids}")
print(f"Input IDs: {input_ids}")

Tokens: ['oh', ',', 'if', 'it', 'isn', "'", 't', 'the', 'thief', 'of', 'the', 'century']
Tokens IDs: [2821, 1010, 2065, 2009, 3475, 1005, 1056, 1996, 12383, 1997, 1996, 2301]
Input IDs: {'input_ids': [101, 2821, 1010, 2065, 2009, 3475, 1005, 1056, 1996, 12383, 1997, 1996, 2301, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


**INSIGHTS:** 

We get a list of the tokens in the sentence. Which is basically a list of the words in the sentence. **Note:** The punctuations in the sentence will also be given tokens. 

The tokens or token ids are basically the mathematical representation of the words or tokens which the model can work with/understand.

When you look at the input ids, you'll find that the tokens are similar to the token ids, however, there is a _101_ and _102_ at the beggining and end of the input ids respectfully. This is just the beggining of string and end of string tokens.

The input ids are what can be passed to the model to do the prediction manually

In [17]:
# Checking how the convert_ids_to_tokens work
rev_tokens = tokeniser.convert_ids_to_tokens(token_ids)
print(rev_tokens)

['oh', ',', 'if', 'it', 'isn', "'", 't', 'the', 'thief', 'of', 'the', 'century']


**Lets try some manual predictions**

In [32]:
# Providing the training set
X_train = [
    "Oh, if it isn't the thief of the century", 
    "Fly! You fools!", 
    "I can do this all day"
]

In [33]:
batch = tokeniser(
    X_train, padding = True, 
    truncation = True, 
    max_length = 512, 
    return_tensors = "pt"
    )
print(batch)

{'input_ids': tensor([[  101,  2821,  1010,  2065,  2009,  3475,  1005,  1056,  1996, 12383,
          1997,  1996,  2301,   102],
        [  101,  4875,   999,  2017, 18656,   999,   102,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  1045,  2064,  2079,  2023,  2035,  2154,   102,     0,     0,
             0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


In [38]:
# Passing the output into our model manually
with torch.no_grad():
    outputs = model(**batch, labels = torch.tensor([0, 0, 1]))
    print(outputs)
    predictions = F.softmax(outputs.logits, dim = 1)
    print(predictions)
    labels = torch.argmax(predictions, dim = 1)
    print(labels)
    labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
    print(labels)

SequenceClassifierOutput(loss=tensor(0.0213), logits=tensor([[ 2.1371, -1.6299],
        [ 1.7681, -1.4484],
        [-3.0584,  3.2194]]), hidden_states=None, attentions=None)
tensor([[0.9774, 0.0226],
        [0.9615, 0.0385],
        [0.0019, 0.9981]])
tensor([0, 0, 1])
['NEGATIVE', 'NEGATIVE', 'POSITIVE']
