# Quicktour for Huggingface library 

Tutorial link https://huggingface.co/transformers/quicktour.html

## Simple sentiment classifiers

In [1]:
from transformers import pipeline

Create a classifier for sentiment analysis. 

In [2]:
clas = pipeline('sentiment-analysis')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




Classify texts by sentiments. 

In [3]:
clas('We are very happy to welcome you here')

[{'label': 'POSITIVE', 'score': 0.9998410940170288}]

In [4]:
clas('We are very sorry to lose you')

[{'label': 'NEGATIVE', 'score': 0.9889100193977356}]

In [5]:
clas('We are very happy get the f*** out of here')

[{'label': 'POSITIVE', 'score': 0.9998367428779602}]

In [6]:
clas('Get the f*** out of here')

[{'label': 'NEGATIVE', 'score': 0.8999497294425964}]

We can choose to directly pass the name of model into pipeline(). 

The following classifier can deal with English, French , Dutch, German, Italian, and Spanish. 

In [7]:
clas = pipeline('sentiment-analysis', 
   model = 'nlptown/bert-base-multilingual-uncased-sentiment'
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=953.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=871891.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=39.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=669491321.0, style=ProgressStyle(descri…




In [8]:
clas('Tout va bien')

[{'label': '5 stars', 'score': 0.42690587043762207}]

In [9]:
clas('Je m\'en fous')

[{'label': '3 stars', 'score': 0.2321930080652237}]

In [10]:
clas('Va te faire foutre')

[{'label': '1 star', 'score': 0.32393142580986023}]

## Model object and associated tokenizer 

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

Use `from_pretrained()` method to download models and tokenizer. 

In [12]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Instantiate classifier and perform a simple classification. 

In [35]:
clas = pipeline(
    'sentiment-analysis', 
    model = pt_model, 
    tokenizer = tokenizer
)

clas('All he had is money')

[{'label': 'NEGATIVE', 'score': 0.9981979131698608}]

<span style="color:red;">Attention!</span> If the pretrained model is trained on data dissimilar from yours, you need to <b>fine-tune</b> the pretrained model. 

Now we observe what happens under the hood when applying tokenizer and model. 

In [13]:
inputs = tokenizer('We are very happy to welcome you here')
inputs

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 6160, 2017, 2182, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We can also pass multiple texts into tokenizer. If we do so, we need to pad all texts to the same length, and truncate them to the max length accepted by the model. 

In [14]:
texts = ['We are very happy to welcome you here', 'Get the f*** outta here']

pt_batch = tokenizer(
    texts, 
    padding = True, 
    truncation = True, 
    max_length = 300, 
    return_tensors = 'pt'
)

for key, val in pt_batch.items(): 
    print(f'{key}: {val.numpy().tolist()}')

input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 6160, 2017, 2182, 102], [101, 2131, 1996, 1042, 1008, 1008, 1008, 24955, 2182, 102]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


Once we have the numerical inputs created by the tokenizer, we pass the inputs directly into the model. 

In PyTorch, we need to unpack the dictionary by adding `**`. 

In [16]:
'''
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final
activation function (like SoftMax) since this final activation function is often fused with the loss.
'''

pt_outputs = pt_model(**pt_batch)
pt_outputs

(tensor([[-4.2170,  4.5300],
         [ 1.7232, -1.5170]], grad_fn=<AddmmBackward>),)

In [18]:
# Now we run the final activation on the previous outputs 
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs[0], dim = -1)
pt_predictions

tensor([[1.5890e-04, 9.9984e-01],
        [9.6232e-01, 3.7682e-02]], grad_fn=<SoftmaxBackward>)

We can also provide labels to the model, and it will return a tuple with the loss and the final activation. 

In [20]:
import torch 
pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1,0]))
pt_outputs

(tensor(0.0193, grad_fn=<NllLossBackward>),
 tensor([[-4.2170,  4.5300],
         [ 1.7232, -1.5170]], grad_fn=<AddmmBackward>))

We can <b>save</b> the models and tokenizers once they are fine-tuned. 

In [23]:
tokenizer.save_pretrained('./')
pt_model.save_pretrained('./')

# Once the tokenizer and model are saved,
# we can reload them next time with `from_pretrained()`

## Accessing the code

(Skip this part)

## Customizing the model 

(Skip this part)