### STARTING WITH HUGGINGFACE 🤗

In [None]:
%pip install -r requirement.txt

The [pipeline()](https://huggingface.co/docs/transformers/v4.37.0/en/main_classes/pipelines#transformers.pipeline) is the easiest and fastest way to use a pretrained model for inference. You can use the pipeline() out-of-the-box for many tasks across different modalities, some of which are shown in the table below:




|Task|Description|Modality|Pipeline identifier|
|----------|:-------------:|:------:|---------:|
|Text classification|assign a label to a given sequence of text|NLP|pipeline(task=“sentiment-analysis”)|
|Text generation|generate text given a prompt|NLP|pipeline(task=“text-generation”)|
|Summarization|generate a summary of a sequence of text or document|NLP|pipeline(task=“summarization”)|
|Image classification|assign a label to an image|Computer vision|pipeline(task=“image-classification”)|
|Image segmentation|assign a label to each individual pixel of an image (supports semantic, panoptic, and instance segmentation)|Computer vision|pipeline(task=“image-segmentation”)|
|Object detection|predict the bounding boxes and classes of objects in an image|Computer vision|pipeline(task=“object-detection”)|
|Audio classification|assign a label to some audio data|Audio|pipeline(task=“audio-classification”)|
|Automatic speech recognition|transcribe speech into text|Audio|pipeline(task=“automatic-speech-recognition”)|
|Visual question answering|answer a question about the image, given an image and a question|Multimodal|pipeline(task=“vqa”)|
|Document question answering|answer a question about the document, given a document and a question|Multimodal|pipeline(task=“document-question-answering”)|
|Image captioning|generate a caption for a given image|Multimodal|pipeline(task=“image-to-text”)|

Start by creating an instance of pipeline() and specifying a task you want to use it for. We’ll use the pipeline() for sentiment analysis as an example:

In [1]:
from transformers import pipeline

In [2]:
classifier=pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


The pipeline() downloads and caches a default pretrained model and tokenizer for sentiment analysis. Now you can use the classifier on your target text:

In [3]:
classifier(["Me gusta aprender conceptos nuevos","No quiero levantarme temprano los fines de semana"])


[{'label': 'POSITIVE', 'score': 0.9815183281898499},
 {'label': 'NEGATIVE', 'score': 0.9779280424118042}]

As we see above  the model correctly classifies the 2 sentences

#### Use another model and tokenizer in the pipeline

The pipeline() can accommodate any model from the [Hub](https://huggingface.co/models), making it easy to adapt the pipeline() for other use-cases. For example, if you’d like a model capable of handling Spanish text, use the tags on the Hub to filter for an appropriate model. The top filtered result returns a multilingual BERT model finetuned for sentiment analysis you can use for Spanish text:

In [4]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

Use [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.37.0/en/model_doc/auto#transformers.AutoModelForSequenceClassification) and [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.0/en/model_doc/auto#transformers.AutoTokenizer) to load the pretrained model and it’s associated tokenizer:

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Lets explore the model architecture:

In [6]:
model.config

BertConfig {
  "_name_or_path": "nlptown/bert-base-multilingual-uncased-sentiment",
  "_num_labels": 5,
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "finetuning_task": "sentiment-analysis",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "1 star",
    "1": "2 stars",
    "2": "3 stars",
    "3": "4 stars",
    "4": "5 stars"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "1 star": 0,
    "2 stars": 1,
    "3 stars": 2,
    "4 stars": 3,
    "5 stars": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_to

As we can see the model has 5 labels corresponding to each score

Specify the model and tokenizer in the pipeline(), and now you can apply the classifier on Spanish text

In [9]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier(["Me apasiona viajar por la Patagonia Argentina!","No me atraen las peliculas de terror"])

[{'label': '5 stars', 'score': 0.5985925793647766},
 {'label': '2 stars', 'score': 0.575228750705719}]

Under the hood, the AutoModelForSequenceClassification and AutoTokenizer classes work together to power the pipeline() you used above. 

**An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path. You only need to select the appropriate AutoClass for your task and it’s associated preprocessing class.

Let’s return to the example from the previous section and see how you can use the AutoClass to replicate the results of the pipeline().

#### AutoTokenizer

**A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model.**
There are multiple rules that govern the tokenization process, including how to split a word and at what level words should be split. 

*The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you’re using the same tokenization rules a model was pretrained with.*

In [12]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [14]:
encoding =tokenizer("Estoy cansado de esperar")
display(encoding)

{'input_ids': [101, 15195, 10158, 10743, 63102, 10102, 71435, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer returns a dictionary containing:
- [input_ids](https://huggingface.co/docs/transformers/glossary#input-ids): numerical representations of your tokens.
- attention_mask: indicates which tokens should be attended to.

A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length:

In [16]:
batch_example=["Me encanta jugar al tenis los dias calidos","Mis amigos siempre me aconsejan"]

encoding=tokenizer(batch_example,truncation=True,padding=True,max_length=512,return_tensors='pt')
encoding

{'input_ids': tensor([[  101, 10525, 10109, 36183, 10112, 31931, 10161, 41139, 10175, 14347,
         56003, 11241,   102],
        [  101, 12751, 24963, 21279, 10525, 12181, 41037, 13554,   102,     0,
             0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

Because the second sentence is shorter than the first, the padding fills with "0". Also the attention_mask discards the padding values by assigning it 0

#### AUTOMODEL

🤗Transformers provides a simple and unified way to load pretrained instances! 

This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. For text (or sequence) classification, you should load AutoModelForSequenceClassification:

In [28]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Now pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding **:

In [32]:
pt_outputs =model(**encoding)
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.3056, -2.4165, -0.3753,  1.6093,  2.7765],
        [-2.4242, -2.3214, -0.0322,  1.7321,  2.4318]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The model outputs the final activations in the logits attribute. Apply the softmax function to the logits to retrieve the probabilities:

In [34]:
from torch import nn
pt_predictions=nn.functional.softmax(pt_outputs.logits,dim=-1)
pt_predictions


tensor([[0.0045, 0.0041, 0.0313, 0.2279, 0.7322],
        [0.0049, 0.0054, 0.0532, 0.3108, 0.6257]], grad_fn=<SoftmaxBackward0>)

All 🤗 Transformers models (PyTorch or TensorFlow) output the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss. Model outputs are special dataclasses so their attributes are autocompleted in an IDE. The model outputs behave like a tuple or a dictionary (you can index with an integer, a slice or a string) in which case, attributes that are None are ignored.

#### SAVE A MODEL

Once your model is fine-tuned, you can save it with its tokenizer using PreTrainedModel.save_pretrained():

In [35]:
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
model.save_pretrained(pt_save_directory)

In [36]:
#When you are ready to use the model again, reload it with PreTrainedModel.from_pretrained():
# pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")

#### Custom model builds

You can modify the model’s configuration class to change how a model is built. The configuration specifies a model’s attributes, such as the number of hidden layers or attention heads. You start from scratch when you initialize a model from a custom configuration class. The model attributes are randomly initialized, and you’ll need to train the model before you can use it to get meaningful results.

Start by importing AutoConfig, and then load the pretrained model you want to modify. Within AutoConfig.from_pretrained(), you can specify the attribute you want to change, such as the number of attention heads:

In [37]:
from transformers import AutoConfig

my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12,activation='relu',n_layers=8)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Create a model from your custom configuration with [AutoModel.from_config()](https://huggingface.co/docs/transformers/v4.37.0/en/model_doc/auto#transformers.FlaxAutoModelForVision2Seq.from_config):

In [38]:
from transformers import AutoModel

my_model = AutoModel.from_config(my_config)

#### Trainer - a PyTorch optimized training loop

All models are a standard [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) so you can use them in any typical training loop. While you can write your own training loop, 🤗 Transformers provides a Trainer class for PyTorch, which contains the basic training loop and adds additional functionality for features like distributed training, mixed precision, and more.

Depending on your task, you’ll typically pass the following parameters to Trainer:

1. You’ll start with a PreTrainedModel or a torch.nn.Module:

In [25]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


2. TrainingArguments contains the model hyperparameters you can change like learning rate, batch size, and the number of epochs to train for. 

The default values are used if you don’t specify any training arguments:

In [1]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./distilbert-base-uncased-model",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
)

3. Load a preprocessing class like a tokenizer, image processor, feature extractor, or processor:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

4. Load a dataset:

In [None]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")  # doctest: +IGNORE_RESULT

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

By default we downloaded the whole dataset including validation and test data.

In [19]:
dataset['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

We have 2 labels: neg and pos

Lets print the third row

In [18]:
dataset['train'][2]

{'text': 'effective but too-tepid biopic', 'label': 1}

5. Create a function to tokenize the dataset:

In [None]:
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

#Then apply it over the entire dataset with map:
dataset = dataset.map(tokenize_dataset, batched=True)

If we print again, we can see the input_ids and attention_mask keys

In [23]:
dataset['train'][2]

{'text': 'effective but too-tepid biopic',
 'label': 1,
 'input_ids': [101, 4621, 2021, 2205, 1011, 8915, 23267, 16012, 24330, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

6. A [DataCollatorWithPadding](https://huggingface.co/docs/transformers/v4.37.0/en/main_classes/data_collator#transformers.DataCollatorWithPadding) to create a batch of examples from your dataset:

In [24]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Now gather all these classes in Trainer:

In [26]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)  # doctest: +SKIP

When you’re ready, call train() to start training:

In [27]:
trainer.train()

  0%|          | 0/1067 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.4611, 'learning_rate': 1.0627928772258671e-05, 'epoch': 0.47}
{'loss': 0.3775, 'learning_rate': 1.2558575445173386e-06, 'epoch': 0.94}
{'train_runtime': 1039.761, 'train_samples_per_second': 8.204, 'train_steps_per_second': 1.026, 'train_loss': 0.4154262435469766, 'epoch': 1.0}


TrainOutput(global_step=1067, training_loss=0.4154262435469766, metrics={'train_runtime': 1039.761, 'train_samples_per_second': 8.204, 'train_steps_per_second': 1.026, 'train_loss': 0.4154262435469766, 'epoch': 1.0})

In [None]:
pred=trainer.predict(dataset['test'])

In [37]:
dataset['test'][2]['text']

'it\'s like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .'

In [31]:
pred.label_ids

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

In [40]:
pred.predictions[2]

array([ 0.7101512, -0.651096 ], dtype=float32)

You can customize the training loop behavior by subclassing the methods inside Trainer. This allows you to customize features such as the loss function, optimizer, and scheduler.

The other way to customize the training loop is by using Callbacks. You can use callbacks to integrate with other libraries and inspect the training loop to report on progress or stop the training early. Callbacks do not modify anything in the training loop itself. To customize something like the loss function, you need to subclass the Trainer instead.