# Setup
For this project we need the torch, transformers and sentencepiece Python packages in order to load and use pre-trained models. We will also need the huggingface_hub package to programmatically login to the Hugging Face Hub and manage repositories, and the datasets package to download datasets from the Hub.

The sentencepiece package is required by transformers to perform inference with some of the pre-trained open source models on Hugging Face Hub and does not need to be explicitly imported. Import the remaining packages as follows.

Run the provided !pip code to install necessary packages and restart your kernel.
Import torch
Import huggingface_hub using the alias hf_hub.
Import datasets
Import transformers

In [1]:
# %pip install datasets
# %pip install huggingface_hub
# %pip install pyarrow
# %pip install transformers
# %pip install hf_xet

In [6]:
import os # Import os
import torch # Import torch
import huggingface_hub as hf_hub # Import huggingface_hub using the alias hf_hub
import datasets # Import datasets 
import transformers # Import transformers

# Downloading pre-trained models
Hugging Face Hub as a Git Platform

The Hugging Face website (also known as the Hub) is essentially a Git platform designed to store pretrained models and datasets as Git repositories. Similar to GitHub, it allows users to explore, create, clone, push repositories and so much more. Each pretrained checkpoint has its own repository and in most cases a descriptive README with code snippets to load and run the model. See the 
bert-base-cased
 model repository as an example.

How to Use Pretrained Models

While the Hub is a great place to explore different tasks and pretrained models, we need the transformers or diffusers libraries in order to load and make predictions with pre-trained models. These two libraries reimplement the code of the state-of-the-art ML research such that vastly different models can be downloaded, loaded into memory and used in a unified way with a few lines of code.

In this task, you will learn how to use the Auto classes of transformers and the from_pretrained method to download and load any model on the Hugging Face Hub. For a full list of supported models, refer to the GitHub 
README
.

What is the Auto Class?

Auto classes of the transformers are simply tools to load models and their data preprocessors in a unified way. Remember, the library reimplements each model such that they each have their own class (BertModel, RobertaModel, T5Model, etc.) with mostly uniform input and output data format across all models. transformers have the following Auto class types to load models and their data preprocessors:

AutoModel
AutoModelForTASK> (more on this below)
AutoTokenizer
AutoFeatureExtractor
AutoImageProcessor
AutoProcessor
Loading Models into Memory with from_pretrained

For the first task, you will download the pretrained "cardiffnlp/twitter-roberta-base-emoji" model and load the model and its data preprocessor into memory with the from_pretrained(<REPO_NAME_OR_PATH>) method. The "cardiffnlp/twitter-roberta-base-emoji" is a text classification model that is trained to predict the emoji class ID of a given tweet.

cardiffnlp/twitter-roberta-base-emoji
 is a valid Git repository on the Hub and the from_pretrained() method downloads and uses the tokenizer specific 
files
 from the model repository.

In [3]:
# Import the AutoTokenizer and AutoModel classes from transformers
from transformers import AutoTokenizer, AutoModel

# Load the pre-trained tokenizer of the "cardiffnlp/twitter-roberta-base-emoji" model
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-emoji")



What from_pretrained() does

The from_pretrained() method first searches for a model repository with the same name on the Hugging Face Hub but it also accepts a local path or a URL with the expected folder structure. You can simply git clone the repository and load it from your local path.

In [4]:
# Print the tokenizer to see the data preprocessing configuration for this model 
print(tokenizer)

RobertaTokenizerFast(name_or_path='cardiffnlp/twitter-roberta-base-emoji', vocab_size=50265, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}


Tokenizers

NLP models can't process text inputs as it is and need to be converted to a fixed length mathematical format. The Tokenizer classes preprocess the text such that each word and punctuation is given a unique ID or a token, short sentences are padded and long sentences are truncated to create fixed-size input vectors. They also allow using additional tokens such as bos, eos, unk, sep, and more to specify start and end of sentences, and to assign token IDS to unknown words that are not in the tokenizer vocabulary.

The output of the tokenizer tells us this pretrained model uses a tokenizer with a 50265 unique token IDs that applies padding or truncation to the end of each input text, and removes leading (leftside) extra whitespaces. You can always refer to the corresponding Tokenizer documentation to learn more about each preprocessing step.

In [5]:
# Load the pre-trained model of the "cardiffnlp/twitter-roberta-base-emoji" model
model = AutoModel.from_pretrained("cardiffnlp/twitter-roberta-base-emoji")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of RobertaModel were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-emoji and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Notes on Model Loading

Take a look at the warning above. We were only able to load a chunk of the model parameters included in the checkpoint with the AutoModel class.

This is because all transformers models are designed to have a single base class and multiple task-specific prediction classes built on top of it. The AutoModel class is designed to only load the base model parameters such as RobertaModel but not task-specific models such as RobertaForSequenceClassification.

In order to identify the exact class name and task of your target checkpoint, you can simply refer to the model configuration.