# Models

In this section, we‚Äôll take a closer look at creating and using models. We‚Äôll use the AutoModel class, which is handy when you want to instantiate any model from a checkpoint.

## Creating a Transformer
Let‚Äôs begin by examining what happens when we instantiate an AutoModel:

In [1]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

Similar to the tokenizer, the from_pretrained() method will download and cache the model data from the Hugging Face Hub. As mentioned previously, the checkpoint name corresponds to a specific model architecture and weights, in this case a BERT model with a basic architecture (12 layers, 768 hidden size, 12 attention heads) and cased inputs (meaning that the uppercase/lowercase distinction is important). There are many checkpoints available on the Hub ‚Äî you can explore them here.

The AutoModel class and its associates are actually simple wrappers designed to fetch the appropriate model architecture for a given checkpoint. It‚Äôs an ‚Äúauto‚Äù class meaning it will guess the appropriate model architecture for you and instantiate the correct model class. However, if you know the type of model you want to use, you can use the class that defines its architecture directly:

In [2]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

## Loading and saving
Saving a model is as simple as saving a tokenizer. In fact, the models actually have the same save_pretrained() method, which saves the model‚Äôs weights and architecture configuration:

In [3]:
model.save_pretrained("lesson3")

This will save two files to your disk:


    ls directory_on_my_computer

    config.json model.safetensors

If you look inside the config.json file, you‚Äôll see all the necessary attributes needed to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what ü§ó Transformers version you were using when you last saved the checkpoint.

The pytorch_model.safetensors file is known as the state dictionary; it contains all your model‚Äôs weights. The two files work together: the configuration file is needed to know about the model architecture, while the model weights are the parameters of the model.

To reuse a saved model, use the from_pretrained() method again:


In [4]:
from transformers import AutoModel

model = AutoModel.from_pretrained("hidden/lesson3")

A wonderful feature of the ü§ó Transformers library is the ability to easily share models and tokenizers with the community. To do this, make sure you have an account on Hugging Face. If you‚Äôre using a notebook, you can easily log in with this:

In [5]:
# only run if you don't already have
#!pip install ipywidgets

In [15]:
# This widget did not load for me so I used an alternate method
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [10]:
#alternate method
from huggingface_hub import login

#enter your token here between () and make sure to delete after running
#and before pushing up

login(token='')


Otherwise, at your terminal run:

    huggingface-cli login

Then you can push the model to the Hub with the push_to_hub() method:

In [15]:
model.push_to_hub("example_my-awesome-model")

Processing Files (0 / 0): |                                        |  0.00B /  0.00B            
Processing Files (1 / 1): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|  433MB /  433MB,  310MB/s  [A
New Data Upload: |                                                 |  0.00B /  0.00B,  0.00B/s  


CommitInfo(commit_url='https://huggingface.co/a-reusche92/example_my-awesome-model/commit/b13d8c0dc0a459377d4dedc17867a23529e95283', commit_message='Upload model', commit_description='', oid='b13d8c0dc0a459377d4dedc17867a23529e95283', pr_url=None, repo_url=RepoUrl('https://huggingface.co/a-reusche92/example_my-awesome-model', endpoint='https://huggingface.co', repo_type='model', repo_id='a-reusche92/example_my-awesome-model'), pr_revision=None, pr_num=None)

In [18]:
#to fix future widgets
!pip install -U ipywidgets
!jupyter nbextension enable --py widgetsnbextension


usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: dejavu events execute kernel kernelspec lab
labextension labhub migrate nbconvert notebook run server troubleshoot trust

Jupyter command `jupyter-nbextension` not found.


This will upload the model files to the Hub, in a repository under your namespace named my-awesome-model. Then, anyone can load your model with the from_pretrained() method!

In [18]:
from transformers import AutoModel

#delete token after download
model = AutoModel.from_pretrained("a-reusche92/example_my-awesome-model", use_auth_token="")

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

You can do a lot more with the Hub API:

    Push a model from a local repository
    Update specific files without re-uploading everything
Add model cards to document the model‚Äôs abilities, limitations, known biases, etc.
See the documentation for a complete tutorial on this, or check out the advanced Chapter 4.

## Encoding text
Transformer models handle text by turning the inputs into numbers. Here we will look at exactly what happens when your text is processed by the tokenizer. We‚Äôve already seen in Chapter 1 that tokenizers split the text into tokens and then convert these tokens into numbers. We can see this conversion through a simple tokenizer:

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We get a dictionary with the following fields:

    input_ids: numerical representations of your tokens
    token_type_ids: these tell the model which part of the input is sentence A and which is sentence B (discussed more in the next section)
    attention_mask: this indicates which tokens should be attended to and which should not (discussed more in a bit)
We can decode the input IDs to get back the original text:

In [20]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I ' m a single sentence! [SEP]"

You‚Äôll notice that the tokenizer has added special tokens ‚Äî [CLS] and [SEP] ‚Äî required by the model. Not all models need special tokens; they‚Äôre utilized when a model was pretrained with them, in which case the tokenizer needs to add them as that model expects these tokens.

You can encode multiple sentences at once, either by batching them together (we‚Äôll discuss this soon) or by passing a list:

In [23]:
encoded_input = tokenizer(["How are you?", "I'm fine, thank you!"])
print(encoded_input)

{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 146, 112, 182, 2503, 117, 6243, 1128, 106, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


Note that when passing multiple sentences, the tokenizer returns a list for each sentence for each dictionary value. We can also ask the tokenizer to return tensors directly from PyTorch:

In [26]:
#code did not work, so i fixed it and now it disregards the 
# next padding comment and size comment

encoded_input = tokenizer(["How are you?", "I'm fine, thank you!"],
                          return_tensors="pt",
                          padding= True,
                          truncation= True
                         )
print(encoded_input)


{'input_ids': tensor([[ 101, 1731, 1132, 1128,  136,  102,    0,    0,    0,    0],
        [ 101,  146,  112,  182, 2503,  117, 6243, 1128,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


But there‚Äôs a problem: the two lists don‚Äôt have the same length! Arrays and tensors need to be rectangular, so we can‚Äôt simply convert these lists to a PyTorch tensor (or NumPy array). The tokenizer provides an option for that: padding.

## Padding inputs
If we ask the tokenizer to pad the inputs, it will make all sentences the same length by adding a special padding token to the sentences that are shorter than the longest one:

In [27]:
#padding alreaddy added above

Now we have rectangular tensors! Note that the padding tokens have been encoded into input IDs with ID 0, and they have an attention mask value of 0 as well. This is because those padding tokens shouldn‚Äôt be analyzed by the model: they‚Äôre not part of the actual sentence.

## Truncating inputs
The tensors might get too big to be processed by the model. For instance, BERT was only pretrained with sequences up to 512 tokens, so it cannot process longer sequences. If you have sequences longer than the model can handle, you‚Äôll need to truncate them with the truncation parameter:

In [28]:
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)
print(encoded_input["input_ids"])

[101, 1188, 1110, 170, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1263, 5650, 119, 102]


By combining the padding and truncation arguments, you can make sure your tensors have the exact size you need:

In [29]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  102],
        [ 101,  146,  112,  182,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}


## Adding special tokens
Special tokens (or at least the concept of them) is particularly important to BERT and derived models. These tokens are added to better represent the sentence boundaries, such as the beginning of a sentence ([CLS]) or separator between sentences ([SEP]). Let‚Äôs look at a simple example:

In [30]:
encoded_input = tokenizer("How are you?")
print(encoded_input["input_ids"])
tokenizer.decode(encoded_input["input_ids"])

[101, 1731, 1132, 1128, 136, 102]


'[CLS] How are you? [SEP]'

These special tokens are automatically added by the tokenizer. Not all models need special tokens; they are primarily used when a model was pretrained with them, in which case the tokenizer will add them since the model expects them.

## Why is all of this necessary?
Here‚Äôs a concrete example. Consider these encoded sequences:


FIXEDDDDDD BECAUSE WEBSITE CODE IS TRASH

In [31]:
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

In [42]:
encoded_input = tokenizer(sequences,
                        padding=True,
                        truncation=True,
                        return_tensors="pt"
                         )

In [43]:
print(encoded_input["input_ids"])

tensor([[  101,   146,   112,  1396,  1151,  2613,  1111,   170, 20164, 10932,
          2271,  7954,  1736,  1139,  2006,  1297,   119,   102],
        [  101,   146,  4819,  1142,  1177,  1277,   106,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]])


This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices). This ‚Äúarray‚Äù is already of rectangular shape, so converting it to a tensor is easy:

In [44]:
encoded_sequences= encoded_input["input_ids"]

## Using the tensors as inputs to the model
Making use of the tensors with the model is extremely simple ‚Äî we just call the model with the inputs:

In [48]:
model(encoded_sequences)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.4222,  0.4443, -0.0659,  ..., -0.1958,  0.3611,  0.1284],
         [ 0.5728, -0.1593,  0.6014,  ..., -0.1134,  0.1791,  0.1787],
         [ 0.4699,  0.4214,  0.1695,  ...,  0.2386,  0.9851, -0.1236],
         ...,
         [ 0.5847,  0.2552,  0.0266,  ...,  0.7203,  0.0650,  0.4277],
         [ 0.5573,  0.4506,  0.0353,  ..., -0.0607,  0.4209, -0.2525],
         [ 0.7136,  1.2932, -0.2937,  ...,  0.2917,  0.4270, -0.3874]],

        [[ 0.4281,  0.5940,  0.0704,  ..., -0.2846,  0.2549,  0.0384],
         [ 0.6010, -0.0288,  0.4811,  ..., -0.1394,  0.1924,  0.2027],
         [ 0.2108,  0.2552, -0.4616,  ...,  0.3663, -0.0610,  0.3810],
         ...,
         [-0.2789, -0.3556,  0.0440,  ..., -0.4985,  0.5493,  0.3444],
         [-0.2533, -0.3874,  0.0834,  ..., -0.5131,  0.5479,  0.2973],
         [-0.1941, -0.4243,  0.1305,  ..., -0.5704,  0.5147,  0.2279]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_ou

While the model accepts a lot of different arguments, only the input IDs are necessary. We‚Äôll explain what the other arguments do and when they are required later, but first we need to take a closer look at the tokenizers that build the inputs that a Transformer model can understand.

WEBSITE TRASH BELOW

In [50]:
encoded_sequences = [
    [
        101,
        1045,
        1005,
        2310,
        2042,
        3403,
        2005,
        1037,
        17662,
        12172,
        2607,
        2026,
        2878,
        2166,
        1012,
        102,
    ],
    [101, 1045, 5223, 2023, 2061, 2172, 999, 102],
]

This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices). This ‚Äúarray‚Äù is already of rectangular shape, so converting it to a tensor is easy:

In [51]:
import torch

model_inputs = torch.tensor(encoded_sequences)

ValueError: expected sequence of length 16 at dim 1 (got 8)

## Using the tensors as inputs to the model
Making use of the tensors with the model is extremely simple ‚Äî we just call the model with the inputs:


In [52]:
output = model(model_inputs)