## Creating a Transformer

The AutoModel class, which is handy when you want to instantiate any model from a checkpoint.

The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the transformers ibrary. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let’s take a look at how this works with a BERT model.

### Different loading methods
Creating a model from the default configuration initializes it with random values

In [1]:
from transformers import BertConfig, BertModel

In [4]:
# BertConfig()

In [6]:
# Building the config
config = BertConfig()
# config

In [40]:
type(config)

transformers.models.bert.configuration_bert.BertConfig

In [10]:
# Building the model from the config
# Model is randomly initialized!
model = BertModel(config)

In [9]:
# model

In [42]:
checkpoint = "google-bert/bert-base-cased"

We can change any part of the configuration using keyword arguments. We can see that this model has 12 hidden layers. We can use 10.

In [43]:
bert_config_10 = BertConfig.from_pretrained(checkpoint, num_hidden_layers=10)

In [49]:
bert_config_10.num_hidden_layers

10

In [44]:
bert_model_10 = BertModel(bert_config_10)

In [50]:
bert_model_10.save_pretrained('./saved_models/bert_model_10/')

In [51]:
load_bert_10 = BertModel.from_pretrained('./saved_models/bert_model_10/')

In [54]:
# load_bert_10.config

The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but this would require a long time and a lot of data, and it would have a non-negligible environmental impact. 

To avoid unnecessary and duplicated effort, it’s imperative to be able to share and reuse models that have already been trained.

Loading a Transformer model that is already trained is simple — we can do this using the from_pretrained() method:

In [11]:
model = BertModel.from_pretrained("google-bert/bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

We could replace BertModel with the equivalent AutoModel class as this produces checkpoint-agnostic code.

If your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task).

In the code sample above we didn’t use BertConfig, and instead loaded a pretrained model via the bert-base-cased identifier. This is a model checkpoint that was trained by the authors of BERT themselves.

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

### Auto Config

In [37]:
from transformers import AutoConfig

In [38]:
bert_config = AutoConfig.from_pretrained("google-bert/bert-base-cased")

In [39]:
type(bert_config)

transformers.models.bert.configuration_bert.BertConfig

In [41]:
bert_config

BertConfig {
  "_name_or_path": "google-bert/bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

## Saving methods
Saving a model is as easy as loading one — we use the save_pretrained() method, which is analogous to the from_pretrained() method:

In [12]:
model.save_pretrained('saved_models')

In [13]:
!ls saved_models/

config.json       model.safetensors


If you take a look at the config.json file, you’ll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.

### Loading the pretrained model

#### What is SafeTensors?

SafeTensors is a format introduced by Hugging Face as an alternative to .pt or .bin files. It provides:

- Faster loading: Uses memory mapping for efficient tensor loading.
- Security: Prevents arbitrary code execution (unlike pickle-based .pt or .bin formats).
- Portability: Works well with Hugging Face models.

#### Usage in Hugging Face Models
If a model supports SafeTensors, you might see a model.safetensors file when loading or saving a model.

In [23]:
from transformers import AutoModel, AutoTokenizer

In [17]:
load_model = AutoModel.from_pretrained("google-bert/bert-base-cased", use_safetensors=True)

### Using a Transformer model for inference

In [22]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [28]:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [30]:
encoded_sequences = tokenizer(sequences)['input_ids']
encoded_sequences

[[101, 8667, 106, 102], [101, 13297, 119, 102], [101, 8835, 106, 102]]

In [31]:
import torch

In [32]:
model_inputs = torch.tensor(encoded_sequences)
model_inputs

tensor([[  101,  8667,   106,   102],
        [  101, 13297,   119,   102],
        [  101,  8835,   106,   102]])

In [33]:
output = load_model(model_inputs)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.6283,  0.2166,  0.5605,  ...,  0.0136,  0.6158, -0.1712],
         [ 0.6108, -0.2253,  0.9263,  ..., -0.3028,  0.4500, -0.0714],
         [ 0.8040,  0.1809,  0.7076,  ..., -0.0685,  0.4837, -0.0774],
         [ 1.3290,  0.2360,  0.4567,  ...,  0.1509,  0.9621, -0.4841]],

        [[ 0.3128,  0.1718,  0.2099,  ..., -0.0721,  0.4919, -0.1383],
         [ 0.1545, -0.3757,  0.7187,  ..., -0.3130,  0.2822,  0.1883],
         [ 0.4123,  0.3721,  0.5484,  ...,  0.0788,  0.5681, -0.2757],
         [ 0.8356,  0.3964, -0.4121,  ...,  0.1838,  1.6365, -0.4806]],

        [[ 0.5399,  0.2564,  0.2511,  ..., -0.1760,  0.6063, -0.1803],
         [ 0.2609, -0.3164,  0.5548,  ..., -0.3439,  0.3909,  0.0900],
         [ 0.5161,  0.0721,  0.5606,  ...,  0.0077,  0.3685, -0.2272],
         [ 0.6560,  0.8475, -0.1606,  ..., -0.0468,  1.6309, -0.5047]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.7105,  0.

In [35]:
output.last_hidden_state.shape

torch.Size([3, 4, 768])