We’ll use the `AutoModel` class, which is handy when you want to instantiate any model from a checkpoint.

The `AutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, **if you know the type of model you want to use, you can use the class that defines its architecture directly**. Let’s take a look at how this works with a BERT model.

# Creating a Transformer
The first thing we’ll need to do to initialize a BERT model is load a configuration object:

In [1]:
from transformers import BertConfig, BertModel

## Model from scratch

Creating a model from the default configuration initializes it with random values.

The configuration contains many attributes that are used to build the model:

In [5]:
# Building the config
config = BertConfig(
    # Number of hidden layers in the Transformer encoder
    num_hidden_layers=10,  # Default is 12
)
config

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 10,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.43.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

While you haven’t seen what all of these attributes do yet, you should recognize some of them: the `hidden_size` attribute defines the size of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers the Transformer model has.

In [6]:
# Building the model from the config
model = BertModel(config)

The model can be used in this state, but it will output gibberish; it **needs to be trained first**. We could train the model from scratch on the task at hand, but as you saw in Chapter 1, this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it’s imperative to be able to share and reuse models that have already been trained.

## Model from pretrained

Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained()` method.

In [8]:
from transformers import BertModel

In [None]:
model = BertModel.from_pretrained("bert-base-cased")

As you saw earlier, we could replace `BertModel` with the equivalent `AutoModel` class. **We’ll do this from now** on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task).

In the code sample above we didn’t use BertConfig, and instead loaded a pretrained model via the bert-base-cased identifier. This is a model checkpoint that was trained by the authors of BERT themselves; you can find more details about it in its model card.

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the `from_pretrained()` method won’t re-download them) in the cache folder, which defaults to `~/.cache/huggingface/transformers`. You can customize your cache folder by setting the `HF_HOME` environment variable.

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture. The entire list of available BERT checkpoints can be found [here](https://huggingface.co/models?filter=bert).

In [12]:
# we could use this to create a new model with the same configuration
# but random weights
model.config

BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.43.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

# Saving methods

Saving a model is as easy as loading one — we use the `save_pretrained()` method, which is analogous to the `from_pretrained()` method.

In [13]:
model.save_pretrained("tmp")