# Creating a Transformer model "from scratch"

This notebook covers: 

- steps for creating and using a model use the HF TFAutoModel 
  - `TFAutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. 
  - It can automatically "guess" the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.
  - However, if you already know the type of model you want to use, you can use the `class` that defines its architecture directly. 
- Using BERT model as an example.



In [2]:
print("Step 1: Initialize a BERT model with configuration object.")

from transformers import TFBertModel, BertConfig

print("Building the config:...")
config = BertConfig()
print(config)

print("Building the model (from config):...")
model = TFBertModel(config)

Step 1: Initialize a BERT model with configuration object.
Building the config:...
BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Building the model (from config):...


2021-11-16 13:23:36.837423: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-16 13:23:36.865010: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7feeffb7bbc0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-16 13:23:36.865021: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


- The model can be used in this state, but it will output gibberish because if we load a model **this way it needs to be trained first**. 
- We could train the model from scratch on the task at hand, but this would require a long time and a lot of data, and it would have a non-negligible environmental impact. 

- To avoid unnecessary and duplicated effort, it’s imperative to be able to **share and reuse models that have already been trained**. We can do this using the `from_pretrained()` method:

In [6]:
model = TFBertModel.from_pretrained("bert-base-cased")

Downloading: 100%|██████████| 570/570 [00:00<00:00, 207kB/s]
Downloading: 100%|██████████| 502M/502M [00:17<00:00, 29.9MB/s]
Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


- We could replace `TFBertModel` with the equivalent `TFAutoModel` class. This produces **checkpoint-agnostic code**, i.e. if your code works for one checkpoint, it should work seamlessly with another. 
- This applies even if the architecture is different, as long as the checkpoint was trained for a **similar task** (for example, a sentiment analysis task).
- In the code sample above we didn’t use BertConfig, and instead loaded a pretrained model via the bert-base-cased identifier. 
- This is a **model checkpoint** that was trained by the authors of BERT themselves; you can find more details about it in its [model card](https://huggingface.co/bert-base-cased).


---

This model is now initialized with all the weights of the checkpoint. It 

- **can now be used directly for inference on the tasks it was trained on** OR 
- **fine-tuned on a new task**. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the `from_pretrained()` method won’t re-download them) in the **cache folder**  (default: `~/.cache/huggingface/transformers`) . You can customize your cache folder by setting the `HF_HOME` environment variable.

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is **compatible with the BERT architecture**. The entire list of available BERT checkpoints can be found [here](https://huggingface.co/models?filter=bert).



## Saving methods

Saving a model is as easy as loading one - we use the `save_pretrained()` method, which is analogous to the `from_pretrained()` method. 

In [11]:
myMacDirectory = "models/bertBaseCased"
model.save_pretrained(myMacDirectory)

This will save a **config.json** file with the **attributes** necessary to build the model **architecture**. This file also contains some **metadata**, such as where the checkpoint originated and what HF Transformers version you were using when you last saved the checkpoint.

The **tf_model.h5** file is known as the **state dictionary**; it contains all your **model’s weights**. 

The two files go hand in hand; the configuration is necessary to know your **model’s architecture**, while the model weights are your **model’s parameters**.



In [13]:
modelConfig = open("models/bertBaseCased/config.json")
print(modelConfig.read())

{
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.11.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



# Using a Transformer model for inference (predictions)

**We did load and save** a BERT model. Next: **Let’s try using it to make some predictions**. 

- Model Inputs need to be numeric. 
- Tokenizers can take care of casting the inputs to the appropriate framework’s tensors
- The tokenizer converts these to vocabulary indices which are typically called **input IDs**. 
- Each sequence is now a list of numbers! 



In [16]:
# Example: Let’s say we have a couple of sequences:
sequences = ["Hello!", "Cool.", "Nice!"]

encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],


]

# Convert to tensor
import tensorflow as tf
model_inputs = tf.constant(encoded_sequences)
print(model_inputs)

tf.Tensor(
[[ 101 7592  999  102]
 [ 101 4658 1012  102]
 [ 101 3835  999  102]], shape=(3, 4), dtype=int32)


This tensor can now be used for the model: 

In [19]:
output = model(model_inputs)

In [20]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

2021-11-16 13:58:27.958642: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-

In [25]:
predictions = tf.math.softmax(output.logits, axis = -1)
print("Prediction: ", predictions)
print(model.config.id2label)


Prediction:  tf.Tensor(
[[4.0195312e-02 9.5980465e-01]
 [5.3534308e-04 9.9946469e-01]], shape=(2, 2), dtype=float32)
{0: 'NEGATIVE', 1: 'POSITIVE'}
