### TFAutoModel to instantiate any model from a checkpoint.

#### first create Transformer BERT model

In [1]:
from transformers import BertConfig, TFBertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = TFBertModel(config)

In [2]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



### Loading a Transformer model that is already trained

In [3]:
from transformers import TFBertModel
## loading the model
model = TFBertModel.from_pretrained("bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [4]:
## save model
model.save_pretrained("directory_on_my_computer")

In [5]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [9]:
# Tokenize the text
inputs = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")

In [10]:
inputs

{'input_ids': <tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[  101,  8667,   106,   102],
       [  101, 13297,   119,   102],
       [  101,  8835,   106,   102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)>}

In [11]:
import tensorflow as tf

model_inputs = tf.constant(encoded_sequences)

In [12]:
output = model(model_inputs)

In [13]:
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(3, 4, 768), dtype=float32, numpy=
array([[[ 4.4495794e-01,  4.8276469e-01,  2.7797124e-01, ...,
         -5.4033011e-02,  3.9393547e-01, -9.4769903e-02],
        [ 2.4942909e-01, -4.4092974e-01,  8.1772232e-01, ...,
         -3.1916600e-01,  2.2992243e-01, -4.1171879e-02],
        [ 1.3667616e-01,  2.2517926e-01,  1.4502074e-01, ...,
         -4.6914838e-02,  2.8224120e-01,  7.5565144e-02],
        [ 1.1788865e+00,  1.6738632e-01, -1.8187091e-01, ...,
          2.4671289e-01,  1.0440778e+00, -6.1962316e-03]],

       [[ 3.6435887e-01,  3.2464162e-02,  2.0257710e-01, ...,
          6.0111675e-02,  3.2451254e-01, -2.0995861e-02],
        [ 7.1866012e-01, -4.8725188e-01,  5.1740503e-01, ...,
         -4.4012001e-01,  1.4553067e-01, -3.7545212e-02],
        [ 3.3223301e-01, -2.3270953e-01,  9.4875395e-02, ...,
         -2.5268134e-01,  3.2171938e-01,  8.1036706e-04],
        [ 1.2523220e+00,  3.5754532e-01,

# Before move to next step, we need know what Tokenizers is.

### There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large “vocabularies,”

#### Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word.

In [14]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


### but split in to words was too much, so we need to use subword tokenization.

### Character-based tokenizers split the text into characters, rather than words. This has two primary benefits:
#### The vocabulary is much smaller.
#### There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.

### Our model ends up dealing with a large number of tokens: while a word is just one token in a word-based token generator, it can easily become 10 or more tokens when converted to characters.
### We can use a third technique that combines the two approaches: subword tokenization.

#### This simply means splitting the complex word into two tokens, e.g., "tokenization" might be considered a rare word that can be broken down into "token" and "ization". Both of these words are likely to occur more frequently as separate subwords
#### This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

In [15]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [16]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [17]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json')

In [18]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [19]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [20]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple
