## Intro
Let us take a good look at the two steps governing tokenization.

In [5]:
from transformers import AutoModel, AutoTokenizer

# Load pre-trained model
model = AutoModel.from_pretrained("bert-base-cased")

# Some data to work with
raw_data = [
    "There may be too many of them, Mobin!",
    "You will not believe this",
    "Hey man! That is not cool at all."
]

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Usual tokenization
inputs = tokenizer(raw_data, padding=True, truncation=True, return_tensors="pt")
inputs

{'input_ids': tensor([[  101,  1247,  1336,  1129,  1315,  1242,  1104,  1172,   117, 12556,
          7939,   106,   102],
        [  101,  1192,  1209,  1136,  2059,  1142,   102,     0,     0,     0,
             0,     0,     0],
        [  101,  4403,  1299,   106,  1337,  1110,  1136,  4348,  1120,  1155,
           119,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

Now imagine that we fine-tuned the model and now want to save it.

In [6]:
# Now we will dissect this into tokenization and encoding
# Tokenize into subwords
tokenized_data = tokenizer.tokenize(raw_data[0])
tokenized_data

['There', 'may', 'be', 'too', 'many', 'of', 'them', ',', 'Mo', '##bin', '!']

In [7]:
# Encoding
ids = tokenizer.convert_tokens_to_ids(tokenized_data)

# The other way around would be:
# tokens = tokenizer.convert_ids_to_tokens(ids)

ids

[1247, 1336, 1129, 1315, 1242, 1104, 1172, 117, 12556, 7939, 106]

The only difference heere is that the start and end of the sentence tokens are missing.

Now let us decode:

In [8]:
decoded_tokens = tokenizer.decode(ids)
decoded_tokens

'There may be too many of them, Mobin!'

Here, the config file represents the configurations of the model while the model.safetensors file represents the model parameters and weights.