<a href="https://colab.research.google.com/github/harnalashok/LLMs/blob/main/learning_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 7th June, 2024
# You should have huggingface token

# Ref:
#     1. https://huggingface.co/learn/nlp-course/chapter2/3?fw=pt
#

## About Models

The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It’s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.    
In Hugging Face Transformers, a *checkpoint* typically refers to a saved version of a model during training.   



Here is a special class for using BertModel

In [2]:
# 1.0
from transformers import BertConfig, BertModel

# 1.0.1 Building the config
# Reading the architecture of NN

config = BertConfig()

# 1.0.2 Use config to build the model
#       Fill the model with random weights:

model = BertModel(config)

In [2]:
# 1.0.3 Here is the architecture:

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.41.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [4]:
# 1.0.4 Else, get BertModel model right
#       from its checkpoint: bert-base-cased
#       Model is created as also weights loaded:

from transformers import BertModel

model_b = BertModel.from_pretrained("bert-base-cased")

Alternatively, use AutoModel class:

In [5]:
from transformers import AutoModel

model_a = AutoModel.from_pretrained("bert-base-cased")

Experiment on some sequence:

In [6]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [7]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [8]:
import torch

# Transform to pytorch tensor:
model_inputs = torch.tensor(encoded_sequences)

In [12]:
# Pass tensor through model:

output_a = model_a(model_inputs)
output_a[0][:1]

tensor([[[ 0.4450,  0.4828,  0.2780,  ..., -0.0540,  0.3939, -0.0948],
         [ 0.2494, -0.4409,  0.8177,  ..., -0.3192,  0.2299, -0.0412],
         [ 0.1367,  0.2252,  0.1450,  ..., -0.0469,  0.2822,  0.0756],
         [ 1.1789,  0.1674, -0.1819,  ...,  0.2467,  1.0441, -0.0062]]],
       grad_fn=<SliceBackward0>)

In [13]:
# PAss tensor through model
output_b = model_b(model_inputs)
output_b[0][:1]

tensor([[[ 0.4450,  0.4828,  0.2780,  ..., -0.0540,  0.3939, -0.0948],
         [ 0.2494, -0.4409,  0.8177,  ..., -0.3192,  0.2299, -0.0412],
         [ 0.1367,  0.2252,  0.1450,  ..., -0.0469,  0.2822,  0.0756],
         [ 1.1789,  0.1674, -0.1819,  ...,  0.2467,  1.0441, -0.0062]]],
       grad_fn=<SliceBackward0>)

## [Tokenizers](https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt)

In [24]:
from transformers import BertTokenizer

tokenizer_a = BertTokenizer.from_pretrained("bert-base-cased")

In [25]:
from transformers import AutoTokenizer

tokenizer_b = AutoTokenizer.from_pretrained("bert-base-cased")

In [27]:
tokenizer_a("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [26]:
tokenizer_b("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [31]:
tokens = tokenizer_a.tokenize("Using a Transformer network is simple")
tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

In [32]:
tokens = tokenizer_b.tokenize("Using a Transformer network is simple")
tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

In [34]:
ids = tokenizer_a.convert_tokens_to_ids(tokens)
ids

[7993, 170, 13809, 23763, 2443, 1110, 3014]

In [35]:
ids = tokenizer_b.convert_tokens_to_ids(tokens)
ids

[7993, 170, 13809, 23763, 2443, 1110, 3014]

In [36]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


In [None]:
# Further:
# https://huggingface.co/learn/nlp-course/chapter2/5?fw=pt

In [None]:
#