Tokenization with Hugging Face Transformers

In [2]:
import transformers

In [3]:
from transformers import BertTokenizer

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [13]:
text = "BERT preprocessing is essential."

In [15]:
tokens = tokenizer.tokenize(text)

In [17]:
print(tokens)

['bert', 'prep', '##ro', '##ces', '##sing', 'is', 'essential', '.']


Text Classification with BERT

In [19]:
from transformers import BertForSequenceClassification, BertTokenizer

In [21]:
import torch

In [23]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



In [25]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
text = "Artificial Intelligence is the simulation of human intelligence process by Machines"

In [31]:
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

In [33]:
predictions = torch.argmax(outputs.logits, dim=1)

In [35]:
print(predictions)

tensor([1])


Visualizing Attention Weights

In [37]:
import torch
from transformers import BertModel, BertTokenizer

In [39]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [41]:
text = "Machine Learning is a branch of Artificial Intelligence"

In [43]:
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs, output_attentions=True)



In [45]:
attention_weights = outputs.attentions

In [47]:
print(attention_weights)

(tensor([[[[6.3404e-02, 7.4812e-02, 5.0517e-02,  ..., 5.8008e-02,
           3.8826e-02, 3.2250e-01],
          [3.3223e-02, 8.5746e-02, 1.7019e-01,  ..., 1.3015e-01,
           1.6430e-01, 1.6220e-01],
          [7.1231e-02, 1.3327e-01, 6.6658e-02,  ..., 1.9794e-01,
           1.3782e-01, 8.5812e-02],
          ...,
          [7.2923e-02, 1.8865e-01, 2.3325e-01,  ..., 1.0993e-01,
           1.5193e-01, 5.1124e-02],
          [4.9920e-02, 1.7651e-01, 1.0689e-01,  ..., 2.6957e-01,
           6.5506e-02, 6.6842e-02],
          [1.0438e-01, 7.2058e-02, 7.4685e-02,  ..., 8.7407e-02,
           4.5061e-02, 2.0507e-01]],

         [[5.2830e-01, 3.6457e-03, 1.7123e-03,  ..., 3.4617e-03,
           4.2933e-03, 5.6448e-03],
          [5.4541e-03, 6.7088e-02, 1.7184e-01,  ..., 6.8875e-02,
           4.2888e-01, 1.3295e-01],
          [6.5564e-03, 2.1092e-01, 8.8582e-02,  ..., 4.2514e-01,
           1.3918e-01, 3.3715e-02],
          ...,
          [2.7423e-02, 5.5378e-02, 1.0394e-01,  ..., 2.719

Pretraining and MLM

In [49]:
from transformers import BertForMaskedLM, BertTokenizer
import torch

In [51]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [53]:
text = "Deep Learning is subset of Machine Learning."

In [55]:
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, add_special_tokens=True)
outputs = model(**inputs, labels=inputs['input_ids'])

In [57]:
loss = outputs.loss

In [59]:
print(loss)

tensor(3.7838, grad_fn=<NllLossBackward0>)


Extracting Word Embeddings with Hugging Face Transformers

In [61]:
from transformers import BertTokenizer, BertModel
import torch

In [63]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [65]:
text = "R.L is a feedback-based Machine Learning."

In [67]:
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, add_special_tokens=True)
outputs = model(**inputs)

In [69]:
word_embeddings = outputs.last_hidden_state

In [71]:
print(word_embeddings)

tensor([[[-0.5229, -0.1897, -0.0321,  ..., -0.2837, -0.0803,  0.7297],
         [-0.0878, -0.3480,  0.3289,  ...,  0.1971,  0.7202,  0.1397],
         [-0.9098, -0.3596,  0.5022,  ...,  0.6466,  0.4512,  0.0088],
         ...,
         [-0.7197, -0.2314, -0.1856,  ..., -0.9756, -0.3295,  0.5237],
         [ 0.5293,  0.1158, -0.4517,  ...,  0.2678, -0.7594, -0.1698],
         [ 0.2906,  0.2216, -0.2134,  ...,  0.3586, -0.9532, -0.0962]]],
       grad_fn=<NativeLayerNormBackward0>)


Fine-Tuning Intermediate Layers with Hugging Face Transformers

In [73]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

In [75]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [77]:
text = "Advanced fine-tuning with BERT."

In [79]:
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs, output_hidden_states=True)

In [81]:
intermediate_layer = outputs.hidden_states[6]  # 7th layer

In [83]:
print(intermediate_layer)

tensor([[[-0.3527, -1.1815, -0.3027,  ..., -0.1794,  0.0872,  0.5573],
         [-0.3768,  0.1754,  0.3395,  ..., -0.0359,  0.2162, -1.2251],
         [ 1.8406, -0.6458,  0.5841,  ..., -0.7345,  0.7542, -0.1614],
         ...,
         [ 1.0414, -0.7009,  1.0362,  ...,  1.0581, -0.3068, -1.4171],
         [-0.8934, -0.8139, -0.3154,  ..., -0.3933, -0.6383,  0.0522],
         [ 0.0143, -0.0423, -0.0131,  ...,  0.0044, -0.0140, -0.0394]]],
       grad_fn=<NativeLayerNormBackward0>)


 Using RoBERTa with Hugging Face Transformers

In [85]:
from transformers import RobertaTokenizer, RobertaModel
import torch

In [87]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [89]:
text = "RoBERTa is an advanced variant of BERT."

In [91]:
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

In [93]:
embeddings = outputs.last_hidden_state

In [95]:
print(embeddings)

tensor([[[-0.0640,  0.1073, -0.0181,  ..., -0.0383, -0.0555, -0.0151],
         [-0.0647,  0.0450, -0.0528,  ...,  0.0814, -0.1633, -0.0284],
         [ 0.0398,  0.0657, -0.1046,  ..., -0.1364, -0.0650,  0.0560],
         ...,
         [ 0.0273,  0.0271, -0.0042,  ..., -0.1554, -0.0037,  0.1105],
         [-0.0560,  0.1078, -0.0385,  ..., -0.0619, -0.0614, -0.0363],
         [ 0.0064,  0.1383,  0.0011,  ...,  0.0994, -0.0593,  0.0169]]],
       grad_fn=<NativeLayerNormBackward0>)


Text Summarization using BERT with Hugging Face Transformers

In [98]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

In [100]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [102]:
original_text = "Long text for summarization..."

In [104]:
inputs = tokenizer(original_text, return_tensors='pt', padding=True, truncation=True)

In [106]:
summary_logits = model(**inputs).logits
summary = tokenizer.decode(torch.argmax(summary_logits, dim=1))

In [108]:
print("Summary:", summary)

Summary: [PAD]


Handling Long Texts with BERT

In [110]:
max_seq_length = 512  # Max token limit for BERT
text = "Random Forests is a supervised Machine Learning algorithm. Used in classification and regression problems"
text_chunks = [text[i:i + max_seq_length] for i in range(0, len(text), max_seq_length)]

In [114]:
for chunk in text_chunks:
    inputs = tokenizer(chunk, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)

 # Process outputs for each chunk

Mixed-Precision Training with BERT

In [116]:
from torch.cuda.amp import autocast, GradScaler

In [118]:
scaler = GradScaler()

  scaler = GradScaler()


In [120]:
with autocast():
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    outputs = model(**inputs)
    loss = outputs.loss

  with autocast():


In [124]:
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

NameError: name 'optimizer' is not defined

Domain Adaptation with BERT

In [126]:
domain_data = load_domain_specific_data()  # Load domain-specific dataset
domain_model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
train_domain(domain_model, domain_data)

NameError: name 'load_domain_specific_data' is not defined

Multilingual BERT with Hugging Face Transformers

In [128]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained('bert-base-multilingual-cased')

text = "BERT understands multiple languages!"
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

embeddings = outputs.last_hidden_state
print(embeddings)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

tensor([[[-0.1740, -0.3340, -0.6609,  ...,  0.4222,  0.3596, -0.0141],
         [ 0.4543,  0.1575, -0.2935,  ..., -0.0887,  0.3577,  0.3759],
         [ 0.7044,  0.1681, -0.1217,  ...,  0.9337,  0.4014, -0.2957],
         ...,
         [-0.2753, -0.4430, -0.5777,  ...,  0.4702,  0.0326, -0.1780],
         [-0.4562, -0.2416, -0.9867,  ...,  0.1909,  0.5182, -0.1369],
         [-0.4752, -0.3440, -0.5101,  ...,  0.0674,  0.5783, -0.1292]]],
       grad_fn=<NativeLayerNormBackward0>)


Lifelong Learning with BERT

In [130]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

new_data = load_latest_data()  # Load updated dataset

for epoch in range(epochs):
    train_lifelong(model, new_data)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


NameError: name 'load_latest_data' is not defined

Implementing BERT with Hugging Face Transformers Library

In [132]:
from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [134]:
text = "BERT is amazing!"
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

In [136]:
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits).item()
print("Predicted Sentiment Class:", predicted_class)

Predicted Sentiment Class: 1


In [138]:
from transformers import BertForSequenceClassification, BertTokenizer, AdamW
import torch

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "Sample text for training."
label = 1  # Assuming positive sentiment

inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs, labels=torch.tensor([label]))

loss = outputs.loss
optimizer = AdamW(model.parameters(), lr=1e-5)
loss.backward()
optimizer.step()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
