<a href="https://www.kaggle.com/code/neesham/transformers-for-beginners-p1?scriptVersionId=124720060" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## But what is transformer?

The Transformer is a deep learning architecture that has revolutionized the field of natural language processing (NLP) since its introduction in 2017. Unlike traditional sequence-to-sequence models, which rely on recurrent neural networks (RNNs) to process sequential data, the Transformer is based on self-attention mechanisms, allowing it to process sequences in parallel and capture long-range dependencies more effectively.

The Transformer has achieved state-of-the-art performance on a range of NLP tasks, including machine translation, question-answering, and text classification. It has also paved the way for the development of large pre-trained language models such as BERT, GPT-2, and RoBERTa, which have further pushed the boundaries of NLP.

**In this series of notebooks, we will dive deep into the details of transformers and we will perform some experiments with them. So upvote this notebook and join this journey.**

In this Notebook we will be experimenting with masked language modeling.

# Set UP

In [1]:
import numpy as np 
import pandas as pd 

## Installing Hugging Face libraries

In [2]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.0



git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 132 not upgraded.


## Initializing the base DistilBert Model and Tokenizer

In [3]:
from transformers import TFAutoModelForMaskedLM
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForMaskedLM.

All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


## Processing the Data

In [4]:
df = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")

df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [5]:
# we will use Series_Title and Overview for our model. Because these fields contain rich text.

df = df[['Series_Title', 'Overview']]

df.head()

Unnamed: 0,Series_Title,Overview
0,The Shawshank Redemption,Two imprisoned men bond over a number of years...
1,The Godfather,An organized crime dynasty's aging patriarch t...
2,The Dark Knight,When the menace known as the Joker wreaks havo...
3,The Godfather: Part II,The early life and career of Vito Corleone in ...
4,12 Angry Men,A jury holdout attempts to prevent a miscarria...


## Huggingface Dataset

In [6]:
from datasets import Dataset

# Converting to hugging face dataset for easier pre-processing.

movie_dataset = Dataset.from_pandas(df)
movie_dataset

Dataset({
    features: ['Series_Title', 'Overview'],
    num_rows: 1000
})

## Concatenating the text

In [7]:
# Making a single column for the text
def concatenate_text(data):
    return {"text": data['Series_Title'] + " " + data['Overview']}


movie_dataset = movie_dataset.map(concatenate_text)

  0%|          | 0/1000 [00:00<?, ?ex/s]

## Debugging the Data

In [8]:
for data_elem in movie_dataset:
    print(data_elem['text'])
    break

The Shawshank Redemption Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.


## Tokenizing the Data

In [9]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = movie_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "Series_Title", "Overview"]
)
tokenized_datasets

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
    num_rows: 1000
})

## Breaking the text into equal length

In natural language processing, we often make the input data length equal to ensure that the machine learning algorithms can handle the data consistently.

For example, suppose we are building a sentiment analysis model that determines whether a tweet is positive or negative. If we allow tweets of different lengths into our model, then the model may learn to give more weight to longer tweets, even if the tweet's length is not related to its sentiment. By making the input data length equal, we ensure that the model gives each tweet equal weight, regardless of its length, and can learn to identify the patterns that truly indicate positive or negative sentiment.

In [10]:
# Before (Unequal length data)

for i in tokenized_datasets:
    for key in i.keys():
        print(key, len(i[key]))

    break

input_ids 30
token_type_ids 30
attention_mask 30
word_ids 30


In [11]:
chunk_size = 128

def group_texts(examples):
    
    # Concatenate all texts
    # dict_keys(['input_ids', 'attention_mask', 'word_ids'])
    
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    
    # Debugging
    # print(concatenated_examples['input_ids'][0:100])
    
    # Compute length of concatenated texts
    total_length = len(concatenated_examples['input_ids'])

    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    
    # Let's save the original input_ids, because we will mask the random tokens of input_ids for MLM.
    result["labels"] = result["input_ids"].copy()
    
    # Now, dict_keys(['input_ids', 'attention_mask', 'word_ids', 'labels'])
    return result

lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
    num_rows: 288
})

In [12]:
# After (equal length data)

for i in lm_datasets:
    
    for k in i.keys():
        print(k, len(i[k]))
        
    break

input_ids 128
token_type_ids 128
attention_mask 128
word_ids 128
labels 128


## Preprocessed Dataset

Masking the random input tokens with the help of DataCollatorForLanguageModeling

In [13]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

train_size = 200
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

tf_train_dataset = model.prepare_tf_dataset(
    downsampled_dataset["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    downsampled_dataset["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

## Login to HuggingFace Hub

In [14]:
# Make sure to create a write token, else it won't work
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Fine-tuning the DistilBert Model

Enter your token above and push the model to hugging face so that that you can use it later. I have already pushed the model to the hub so don't need to push it again.

In [15]:
# Uncomment to upload to the huggingface hub


from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")


model_name = model_checkpoint.split("/")[-1]

# callback = PushToHubCallback('Neesham/first-model', tokenizer=tokenizer)


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [16]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset)



<keras.callbacks.History at 0x7fdb66efd450>

## Testing

In [17]:
# Encode the input text
text = "It is a great [MASK]."
encoded_input = tokenizer(text, return_tensors="tf")

# Get the model's predictions
predictions = model(encoded_input["input_ids"], token_type_ids=encoded_input["token_type_ids"], attention_mask=encoded_input["attention_mask"])

# Decode the predicted tokens back into text
predicted_index = tf.argmax(predictions[0], axis=-1)

predicted_token = tokenizer.convert_ids_to_tokens(predicted_index.numpy()[0])

ans = predicted_token[1:-1]
print(" ".join(ans))


it is a great place .


## Conclusion:
In this notebook, we have explored the concept of fine-tuning a pre-trained transformer model for masked language modeling. We started by introducing the idea of pre-trained language models and how they can be fine-tuned for specific NLP tasks. Then, we discussed the masked language modeling task and how it can be used to further train a pre-trained language model on specific domains.

We then walked through a step-by-step process of how to fine-tune the pre-trained BERT model for masked language modeling. We demonstrated how to load and preprocess the data, build and train the model.

I will make couple of more notebooks on transformer and we will do some amazing experiments with them.