## Architectural Overview

> ALBERT

ALBERT stands for "A Lite BERT", is a variant of the BERT(Bidirectional Encoder Representations from Transformers) model. It is designed to be more efficient and scalable while maintaining similar or even improved performance compared to BERT.

> The architecture of ALBERT can be fragmented in the following steps

1. Embeddings layer - Like other transformer based models, ALBERT starts with an embeddings layer. This layer maps input tokens to continuous vector representations called word embeddings. These embeddings capture the semantic meaning of the words and their contextual info.

2. Transformer Encoder - It is the core building block of ALBERT. It consists of multiple stacked layers, and each layer has two sub layers; the self attention mechanism and a feed forward neural network.

- Self attn mechanism - It allows each word in the input seq. to attend to other words in the same seq, capturing their relationships and dependencies. It calculates attention scores for each word and combines information from different positions.

- Feed forward neural network - The feed forward neural network applies a non linear transformation to the outputs of the self attn mechanism, enhancing the representation of each word. It consists of two layers with a non linear activation function in between.

3. Parameter sharing - ALBERT introduces parameter sharing. In BERT, all layers are unique which leads to a large number of parameters. In ALBERT, the layers are grouped and share parameters within each group. This reduces the model's overall size and makes it more memory efficient.

4. Cross-Layer Parameter Sharing: ALBERT goes a step further by introducing cross layer parameter sharing. In addition to sharing params with a layer group, ALBERT shares parameters across different layers. This helps to further reduce the number of params and improves param efficiency.

5. Sentence order prediction - ALBERT introduces a pre-training task called "sentence order prediction" In this task, input sequences are split into segments, and the model learns to predict the correct order of the segments. This additional objective helps ALBERT to better understand the relationships between sentences.

> Architectural differences

- ALBERT vs. BERT
ALBERT improves upon BERT by introducing parameter sharing, both within layer groups and across layers. This reduces the model size and improves efficiency without sacrificing performance.

- ALBERT vs. ROBERTa:
ROBERTa is a variant of BERT that focuses on pre-training with larger batch sizes and more data. While both ALBERT and ROBERTa achieve similar performance, ALBERT is more param efficient due to its parameter sharing techniques.

- ALBERT vs. XLNET:
XLNet is a model that incorporates permutation based training, allowing it to capture dependencies between all positions in a sequence. ALBERT, on the other hand, uses a masked langugage modelling object and parameter sharing techniques. The main difference lies in the training objectives and modelling of depends.

- ALBERT vs. T5:
T5(Text to Text Transfer Transformer) is a versatile model that can be applied to various NLP tasks by casting them into a text to text format. ALBERT, like BERT, focuses on masked language modelling. T5 is more flexible in handling diff. tasks whiles ALBERT is specifally designed for language representation learning(i.e towards understanding the structure and meaning of language).

In [2]:
#importing modules
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AlbertTokenizer, AlbertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import csv


caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In this snippet, the necessary modules and libraries are imported, including PyTorch, the ALBERT model and tokenizer from the transformers library, DataLoader for creating data loaders, Dataset for creating custom datasets, and pandas for reading and manipulating data.



In [3]:
# Load train.csv and test.csv using pandas
train_df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')


The training and testing data files, "train.csv" and "test.csv," are loaded using pandas.



In [4]:
# Extract the text and target columns from the train and test data
train_texts = train_df['text'].tolist()
train_labels = train_df['target'].tolist()
test_texts = test_df['text'].tolist()


The "text" and "target" columns are extracted from the training and testing data and stored as lists.



In [5]:
# Create train_data and test_data dictionaries
train_data = [{'text': text, 'target': label} for text, label in zip(train_texts, train_labels)]
test_data = [{'text': text} for text in test_texts]

The training and testing data are transformed into dictionaries, where each dictionary entry contains the text and target (if available) for each data instance.



In [6]:
# Set up device (GPU if available)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


The code sets the device to use GPU if available; otherwise, it uses the CPU.



In [7]:
# Preprocess the data (assuming you have train_data and test_data)
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

The ALBERT tokenizer is instantiated using the "albert-base-v2" pre-trained model.



In [21]:
#custom dataset class
class DisasterTweetsDataset(Dataset):
    def __init__(self, data, mode="train"):
        self.data = data
        self.mode=mode
    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        tweet = self.data[index]['text']
        if self.mode == "train":
            label = self.data[index]['target']
            encoding = tokenizer.encode_plus(tweet, add_special_tokens=True, padding='max_length', max_length=128, truncation=True, return_tensors='pt')
            input_ids = encoding['input_ids'].squeeze()
            attention_mask = encoding['attention_mask'].squeeze()
            return {'input_ids': input_ids, 'attention_mask': attention_mask, 'label': label}
        else:
            encoding = tokenizer.encode_plus(tweet, add_special_tokens=True, padding='max_length', max_length=128, truncation=True, return_tensors='pt')
            input_ids = encoding['input_ids'].squeeze()
            attention_mask = encoding['attention_mask'].squeeze()
            return {'input_ids': input_ids, 'attention_mask': attention_mask}


A custom dataset class, DisasterTweetsDataset, is defined. It takes the data as input and implements the `__len__` and `__getitem__` methods required for a dataset class. The `__getitem__` method performs tokenization using the ALBERT tokenizer and returns the input IDs, attention mask, and label (if available) for a specific index.



In [22]:
train_dataset = DisasterTweetsDataset(train_data)
test_dataset = DisasterTweetsDataset(test_data,mode='test')


In [23]:
train_dataloader = DataLoader(train_dataset, batch_size=16*4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=16*4, shuffle=False)


The training and testing datasets are created using the custom dataset class, and dataloaders are initialized for both datasets. The train_dataloader shuffles the data during training, while the test_dataloader does not shuffle the data during testing.



In [11]:
# Define the ALBERT model
model = AlbertForSequenceClassification.from_pretrained('albert-base-v2', num_labels=2).to(device)


Downloading model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForSequenceClassification: ['predictions.dense.weight', 'predictions.dense.bias', 'predictions.LayerNorm.weight', 'predictions.decoder.bias', 'predictions.bias', 'predictions.LayerNorm.bias']
- This IS expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model 

The ALBERT model for sequence classification is instantiated using the "albert-base-v2" pre-trained model. The number of labels is set to 2 (binary classification), and the model is moved to the specified device.



In [12]:
# Define the optimizer and loss function
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()


The optimizer (AdamW) is defined, which will update the model parameters during training. The learning rate is set to 2e-5. The loss function (CrossEntropyLoss) is also defined.



In [13]:
len(train_dataloader),len(test_dataloader)

(119, 51)

Length of dataloader is checked to estimate the steps needed per epoch.

# Training

In [46]:
from tqdm import tqdm

# Training loop
num_epochs = 10
best_loss = float('inf')
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for step,batch in tqdm(enumerate(train_dataloader)):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        loss.backward()
        optimizer.step()

        running_loss += loss.item()
    epoch_loss = running_loss / len(train_dataloader)
    print(f'Epoch {epoch+1}/{num_epochs} - Loss: {epoch_loss:.4f}')

    # Save the best model based on the lowest loss achieved during training
    if epoch_loss < best_loss:
        best_loss = epoch_loss
        torch.save(model.state_dict(), 'best_model.pt')


119it [01:35,  1.25it/s]


Epoch 1/10 - Loss: 0.0304


119it [01:35,  1.24it/s]


Epoch 2/10 - Loss: 0.0339


119it [01:35,  1.24it/s]


Epoch 3/10 - Loss: 0.0148


119it [01:35,  1.24it/s]


Epoch 4/10 - Loss: 0.0313


119it [01:35,  1.24it/s]


Epoch 5/10 - Loss: 0.0472


119it [01:35,  1.25it/s]


Epoch 6/10 - Loss: 0.0245


119it [01:35,  1.24it/s]


Epoch 7/10 - Loss: 0.0163


119it [01:35,  1.24it/s]


Epoch 8/10 - Loss: 0.0168


44it [00:36,  1.22it/s]


KeyboardInterrupt: 

The training loop iterates over the specified number of epochs. Within each epoch, the model is set to train mode, and the running loss is initialized. The loop then iterates through the training dataloader in batches. The input IDs, attention mask, and labels are moved to the specified device. The optimizer gradients are zeroed, and the model is called with the inputs and labels. The loss is calculated, and the gradients are backpropagated and updated using the optimizer. The running loss is updated. After each epoch, the epoch loss is calculated and printed.



# Inference

In [47]:
#loading the best trained model
model.load_state_dict(torch.load('best_model.pt'))


<All keys matched successfully>

In [48]:
# Evaluate the model and generate submission file
model.eval()
predictions = []

with torch.no_grad():
    for batch in test_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        batch_predictions = torch.argmax(logits, dim=1)
        predictions.extend(batch_predictions.cpu().tolist())



The model is set to evaluation mode after loading the best checkpoint, and predictions are generated for the test dataset. The input IDs and attention mask are moved to the specified device, and the model is called with the inputs. The logits are obtained, and the predictions are extracted by taking the argmax along the second dimension. The predictions are extended to the predictions list.



In [49]:
# Generate submission.csv
submission_file = 'submission.csv'

with open(submission_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['id', 'target'])

    for i, prediction in enumerate(predictions):
        id_value = test_df['id'].iloc[i]  
        writer.writerow([id_value, prediction])

print(f'Submission file "{submission_file}" generated successfully.')

Submission file "submission.csv" generated successfully.


A submission file named "submission.csv" is created. The file is opened in write mode, and a CSV writer is created. The header row is written, and the




In [50]:
df = pd.read_csv('submission.csv')
df

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1
