## Architecture Overview

## Electra

It is a neural network architecture that belongs to the hierarchy of transformers family.

The building blocks of Electra is explained below

1. Transformer architecture - Similar to other models like BERT, GPT2, It is built upon the transformer architecture which utilize self attention mechanisms to capture the relationships b/w words in a sentence and enable parallel procesing.

2. Pre-training and Fine tuning : Like BERT and GPT2, Electra also follows a two step process. It undergoes pre-training and fine-tuning. In the pre-training phase, the model is trained over a large corpus of unlabeled text to learn the language patterns and representations. In the fine tuning phase, the model is further trained on a smaller labeled dataset for specific downstream tasks like text classification or NER.

3. Masked Language Modelling vs. Discriminative Training : In BERT, a technique known as MLM is used where random words are masked and the model is asked to predict them. In contrast, Electra employs Discriminative training. Instead of masking words, It replaces some of them with plausible alternatives and tasks the model with distinguishing the original word from the replacement. This way, Electra learns to discriminate b/w real and generated tokens resulting in more efficient training.

4. Generator and Discriminator: Electra consists of two components, The generator and the discriminator. The generator takes in a sentence as input and tries to predict the replaced words whereas the discriminator aims to determine whether the replaced words are real or generated. These two components work in synchronicity during training with the Discriminator providing feedback to improve generator's performance.

5. Model size and training efficiency- Electra is designed to be computationally more efficient compared to some other models. For instance, models like BERT and GPT2 are known for their large size, which makes training and deployment challenging. Electra whreas achieves similar performance with smaller memory footprints.

> Key differences in the architecture of Electra and other models.

- With respect to ALBERT : ALBERT uses parameter sharing to reduce model size and training time whreas Electra although smaller, focuses on the Discriminative Training approach for efficiency.

- With respect to ROBERTA- ROBERTA uses large batch size and corpus to train but Electra achieves similar performance with smaller resources.

- With respect to T5 - T5 is a versatile model for various NLP tasks while Electra's fine tuning also makes it capable for a wide number of NLP tasks.

In [66]:
!7z x /kaggle/input/mercari-price-suggestion-challenge/test.tsv.7z
!7z x /kaggle/input/mercari-price-suggestion-challenge/train.tsv.7z

Extracting the train and test sets using 7z command line tool

In [94]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import ElectraTokenizer, ElectraForSequenceClassification


This code snippet imports the necessary libraries and modules for deep learning model training using Electra.

In [95]:
# Set the random seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)


These lines of code set the random seed for reproducibility.

In [96]:
# Load the training and test data
train_df = pd.read_csv('/kaggle/working/train.tsv', delimiter='\t')
test_df = pd.read_csv('/kaggle/working/test.tsv', delimiter='\t')


These lines of code load the train and test sets

In [97]:
# Remove rows with missing values
train_df = train_df.dropna()


In [98]:
# Split the data into training and validation sets
train_data, val_data = train_test_split(train_df, test_size=0.2, random_state=seed)


In [99]:
# Define the tokenizer
tokenizer = ElectraTokenizer.from_pretrained('google/electra-base-discriminator')


In [100]:
# Tokenize the input data
def tokenize_data(text):
    return tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=256,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

In [101]:
# Create PyTorch DataLoader for training and validation sets
class MercariDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        text = row['name'] + ' ' + row['item_description']
        inputs = tokenize_data(text)
        label = row['price']
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'label': torch.tensor(label, dtype=torch.float32)
        }


In [102]:
# Define batch size and create data loaders
batch_size = 32
train_dataset = MercariDataset(train_data)
val_dataset = MercariDataset(val_data)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size)

In [103]:
for batch in train_loader:
    print(batch)
    break

{'input_ids': tensor([[  101,  2417,  2152,  ...,     0,     0,     0],
        [  101,  6954,  2072,  ...,     0,     0,     0],
        [  101, 21994, 19457,  ...,     0,     0,     0],
        ...,
        [  101,  2047,  2141,  ...,     0,     0,     0],
        [  101,  1038,  2989,  ...,     0,     0,     0],
        [  101,  9212,  2884,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'label': tensor([ 46.,   3.,  30.,  37.,  20.,  17.,  10.,  31.,  10., 895.,  19.,  61.,
         14.,  96.,  90.,  23.,  13.,  74.,  19.,  26., 525.,  41.,   8.,  11.,
         24.,  16.,  30.,  46.,  20.,  20.,  26.,  50.])}


In [104]:
# Define the ELECTRA model
model = ElectraForSequenceClassification.from_pretrained('google/electra-base-discriminator', num_labels=1)

Some weights of the model checkpoint at google/electra-base-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.o

In [105]:
# Define the loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-5)

In [108]:
len(train_loader), len(val_loader)

(21175, 5294)

# Train

In [114]:
from tqdm import tqdm

# Train the model
num_epochs = 1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    
    for step,batch in tqdm(enumerate(train_loader)):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        optimizer.zero_grad()
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        print(step)
        if step%500==0:
            print("Step-{}, Loss-{}".format(step,loss.item()))
            break # breaking the training at 500th step since 1 iteration may take around 5 hrs, Uncomment this during full training
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item() * input_ids.size(0)
    
    train_loss /= len(train_dataset)
    
    # Evaluate on the validation set
    model.eval()
    val_loss = 0.0
    
    with torch.no_grad():
        for step,batch in tqdm(enumerate(val_loader)):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            
            val_loss += loss.item() * input_ids.size(0)
    
    val_loss /= len(val_dataset)
    
    print(f'Epoch {epoch+1}/{num_epochs} - Train Loss: {train_loss:.4f} - Val Loss: {val_loss:.4f}')


0it [00:00, ?it/s]

0


0it [00:00, ?it/s]


Step-0, Loss-2053.11083984375


123it [00:37,  3.27it/s]


KeyboardInterrupt: 

This code block trains a model using a loop that iterates over a specified number of epochs. It uses the tqdm library to display a progress bar during training. The model is trained on a GPU if available; otherwise, it falls back to CPU.

Within each epoch, the code iterates over the train_loader to process batches of training data. It moves the input data and labels to the appropriate device (GPU or CPU) and performs forward and backward passes through the model. The optimizer is used to update the model's parameters based on the computed gradients. The training loss is accumulated and averaged over the entire training dataset.

After each epoch's training, the code enters the evaluation phase on the validation set (val_loader). It iterates over the validation batches, performs forward passes through the model, and calculates the loss. The validation loss is accumulated and averaged over the entire validation dataset.

At the end of each epoch, the code prints the epoch number, the training loss, and the validation loss.

# Inference

In [None]:
# Load the test data and create a DataLoader
test_dataset = MercariDataset(test_df)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=128*8)

# Generate predictions on the test set
model.eval()
predictions = []

with torch.no_grad():
    for batch in tqdm(test_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        
        predictions.extend(logits.flatten().cpu().numpy())

# Create the submission file
submission_df = pd.DataFrame({'test_id': test_df['test_id'], 'price': predictions})
submission_df.to_csv('submission.csv', index=False)


  7%|▋         | 49/678 [07:33<1:36:29,  9.20s/it]

The code provided is used to generate predictions on the test set using a pre-trained model and create a submission file for a competition. 


The model is then put into evaluation mode using `model.eval()`, and predictions are generated for each batch in the test loader. The input tensors (`input_ids` and `attention_mask`) are moved to the appropriate device using `.to(device)`.

The model's forward pass is executed with the input tensors, and the output logits are obtained from `outputs.logits`. The logits are then flattened, converted to NumPy arrays on the CPU, and appended to the `predictions` list.

Once all the predictions are generated, a submission DataFrame is created with columns for `test_id` (assuming it's available in `test_df`) and `price`, using the `predictions` list. Finally, the submission DataFrame is saved as a CSV file called 'submission.csv' without including an index.

