## Architecture Overview

> UniLM(Universal Language Model)

It is a transformer based architecture designed for various nlp taks, including lang generation, summarization, machine translation and more. It introduces several building blocks that enable it to handle both bidirectional and unidirectional language modelling tasks, making it different from other transformer architectures like GPT.

1, Bidirectional Language Model(BLM): The BLM component of UniLM is responsible for learning the bidirectional context of the input seq. It uses a transformer encoder that processes the input tokens in both forward and backward directions. By considering the context from both directions, the model captures dependencies and relationships between words effectively.

2. Unidirectional Language Model(ULM): The ULM component is used for generating text and handling unidirectional language modelling tasks. It utilizes a transformer decoder, which takes the previous tokens as input and predicts the next token in the sequence. The ULM is trained to generate coherent and fluent text based on the context provided.

3. Encoder-Decoder Framework : UniLM combines both BLM and ULM by employing an encoder-decoder framework. The encoder receives the input seq and produces a contextual representation of the entire seq, capturing the bidirectional context. The decoder takes the encoder's output and generates the output sequence, focusing on the unidirectional context. The encoder and decoder are trained jointly to enhance the model's performance.

4. Masked Language Model(MLM): Similar to other transformer architectures, UniLM also utilizes the MLM objective during pre-training. In MLM, a portion of the input tokens is randomly masked, and the model is trained to predict the original tokens based on the surrounding context. This encourages the model to learn meaningful representations and improve its ability to understand & gen text.

The key diff between UniLM and other architectures like GPT lies in the combo of bidirectional and unidirectional modelling. GPT primarily focuses on unidirectional language modelling thereby generating text from left to right however UniLM employs bidirectional and unidirectional models both. It employs the bidirectional context during pretraining and leverages the unidirectional context during fine-tuning and generation, making it a dependable architecture in multiple tasks.

In [1]:
# Load the data
import pandas as pd
data = pd.read_csv('/kaggle/input/news-summarization/data.csv')[:100] # fine-tuning on first 100 samples for Proof of concept, remove for training on complete dataset


In [2]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel, BertConfig

class UniLMTokenizer(BertTokenizer):
    def __init__(self, vocab_file, do_lower_case=True):
        super().__init__(vocab_file=vocab_file, do_lower_case=do_lower_case)

#UNILM model is an extension of BERT, Feel free to modify it as per original paper
class UniLMModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased', config=config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)

    def forward(self, input_ids, attention_mask=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state
        logits = self.lm_head(sequence_output)
        return logits




caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
!wget https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt

--2023-05-30 23:18:39--  https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
Resolving huggingface.co (huggingface.co)... 13.227.219.63, 13.227.219.125, 13.227.219.41, ...
Connecting to huggingface.co (huggingface.co)|13.227.219.63|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘vocab.txt’


2023-05-30 23:18:40 (2.53 MB/s) - ‘vocab.txt’ saved [231508/231508]



In [4]:
tokenizer = UniLMTokenizer(vocab_file='/kaggle/working/vocab.txt', do_lower_case=True)


In [5]:
# Tokenize the content and summary columns
tokenized_inputs = data['Content'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
tokenized_summaries = data['Summary'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))


In [12]:
# Define the maximum sequence length
max_seq_length = 512

# Truncate and pad the tokenized inputs and summaries
input_ids = []
summary_ids = []

for tokens in tokenized_inputs:
    if len(tokens) > max_seq_length:
        tokens = tokens[:max_seq_length]
    else:
        tokens = tokens + [0] * (max_seq_length - len(tokens))
    input_ids.append(tokens)

for tokens in tokenized_summaries:
    if len(tokens) > max_seq_length:
        tokens = tokens[:max_seq_length]
    else:
        tokens = tokens + [0] * (max_seq_length - len(tokens))
    summary_ids.append(tokens)

# Convert the tokenized inputs and summaries to tensors
input_ids = torch.tensor(input_ids, dtype=torch.long)
summary_ids = torch.tensor(summary_ids, dtype=torch.long)


In [14]:
# Create TensorDataset
dataset = TensorDataset(input_ids, summary_ids)

# Define batch size
batch_size = 1

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


In [15]:
# Define the configuration for UniLM
config = BertConfig(
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=512,
    vocab_size=tokenizer.vocab_size
)



In [16]:
model = UniLMModel(config)  # Initialize the UniLM model with the desired configuration
loss_fn = nn.CrossEntropyLoss()  # Define the loss function
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)  # Define the optimizer


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available

model.to(device)

# Training loop
num_epochs = 1

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for step,batch in enumerate(dataloader):
        input_batch, summary_batch = batch
        input_batch = input_batch.to(device)
        summary_batch = summary_batch.to(device)

        # Clear gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_batch, attention_mask=input_batch.ne(0))[0]

        # Compute loss
        loss = loss_fn(outputs.view(-1, outputs.shape[-1]), summary_batch.view(-1))
        total_loss += loss.item()

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()
        if step%100==0:
            print("Step-{}, Loss-{}".format(step,loss.item()))

    # Calculate average loss for the epoch
    avg_loss = total_loss / len(dataloader)

    # Print the average loss
    print(f"Epoch {epoch+1}/{num_epochs} - Average Loss: {avg_loss:.4f}")


Step-0, Loss-1.8001112937927246
Epoch 1/1 - Average Loss: 1.7692


## Inference


In [49]:
# Sample input text
input_text = "This is a sample input text for summarization. We are going to evaluate the trained model. Let's see how it performs ! Do you have any idea ?"


In [50]:
# Tokenize the input text
tokenized_input = tokenizer.encode_plus(input_text, max_length=max_seq_length, truncation=True, padding='max_length', return_tensors='pt')


In [51]:

# Move the tokenized input to the device
input_ids = tokenized_input['input_ids'].to(device)
attention_mask = tokenized_input['attention_mask'].to(device)


In [52]:
# Set the model to evaluation mode
model.eval()


UniLMModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

In [53]:
# Perform forward pass
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)


In [54]:
# Reshape the outputs tensor
reshaped_outputs = outputs.permute(0, 2, 1)  # Reshape to (batch_size, vocab_size, sequence_length)

# Get the predicted summary
predicted_summary_ids = torch.argmax(reshaped_outputs, dim=1)
predicted_summary = tokenizer.decode(predicted_summary_ids[0], skip_special_tokens=True)

print("Predicted Summary:", predicted_summary)


Predicted Summary: 


Train model more to get better summary and it's also possible that our input text is very short.