In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/quora-insincere-questions-classification/sample_submission.csv
/kaggle/input/quora-insincere-questions-classification/embeddings.zip
/kaggle/input/quora-insincere-questions-classification/train.csv
/kaggle/input/quora-insincere-questions-classification/test.csv


## Architectural Overview

> XLNet(eXtreme Lite Transformer)

- It is a state of the art llm developed by CMU and Google AI. It is based on transformer architecture and was designed to overcome the limitations of prev language models such as BERT and GPT2.
- The architecture of XLNet is similar to BERT with some key diff. The main diff being XLNet uses a permutation based training approach which allows it to model dependencies between all tokens in a sequence rather than just the tokens that come before the current token.

>> The block diagram of XLNet consists of three main components.

- Input embedding layer - This layer takes the input text and converts it into a vector representation that can be processed by the model. The input embedding layer uses a pretrained word embedding model to convert each word in the input text into a vector.
- Transformer encoder layers - The transformer encoder layers are the core buiding blocks of the model. They use self attn mechanism to process the input text and generate a contextualised representation of each word in the text. The transformer encoder layers are stacked on top of each other to create a deep neural network.
- Permutation based training - XLNet uses a permutation based training approach which allows it to model dependencies between all tokens in a sequence rather than just the tokens that come before the current token. This is achieved by randomly permuting the input seq during training and using a modified loss function that takes into account all possible permutations of the input sequence

>> The advantages of XLNet are

1. Improved modeling of dependencies. - It is able to model dependencies between all tokens in a sequence rather than just the tokens that come before it. It yields better and coherent text.
2. It performs better on Question-answering, sentiment analysis and text classification in comparision to bert and GPT2.

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import XLNetForSequenceClassification, XLNetTokenizer
from sklearn.metrics import f1_score
import pandas as pd




We import the required libraries including PyTorch, XLNetForSequenceClassification, XLNetTokenizer from the transformers package, and pandas for data handling.



In [3]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


We check if a GPU is available and set the device accordingly. This enables GPU acceleration if available.

In [4]:
# Define XLNet model and tokenizer
model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=2)
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')


Downloading (…)lve/main/config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.bias', 'lm_loss.weight']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.bias', 'logits_proj.bias', 'sequence_summary.summary.weight', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

We initialize the XLNet model for sequence classification and its corresponding tokenizer. We use the 'xlnet-base-cased' pre-trained model.

In [5]:
# Load Quora Insincere Question Classification dataset
train_df = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/train.csv')
test_df = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/test.csv')


We read the Quora Insincere Question Classification dataset from CSV files using pandas.

In [6]:
# Tokenize and encode sequences
train_encoded = tokenizer.batch_encode_plus(train_df['question_text'].tolist(),
                                            add_special_tokens=True,
                                            padding='longest',
                                            truncation=True,
                                            return_tensors='pt', max_length=64) #increase max length to 512 if there are no memory restrictions

We tokenize and encode the question texts using the XLNet tokenizer. The encoded sequences include input IDs, attention masks, and the target labels for the training dataset.

In [7]:
test_encoded = tokenizer.batch_encode_plus(test_df['question_text'].tolist(),
                                           add_special_tokens=True,
                                           padding='longest',
                                           truncation=True,
                                           return_tensors='pt',max_length=64)#increase max length to 512 if there are no memory restrictions


In [8]:
# Prepare data loaders
train_dataset = torch.utils.data.TensorDataset(train_encoded['input_ids'],
                                               train_encoded['attention_mask'],
                                               torch.tensor(train_df['target'].tolist()))
test_dataset = torch.utils.data.TensorDataset(test_encoded['input_ids'], test_encoded['attention_mask'])
train_loader = DataLoader(train_dataset, batch_size=16*4, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16*4, shuffle=False)


We create data loaders for both the training and test datasets using the encoded sequences. 

In [9]:
for batch in train_loader:
    print(batch)
    break

[tensor([[ 5,  5,  5,  ..., 82,  4,  3],
        [ 5,  5,  5,  ..., 82,  4,  3],
        [ 5,  5,  5,  ..., 82,  4,  3],
        ...,
        [ 5,  5,  5,  ..., 82,  4,  3],
        [ 5,  5,  5,  ..., 82,  4,  3],
        [ 5,  5,  5,  ..., 82,  4,  3]]), tensor([[0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1]]), tensor([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]


Performed sanity check of the dataloaders

In [10]:
# Set hyperparameters
num_epochs = 1
learning_rate = 2e-5


We define the number of training epochs and the learning rate.

In [11]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


We specify the loss function (cross-entropy loss) and the optimizer (AdamW) for model training.

In [12]:
# Training loop
model.to(device)
model.train()


XLNetForSequenceClassification(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 768)
    (layer): ModuleList(
      (0-11): 12 x XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=768, out_features=3072, bias=True)
          (layer_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation_function): GELUActivation()
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (sequence_summary): SequenceSummary(
    (summary): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
    (first_dropout): Identity()
    (last

In [13]:
from tqdm import tqdm

for epoch in range(num_epochs):
    total_loss = 0
    for step, (inputs, masks, targets) in enumerate(tqdm(train_loader)):
        inputs, masks, targets = inputs.to(device), masks.to(device), targets.to(device)

        optimizer.zero_grad()

        outputs = model(input_ids=inputs, attention_mask=masks)[0]
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        if step%1000==0:
            print("Step - {}, Loss - {}".format(step, loss.item()))
            break

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {total_loss}")


  0%|          | 0/20409 [00:02<?, ?it/s]

Step - 0, Loss - 0.7137416005134583
Epoch 1/1, Loss: 0





We train the XLNet model by iterating over the training data loader. The model is set to train mode, and the optimizer is zeroed before computing and backpropagating the loss.

Train more, It's just a proof of concept !

In [14]:
# Evaluation
model.eval()
predictions = []


We switch the model to evaluation mode and iterate over the test data loader to make predictions on the test dataset.

In [None]:
with torch.no_grad():
    for inputs, masks in tqdm(test_loader):
        inputs, masks = inputs.to(device), masks.to(device)
        outputs = model(input_ids=inputs, attention_mask=masks)[0]
        _, predicted_labels = torch.max(outputs, 1)
        predictions.extend(predicted_labels.tolist())


 30%|██▉       | 1748/5872 [04:22<10:19,  6.66it/s]

In [None]:
submission_df = pd.DataFrame({'qid': test_df['qid'], 'prediction': predictions})
submission_df.to_csv('submission.csv', index=False)

submission.csv generated