## Goal:

* The immediate goal is to setup this notebook for training, Inference and submission for LB evaluation.

We'll set more goals once we achieve this one.

Let's Create a Controller variable for Training/Submission Mode.

* Set **True** for Submission Mode.
* Set **False** for Training Mode.

In [1]:
# True if you want to submit the Notebook
isSubmit = True

if isSubmit:
    print(f'This notebook is on submission Mode...')
    print(f'Make sure to turn off the Internet...')
else:
    print(f'This notebook is on Training Mode...')

This notebook is on submission Mode...
Make sure to turn off the Internet...


### Step-1: Read and Understand Data.

Here we have two type of files: 
1. **<span style="color:red">Prompts:</span>** It contains the Question, a title about the text, and the text that needs to be summarized.
2. **<span style="color:red">Summaries:</span>** This includes the text summarized by the students, along with the corresponding prompt id and target scores for both content and wording.

In [2]:
# Read Dataset
import pandas as pd

if not isSubmit:
    # Train
    prompt_df = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_train.csv')
    summary_df = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_train.csv')

    print(f'\nLength of Train Prompt df: {len(prompt_df)}')
    print(f'Length of Train Summary df: {len(summary_df)}\n')

    # Display
    summary_df.sample(5)

In [3]:
if not isSubmit:
    # Distribution of the Prompt Questions.
    print(summary_df['prompt_id'].value_counts())

In [4]:
if not isSubmit:
    # Number of Unique Students
    print(len(summary_df['student_id'].unique()))

**So the Training data contains only 4 Prompt Questions that is being asked to 7165 students.**

Now Let's understand the Labels.

In [5]:
if not isSubmit: 
    # Describe
    describe_content_df = summary_df['content'].describe()
    describe_wording_df = summary_df['wording'].describe()

    print(pd.concat([describe_content_df, describe_wording_df], axis=1))

* The value of **Content** ranging from -1.72 to 3.90.
* Value of the **Wording** is ranging from -1.96 to 4.31.

It surely has two classes (Content and Wording) and each class has a continuous output. This problem can be put into the category of **Two-Class Regression.**

Let's have an overall idea about the number of words (or tokens) in summary text.

In [6]:
if not isSubmit:
    text_length = summary_df['text'].apply(lambda x: len(x.split(' ')))
    print(text_length.describe())
    print('\n')

    # Lets Visualize the same.
    from matplotlib import pyplot as plt

    # Create a histogram using Matplotlib
    plt.figure(figsize = (3, 3))
    plt.hist(text_length, bins=10, edgecolor='k')
    plt.xlabel('Text Length')
    plt.ylabel('Frequency')
    plt.title(f'Distribution of the Text Length')

The average text summary length is approximately 76 tokens, with a standard deviation of 54 tokens. Most of the text summaries have a length of less than 200 tokens. That is still in the limit of **BERT** and related models, which accept a **maximum of 512 tokens**. In Future, we might use some of the contextual information from the Prompt text to utilize this token gap properly.

### Step-2: Dataset and Dataloaders

Now Let's define the Dataset Class.

The Dataset class in Natural Language Processing (NLP) serves as a fundamental data structure that helps manage and handle textual data for training and evaluation purposes.

In [7]:
import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer

class CommonLitSummaryDataset(Dataset):
    def __init__(self, 
                 summary_df,
                 prompt_df, 
                 model_name, 
                 max_length = 256,
                 isTest = False
                ):
        self.summary_df = summary_df
        self.prompt_df = prompt_df
        self.max_length = max_length
        self.tokz = AutoTokenizer.from_pretrained('/kaggle/input/commonlit-summaries-all-tokenizers/bert_base_cased_tokenizer')
        self.isTest = isTest
        
    def __len__(self):
        return len(self.summary_df)
    
    def __getitem__(self, idx):
        
        # Get the Summary and It's Corresponding Question
        txt_summary = self.summary_df['text'].iloc[idx]
        prompt_id = self.summary_df['prompt_id'].iloc[idx]
        txt_question = self.prompt_df[self.prompt_df['prompt_id'] == prompt_id]['prompt_question'].iloc[0]
        
        # Concat the Question and Summary.
        input_text = 'QUESTION: ' + txt_question + 'SUMMARY: ' + txt_summary
        
        # Convert the text data into Corresponding Numerical Embeddings.
        encodings = self.tokz.encode_plus(input_text, 
                                          add_special_tokens=True, 
                                          max_length = self.max_length, 
                                          padding = 'max_length', 
                                          truncation = True, 
                                          return_tensors = 'pt'
                                         )
        input_ids = encodings['input_ids'].squeeze()
        attention_mask = encodings['attention_mask'].squeeze()
        
        # For Test set, No labels will be available
        if self.isTest:
            return {
            'input_ids': input_ids,
            'attention_mask': attention_mask
        }
            
        # Labels
        label = torch.tensor(self.summary_df.iloc[idx][-2:].tolist())
        
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': label
        }

Before Initialising the Class, We have to split the dataset into two categories. 1) Train 2) Valid

In [8]:
if not isSubmit:
    # Split the dataframe into train and validation sets
    from sklearn.model_selection import train_test_split

    train_df, valid_df = train_test_split(summary_df, test_size=0.2, random_state=42)

In [9]:
if not isSubmit:

    # Initialize Dataset Classes
    commonlit_summary_train_ds = CommonLitSummaryDataset(train_df,
                                                         prompt_df, 
                                                         model_name = 'bert-base-cased', 
                                                         max_length = 256
                                                        )
    commonlit_summary_valid_ds = CommonLitSummaryDataset(valid_df, 
                                                         prompt_df,  
                                                         model_name = 'bert-base-cased', 
                                                         max_length = 256
                                                        )
    print(f'Train - {len(commonlit_summary_train_ds)}, Test - {len(commonlit_summary_valid_ds)}')

Let's Visualize one Sample to see the working of our Dataset.

In [10]:
if not isSubmit:

    # Tokenizer
    tokz =  AutoTokenizer.from_pretrained('/kaggle/input/commonlit-summaries-all-tokenizers/bert_base_cased_tokenizer')

    print(f'------ Input ----------\n')
    sample = commonlit_summary_train_ds[0]
    print(tokz.decode(sample['input_ids']))

    print(f'\n------ Labels ----------\n')
    labels = sample['labels']
    print(labels)

**Dataloader:**

Dataloader allows us to group multiple samples into batches, enabling parallel processing and more efficient GPU utilization during training.

In [11]:
# Dataloader
from torch.utils.data import DataLoader

if not isSubmit:
    # Create a data loader for the dataset
    batch_size = 16
    train_dataloader = DataLoader(commonlit_summary_train_ds, batch_size=batch_size, shuffle=True)
    eval_dataloader = DataLoader(commonlit_summary_valid_ds, batch_size=batch_size, shuffle=False)

### Step-3: Model Training

In [12]:
from transformers import AutoModelForSequenceClassification, AdamW

model_nm = 'bert-base-cased'
num_labels = 2

if not isSubmit:
    model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=num_labels)

    total_params = sum(p.numel() for p in model.parameters())
    print("\nTotal number of parameters: ", total_params)

    total_size = sum(p.numel() * p.element_size() for p in model.parameters())
    print("Total size (bytes) of the model: ", total_size)
    print("Total size (MB) of the model: ", total_size / (1024 * 1024))

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


we need to save the best model during Training based on the **MCRMSE** (mean columnwise root mean squared error) which is the Cost Function/ Metric for this Competition.

Let's implement this.

In [13]:
# Utility: mean columnwise root mean squared error
def mcrmse_loss(predictions, targets):
    rmse_columnwise = torch.sqrt(torch.mean((predictions - targets)**2, dim=0))
    return torch.mean(rmse_columnwise)

**Training Loop**

In order to have **faster Training**, we will do the followings:

1. **Enable mixed precision training** (using half-precision floating-point format or fp16). It utilizes Tensor Cores on supported NVIDIA GPUs to speed up the training process with reduced memory usage.

2.  To utilize Kaggle T4/X2 GPU, we will use **DataParallel**. DataParallel allows you to train your model on multiple GPUs simultaneously, distributing the workload across all available GPUs.

In [14]:
if not isSubmit:
    
    # Import Dataparallel and mixed precision modules
    from torch.cuda.amp import autocast, GradScaler
    from torch.nn.parallel import DataParallel

    # Check if GPU is available.
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'DEVICE: {device}\n')

    # Used for Mixed Precision Training
    scaler = GradScaler()

    # Move your model to the GPU and wrap it with DataParallel (To utilize T4X2)
    model = model.to(device)
    model = DataParallel(model) 

Let's Train the Model.

In [15]:
# Visualize progress bar
from tqdm import tqdm

if not isSubmit:
    # Prepare optimizer
    optimizer = AdamW(model.parameters(), lr=1e-5)
    NUM_EPOCHS = 15

    best_eval_loss = float('inf')  # Initialize the best evaluation loss to infinity

    for epoch in range(NUM_EPOCHS):
        total_loss = 0
        model.train() # Set the model to Training mode

        for batch in tqdm(train_dataloader):

            with autocast():
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                # Forward loop
                optimizer.zero_grad() # Ensures Gradient doesn't accumulate.
                predictions = model(input_ids=input_ids,
                                    attention_mask=attention_mask, 
                                    labels=labels
                                   ).logits

                # Compute MCRMSE loss
                loss = mcrmse_loss(predictions, labels)

            # BackProp
            scaler.scale(loss).backward()  # Scale the loss value
            scaler.step(optimizer)
            scaler.update()

            # Accumulate the Loss
            total_loss += loss.item()

        # Calculate epoch-level metrics
        epoch_loss = total_loss / len(train_dataloader)

        # Print epoch-level metrics
        print(f"Epoch {epoch+1}/{NUM_EPOCHS}")
        print(f"Train Loss: {epoch_loss:.4f}")

         # Evaluation
        model.eval()  # Set model to evaluation mode
        eval_loss = 0

        with torch.no_grad():
            for batch in eval_dataloader:
                with autocast():
                    input_ids = batch['input_ids'].to(device)
                    attention_mask = batch['attention_mask'].to(device)
                    labels = batch['labels'].to(device)

                    predictions = model(input_ids=input_ids,
                                        attention_mask=attention_mask, 
                                        labels=labels
                                       ).logits
                    loss = mcrmse_loss(predictions, labels)

                    eval_loss += loss.item()

        # Calculate evaluation metrics
        eval_epoch_loss = eval_loss / len(eval_dataloader)

        # Print evaluation metrics
        print(f"Eval Loss: {eval_epoch_loss:.4f}")

        # Save the Best Model
        if eval_epoch_loss < best_eval_loss:
            print(f'--------------------------------------')
            print(f'Found the best model at Epoch {epoch+1}')
            print(f'Validation Loss reduced from {best_eval_loss:.4f} to {eval_epoch_loss:.4f}')
            best_eval_loss = eval_epoch_loss
            print(f'Saving the best model.')
            print(f'--------------------------------------\n')
            
            # Access the actual model from the DataParallel object
            actual_model = model.module
            # Save
            actual_model.save_pretrained("vanilla_bert_base_cased")
            #torch.save(model.state_dict(), "vanilla_bert_base_cased.pt")

### Step-4: Model Inference

Load the Best model.

In [16]:
model_path = '/kaggle/input/commonlit-summaries-all-models/vanilla_bert_base_cased'
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# Check if GPU is available.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'DEVICE: {device}\n')

# Model to the Device
model = model.to(device)

DEVICE: cuda



Prepare Test Data Loader.

In [17]:
# Read Test DF's.
prompt_test_df = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/prompts_test.csv')
summary_test_df = pd.read_csv('/kaggle/input/commonlit-evaluate-student-summaries/summaries_test.csv')

# Initialize Dataset Classes
commonlit_summary_test_ds = CommonLitSummaryDataset(summary_test_df,
                                                    prompt_test_df, 
                                                    model_name = 'bert-base-cased', 
                                                    max_length = 256,
                                                    isTest = True
                                                    )
# Test Dataloader
batch_size = 16
test_loader = DataLoader(commonlit_summary_test_ds, batch_size=batch_size, shuffle=False)

test_loader

<torch.utils.data.dataloader.DataLoader at 0x7c9e13ce3550>

**Prediction**

In [18]:
# Eval model
model.eval()

predictions = []
with torch.no_grad():  # Disable gradient computation during inference
    for batch in tqdm(test_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        
        # Predictions
        predictions_batch = model(input_ids=input_ids,
                                  attention_mask=attention_mask
                                 ).logits
        
        # Collect predictions from this batch
        predictions.extend(predictions_batch.cpu().tolist())
        
        
# Convert to numpy
import numpy as np
predictions = np.array(predictions)

100%|██████████| 1/1 [00:02<00:00,  2.56s/it]


**Submission**

In [19]:
# Create a DataFrame
data = {
    'student_id': summary_test_df['student_id'].tolist(),
    'content': predictions[:,0],
    'wording': predictions[:,1]
}
submission_df = pd.DataFrame(data)

# Display
display(submission_df.head())

# Save it for Submission
submission_df.to_csv('submission.csv', index=False)

Unnamed: 0,student_id,content,wording
0,000000ffffff,-1.450631,-1.514663
1,111111eeeeee,-1.450113,-1.521089
2,222222cccccc,-1.447816,-1.523844
3,333333dddddd,-1.456676,-1.529542


**TODO:**

* FP16 Training <font color="green">&#10004;</font>
* Utilize Both of the GPU's <font color="green">&#10004;</font>
* Stratified Split 
* Data Augmentations
* Model Architecture Tweaking

**Model Performance Tracker**

<style>
table {
    width: 100%;
    border-collapse: collapse;
}

th, td {
    padding: 8px;
    text-align: left;
}

th {
    background-color: #FF0000; /* Red color */
    color: white;
}
</style>

<table>
    <tr>
        <th><span style="color:red">S.No.</span></th>
        <th><span style="color:red">Seed</span></th>
        <th><span style="color:red">Split</span></th>
        <th><span style="color:red">Model_name</span></th>
        <th><span style="color:red">DA</span></th>
        <th><span style="color:red">Others</span></th>
        <th><span style="color:red">CV</span></th>
        <th><span style="color:red">LB</span></th>
    </tr>
    <tr>
        <td>1</td>
        <td>42</td>
        <td>Random</td>
        <td>vanilla_bert_base_case</td>
        <td>None</td>
        <td>NA</td>
        <td>0.4801</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>2</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
    <tr>
        <td>3</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
        <td>NA</td>
    </tr>
</table>
