## Architecture Overview

> DeBERTa

It is a powerful transformer based LLM that builds upon the original transformer arch, thereby introducing several innovative building blocks. These blocks enhance the model's ability to understand & gen. language by addressing specific limitations found in prev. transformer models.

It differs from other archs. via following blocks

1. Enhanced Masked Language Modeling(MLM):
DeBERTa improves upon the traditional MLM used in BERT. In BERT, tokens are randomly masked and the model learns to predict the original tokens. However, DeBERTa introduces a more advanced MLM called Span Boundary Objective(SBO). It masks consecutive spans of tokens, which helps the model capture dependencies across multiple tokens and better understand long range relationships.

2. Intra-sentence and Inter-sentence learning:
DeBERTa incorporates both intra-sentence and inter-sentence learning. While most transformer models focus on capturing relationships within a sentence(intra levels), DeBERTa also considers the connections between sentences(inter levels) which allows the model to understand document-level semantics and capture global context effectively.

3. Relational Self-Attn:
It replaces the traditional self attn mechanism with relational self attn.(RSA). RSA introduces a series of learnable matrices that modulate the attention wts. between tokens. By modelling the relationships between tokens explicitly, RSA enables the model to better handle long dependencies and capture fine-grained interactions.

4. Enhanced Training Techniques: DeBERTa adopts a training strat. called Contrastive Bidirectional Training(CBT). CBT combines pretraining and fine-tuning by leveraging both masked language modelling(MLM) and NSP(Next Sentence prediction) objectives simultaneously. This approach enhances the model's ability to capture context and relationships between sentences, resulting in improved performance on downstream tasks.

5. Cross-layer Parameter Sharing:
Unlike other transformer archs, DeBERTa applies param sharing across diff. layers of the model. This param sharing allows the model to efficiently learn representations at different levels. By sharing params, DeBERTa reduces the number of total params, making it a more computationally scalable and memory efficient model.


These improvements help the model capture long range dependencies, global context and fine grained interactions, making it a state of the art transformer based LLM.

The following code snippet is just a proof of concept ! For better results, Train more !

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
from sklearn.model_selection import train_test_split


Neccessary modules for training and inference are imported in the above snippet of the code.

In [None]:
# Load the data
train_df = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/train.csv')
test_df = pd.read_csv('/kaggle/input/contradictory-my-dear-watson/test.csv')


Train and test dataframes are read from their location on the disk.

In [None]:
# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(train_df['premise'].tolist(), train_df['label'].tolist(), test_size=0.2, random_state=42)


The code splits the data into training and validation sets using the `train_test_split` function. The `train_texts` and `train_labels` lists contain the training data, while the `val_texts` and `val_labels` lists contain the validation data. The data is divided in a 80:20 ratio, with 80% used for training and 20% for validation.

In [None]:
# Load the Deberta model and tokenizer
model_name = 'microsoft/deberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name)


The code loads the Deberta model and tokenizer using the `AutoTokenizer` and `TFAutoModel` classes from the Transformers library. The `model_name` variable specifies the name or identifier of the Deberta model to be loaded, in this case, it is set to 'microsoft/deberta-base'. The `tokenizer` object is initialized with the Deberta tokenizer, which will be used to tokenize the input text. The `model` object is initialized with the Deberta model, which will be used for further processing or fine-tuning.

In [None]:
# Tokenize the input texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)


The code tokenizes the input texts using the tokenizer object created in the previous step. The `tokenizer` is applied to the `train_texts` and `val_texts` lists, which contain the training and validation texts respectively. The `truncation=True` argument ensures that the texts are truncated to a maximum length if they exceed the maximum token limit. The `padding=True` argument adds padding tokens to make all input sequences of equal length. The resulting tokenized encodings are stored in the `train_encodings` and `val_encodings` variables respectively.

In [None]:
# Create TensorFlow datasets
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels))


The code creates TensorFlow datasets from the tokenized encodings and labels. It uses the `tf.data.Dataset.from_tensor_slices()` function to create datasets from the tensors `train_encodings` and `train_labels` for the training dataset, and `val_encodings` and `val_labels` for the validation dataset. Each sample in the dataset consists of a dictionary containing the input encodings (tokenized texts) and the corresponding labels.

In [None]:
# Define the model architecture
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name='attention_mask')
outputs = model({'input_ids': input_ids, 'attention_mask': attention_mask})[0]
outputs = tf.keras.layers.GlobalMaxPool1D()(outputs)
outputs = tf.keras.layers.Dropout(0.2)(outputs)
outputs = tf.keras.layers.Dense(3, activation='softmax')(outputs)
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=outputs)


The code defines the architecture of the model using TensorFlow's Keras API. It starts by defining two input layers, `input_ids` and `attention_mask`, with the specified shapes and data types. The `input_ids` layer represents the input tokenized text, and the `attention_mask` layer represents the attention mask for the input.

Next, the code passes the input layers to the pre-trained Deberta model to obtain the model outputs. The `model` variable represents the pre-trained Deberta model loaded earlier.

The code then applies a global max pooling layer (`GlobalMaxPool1D`) to pool the output tensor along the time dimension. This operation reduces the dimensionality of the tensor.

A dropout layer with a dropout rate of 0.2 is applied to prevent overfitting.

Finally, a dense layer with 3 units and a softmax activation function is added to produce the final output probabilities for each class. The `model` variable is updated to represent the new model architecture with the defined input and output layers.

In [None]:
# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])


The code compiles the model by specifying the optimizer, loss function, and metrics to be used during training.

The optimizer chosen is Adam with a learning rate of 5e-5. Adam is an optimization algorithm commonly used for training neural networks.

The loss function is specified as "sparse_categorical_crossentropy", which is suitable for multi-class classification problems with integer labels. This loss function calculates the cross-entropy between the predicted probabilities and the true labels.

The chosen metric for evaluation during training is "accuracy", which measures the proportion of correctly classified samples.

By calling `model.compile()`, the model is prepared for training with the specified optimizer, loss function, and metrics.

# Training

The code loads a pretrained checkpoint to resume training from there

In [None]:
# from transformers import TFDebertaModel

# # Define the custom layer within the custom_object_scope
# custom_objects = {'TFDebertaModel': TFDebertaModel}

# # Load the model using the custom_object_scope
# with tf.keras.utils.custom_object_scope(custom_objects):
#     model = tf.keras.models.load_model('deberta-trained-model.h5')

Loading checkpoint.


In [None]:

# Train the model
history = model.fit(train_dataset.shuffle(1000).batch(8), epochs=10, batch_size=16, validation_data=val_dataset.shuffle(1000).batch(16))


The code trains the model using the `fit()` function in TensorFlow. The args are 

- `train_dataset.shuffle(1000).batch(8)`: The training dataset is shuffled with a buffer size of 1000 and batched into batches of size 8. The `shuffle()` function shuffles the examples in the dataset, ensuring randomness during training.

- `epochs=3`: The number of times the entire training dataset will be iterated over during training.

- `batch_size=16`: The number of examples in each training batch.

- `validation_data=val_dataset.shuffle(1000).batch(16)`: The validation dataset is shuffled with a buffer size of 1000 and batched into batches of size 16. The model's performance on this dataset will be evaluated at the end of each training epoch.


# Inference

In [None]:
# Save the trained model
model.save('deberta-trained-model.h5')

Saves the trained model for further inferencing.

In [None]:
# Make predictions on the test set using the saved model
#model = tf.keras.models.load_model('deberta-trained-model.h5')
test_encodings = tokenizer(test_df['premise'].tolist(), truncation=True, padding=True)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings)))
test_predictions = model.predict(test_dataset.batch(16)).argmax(axis=-1)

# Save the predictions to a CSV file
submission_df = pd.DataFrame({'id': test_df['id'], 'prediction': test_predictions})
submission_df.to_csv('submission.csv', index=False)

The code snippet makes predictions on the test set using a saved model. The steps are


1. `loaded_model = tf.keras.models.load_model('deberta-trained-model.h5')`: The saved model is loaded from the file "deberta-trained-model.h5" using `load_model()` function from TensorFlow.

2. `test_encodings = tokenizer(test_df['premise'].tolist(), truncation=True, padding=True)`: The test set is tokenized using the same tokenizer that was used during training. The tokenizer is applied to the "premise" column of the test data, and truncation and padding are applied to ensure consistent input shapes.

3. `test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings)))`: The tokenized test data is converted into a TensorFlow dataset using `from_tensor_slices()` function. The input features are passed as a dictionary.

4. `test_predictions = loaded_model.predict(test_dataset.batch(16)).argmax(axis=-1)`: The loaded model is used to make predictions on the test dataset. The `predict()` function is applied to the test dataset batched into batches of size 16. The `argmax(axis=-1)` method is used to determine the predicted class index for each example.

5. `submission_df = pd.DataFrame({'id': test_df['id'], 'prediction': test_predictions})`: The predicted class indices and corresponding IDs from the test dataset are combined into a pandas DataFrame.

6. `submission_df.to_csv('submission.csv', index=False)`: The predictions are saved to a CSV file named "submission.csv" without including the index column.



In [None]:
submission_df