# Merged Multilingual NLI Notebook (Enhanced)

This notebook merges the previous two notebooks into one clean pipeline for the **Contradictory, My Dear Watson** task.

## What is improved
- Single unified workflow (EDA + training + inference)
- Language-aware stratified train/validation split
- Text normalization
- Masked mean pooling (instead of naive pooling)
- Dropout regularization
- Class weights
- Two-stage fine-tuning (freeze then unfreeze)
- AdamW + LR scheduling + EarlyStopping
- Per-language validation diagnostics


## 1. Install / Imports


In [1]:
import os
# Keras 3 compatibility fix for Hugging Face TF models
os.environ['TF_USE_LEGACY_KERAS'] = '1'

!pip install -q -U tf-keras "transformers==4.48.3" safetensors
print('Installed compatibility packages. Restart runtime once, then continue.')


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m76.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.2/507.2 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m620.7/620.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m104.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m117.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the followin

In [2]:

import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

import re
import random
import unicodedata
import numpy as np
import pandas as pd
import tensorflow as tf
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, classification_report

from transformers import AutoTokenizer, TFAutoModel

pio.templates.default = "plotly_dark"
print('TensorFlow:', tf.__version__)


TensorFlow: 2.20.0


## 2. Configuration


In [3]:
# Paths
PATH_TRAIN = '../input/contradictory-my-dear-watson/train.csv'
PATH_TEST = '../input/contradictory-my-dear-watson/test.csv'
PATH_SUB = '../input/contradictory-my-dear-watson/sample_submission.csv'

# Model
MODEL_NAME = 'joeddav/xlm-roberta-large-xnli'  # Change to xlm-roberta-base for faster training
MAX_LEN = 128
NUM_CLASSES = 3
DROPOUT = 0.30
SEED = 42

# Training
VAL_SIZE = 0.20
BATCH_SIZE_PER_REPLICA = 16
EPOCHS_STAGE1 = 1   # freeze backbone
EPOCHS_STAGE2 = 8   # full fine-tuning
BASE_LR = 1.5e-5
WEIGHT_DECAY = 1e-4
PATIENCE = 2

LABEL_MAP = {0: 'Entailment', 1: 'Neutral', 2: 'Contradiction'}


## 3. Reproducibility + Hardware Strategy


In [4]:
def seed_everything(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

seed_everything(SEED)

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy()

GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
print('Replicas:', strategy.num_replicas_in_sync)
print('Global batch size:', GLOBAL_BATCH_SIZE)


Replicas: 1
Global batch size: 16


## 4. Load Data + Basic EDA


In [5]:
train_df = pd.read_csv(PATH_TRAIN)
test_df = pd.read_csv(PATH_TEST)
sub_df = pd.read_csv(PATH_SUB)

print('Train shape:', train_df.shape)
print('Test shape :', test_df.shape)
print('Missing values in train:', train_df.isna().sum().sum())

lang_counts = train_df['language'].value_counts().reset_index()
lang_counts.columns = ['language', 'count']
fig = px.bar(lang_counts, x='language', y='count', title='Training Samples per Language')
fig.show()

label_counts = train_df['label'].map(LABEL_MAP).value_counts().reset_index()
label_counts.columns = ['label', 'count']
fig = px.pie(label_counts, names='label', values='count', title='Label Distribution', hole=0.35)
fig.show()


Train shape: (12120, 6)
Test shape : (5195, 5)
Missing values in train: 0


## 5. NLP Preprocessing: Text Normalization


In [6]:
def normalize_text(text):
    text = str(text)
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

for col in ['premise', 'hypothesis']:
    train_df[col] = train_df[col].map(normalize_text)
    test_df[col] = test_df[col].map(normalize_text)

train_df[['premise', 'hypothesis']].head()


Unnamed: 0,premise,hypothesis
0,and these comments were considered in formulat...,The rules developed in the interim were put to...
1,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...
2,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.
3,you know they can't really defend themselves l...,They can't defend themselves because of their ...
4,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร


## 6. Language-Aware Stratified Split

We stratify by `language + label` to keep both language distribution and class balance stable between train/validation.


In [7]:
train_df['strat_key'] = train_df['language'].astype(str) + '_' + train_df['label'].astype(str)

train_part, val_part = train_test_split(
    train_df,
    test_size=VAL_SIZE,
    random_state=SEED,
    stratify=train_df['strat_key']
)

train_part = train_part.reset_index(drop=True)
val_part = val_part.reset_index(drop=True)

print('Train split:', train_part.shape)
print('Val split  :', val_part.shape)


Train split: (9696, 7)
Val split  : (2424, 7)


## 7. Tokenization


In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_pairs(df):
    enc = tokenizer(
        df['premise'].tolist(),
        df['hypothesis'].tolist(),
        padding='max_length',
        truncation=True,
        max_length=MAX_LEN,
        return_tensors='np'
    )
    return enc['input_ids'], enc['attention_mask']

x_train_ids, x_train_mask = tokenize_pairs(train_part)
x_val_ids, x_val_mask = tokenize_pairs(val_part)
x_test_ids, x_test_mask = tokenize_pairs(test_df)

y_train_sparse = train_part['label'].values
y_val_sparse = val_part['label'].values

y_train = tf.keras.utils.to_categorical(y_train_sparse, num_classes=NUM_CLASSES)
y_val = tf.keras.utils.to_categorical(y_val_sparse, num_classes=NUM_CLASSES)

print('Tokenized train:', x_train_ids.shape)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Tokenized train: (9696, 128)


## 8. Class Weights


In [9]:
classes = np.array([0, 1, 2])
weights = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=y_train_sparse
)
class_weight = {int(c): float(w) for c, w in zip(classes, weights)}
print('Class weights:', class_weight)


Class weights: {0: 0.9676646706586827, 1: 1.040901771336554, 2: 0.9941556444171025}


## 9. Enhanced Model: XLM-R + Masked Mean Pooling + Dropout


In [10]:
class MaskedMeanPooling(tf.keras.layers.Layer):
    def call(self, token_embeddings, attention_mask):
        mask = tf.cast(tf.expand_dims(attention_mask, axis=-1), tf.float32)
        masked = token_embeddings * mask
        summed = tf.reduce_sum(masked, axis=1)
        denom = tf.reduce_sum(mask, axis=1) + 1e-9
        return summed / denom

class EnhancedNLIModel(tf.keras.Model):
    def __init__(self, backbone, num_classes=3, dropout=0.3, **kwargs):
        super().__init__(**kwargs)
        self.backbone = backbone
        self.pool = MaskedMeanPooling()
        self.dropout = tf.keras.layers.Dropout(dropout)
        self.classifier = tf.keras.layers.Dense(num_classes, activation='softmax')

    def call(self, inputs, training=False):
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, training=training)
        x = self.pool(out.last_hidden_state, attention_mask)
        x = self.dropout(x, training=training)
        return self.classifier(x)


## 10. Compile Helpers


In [11]:
def make_optimizer(total_steps, base_lr=None):
    if base_lr is None:
        base_lr = BASE_LR

    lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
        initial_learning_rate=base_lr,
        decay_steps=max(1, total_steps)
    )

    return tf.keras.optimizers.AdamW(
        learning_rate=lr_schedule,
        weight_decay=WEIGHT_DECAY,
        epsilon=1e-8,
        clipnorm=1.0
    )
    

loss_fn = tf.keras.losses.CategoricalCrossentropy(label_smoothing=0.05)
metrics = [tf.keras.metrics.CategoricalAccuracy(name='accuracy')]


I0000 00:00:1771652791.457911      23 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13757 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1771652791.460949      23 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13757 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


## 11. Build Datasets


In [12]:
train_inputs = {'input_ids': x_train_ids, 'attention_mask': x_train_mask}
val_inputs = {'input_ids': x_val_ids, 'attention_mask': x_val_mask}
test_inputs = {'input_ids': x_test_ids, 'attention_mask': x_test_mask}

train_ds = tf.data.Dataset.from_tensor_slices((train_inputs, y_train)).shuffle(4096, seed=SEED).batch(GLOBAL_BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_ds = tf.data.Dataset.from_tensor_slices((val_inputs, y_val)).batch(GLOBAL_BATCH_SIZE).prefetch(tf.data.AUTOTUNE)


## 12. Two-Stage Fine-Tuning

Stage 1: freeze transformer for stabilization.  
Stage 2: unfreeze full model for better task adaptation.


In [13]:
callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor='val_accuracy',
        patience=PATIENCE,
        mode='max',
        restore_best_weights=True
    )
]

with strategy.scope():
    backbone = TFAutoModel.from_pretrained(MODEL_NAME, from_pt=True)
    model = EnhancedNLIModel(
        backbone=backbone,
        num_classes=NUM_CLASSES,
        dropout=DROPOUT
    )

    # -------------------------
    # STAGE 1 — Train Head Only
    # -------------------------
    model.backbone.trainable = False

    steps_stage1 = int(np.ceil(len(train_part) / GLOBAL_BATCH_SIZE)) * max(1, EPOCHS_STAGE1)

    model.compile(
        optimizer=make_optimizer(steps_stage1, base_lr=3e-5),
        loss=loss_fn,
        metrics=metrics
    )

history_stage1 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS_STAGE1,
    class_weight=class_weight,
    callbacks=callbacks,
    verbose=1
)

# -------------------------
# STAGE 2 — Partial Unfreeze
# -------------------------
with strategy.scope():

    # Unfreeze only last 6 encoder layers
    for layer in model.backbone.roberta.encoder.layer[:-6]:
        layer.trainable = False

    for layer in model.backbone.roberta.encoder.layer[-6:]:
        layer.trainable = True

    # Keep embeddings frozen (saves memory + stabilizes training)
    model.backbone.roberta.embeddings.trainable = False

    steps_stage2 = int(np.ceil(len(train_part) / GLOBAL_BATCH_SIZE)) * max(1, EPOCHS_STAGE2)

    # Lower LR for fine-tuning
    model.compile(
        optimizer=make_optimizer(steps_stage2, base_lr=1e-5),
        loss=loss_fn,
        metrics=metrics
    )

history_stage2 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS_STAGE2,
    class_weight=class_weight,
    callbacks=callbacks,
    verbose=1
)

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFXLMRobertaModel: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'roberta.embeddings.position_ids', 'classifier.dense.weight']
- This IS expected if you are initializing TFXLMRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLMRobertaModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFXLMRobertaModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaModel for predictions without further training.


  1/606 [..............................] - ETA: 4:19:41 - loss: 1.8639 - accuracy: 0.0625

I0000 00:00:1771652834.660533      83 device_compiler.h:196] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 1/8
Epoch 2/8
Epoch 3/8


## 13. Validation Audit (Overall + Per Language)


In [14]:
val_probs = model.predict(val_inputs, batch_size=GLOBAL_BATCH_SIZE, verbose=1)
val_pred = np.argmax(val_probs, axis=1)
val_true = y_val_sparse

acc = accuracy_score(val_true, val_pred)
print(f'Validation accuracy: {acc:.4f}')
print('Classification report:')
print(classification_report(val_true, val_pred, target_names=[LABEL_MAP[i] for i in range(NUM_CLASSES)]))

val_eval = val_part.copy()
val_eval['pred'] = val_pred
val_eval['correct'] = (val_eval['pred'] == val_eval['label']).astype(int)

lang_perf = val_eval.groupby('language')['correct'].mean().sort_values(ascending=False).reset_index()
lang_perf.columns = ['language', 'accuracy']
fig = px.bar(lang_perf, x='language', y='accuracy', title='Per-language Validation Accuracy')
fig.show()

errors = val_eval[val_eval['correct'] == 0][['language', 'premise', 'hypothesis', 'label', 'pred']]
errors.head(10)


Validation accuracy: 0.9295
Classification report:
               precision    recall  f1-score   support

   Entailment       0.92      0.94      0.93       836
      Neutral       0.90      0.91      0.91       775
Contradiction       0.96      0.94      0.95       813

     accuracy                           0.93      2424
    macro avg       0.93      0.93      0.93      2424
 weighted avg       0.93      0.93      0.93      2424



Unnamed: 0,language,premise,hypothesis,label,pred
16,English,The two programs are currently housed in build...,The two buildings are on opposite sides of the...,2,1
25,English,more than anything else in this day and age th...,In your decisions age is a big factor,0,1
26,English,"It is really a matter of waiting.""",It is a matter of not having nay patients.,2,1
38,English,"Also, disappointing earnings reports from Inte...",Intel has had many disappointing earning reports.,1,0
54,English,The burden of his spiritual functions as high ...,People looked down on the emperor for abandoni...,1,2
60,English,You wonder whether he could win a general elec...,He might run in a general election while he is...,1,0
61,English,Search out the House of Dionysos and the House...,The House of Dolphins and the House of Masks a...,0,1
90,English,This provides insight into the important Japan...,"Katachi means, it's not how you do something; ...",2,0
104,English,Professor Rogers began her career by clerking ...,Professor Rogers has always been a clerk to him.,2,1
121,English,There may be a small savings at the factory sh...,The factory show rooms are cheaper.,1,0


## 14. Inference + Submission


In [15]:
test_probs = model.predict(test_inputs, batch_size=GLOBAL_BATCH_SIZE, verbose=1)
test_pred = np.argmax(test_probs, axis=1)

submission = pd.DataFrame({'id': test_df['id'], 'prediction': test_pred})
submission.to_csv('submission.csv', index=False)

print('Saved submission.csv')
submission.head()


Saved submission.csv


Unnamed: 0,id,prediction
0,c6d58c3f69,2
1,cefcc82292,1
2,e98005252c,0
3,58518c10ba,1
4,c32b0d16df,1


## 15. Notes

If you want even better performance, test these next:
1. K-fold CV and prediction averaging
2. Sequence length sweep (`MAX_LEN`: 96, 128, 160)
3. Backbone sweep (`xlm-roberta-base`, `xlm-roberta-large`, `mdeberta-v3-base`)
4. R-Drop / adversarial training (advanced)
