KISS RoBERTa
============

If you don't know it already, KISS is an acronym for "Keep It Simple Stupid".

*This notebook's goal is not to be competitive, but rather to quickly get started with RoBERTa training.*

This intentionally does not contain K-Fold, etc. It uses 80% of the data to train, and 20% to validate itself. It also uses "distil-roberta", a smaller model where you sacrifice a little bit of accuracy in exchange for faster training speed. I personally like it because this makes experimentation faster, which is the goal here. Moreover, overfitting will be your enemy, so it's not even sure the bigger model will be better anyway. 

In [None]:
BASE_MODEL = '/kaggle/input/huggingface-roberta-variants/distilroberta-base/distilroberta-base'

Prepare data
============

In [None]:
import numpy as np
import pandas as pd

train = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
train

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

In [None]:
def get_data(df):
    # These 2 lines took me days to figure out!!! 😖😖😖
    tokenized = tokenizer(df['excerpt'].tolist(), padding=True, return_tensors="np") # the "np" means it will return numpy arrays
    return {feat: tokenized[feat] for feat in tokenizer.model_input_names}
    
X = get_data(train)
X

In [None]:
y = train[["target"]].values # note that this is a list of single value lists
y

Helper functions
================

In [None]:
import matplotlib.pyplot as plt

import keras
import tensorflow.keras.backend as K


early_stop = keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=2,
    verbose=1,
    restore_best_weights=True,
)


# to define 'rmse' as loss instead of 'mse'
def rmse(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred - y_true))) 


def plot_hist(history):
    plt.plot(history['loss'])
    plt.plot(history['val_loss'])
    best_val_loss = min(history['val_loss'])
    print(f'Best validation loss: {best_val_loss:.3f}')

Train
=====

In [None]:
from transformers import TFAutoModelForSequenceClassification
from keras.optimizers import *

model = TFAutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=1) # num_labels=1 results in a regression

model.compile(optimizer=Adam(1e-5), loss=rmse) # small training rates are necessary!
model.summary()

In [None]:
hist = model.fit(X, y, validation_split=0.2, epochs=5, batch_size=8, callbacks=[early_stop], verbose=2)

plot_hist(hist.history)

Submission
==========

In [None]:
test = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')
test

In [None]:
X_test = get_data(test)
y_test = model.predict(X_test)
test['target'] = y_test.logits[:,0]
test

In [None]:
test.to_csv('submission.csv', columns=['id','target'], index=False)

Conclusion
==========

1. ***Do not use this code to submit directly. It will uselessly re-train the model during the submission. This was done solely for demonstration purposes***
2. This does not train on the whole data, since 20% is used for validation, so it's up to you how to include it all, whether it's K-Fold models or something else.
3. Tweaking the model/training parameters can lead to huge differences! Here are the most obvious ones:
    - optimizer
    - learning rate
    - batch size
    - dropout layers
    - model config parameters
    - ...or even simply trying another huggingface model
    
You will quickly notice that the impact of these "hyperparameters" are huge. But they have one thing in common: you will have to fight overfitting! With < 3000 training samples, it's your enemy. If you acchieve a "simple" way to improve this baseline a lot, I would be glad to hear about it.