# 1- Learning 🤗  - Out-of-the-box BERT [LB: 0.81029]

Hi, and welcome! This is the first kernel of the series `Learning 🤗`, a personal project I'm currently working on. I am an experienced data scientist diving into the hugging face transformers library and this series or kernels is a "working diary", as I do it. The approach I'm taking is the following: 

1. Explore various out-of-the-box models, without digging into their technical details. 
2. After that, I'll start going over the best ranked public kernels, understand their ideas, and reproduce them by myself. 

You are invited to follow me in this journey. In this short kernel (~80 lines) we fine-tune an out-of-the-box cased BERT, with just the minimal set up required for it to run in this competition, obtaining a leaderboard score of `0.81029`. 

This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:

1. [Learning 🤗  - Out-of-the-box BERT [LB: 0.81029]](1-learning-out-of-the-box-bert-lb-0-8102) (this notebook)
2. Learning 🤗 - Out-of-the-box RoBERTa _WIP_
3. Learning 🤗 - Out-of-the-box Electra _WIP_
4. Learning 🤗 - BERT Large Uncased _WIP_

### Please remember to upvote if you found the series useful for your research!


## Using the [`transformers`](https://huggingface.co/transformers/) library

We are using a very high-level API of the library after following this quick guide article, which we recommend to read:
[Fine-tuning a pretrained model](https://huggingface.co/transformers/training.html)

We use only 4 objects from the library: `Trainer`, `TrainingArguments`, `AutoModelForSequenceClassification`, `AutoTokenizer`. And, actually, the full list of imports is quite small as you can see below:

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from transformers import Trainer, TrainingArguments, AutoTokenizer,\
                         AutoModelForSequenceClassification

## The code

We have split the code in various simple functions to separates the wheat from the chaff and focus on the parts that are new to us.

Documentation about the `Trainer` can be found [here](https://huggingface.co/transformers/main_classes/trainer.html) and about the `TrainerArguments` [here](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).

In [None]:
# This is just 2 calls to pd.read_csv()
# Loading the df_train and df_test
def load_dfs():
    df_train = pd.read_csv('../input/nlp-getting-started/train.csv')[['text', 'target']]\
                 .rename(columns={'target': 'label'})
    df_test = pd.read_csv('../input/nlp-getting-started/test.csv')[['id', 'text']]
    return df_train, df_test


# This functions is used by the Trainer to compute the metrics in the evaluation steps
# It's just computing accuracy and f1
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc,
            'f1': f1}


# This function applies the tokenizer to the three dataframes
# The result is ready for inputting it to the model
def tokenize(tokenizer, df_train, df_val, df_test):

    def tokenize_df(tokenizer, df, has_label=True):
        # Tokenize texts (Returns dictionary with keys: input_ids, token_type_ids, attention_mask)
        ds = tokenizer(df['text'].tolist(), padding="max_length", truncation=True)
        # Add key 'label'
        if has_label: ds['label'] = df['label'].tolist()
        # Turn dictionary of lists into list of dictionaries
        return [dict(zip(ds, t)) for t in zip(*ds.values())]

    ds_train = tokenize_df(tokenizer, df_train)
    ds_val = tokenize_df(tokenizer, df_val)
    ds_test = tokenize_df(tokenizer, df_test, has_label=False)

    return ds_train, ds_val, ds_test

# Gets tokenizer and model from the modelhub, given its id
def get_tokenizer_and_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    return tokenizer, model

# Gets the predictions for ds_test from the Trainer object
def get_predictions(trainer, ds_test):
    preds = trainer.predict(ds_test)
    preds = F.softmax(torch.from_numpy(preds.predictions), dim=-1)
    binary_preds = (preds[:, 1] > 0.5).numpy().astype(int)
    return binary_preds
    
# Gets the predictions for the test set and saves them to submission.csv
def submit(trainer, ds_test):
    df_res = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')
    df_res['target'] = get_predictions(trainer, ds_test)
    df_res.to_csv('submission.csv', index=False)
    return df_res

In [None]:
from transformers.trainer_utils import set_seed; set_seed(2021) # Set seed for reproducibility

# This is the model we will use, from the modelhub:
# https://huggingface.co/models
MODEL_NAME = "bert-base-cased"

# Get tokenizer and model
tokenizer, model = get_tokenizer_and_model(MODEL_NAME)

# Load dataframes
df_base, df_test = load_dfs()

# Split train and validation sets
df_train, df_val = train_test_split(df_base, test_size=0.1)

# Tokenize train, validation, and test sets
ds_train, ds_val, ds_test = tokenize(tokenizer, df_train, df_val, df_test)

In [None]:
# Fine-tune for just 1 epoch
EPOCHS = 1

# Prepare the TrainingArguments
args = TrainingArguments("/kaggle/working/model/", 
                         num_train_epochs=EPOCHS, 
                         report_to="none", # Disable "wandb", I don't know what it is yet
                         evaluation_strategy="steps", 
                         eval_steps=100, # Evaluate and log to screen metrics each 100 batches
                         )

# Instantiate the Trainer
trainer = Trainer(model=model, 
                  args=args, 
                  train_dataset=ds_train, 
                  eval_dataset=ds_val, 
                  compute_metrics=compute_metrics)

# Train the model
trainer.train()

In [None]:
# Evaluate for correlating LB with validation schema
res = trainer.evaluate()
print(f"Validation F1 : {res['eval_f1']:.2f}")
print(f"Validation Acc: {res['eval_accuracy']:.2f}")

In [None]:
# Generate predictions and create submission file
submit(trainer, ds_test);

# 🤗🤗 Thanks for reading this notebook! Remember to upvote if you found it useful, and stay tuned for the next deliveries! 🤗🤗