Welcome to the [Feedback Prize - Predicting Effective Arguments](https://www.kaggle.com/competitions/feedback-prize-effectiveness) competition by [Georgia State University](https://www.gsu.edu/) on Kaggle.

 <img alt="An example of inference on an essay" src="https://i.imgur.com/uZIST9f.png" width="750px"/>

In this notebook, I give an overview of the competition and explore the dataset before training a baseline model. I also perform additional analysis using the trained model.

The notebook includes:
* An example of using pseudo-labels on the larger [2021 Feedback Prize Competition](https://www.kaggle.com/competitions/feedback-prize-2021) dataset.
* A custom head for the [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) framework.
* An implementation of [multi-sample dropout](https://arxiv.org/abs/1905.09788).

If you find this helpful notebook, I would appreciate an upvote! ❤️

* For inference on the test set, see [Feedback Prize - DeBERTa-v3 Inference](https://www.kaggle.com/code/lextoumbourou/feedback-prize-deberta-v3-inference).
* For inference on the 2021 data, see [Feedback Prize - Inference on 2021 Dataset](https://www.kaggle.com/code/lextoumbourou/feedback-prize-inference-on-2021-dataset).

# Changelog

* **2022-06-25**

  Reduce Dropout and validate 2x each epoch. Also some changes for faster training.
 
* **2022-06-19**

  Include essay text.
 
* **2022-06-17**

  Pseudo-labels appear to be making results a bit worse. Reducing the number. Soft labels or inference with a more accurate model will likely help here.
  
  Also tuned the multi-sample dropout params.
 
* **2022-06-11**

  Add Mean Pooling taken from [this](https://www.kaggle.com/code/debarshichanda/pytorch-feedback-deberta-v3-baseline) notebook by [@debarshichanda](https://www.kaggle.com/debarshichanda).
 
* **2022-06-07**

  Clean up.

* **2022-06-05**

  Add topics to improve EDA.
  
  Revert label smoothing.

* **2022-06-02**

  Revert to Deberta Base for iteration speed.
  
  Add label smoothing.
  
  Also, update Model Interpretation.
 
* **2022-05-31**

  Try Deberta Large.
 
* **2022-05-29**

  Add 2021 competition data experiment.

# Competition Overview

From the [competition description](https://www.kaggle.com/competitions/feedback-prize-effectiveness/overview/description):

*Writing is crucial for success.  In particular, argumentative writing fosters critical thinking and civic engagement skills, and can be strengthened by practice. However, only 13 percent of eighth-grade teachers ask their students to write persuasively each week. Additionally, resource constraints disproportionately impact Black and Hispanic students, so they are more likely to write at the “below basic” level as compared to their white peers. An automated feedback tool is one way to make it easier for teachers to grade writing tasks assigned to their students that will also improve their writing skills.*

Our goal in this competition is to classify sections (or "*elements*") of student's essays, in 3 categories: `Ineffective` or `Adequate` or `Effective`.

The essays were written by 6th-12th grade students located in the USA.

The competition is a Code Competition, which means that you must submit your kernel to get a leaderboard score. The unseen test set contains around 3000 essays.

# Related Competitions

## Feedback Prize (2021)

The original [Feedback Prize competition](https://www.kaggle.com/competitions/feedback-prize-2021) finished earlier this year. Though a [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) competition as opposed to classification, the dataset contains 11,403 additional essays and 70,763 additional essay sections.

I have done some analysis on that dataset [here](https://www.kaggle.com/code/lextoumbourou/feedback-prize-inference-on-2021-dataset).

* [1st solution with code(cv:0.748 lb:0.742)](https://www.kaggle.com/c/feedback-prize-2021/discussion/313177)
* [2nd Place - Weighted Box Fusion and Post Process](https://www.kaggle.com/competitions/feedback-prize-2021/discussion/313389)
* [3rd Place Solution w code and notebook](https://www.kaggle.com/competitions/feedback-prize-2021/discussion/313235)
* [4th place solution - 🎖️ my first gold medal 🎖️ (+source code available!)](https://www.kaggle.com/competitions/feedback-prize-2021/discussion/313330)
* [5'th place : simultaneous span segmentation and classification + WBF](https://www.kaggle.com/competitions/feedback-prize-2021/discussion/313478)
* [6th place solution. A YOLO-like text span detector.](https://www.kaggle.com/c/feedback-prize-2021/discussion/313424)
* [7th place solution](https://www.kaggle.com/competitions/feedback-prize-2021/discussion/315887)
* [9th solution, deberta is the king, pure ensemble of bert models](https://www.kaggle.com/competitions/feedback-prize-2021/discussion/313201)
* [10th solution](https://www.kaggle.com/c/feedback-prize-2021/discussion/313718)

## U.S. Patent Phrase to Phrase Matching

https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching

An NLP competition that finished very recently on June 20.

* [1st place solution](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332243)
* [2nd Place Solution](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332234)
* [3rd place solution](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332420)
* [5th solution: prompt is all you need](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332418)
* [7th place solution - the power of randomness](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332928)
* [8th place solution: Predicting Targets at Once Led Us to Gold](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332492)
* [10th place Solution : Single model public lb 0.8562, private lb 0.8717](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332273)
* [12th Place Solution](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/332567)

## NBME Score Clinical Patients

https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes

A recent NLP competition that finished in May 2022.

* [1st solution](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/323095)
* [#2 solution](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/323085)
* [3rd Place Solution: Meta Pseudo Labels + Knowledge Distillation](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/322832)
* [4th place solution: Deberta models & postprocess](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/322799)
* [5th place solution](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/322875)
* [6th place solution](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/323237)
* [7th place solution: Get 0.892 in just 10 minutes](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/322829)
* [8th place solution](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/322962)
* [9th Weight search and threshold modification](https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes/discussion/322891)

## Jigsaw Rate Severity of Toxic Comments

https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating

Another recent NLP competition with a similar problem statement that finished in February 2022. 

* [1st place solution with code](https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/discussion/306274)
* [Toxic Solution and Review (2nd Place)](https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/discussion/308938)
* [4th - This is Great! - Shared Solution](https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/discussion/306084)
* [5th place solution](https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/discussion/306390)
* [7th Place Solution](https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/discussion/306366)

See also: [Every single Jigsaw competition solution write up in history](https://www.kaggle.com/competitions/jigsaw-toxic-severity-rating/discussion/286333)



# Competition Metric

Solutions are evaluated using multi-class log loss (also known as negative log-likelihood).

$$\text{log loss} = -\frac{1}{\color{magenta}{N}} \sum\limits_{i=1}^{\color{magenta}{N}} \sum\limits_{j=1}^{\color{purple}{M}} \color{olive}{y_{ij}} \color{orange}{\log}(\color{teal}{p_{ij}})$$

Where:
* $\color{magenta}{N}$ = number of rows.
* $\color{purple}{M}$ = number of class labels.
* $\color{olive}{y_{ij}}$ = 1 if $i$ is in class $j$, else 0.
* $\color{orange}{\log}$ = natural logarithm.
* $\color{teal}{p_{ij}}$ is the predicted probability that $i$ belongs to class $j$.


You can think of log loss as the negative log of your prediction for the correct class.

The negative log has a range where at 0 the function returns ∞ `−log(0)= ∞` and at 1 returns 0 `−log(1) = 0`

This means confidentally wrong answers are heavily penalised.

Here I plot the range of the `-log(x)` function.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0.001, 1.0, 0.001)
y = -np.log(x)

fig,ax = plt.subplots(figsize=(6,4))
ax.plot(x,y)
plt.ylabel('-log(x)')
plt.xlabel('x')
plt.title('Range of negative log-likelihood function')
plt.show()

# Imports

On top of the standard data science libraries: Pandas, Numpy and Matplotlib, I'm importing these additional libraries for training neural networks:

* [HuggingFace Transfomers](https://github.com/huggingface/transformers) library for pre-trained models and training framework.
* [PyTorch](https://pytorch.org/) for GPU computation.
* [Weights and Biases](https://wandb.ai) for experiment tracking.

In [None]:
import os

run_type = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

import logging
from types import SimpleNamespace
from pathlib import Path
from datetime import datetime
import math

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from torch.utils.data import DataLoader, Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
from transformers import TrainingArguments, Trainer
from tqdm import tqdm
from scipy.special import softmax
from IPython.core.display import display, HTML

from transformers import DataCollatorWithPadding
from datasets import Dataset, load_metric

import wandb

# From this Gist: https://gist.github.com/ihoromi4/b681a9088f348942b01711f251e5f964
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

# Load Data

The competition data includes a training CSV file which includes metadata for each essay element, and a folder containing the full essay texts.

Let's start by loading these and looking at an example from the train and test set.

In [None]:
train_df = pd.read_csv('../input/feedback-prize-effectiveness/train.csv')
test_df = pd.read_csv('../input/feedback-prize-effectiveness/test.csv')

In [None]:
train_df.head(1)

In [None]:
test_df.head(1)

In [None]:
len(train_df), len(test_df)

The train set contains **36,756** essay elements.

The test set CSV only contains **10** elements, as the notebook needs to be submitted to run against the entire test set.

# Essay Texts

Let's see the first 200 characters of a few essay examples.

In [None]:
essays = train_df.essay_id.unique()

texts = []
for essay_id in essays[:10]:
    texts.append(open(f'../input/feedback-prize-effectiveness/train/{essay_id}.txt').read())

inner_html = ""
for text in texts:
    inner_html += f'<td style="vertical-align:top; border-right: 1px solid #7accd8">{text[:200]}</td>'
display(HTML(f"""
<table style="font-family: monospace;">
    <tr>
         {inner_html}
    </tr>
</table>
"""))

Let's count the number of unique essays in the folder.

In [None]:
from pathlib import Path

essay_ids_in_folder = set()
for path in Path('../input/feedback-prize-effectiveness/train').iterdir():
    essay_ids_in_folder.add(path.name[:-4])
len(essay_ids_in_folder)

There are **4,191** unique essays. Not a huge dataset at all!

Compared with number of essays in the CSV?

In [None]:
train_df.essay_id.nunique()

In [None]:
essay_ids_in_folder - set(train_df.essay_id.unique()), set(train_df.essay_id.unique()) - essay_ids_in_folder

So every essay in the CSV is represented in the folder. At least for the train set.

# Topics/Prompts

[@jdoesv](https://www.kaggle.com/jdoesv) put together a really useful [notebook](https://www.kaggle.com/code/jdoesv/topics-identification), which runs [BERTopic](https://maartengr.github.io/BERTopic/index.html) across each of the training examples. This uncovers the essay prompts used for each of the training examples.

jdoesv determines that there are [15 essay prompts used in the dataset](https://www.kaggle.com/competitions/feedback-prize-effectiveness/discussion/327514). The topic information is useful for data analysis, and will potentially be useful in the final model, so I'm joining it with the competition dataset.

In [None]:
topic_pred_df = pd.read_csv('../input/feedback-topics-identification-with-bertopic/topic_model_feedback.csv')
topic_pred_df = topic_pred_df.drop(columns={'prob'})
topic_pred_df = topic_pred_df.rename(columns={'id': 'essay_id'})

topic_meta_df = pd.read_csv('../input/feedback-topics-identification-with-bertopic/topic_model_metadata.csv')
topic_meta_df = topic_meta_df.rename(columns={'Topic': 'topic', 'Name': 'topic_name'}).drop(columns=['Count'])
topic_meta_df.topic_name = topic_meta_df.topic_name.apply(lambda n: ' '.join(n.split('_')[1:]))

topic_pred_df = topic_pred_df.merge(topic_meta_df, on='topic', how='left')

train_df = train_df.merge(topic_pred_df, on='essay_id', how='left')

In [None]:
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot()
 
sns.countplot(y="topic_name", data=train_df, linewidth=1.25, alpha=1, ax=ax, zorder=2, orient='v')
ax.set_title("Topic distribution")

for tick in ax.get_xticklabels():
    tick.set_rotation(90)
fig.show()

Next, let's take a look at the competition metadata and start speculating on how it could be useful.

## Discourse Type

Each essay element contains discourse type metadata. There are 7 `discourse_type` values with explainations taken from the [data page](https://www.kaggle.com/competitions/feedback-prize-effectiveness/data).

* `Lead` - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
* `Position` - an opinion or conclusion on the main question
* `Claim` - a claim that supports the position
* `Counterclaim` - a claim that refutes another claim or gives an opposing reason to the position
* `Rebuttal` - a claim that refutes a counterclaim
* `Evidence` - ideas or examples that support claims, counterclaims, or rebuttals.
* `Concluding Statement` - a concluding statement that restates the claims.

Let's look at the distribution across the whole dataset.

In [None]:
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot()
sns.countplot(x="discourse_type", data=train_df, linewidth=1.25, alpha=1, ax=ax, zorder=2)
ax.set_title("Discourse type distribution")
fig.show()

In the next section, we'll explore the Discourse Effectiveness field. For now, it's sufficient to know there are three possible values: `Adequate`, `Efficient`, and `Ineffective`. Let's see the distribution of Discourse Type across Discourse Effectiveness.

In [None]:
labels = ['Adequate', 'Effective', 'Ineffective']

In [None]:
discourse_types = train_df.discourse_type.unique()

fig, axes = plt.subplots(1, len(discourse_types), sharex='col', sharey='row', figsize=(25, 3))
for i, discourse_type in enumerate(discourse_types):
    ax = axes[i]
    filtered_df = train_df[train_df.discourse_type == discourse_type]
    sns.countplot(x="discourse_effectiveness", data=filtered_df, linewidth=1.25, alpha=1, ax=ax, zorder=2, order=labels)
    ax.set_title(discourse_type)
    ax.set(xlabel=None, ylabel=None)
    
fig.suptitle('Discourse Effectiveness distribution per Discourse Type', y=1.08)
plt.show()

It seems that you have highest probability of having your section marked `Ineffective` within the `Evidence` Discourse Type.

That makes sense as the degree of Evidence seems more objectively quantifiable.

## Discourse Effectiveness (label)

Each essay section is labelled from one of three labels: `Adequate`, `Effective`, and `Ineffective`.

In [None]:
labels

Let's explore the distribution.

In [None]:
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot()
 
sns.countplot(x="discourse_effectiveness", data=train_df, linewidth=1.25, alpha=1, ax=ax, zorder=2)
ax.set_title("Discourse type distribution")
fig.show()

As we can see, quite an imbalanced dataset. We may want to use some kind of weighting, or perhaps up or downsampling within the solution.

In [None]:
from IPython.core.display import display, HTML

def show_examples_for_discourse_type(discourse_type, topic):
    filt = train_df.query(f'discourse_type == "{discourse_type}"').query(f'topic == {topic}').sample(frac=1, random_state=420)
    display(HTML(
        f"""
        <h4><code>{discourse_type}</code> examples</h4>
        <table>
            <tr>
              <th width=33%>Ineffective</th>
              <th width=33%>Adequate</th>
              <th width=33%>Effective</th>
            </tr>
            <tr>
              <td>{filt.query("discourse_effectiveness == 'Ineffective'").iloc[0].discourse_text}</td>
              <td>{filt.query("discourse_effectiveness == 'Adequate'").iloc[0].discourse_text}</td>
              <td>{filt.query("discourse_effectiveness == 'Effective'").iloc[0].discourse_text}</td>
            </tr>
        </table>
        """
    ))

## Examples

Let's see examples of each for each discourse type from topic: `face mars landform aliens`

In [None]:
for dt in discourse_types: show_examples_for_discourse_type(dt, 10)

# Word Count

Let's look at the word count distribution across the dataset. The token count will inform settings for our model, like max sequence length and the types of model architectures we can use. Only some are suitable for very long sequences.

In [None]:
fig = plt.figure(figsize=(15, 5))
train_df['word_count'] = train_df.discourse_text.apply(lambda x: len(x.split()))
sns.histplot(data=train_df, x="word_count")
plt.show()

The mean word count is 44.65 words:

In [None]:
train_df['word_count'].mean()

The max word count is 836.

In [None]:
train_df['word_count'].max()

Let's see the first 1000 characters:

In [None]:
train_df.iloc[train_df['word_count'].idxmax()].discourse_text[:1000]

Let's see the word count per Discourse Type

In [None]:
discourse_types = train_df.discourse_type.unique()

fig, axes = plt.subplots(1, len(discourse_types), sharex='col', sharey='row', figsize=(25, 5))
for i, discourse_type in enumerate(discourse_types):
    filtered_df = train_df[train_df.discourse_type == discourse_type]
    sns.histplot(data=filtered_df, x="word_count", ax=axes[i])
    axes[i].set_title(discourse_type)
    
fig.suptitle('Word count distribution per discourse_type', y=1.08)
plt.show()

So `Claim` and `Evidence` appear to have the largest word count.

# 2021 Data

In [this](https://www.kaggle.com/code/lextoumbourou/feedback-prize-inference-on-2021-dataset) notebook, I made predictions on the full 2021 set from the original Feedback competition.

Let's load them here. I'll exclude any that are in the 2022 subset.

In [None]:
train_2021_preds_df = pd.read_csv('../input/feedback-prize-inference-on-2021-dataset/train_2021_preds.csv')
train_2021_preds_df = train_2021_preds_df[train_2021_preds_df.in_2022 == False]

In [None]:
train_2021_preds_df.head()

In [None]:
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot()
 
sns.countplot(x="discourse_effectiveness", data=train_2021_preds_df, linewidth=1.25, alpha=1, ax=ax, zorder=2)
ax.set_title("Discourse Effectiveness distribution")
fig.show()

I'm going to get essays that contain the most confident predictions.

That should maintain the same distribution of discourse types.

In [None]:
num_essays = 2

In [None]:
train_2021_preds_df['label_prob'] = train_2021_preds_df[labels].max(axis=1)

In [None]:
# train_2021_preds_df = train_2021_preds_df.merge(topic_pred_df, on='essay_id', how='left')

In [None]:
confident_essays = train_2021_preds_df[['essay_id', 'label_prob']].groupby('essay_id').mean().sort_values('label_prob', ascending=False)[:num_essays]

In [None]:
essay_ids = set(confident_essays.index)

In [None]:
train_2021_filt_df = train_2021_preds_df[train_2021_preds_df.essay_id.isin(essay_ids)].reset_index(drop=True)

In [None]:
train_2021_filt_df.shape

Okay, let's get to training a model!

# Config

I include some additional dropout in the config as the model tends to overfit as well as label smoothing, as it seems to work slightly better in the tests I've performed.

In [None]:
config = SimpleNamespace()

config.seed = 420
config.model_name = 'microsoft/deberta-v3-base'
config.output_path = Path('./')
config.input_path = Path('../input/feedback-prize-effectiveness')

config.n_folds = 4
config.lr = 1e-5
config.weight_decay = 0.01
config.epochs = 4
config.batch_size = 16
config.gradient_accumulation_steps = 1
config.warm_up_ratio = 0.1
config.max_len = 384
config.hidden_dropout_prob = 0.1
config.label_smoothing_factor = 0.
config.eval_per_epoch = 2

logging.disable(logging.WARNING)

seed_everything(config.seed)

# WanDB

In [None]:
if run_type == 'Interactive':
    print('Wandb in offline mode.')
    os.environ['WANDB_MODE'] = 'offline'

The following lines of code assumes that you have a [User Secret](https://www.kaggle.com/product-feedback/114053) setup called `wandb` with your wandb API key.

In [None]:
print('Authenticating with wandb.')
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
wandb_creds = user_secrets.get_secret("wandb")

!wandb login {wandb_creds}

In [None]:
wandb.config = config.__dict__

In [None]:
wandb.init(project="feedback-prize-effectiveness")

# Setup CV

Note that I am using `StratifiedKFold` instead of `StratifiedGroupKFold` here, as it performs better on the LB. This comes at the expense of an accurate CV score.

In [None]:
cv = StratifiedKFold(n_splits=config.n_folds, shuffle=True, random_state=config.seed)

In [None]:
train_df['fold'] = -1
for fold_num, (train_idxs, test_idxs) in enumerate(cv.split(train_df.index, train_df.discourse_effectiveness, train_df.essay_id)):
    train_df.loc[test_idxs, ['fold']] = fold_num

In [None]:
train_df.head()

In [None]:
train_df.to_csv(config.output_path / 'train_folds.csv', index=False)

# Tokenizer

In [None]:
config.model_name

In [None]:
tokenizer = AutoTokenizer.from_pretrained(config.model_name, use_fast=True)
tokenizer.model_max_length = config.max_len

In [None]:
tokenizer

In [None]:
def get_essay(essay_fns):
    essay_cache = {}

    output = []
    for essay_fn in essay_fns:
        if essay_fn not in essay_cache:
            essay_txt = open(essay_fn).read().strip().lower()
            essay_cache[essay_fn] = essay_txt
        output.append(essay_cache[essay_fn])

    return output

The essay string is passed as the `text_pair` argument to the tokenisation function. I got this idea from [this](https://www.kaggle.com/code/abhishek/tez-for-feedback-v2-0) kernel. I can't tell you exactly why it helps to pass as `text_pair` instead of concatenating onto the sequence, but it seems to work a bit.

In [None]:
def tokenizer_func(x):
    return tokenizer(x["inputs"], get_essay(x['essay_fn']), truncation=True, max_len=config.max_len)

Since the `discourse_type` will be potentially valuable information, I'll concatenate it to the essay elements.

I'm also concatenating the topic information.

Lastly, converting all text to lowercase as it performs better on CV and LB.

In [None]:
def add_inputs(df, basepath):
    df['essay_fn'] = basepath + '/' + df.essay_id + '.txt'
    df['inputs'] = df.discourse_type.str.lower() + ' ' + tokenizer.sep_token + ' ' + df.topic_name + ' ' + tokenizer.sep_token + ' ' + df.discourse_text.str.lower()
    return df

In [None]:
train_df = add_inputs(train_df, str(config.input_path / 'train'))
train_2021_filt_df = add_inputs(train_2021_filt_df, '../input/feedback-prize-2021/train')

# Model

For maximum experimentation flexibility, I've setup a custom head.

I've included an implementation of [Multi-Sample Dropout](https://arxiv.org/abs/1905.09788) as the model is overfitting quite quickly.

When using HuggingFace Transformers, the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class has a few rules you need to follow when creating custom models:

* your model always return tuples or subclasses of ModelOutput.
* your model can compute the loss if a labels argument is provided and that loss is returned as the first element of the tuple (if your model returns tuples)
* your model can accept multiple label arguments (use the label_names in your TrainingArguments to indicate their name to the Trainer) but none of them should be named "label".

I've also replaced the `ContextPooler` with a mean pooling layer, as it works better in the tests I've run outside of Kaggle.

In [None]:
import torch
from torch import nn
from transformers import AutoConfig, AutoModelForSequenceClassification
from transformers.models.deberta_v2.modeling_deberta_v2 import ContextPooler
from transformers.models.deberta_v2.modeling_deberta_v2 import StableDropout
from transformers.modeling_outputs import TokenClassifierOutput

def get_dropouts(num, start_prob, increment):
    return [StableDropout(start_prob + (increment * i)) for i in range(num)]  

class MeanPooling(nn.Module):
    def __init__(self):
        super(MeanPooling, self).__init__()
        
    def forward(self, last_hidden_state, attention_mask):
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        mean_embeddings = sum_embeddings / sum_mask
        return mean_embeddings

class CustomModel(nn.Module):
    def __init__(self, backbone):
        super(CustomModel, self).__init__()
        
        self.model = backbone
        self.config = self.model.config
        self.num_labels = self.config.num_labels

        # self.pooler = ContextPooler(self.config)
        self.pooler = MeanPooling()
        
        self.classifier = nn.Linear(self.config.hidden_size, self.num_labels)
    
        self.dropouts = get_dropouts(num=5, start_prob=config.hidden_dropout_prob - 0.02, increment=0.01)
    
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None
    ):
        outputs = self.model.deberta(
            input_ids,
            token_type_ids=token_type_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        
        encoder_layer = outputs[0]
        pooled_output = self.pooler(encoder_layer, attention_mask)
                      
        # Multi-sample dropout.
        num_dps = float(len(self.dropouts))
        for ii, drop in enumerate(self.dropouts):
            if ii == 0:
                logits = (self.classifier(drop(pooled_output)) / num_dps)
            else:
                logits += (self.classifier(drop(pooled_output)) / num_dps)

        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            logits = logits.view(-1, self.num_labels)
            loss = loss_fn(logits, labels.view(-1))

        output = (logits,) + outputs[1:]

        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)

In [None]:
def get_backbone_config():
    model_config = AutoConfig.from_pretrained(config.model_name, num_labels=3)
    model_config.hidden_dropout_prob = config.hidden_dropout_prob
    return model_config

In [None]:
def get_model():
    model_config = get_backbone_config()

    model = AutoModelForSequenceClassification.from_pretrained(
        config.model_name,
        config=model_config,
    )
    return CustomModel(model)

Need to save this to generate the backbone offline.

In [None]:
backbone_config = get_backbone_config()
backbone_config.save_pretrained('./backbone_config')

In [None]:
model = get_model()

# Training

The loss function Cross-Entropy is identical to the competition metric when running the model output through Softmax.

I'll include accuracy as an additional metric as it tends to be human interpretable.

In [None]:
metric = load_metric('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
# train_df = train_df.sample(n=150)
# config.epochs = 1

In [None]:
def do_fold(fold_num):
    train_data  = train_df.query(f'fold != {fold_num}').reset_index(drop=True)

    val_data  = train_df.query(f'fold == {fold_num}').reset_index(drop=True)
    
    # Add 2021 to train data.
    train_data = pd.concat([train_data, train_2021_filt_df[['inputs', 'essay_fn', 'discourse_effectiveness']]]).sample(frac=1., random_state=config.seed).reset_index(drop=True)
    print(f'Train data size: {train_data.shape}')

    train_dataset = Dataset.from_pandas(train_data[['inputs', 'essay_fn', 'discourse_effectiveness']]).rename_column('discourse_effectiveness', 'label').class_encode_column("label")
    val_dataset = Dataset.from_pandas(val_data[['inputs', 'essay_fn', 'discourse_effectiveness']]).rename_column('discourse_effectiveness', 'label').class_encode_column("label")

    train_tok_dataset = train_dataset.map(tokenizer_func, batched=True, remove_columns=('inputs', 'essay_fn'))
    val_tok_dataset = val_dataset.map(tokenizer_func, batched=True, remove_columns=('inputs', 'essay_fn'))

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='longest')

    num_steps = len(train_data) / config.batch_size / config.gradient_accumulation_steps
    eval_steps = num_steps // config.eval_per_epoch
    print(f'Num steps: {num_steps}, eval steps: {eval_steps}')

    args = TrainingArguments(
        output_dir=config.output_path,
        learning_rate=config.lr,
        warmup_ratio=config.warm_up_ratio,
        lr_scheduler_type='cosine',
        fp16=True,
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size * 2,
        num_train_epochs=config.epochs,
        weight_decay=config.weight_decay,
        report_to="wandb",

        evaluation_strategy='steps',
        eval_steps=eval_steps, 
        save_strategy='steps',
        save_steps=eval_steps,
        
        load_best_model_at_end=True,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        label_smoothing_factor=config.label_smoothing_factor,
        save_total_limit=3  # Prevents running out of disk space.
    )

    model = get_model()

    trainer = Trainer(
        model,
        args,
        train_dataset=train_tok_dataset,
        eval_dataset=val_tok_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    trainer.train()
    
    trainer.save_model(config.output_path / f'fold_{fold_num}')
    
    outputs = trainer.predict(val_tok_dataset)

    val_data[labels] = softmax(outputs.predictions, axis=1)
    
    !rm -rf {config.output_path / 'checkpoint'}*
    
    return val_data

In [None]:
val_preds_df = pd.DataFrame()

val_data = do_fold(0)

val_preds_df = pd.concat([val_preds_df, val_data])

## All Folds

In [None]:
for fold in range(1, config.n_folds):
    val_data = do_fold(fold)
    val_preds_df = pd.concat([val_preds_df, val_data])

In [None]:
val_preds_df.drop(columns=['inputs']).to_csv(config.output_path / 'val_preds.csv', index=False)

In [None]:
val_preds_df = pd.read_csv(config.output_path / 'val_preds.csv')

In [None]:
val_preds_df.head(1)

# CV Score

In [None]:
cv = log_loss(val_preds_df['discourse_effectiveness'], val_preds_df[labels])
cv

In [None]:
wandb.log({"cv": cv})

# Model Interpretation

I'm going to explore some of the model's predictions. I hope to learn more about the dataset and its limitations by doing this.

Firstly, I'll add a column to calculate the `-log` error per example.

In [None]:
def compute_loss(row):
    return -math.log(row[row.discourse_effectiveness])

val_preds_df['loss'] = val_preds_df.apply(compute_loss, axis=1)

And another boolean column to describe where the prediction was correct or not. 

In [None]:
val_preds_df['predicted'] = val_preds_df[labels].idxmax(axis=1)
val_preds_df['is_correct'] = val_preds_df.discourse_effectiveness == val_preds_df.predicted

## Average error per class

In [None]:
loss_per_class = val_preds_df[['discourse_effectiveness', 'loss']].groupby('discourse_effectiveness').mean('loss')
loss_per_class

In [None]:
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot()

sns.barplot(x=loss_per_class.index, y=loss_per_class.loss, ax=ax)
plt.show()

Clearly the ineffective classes are the most difficult to classify.

## Confusion Matrix

A confusion matrix is a useful way of viewing which classes the model has the most difficulty with.

A common practice to create a nice looking confusion matrix is to pass a scikit Confusion Matrix into a Seaborn Heatmap.

In [None]:
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import seaborn as sns

labels = ['Ineffective', 'Adequate', 'Effective']

def do_conf_matrix(y_true, y_pred, ax, title=None):
    cm = confusion_matrix(y_true, y_pred, labels=labels)
    cm

    sns.heatmap(cm, annot=True, fmt='g', ax=ax, cmap='Blues');

    # labels, title and ticks
    ax.set_xlabel('Predicted labels');
    ax.set_ylabel('True labels'); 
    ax.set_title('Confusion Matrix'); 

    ax.xaxis.set_ticklabels(labels)
    ax.yaxis.set_ticklabels(labels);
    

y_true = val_preds_df.discourse_effectiveness.values
y_pred = val_preds_df[labels].idxmax(axis=1).values
ax= plt.subplot()
do_conf_matrix(y_true, y_pred, ax=ax)

We can see again that the model is having a lot of difficulty with `Ineffective` labeled examples.

## Confusion Matrix Per Discourse Type

In [None]:
fig, axes = plt.subplots(1, len(discourse_types), figsize=(30, 3))
for i, discourse_type in enumerate(discourse_types):
    filtered_df = val_preds_df[val_preds_df.discourse_type == discourse_type]
    y_true = filtered_df.discourse_effectiveness.values
    y_pred = filtered_df[labels].idxmax(axis=1).values
    ax = axes[i]
    do_conf_matrix(y_true, y_pred, ax=ax)
    axes[i].set_title(discourse_type)
    
fig.suptitle('Confusion matrix per discourse_type', y=1.08)
plt.show()

## Confident wrong examples

Let's see 5 examples where the model was very confident but wrong.

In [None]:
inner_html = ""
for idx, row in val_preds_df[~val_preds_df.is_correct].sort_values('loss', ascending=True).head(5).iterrows():
    inner_html += f'''
    <td width="20%" style="vertical-align:top; border-right: 1px solid #7accd8"><p><b>Actual</b>: {row.discourse_effectiveness} <br><b>Predicted</b>: {row.predicted} ({row[row.predicted]}) <p><i>{row.discourse_text}</p></td>
    '''
    
display(HTML(f"""
<table style="font-family: monospace;">
    <tr>
         {inner_html}
    </tr>
</table>
"""))

## Confident right examples

Let's see 5 examples where the model was very confident and right.

In [None]:
inner_html = ""
for idx, row in val_preds_df[val_preds_df.is_correct].sort_values('loss', ascending=True).head(5).iterrows():
    inner_html += f'''
        <td width="20%" style="vertical-align:top; border-right: 1px solid #7accd8"><p><b>Actual</b>: {row.discourse_effectiveness} <br><b>Predicted</b>: {row.predicted} ({row[row.predicted]}) <p><i>{row.discourse_text}</p></td>
    '''
    
display(HTML(f"""
<table style="font-family: monospace;">
    <tr>
         {inner_html}
    </tr>
</table>
"""))

## Whole Essays

In [None]:
def _get_label_color(label):
    return {
        'Adequate': '#777',
        'Ineffective': '#d9534f',
        'Effective': '#5cb85c'
    }[label]

def display_essay(essay_id):
    table_header = """<table style="line-height: 25px; border-collapse:collapse;" width=100%>
        <tr style="font-size: 1.2em; padding-top: 15px; font-family: monospace; background-image: repeating-linear-gradient(white 0px, white 24px, #7accd8 25px);">
            <th style="padding-bottom: 10px;" width="10%" align="left">Prediction</th>
            <th style="padding-bottom: 10px;" width="5%" align="left">Conf</th>
            <th style="padding-bottom: 10px;"  width="75%" align="left">Text</th>
            <th style="padding-bottom: 10px;"  width="10%" align="left">Type</th>
        </tr>"""


    for idx, row in val_preds_df[val_preds_df.essay_id == essay_id].iterrows():
        table_header += f"""
        <tr  style="padding: 0px; vertical-align: top; align: left; background-image: repeating-linear-gradient(white 0px, white 24px, #7accd8 25px);">
            <td style="vertical-align: top; line-height: 25px;">
                <div style="line-height: 20px; width: 100%;  text-align:center; border-radius: 0.25em; background-color: {_get_label_color(row.predicted)}; color: #fff; font-family: monospace">{row.predicted}</div>
                
            </td>
            <td style="vertical-align: top; align: left; line-height: 25px;">
                <span style="border-radius: 0.25em; color: #000; font-family: monospace"> {round(row[row.predicted], 2)}</span>
            </td>
            <td style="vertical-align: top; align: left; font-family: monospace; line-height: 25px;">
                {row.discourse_text}
            </td>
            <td style="vertical-align: top; align: left; line-height: 25px;">
                <span style="border-radius: 0.25em; color: #000; font-family: monospace"><b>{row.discourse_type}</b></span>
            </td>
        </tr>
        """

    table_footer = """</table>"""
    display(HTML(table_header + table_footer))

In [None]:
essay1, essay2 = list(val_preds_df[val_preds_df.is_correct].sort_values('loss', ascending=True).essay_id.values[:2])

In [None]:
display_essay(essay1)

In [None]:
display_essay(essay2)

# Inference

For inference results see [Feedback Prize - DeBERTa-v3 Inference](https://www.kaggle.com/code/lextoumbourou/feedback-prize-deberta-v3-inference).

I add the inference code here just for completeness.

In [None]:
import sys
import glob
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
sys.path.append('../input/feedback-topics-identification-with-bertopic/site-packages')
from bertopic import BERTopic

topic_model = BERTopic.load("../input/feedback-topics-identification-with-bertopic/feedback_2021_topic_model")

sws = stopwords.words("english") + ["n't",  "'s", "'ve"]
fls = glob.glob("../input/feedback-prize-effectiveness/test/*.txt")
docs = []
for fl in tqdm(fls):
    with open(fl) as f:
        txt = f.read()
        word_tokens = word_tokenize(txt)
        txt = " ".join([w for w in word_tokens if not w.lower() in sws])
    docs.append(txt)

topics, probs = topic_model.transform(docs)

pred_topics = pd.DataFrame()
dids = list(map(lambda fl: fl.split("/")[-1].split(".")[0], fls))
pred_topics["id"] = dids
pred_topics["topic"] = topics
pred_topics['prob'] = probs
pred_topics = pred_topics.drop(columns={'prob'})
pred_topics = pred_topics.rename(columns={'id': 'essay_id'})
pred_topics = pred_topics.merge(topic_meta_df, on='topic', how='left')
pred_topics

In [None]:
test_df = test_df.merge(pred_topics, on='essay_id', how='left')
test_df = add_inputs(test_df, str(config.input_path / 'test'))

In [None]:
import torch

all_test_data = np.zeros((config.n_folds, len(test_df), len(labels)))

for fold_num in range(config.n_folds):
    print(f'Do fold {fold_num}')

    tokenizer = AutoTokenizer.from_pretrained(f'./fold_{fold_num}')
    tokenizer.model_max_length = config.max_len

    model = get_model()
    state_dict = torch.load(f'./fold_{fold_num}/pytorch_model.bin')
    model.load_state_dict(state_dict)
    
    test_dataset = Dataset.from_pandas(test_df[['inputs', 'essay_fn']])
    test_tok_dataset = test_dataset.map(tokenizer_func, batched=True, remove_columns=('inputs', 'essay_fn'))
    
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='longest')

    args = TrainingArguments(
        output_dir=config.output_path,
        learning_rate=config.lr,
        lr_scheduler_type='cosine',
        fp16=True,
        evaluation_strategy='epoch',
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size * 2,
        report_to="none",
        save_strategy='no'
    )
    
    trainer = Trainer(
        model,
        args,
        tokenizer=tokenizer,
        data_collator=data_collator
    )
    
    outputs = trainer.predict(test_tok_dataset) 
    softmax_outputs = softmax(outputs.predictions, axis=1)
    
    all_test_data[fold_num] = softmax_outputs

# Submission

In [None]:
preds = np.mean(all_test_data, axis=0)
output_df = pd.concat([test_df[['discourse_id']], pd.DataFrame(preds, columns=labels)], axis=1)
output_df.to_csv('submission.csv', index=False)
pd.read_csv('submission.csv')