## About Competition:

* The goal of this competition is to classify argumentative elements in student writing as "effective," "adequate," or "ineffective."

* The dataset contains argumentative essays written by U.S students in grades 6-12.

* These essays were annotated by expert raters for discourse elements commonly found in argumentative writing such as Lead, Claim, Evidence etc.

* The task is to predict the quality rating of each discourse element.

* Submissions for this track are evaluated using multi-class logarithmic loss.

## Learning Goal:

* How to use Huggingface Library to train a NLP task.

**References**

1. [Getting started with NLP for absolute beginners by J Howard](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners)

## Setup

In [1]:
# Imports
import os

# Directories
DIR = '/kaggle/input/feedback-prize-effectiveness'
TRAIN_DIR = os.path.join(DIR, 'train')
TEST_DIR = os.path.join(DIR, 'test')

In [2]:
# Imports
import pandas as pd

# Read train and test csv's.
df = pd.read_csv(os.path.join(DIR, 'train.csv'))
test_df = pd.read_csv(os.path.join(DIR, 'test.csv'))

print(f'# of training samples: {len(df)}')
print(f'# of test samples: {len(test_df)}\n')
df.sample(5)

# of training samples: 36765
# of test samples: 10



Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
13440,81aafc967519,F2A3BE5019F0,A shadow could have been reflecting off the vi...,Evidence,Ineffective
34990,1950f537095f,80889A79B329,People who hear only one side of the story wil...,Evidence,Adequate
20408,8698e6ba5051,515E8B741A54,Another reason is that this system helps choos...,Counterclaim,Ineffective
2654,19617f23c0dc,2FF9836001F4,I disagree with this decision because you go t...,Position,Effective
3579,da37345c71fa,3F949C10F639,So that is why I think that the author wants u...,Concluding Statement,Adequate


## Preprocessing
​
* Analyse the statistic of the number of words in **discourse_text**
* Concat the discourse_type and di

In [3]:
df['text_length'] = df['discourse_text'].apply(lambda x: len(x.split(' ')))
df['text_length'].describe()

count    36765.000000
mean        45.721637
std         46.641451
min          2.000000
25%         17.000000
50%         29.000000
75%         58.000000
max        831.000000
Name: text_length, dtype: float64

**SURPRISING!!!**

* Minimum Length of disclosure_text is 2, and Maximum is 831.
* However Mean is 45 with standard deviation of 46, which mean most of the disclosure_text have length < 100.

In [4]:
# Disclosure Text with length = 2
df[df['text_length']==2]

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,text_length
11,cc921c5cfda4,00944C693682,stress.,Claim,Adequate,2
452,1ab1030c639a,0A5B8761B187,Disagree,Position,Ineffective,2
1397,210f8f088aa4,1B4E66B0BE0A,pollution.,Claim,Adequate,2
1571,e18b753a740a,1DC6485ABFF6,"interest,",Claim,Ineffective,2
1572,91b5849cdbed,1DC6485ABFF6,funds/workers.,Claim,Ineffective,2
...,...,...,...,...,...,...
35968,4a76afecac31,C8FB2508978A,choices,Claim,Ineffective,2
35969,d9c17f7d8b7a,C8FB2508978A,"opinions,",Claim,Adequate,2
35973,9b72380e4fc2,C8FB2508978A,"opinions,",Claim,Ineffective,2
35975,247d1c922753,C8FB2508978A,"choices,",Claim,Ineffective,2


In [5]:
thresholds = [5, 10, 20, 40, 60, 80, 100, 1000]

for thr in thresholds:
    num_of_samples = len(df[df['text_length'] <= thr])
    print(f'# of samples with text length {thr} or lesser: {num_of_samples}')

# of samples with text length 5 or lesser: 618
# of samples with text length 10 or lesser: 3680
# of samples with text length 20 or lesser: 12620
# of samples with text length 40 or lesser: 23090
# of samples with text length 60 or lesser: 27990
# of samples with text length 80 or lesser: 30973
# of samples with text length 100 or lesser: 32940
# of samples with text length 1000 or lesser: 36765


In [6]:
# Concat discourse_text + discourse_type to form the input.
df['input'] = 'TEXT1: ' + df.discourse_text + '; TEXT2: ' + df.discourse_type
df.input.head()

0    TEXT1: Hi, i'm Isaac, i'm going to be writing ...
1    TEXT1: On my perspective, I think that the fac...
2    TEXT1: I think that the face is a natural land...
3    TEXT1: If life was on Mars, we would know by n...
4    TEXT1: People thought that the face was formed...
Name: input, dtype: object

In [7]:
# create a map of the expected ids to their labels with id2label and label2id:
id2label = {0: "Adequate", 1: "Ineffective", 2:"Effective"}
label2id = {"Adequate": 0, "Ineffective": 1, "Effective":2}

df['labels'] = df['discourse_effectiveness'].apply(lambda x: label2id[x])
df.sample(5)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,text_length,input,labels
2364,00821551d0ca,2B7A8D15B50C,because of the pictures we took we could not s...,Evidence,Adequate,42,TEXT1: because of the pictures we took we coul...,0
7600,0344a0b41b00,88FD7FAAFA90,the students will enjoy doing the project.,Claim,Adequate,8,TEXT1: the students will enjoy doing the proje...,0
31070,fee94770bbee,E48B9182B257,This writer is implying that the small state v...,Evidence,Effective,66,TEXT1: This writer is implying that the small ...,2
8692,a814a90ca57b,9CE197B2A9F9,people shuld participate in the program. \n,Position,Adequate,7,TEXT1: people shuld participate in the program...,0
9897,6c5aae0257b1,B2426E4674E7,"The author also states, ""Some simplified elect...",Evidence,Effective,97,"TEXT1: The author also states, ""Some simplifie...",2


## Tokenization

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

* **Tokenization**: Split each text up into words (or actually, as we'll see, into tokens)
* **Numericalization**: Convert each word (or token) into a number.

In [8]:
# Transformers uses dataset object to store data.
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['discourse_id', 'essay_id', 'discourse_text', 'discourse_type', 'discourse_effectiveness', 'text_length', 'input', 'labels'],
    num_rows: 36765
})

In [9]:
# Choose a model
model_nm = 'bert-base-uncased'

In [10]:
# AutoTokenizer will create a tokenizer appropriate for a given model:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [11]:
# Uncommon words will be split into pieces. 
# The start of a new word is represented by ▁:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['g',
 "'",
 'day',
 'folks',
 ',',
 'i',
 "'",
 'm',
 'jeremy',
 'from',
 'fast',
 '.',
 'ai',
 '!']

In [12]:
# Here's a simple function which tokenizes our inputs
def tok_func(x): return tokz(x["input"], truncation=True)

In [13]:
# To run this quickly in parallel on every row in our dataset, use map:
tok_ds = ds.map(tok_func, batched=True)

  0%|          | 0/37 [00:00<?, ?ba/s]

In [14]:
tok_ds

Dataset({
    features: ['discourse_id', 'essay_id', 'discourse_text', 'discourse_type', 'discourse_effectiveness', 'text_length', 'input', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36765
})

In [15]:
row = tok_ds[0]
print(row['input'])
print(row['input_ids'])

TEXT1: Hi, i'm Isaac, i'm going to be writing about how this face on Mars is a natural landform or if there is life on Mars that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA doesn't know if the landform was created by life on Mars, or if it is just a natural landform. ; TEXT2: Lead
[101, 3793, 2487, 1024, 7632, 1010, 1045, 1005, 1049, 7527, 1010, 1045, 1005, 1049, 2183, 2000, 2022, 3015, 2055, 2129, 2023, 2227, 2006, 7733, 2003, 1037, 3019, 2455, 14192, 2030, 2065, 2045, 2003, 2166, 2006, 7733, 2008, 2081, 2009, 1012, 1996, 2466, 2003, 2055, 2129, 9274, 2165, 1037, 3861, 1997, 7733, 1998, 1037, 2227, 2001, 2464, 2006, 1996, 4774, 1012, 9274, 2987, 1005, 1056, 2113, 2065, 1996, 2455, 14192, 2001, 2580, 2011, 2166, 2006, 7733, 1010, 2030, 2065, 2009, 2003, 2074, 1037, 3019, 2455, 14192, 1012, 1025, 3793, 2475, 1024, 2599, 102]


In [16]:
tokz.convert_ids_to_tokens(101)

'[CLS]'

* The CLS token, short for "classification token," is a special token used in Transformer-based models, such as BERT
*  It is used as the first token in the input sequence, and it carries important information for classification tasks.
* The final hidden state corresponding to the CLS token is often used as the input to a classifier layer or for downstream tasks.

In [17]:
# Train-Valid Split.
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['discourse_id', 'essay_id', 'discourse_text', 'discourse_type', 'discourse_effectiveness', 'text_length', 'input', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27573
    })
    test: Dataset({
        features: ['discourse_id', 'essay_id', 'discourse_text', 'discourse_type', 'discourse_effectiveness', 'text_length', 'input', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9192
    })
})

In [18]:
# It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, 
# instead of padding the whole dataset to the maximum length.
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokz)

## Define Metric (multi-class logarithmic loss)

In [19]:
import numpy as np
from sklearn.metrics import log_loss, accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred.predictions, eval_pred.label_ids

    # Apply softmax to obtain class probabilities
    probabilities = np.exp(logits) / np.exp(logits).sum(axis=1, keepdims=True)

    # Compute logarithmic loss
    loss = log_loss(labels, probabilities)

    # Compute accuracy
    predicted_labels = np.argmax(probabilities, axis=1)
    accuracy = accuracy_score(labels, predicted_labels)

    return {'log_loss': round(loss, 4), 'accuracy': round(accuracy, 4)}

## Training

In [20]:
from transformers import TrainingArguments,Trainer

In [21]:
# Hyperparameters
bs = 16
epochs = 3
lr = 8e-5

In [22]:
args = TrainingArguments('outputs', 
                         learning_rate=lr, 
                         warmup_ratio=0.1, 
                         lr_scheduler_type='cosine', 
                         fp16=True,
                         evaluation_strategy="epoch", 
                         per_device_train_batch_size=bs, 
                         per_device_eval_batch_size=bs*2,
                         num_train_epochs=epochs, 
                         weight_decay=0.01,
                         report_to='none'
                        )

We can now create our model, and Trainer, which is a class which combines the data and model together (just like Learner in fastai):

In [23]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=3, id2label=id2label, label2id=label2id)
trainer = Trainer(model, 
                  args, 
                  train_dataset=dds['train'], 
                  eval_dataset=dds['test'],
                  tokenizer=tokz,
                  data_collator=data_collator, 
                  compute_metrics=compute_metrics
                 )

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [24]:
trainer.train();

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text_length, discourse_type, discourse_id, discourse_text, input, discourse_effectiveness, essay_id.
***** Running training *****
  Num examples = 27573
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2586


Epoch,Training Loss,Validation Loss,Loss,Accuracy,Runtime,Samples Per Second,Steps Per Second
1,0.8091,0.8038,0.803885,0.6574,44.0753,208.552,3.267
2,0.6288,0.7131,0.713139,0.692,43.9335,209.225,3.278
3,0.3577,0.8869,0.886987,0.6781,43.921,209.285,3.279


  args.max_grad_norm,
Saving model checkpoint to outputs/checkpoint-500
Configuration saved in outputs/checkpoint-500/config.json
Model weights saved in outputs/checkpoint-500/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-500/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text_length, discourse_type, discourse_id, discourse_text, input, discourse_effectiveness, essay_id.
***** Running Evaluation *****
  Num examples = 9192
  Batch size = 64
Saving model checkpoint to outputs/checkpoint-1000
Configuration saved in outputs/checkpoint-1000/config.json
Model weights saved in outputs/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-1000/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-1000/special_tokens_map.json
Saving