# 📖 Feedback Prize - BERT 🤗 Sentence Classifier

![](https://storage.googleapis.com/kaggle-competitions/kaggle/31779/logos/header.png)

###  Starter noteboook for the competition [Feedback Prize - Evaluating Student Writing](https://www.kaggle.com/c/feedback-prize-2021/), framing the task as a sentence classificaiton problem, using HuggingFace, BERT and the Trainer API.


#### A lot of competitions going on, right my friends? ;)

### The nature of the task of this competition is not trivially easy to map to an "orthodox" NLP problem, at least as far as I can tell. Is it a Token classification as NER/POS? Is it a... hindi bilingual question answering lol?

<h2> In this notebook I will try to present one of the possible approaches, this is: <span style="color:blue"> Sentence Classification</span>.</h2>

---

The agenda is as follows:
1. A very quick EDA (there are various EDA notebooks already against which I cannot offer any value)
2. Preprocess to obtain a sentence classification dataset
3. Fine-tune a BERT over that sentence classficiation datasdet
4. Submit

---

## Please, _DO_ upvote if you find it useful or interesting!! 


# Imports

In [21]:
import os
import nltk
import pandas as pd
from tqdm.auto import tqdm

from datasets import Dataset
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

# 10,000 foot view of the data

Let's take a quick glance at the `train.csv` file:

In [22]:
# Constants
TRAIN_CSV = "../input/feedback-prize-2021/train.csv"
SUB_CSV = "../input/feedback-prize-2021/sample_submission.csv"
TRAIN_PATH = "../input/feedback-prize-2021/train"
TEST_PATH = "../input/feedback-prize-2021/test"

# Load DF
df = pd.read_csv(TRAIN_CSV, dtype={'discourse_id': int, 'discourse_start': int, 'discourse_end': int})
df.head()

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622627660524,8,229,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622627653021,230,312,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622627671020,313,401,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622627696365,402,758,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622627759780,759,886,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...


From the [Data tab](https://www.kaggle.com/c/feedback-prize-2021/data):

* id - ID code for essay response
* discourse_id - ID code for discourse element
* discourse_start - character position where discourse element begins in the essay response
* discourse_end - character position where discourse element ends in the essay response
* discourse_text - text of discourse element
* discourse_type - classification of discourse element
* discourse_type_num - enumerated class label of discourse element
* predictionstring - the word indices of the training sample, as required for predictions


In [23]:
# No nulls
df.isnull().sum()

id                    0
discourse_id          0
discourse_start       0
discourse_end         0
discourse_text        0
discourse_type        0
discourse_type_num    0
predictionstring      0
dtype: int64

## Let's see the first example in some more detail

In [24]:
a_id = "423A1CA112E2"

In [25]:
def get_text(a_id):
    a_file = f"{TRAIN_PATH}/{a_id}.txt"
    with open(a_file, "r") as fp:
        txt = fp.read()
    return txt

txt = get_text(a_id)
print(txt)

Phones

Modern humans today are always on their phone. They are always on their phone more than 5 hours a day no stop .All they do is text back and forward and just have group Chats on social media. They even do it while driving. They are some really bad consequences when stuff happens when it comes to a phone. Some certain areas in the United States ban phones from class rooms just because of it.

When people have phones, they know about certain apps that they have .Apps like Facebook Twitter Instagram and Snapchat. So like if a friend moves away and you want to be in contact you can still be in contact by posting videos or text messages. People always have different ways how to communicate with a phone. Phones have changed due to our generation.

Driving is one of the way how to get around. People always be on their phones while doing it. Which can cause serious Problems. That's why there's a thing that's called no texting while driving. That's a really important thing to remember. S

In [26]:
df_example = df[df['id'] == a_id]
df_example

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622627660524,8,229,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622627653021,230,312,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622627671020,313,401,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622627696365,402,758,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622627759780,759,886,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...
5,423A1CA112E2,1622627780655,887,1150,That's why there's a thing that's called no te...,Evidence,Evidence 3,163 164 165 166 167 168 169 170 171 172 173 17...
6,423A1CA112E2,1622627811787,1151,1533,Sometimes on the news there is either an accid...,Evidence,Evidence 4,211 212 213 214 215 216 217 218 219 220 221 22...
7,423A1CA112E2,1622627585180,1534,1602,Phones are fine to use and it's also the best ...,Claim,Claim 2,282 283 284 285 286 287 288 289 290 291 292 29...
8,423A1CA112E2,1622627895668,1603,1890,If you go through a problem and you can't find...,Evidence,Evidence 5,297 298 299 300 301 302 303 304 305 306 307 30...
9,423A1CA112E2,1622627628524,1891,2027,The news always updated when people do somethi...,Concluding Statement,Concluding Statement 1,355 356 357 358 359 360 361 362 363 364 365 36...


In [27]:
# Files in train path: 15595
!ls -l {TRAIN_PATH} | wc -l

15595


In [28]:
# Files in test path: 6
!ls -l {TEST_PATH} | wc -l

6


I will stop with the EDA here because I have already seen various very good notebooks around. I suggest you the following ones:

* [Feedback Prize EDA with displacy](https://www.kaggle.com/thedrcat/feedback-prize-eda-with-displacy) by [thedrcat](https://www.kaggle.com/thedrcat/)
* [[Feedback prize] Simple EDA](https://www.kaggle.com/ilialar/feedback-prize-simple-eda) by [ilialar](https://www.kaggle.com/ilialar)
* [🔥📊 Feedback Prize - EDA 📊🔥](https://www.kaggle.com/odins0n/feedback-prize-eda) by [odins0n](https://www.kaggle.com/odins0n/)
* [Feedback Prize - EDA](https://www.kaggle.com/yamqwe/feedback-prize-eda) by [yamqwe](https://www.kaggle.com/yamqwe/)


Let's get into the sentence classification idea!


# Sentence Classifier with HuggingFace 🤗

# Create a sentence classification datasety

As far as I know, this problem is not trivially mapped to one of the "typical" NLP tasks. 
It might be close to NER / POS, but the fact that the entities are large makes me doubt about it.

I'm looking forward for the community discussion about the different possible approaches to this problem. 


Although I might be missing something very obvious, this notebook proposes the following approach, that is a multiclass classifier:

1. Split the texts into sentences (x)
2. Assign each sentence a class (y).
3. Train a normal sequence classifier on those sentences

There are 7 classes and the labeled sections (sometimes) exceed sentences. We will preprocess them to have only sentences. That way, we avoid the problem of detecting when a element starts and when it ends for now.


In [29]:
# There are 7 classes:
df['discourse_type'].value_counts(normalize=True)

Claim                   0.347959
Evidence                0.316731
Position                0.106859
Concluding Statement    0.093594
Lead                    0.064487
Counterclaim            0.040314
Rebuttal                0.030057
Name: discourse_type, dtype: float64

## Encode classes as ints
Some sections don't belong to any class. We will label them as `No Class` so we can discard those sections and avoid false positives.

In [30]:
ID2CLASS = dict(enumerate(df['discourse_type'].unique().tolist() + ['No Class']))
CLASS2ID = {v: k for k, v in ID2CLASS.items()}
print(ID2CLASS)
CLASS2ID

{0: 'Lead', 1: 'Position', 2: 'Evidence', 3: 'Claim', 4: 'Concluding Statement', 5: 'Counterclaim', 6: 'Rebuttal', 7: 'No Class'}


{'Lead': 0,
 'Position': 1,
 'Evidence': 2,
 'Claim': 3,
 'Concluding Statement': 4,
 'Counterclaim': 5,
 'Rebuttal': 6,
 'No Class': 7}



## Dataset functions: `fill_gaps()`, `get_elements()`, and `get_x_samples()`

Here we write the functions `fill_gaps` which will to just that. I leave the code I use for developing and below there is the condensed function.

In [31]:
text_ids = df['id'].unique().tolist()

In [32]:
text_id = text_ids[5]
text = get_text(text_id)
print(text)

Operating a motor vehicle while on your cell phone

Being on your device and driving could be an overly dangerous choice in life. Many people around the world are injured by this situation every day. It could lead to accidents and altercations. In addition it would even cost you your licences. The most detrimental outcome is death. There are far more outcomes to operating a motor vehicle while being on a cell phone. Drivers should not be able to use cell phones in any capacity while operating a motor vehicle.

One leading cause to motor vehicle accidents is being on your cell phone. It could lead to accidents and altercations. Yourself and the passengers are more at risk to bodily injuries and harm. These bodily injuries and harm can range anywhere from mild to critical condition. In an motor vehicle accident there is more than one involved. There is another person or group of people that could have the exact same conditions than the driver on the cell phone. In an example of an alterc

In [33]:
# Extract element boundaries and classes  with to_records

df_text = df[df['id'] == text_id]
elements = df_text[['discourse_start', 'discourse_end', 'discourse_type']].to_records(index=False).tolist()
elements

[(52, 200, 'Lead'),
 (200, 244, 'Claim'),
 (245, 294, 'Claim'),
 (295, 334, 'Claim'),
 (418, 515, 'Position'),
 (516, 590, 'Claim'),
 (590, 1152, 'Evidence'),
 (1162, 1219, 'Claim'),
 (1220, 1706, 'Evidence'),
 (1718, 1842, 'Claim'),
 (1843, 2443, 'Evidence'),
 (2444, 2922, 'Concluding Statement')]

In [34]:
# Fill "No class" chunks: beginning and end

initial_idx = 0
final_idx = len(text)

# Add element at the beginning if it doesn't in index 0
new_elements = []
if elements[0][0] != initial_idx:
    starting_element = (0, elements[0][0]-1, 'No Class')
    new_elements.append(starting_element)

    
# Add element at the end if it doesn't in index "-1"
if elements[-1][1] != final_idx:
    closing_element = (elements[-1][1]+1, final_idx, 'No Class')
    new_elements.append(closing_element)
    
elements += new_elements
elements = sorted(elements, key=lambda x: x[0])
# See first element (new)
elements

[(0, 51, 'No Class'),
 (52, 200, 'Lead'),
 (200, 244, 'Claim'),
 (245, 294, 'Claim'),
 (295, 334, 'Claim'),
 (418, 515, 'Position'),
 (516, 590, 'Claim'),
 (590, 1152, 'Evidence'),
 (1162, 1219, 'Claim'),
 (1220, 1706, 'Evidence'),
 (1718, 1842, 'Claim'),
 (1843, 2443, 'Evidence'),
 (2444, 2922, 'Concluding Statement')]

In [35]:
# Add "No class" elements inbetween separated elements 
new_elements = []
for i in range(1, len(elements)-1):
    if elements[i][0] != elements[i-1][1] + 1 and elements[i][0] != elements[i-1][1]:
        new_element = (elements[i-1][1] + 1, elements[i][0]-1, 'No Class')
        new_elements.append(new_element)

elements += new_elements
elements = sorted(elements, key=lambda x: x[0])
elements

[(0, 51, 'No Class'),
 (52, 200, 'Lead'),
 (200, 244, 'Claim'),
 (245, 294, 'Claim'),
 (295, 334, 'Claim'),
 (335, 417, 'No Class'),
 (418, 515, 'Position'),
 (516, 590, 'Claim'),
 (590, 1152, 'Evidence'),
 (1153, 1161, 'No Class'),
 (1162, 1219, 'Claim'),
 (1220, 1706, 'Evidence'),
 (1707, 1717, 'No Class'),
 (1718, 1842, 'Claim'),
 (1843, 2443, 'Evidence'),
 (2444, 2922, 'Concluding Statement')]

In [36]:
# Finall "fill_gaps" functions, wrapping up the above cells

def fill_gaps(elements, text):
    """Add "No Class" elements to a list of elements (see get_elements) """
    initial_idx = 0
    final_idx = len(text)

    # Add element at the beginning if it doesn't in index 0
    new_elements = []
    if elements[0][0] != initial_idx:
        starting_element = (0, elements[0][0]-1, 'No Class')
        new_elements.append(starting_element)


    # Add element at the end if it doesn't in index "-1"
    if elements[-1][1] != final_idx:
        closing_element = (elements[-1][1]+1, final_idx, 'No Class')
        new_elements.append(closing_element)

    elements += new_elements
    elements = sorted(elements, key=lambda x: x[0])

    # Add "No class" elements inbetween separated elements 
    new_elements = []
    for i in range(1, len(elements)-1):
        if elements[i][0] != elements[i-1][1] + 1 and elements[i][0] != elements[i-1][1]:
            new_element = (elements[i-1][1] + 1, elements[i][0]-1, 'No Class')
            new_elements.append(new_element)

    elements += new_elements
    elements = sorted(elements, key=lambda x: x[0])
    return elements


def get_elements(df, text_id, do_fill_gaps=True, text=None):
    """Get a list of (start, end, class) elements for a given text_id"""
    text = get_text(text_id) if text is None else text
    df_text = df[df['id'] == text_id]
    elements = df_text[['discourse_start', 'discourse_end', 'discourse_type']].to_records(index=False).tolist()
    if do_fill_gaps:
        elements = fill_gaps(elements, text)
    return elements

In [37]:
def get_x_samples(df, text_id, do_fill_gaps=True):
    """Create a dataframe of the sentences of the text_id, with columns text, label """
    text = get_text(text_id)
    elements = get_elements(df, text_id, do_fill_gaps, text)
    sentences = []
    for start, end, class_ in elements:
        elem_sentences = nltk.sent_tokenize(text[start:end])
        sentences += [(sentence, class_) for sentence in elem_sentences]
    df = pd.DataFrame(sentences, columns=['text', 'label'])
    df['label'] = df['label'].map(CLASS2ID)
    return df

get_x_samples(df, text_ids[1])

Unnamed: 0,text,label
0,Phones & Driving,7
1,Drivers should not be able to use phones while...,1
2,Drivers who used their phone while operating a...,3
3,According to an article by the Edgar Snyder Fi...,2
4,"According to the same article, 35% know the ri...",2
5,This shows that its beyond dangerous and irres...,2
6,Drivers should be able to concentrate without ...,2
7,"According to another article, ""Distracted Driv...",2
8,The article states that teen get too distracte...,2
9,Accidents that can be easily avoided by focusi...,2


## Build the full dataframe for sentence classification

In [40]:
# This takes a while. I created a dataset with the output here: https://www.kaggle.com/julian3833/feedback-df-sentences
x = []
for text_id in tqdm(text_ids):
   x.append(get_x_samples(df, text_id))

df_sentences = pd.concat(x)

  0%|          | 0/15594 [00:00<?, ?it/s]

In [42]:
df_sentences.to_csv('../input/feedback-prize-2021/df_sentences.csv', index=False)

In [39]:
# df_sentences = pd.read_csv("../input/feedback-df-sentences/df_sentences.csv")

FileNotFoundError: [Errno 2] No such file or directory: '../input/feedback-df-sentences/df_sentences.csv'

In [None]:
df_sentences = df_sentences[df_sentences.text.str.split().str.len() >= 3]
df_sentences.head()

In [None]:
df_sentences.to_csv("df_sentences.csv", index=False)

In [None]:
len(df_sentences)

# Modeling!!!

We will use a `BERT` and the `Trainer` API from Hugging Face. 

We are using a dataset to avoid using internet (a restriction of the competition for submission notebooks)

References:
* https://huggingface.co/docs/transformers/training
* https://huggingface.co/docs/transformers/custom_datasets

In [None]:
MODEL_CHK = "../input/huggingface-bert/bert-base-cased"

NUM_LABELS = 8

NUM_EPOCHS = 2

## HuggingFace Dataset

In [None]:
ds_train = Dataset.from_pandas(df_sentences.iloc[:340000])
ds_val = Dataset.from_pandas(df_sentences.iloc[340000:])

## Tokenize

In [None]:
transformers.logging.set_verbosity_warning() # Silence some annoying logging of HF

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHK)

def preprocess_function(examples):    
    return tokenizer(examples["text"], truncation=True, max_length=256)


# Tokenizer dataset
ds_train_tokenized = ds_train.map(preprocess_function, batched=True)
ds_val_tokenized = ds_val.map(preprocess_function, batched=True)

In [None]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHK, num_labels=NUM_LABELS)

### Prepare trainer

In [None]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['WANDB_DISABLED'] = 'true'

training_args = TrainingArguments(
    output_dir='feeeback-classifier',
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=0.01,
    report_to="none",
    evaluation_strategy="epoch",
    save_strategy="epoch",

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train_tokenized,
    eval_dataset=ds_val_tokenized,
    tokenizer=tokenizer,
    #data_collator=data_collator,
)

## Train

In [None]:
trainer.train()

In [None]:
trainer.save_model("feedback-bert-trained")

# Submit

We will apply a process similar to the one we applied to the original train data, splitting each text into its sentences.

See the [Evaluation tab](https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation) for details about the `predictionstring` column

In [None]:
df_sub = pd.read_csv(SUB_CSV)
df_sub

## Prepare test dataset

In [None]:
def get_test_text(a_id):
    a_file = f"{TEST_PATH}/{a_id}.txt"
    with open(a_file, "r") as fp:
        txt = fp.read()
    return txt

def create_df_test():
    test_ids = [f[:-4] for f in os.listdir(TEST_PATH)]
    test_data = []
    for test_id in test_ids:
        text = get_test_text(test_id)
        sentences = nltk.sent_tokenize(text)
        id_sentences = []
        idx = 0 
        for sentence in sentences:
            id_sentence = []
            words = sentence.split()
            # I created this heuristic for mapping words in senteces to "word indexes"
            # This is not definitive and might have strong drawbacks and problems
            for w in words:
                id_sentence.append(idx)
                idx+=1
            id_sentences.append(id_sentence)
        test_data += list(zip([test_id] * len(sentences), sentences, id_sentences))
    df_test = pd.DataFrame(test_data, columns=['id', 'text', 'ids'])
    return df_test

In [None]:
df_test = create_df_test()
df_test.head()

In [None]:
ds_test = Dataset.from_pandas(df_test)
ds_test_tokenized = ds_test.map(preprocess_function, batched=True)

## Predict

In [None]:
# Get the predictions!!
test_predictions = trainer.predict(ds_test_tokenized)

In [None]:
# Turn logits into classes
df_test['predictions'] = test_predictions.predictions.argmax(axis=1)

# Turn class ids into class labels
df_test['class'] = df_test['predictions'].map(ID2CLASS)
df_test.head()

For now, we are submitting one row per sentence and not "elements". 

How to convert sentences into "elements" (blocks of setences) is not clear since there are times when various sentences with the same class are flagged in independent "elements".

In [None]:
# Turn the word ids into this weird predictionstring required
df_test['predictionstring'] = df_test['ids'].apply(lambda x: ' '.join([str(i) for i in x]))
df_test.head()

In [None]:
# Drop "No class" sentences
df_test = df_test[df_test['class'] != 'No Class']
df_test.head()

In [None]:
# And submit!! 🤞🤞 
df_test[['id', 'class', 'predictionstring']].to_csv("submission.csv", index=False)

## Please, _DO_ upvote if you find it useful or interesting!! 
#### I'm very close to becoming a grandmaster and very excited about it 😇🙏