# 📖 Feedback - Baseline🤗 Sentence Classifier [0.226]

![](https://storage.googleapis.com/kaggle-competitions/kaggle/31779/logos/header.png)

###  Starter noteboook for the competition [Feedback Prize - Evaluating Student Writing](https://www.kaggle.com/c/feedback-prize-2021/) framing the task as a sentence classification problem, using HuggingFace, BERT, and the Trainer API, with a LB of `0.226`.


#### A lot of competitions going on, right my friends? ;)

### The nature of this task is not trivially easy to map to an "orthodox" NLP problem, at least as far as I can tell. Is it a Token classification as NER/POS? Is it a... hindi bilingual Question Answering (lol)?

<h2> In this notebook I will try to present one of the possible approaches, this is: <span style="color:blue"> Sentence Classification</span>.</h2>

---

The agenda is as follows:
1. A very quick EDA
2. Preprocess to obtain a sentence classification dataset
3. Fine-tune a BERT over that sentence classification dataset
4. Submit

---

## Please, _DO_ upvote if you find it useful or interesting!! 


# Imports

In [None]:
import os
import nltk
import pandas as pd
from tqdm.auto import tqdm

from datasets import Dataset
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

# 10,000 foot view of the data

Let's take a quick glance at the `train.csv` file:

In [None]:
# Constants
TRAIN_CSV = "../input/feedback-prize-2021/train.csv"
SUB_CSV = "../input/feedback-prize-2021/sample_submission.csv"
TRAIN_PATH = "../input/feedback-prize-2021/train"
TEST_PATH = "../input/feedback-prize-2021/test"

# Load DF
df = pd.read_csv(TRAIN_CSV, dtype={'discourse_id': int, 'discourse_start': int, 'discourse_end': int})
df.head()

From the [Data tab](https://www.kaggle.com/c/feedback-prize-2021/data):

* id - ID code for essay response
* discourse_id - ID code for discourse element
* discourse_start - character position where discourse element begins in the essay response
* discourse_end - character position where discourse element ends in the essay response
* discourse_text - text of discourse element
* discourse_type - classification of discourse element
* discourse_type_num - enumerated class label of discourse element
* predictionstring - the word indices of the training sample, as required for predictions


In [None]:
# No nulls
df.isnull().sum()

## Let's see the first example in some more detail

In [None]:
a_id = "423A1CA112E2"

In [None]:
def get_text(a_id):
    a_file = f"{TRAIN_PATH}/{a_id}.txt"
    with open(a_file, "r") as fp:
        txt = fp.read()
    return txt

txt = get_text(a_id)
print(txt)

In [None]:
df_example = df[df['id'] == a_id]
df_example

In [None]:
# Files in train path: 15595
!ls -l {TRAIN_PATH} | wc -l

In [None]:
# Files in test path: 6
!ls -l {TEST_PATH} | wc -l

I will stop with the EDA here because I have already seen various very good notebooks around, against which I cannot offer much value. I suggest you the following ones:

* [Feedback Prize EDA with displacy](https://www.kaggle.com/thedrcat/feedback-prize-eda-with-displacy) by [thedrcat](https://www.kaggle.com/thedrcat/)
* [[Feedback prize] Simple EDA](https://www.kaggle.com/ilialar/feedback-prize-simple-eda) by [ilialar](https://www.kaggle.com/ilialar)
* [🔥📊 Feedback Prize - EDA 📊🔥](https://www.kaggle.com/odins0n/feedback-prize-eda) by [odins0n](https://www.kaggle.com/odins0n/)
* [Feedback Prize - EDA](https://www.kaggle.com/yamqwe/feedback-prize-eda) by [yamqwe](https://www.kaggle.com/yamqwe/)


Let's get into the sentence classification idea!


# Sentence Classifier with HuggingFace 🤗

# Create a sentence classification dataset

As far as I know, this problem is not trivially mapped to one of the "typical" NLP tasks. 
It might be close to NER / POS, but the fact that the entities are large makes me doubt about it.

I'm looking forward for the community discussion about the different possible approaches to this problem. 

Although I might be missing something very obvious, this notebook proposes the following approach, that is a multiclass classifier:

1. Split the texts into sentences (x)
2. Assign each sentence a class (y).
3. Train a normal sequence classifier on those sentences

There are 7 classes and the labeled sections (sometimes) exceed sentences. We will preprocess them to have only sentences. That way, we will avoid the problem of detecting when an element starts and when it ends (for now).


In [None]:
# There are 7 classes:
df['discourse_type'].value_counts(normalize=True)

## Encode classes as ints
Some sections don't belong to any class. We will label them as `No Class` so we can discard those sections and avoid false positives.

In [None]:
ID2CLASS = dict(enumerate(df['discourse_type'].unique().tolist() + ['No Class']))
CLASS2ID = {v: k for k, v in ID2CLASS.items()}
print(ID2CLASS)
CLASS2ID

## Dataset functions: `fill_gaps()`, `get_elements()`, and `get_x_samples()`

Here we write the following functions:
* `fill_gaps()`,  which will label the "No Class" parts of the texts. 
* `get_elements()`, which uses `fill_gaps` and creates a list of text sections for a given text id 
* `get_x_samples()`, which maps these elements into sentences with labels.

First you'll find the code used for development (because it illustrates the thought process) and below you'll find the condensed functions.

In [None]:
text_ids = df['id'].unique().tolist()

In [None]:
text_id = text_ids[5]
text = get_text(text_id)
print(text)

In [None]:
# Extract element boundaries and classes with to_records

df_text = df[df['id'] == text_id]
elements = df_text[['discourse_start', 'discourse_end', 'discourse_type']].to_records(index=False).tolist()
elements

In [None]:
# Fill "No class" chunks: beginning and end

initial_idx = 0
final_idx = len(text)

# Add element at the beginning if it doesn't in index 0
new_elements = []
if elements[0][0] != initial_idx:
    starting_element = (0, elements[0][0]-1, 'No Class')
    new_elements.append(starting_element)

    
# Add element at the end if it doesn't in index "-1"
if elements[-1][1] != final_idx:
    closing_element = (elements[-1][1]+1, final_idx, 'No Class')
    new_elements.append(closing_element)
    
elements += new_elements
elements = sorted(elements, key=lambda x: x[0])

# See the first element (it's new, labeled "No Class")
elements

In [None]:
# Add "No class" elements inbetween separated elements 
new_elements = []
for i in range(1, len(elements)-1):
    if elements[i][0] != elements[i-1][1] + 1 and elements[i][0] != elements[i-1][1]:
        new_element = (elements[i-1][1] + 1, elements[i][0]-1, 'No Class')
        new_elements.append(new_element)

elements += new_elements
elements = sorted(elements, key=lambda x: x[0])
elements

In [None]:
# Final "fill_gaps" functions, wrapping up the above cells

def fill_gaps(elements, text):
    """Add "No Class" elements to a list of elements (see get_elements) """
    initial_idx = 0
    final_idx = len(text)

    # Add element at the beginning if it doesn't in index 0
    new_elements = []
    if elements[0][0] != initial_idx:
        starting_element = (0, elements[0][0]-1, 'No Class')
        new_elements.append(starting_element)


    # Add element at the end if it doesn't in index "-1"
    if elements[-1][1] != final_idx:
        closing_element = (elements[-1][1]+1, final_idx, 'No Class')
        new_elements.append(closing_element)

    elements += new_elements
    elements = sorted(elements, key=lambda x: x[0])

    # Add "No class" elements inbetween separated elements 
    new_elements = []
    for i in range(1, len(elements)-1):
        if elements[i][0] != elements[i-1][1] + 1 and elements[i][0] != elements[i-1][1]:
            new_element = (elements[i-1][1] + 1, elements[i][0]-1, 'No Class')
            new_elements.append(new_element)

    elements += new_elements
    elements = sorted(elements, key=lambda x: x[0])
    return elements


def get_elements(df, text_id, do_fill_gaps=True, text=None):
    """Get a list of (start, end, class) elements for a given text_id"""
    text = get_text(text_id) if text is None else text
    df_text = df[df['id'] == text_id]
    elements = df_text[['discourse_start', 'discourse_end', 'discourse_type']].to_records(index=False).tolist()
    if do_fill_gaps:
        elements = fill_gaps(elements, text)
    return elements

In [None]:
def get_x_samples(df, text_id, do_fill_gaps=True):
    """Create a dataframe of the sentences of the text_id, with columns text, label """
    text = get_text(text_id)
    elements = get_elements(df, text_id, do_fill_gaps, text)
    sentences = []
    for start, end, class_ in elements:
        elem_sentences = nltk.sent_tokenize(text[start:end])
        sentences += [(sentence, class_) for sentence in elem_sentences]
    df = pd.DataFrame(sentences, columns=['text', 'label'])
    df['label'] = df['label'].map(CLASS2ID)
    return df

get_x_samples(df, text_ids[1])

## Build the full dataframe for sentence classification

In [None]:
# This takes a while. I created a dataset with the output here: https://www.kaggle.com/julian3833/feedback-df-sentences
#x = []
#for text_id in tqdm(text_ids):
#    x.append(get_x_samples(df, text_id))

#df_sentences = pd.concat(x)

In [None]:
df_sentences = pd.read_csv("../input/feedback-df-sentences/df_sentences.csv")

In [None]:
df_sentences = df_sentences[df_sentences.text.str.split().str.len() >= 3]
df_sentences.head()

In [None]:
df_sentences.to_csv("df_sentences.csv", index=False)

In [None]:
len(df_sentences)

# Modeling!!!

We will use a `BERT` and the `Trainer` API from Hugging Face. 

We are using a dataset to avoid using Internet (this is because submissions notebooks should not have Internet access: a competition restriction).

References:
* https://huggingface.co/docs/transformers/training
* https://huggingface.co/docs/transformers/custom_datasets

In [None]:
MODEL_CHK = "../input/huggingface-bert/bert-base-cased"

NUM_LABELS = 8

NUM_EPOCHS = 3

## HuggingFace Dataset

In [None]:
ds_train = Dataset.from_pandas(df_sentences.iloc[:340000])
ds_val = Dataset.from_pandas(df_sentences.iloc[340000:])

## Tokenize

In [None]:
transformers.logging.set_verbosity_warning() # Silence some annoying logging of HF

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHK)

def preprocess_function(examples):    
    return tokenizer(examples["text"], truncation=True, max_length=326)


# Tokenizer dataset
ds_train_tokenized = ds_train.map(preprocess_function, batched=True)
ds_val_tokenized = ds_val.map(preprocess_function, batched=True)

In [None]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHK, num_labels=NUM_LABELS)

### Prepare trainer

In [None]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['WANDB_DISABLED'] = 'true'

training_args = TrainingArguments(
    output_dir='feeeback-classifier',
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=0.01,
    report_to="none",
    evaluation_strategy="epoch",
    save_strategy="epoch",

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train_tokenized,
    eval_dataset=ds_val_tokenized,
    tokenizer=tokenizer,
    #data_collator=data_collator,
)

## Train

In [None]:
trainer.train()

In [None]:
trainer.save_model("feedback-bert-trained")

# Submit

We will apply a process similar to the one we applied to the original training data, but to the test data: we are splitting each text into its sentences.

See the [Evaluation tab](https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation) for details about the `predictionstring` column.

In [None]:
df_sub = pd.read_csv(SUB_CSV)
df_sub

## Prepare test dataset

In [None]:
def get_test_text(a_id):
    a_file = f"{TEST_PATH}/{a_id}.txt"
    with open(a_file, "r") as fp:
        txt = fp.read()
    return txt

def create_df_test():
    test_ids = [f[:-4] for f in os.listdir(TEST_PATH)]
    test_data = []
    for test_id in test_ids:
        text = get_test_text(test_id)
        sentences = nltk.sent_tokenize(text)
        id_sentences = []
        idx = 0 
        for sentence in sentences:
            id_sentence = []
            words = sentence.split()
            # I created this heuristic for mapping words in senteces to "word indexes"
            # This is not definitive and might have strong drawbacks and problems
            for w in words:
                id_sentence.append(idx)
                idx+=1
            id_sentences.append(id_sentence)
        test_data += list(zip([test_id] * len(sentences), sentences, id_sentences))
    df_test = pd.DataFrame(test_data, columns=['id', 'text', 'ids'])
    return df_test

In [None]:
df_test = create_df_test()
df_test.head()

In [None]:
ds_test = Dataset.from_pandas(df_test)
ds_test_tokenized = ds_test.map(preprocess_function, batched=True)

## Predict

In [None]:
# Get the predictions!!
test_predictions = trainer.predict(ds_test_tokenized)

In [None]:
# Turn logits into classes
df_test['predictions'] = test_predictions.predictions.argmax(axis=1)

# Turn class ids into class labels
df_test['class'] = df_test['predictions'].map(ID2CLASS)
df_test.head()

For now, we are submitting one row per sentence and not per "element". 

How to convert sentences into "elements" (blocks of sentences) is not clear since there are times when various continuous sentences with the same class are flagged as independent "elements".

In [None]:
# Turn the word ids into this weird predictionstring required
df_test['predictionstring'] = df_test['ids'].apply(lambda x: ' '.join([str(i) for i in x]))
df_test.head()

In [None]:
# Drop "No class" sentences
df_test = df_test[df_test['class'] != 'No Class']
df_test.head()

In [None]:
# And submit!! 🤞🤞 
df_test[['id', 'class', 'predictionstring']].to_csv("submission.csv", index=False)

## Please, _DO_ upvote if you find it useful or interesting!! 
#### I'm very close to becoming a grandmaster and very excited about it 😇🙏