# 📖 Feedback Prize - BERT 🤗 Sentence Classifier

![](https://storage.googleapis.com/kaggle-competitions/kaggle/31779/logos/header.png)
## Please, _DO_ upvote if you find it useful or interesting!! 


# Imports

In [None]:
import os
import nltk
import pandas as pd
from tqdm.auto import tqdm

from datasets import Dataset
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

# A Little Look Into The Data
In the dataset folder we have `train.csv` file which contains labels for text files in train folder. Let's take a  look at the `train.csv`

In [None]:
# Constants
train_csv = "../input/feedback-prize-2021/train.csv"
submission_csv = "../input/feedback-prize-2021/sample_submission.csv"
train_text_path = "../input/feedback-prize-2021/train"
test_text_path = "../input/feedback-prize-2021/test"

# Load DF
df = pd.read_csv(train_csv, dtype={'discourse_id': int, 'discourse_start': int, 'discourse_end': int})
df.head()

## From the [Data tab](https://www.kaggle.com/c/feedback-prize-2021/data):

> *  id -                 ID code for essay response
> *  discourse_id -       ID code for discourse element
> *  discourse_start -    character position where discourse element begins in the essay response
> *  discourse_end -      character position where discourse element ends in the essay response
> *  discourse_text -     text of discourse element
> *  discourse_type -     classification of discourse element
> *  discourse_type_num - enumerated class label of discourse element
> *  predictionstring -   the word indices of the training sample, as required for predictions


## Check For Null Values 
we can see that there are no null values in the dataset


In [None]:
# No nulls
df.isnull().sum()

## Let's see the first example in some more detail

In [None]:
text_id = df['id'][0]

In [None]:
def get_text(file_id):
    a_file = f"{train_text_path}/{file_id}.txt"
    with open(a_file, "r") as fp:
        txt = fp.read()
    return txt

txt = get_text(text_id)
print(txt)

In [None]:
df_example = df[df['id'] == text_id]
df_example

# Some Visual Things 

In [None]:
# Creadits for this part of visualisation _> https://www.kaggle.com/thedrcat
import spacy
from spacy import displacy
from pylab import cm, matplotlib

colors = {
            'Lead': '#8000ff',
            'Position': '#2b7ff6',
            'Evidence': '#2adddd',
            'Claim': '#80ffb4',
            'Concluding Statement': 'd4dd80',
            'Counterclaim': '#ff8042',
            'Rebuttal': '#ff0000'
         }

def visualize(example, df):
    ents = []
    for i, row in df[df['id'] == example].iterrows():
        ents.append({
                        'start': int(row['discourse_start']), 
                         'end': int(row['discourse_end']), 
                         'label': row['discourse_type']
                    })
        
    with open(f'{train_text_path}/{example}.txt', 'r') as file: data = file.read()
    doc2 = {
                "text": data,
                "ents": ents,
                "title": example
            }

    options = {"ents": df.discourse_type.unique().tolist(), "colors": colors}
    displacy.render(doc2, style="ent", options=options, manual=True, jupyter=True)

In [None]:
examples = df['id'].sample(n=3, random_state=42).values.tolist()

for ex in examples:
    visualize(ex,df)
    print('\n')

# DATA DISTRIBUTIONs
## Discourse Type Distribution**

In [None]:
import plotly.express as px
values = df['discourse_type'].value_counts().to_dict()
fig =px.bar(x=list(values.keys()),y = list(values.values()) )
fig.update_xaxes(title="Classes")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Discourse Type Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

## discourse type num   distribution

In [None]:
import plotly.express as px
import numpy as np
values = df['discourse_type_num'].value_counts().to_dict()
fig =px.bar(x=list(values.keys()),y = list(values.values()) ,color = np.unique(list(values.keys())))
fig.update_xaxes(title="Classes")
fig.update_yaxes(title = "Number of Rows")
fig.update_layout(showlegend = True,
    title = {
        'text': 'Discourse Type Num Distribution ',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

## Insights

* For some examples, the entire text is densely split into spans of different categories. In some other examples, the annotators omit some words and the splits look very subjective. It's an indicator that annotations may be noisy.
* Order seems to be important: start with the lead, mix claims and evidence, finish with concluding statement. We may need to incorporate this into our models.
* There may be 2 spans of the same class next to each other - it will be important to separate them!

# Sentence Classifier with HuggingFace 🤗

# Create a sentence classification datasety

The steps for solving the problem statement are 

1. Split the texts into sentences (x)
2. Assign each sentence a class (y).
3. Train a normal sequence classifier on those sentences

There are 7 classes and the labeled sections (sometimes) exceed sentences. We will preprocess them to have only sentences. That way, we avoid the problem of detecting when a element starts and when it ends for now.


## Encode classes as ints
Some sections don't belong to any class. We will label them as `No Class` so we can discard those sections and avoid false positives.

In [None]:
id_to_class = dict(enumerate(df['discourse_type'].unique().tolist() + ['No Class']))
class_to_id = {v: k for k, v in id_to_class.items()}
print(id_to_class)
class_to_id



## Dataset functions: `fill_gaps()`, `get_elements()`, and `get_x_samples()`

Here we write the functions `fill_gaps` which will to just that. I leave the code I use for developing and below there is the condensed function.

In [None]:
text_ids = df['id'].unique().tolist()
text_id = text_ids[5]
text = get_text(text_id)
print(text)

In [None]:
# Extract element boundaries and classes  with to_records
df_text = df[df['id'] == text_id]
elements = df_text[['discourse_start', 'discourse_end', 'discourse_type']].to_records(index=False).tolist()
elements

In [None]:
# Fill "No class" chunks: beginning and end
initial_idx = 0
final_idx = len(text)
# Add element at the beginning if it doesn't in index 0
new_elements = []
if elements[0][0] != initial_idx:
    starting_element = (0, elements[0][0]-1, 'No Class')
    new_elements.append(starting_element)
# Add element at the end if it doesn't in index "-1"
if elements[-1][1] != final_idx:
    closing_element = (elements[-1][1]+1, final_idx, 'No Class')
    new_elements.append(closing_element)
    
elements += new_elements
elements = sorted(elements, key=lambda x: x[0])
# See first element (new)
elements

In [None]:
# Add "No class" elements inbetween separated elements 
new_elements = []
for i in range(1, len(elements)-1):
    if elements[i][0] != elements[i-1][1] + 1 and elements[i][0] != elements[i-1][1]:
        new_element = (elements[i-1][1] + 1, elements[i][0]-1, 'No Class')
        new_elements.append(new_element)

elements += new_elements
elements = sorted(elements, key=lambda x: x[0])
elements

In [None]:
# Finall "fill_gaps" functions, wrapping up the above cells
def fill_gaps(elements, text):
    """Add "No Class" elements to a list of elements (see get_elements) """
    initial_idx = 0
    final_idx = len(text)

    # Add element at the beginning if it doesn't in index 0
    new_elements = []
    if elements[0][0] != initial_idx:
        starting_element = (0, elements[0][0]-1, 'No Class')
        new_elements.append(starting_element)


    # Add element at the end if it doesn't in index "-1"
    if elements[-1][1] != final_idx:
        closing_element = (elements[-1][1]+1, final_idx, 'No Class')
        new_elements.append(closing_element)

    elements += new_elements
    elements = sorted(elements, key=lambda x: x[0])

    # Add "No class" elements inbetween separated elements 
    new_elements = []
    for i in range(1, len(elements)-1):
        if elements[i][0] != elements[i-1][1] + 1 and elements[i][0] != elements[i-1][1]:
            new_element = (elements[i-1][1] + 1, elements[i][0]-1, 'No Class')
            new_elements.append(new_element)

    elements += new_elements
    elements = sorted(elements, key=lambda x: x[0])
    return elements


def get_elements(df, text_id, do_fill_gaps=True, text=None):
    """Get a list of (start, end, class) elements for a given text_id"""
    text = get_text(text_id) if text is None else text
    df_text = df[df['id'] == text_id]
    elements = df_text[['discourse_start', 'discourse_end', 'discourse_type']].to_records(index=False).tolist()
    if do_fill_gaps:
        elements = fill_gaps(elements, text)
    return elements

In [None]:
def get_x_samples(df, text_id, do_fill_gaps=True):
    """Create a dataframe of the sentences of the text_id, with columns text, label """
    text = get_text(text_id)
    elements = get_elements(df, text_id, do_fill_gaps, text)
    sentences = []
    for start, end, class_ in elements:
        elem_sentences = nltk.sent_tokenize(text[start:end])
        sentences += [(sentence, class_) for sentence in elem_sentences]
    df = pd.DataFrame(sentences, columns=['text', 'label'])
    df['label'] = df['label'].map(class_to_id)
    return df

get_x_samples(df, text_ids[1])

## Build the full dataframe for sentence classification

In [None]:
# # This takes a while. I created a dataset with the output here: https://www.kaggle.com/julian3833/feedback-df-sentences
# x = []
# for text_id in tqdm(text_ids):
#    x.append(get_x_samples(df, text_id))

# df_sentences = pd.concat(x)

In [None]:
df_sentences = pd.read_csv("../input/feedback-df-sentences/df_sentences.csv")
df_sentences = df_sentences[df_sentences.text.str.split().str.len() >= 3]
df_sentences.head()

In [None]:
df_sentences.to_csv("df_sentences.csv", index=False)

In [None]:
len(df_sentences)

# Modeling!!!

We will use a `BERT` and the `Trainer` API from Hugging Face for this notebook. 

We are using a dataset to avoid using internet (a restriction of the competition for submission notebooks)

References:
* https://huggingface.co/docs/transformers/training
* https://huggingface.co/docs/transformers/custom_datasets

In [None]:
MODEL_CHK = "../input/huggingface-bert/bert-base-cased"
NUM_LABELS = 8
NUM_EPOCHS = 2

## HuggingFace Dataset

In [None]:
ds_train = Dataset.from_pandas(df_sentences.iloc[:340000])
ds_val = Dataset.from_pandas(df_sentences.iloc[340000:])

## Tokenize

In [None]:
transformers.logging.set_verbosity_warning() # Silence some annoying logging of HF

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHK)
def preprocess_function(examples):    
    return tokenizer(examples["text"], truncation=True, max_length=256)

# Tokenizer dataset
ds_train_tokenized = ds_train.map(preprocess_function, batched=True)
ds_val_tokenized = ds_val.map(preprocess_function, batched=True)

# Load Model 

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_CHK, num_labels=NUM_LABELS)

### Prepare trainer

In [None]:
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['WANDB_DISABLED'] = 'true'
training_args = TrainingArguments(
    output_dir='feeeback-classifier',
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=0.01,
    report_to="none",
    evaluation_strategy="epoch",
    save_strategy="epoch",

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train_tokenized,
    eval_dataset=ds_val_tokenized,
    tokenizer=tokenizer,
    #data_collator=data_collator,
)

## Train

In [None]:
trainer.train()

In [None]:
trainer.save_model("feedback-bert-trained")

# Submit

We will apply a process similar to the one we applied to the original train data, splitting each text into its sentences.

See the [Evaluation tab](https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation) for details about the `predictionstring` column

## Prepare test dataset

In [None]:
TEST_PATH='../input/feedback-prize-2021/test'
def get_test_text(a_id):
    a_file = f"{TEST_PATH}/{a_id}.txt"
    with open(a_file, "r") as fp:
        txt = fp.read()
    return txt

def create_df_test():
    test_ids = [f[:-4] for f in os.listdir(TEST_PATH)]
    test_data = []
    for test_id in test_ids:
        text = get_test_text(test_id)
        sentences = nltk.sent_tokenize(text)
        id_sentences = []
        idx = 0 
        for sentence in sentences:
            id_sentence = []
            words = sentence.split()
            # I created this heuristic for mapping words in senteces to "word indexes"
            # This is not definitive and might have strong drawbacks and problems
            for w in words:
                id_sentence.append(idx)
                idx+=1
            id_sentences.append(id_sentence)
        test_data += list(zip([test_id] * len(sentences), sentences, id_sentences))
    df_test = pd.DataFrame(test_data, columns=['id', 'text', 'ids'])
    return df_test

In [None]:
df_test = create_df_test()
df_test.head()

In [None]:
ds_test = Dataset.from_pandas(df_test)
ds_test_tokenized = ds_test.map(preprocess_function, batched=True)

## Predict

In [None]:
# Get the predictions!!
test_predictions = trainer.predict(ds_test_tokenized)

In [None]:
# Turn logits into classes
df_test['predictions'] = test_predictions.predictions.argmax(axis=1)

# Turn class ids into class labels
df_test['class'] = df_test['predictions'].map(id_to_class)
df_test.head()

For now, we are submitting one row per sentence and not "elements". 

How to convert sentences into "elements" (blocks of setences) is not clear since there are times when various sentences with the same class are flagged in independent "elements".

In [None]:
# Turn the word ids into this weird predictionstring required
df_test['predictionstring'] = df_test['ids'].apply(lambda x: ' '.join([str(i) for i in x]))
df_test.head()

In [None]:
# Drop "No class" sentences
df_test = df_test[df_test['class'] != 'No Class']
df_test.head()

In [None]:
# And submit!! 🤞🤞 
df_test[['id', 'class', 'predictionstring']].to_csv("submission.csv", index=False)

## Please, _DO_ upvote if you find it useful or interesting!! 