# Feedback Prize - Evaluating Student Writing
Writing is a critical skill for success. However, less than a third of high school seniors are proficient writers, according to the National Assessment of Educational Progress. Unfortunately, low-income, Black, and Hispanic students fare even worse, with less than 15 percent demonstrating writing proficiency. One way to help students improve their writing is via automated feedback tools, which evaluate student writing and provide personalized feedback.
This notebook identify elements in student writing. More specifically, by using NLP to automatically segment texts and classify argumentative and rhetorical elements in essays written by 6th-12th grade students. The essays have the following elements:
* Lead - an introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader’s attention and point toward the thesis
* Position - an opinion or conclusion on the main question
* Claim - a claim that supports the position
* Counterclaim - a claim that refutes another claim or gives an opposing reason to the position
* Rebuttal - a claim that refutes a counterclaim
* Evidence - ideas or examples that support claims, counterclaims, or rebuttals.
* Concluding Statement - a concluding statement that restates the claims

The essays were annotated by expert raters for elements commonly found in argumentative writing. The data were provided by Georgia State University and the Learning Agency Lab.

Description of the data for model training:
* train.zip - folder of individual .txt files, with each file containing the full text of an essay response in the training set
* train.csv - a .csv file containing the annotated version of all essays in the training set
    * id - ID code for essay response
    * discourse_id - ID code for discourse element
    * discourse_start - character position where discourse element begins in the essay response
    * discourse_end - character position where discourse element ends in the essay response
    * discourse_text - text of discourse element
    * discourse_type - classification of discourse element
    * discourse_type_num - enumerated class label of discourse element
    * predictionstring - the word indices of the training sample, as required for predictions
* test.zip - folder of individual .txt files, with each file containing the full text of an essay response in the test set
* sample_submission.csv - file in the required format for making predictions - note that if you are making multiple predictions for a document, submit multiple rows


In [None]:
%%capture
!pip install umap-learn
!pip install seqeval
!pip install transformers

In [None]:
#import libaries
import os
import pandas as pd
import numpy as np
import random
import math

from IPython.display import display
from matplotlib import pyplot as plt
import seaborn as sns

import spacy
import umap

from tqdm.auto import tqdm

import gc
import copy
pd.set_option('display.max_columns', None)

from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay

import torch
from torch.utils.data import Dataset, DataLoader

import transformers
from transformers import  BertConfig, BertForTokenClassification
from transformers import AutoTokenizer
from torch import cuda

#set random seed
random.seed(9999)

In [None]:
import transformers
from transformers import AutoModel, AutoTokenizer
transformers.__version__

In [None]:
MODEL_PATH = '../input/huggingface-bert/bert-base-cased'

In [None]:
#get the bert tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [None]:
os.listdir('/kaggle/working')

In [None]:
#config cuda device
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

In [None]:

#folder containing essays
train_essays = '/kaggle/input/feedback-prize-2021/train'
test_essays = '/kaggle/input/feedback-prize-2021/test'

#read the training data frame
df = pd.read_csv('/kaggle/input/feedback-prize-2021/train.csv')

submission_df = pd.read_csv('/kaggle/input/feedback-prize-2021/sample_submission.csv')

df.info()

In [None]:
#convert the discourse_start and discourse_end to int 
df['discourse_start'] = df['discourse_start'].astype('int')
df['discourse_end'] = df['discourse_end'].astype('int')

#View first 5 rows of the train df
df.head(5)

In [None]:
print(f"Number of essays in training folder: {len(os.listdir(train_essays))}")
print(f"Number of essays in test folder: {len(os.listdir(test_essays))}")

# **Exploratory Data Analysis**

The training data frame has **144,293** records while number of essays in the training folder is **15,594**. From this and also observing the first 5 rows of the data frame, it shows that an essay file (represented by id column)can have one or more entries in training data frame. Below code plots histogram for the essays.

In [None]:
ax = df['id'].value_counts().hist()
ax.set_ylabel('count')
ax.set_xlabel('# of times essay in dataframe')

From the above, essays with 4 to 14 instance appear much in the data frame.

In [None]:
#code plot distribution of the discourse type
ax = sns.countplot(y='discourse_type', data=df, order = df['discourse_type'].value_counts().index)
ax.set_xlabel("Count")
ax.set_ylabel('')
ax.set_title('Distribution of discourse types')

#list of discourse types
discourse_type_list = df['discourse_type'].unique().tolist()

In [None]:
#pivot_table to see how many instance each essay are in discourse_type and discourse_type_num 
df_ = pd.pivot_table(df, index=['discourse_type', 'discourse_type_num'], values=['id'], aggfunc='count')
df_ = df_.rename(columns= {'id': 'num_of_essays'})
df_.sort_values(['discourse_type', 'num_of_essays'], ascending=[True, False])

From above it can be observed that, more essays are have the "claim" discourse type and within this group "claim 1" has more essays. Next we will explore the distribution of the discourse and compare how it varies by discourse type.

In [None]:
df['discourse_len'] = df['discourse_end'] - df['discourse_start'] 
ax = df['discourse_end'].hist()
ax.set_xlabel('Length of Discourse')
ax.set_ylabel('count')

In [None]:
#pivot_table to see how many instance each essay are in discourse_type and discourse_type_num 
df_ = pd.pivot_table(df, index=['discourse_type'], values=['discourse_len'], aggfunc=['mean', 'median'])
df_.reset_index(inplace=True)
df_.columns = [c0 + ('' if c1.strip() == '' else '_') +c1 for c0, c1 in df_.columns]
df_.sort_values('mean_discourse_len', ascending=False).reset_index(drop=True)

From the data it can be observed that the "claim" discourse type has the shortest discourse length even though it  has most essays. The longest discourse is in "Evidence" discourse type.

We now visualise 5 random essays from the training set.

In [None]:
#labels and colors to visualize the essays
color_map = {
                'Evidence': '#40E0D0',
                'Concluding Statement': '#9FE2BF',
                'Lead': '#6495ED',
                'Rebuttal': '#CCCCFF',
                'Counterclaim': '#DFFF00',
                'Position': '#FFBF00',
                'Claim': '#FF7F50'        
             }

def get_essay_text(id_, folder_=train_essays):
    """
    params id_ - essay id
    returns text for this essay id
    """
    
    txt = ""
    #read the essay from the text file
    with open(os.path.join(folder_, id_ + '.txt'), 'r') as file:
        txt = file.read()
    return(txt)
    
def visualize_essay(essay_ids_, df_):
    """
    params essay_ids_ - list of essay id
    params df_ - data frame containing meta data for the essays
    """
   

    ents = []

    for id in essay_ids_:
        for i, row in df_[df_['id']==id].iterrows():
            ents.append({
                        'start': row['discourse_start'], 
                         'end': row['discourse_end'],
                        'label': row['discourse_type']}
                        )

        
        doc_ = {
                'text': get_essay_text(id, train_essays),
                'ents': ents,
               }
        
        print(f'Essay ID: {id}')
        print('=========================================================================================')
        options = {"ents": discourse_type_list, "colors": color_map}
        spacy.displacy.render(doc_, style="ent", options=options, manual=True, jupyter=True);
        print('   ')
   

In [None]:
#get sample essay ids to preview  
essay_ids = df.groupby('discourse_type')['id'].sample(n=1, random_state=2).to_list()

visualize_essay(essay_ids, df)


**Visualize embeddings**

This section visualises the essays embedding by using UMAP.

In [None]:
# Load a large model, and disable pipeline unnecessary parts for our task
nlp = spacy.load('en_core_web_lg', disable=["parser", "tagger", "ner"])

In [None]:
#visualize embedding of 1000 essays
essay_ids =  df.groupby('discourse_type')['id'].sample(n=1000, random_state=2).to_list()


df_ =  df[ df['id'].isin(essay_ids)][['id', 'discourse_type']]

In [None]:

df_['essay'] = df_['id'].apply(get_essay_text)
df_.head(5)

In [None]:
# Get the vector for each essay
spacy_emb = df_['essay'].apply(lambda x: nlp(x).vector)
embeddings = np.vstack(spacy_emb)
embeddings.shape

In [None]:
#use UMAP to reduce the dimmensions
model = umap.UMAP()
data = model.fit_transform(embeddings)

#convert to DataFrame
data_ = pd.DataFrame(data, columns=['Dim_1', 'Dim_2'])
data_['label'] = df_['discourse_type'].to_list()

In [None]:
fig = plt.figure(figsize=(16, 10))
groups = data_.groupby('label')
for lbl, group in groups:
    plt.scatter(group['Dim_1'], group['Dim_2'], label=lbl, c=color_map.get(lbl))
    plt.legend()
plt.title('Visualization of the Word Embeddings')
plt.colorbar()

**Define hyper-parameters**

In [None]:

MAX_PASSAGE = 250 #approx half page
BATCH_SIZE = 32
NUM_EPOCHS = 15
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10
LABEL_ALL_TOKENS = False
print(f'Max passage : {MAX_PASSAGE}')

**Prepare Data**

Prepare data for training and validation.

**Prepare the datasets and dataloaders**

In [None]:
B_TAG = 'B-'
I_TAG = 'I-'
O_TAG = 'O'

#create list of output labels
output_labels  = []

output_labels.append(O_TAG)
for k in color_map.keys():
    output_labels.append(B_TAG + k)
    output_labels.append(I_TAG + k)

#2 Dictionaries: one that maps individual tags to indices, and one that maps indices to their individual tags
labels_to_ids = {v:k for k,v in enumerate(output_labels)}
ids_to_labels = {k:v for k,v in enumerate(output_labels)}

print(labels_to_ids)
print(ids_to_labels)

In [None]:
def prepare_data(df_):
    """
    creates dataframe with essay and IOB tags
    """
    data = None
    prev_id = None
    len_essay = 0

    for i, row in tqdm(df_.iterrows()):
        id_ = row['id']
    
        #get essay if id_ != prev_id
        if id_ != prev_id:
            essay = get_essay_text(id_, train_essays)
            len_essay = len(essay.strip().split())
            
        discourse_ = row['discourse_type']
        predictionstring_ = row['predictionstring'].split()
   
        #initialize the Tag
        ents_ = [ I_TAG + discourse_]* len(predictionstring_)
    
        #add begining tag to 
        ents_[0] = B_TAG + discourse_
      
        
        assert len(ents_) == len(predictionstring_ )   
    
        data = pd.concat(
                        [
                            data, pd.DataFrame({'id': [id_],
                                            'essay': [essay],
                                             'len_essay' : [len_essay],
                                            'predictionstring0': [','.join(predictionstring_)],
                                            'ents': [','.join(ents_)],
                                           })
                        ]
                        )
    
        prev_id = id_
    
    #clean-up theoutput data frame
    data['discourse_tags'] = data.groupby(['id'])['ents'].transform(lambda x: ','.join(x))
    data['predictionstring'] = data.groupby(['id'])['predictionstring0'].transform(lambda x: ','.join(x))

    #drop the duplicate rows
    data = data[['id', 'essay', 'len_essay', 'discourse_tags', 'predictionstring']].drop_duplicates().reset_index(drop=True)
    
    #add the O tag
    labels = []
    print('adding O tags ........')
    for i, row in tqdm(data.iterrows()):
        label = [O_TAG] * int(row['len_essay'])
    
        for BI_tag, pos in zip(row['discourse_tags'].split(','), row['predictionstring'].split(',')):
            label[int(pos)] = BI_tag
        labels.append(','.join(label))
    
    data['labels'] = labels
    
    return data

In [None]:
data = prepare_data(df)

print(f'Essay Length - max: {np.max(data["len_essay"])}, mean: {np.mean(data["len_essay"]):.2f}')
print(display(data.head(5)))



In [None]:
def split_list_into_chunks(list_, chunk_len_=250):
    """
    """
    return [list_[i:i+chunk_len_] for i in range(0, len(list_), chunk_len_)]

In [None]:
def split_essay_into_chucks(df_, max_passage_=250):
    """"
    splits the IOB and essay to < 512, so that when tokenized within BERT limit 
    """
    data = None
    for i, row in tqdm(df_.iterrows()):
        id_ = row['id']
        essay_ = row['essay']
        labels_ = row['labels']
        
        essay_ = essay_.strip().split()
        labels_ = labels_.split(",")
        
        assert len(labels_) == len(essay_)
        
        #split the essay and tags into chunks, so that it less 512 (Max bert )
        passages = split_list_into_chunks(essay_, max_passage_)
        sub_labels_ = split_list_into_chunks(labels_, max_passage_)
        
        #get passage length
        passage_len = [len(passage) for passage in passages ]
        
        passages = [' '.join(passage) for passage in passages ]
        sub_labels_ = [','.join(labs_) for labs_ in  sub_labels_]
        
        
        df_ = pd.DataFrame({ 'essay': passages,
                             'labels': sub_labels_,
                             
                          })
        
        df_['id'] = id_
        
        data = pd.concat([data, df_])
    
    
    #drop the duplicate rows
    if not data.empty:
        data = data.drop_duplicates().reset_index(drop=True)
    
    return data

In [None]:
def train_test_split_df(df, frac=0.2):
    # get random sample 
    test = df.sample(frac=frac, axis=0, random_state=1999)

    # get everything but the test sample
    train = df.drop(index=test.index).reset_index(drop=True)
    test = test.reset_index(drop=True)
    
      
    return train, test

In [None]:
train_df, val_df = train_test_split_df(data[['id', 'essay', 'labels']])

print(f'Source data shape: {data.shape}, train_df shape: {train_df.shape}, val_df shape: {val_df.shape} ')
#split the essay in junks

train_df = split_essay_into_chucks(train_df, MAX_PASSAGE)
val_df = split_essay_into_chucks(val_df, MAX_PASSAGE)

print(f'After splitting in chunks: train_df shape: {train_df.shape}, val_df shape: {val_df.shape} ')

In [None]:
def tokenize_inputs(text_, max_passage_):
  # add two for the special tokens {start, end [CLS][SEP]
  input_tokens_ = tokenizer(
                                  text_, 
                                  is_split_into_words=True,
                                  padding='max_length', 
                                  truncation=True, 
                                  max_length= max_passage_ + 2
                                )
  
  return input_tokens_

In [None]:
#define dataset
class passageDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len  
    
  
    def __getitem__(self, index):
        # step 1: get the sentence and word labels 
        essay_ = self.data.essay[index].strip().split()
        discourse_tags_ = self.data.labels[index].split(",")
              
        #step 2: tokenize the passage 
        input_tokens_ = tokenize_inputs(essay_, self.max_len)
      
        #step 3: align tokens to labels
        previous_word_idx = None
        label_ids = []


        for word_idx_ in input_tokens_.word_ids():
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.

            if word_idx_ is None:
                label_ids.append(-100)
            elif word_idx_ != previous_word_idx:
                label_ids.append(labels_to_ids[discourse_tags_[word_idx_]])
                # For the other tokens in a word, we set the label to either the current label or -100, depending on
                # the label_all_tokens flag.
            else:
                #previous word id is same to this one -- word was split by tokenizer
                label_ids.append(-100) 

        
            previous_word_idx = word_idx_
        
        # step 4: turn everything into PyTorch tensors
        item_ = {key: torch.as_tensor(val) for key, val in input_tokens_.items()}

        #append the labels_ids to dict    
        item_["labels"] = torch.as_tensor(label_ids)


        return item_
    
    def __len__(self):
        return self.len

In [None]:
train_dataset = passageDataset(train_df, tokenizer, MAX_PASSAGE)

train_loader = DataLoader(train_dataset,
                              batch_size= BATCH_SIZE,
                              shuffle=True)

val_dataset = passageDataset(val_df, tokenizer, MAX_PASSAGE)

val_loader = DataLoader(val_dataset,
                        batch_size=BATCH_SIZE,
                        shuffle=False)

In [None]:
#preview 2 essays and their tags
examples = iter(train_loader)
example_data = examples.next()


max_preview = 1
SPECIAL_TOK = [-100] #special token used by bert

for i, tokens, lbls in zip(range(max_preview), example_data['input_ids'], example_data['labels']):
  
    print(f'Passage: {tokenizer.decode(tokens)}') 
    print(f'Tokenized Passage: {[tk for tk in tokenizer.convert_ids_to_tokens(tokens)]}')
    print(f'Passage label_ids : {[lb.item() for lb in lbls]}')
    print(f'Passage labels : { [ids_to_labels[lb.item()] for lb in lbls if lb.item() not in SPECIAL_TOK ]}')
    print('------------------------------------------------------------------------------------------------------------------')
    print(f'Length of: tokens {len(tokens)}, labels with special tokens {len(lbls)}')
    print('==================================================================================================================') 
  
    if i > max_preview:
        break

**Define Model**

BERT will be used to predict each words discourse type.

In [None]:
model = BertForTokenClassification.from_pretrained(MODEL_PATH, num_labels=len(labels_to_ids))
model.to(device)

**Training the model**

In [None]:
#Sanity check of model:  initial loss of your model should be close to -ln(1/number of classes)
inputs = example_data
input_ids = inputs["input_ids"][0].unsqueeze(0)
attention_mask = inputs["attention_mask"][0].unsqueeze(0)
labels = inputs["labels"][0].unsqueeze(0)

input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
labels = labels.to(device)

print(f'input ids shape: {input_ids.shape}, attention mask shape: {attention_mask.shape}, labels shape: {labels.shape}')

In [None]:
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
initial_loss = outputs.loss
logits = outputs.logits
print(f'Initial loss: {initial_loss:.4f}, logits shape: {logits.shape}, -Log(1/num_output) = {-math.log(1/logits.shape[-1]):.4f}')


In [None]:
#define optimizer
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [None]:
# Defining the training method to fine tuning the bert model
def train_model(model_, epoch_):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model_.train()
    
    for idx, batch in enumerate(train_loader):
        
        ids = batch['input_ids'].to(device, dtype = torch.long)
        mask = batch['attention_mask'].to(device, dtype = torch.long)
        labels = batch['labels'].to(device, dtype = torch.long)

        outputs = model_(input_ids=ids, attention_mask=mask, labels=labels)
       
        loss = outputs.loss
        tr_logits = outputs.logits
     
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        
        if idx % 1000 ==0:
            loss_step = tr_loss/nb_tr_steps
            #print(type(ids))
            print(f"Training - Epoch: {epoch_} Step: {nb_tr_steps}/{len(train_loader)} Batch #: {idx} loss: {loss_step}")
           
        # compute training accuracy
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))
        
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")
    
    return model_

In [None]:

model_ = model

for epoch_ in range(NUM_EPOCHS):
    model_ = train_model(model_, epoch_)

    if epoch_ % 5 == 0:
        torch.cuda.empty_cache()
    
model = model_

**Evaluate model**

In [None]:
def validate_model(model, val_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(val_loader):
            
            ids = batch['input_ids'].to(device, dtype = torch.long)
            mask = batch['attention_mask'].to(device, dtype = torch.long)
            labels = batch['labels'].to(device, dtype = torch.long)
            
            outputs = model(input_ids=ids, attention_mask=mask, labels=labels)
           
            loss = outputs.loss
            eval_logits = outputs.logits
            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            
            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != SPECIAL_TOK[0] # shape (batch_size, seq_len)
        
            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(labels)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [ids_to_labels[id.item()] for id in eval_labels]
    predictions = [ids_to_labels[id.item()] for id in eval_preds]
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [None]:
labels, predictions = validate_model(model, val_loader)

**Test model and prepare submission file**

In [None]:
def prepare_passage(passage_, max_passage_=MAX_PASSAGE):
    """
      tokenize essay prepares it for model input
    """
 
    #step 2: tokenize the passage 
    input_tokens_ = tokenize_inputs(passage_, max_passage_)
   
    # step 4: turn everything into PyTorch tensors
    item_ = {key: torch.as_tensor(val) for key, val in input_tokens_.items()}
   
    #add batch dimmension
    input_ids = item_['input_ids'].unsqueeze(0)
    att_mask = item_['attention_mask'].unsqueeze(0)

    return input_ids, att_mask 

In [None]:
def get_prediction_word_pos(id_, predicted_labels_):
    word_pos = []
    pos = []
    dis_course = []

    prev_dis = None

    for i, lb in enumerate(predicted_labels_):
        #
        if lb != O_TAG:
            #print(lb)
            dis_ = lb.split('-')[1]
        
            if prev_dis and prev_dis != dis_:
                #change discourse
                dis_course.append(prev_dis)
                word_pos.append(pos)
                pos = []
        
            pos.append(i)  
            prev_dis = dis_

    if prev_dis: 
        #last discourse
        dis_course.append(prev_dis)
        word_pos.append(pos)

    data = None

    for dis, w_pos in zip (dis_course, word_pos):
        data = pd.concat(
                    [data, 
                     pd.DataFrame([{'id': id_,
                                    'class': dis, 'predictionstring': ','.join(str(e) for e in w_pos),
                                    'discourse_start':w_pos[0], 'discourse_end':w_pos[-1]}])
                    ])
    data.reset_index(drop=True, inplace=True)

    return data

In [None]:
def predict_passage(model_, passage_):
    
    input_id, att_mask = prepare_passage(passage_)    
    

    input_id = input_id.to(device, dtype = torch.long)
    att_mask = att_mask.to(device, dtype = torch.long)

    model_.eval()           
    outputs = model_(input_ids=input_id, attention_mask=att_mask)

    logits = outputs.logits

    logits = logits.view(-1) # shape (batch_size * seq_len,)

    logits = logits.view(-1, model_.num_labels) # shape (batch_size * seq_len, num_labels)
   
    preds = torch.argmax(logits, axis=1) # shape (batch_size * seq_len,)
    
    #mask that ignores all special tokens
    mask = input_id != SPECIAL_TOK[0]
    
    #only select tokens that are NOT special tokens
    preds = torch.masked_select(preds, mask)

    predictions = [p.item() for p in preds]

    #ignore 
    return predictions

In [None]:
def run_inference(essay_):
    
    predictions = []
    
    passages = split_list_into_chunks(essay_.strip().split(), MAX_PASSAGE)


    for passage_ in passages:
        preds_ = predict_passage(model_, passage_)

        predictions.extend(preds_)
    
    return predictions


In [None]:
df_test = None

for i, row in submission_df.iterrows():
    id_ = row['id']
    essay = get_essay_text(id_, test_essays)

    #make predictions
    predictions = run_inference(essay) 

    preds = [ids_to_labels[id] for id in predictions]

    #put predictions in data frame
    pds = get_prediction_word_pos(id_, preds)

    df_test = pd.concat([df_test, pds])

df_test.head()

In [None]:
#prepare submission file
df_test[['id', 'class', 'predictionstring']].to_csv(os.path.join('submission.csv'), index=False)