# Evaluating Student Writing

## Business Understanding

Writing is a critical skill for success. However, less than a third of high school seniors are proficient writers, according to the National Assessment of Educational Progress. One way to help students improve their writing is via automated feedback tools, which evaluate student writing and provide personalized feedback.

In this task, you need to identify elements in student writing. More specifically, you will automatically segment texts and classify argumentative and rhetorical elements in essays written by 6th-12th grade students.

If successful, you'll make it easier for students to receive feedback on their writing and increase opportunities to improve writing outcomes. Virtual writing tutors and automated writing systems can leverage these algorithms while teachers may use them to reduce grading time. The open-sourced algorithms you come up with will allow any educational organization to better help young writers develop.

## Analytic Approach

The task is not a normal classification problem. We can not just classify directly the essays. First, we need to segment essays into discrete rhetorical and argumentative elements, then classify those elements.

We can use NER approach to solve the problem. We will convert train dataset into a NER token array that we can use to train a NER transformer.

## Data Understanding

The data is provided in two formats:

* A train.csv with annotation for essays
* A train folder with invidual .txt files for each essay.

In [None]:
#import pakages
import numpy as np
import pandas as pd 
import os
import matplotlib.pyplot as plt
#from nltk.corpus import stopwords
import tensorflow as tf
from transformers import *
###Insipred from CHRIS DEOTTE

### Data overview

Let look at the training data table.

In [None]:
train = pd.read_csv('../input/feedback-prize-2021/train.csv')
train.head()

There are 7 fields in the table:

* id - ID code for essay response
* discourse_id - ID code for discourse element
* discourse_start - character position where discourse element begins in the essay response
* discourse_end - character position where discourse element ends in the essay response
* discourse_text - text of discourse element
* discourse_type - classification of discourse element
* discourse_type_num - enumerated class label of discourse element
* predictionstring - the word indices of the training sample, as required for predictions

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
train.isnull().sum()

There are no missing values in the training set.

Let see how many raw text files and annotaions in the training set.

In [None]:
raw_text_files = os.listdir('/kaggle/input/feedback-prize-2021/train')
print(f'Training data consists of {len(raw_text_files)} texts')
print(f'Training data consists of {train.shape[0]} annotaions')
print(f'Each essay contains average {round(train.shape[0]/len(raw_text_files), 1)} annotaions.')

Let look at the first file in the data frame

In [None]:
with open('../input/feedback-prize-2021/train/423A1CA112E2.txt', 'r') as file:
    first_txt = file.read()
print(first_txt)

In [None]:
train[train['id'] == "423A1CA112E2"]

### Texts (essay) overview

In [None]:
text_df = pd.DataFrame(columns = ['id', 'text'])

In [None]:
%%time
texts = []
for file in raw_text_files:
    with open(f'/kaggle/input/feedback-prize-2021/train/{file}') as f:
        texts.append({'id': file[:-4], 'text': f.read()})
        #text_df.append(pd.Series({'id': file[:-4], 'text': f.read()}), ignore_index = True)
texts_df = pd.DataFrame(texts)

In [None]:
texts_df.head()

In [None]:
#count the number of character and number of word of each essay
texts_df['len'] = texts_df['text'].apply(len)
texts_df['word_num'] = texts_df['text'].apply(lambda x: len(x.split()))

In [None]:
texts_df['len'].hist(bins = 50, figsize = (12, 8))
plt.title('Number of characters of each essay', fontsize = 15)
plt.xlabel('Number of characters')
plt.ylabel('Frequency');

From the histogram, we can see that most of the texts have less than 5000 characters, with some outlier that length up to about 17500 characters.

In [None]:
texts_df['word_num'].hist(bins = 50, figsize = (12, 8))
plt.title('Number of words of each essay', fontsize = 15)
plt.xlabel('Number of words')
plt.ylabel('Frequency');

In [None]:
(texts_df['word_num'] <= 512).sum()/len(texts_df['word_num'])

In [None]:
(texts_df['word_num'] <= 1024).sum()/len(texts_df['word_num'])

From the histogram, we can see that most of the texts have less than 1000 words. There are 73.9% texts have less than or equal to 512 words and 99% texts have less than or equal to 1024 words.

###  Length and frequency of each discourse type

In [None]:
train['discourse_type'].value_counts(ascending = True).plot(kind = 'barh')
plt.title('Number of each Discourse Type', fontsize = 16, pad = 15)
plt.ylabel('')
plt.xlabel('Frequency');

The most popular discourse type is Claim, and the least popular is Rebuttal.

In [None]:
#make a collumn to calculate the len of each discoure
train["discourse_len"] = train["discourse_text"].apply(lambda x: len(x.split()))

In [None]:
train.groupby('discourse_type')["discourse_len"].mean().sort_values().plot(kind = 'barh')
plt.title('Average length of each Discourse Type', fontsize = 16, pad = 15)
plt.xlabel('Number of words')
plt.ylabel('');

The Evidence type has the longest average number of words, and the sortest one is Claim. It seems resonable

## Modelling

In [None]:
# Token file
LOAD_TOKENS_FROM = '../input/longformerbase4096'

# Pretrained model
DOWNLOADED_MODEL_PATH = '../input/longformerbase4096'

# Model name
#MODEL_NAME = "../input/huggingface-bert/bert-base-cased"

# NER target file
TARGET = '../input/ner-target-for-feedback-prize-competition'

## First, I will make a baseline using BERT

BERT can only process a max of 512 input tokens lenght.

### Tokenize Training set

First we need to converts training dataset into a NER token array that we can use to train a NER transformer.

In [None]:
# Make an array of training texts name
IDS = train.id.unique()

In [None]:
MAX_LEN = 1024 # BERT limit

# THE TOKENS AND ATTENTION ARRAYS
tokenizer = AutoTokenizer.from_pretrained(DOWNLOADED_MODEL_PATH)
train_tokens = np.zeros((len(IDS),MAX_LEN), dtype='int32')
train_attention = np.zeros((len(IDS),MAX_LEN), dtype='int32')

# THE 14 CLASSES FOR NER
lead_b = np.zeros((len(IDS),MAX_LEN))
lead_i = np.zeros((len(IDS),MAX_LEN))

position_b = np.zeros((len(IDS),MAX_LEN))
position_i = np.zeros((len(IDS),MAX_LEN))

evidence_b = np.zeros((len(IDS),MAX_LEN))
evidence_i = np.zeros((len(IDS),MAX_LEN))

claim_b = np.zeros((len(IDS),MAX_LEN))
claim_i = np.zeros((len(IDS),MAX_LEN))

conclusion_b = np.zeros((len(IDS),MAX_LEN))
conclusion_i = np.zeros((len(IDS),MAX_LEN))

counterclaim_b = np.zeros((len(IDS),MAX_LEN))
counterclaim_i = np.zeros((len(IDS),MAX_LEN))

rebuttal_b = np.zeros((len(IDS),MAX_LEN))
rebuttal_i = np.zeros((len(IDS),MAX_LEN))

# HELPER VARIABLES
train_lens = []
targets_b = [lead_b, position_b, evidence_b, claim_b, conclusion_b, counterclaim_b, rebuttal_b]
targets_i = [lead_i, position_i, evidence_i, claim_i, conclusion_i, counterclaim_i, rebuttal_i]
target_map = {'Lead':0, 'Position':1, 'Evidence':2, 'Claim':3, 'Concluding Statement':4,
             'Counterclaim':5, 'Rebuttal':6}

In [None]:
txt = open('../input/feedback-prize-2021/train/423A1CA112E2.txt', 'r').read()
txt

In [None]:
tokens = tokenizer(txt, max_length = MAX_LEN, padding = 'max_length',
                        truncation = True, return_offsets_mapping = True)
offsets = tokens['offset_mapping']


In [None]:
s = 0
for i in offsets:
    if i != (0,0):
        s += 1
s

In [None]:
targets_b[0].shape

In [None]:
# FOR LOOP THROUGH EACH TRAIN TEXT
for id_num in range(len(IDS)):
    
    # Run and save the target for the first time
    if TARGET:
        break
        
    # READ TRAIN TEXT, TOKENIZE, AND SAVE IN TOKEN ARRAYS    
    
    # loop through each text, store its length to the train_lens
    n = IDS[id_num]
    name = f'../input/feedback-prize-2021/train/{n}.txt'
    txt = open(name, 'r').read()
    #train_lens.append(len(txt.split()))
    
    #tokenize the text
    tokens = tokenizer(txt, max_length = MAX_LEN, padding = 'max_length',
                        truncation = True, return_offsets_mapping = True)
    #save token of the text to the train_tokens array
    train_tokens[id_num,] = tokens['input_ids']
    #save attention mask to the train_attention array
    train_attention[id_num,] = tokens['attention_mask']
    
    # FIND TARGETS IN TEXT AND SAVE IN TARGET ARRAYS
    
    #loop through offset_mapping to asign each token to a class
    offsets = tokens['offset_mapping']
    offset_index = 0
    df = train.loc[train.id == n]
    for index,row in df.iterrows():
        a = row.discourse_start
        b = row.discourse_end
        if offset_index > MAX_LEN - 1:
            break
        c = offsets[offset_index][0] 
        d = offsets[offset_index][1]
        beginning = True
        while b > c:
            if (c >= a) & (b >= d):
                k = target_map[row.discourse_type]
                if beginning:
                    targets_b[k][id_num][offset_index] = 1
                    beginning = False
                else:
                    targets_i[k][id_num][offset_index] = 1
            offset_index += 1
            if offset_index > MAX_LEN - 1:
                break
            c = offsets[offset_index][0]
            d = offsets[offset_index][1]

In [None]:
# load the NER target if it is created
if TARGET is None:
    targets = np.zeros((len(IDS),MAX_LEN,15), dtype='int32')
    for k in range(7):
        targets[:,:,2*k] = targets_b[k]
        targets[:,:,2*k+1] = targets_i[k]
    targets[:,:,14] = 1 - np.max(targets,axis = -1)
    np.save(f'targets_{MAX_LEN}', targets)
    np.save(f'tokens_{MAX_LEN}', train_tokens)
    np.save(f'attention_{MAX_LEN}', train_attention)
    print('Saved NER tokens')
else:
    targets = np.load(f'{TARGET}/targets_{MAX_LEN}.npy')
    train_tokens = np.load(f'{TARGET}/tokens_{MAX_LEN}.npy')
    train_attention = np.load(f'{TARGET}/attention_{MAX_LEN}.npy')
    print('Loaded NER tokens')

## Build Model
We will use LongFormer backbone and add our own NER head using one hidden layer of size 256 and one final layer with softmax. We use 15 classes because we have a B class and I class for each of 7 labels. And we have an additional class (called O class) for tokens that do not belong to one of the 14 classes.

In [None]:
DOWNLOADED_MODEL_PATH

In [None]:
tokens = tf.keras.layers.Input(shape=(MAX_LEN,), name = 'tokens', dtype=tf.int32)
attention = tf.keras.layers.Input(shape=(MAX_LEN,), name = 'attention', dtype=tf.int32)

config = AutoConfig.from_pretrained(DOWNLOADED_MODEL_PATH+'/config.json') 
backbone = TFAutoModel.from_pretrained(DOWNLOADED_MODEL_PATH+'/tf_model.h5', config=config)

In [None]:

x = backbone(tokens, attention_mask=attention)
x = tf.keras.layers.Dense(256, activation='relu')(x[0])
x = tf.keras.layers.Dense(15, activation='softmax', dtype='float32')(x)

model = tf.keras.Model(inputs=[tokens,attention], outputs=x)
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-4),
              loss = ['categorical_crossentropy'],
              metrics = ['categorical_accuracy'])

In [None]:
del model

In [None]:
model = tf.keras.Model(inputs=[tokens,attention], outputs=x)
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.5e-4),
              loss = ['categorical_crossentropy'],
              metrics = ['categorical_accuracy'])

In [None]:
def build_model():
    
    tokens = tf.keras.layers.Input(shape=(MAX_LEN,), name = 'tokens', dtype=tf.int32)
    attention = tf.keras.layers.Input(shape=(MAX_LEN,), name = 'attention', dtype=tf.int32)
    
    config = AutoConfig.from_pretrained(DOWNLOADED_MODEL_PATH+'/config.json') 
    backbone = TFAutoModel.from_pretrained(DOWNLOADED_MODEL_PATH+'/tf_model.h5', config=config)
    
    x = backbone(tokens, attention_mask=attention)
    x = tf.keras.layers.Dense(256, activation='relu')(x[0])
    x = tf.keras.layers.Dense(15, activation='softmax', dtype='float32')(x)
    
    model = tf.keras.Model(inputs=[tokens,attention], outputs=x)
    model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 1e-4),
                  loss = ['categorical_crossentropy'],
                  metrics = ['categorical_accuracy'])
    
    return model

In [None]:
model = build_model()

In [None]:
# LEARNING RATE SCHEDULE AND MODEL CHECKPOINT
EPOCHS = 4
BATCH_SIZE = 4
LRS = [0.25e-4, 0.1e-4, 0.75e-4, 0.5e-5] 
def lrfn(epoch):
    return LRS[epoch]
lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose = True)

In [None]:
# TRAIN VALID SPLIT 90% 10%
np.random.seed(6)
train_idx = np.random.choice(np.arange(len(IDS)),int(0.9*len(IDS)),replace=False)
valid_idx = np.setdiff1d(np.arange(len(IDS)),train_idx)
np.random.seed(None)
print('Train size',len(train_idx),', Valid size',len(valid_idx))

In [None]:
train[train['id'].isin(IDS[valid_idx])].shape[0]

In [None]:
train[train['id'].isin(IDS[valid_idx])]

In [None]:
train[train['id'].isin(IDS[valid_idx])]['discourse_type'].value_counts(ascending = True)

In [None]:
train[~train['id'].isin(IDS[train_idx])]['discourse_type'].value_counts(ascending = True)

In [None]:
train[train['id'].isin(IDS[train_idx])]['discourse_type'].value_counts(ascending = True).plot(kind = 'barh')
plt.title('Number of each Discourse Type in Training Set', fontsize = 16, pad = 15)
plt.ylabel('')
plt.xlabel('Frequency');

In [None]:
train[~train['id'].isin(IDS[train_idx])]['discourse_type'].value_counts(ascending = True).plot(kind = 'barh')
plt.title('Number of each Discourse Type in Validation Set', fontsize = 16, pad = 15)
plt.ylabel('')
plt.xlabel('Frequency');

In [None]:
train['discourse_type'].value_counts(ascending = True, normalize = True)/train[train['id'].isin(IDS[train_idx])]['discourse_type'].value_counts(ascending = True, normalize = True)

In [None]:
train[train['id'].isin(IDS[train_idx])]['discourse_type'].value_counts(ascending = True, normalize = True)/train[~train['id'].isin(IDS[train_idx])]['discourse_type'].value_counts(ascending = True, normalize = True)

In [None]:
model.get_config()

In [None]:
model.summary()

In [None]:
history = model.fit(x = [train_tokens[train_idx,], train_attention[train_idx,]],
          y = targets[train_idx,],
          validation_data = ([train_tokens[valid_idx,], train_attention[valid_idx,]],
                         targets[valid_idx,]),
          epochs = 5,
          batch_size = 4,
          verbose = 1)

In [None]:
df = pd.DataFrame(history.history)

In [None]:
im

In [None]:
df.to_csv('Accuracy 4th training')

In [None]:
df.plot(y = ['categorical_accuracy', 'val_categorical_accuracy'], figsize = (12, 7))
plt.xlabel("Epochs")
plt.ylabel('Accuracy')
plt.title('Accuracy vs. epochs');

In [None]:
df.plot(y = ['loss', 'val_loss'], figsize = (12, 7))
plt.xlabel("Epochs")
plt.ylabel('Loss')
plt.title('Loss vs. epochs');

In [None]:
model.save_weights('long_v4.h5')

In [None]:
LOAD_MODEL_FROM = '../input/trained-model-for-feedback-prize-competition'
VER = 1

In [None]:
if LOAD_MODEL_FROM:
    model.load_weights(f'{LOAD_MODEL_FROM}/long_v{VER}.h5')
    
# OR TRAIN MODEL
else:
    history = model.fit(x = [train_tokens[train_idx,], train_attention[train_idx,]],
              y = targets[train_idx,],
              validation_data = ([train_tokens[valid_idx,], train_attention[valid_idx,]],
                             targets[valid_idx,]),
              callbacks = [lr_callback],
              epochs = EPOCHS,
              batch_size = BATCH_SIZE,
              verbose = 1)

    # SAVE MODEL WEIGHTS
    model.save_weights(f'long_v{VER}.h5')

## Validate Model

We will now make predictions on the validation texts. Our model makes label predictions for each token, we need to convert this into a list of word indices for each label. Note that the tokens and words are not the same. A single word may be broken into multiple tokens. Therefore we need to first create a map to change token indices to word indices.

In [None]:
p = model.predict([train_tokens[valid_idx,], train_attention[valid_idx,]], 
                  batch_size=16, verbose=1)
print('Validation predictions shape:',p.shape)
oof_preds = np.argmax(p,axis=-1)

In [None]:
#turn from class number to class name
target_map_rev = {0:'Lead', 1:'Position', 2:'Evidence', 3:'Claim', 4:'Concluding Statement',
             5:'Counterclaim', 6:'Rebuttal', 7:'blank'}

In [None]:
def get_preds(dataset='train', verbose=True, text_ids = IDS[valid_idx], preds = oof_preds):
    all_predictions = []

    for id_num in range(len(preds)):

        n = text_ids[id_num]
    
        # GET TOKEN POSITIONS IN CHARS
        name = f'../input/feedback-prize-2021/{dataset}/{n}.txt'
        txt = open(name, 'r').read()
        tokens = tokenizer.encode_plus(txt, max_length = MAX_LEN, padding = 'max_length',
                                   truncation = True, return_offsets_mapping = True)
        off = tokens['offset_mapping']
    
        # GET WORD POSITIONS IN CHARS
        w = []
        blank = True
        for i in range(len(txt)):
            if (txt[i]!=' ')&(txt[i]!='\n')&(txt[i]!='\xa0')&(txt[i]!='\x85')&(blank==True):
                w.append(i)
                blank = False
            elif (txt[i]==' ')|(txt[i]=='\n')|(txt[i]=='\xa0')|(txt[i]=='\x85'):
                blank = True
        w.append(1e6)
            
        # MAPPING FROM TOKENS TO WORDS
        word_map = -1 * np.ones(MAX_LEN,dtype='int32')
        w_i = 0
        for i in range(len(off)):
            if off[i][1]==0:
                continue
            while off[i][0]>=w[w_i+1]:
                w_i += 1
            word_map[i] = int(w_i)
        
        # CONVERT TOKEN PREDICTIONS INTO WORD LABELS
        ### KEY: ###
        # 0: LEAD_B, 1: LEAD_I
        # 2: POSITION_B, 3: POSITION_I
        # 4: EVIDENCE_B, 5: EVIDENCE_I
        # 6: CLAIM_B, 7: CLAIM_I
        # 8: CONCLUSION_B, 9: CONCLUSION_I
        # 10: COUNTERCLAIM_B, 11: COUNTERCLAIM_I
        # 12: REBUTTAL_B, 13: REBUTTAL_I
        # 14: NOTHING i.e. O
        ### NOTE THESE VALUES ARE DIVIDED BY 2 IN NEXT CODE LINE
        pred = preds[id_num,]/2.0
    
        i = 0
        while i < MAX_LEN:
            prediction = []
            start = pred[i]
            if start in [0,1,2,3,4,5,6,7]:
                prediction.append(word_map[i])
                i += 1
                if i >= MAX_LEN:
                    break
                while pred[i] == start + 0.5:
                    if not word_map[i] in prediction:
                        prediction.append(word_map[i])
                    i += 1
                    if i >= MAX_LEN:
                        break
            else:
                i += 1
            prediction = [x for x in prediction if x!=-1]
            if len(prediction) > 4:
                all_predictions.append((n, target_map_rev[int(start)], 
                                ' '.join([str(x) for x in prediction])))
                
    # MAKE DATAFRAME
    df = pd.DataFrame(all_predictions)
    df.columns = ['id','class','predictionstring']
    
    return df

In [None]:
oof = get_preds(dataset = 'train', verbose = True, text_ids = IDS[valid_idx])
oof.head()

## Compute Validation Metric

In [None]:
# CODE FROM : Rob Mulla @robikscube
# https://www.kaggle.com/robikscube/student-writing-competition-twitch
def calc_overlap(row):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(row.predictionstring_pred.split(' '))
    set_gt = set(row.predictionstring_gt.split(' '))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter/ len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df = pred_df[['id','class','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on=['id','class'],
                           right_on=['id','discourse_type'],
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    joined['predictionstring_gt'] = joined['predictionstring_gt'].fillna(' ')
    joined['predictionstring_pred'] = joined['predictionstring_pred'].fillna(' ')

    joined['overlaps'] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['overlap1'] = joined['overlaps'].apply(lambda x: eval(str(x))[0])
    joined['overlap2'] = joined['overlaps'].apply(lambda x: eval(str(x))[1])


    joined['potential_TP'] = (joined['overlap1'] >= 0.5) & (joined['overlap2'] >= 0.5)
    joined['max_overlap'] = joined[['overlap1','overlap2']].max(axis=1)
    tp_pred_ids = joined.query('potential_TP') \
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','predictionstring_gt']).first()['pred_id'].values

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined['pred_id'].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query('potential_TP')['gt_id'].unique()
    unmatched_gt_ids = [c for c in joined['gt_id'].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    my_f1_score = TP / (TP + 0.5*(FP+FN))
    return my_f1_score

In [None]:
# VALID DATAFRAME
valid = train.loc[train['id'].isin(IDS[valid_idx])]

In [None]:
f1s = []
CLASSES = oof['class'].unique()

print('Validation F1_score:')
for c in CLASSES:
    pred_df = oof.loc[oof['class'] == c].copy()
    gt_df = valid.loc[valid['discourse_type'] == c].copy()
    f1 = score_feedback_comp(pred_df, gt_df)
    print(c + ':', round(f1, 3))
    f1s.append(f1)
print()
print('Overall',round(np.mean(f1s), 3))

## Infer Test Data
We will now infer the test data and create a submission.

In [None]:
# GET TEST TEXT IDS
files = os.listdir('../input/feedback-prize-2021/test')
TEST_IDS = [f.replace('.txt','') for f in files if 'txt' in f]
print('There are',len(TEST_IDS),'test texts.')

In [None]:
# CONVERT TEST TEXT TO TOKENS
test_tokens = np.zeros((len(TEST_IDS), MAX_LEN), dtype='int32')
test_attention = np.zeros((len(TEST_IDS), MAX_LEN), dtype='int32')

for id_num in range(len(TEST_IDS)):
        
    # READ TRAIN TEXT, TOKENIZE, AND SAVE IN TOKEN ARRAYS    
    n = TEST_IDS[id_num]
    name = f'../input/feedback-prize-2021/test/{n}.txt'
    txt = open(name, 'r').read()
    tokens = tokenizer.encode_plus(txt, max_length=MAX_LEN, padding='max_length',
                                   truncation=True, return_offsets_mapping=True)
    test_tokens[id_num,] = tokens['input_ids']
    test_attention[id_num,] = tokens['attention_mask']

In [None]:
# INFER TEST TEXTS
p = model.predict([test_tokens, test_attention], 
                  batch_size=16, verbose=2)
print('Test predictions shape:',p.shape)
test_preds = np.argmax(p,axis=-1)

## Write Submission CSV

In [None]:
# GET TEST PREDICIONS
sub = get_preds( dataset='test', verbose=False, text_ids=TEST_IDS, preds=test_preds )
sub.head()

In [None]:
sub

In [None]:
# WRITE SUBMISSION CSV
sub.to_csv('submission.csv',index=False)