# LEFTMOSTSEG MODEL BASELINE

Title: Neural Sequence Segmentation as Determining the Leftmost Segments <br>
Author: Li, Yangming and Liu, Lemao and Yao, Kaisheng <br>
Booktitle: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics <br>
Publisher: Association for Computational Linguistics

### SETUP

In [1]:
# LeftmostSeg setup
!cp -R ../input/leftmostseg/LeftmostSeg ./

In [2]:
import os
import random
import json
from pathlib import Path
from glob import glob

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import torch
from torch.optim import Adam

# LeftmostSeg
from LeftmostSeg.misc import fix_random_seed
from LeftmostSeg.utils import setUpVocab, getSegmentBounds, getDataLoader
from LeftmostSeg.model import LeftmostSeg
from LeftmostSeg.utils import Procedure, FeedbackPrizeDataset

In [3]:
# Logging
import logging
# Set logger
logging.basicConfig(format='[%(asctime)s %(levelname)s] %(message)s', datefmt='%d-%b-%Y %H:%M')
log = logging.getLogger(__name__)
log.setLevel(logging.WARN)
log.setLevel(logging.INFO)
log.setLevel(logging.DEBUG)

### UTILITIES

### DATA

In [4]:
# Fix random state.
randomState = 0
fix_random_seed(randomState)

In [5]:
# Paths
baseDirectory = Path('../input/feedback-prize-2021')
dfBase = pd.read_csv(
    baseDirectory / 'train.csv', 
    dtype={'discourse_start': int, 'discourse_end': int}
)
# Broken labels.
brokenLabels = ['DBBF3EF47E93', '96948C0AFC15', '15F434699355', '92C09304882D']
# Remove broken label instances.
dfBase = dfBase[~dfBase.id.isin(brokenLabels)]
# Get segment labels.
dfBase['segment_bounds'] = dfBase['predictionstring'].apply(getSegmentBounds)
# Add paths.
dfBase['text_path'] = dfBase['id'].apply(lambda id_: baseDirectory / f'train/{id_}.txt')

#### What is 20% of the current dataset?

In [6]:
print(f"20% of dataset is: {int(len(dfBase.id.unique()) * 0.2)}")
# The validation essays.
valIds = np.random.choice(dfBase.id.unique(), 3118, replace=False)
valIds

20% of dataset is: 3118


array(['A143BE759AD2', '16364DA86C3D', '77351280D0B0', ...,
       'FF01B32BAC3E', 'FBB35C3EF339', '24B659FCEC04'], dtype=object)

In [7]:
# Train columns
trainColumns = ['id', 'text_path', 'discourse_id', 'discourse_type', 'predictionstring', 'segment_bounds']
# Create the trainSet, valSet
trainSet = dfBase[~dfBase.id.isin(valIds)][trainColumns].copy(deep=True).reset_index(drop=True)
valSet = dfBase[dfBase.id.isin(valIds)][trainColumns].copy(deep=True).reset_index(drop=True)

In [8]:
print(f"Shapes: {trainSet.shape}, {valSet.shape}")
print(f"No of unique Ids: {len(trainSet.id.unique())}, {len(valSet.id.unique())}")

Shapes: (115315, 6), (28926, 6)
No of unique Ids: 12472, 3118


### MODELING PRE-STEPS

In [9]:
checkpointDir = './checkpoints'

In [10]:
# Parameters
epochs = 10
batchSize = 1
wordEmbeddingDim = 128
labelEmbeddingDim = 16
encHiddenDim = 128
decHiddenDim = 512
dropoutRate = 0.3

In [11]:
# Set up the vocabulary.
lexicalVocab, labelVocab = setUpVocab(dfBase)

100%|██████████| 15590/15590 [00:55<00:00, 279.02it/s]


In [12]:
# Data loaders. 
trainLoader = getDataLoader(trainSet, batchSize, True)
valLoader = getDataLoader(valSet,batchSize,False)

### MODEL

In [13]:
model = LeftmostSeg(
    lexicalVocab, 
    labelVocab, 
    wordEmbeddingDim,
    labelEmbeddingDim, 
    encHiddenDim,
    decHiddenDim, 
    dropoutRate
)

In [14]:
if torch.cuda.is_available():
    model = model.cuda()
optimizer = Adam(model.parameters(), weight_decay=1e-6)

In [15]:
best_val = 0.0
savePath = os.path.join(checkpointDir, "model.pt")
if not os.path.exists(checkpointDir):
    os.makedirs(checkpointDir)

In [16]:
for epoch_idx in range(0, epochs + 1):
    train_loss, train_time = Procedure.train(model, trainLoader, optimizer)
    print("[Epoch {:3d}] loss on train set is {:.5f} using {:.3f} secs".format(epoch_idx, train_loss, train_time))

    val_score, val_time = Procedure.evaluate(model, valLoader)
    print("(Epoch {:3d}) f1 score on dev set is {:.5f} using {:.3f} secs".format(epoch_idx, val_score, val_time))
        
    if val_score > best_val:
        best_val = val_score
        print("\n<Epoch {:3d}> save the model with test score, {:.5f}, in terms of dev".format(epoch_idx, val_score))
        torch.save(model, savePath)
    print(end="\n\n")

100%|███████| 12472/12472 [17:11<00:00, 12.09it/s]


[Epoch   0] loss on train set is 38968.67180 using 1031.892 secs


100%|█████████| 3118/3118 [03:49<00:00, 13.59it/s]


(Epoch   0) f1 score on dev set is 0.41043 using 230.398 secs

<Epoch   0> save the model with test score, 0.41043, in terms of dev




100%|███████| 12472/12472 [18:11<00:00, 11.42it/s]


[Epoch   1] loss on train set is 27920.19113 using 1091.915 secs


100%|█████████| 3118/3118 [03:51<00:00, 13.46it/s]


(Epoch   1) f1 score on dev set is 0.45478 using 232.849 secs

<Epoch   1> save the model with test score, 0.45478, in terms of dev




100%|███████| 12472/12472 [17:38<00:00, 11.79it/s]


[Epoch   2] loss on train set is 25260.12666 using 1058.210 secs


100%|█████████| 3118/3118 [03:47<00:00, 13.72it/s]


(Epoch   2) f1 score on dev set is 0.47955 using 228.181 secs

<Epoch   2> save the model with test score, 0.47955, in terms of dev




100%|███████| 12472/12472 [17:43<00:00, 11.73it/s]


[Epoch   3] loss on train set is 23823.04415 using 1063.573 secs


100%|█████████| 3118/3118 [03:49<00:00, 13.61it/s]


(Epoch   3) f1 score on dev set is 0.48277 using 230.167 secs

<Epoch   3> save the model with test score, 0.48277, in terms of dev




100%|███████| 12472/12472 [17:32<00:00, 11.85it/s]


[Epoch   4] loss on train set is 22793.26798 using 1052.069 secs


100%|█████████| 3118/3118 [03:52<00:00, 13.40it/s]


(Epoch   4) f1 score on dev set is 0.48954 using 233.665 secs

<Epoch   4> save the model with test score, 0.48954, in terms of dev




100%|███████| 12472/12472 [17:36<00:00, 11.81it/s]


[Epoch   5] loss on train set is 21979.98522 using 1056.209 secs


100%|█████████| 3118/3118 [03:58<00:00, 13.08it/s]


(Epoch   5) f1 score on dev set is 0.50143 using 239.523 secs

<Epoch   5> save the model with test score, 0.50143, in terms of dev




100%|███████| 12472/12472 [17:46<00:00, 11.69it/s]


[Epoch   6] loss on train set is 21157.05986 using 1066.692 secs


100%|█████████| 3118/3118 [03:48<00:00, 13.67it/s]


(Epoch   6) f1 score on dev set is 0.49546 using 229.054 secs




100%|███████| 12472/12472 [17:29<00:00, 11.88it/s]


[Epoch   7] loss on train set is 20582.78415 using 1049.559 secs


100%|█████████| 3118/3118 [03:47<00:00, 13.71it/s]


(Epoch   7) f1 score on dev set is 0.50066 using 228.385 secs




100%|███████| 12472/12472 [17:39<00:00, 11.78it/s]


[Epoch   8] loss on train set is 20037.11506 using 1059.007 secs


100%|█████████| 3118/3118 [03:37<00:00, 14.35it/s]


(Epoch   8) f1 score on dev set is 0.49978 using 218.421 secs




100%|███████| 12472/12472 [18:04<00:00, 11.50it/s]


[Epoch   9] loss on train set is 19418.43376 using 1084.065 secs


100%|█████████| 3118/3118 [03:58<00:00, 13.06it/s]


(Epoch   9) f1 score on dev set is 0.50621 using 239.752 secs

<Epoch   9> save the model with test score, 0.50621, in terms of dev




100%|███████| 12472/12472 [18:07<00:00, 11.47it/s]


[Epoch  10] loss on train set is 18964.49135 using 1087.554 secs


100%|█████████| 3118/3118 [03:57<00:00, 13.13it/s]


(Epoch  10) f1 score on dev set is 0.50319 using 238.721 secs


