# Identifying PII in Student Essays
## Project Summary
The Kaggle Competition we are participating in is the [PII Data Detection hosted by The Learning Agency Lab](https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/overview). The goal of this competition is to develop a model that detects sensitive personally identifiable information (PII) in student writing. This is necessary to screen and clean educational data so that when released to the public for analysis and archival, the students' risk are mitigated.

## Cloning Repo
Because one of the files is larger than 100MiB, the file could not be uploaded directly to the github repo. The solution found was using git large file system to hold the file and upload the git lfs pointer file in the place of the json.

Git Bash Code:
```
# install git lfs
git lfs install

# start file tracking for git lfs in the repo
git lfs track "*.json"

# stage/commit/push training json
git add train.json
git commit -m "add train.json"
git push
```
After cloning the repo locally, it clones the git lfs pointer file not the data file.

Git Bash Code:
```
# pull file from git lfs system into local repo using any pointer files
git lfs pull
```


## External Data Sources

* [Persuade PII Dataset](https://www.kaggle.com/datasets/thedrcat/persuade-pii-dataset?rvi=1)
  * Essays from Persuade corpus, modified with synthetic PII data and corresponding labels. It was filtered for essays that contain tokens that are relevant to competition.

* [PII | External Dataset](https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset?rvi=1)
  * This is an LLM-generated external dataset that contains generated texts with their corresponding annotated labels in the required competition format.

* [NEW DATASET PII Data Detection](https://www.kaggle.com/datasets/cristaliss/new-dataset-pii-data-detection?rvi=1)
  * This dataset is a modified version of the official training which have the following changes: Revamped Labels, Token Transformation, and Token indexing

* [PII Detection Dataset (GPT)](https://www.kaggle.com/datasets/pjmathematician/pii-detection-dataset-gpt)
  * Personal data was created using python Faker package, which was then fed into the LLM to write an essay on. Overall, it contains 2000 gpt - generated essays and corresponding competition entities used in the essay.

* [AI4privacy-PII](https://www.kaggle.com/datasets/verracodeguacas/ai4privacy-pii)
  * The dataset is crafted using proprietary algorithms, ensuring the creation of synthetic data that avoids privacy violations. The data is meticulously curated with human-in-the-loop validation, ensuring both relevance and quality. It serves a crucial role in addressing the growing concerns around personal data security in AI applications.



## Python Libraries


In [20]:
try:
    import pandas as pd
    import numpy as np
    import spacy as sp
    import re
except DeprecationWarning:
    None

## Loading Datasets


### Official training data

Only load into notebook after pulling from git LFS (see above)

In [2]:
# df_train = pd.read_json("../Datasets/Official/train.json")
# df_train

### Official testing data


In [3]:
df_test = pd.read_json("../Datasets/Official/test.json")
df_test

Unnamed: 0,document,full_text,tokens,trailing_whitespace
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F..."
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,..."
2,16,Reporting process\n\nby Gilberto Gamboa\n\nCha...,"[Reporting, process, \n\n, by, Gilberto, Gambo...","[True, False, False, True, True, False, False,..."
3,20,Design Thinking for Innovation\n\nSindy Samaca...,"[Design, Thinking, for, Innovation, \n\n, Sind...","[True, True, True, False, False, True, False, ..."
4,56,Assignment: Visualization Reflection Submitt...,"[Assignment, :, , Visualization, , Reflecti...","[False, False, False, False, False, False, Fal..."
5,86,Cheese Startup - Learning Launch ​by Eladio Am...,"[Cheese, Startup, -, Learning, Launch, ​by, El...","[True, True, True, True, True, True, True, Fal..."
6,93,Silvia Villalobos\n\nChallenge:\n\nThere is a ...,"[Silvia, Villalobos, \n\n, Challenge, :, \n\n,...","[True, False, False, False, False, False, True..."
7,104,Storytelling The Path to Innovation\n\nDr Sak...,"[Storytelling, , The, Path, to, Innovation, \...","[True, False, True, True, True, False, False, ..."
8,112,Reflection – Learning Launch\n\nFrancisco Ferr...,"[Reflection, –, Learning, Launch, \n\n, Franci...","[True, True, True, False, False, True, False, ..."
9,123,Gandhi Institute of Technology and Management ...,"[Gandhi, Institute, of, Technology, and, Manag...","[True, True, True, True, True, True, False, Tr..."


## Cleaning
To have some uniform input, each source dataframe needs to have a list of tokens from of the source text located in each row.

### Official training dataset
Verify that there are no rows with any null values

In [9]:
# df_train[df_train.isnull().any(axis = 1)]

### Official test dataset
Verify that there are no rows with any null values

In [10]:
df_test[df_test.isnull().any(axis = 1)]

Unnamed: 0,document,full_text,tokens,trailing_whitespace


# Building the model framework

### Loading and Cleaning Datasets

In [None]:
import pandas as pd
import numpy as np
import spacy as sp
import re
import os
import transformers
import torch
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


import torch.nn as nn
from torch import cuda
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification, AutoTokenizer



import warnings
warnings.filterwarnings('ignore')

### Config

This section does initial configuration, loading our the bert pretained model that we use, as well as setting up hyper parameters like `EPOCHS` and `LEARNING_RATE`

In [29]:
class Config():
    def __init__(self, platform, model_name, pretrained_model_name):
        # platform = 'Kaggle'# 
        if platform == 'Kaggle':
            pretrained_model = '../input/huggingface-bert/' + pretrained_model_name + '/'
            train_path = 'path_TBD/train.json'
            # test_path = ''
            model_path = 'path_TBD/' + model_name
        elif platform == 'local':
            model_path = '../models/bert_models/' + model_name
        
        self.config = {
            'MAX_LEN': 128,
            'TRAIN_BATCH_SIZE': 4,
            'VALID_BATCH_SIZE': 2,
            'EPOCHS': 1,
            'LEARNING_RATE':1e-5,
            'MAX_GRAD_NORM': 10,
            'device': 'cuda' if torch.cuda.is_available() else 'cpu',
            # 'device': 'cpu',
            'model_path': model_path,
            'pretrained_model': pretrained_model_name,
            'tokenizer': BertTokenizerFast.from_pretrained(pretrained_model_name)
        }

In [None]:
platform = 'local'
pretrainend_model_name = 'bert-large-cased'
model_num = 1
model_name = 'model' + model_num + '_' + pretrainend_model_name +'.bin'

config = Config(platform,model_name, pretrainend_model_name).config

Requires:
```git lfs track "*.safetensors"```
to clone model from github

If using IDE, Run 
```
python -m spacy download en_core_web_sm
```
in the bash to install the english spacy pipline

## Official Datasets

### Data and Exploration

In [None]:
df = pd.read_json("../Datasets/Official/train.json")
df

Unnamed: 0,document,full_text,tokens,trailing_whitespace,labels
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F...","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-..."
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,...","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O..."
2,16,Reporting process\n\nby Gilberto Gamboa\n\nCha...,"[Reporting, process, \n\n, by, Gilberto, Gambo...","[True, False, False, True, True, False, False,...","[O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O..."
3,20,Design Thinking for Innovation\n\nSindy Samaca...,"[Design, Thinking, for, Innovation, \n\n, Sind...","[True, True, True, False, False, True, False, ...","[O, O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT..."
4,56,Assignment: Visualization Reflection Submitt...,"[Assignment, :, , Visualization, , Reflecti...","[False, False, False, False, False, False, Fal...","[O, O, O, O, O, O, O, O, O, O, O, O, B-NAME_ST..."
...,...,...,...,...,...
6802,22678,EXAMPLE – JOURNEY MAP\n\nTHE CHALLENGE My w...,"[EXAMPLE, –, JOURNEY, MAP, \n\n, THE, CHALLENG...","[True, True, True, False, False, True, True, F...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
6803,22679,Why Mind Mapping?\n\nMind maps are graphical r...,"[Why, Mind, Mapping, ?, \n\n, Mind, maps, are,...","[True, True, False, False, False, True, True, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
6804,22681,"Challenge\n\nSo, a few months back, I had chos...","[Challenge, \n\n, So, ,, a, few, months, back,...","[False, False, False, True, True, True, True, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
6805,22684,Brainstorming\n\nChallenge & Selection\n\nBrai...,"[Brainstorming, \n\n, Challenge, &, Selection,...","[False, False, True, True, False, False, True,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [None]:
# df[df.isnull().any(axis = 1)]

In [2]:
from collections import Counter
c = Counter()
df.apply(lambda line: c.update(line.labels), axis = 1)
c_pii = c.most_common()[1:]
c_key, c_val = zip(*c_pii)

NameError: name 'df' is not defined

In [0]:
c_pii

[('B-NAME_STUDENT', 1365),
 ('I-NAME_STUDENT', 1096),
 ('B-URL_PERSONAL', 110),
 ('B-ID_NUM', 78),
 ('B-EMAIL', 39),
 ('I-STREET_ADDRESS', 20),
 ('I-PHONE_NUM', 15),
 ('B-USERNAME', 6),
 ('B-PHONE_NUM', 6),
 ('B-STREET_ADDRESS', 2),
 ('I-URL_PERSONAL', 1),
 ('I-ID_NUM', 1)]

Showing frequency of each label

In [ ]:
plt.barh(c_key, c_val)
plt.show()

In [None]:
labels_to_ids = {k: v for v, k in enumerate(c.keys())}
ids_to_labels = {v: k for v, k in enumerate(c.keys())}
labels_to_ids

{'O': 0,
 'B-NAME_STUDENT': 1,
 'I-NAME_STUDENT': 2,
 'B-URL_PERSONAL': 3,
 'B-EMAIL': 4,
 'B-ID_NUM': 5,
 'I-URL_PERSONAL': 6,
 'B-USERNAME': 7,
 'B-PHONE_NUM': 8,
 'I-PHONE_NUM': 9,
 'B-STREET_ADDRESS': 10,
 'I-STREET_ADDRESS': 11,
 'I-ID_NUM': 12}

### Preprocessing



First we make sure that the length of the tokens and the length of the labels are the same for each document

In [ ]:
df_usable = df.iloc[df[~(df.tokens.apply(len) != df.labels.apply(len))].index]
1-(len(df_usable))/len(df.document)

Replaces unique spaces (like NBSPs) in the documents with uft-8 spaces

In [None]:
pattern = re.compile('\xa0|\uf0b7|\u200b')
df.loc[:,'full_text'] = df.loc[:,'full_text'].replace(pattern, ' ')
df.loc[:,'tokens'] = df.loc[:,'tokens'].apply(lambda line: [tok for tok in line if not re.search(pattern,tok)])

0.06184809754664311

Here we chunk the token and label arrays of the texts into smaller arrays based `MAX_LEN`

In [None]:
def make_smaller_inputs(dataframe):
    df_out = pd.DataFrame(columns = ['tokens','labels'])
    counter = 0
    max_len = config['MAX_LEN']
    
    for _,line in dataframe.iterrows():
        tokens = line.tokens
        labels = line.labels
        items = range(0,len(tokens),max_len)
        
        for i in items:
            df_out.at[counter,'tokens'] = tokens[i:i+max_len]
            df_out.at[counter,'labels'] = labels[i:i+max_len]
            counter += 1
            
    return df_out

In [None]:
df_model_input = make_smaller_inputs(df_usable)
df_model_input.head()

Unnamed: 0,tokens,labels
0,"[Design, Thinking, for, innovation, reflexion,...","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-..."
1,"[significant, material, investment, and, can, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[starting, point, , generates, ideas, /, work...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[the, series, , of, questions, according, to,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[images, and, interconnections, ., This, secon...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [None]:
len(df_model_input.index)

39146

### Creating the training and testing split

In [None]:
df_train, df_test = train_test_split(df_model_input.sample(n=10000), test_size=0.2)

### Formatting split

In [None]:
df_train.reset_index(drop = True, inplace=True)
df_train

Unnamed: 0,tokens,labels
0,"[on, the, proper, to, identify, Oversight, Org...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"[how, important, we, must, be, when, choosing,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[2019, \n\n, •, , The, 4500, managers, who, n...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[very, important, to, reinforce, your, underst...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[broke, up, with, her, and, it, still, pains, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
...,...,...
7995,"[step, and, help, us, to, publish, our, produc...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
7996,"[Challenge, \n\n, I, am, part, of, the, produc...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
7997,"[of, their, ideas, ., , Application, :, , ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
7998,"[,, since, here, we, could, simply, just, summ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [None]:
df_test.reset_index(drop = True, inplace=True)
df_test

Unnamed: 0,tokens,labels
0,"[ , what, woww, and, what, works, ., \n\n, Ins...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"[Reflection, -, Storytelling, \n\n, Challenge,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[cover, and, content, it, ’s, a, most, challen...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[test, new, ideas, drive, from, information, g...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[is, also, fact, ., Therefore, ,, I, believe, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
...,...,...
1995,"[clear, picture, of, what, we, have, and, what...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1996,"[had, find, out, what, was, the, big, , probl...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1997,"[a, more, intimate, way, ., We, are, not, just...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1998,"[here, was, a, digital, shared, platform, to, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


# Model


adapted from 
https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb#scrollTo=Eh3ckSO0YMZW

In [None]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index):
        # step 1: get the sentence and word labels 
        tokens = self.data.tokens[index]
        word_labels = self.data.labels[index]

        # step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
        # BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
        encoding = self.tokenizer(tokens,
                                  is_split_into_words=True,
                                  return_offsets_mapping=True,
                                  padding='max_length',
                                  truncation=True,
                                  max_length=self.max_len)

        # step 3: create token labels only for first word pieces of each tokenized word
        labels = [ labels_to_ids[label] for label in word_labels]
        # code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
        # create an empty array of -100 of length max_length
        encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100

        # set only labels whose first offset position is 0 and the second is not 0
        i = 0
        for idx, mapping in enumerate(encoding["offset_mapping"]):
            if mapping[0] == 0 and mapping[1] != 0:
                # overwrite label
                encoded_labels[idx] = labels[i]
                i += 1

        # step 4: turn everything into PyTorch tensors
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        item['labels'] = torch.as_tensor(encoded_labels)

        return item

    def __len__(self):
        return self.len

In [None]:
training_set = dataset(df_train, config['tokenizer'], config['MAX_LEN'])
testing_set = dataset(df_test, config['tokenizer'], config['MAX_LEN'])

In [None]:
training_set[2]["input_ids"].unsqueeze(0)

tensor([[  101, 10351,   794,  1109, 10181,  1568, 11493,  1150,  1444,  3972,
          1132,  1155,  1166,  1103,  1362,  1105,  2936,  1483,   794,  1247,
          1110,  1136,  4788,  1111,   786,  1339,  1106,  1339,   787,  2013,
           794, 23070,  1116,  1132,  3600,   117,  1105,  1412,  1159, 27135,
          1106,  4821,  1147,  2209,  1110,  2609,   124,   119, 10997, 27258,
          9741, 13821, 24805,   131,  1752,   117,  1606,  7934,  2713,   113,
         18012,   125,   114,   117,   146,  1899,  1114,  1317,  2501, 26027,
           117, 12859, 16811, 21773, 16409, 17786,  1116,   113, 19293,  2036,
           787,   188,   114,  1105, 11493,  1150,  1138,  1640,  1151,  3972,
          1113,  1103,  1671,  1449,   119,  2397, 13032, 11646,  1114,  1103,
         19293,  2036,   787,   188,  1105, 13942,  1104,  1103,  1933,   117,
          1195,  1138,   124,  3209, 19129,  1106,  9474,  1412,  4506,   119,
           146,  2063,  1103,  9681, 26738,  1902,  

Here is an example array that the model would use to train written back into an array of strings

In [None]:
for token, label in zip(config['tokenizer'].convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
    print('{0:10}  {1}'.format(token, label))

[CLS]       -100
on          0
the         0
proper      0
to          0
identify    0
Over        0
##sight     -100
Organizations  0
,           0
resembling  0
Congress    0
,           0
GA          0
##O         -100
,           0
labour      0
unions      0
,           0
support     0
groups      0
,           0
and         0
alternative  0
entities    0
that        0
will        0
add         0
constraints  0
to          0
however     0
the         0
organization  0
operates    0
.           0
The         0
straw       0
man         0
graphic     0
is          0
easy        0
,           0
victim      0
##ization   -100
simple      0
shapes      0
and         0
icons       0
obtain      0
##able      -100
in          0
V           0
##isi       -100
##o         -100
or          0
on          0
the         0
Internet    0
.           0
If          0
the         0
front       0
-           0
stage       0
/           0
back        0
-           0
stage       0
read        0
does  

Setting training and testing parameters

In [None]:
train_params = {'batch_size': config["TRAIN_BATCH_SIZE"],
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': config["VALID_BATCH_SIZE"],
               'shuffle': True,
               'num_workers': 0
               }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

Below we initialize our pretrained BERT token classification model with random weights and biases

In [None]:
model = BertForTokenClassification.from_pretrained(config['pretrained_model'], num_labels=len(labels_to_ids))
model.to(config['device'])

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), 

expected loss of inital model is 
-ln(1/(# of classes))

In [None]:
-np.log(1/len(labels_to_ids))

2.5649493574615367

In [None]:
inputs = training_set[2]
input_ids = inputs["input_ids"].unsqueeze(0)
attention_mask = inputs["attention_mask"].unsqueeze(0)
labels = inputs["labels"].unsqueeze(0)

input_ids = input_ids.to(config['device'])
attention_mask = attention_mask.to(config['device'])
labels = labels.to(config['device'])

outputs = model(input_ids.long(), attention_mask=attention_mask.long(), labels=labels.long())
initial_loss = outputs[0]
initial_loss

tensor(2.5633, grad_fn=<NllLossBackward0>)

Limit of my comprehension as of 3/7

### Building training loop

Below we define the optimizer algorthm (Adam) and the loss evaluation function. Then we defined the training loop function.

In [None]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=config['LEARNING_RATE'])

In [None]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()

    for idx, batch in enumerate(training_loader):

        ids = batch['input_ids'].to(config['device'], dtype = torch.long)
        mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
        labels = batch['labels'].to(config['device'], dtype = torch.long)

        outputs = model(input_ids=ids.long(), attention_mask=mask.long(), labels=labels.long())
        loss = outputs[0]
        tr_logits = outputs[1]
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)

        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")

        # compute training accuracy
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))

        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)

        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy

        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=config['MAX_GRAD_NORM']
        )

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

### Determine whether to run training

In [None]:
run = False

In [None]:
if run:
    for epoch in range(config['EPOCHS']):
        print(f"Training epoch: {epoch + 1}")
        train(epoch)
    run = False

KeyError: 9637
KeyError: 2326
KeyError: 10126

This creates a directory to save and store trained models

In [None]:
import os

directory = config['model_path']

if not os.path.exists(directory):
    os.makedirs(directory)

# save vocabulary of the tokenizer
config['tokenizer'].save_vocabulary(directory)
# save the model weights and its configuration file
save_model = False
if save_model:  
    model.save_pretrained(directory)
    print('All files saved')

In [None]:
model1 = BertForTokenClassification.from_pretrained(config['model_path'], num_labels=len(labels_to_ids))
model1.to(config['device'])

OSError: ../models/bert_models/model2bert-large-cased.bin does not appear to have a file named config.json. Checkout 'https://huggingface.co/../models/bert_models/model2bert-large-cased.bin/main' for available files.

### Validation

Below we validate the model by testing it with the test data

In [None]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['input_ids'].to(config['device'], dtype = torch.long)
            mask = batch['attention_mask'].to(config['device'], dtype = torch.long)
            labels = batch['labels'].to(config['device'], dtype = torch.long)

            outputs = model(input_ids=ids.long(), attention_mask=mask.long(), labels=labels.long())
            
            loss = outputs[0]
            eval_logits = outputs[1]
            eval_loss += loss.item()
            
            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
            
            # compute evaluation accuracy
            flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            
            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        
            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(labels)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [ids_to_labels[id.item()] for id in eval_labels]
    predictions = [ids_to_labels[id.item()] for id in eval_preds]
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [None]:
labels, predictions = valid(model1, testing_loader)

In [None]:
print(classification_report(labels, predictions))

About 50% for getting student names correct, 0% for the others