# V3 --- Re Fine-Tuning Model with Adjusted Tokenizer

This notebook contains a reimplementation of the fine tuning of distilbert for NER with the adjustment of a tokenizer that has been trained on both the conll2003 and ONLY the toxic comments from the wikipedia toxic comments corpus. The latter specification is what differs the V3 NER model from the V2. 

The proposed benefit of training the tokenizer on the conll2003 and the toxic comments from the wiki set is that there will be a much higher ratio of toxic tokens to conll2003 tokens. This means the tokenizer will be more likely to accept toxic words into its vocabulary, as opposed to subword tokenizing them. Since we are training this model for the process of labeling toxic entities, it is very important that the toxic entities remain whole tokens as much as possible.

## Understanding this Implementation

For a more detailed explanation of how the model, task, and training metric works, visit: ./NER_standard_model/V1_model_base_tokenizer/sagemaker_training_env.ipynb

The tokenizer was trained in ./NER_standard_model/tokenizer_training/labeler_tokenizer_training.ipynb

The tokenizer vocab file is at ./NER_standard_model/tokenizer_training/tokenizer_files/conll_wiki_tokenizer-vocab.txt


# Sagemaker Env


In [1]:
import sagemaker
sess = sagemaker.Session() #this creates a sagemaker session -
role = sagemaker.get_execution_role() #this gets permissions from the env where 
                                      #it is running. I am running in a sagemaker notebook instance

# HuggingFace Estimator
##### The huggingface estimator is a tool that will create a docker image of our specified hyperparams and conduct the training specified within the train.py training script.

In [2]:
from sagemaker.huggingface import HuggingFace


# hyperparameters which are passed to the training job
hyperparameters={'epochs': 3,
                 'per_device_train_batch_size': 16,
                 'per_device_eval_batch_size': 16
                 }

# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='/home/ec2-user/SageMaker/NER_training_sagemaker/tox-wiki-addition-training', #This is the only change made to the doc
        instance_type='ml.g4dn.xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.28.1',
        pytorch_version='2.0.0',
        py_version='py310',
        hyperparameters = hyperparameters
)

In [4]:
huggingface_estimator.fit({
    'train': "s3://conll2003-task/tokenized_conll2003/train/", 
    'test': "s3://conll2003-task/tokenized_conll2003/validation/"
    })

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-09-02-15-01-27-400


Using provided s3_resource
2023-09-02 15:01:27 Starting - Starting the training job...
2023-09-02 15:01:41 Starting - Preparing the instances for training......
2023-09-02 15:02:56 Downloading - Downloading input data
2023-09-02 15:02:56 Training - Downloading the training image.............................................
2023-09-02 15:10:29 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-09-02 15:10:36,640 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-09-02 15:10:36,658 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-09-02 15:10:36,667 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-09-02 15:10:36,674 sagemaker_pytorch_container.training INFO     Invoking user

[34m0%|          | 0/1317 [00:00<?, ?it/s][0m
[34m0%|          | 1/1317 [00:00<07:38,  2.87it/s][0m
[34m0%|          | 2/1317 [00:00<04:42,  4.65it/s][0m
[34m0%|          | 3/1317 [00:00<03:52,  5.65it/s][0m
[34m0%|          | 4/1317 [00:00<03:15,  6.72it/s][0m
[34m0%|          | 5/1317 [00:00<03:12,  6.81it/s][0m
[34m0%|          | 6/1317 [00:00<02:57,  7.38it/s][0m
[34m1%|          | 7/1317 [00:01<02:49,  7.73it/s][0m
[34m1%|          | 8/1317 [00:01<02:51,  7.64it/s][0m
[34m1%|          | 9/1317 [00:01<02:56,  7.43it/s][0m
[34m1%|          | 10/1317 [00:01<02:58,  7.32it/s][0m
[34m1%|          | 11/1317 [00:01<03:23,  6.43it/s][0m
[34m1%|          | 12/1317 [00:01<03:21,  6.48it/s][0m
[34m1%|          | 13/1317 [00:01<03:02,  7.14it/s][0m
[34m1%|          | 14/1317 [00:02<02:56,  7.40it/s][0m
[34m1%|          | 15/1317 [00:02<02:50,  7.66it/s][0m
[34m1%|          | 16/1317 [00:02<02:39,  8.14it/s][0m
[34m1%|▏         | 17/1317 [00:02<02:50,  7.63it

[34m11%|█         | 143/1317 [00:19<02:20,  8.34it/s][0m
[34m11%|█         | 144/1317 [00:19<02:19,  8.42it/s][0m
[34m11%|█         | 145/1317 [00:19<02:40,  7.29it/s][0m
[34m11%|█         | 146/1317 [00:19<02:41,  7.27it/s][0m
[34m11%|█         | 147/1317 [00:19<02:37,  7.45it/s][0m
[34m11%|█         | 148/1317 [00:20<02:32,  7.64it/s][0m
[34m11%|█▏        | 149/1317 [00:20<02:40,  7.28it/s][0m
[34m11%|█▏        | 150/1317 [00:20<02:42,  7.18it/s][0m
[34m11%|█▏        | 151/1317 [00:20<02:51,  6.80it/s][0m
[34m12%|█▏        | 152/1317 [00:20<02:42,  7.16it/s][0m
[34m12%|█▏        | 153/1317 [00:20<02:51,  6.81it/s][0m
[34m12%|█▏        | 154/1317 [00:20<02:42,  7.17it/s][0m
[34m12%|█▏        | 155/1317 [00:21<02:42,  7.15it/s][0m
[34m12%|█▏        | 156/1317 [00:21<02:44,  7.08it/s][0m
[34m12%|█▏        | 157/1317 [00:21<02:42,  7.13it/s][0m
[34m12%|█▏        | 158/1317 [00:21<02:36,  7.42it/s][0m
[34m12%|█▏        | 159/1317 [00:21<02:42,  7.12it/s][

[34m22%|██▏       | 287/1317 [00:39<02:23,  7.19it/s][0m
[34m22%|██▏       | 288/1317 [00:39<02:25,  7.09it/s][0m
[34m22%|██▏       | 289/1317 [00:39<02:25,  7.06it/s][0m
[34m22%|██▏       | 290/1317 [00:39<02:19,  7.38it/s][0m
[34m22%|██▏       | 291/1317 [00:39<02:27,  6.96it/s][0m
[34m22%|██▏       | 292/1317 [00:40<02:23,  7.12it/s][0m
[34m22%|██▏       | 293/1317 [00:40<02:20,  7.31it/s][0m
[34m22%|██▏       | 294/1317 [00:40<02:23,  7.12it/s][0m
[34m22%|██▏       | 295/1317 [00:40<02:25,  7.05it/s][0m
[34m22%|██▏       | 296/1317 [00:40<02:18,  7.36it/s][0m
[34m23%|██▎       | 297/1317 [00:40<02:27,  6.93it/s][0m
[34m23%|██▎       | 298/1317 [00:40<02:25,  7.01it/s][0m
[34m23%|██▎       | 299/1317 [00:41<02:25,  6.99it/s][0m
[34m23%|██▎       | 300/1317 [00:41<02:17,  7.41it/s][0m
[34m23%|██▎       | 301/1317 [00:41<02:29,  6.79it/s][0m
[34m23%|██▎       | 302/1317 [00:41<02:17,  7.38it/s][0m
[34m23%|██▎       | 303/1317 [00:41<02:14,  7.54it/s][

[34m33%|███▎      | 435/1317 [00:59<02:00,  7.30it/s][0m
[34m33%|███▎      | 436/1317 [00:59<02:02,  7.17it/s][0m
[34m33%|███▎      | 437/1317 [00:59<02:03,  7.14it/s][0m
[34m33%|███▎      | 438/1317 [00:59<02:05,  7.03it/s][0m
[34m33%|███▎      | 439/1317 [00:59<01:59,  7.37it/s][0m
[34m0%|          | 0/51 [00:00<?, ?it/s]#033[A[0m
[34m6%|▌         | 3/51 [00:00<00:03, 15.44it/s]#033[A[0m
[34m10%|▉         | 5/51 [00:00<00:03, 12.09it/s]#033[A[0m
[34m14%|█▎        | 7/51 [00:00<00:03, 12.02it/s]#033[A[0m
[34m18%|█▊        | 9/51 [00:00<00:03, 13.33it/s]#033[A[0m
[34m22%|██▏       | 11/51 [00:00<00:03, 12.86it/s]#033[A[0m
[34m25%|██▌       | 13/51 [00:01<00:03, 10.36it/s]#033[A[0m
[34m29%|██▉       | 15/51 [00:01<00:03, 10.67it/s]#033[A[0m
[34m33%|███▎      | 17/51 [00:01<00:03, 10.17it/s]#033[A[0m
[34m37%|███▋      | 19/51 [00:01<00:03,  9.63it/s]#033[A[0m
[34m41%|████      | 21/51 [00:01<00:02, 10.20it/s]#033[A[0m
[34m45%|████▌     | 23/51 [00:02<00:0

[34m46%|████▌     | 600/1317 [01:29<01:43,  6.95it/s][0m
[34m46%|████▌     | 601/1317 [01:29<01:42,  6.97it/s][0m
[34m46%|████▌     | 602/1317 [01:29<01:43,  6.89it/s][0m
[34m46%|████▌     | 603/1317 [01:29<01:36,  7.39it/s][0m
[34m46%|████▌     | 604/1317 [01:29<01:44,  6.80it/s][0m
[34m46%|████▌     | 605/1317 [01:30<01:44,  6.79it/s][0m
[34m46%|████▌     | 606/1317 [01:30<01:44,  6.81it/s][0m
[34m46%|████▌     | 607/1317 [01:30<01:42,  6.91it/s][0m
[34m46%|████▌     | 608/1317 [01:30<01:37,  7.24it/s][0m
[34m46%|████▌     | 609/1317 [01:30<01:44,  6.74it/s][0m
[34m46%|████▋     | 610/1317 [01:30<01:38,  7.15it/s][0m
[34m46%|████▋     | 611/1317 [01:30<01:40,  7.01it/s][0m
[34m46%|████▋     | 612/1317 [01:31<01:37,  7.20it/s][0m
[34m47%|████▋     | 613/1317 [01:31<01:34,  7.47it/s][0m
[34m47%|████▋     | 614/1317 [01:31<01:36,  7.26it/s][0m
[34m47%|████▋     | 615/1317 [01:31<01:31,  7.66it/s][0m
[34m47%|████▋     | 616/1317 [01:31<01:31,  7.67it/s][

[34m56%|█████▋    | 741/1317 [01:49<01:20,  7.16it/s][0m
[34m56%|█████▋    | 742/1317 [01:49<01:18,  7.37it/s][0m
[34m56%|█████▋    | 743/1317 [01:49<01:16,  7.50it/s][0m
[34m56%|█████▋    | 744/1317 [01:49<01:18,  7.26it/s][0m
[34m57%|█████▋    | 745/1317 [01:49<01:20,  7.11it/s][0m
[34m57%|█████▋    | 746/1317 [01:50<01:24,  6.75it/s][0m
[34m57%|█████▋    | 747/1317 [01:50<01:18,  7.23it/s][0m
[34m57%|█████▋    | 748/1317 [01:50<01:19,  7.12it/s][0m
[34m57%|█████▋    | 749/1317 [01:50<01:20,  7.06it/s][0m
[34m57%|█████▋    | 750/1317 [01:50<01:17,  7.35it/s][0m
[34m57%|█████▋    | 751/1317 [01:50<01:15,  7.46it/s][0m
[34m57%|█████▋    | 752/1317 [01:50<01:20,  7.05it/s][0m
[34m57%|█████▋    | 753/1317 [01:50<01:15,  7.49it/s][0m
[34m57%|█████▋    | 754/1317 [01:51<01:22,  6.82it/s][0m
[34m57%|█████▋    | 755/1317 [01:51<01:21,  6.86it/s][0m
[34m57%|█████▋    | 756/1317 [01:51<01:21,  6.89it/s][0m
[34m57%|█████▋    | 757/1317 [01:51<01:21,  6.89it/s][

[34m25%|██▌       | 13/51 [00:01<00:03, 10.00it/s]#033[A[0m
[34m29%|██▉       | 15/51 [00:01<00:03, 10.37it/s]#033[A[0m
[34m33%|███▎      | 17/51 [00:01<00:03,  9.99it/s]#033[A[0m
[34m37%|███▋      | 19/51 [00:01<00:03,  9.34it/s]#033[A[0m
[34m41%|████      | 21/51 [00:01<00:03,  9.95it/s]#033[A[0m
[34m45%|████▌     | 23/51 [00:02<00:02, 10.31it/s]#033[A[0m
[34m49%|████▉     | 25/51 [00:02<00:02, 10.47it/s]#033[A[0m
[34m53%|█████▎    | 27/51 [00:02<00:02, 10.47it/s]#033[A[0m
[34m57%|█████▋    | 29/51 [00:02<00:02,  9.97it/s]#033[A[0m
[34m61%|██████    | 31/51 [00:02<00:01, 10.42it/s]#033[A[0m
[34m67%|██████▋   | 34/51 [00:03<00:01, 12.47it/s]#033[A[0m
[34m71%|███████   | 36/51 [00:03<00:01,  9.64it/s]#033[A[0m
[34m75%|███████▍  | 38/51 [00:03<00:01,  9.16it/s]#033[A[0m
[34m78%|███████▊  | 40/51 [00:03<00:01,  9.85it/s]#033[A[0m
[34m82%|████████▏ | 42/51 [00:04<00:01,  8.58it/s]#033[A[0m
[34m86%|████████▋ | 44/51 [00:04<00:00,  9.46it/s]#033[A[0m
[34m90%

[34m79%|███████▉  | 1047/1317 [02:39<00:37,  7.17it/s][0m
[34m80%|███████▉  | 1048/1317 [02:39<00:36,  7.34it/s][0m
[34m80%|███████▉  | 1049/1317 [02:39<00:35,  7.48it/s][0m
[34m80%|███████▉  | 1050/1317 [02:39<00:35,  7.55it/s][0m
[34m80%|███████▉  | 1051/1317 [02:39<00:36,  7.33it/s][0m
[34m80%|███████▉  | 1052/1317 [02:40<00:34,  7.72it/s][0m
[34m80%|███████▉  | 1053/1317 [02:40<00:34,  7.69it/s][0m
[34m80%|████████  | 1054/1317 [02:40<00:36,  7.12it/s][0m
[34m80%|████████  | 1055/1317 [02:40<00:58,  4.50it/s][0m
[34m80%|████████  | 1056/1317 [02:40<00:53,  4.89it/s][0m
[34m80%|████████  | 1057/1317 [02:41<00:46,  5.54it/s][0m
[34m80%|████████  | 1058/1317 [02:41<00:41,  6.17it/s][0m
[34m80%|████████  | 1059/1317 [02:41<00:40,  6.41it/s][0m
[34m80%|████████  | 1060/1317 [02:41<00:37,  6.84it/s][0m
[34m81%|████████  | 1061/1317 [02:41<00:39,  6.48it/s][0m
[34m81%|████████  | 1062/1317 [02:41<00:38,  6.64it/s][0m
[34m81%|████████  | 1063/1317 [02:41<00

[34m90%|█████████ | 1187/1317 [02:59<00:16,  7.72it/s][0m
[34m90%|█████████ | 1188/1317 [02:59<00:16,  7.71it/s][0m
[34m90%|█████████ | 1189/1317 [02:59<00:17,  7.50it/s][0m
[34m90%|█████████ | 1190/1317 [02:59<00:16,  7.65it/s][0m
[34m90%|█████████ | 1191/1317 [03:00<00:17,  7.17it/s][0m
[34m91%|█████████ | 1192/1317 [03:00<00:17,  7.31it/s][0m
[34m91%|█████████ | 1193/1317 [03:00<00:17,  7.23it/s][0m
[34m91%|█████████ | 1194/1317 [03:00<00:17,  7.07it/s][0m
[34m91%|█████████ | 1195/1317 [03:00<00:16,  7.51it/s][0m
[34m91%|█████████ | 1196/1317 [03:00<00:15,  7.66it/s][0m
[34m91%|█████████ | 1197/1317 [03:00<00:16,  7.31it/s][0m
[34m91%|█████████ | 1198/1317 [03:01<00:17,  6.77it/s][0m
[34m91%|█████████ | 1199/1317 [03:01<00:16,  7.12it/s][0m
[34m91%|█████████ | 1200/1317 [03:01<00:16,  7.06it/s][0m
[34m91%|█████████ | 1201/1317 [03:01<00:17,  6.55it/s][0m
[34m91%|█████████▏| 1202/1317 [03:01<00:16,  6.84it/s][0m
[34m91%|█████████▏| 1203/1317 [03:01<00


2023-09-02 15:14:30 Uploading - Uploading generated training model[34m86%|████████▋ | 44/51 [00:04<00:00,  9.09it/s][0m
[34m88%|████████▊ | 45/51 [00:04<00:00,  9.19it/s][0m
[34m90%|█████████ | 46/51 [00:04<00:00,  8.53it/s][0m
[34m94%|█████████▍| 48/51 [00:04<00:00,  8.93it/s][0m
[34m96%|█████████▌| 49/51 [00:05<00:00,  8.77it/s][0m
[34m98%|█████████▊| 50/51 [00:05<00:00,  8.65it/s][0m
[34m100%|██████████| 51/51 [00:06<00:00,  7.51it/s][0m
[34m***** Eval results *****[0m
[34m2023-09-02 15:14:23,748 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.[0m
[34m2023-09-02 15:14:23,748 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.[0m
[34m2023-09-02 15:14:23,749 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2023-09-02 15:16:10 Completed - Training job completed
Training seconds: 814
Billable seconds: 814
