# Re Fine-Tuning Model with Adjusted Tokenizer

This notebook contains a reimplementation of the fine tuning of distilbert for NER with the adjustment of a tokenizer that has been trained on both the conll2003 and wikipedia toxic comments corpus. This will benefit the labeling job because have the tokenizer's vocabulary contain words which it will be used to label will curb the tokenizer from subword tokenizing named toxic entities and disturbing the structural integrety of those statements for NER. The training script train.py contains these modifications. This notebook contains only the setup of a sagemaker env for that train.py and the training logs.

## Understanding this Implementation

For a more detailed explanation of how the model, task, and training metric works, visit: ./NER_standard_model/V1_model_base_tokenizer/sagemaker_training_env.ipynb

The tokenizer was trained in ./NER_standard_model/tokenizer_training/labeler_tokenizer_training.ipynb

The tokenizer vocab file is at ./NER_standard_model/tokenizer_training/tokenizer_files/conll_wiki_tokenizer-vocab.txt


# Sagemaker Env


In [2]:
import sagemaker
sess = sagemaker.Session() #this creates a sagemaker session -
role = sagemaker.get_execution_role() #this gets permissions from the env where 
                                      #it is running. I am running in a sagemaker notebook instance

# HuggingFace Estimator
##### The huggingface estimator is a tool that will create a docker image of our specified hyperparams and conduct the training specified within the train.py training script.

In [6]:
from sagemaker.huggingface import HuggingFace


# hyperparameters which are passed to the training job
hyperparameters={'epochs': 3,
                 'per_device_train_batch_size': 16,
                 'per_device_eval_batch_size': 16
                 }

# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='/home/ec2-user/SageMaker/NER_training_sagemaker/wiki-addition-training', #This is the only change made to the doc
        instance_type='ml.g4dn.xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.28.1',
        pytorch_version='2.0.0',
        py_version='py310',
        hyperparameters = hyperparameters
)

In [7]:
huggingface_estimator.fit({
    'train': "s3://conll2003-task/tokenized_conll2003/train/", 
    'test': "s3://conll2003-task/tokenized_conll2003/validation/"
    })

Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-09-02-01-42-29-022


2023-09-02 01:42:29 Starting - Starting the training job.........
2023-09-02 01:43:51 Starting - Preparing the instances for training......
2023-09-02 01:44:52 Downloading - Downloading input data...
2023-09-02 01:45:13 Training - Downloading the training image.............................................
2023-09-02 01:52:50 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-09-02 01:53:02,136 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-09-02 01:53:02,153 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-09-02 01:53:02,163 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-09-02 01:53:02,169 sagemaker_pytorch_container.training INFO     Invoking user training script.[

[34mBuilding wheel for seqeval (setup.py): finished with status 'done'[0m
[34mCreated wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16165 sha256=999265168ddfe7abd91119b8cff634b3e100bee02cff1613d076d2d2d87324a3[0m
[34mStored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa[0m
[34mSuccessfully built seqeval[0m
[34mInstalling collected packages: seqeval[0m
[34mSuccessfully installed seqeval-1.2.2[0m
[34m[notice] A new release of pip is available: 23.1.2 -> 23.2.1[0m
[34m[notice] To update, run: pip install --upgrade pip[0m
  metric = datasets.load_metric("seqeval") #load in seqeval metric after install[0m
[34mDownloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s][0m
[34mDownloading builder script: 6.33kB [00:00, 6.85MB/s][0m
[34m0%|          | 0/1317 [00:00<?, ?it/s][0m
[34m0%|          | 1/1317 [00:00<07:24,  2.96it/s][0m
[34m0%|          | 2/1317 [00:00<04:39,  4.71it/s][0m
[3

[34m14%|█▍        | 182/1317 [00:24<02:30,  7.56it/s][0m
[34m14%|█▍        | 183/1317 [00:24<02:45,  6.85it/s][0m
[34m14%|█▍        | 184/1317 [00:24<02:37,  7.19it/s][0m
[34m14%|█▍        | 185/1317 [00:24<02:50,  6.64it/s][0m
[34m14%|█▍        | 186/1317 [00:24<02:47,  6.74it/s][0m
[34m14%|█▍        | 187/1317 [00:24<02:54,  6.49it/s][0m
[34m14%|█▍        | 188/1317 [00:25<02:39,  7.08it/s][0m
[34m14%|█▍        | 189/1317 [00:25<02:33,  7.34it/s][0m
[34m14%|█▍        | 190/1317 [00:25<02:32,  7.39it/s][0m
[34m15%|█▍        | 191/1317 [00:25<02:27,  7.63it/s][0m
[34m15%|█▍        | 192/1317 [00:25<02:23,  7.83it/s][0m
[34m15%|█▍        | 193/1317 [00:25<02:27,  7.61it/s][0m
[34m15%|█▍        | 194/1317 [00:25<02:25,  7.73it/s][0m
[34m15%|█▍        | 196/1317 [00:26<02:22,  7.89it/s][0m
[34m15%|█▍        | 197/1317 [00:26<02:21,  7.93it/s][0m
[34m15%|█▌        | 198/1317 [00:26<02:16,  8.18it/s][0m
[34m15%|█▌        | 199/1317 [00:26<02:21,  7.90it/s][

[34m25%|██▍       | 329/1317 [00:44<02:13,  7.41it/s][0m
[34m25%|██▌       | 330/1317 [00:44<02:13,  7.38it/s][0m
[34m25%|██▌       | 331/1317 [00:44<02:13,  7.40it/s][0m
[34m25%|██▌       | 332/1317 [00:44<02:13,  7.40it/s][0m
[34m25%|██▌       | 333/1317 [00:44<02:13,  7.39it/s][0m
[34m25%|██▌       | 334/1317 [00:44<02:09,  7.57it/s][0m
[34m25%|██▌       | 335/1317 [00:44<02:19,  7.03it/s][0m
[34m26%|██▌       | 336/1317 [00:45<02:22,  6.88it/s][0m
[34m26%|██▌       | 337/1317 [00:45<02:15,  7.22it/s][0m
[34m26%|██▌       | 338/1317 [00:45<02:09,  7.59it/s][0m
[34m26%|██▌       | 339/1317 [00:45<02:06,  7.74it/s][0m
[34m26%|██▌       | 340/1317 [00:45<02:05,  7.81it/s][0m
[34m26%|██▌       | 341/1317 [00:45<01:59,  8.15it/s][0m
[34m26%|██▌       | 342/1317 [00:45<01:57,  8.29it/s][0m
[34m26%|██▌       | 343/1317 [00:45<02:07,  7.65it/s][0m
[34m26%|██▌       | 344/1317 [00:46<02:10,  7.45it/s][0m
[34m26%|██▌       | 345/1317 [00:46<02:08,  7.59it/s][

[34m38%|███▊      | 501/1317 [01:14<06:12,  2.19it/s][0m
[34m38%|███▊      | 502/1317 [01:14<04:53,  2.78it/s][0m
[34m38%|███▊      | 503/1317 [01:14<03:58,  3.42it/s][0m
[34m38%|███▊      | 504/1317 [01:14<03:23,  3.99it/s][0m
[34m38%|███▊      | 505/1317 [01:14<02:51,  4.72it/s][0m
[34m38%|███▊      | 506/1317 [01:15<02:33,  5.29it/s][0m
[34m38%|███▊      | 507/1317 [01:15<02:23,  5.63it/s][0m
[34m39%|███▊      | 508/1317 [01:15<02:13,  6.07it/s][0m
[34m39%|███▊      | 509/1317 [01:15<02:02,  6.58it/s][0m
[34m39%|███▊      | 510/1317 [01:15<02:02,  6.58it/s][0m
[34m39%|███▉      | 511/1317 [01:15<01:56,  6.91it/s][0m
[34m39%|███▉      | 512/1317 [01:15<01:52,  7.15it/s][0m
[34m39%|███▉      | 513/1317 [01:16<01:52,  7.16it/s][0m
[34m39%|███▉      | 514/1317 [01:16<01:52,  7.12it/s][0m
[34m39%|███▉      | 515/1317 [01:16<01:46,  7.55it/s][0m
[34m39%|███▉      | 516/1317 [01:16<01:51,  7.19it/s][0m
[34m39%|███▉      | 517/1317 [01:16<01:54,  6.97it/s][

[34m49%|████▉     | 646/1317 [01:34<01:29,  7.50it/s][0m
[34m49%|████▉     | 647/1317 [01:34<01:28,  7.60it/s][0m
[34m49%|████▉     | 648/1317 [01:34<02:22,  4.69it/s][0m
[34m49%|████▉     | 649/1317 [01:34<02:07,  5.24it/s][0m
[34m49%|████▉     | 650/1317 [01:35<01:56,  5.71it/s][0m
[34m49%|████▉     | 651/1317 [01:35<01:46,  6.27it/s][0m
[34m50%|████▉     | 652/1317 [01:35<01:45,  6.31it/s][0m
[34m50%|████▉     | 653/1317 [01:35<01:45,  6.27it/s][0m
[34m50%|████▉     | 654/1317 [01:35<01:51,  5.96it/s][0m
[34m50%|████▉     | 655/1317 [01:35<01:41,  6.49it/s][0m
[34m50%|████▉     | 656/1317 [01:35<01:39,  6.67it/s][0m
[34m50%|████▉     | 657/1317 [01:36<01:36,  6.84it/s][0m
[34m50%|████▉     | 658/1317 [01:36<01:32,  7.09it/s][0m
[34m50%|█████     | 659/1317 [01:36<01:28,  7.40it/s][0m
[34m50%|█████     | 660/1317 [01:36<01:30,  7.28it/s][0m
[34m50%|█████     | 661/1317 [01:36<01:28,  7.44it/s][0m
[34m50%|█████     | 662/1317 [01:36<01:30,  7.22it/s][

[34m60%|█████▉    | 787/1317 [01:54<01:05,  8.03it/s][0m
[34m60%|█████▉    | 788/1317 [01:54<01:10,  7.54it/s][0m
[34m60%|█████▉    | 789/1317 [01:54<01:11,  7.36it/s][0m
[34m60%|█████▉    | 790/1317 [01:54<01:07,  7.79it/s][0m
[34m60%|██████    | 791/1317 [01:54<01:10,  7.45it/s][0m
[34m60%|██████    | 792/1317 [01:54<01:08,  7.62it/s][0m
[34m60%|██████    | 793/1317 [01:54<01:07,  7.72it/s][0m
[34m60%|██████    | 794/1317 [01:55<01:04,  8.12it/s][0m
[34m60%|██████    | 795/1317 [01:55<01:09,  7.46it/s][0m
[34m60%|██████    | 796/1317 [01:55<01:16,  6.82it/s][0m
[34m61%|██████    | 797/1317 [01:55<01:13,  7.11it/s][0m
[34m61%|██████    | 798/1317 [01:55<01:14,  7.01it/s][0m
[34m61%|██████    | 799/1317 [01:55<01:08,  7.60it/s][0m
[34m61%|██████    | 800/1317 [01:55<01:15,  6.89it/s][0m
[34m61%|██████    | 801/1317 [01:56<01:14,  6.95it/s][0m
[34m61%|██████    | 802/1317 [01:56<01:09,  7.38it/s][0m
[34m61%|██████    | 803/1317 [01:56<01:06,  7.69it/s][

[34m75%|███████▌  | 992/1317 [02:29<00:45,  7.14it/s][0m
[34m75%|███████▌  | 993/1317 [02:29<00:43,  7.37it/s][0m
[34m75%|███████▌  | 994/1317 [02:29<00:41,  7.71it/s][0m
[34m76%|███████▌  | 995/1317 [02:29<00:45,  7.03it/s][0m
[34m76%|███████▌  | 996/1317 [02:29<00:46,  6.89it/s][0m
[34m76%|███████▌  | 997/1317 [02:29<00:45,  7.10it/s][0m
[34m76%|███████▌  | 998/1317 [02:30<00:46,  6.92it/s][0m
[34m76%|███████▌  | 999/1317 [02:30<00:44,  7.09it/s][0m
[34m76%|███████▌  | 1000/1317 [02:30<00:46,  6.83it/s][0m
[34m{'loss': 0.3056, 'learning_rate': 1.9400244798041617e-05, 'epoch': 2.28}[0m
[34m76%|███████▌  | 1000/1317 [02:30<00:46,  6.83it/s][0m
[34m76%|███████▌  | 1001/1317 [02:31<02:30,  2.10it/s][0m
[34m76%|███████▌  | 1002/1317 [02:31<01:56,  2.70it/s][0m
[34m76%|███████▌  | 1003/1317 [02:31<01:33,  3.34it/s][0m
[34m76%|███████▌  | 1004/1317 [02:31<01:17,  4.03it/s][0m
[34m76%|███████▋  | 1005/1317 [02:32<01:05,  4.79it/s][0m
[34m76%|███████▋  | 1006/

[34m91%|█████████ | 1193/1317 [02:59<00:17,  7.11it/s][0m
[34m91%|█████████ | 1194/1317 [02:59<00:17,  6.99it/s][0m
[34m91%|█████████ | 1195/1317 [02:59<00:16,  7.41it/s][0m
[34m91%|█████████ | 1196/1317 [02:59<00:16,  7.54it/s][0m
[34m91%|█████████ | 1197/1317 [02:59<00:16,  7.14it/s][0m
[34m91%|█████████ | 1198/1317 [03:00<00:17,  6.65it/s][0m
[34m91%|█████████ | 1199/1317 [03:00<00:16,  7.01it/s][0m
[34m91%|█████████ | 1200/1317 [03:00<00:16,  6.95it/s][0m
[34m91%|█████████ | 1201/1317 [03:00<00:17,  6.48it/s][0m
[34m91%|█████████▏| 1202/1317 [03:00<00:17,  6.74it/s][0m
[34m91%|█████████▏| 1203/1317 [03:00<00:16,  6.94it/s][0m
[34m91%|█████████▏| 1204/1317 [03:00<00:17,  6.64it/s][0m
[34m92%|█████████▏| 1206/1317 [03:01<00:14,  7.49it/s][0m
[34m92%|█████████▏| 1207/1317 [03:01<00:15,  7.19it/s][0m
[34m92%|█████████▏| 1208/1317 [03:01<00:15,  7.24it/s][0m
[34m92%|█████████▏| 1209/1317 [03:01<00:14,  7.29it/s][0m
[34m92%|█████████▏| 1210/1317 [03:01<00

[34m92%|█████████▏| 47/51 [00:04<00:00,  8.72it/s][0m
[34m94%|█████████▍| 48/51 [00:04<00:00,  8.78it/s][0m
[34m96%|█████████▌| 49/51 [00:05<00:00,  8.64it/s][0m
[34m98%|█████████▊| 50/51 [00:05<00:00,  8.56it/s][0m
[34m100%|██████████| 51/51 [00:06<00:00,  7.46it/s][0m
[34m***** Eval results *****[0m
[34m2023-09-02 01:56:48,732 sagemaker-training-toolkit INFO     Waiting for the process to finish and give a return code.[0m
[34m2023-09-02 01:56:48,732 sagemaker-training-toolkit INFO     Done waiting for a return code. Received 0 from exiting process.[0m
[34m2023-09-02 01:56:48,732 sagemaker-training-toolkit INFO     Reporting training SUCCESS[0m

2023-09-02 01:56:51 Uploading - Uploading generated training model
2023-09-02 01:58:37 Completed - Training job completed
Training seconds: 824
Billable seconds: 824
