# NER Evaluation of Augmented data

* This evaluation is done in Google Colab because of:
    * Enormous dataset size
    * Transformer based architecture involving GPU usage


## Install spaCy and download English model file

In [None]:
# !pip install cupy-cuda112
!pip install spacy==3.0.6

In [None]:
# Download spacy small model
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_trf

In [None]:
!nvidia-smi

Sun Jun 27 00:07:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Install torch

* Install torch specifc to the Google Colab's CUDA version
* CUDA version 11.1 works

In [None]:
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

## Extract Project files

In [None]:
!unzip /content/project.zip

Archive:  /content/project.zip
 extracting: .gitignore              
   creating: configs/
  inflating: configs/config.cfg      
  inflating: project.lock            
  inflating: project.yml             
  inflating: README.md               
 extracting: requirements.txt        
   creating: scripts/
  inflating: scripts/preprocess.py   
  inflating: scripts/visualize_data.py  
  inflating: scripts/visualize_model.py  
  inflating: test_project_ner_fashion_brands.py  


## Pre-process and save to json

### Extract the augmented dataset

In [None]:
!unzip /content/augmented_dataset_2021-06-21.zip

Archive:  /content/augmented_dataset_2021-06-21.zip
   creating: augmented_dataset_2021-06-21/
  inflating: augmented_dataset_2021-06-21/keyword_ids.csv  
  inflating: augmented_dataset_2021-06-21/pattern_ids.csv  
  inflating: augmented_dataset_2021-06-21/test_content.csv  
  inflating: augmented_dataset_2021-06-21/test_context.csv  
  inflating: augmented_dataset_2021-06-21/test_unseen.csv  
  inflating: augmented_dataset_2021-06-21/train.csv  


### Loader function

In [None]:
import pandas as pd
import os
import re
import numpy
from numpy.core.defchararray import find

TRAIN_DATA_PATH = "./augmented_dataset_2021-06-21/train.csv"
TEST_CONTENT_DATA_PATH = "./augmented_dataset_2021-06-21/test_content.csv"
TEST_CONTEXT_DATA_PATH = "./augmented_dataset_2021-06-21/test_context.csv"
TEST_UNSEEN = "./augmented_dataset_2021-06-21/test_unseen.csv"

def load_cleaned_data(data_path):
    """
    Go through every sentence's all word-tag pair (except "NONE")
    and calculate the start and end index.
    After getting the (start, end) pair, check if this pair was already calculated
    (i.e., either the start_index, OR end_index, OR both are matching with the ones in list),
    and if so, discard the pair and continue calculating again, skipping over the one discarded.
    :return: DATA
    """
    col_names = ['text', 'entities']

    data = pd.read_csv(data_path, names=col_names, usecols=[0, 1])
    entity_list = data.entities.to_list()

    DATA = []

    for index, ent in enumerate(entity_list):
        if ent == "tokens":
            continue

        ent = ent.split("), (")
        ent[0] = re.sub("[([]", "", ent[0])
        ent[-1] = re.sub("[)]]", "", ent[-1])

        # Initialize index list, to store pairs of (start, end) indices
        indices_list = [(-1, -1), (-1, -1)]

        tokens_list = []
        spans_list = []

        start_index = 0
        end_index = 0

        # Analyze current "split_sentences"'s all word-pairs
        for index_ent, word_pair in enumerate(ent):
            # Split the word and its pair
            word_pair_list = word_pair.split("'")[1::2]

            # Remove any leading or beginning blank space
            word_pair_list[0] = word_pair_list[0].strip()

            start_index = find(data['text'][index].lower(), word_pair_list[0]).astype(numpy.int64)
            start_index = int(start_index + 0)
            end_index = int(start_index + len(word_pair_list[0]))

            # Incase word not found in the sentence
            if start_index == -1:
                print("-1 error")
                print(data['text'][index])
                break

            both_present = lambda: (start_index, end_index) in indices_list
            start_present = lambda: start_index in [i[0] for i in indices_list]
            end_present = lambda: end_index in [i[1] for i in indices_list]
            left_blank = lambda: data['text'][index][start_index - 1] != " "

            def right_blank():
                # return true if there is no blank space after the end_index,
                # as long as end_index is not at the end of the sentence
                if len(data['text'][index].lower()) != end_index:
                    return data['text'][index][end_index] != " "
            
            # Check if this start_index and/or end_index is already in the list:
            # (To prevent overlapping with already tagged words)
            flag = 0
            while True:
                if (start_index == -1 or end_index == -1):
                    flag = 1
                    break
                if (both_present()) or (start_present()) or (end_present()) or (left_blank()) or (right_blank()):
                
                    start_index = find(data['text'][index].lower(), word_pair_list[0],
                                        start=end_index + 1).astype(numpy.int64)
                    start_index = int(start_index + 0)
                    end_index = int(start_index + len(word_pair_list[0]))

                else:
                    indices_list.append((start_index, end_index))
                    break
            
            if (flag == 1):
                # Don't bother checking rest of the current sentence
                break
            
            # Add ALL the words and their positions to a "tokens" list
            tokens_list.append({"text": word_pair_list[0], "start": start_index, "end": end_index})

            # Add the specially tagged words to a "spans" list
            if word_pair_list[1] != "NONE":
                spans_list.append({"start": start_index, "end": end_index, "label": word_pair_list[1]})

        DATA.append({"text": data['text'][index].lower(), "tokens": tokens_list, "spans": spans_list, "answer": "accept"})
        
    return DATA


# TRAIN_DATA = load_cleaned_data(TRAIN_DATA_PATH)
# TEST_CONTENT = load_cleaned_data(TEST_CONTENT_DATA_PATH)
TEST_CONTEXT = load_cleaned_data(TEST_CONTEXT_DATA_PATH)
# UNSEEN_DATA = load_cleaned_data(TEST_UNSEEN)



### Save to JSONL

In [None]:
import json
if not os.path.exists("assets"):
        os.makedirs("assets")

# with open('assets/TRAIN_DATA.jsonl', 'w') as f:
#     for entry in TRAIN_DATA:
#         json.dump(entry, f)
#         f.write('\n')

# with open('assets/TEST_CONTENT.jsonl', 'w') as f:
#     for entry in TEST_CONTENT:
#         json.dump(entry, f)
#         f.write('\n')

with open('assets/TEST_CONTEXT.jsonl', 'w') as f:
    for entry in TEST_CONTEXT:
        json.dump(entry, f)
        f.write('\n')

# with open('assets/UNSEEN_DATA.jsonl', 'w') as f:
#     for entry in UNSEEN_DATA:
#         json.dump(entry, f)
#         f.write('\n')


### Zip the JSONL files

In [None]:
!zip -r /content/assets.zip /content/assets

  adding: content/assets/ (stored 0%)
  adding: content/assets/TEST_CONTEXT.jsonl (deflated 94%)
  adding: content/assets/TEST_CONTENT.jsonl (deflated 96%)
  adding: content/assets/.ipynb_checkpoints/ (stored 0%)


## Extract assets

In [None]:
!unzip /content/assets.zip

Archive:  /content/assets.zip
   creating: assets/
  inflating: assets/TEST_CONTENT.jsonl  
  inflating: assets/TEST_CONTEXT.jsonl  


## Convert the data to spaCy's binary format

In [None]:
!python -m spacy project run preprocess

2021-06-27 00:08:27.911511: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[1m
Running command: /usr/bin/python3 scripts/preprocess.py assets/TEST_CONTEXT.jsonl corpus/TEST_CONTEXT.spacy
2021-06-27 00:08:32.140911: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Processed 207134 documents: TEST_CONTEXT.spacy
Running command: /usr/bin/python3 scripts/preprocess.py assets/TEST_CONTENT.jsonl corpus/TEST_CONTENT.spacy
2021-06-27 00:10:47.654556: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Processed 19608 documents: TEST_CONTENT.spacy


## Check the config file

* Cannot check properly with large dataset because of memory issues

In [None]:
!python -m spacy debug data configs/config.cfg

2021-06-27 00:11:14.299376: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[1m
^C


## Train

In [None]:
# !python -m spacy project run train
!python -m spacy train configs/config.cfg --output training/ --paths.train corpus/TEST_CONTEXT.spacy --paths.dev corpus/TEST_CONTENT.spacy --gpu-id 0

2021-06-27 00:51:22.690213: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-06-27 00:51:28,547] [INFO] Set up nlp object from config
[2021-06-27 00:51:28,557] [INFO] Pipeline: ['transformer', 'ner']
[2021-06-27 00:51:28,561] [INFO] Created vocabulary
[2021-06-27 00:51:28,561] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you

## Evaluate

In [None]:
# !python -m spacy project run evaluate
!python -m spacy evaluate training/model-best corpus/fashion_brands_eval.spacy --output training/metrics.json --gpu-id 0

## Archive the generated model/data/images

In [None]:
# !unzip /content/data.zip
# !unzip /content/saved_model.zip
# !zip -r /content/data.zip /content/data
# !zip -r /content/img.zip /content/img
# !zip -r /content/saved_model.zip /content/saved_model