# **Project Description**

If you are here, it means, you already checked GitHub repo of this project, so that you have enough information about Named Entity Recognition and what did we do in the project. In case you reached out this ner_bert.ipynb file randomly, you can reach to the [repo](https://github.com/NamazovMN/NER-BERT) easily, to check  the source code and a bit more information.

This file is to show how can you easily use the repository in your own local machine or in any notebook supported environment!

Hope you will enjoy it!

## **Initial Steps**

### Mount Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Create Directory
In order to prevent messed up data organization in you drive, we will create directory, where we will clone the repository:

In [2]:
import os
folder_name = 'projects_clone' #change it as you wish
path = os.path.join('drive/MyDrive/Colab Notebooks/', folder_name)
if not os.path.exists(path):
  os.makedirs(path)
  %cd -q $path
else:
  %cd -q $path

### Clone the project into the given path
Now we are in the directory of 'drive/MyDrive/Colab Notebooks/[folder_name]', where folder_name can be anything that you can modify from the cell above. Let's clone the project:

In [3]:
if 'NER-BERT' not in os.listdir():
  !git clone https://github.com/NamazovMN/NER-BERT.git
project_path = 'NER-BERT'
%cd -q $project_path

Cloning into 'NER-BERT'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (68/68), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 68 (delta 17), reused 63 (delta 12), pack-reused 0[K
Receiving objects: 100% (68/68), 1.31 MiB | 11.90 MiB/s, done.
Resolving deltas: 100% (17/17), done.


### Let's install dependencies
Now we are kind of set, let's install required dependencies through [requirements.txt](https://github.com/NamazovMN/NER-BERT/blob/main/requirements.txt).


In [4]:
!pip install -r requirements.txt



##**How To Use?**
In order to run the model as a script we need to know some details. In this section we will provide them

### Project Parameters
We need to know what parameters do what. For this you can check the project parameters from below:

        "experiment_num": 6 => Specifies number of experiment (for folder design)
        "epochs": 3 => Number of epochs you want to run the training (or fine-tuning)
        "learning_rate": 0.0001 => Learning rate of the training
        "batch_size": 16 => Batch size
        "weight_decay": 0.0001 => Weight decay as a regularizer parameter
        "train": True => Train the model or not (must be specified)
        "infer": True => Activate inference session or not (must be specified)
        "resume_training": False => When you want to continue training phase of specified experiment (with experiment number) (must be specified)
        "epoch_choice": 1 => specifies which epoch should be chosen (in resume or inference scenarios)
        "load_best": False => Specifies whether we want to load model based on best metric results
        "load_choice": 'f1_macro' => Best choice will be set on this choice. While default is 'f1_macro', you can also choose either dev_loss or dev_accuracy
        "dropout": 0.3 => dropout rate
        "max_length": 180 => maximum length that will be considered by the model
        "model_checkpoint": 'bert-base-cased' => model checkpoint that you want to use
        "stats": True => In case you want to see statistics set it to True, please (must be specified)
        "statistics_data_choice": 'test' => Results will be done based on this dataset (can be 'test' or 'validation')
Note: Parameters convey *(must be specified)* information, are set to False by default. In order make them True (in argparse techniques *store_true*) you must call them.

### Training
In order to train the model following parameters are enough to start: experiment_num, train. Rest of the parameters will be set by default values. But you can also modify themm, if you want. For simplicity, we can say that these default values are okay and we will train the model for 3 epochs (default)

In [5]:
!python main.py --train --experiment_num 1

2023-08-21 11:51:48.886662: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Found cached dataset conllpp (/root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2)
100% 3/3 [00:00<00:00, 543.94it/s]
Loading cached processed dataset at /root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2/cache-092064d366e0730c.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2/cache-685e8d854b3e67e6.arrow
Some weights of th

###Inference
In case you want to check the model results by prompting some text, you need to activate inference phase. As we mentioned above, initially it is set to False and we need to call the parameter in order to activate it.

We must emphasize that, in case you do not have any checkpoint to load the model (in other words, you do not have experiment_# folder, where # is experiment number you want to run) inference will not be performed.

Let's analyze the scenario that you choose experiment_num 1, and you have the required output results. Then you have to follow some rules:


*   If you do not specify any parameters, it will load initial epoch results (epoch_choice which is 1 by default)
*   If you specify epoch choice and set load_best to True, the latter will be chosen.
*   In case you set load_best to True, you might want to specify load_choice as well. When you choose dev_loss, then the epoch with the minimum loss will be chosen. When you choose either f1_macro or dev_accuracy then the epoch with maximum value for this specified metric will be chosen.

For instance in the following example we choose experiment_num as 1 (we already trained the model above), set inference to true by calling it, set load_best to True and make choice as f1_macro.

For now, that is all! Have fun!

In [8]:
!python main.py --infer --experiment_num 1 --load_best --load_choice f1_macro

2023-08-21 12:34:24.090815: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Found cached dataset conllpp (/root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2)
100% 3/3 [00:00<00:00, 575.80it/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']


## **Statistics**

In the Named Entity Recognition task, we know that labels are not uniformly distributed. Thus, most of the labels are Others (O). To this end we specified statistics in two directions: With and without O label. That is why, when you run it you will wee 2 graph per chosen metric.

On the other hand, we make inference based on 3 metrics of load_choice, but this time they are automatically chosen.

Last but not least, dataset choice is made by you thanks to the parameter of *statistics_data_choice*, which can either be test or validation.

We have to emphasize, when it is your first time to see statistics, predictions are run based on chosen epoch for each metric and saved to the corresponding experiment folder. Thus, next time you won't need to predict.

Let's step into the example then:

We set stats parameter to True, and statistics_data_choice to 'validation'. Then you can go to the experiment folder in your drive to check confusion matrices.


In [11]:
!python main.py --stats --statistics_data_choice validation --experiment_num 1

2023-08-21 12:44:04.148252: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Found cached dataset conllpp (/root/.cache/huggingface/datasets/conllpp/conllpp/1.0.0/04f15f257dff3fe0fb36e049b73d51ecdf382698682f5e590b7fb13898206ba2)
100% 3/3 [00:00<00:00, 579.43it/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
