ADPTER TRAINING GUIDE
===========================================================================
THIS FILE WILL HELP TO UNDERSTAND THE PROCESS OF TRAINING AND EVALUATING AN ADAPTER.
PREREQUISITES
-
How an adapter setup works. reference - [https://arxiv.org/abs/2005.00052](MAD-X Paper)
-
NER Datset preprocessing. reference - https://www.youtube.com/watch?v=dzyDHMycx_c
-
Adapters - [https://docs.adapterhub.ml/quickstart.html] (Adapter Introduction)
DATASET PREPROCESSING
Preprcoessing is divided into 3 parts .
- Langauge Adapter Preprocessing
- Task Adapter Preprocessing
- Evaluation Dataset preprocessing
Set the labels for NER Tags: {"O": 0, "B-per": 1, "I-per": 2, "B-org": 3, "I-org": 4, "B-loc": 5, "I-loc": 6}
This is default TAG List for BIO Tagging Format.
Language Adapter Training:
-
We start with training the language adapter, these adapters will be trained on unlabeled data.
-
In
language.ipynbgive the directory of the unlabled dataset and RUN, this will automatically save the langauge adapter in your folder. -
You can change the adapter name and ouput directory accordingly.
NOTE: SAME PIPELINE CAN BE USED FOR TRAINING TARGET LANGUAGE ADAPTER.
Task Adapter Training:
-
RUN
task.ipynbfor TASK ADAPTER Training. -
This adapter would be trained on labeled dataset, for this project NER Dataset have been used.
-
Set the path for NER Dataset, preprocess pipeline will convert the dataset into Huggingface Dict. i.e. the format for tokenizing the text. After Ist preprocess your output should be:
DatasetDict({ train: Dataset({ features: ['LABEL-1', 'LABEL-2'] }) })
NOTE: You can change name of the labels, the NER dataset used was having ['token','ner_tags'] as label.
-
This preprocessing will tokenize and map the text to their corresponding input_ids. Your ouput should be:
DatasetDict({ train: Dataset({ features: ['LABEL-1', 'LABEL-2', 'input_ids', 'attention_mask', 'labels'], }) })
-
We shall remove the LABELS from the Dataset for training purpose.
-
Import a pretrained adapter model for training. Add your adapter ['model.add_adapter("Your_Task_Adapter")']. this adapter would then be stacked with Language Adpater.
NOTE:The langauge adapter should be same as the langauage being used for training Task Adapter.
-
Set your parameters for training.
-
Save your Task adapter for further evalauation.
Evaluation:
-
We would evaluate on Target Adapter language. Evaluation would be done for labeled dataset of the langauge.
-
So we would preprcoess the data same way we did for Task Adapter Dataset. Repat the process till removing the LABELS.
-
Then we will call the TASK ADAPTER and replace the Language adapter with Target adapter.
-
Then would evaluate for test split of the dataset.
SINCE WE USED NER DATASET, WE SAW A HUGE CLASS IMBALANCE IN TAGGING, SO DURING EVALUATION WE WOULD IGNORE THE DOMINATED TAG FOR BETTER RESULTS.
YOU CAN TRY ANOTHER APPROACH TO EVALAUTE
INSTALL THE REQUIRED PACKAGES FROM 'requirements.txt'