feature: end-to-end NER pipeline #664

JulesBelveze · 2023-07-24T08:52:59Z

Description

This PR aims at providing an end to end pipeline to perform the following workflow:

- train a model on a given dataset
- evaluate the model on a given test dataset
- test the trained model on a set of tests
- augment the training set based on the tests outcome
- retrain the model on a the freshly generated augmented training set
- evaluate the retrained model on the test dataset
- compare the performance of the two models

This way the user is able to train a model and tests behaviours that matter using langtest. Based on the outcome of those tests langtest will augment the original training set with samples on which the model failed. The model will then be retrained on this augmented dataset and compared to the original on the generated set of tests.

It for now supports the transformers library and the NER task. The datasets can be passed in conll or csv format.

Usage

To use the end to end pipeline you can run the following one liner with your own parameters:

python langtest/pipelines/transformers_pipelines.py run \
    --model-name=MODEL_NAME \
    --train-data=TRAIN_FILE \
    --eval-data=EVAL_FILE \
    --training-args=ARGS_DICT \
    --feature-col=NAME_OF_FEATURE_COL \
    --target-col=NAME_OF_TARGET_COL

for example:

python langtest/pipelines/transformers_pipelines.py run \
    --model-name="bert-base-uncased" \
    --train-data=train.csv \
    --eval-data=tesrt.csv \
    --training-args='{"per_device_train_batch_size": 4}' \
    --feature-col="tokens" \
    --target-col="ner_tags"

Checklist:

I've added Google style docstrings to my code.
I've used pydantic for typing when/where necessary.
I have linted my code
I have added tests to cover my changes.

JulesBelveze · 2023-07-28T09:01:56Z

I am stuck trying to update the poetry.lock file.. For some reason poetry gets stuck trying to resolve the dependencies

JulesBelveze added 5 commits July 21, 2023 17:07

fix(formatter): wrong assertion

77d4848

chore(pipelines): add NER dataset

ec12ea0

chore(pipelines): NER metrics computation

0c46294

feature(pipelines): add NER pipeline transformers

c63cefb

docs(pipelines): improve NER HF pipeline docstring

a07949e

JulesBelveze self-assigned this Jul 24, 2023

JulesBelveze requested a review from chakravarthik27 July 24, 2023 08:53

chakravarthik27 approved these changes Jul 24, 2023

View reviewed changes

JulesBelveze requested review from chakravarthik27 and ArshaanNazir July 24, 2023 09:25

tests(pipelines): add NER HF tests for pipelines

3b01654

JulesBelveze linked an issue Jul 24, 2023 that may be closed by this pull request

Provide users with NER HF end-to-end pipeline #595

Closed

dependency: add metaflow

bf9ef16

ArshaanNazir approved these changes Jul 24, 2023

View reviewed changes

JulesBelveze and others added 10 commits July 28, 2023 14:02

Update poetry.lock

f496e53

Update pyproject.toml

2010850

tests: set Metaflow env var

4d330c4

fix(test): missing import in conftest

06c1872

fix(tests): attempt to run pipeline tests

2787804

tests: fix metaflow test config

42fcd87

dependency: missing evaluate dependency for hf trainer

266812a

dependency: missing evaluate dependency for hf trainer

ed59cf6

dependency: missing seqeval dependency for hf trainer

72786bb

fix(pipeline): wrong column names

8793cdc

JulesBelveze merged commit 1c120c9 into release/1.2.0 Aug 1, 2023

JulesBelveze deleted the feature/end-to-end-pipelines branch August 1, 2023 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: end-to-end NER pipeline #664

feature: end-to-end NER pipeline #664

JulesBelveze commented Jul 24, 2023 •

edited

Loading

JulesBelveze commented Jul 28, 2023

feature: end-to-end NER pipeline #664

feature: end-to-end NER pipeline #664

Conversation

JulesBelveze commented Jul 24, 2023 • edited Loading

Description

Usage

Checklist:

JulesBelveze commented Jul 28, 2023

JulesBelveze commented Jul 24, 2023 •

edited

Loading