Feature/hf dataset augmentation #653

RakshitKhajuria · 2023-07-20T09:20:43Z

Description

Added support for loading HuggingFace Datasets for Augmentation tasks.

Notebook ➤ Demo

➤ Fixes #621

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Usage

custom_proportions = {
    'add_ocr_typo':0.3
}

data_kwargs = {
      "data_source" : "glue",
      "subset": "sst2",
      "feature_column": "sentence",
      "target_column": "label",
      "split": "train"
       }
harness.augment(
    training_data = data_kwargs,
    augmented_data ="augmented_glue.csv",
    custom_proportions=custom_proportions,
    export_mode="add",
)

Checklist:

I've added Google style docstrings to my code.
I have linted my code
I have added tests to cover my changes.

Screenshots (if appropriate):

JulesBelveze

@Prikshit7766 Why not doing something like?

harness.augment(
    input_path="glue",
    output_path="augmented.csv",
    custom_proportions=custom_proportions,
    export_mode="transformed",
    data_kwargs={
        "subset": "sst2",
        "feature_column": "sentence",
        "target_column": "label",
        "split": "train"
    }
)

then you simply have to do:

self.df.load_data(**data_kwargs)

RakshitKhajuria · 2023-07-20T15:27:13Z

@Prikshit7766 Why not doing something like?

harness.augment(
    input_path="glue",
    output_path="augmented.csv",
    custom_proportions=custom_proportions,
    export_mode="transformed",
    data_kwargs={
        "subset": "sst2",
        "feature_column": "sentence",
        "target_column": "label",
        "split": "train"
    }
)

then you simply have to do:

self.df.load_data(**data_kwargs)

Do we have to change this for the loading of hf dataset in the harness as well?
If yes then we will be adding one more parameter to the haness class. @JulesBelveze

…cation

JulesBelveze · 2023-07-20T15:42:12Z

@RakshitKhajuria Yes I don't think it's too bad to have an optional parameter called data_kwargs in the Harness constructor.. Also, I don't see any other way around it..
@ArshaanNazir what's your opinion?

…cation

ArshaanNazir · 2023-07-20T16:37:39Z

@RakshitKhajuria Yes I don't think it's too bad to have an optional parameter called data_kwargs in the Harness constructor.. Also, I don't see any other way around it.. @ArshaanNazir what's your opinion?

@JulesBelveze adding an additional param was the initial thought. However David insisted of not adding more params to harness class.

JulesBelveze · 2023-07-20T16:42:49Z

@ArshaanNazir then I guess the only way is to delete the input_path parameter and have a data_kwargs parameter of type Dict which requires a key input_path and other keys are optional?

ArshaanNazir · 2023-07-20T17:15:22Z

@ArshaanNazir then I guess the only way is to delete the input_path parameter and have a data_kwargs parameter of type Dict which requires a key input_path and other keys are optional?

But our main use case was to retrain our own models. ( with data on which they have being trained )

h.augment( input_path = "train.conll" , ****) looks more appealing in that way. But yes, it will make it more generic.

What do you think @JulesBelveze ?

ArshaanNazir · 2023-07-20T17:27:28Z

data_kwargs

In this case, we can name it as training_data (parameter as Dict) , with data_source and other optional and output_path can be augmented_data

Prikshit7766 · 2023-07-20T17:40:46Z

data_kwargs

In this case, we can name it as training_data (parameter as Dict) , with data_source and other optional and output_path can be augmented_data

@ArshaanNazir
can you provied one sample example for both the cases

JulesBelveze

LGTM

RakshitKhajuria · 2023-08-01T12:43:58Z

LGTM

@JulesBelveze don't merge it yet. We will be updating this PR with the notebook 😊

into feature/hf-dataset-augmentation

…JohnSnowLabs/nlptest into feature/hf-dataset-augmentation

…JohnSnowLabs/langtest into feature/hf-dataset-augmentation

Prikshit7766 and others added 2 commits July 20, 2023 14:36

Task: support hf dataset for augmentation

de0df1a

fix(augmentation/__init__.py): Bug fix in export_mode = transformed

7dcd8cb

RakshitKhajuria added ⭐ Feature Indicates new feature requests v2.1.0 Issue or request to be done in v2.1.0 release labels Jul 20, 2023

RakshitKhajuria assigned RakshitKhajuria and Prikshit7766 Jul 20, 2023

RakshitKhajuria marked this pull request as draft July 20, 2023 09:20

Prikshit7766 linked an issue Jul 20, 2023 that may be closed by this pull request

Add support for HF datasets augmentations #621

Closed

Prikshit7766 and others added 2 commits July 20, 2023 19:48

Test(test/test_augmentation.py): added test for coverage

56128c9

Test(test_augmentation.py): Added more tests

20347a7

JulesBelveze reviewed Jul 20, 2023

View reviewed changes

Test(test/test_augmentation.py): updated path

d269f08

task(test_augmentation.py): Updated the config path for text-classifi…

fe566c4

…cation

task(test_augmentation.py): Updated the config path for text-classifi…

795e639

…cation

Task(test_augmentation): Added custom proportions

81c9835

RakshitKhajuria and others added 5 commits July 21, 2023 00:55

task(langtest.py): Updated Args

ffeeb3f

task(augmentation/__init__.py): Updated Args

631bd4a

update: test augmentation

ef2bd6c

task(test_augmentation.py): added data_source

eed8778

updated augmentation/__init__.py

f331a23

RakshitKhajuria marked this pull request as ready for review July 30, 2023 05:38

Prikshit7766 requested a review from JulesBelveze August 1, 2023 10:45

JulesBelveze approved these changes Aug 1, 2023

View reviewed changes

Prikshit7766 and others added 3 commits August 1, 2023 23:49

website and notebook updated

4e3be6c

Docs(generate_aug.md): Updated For hf dataset

62b349c

Merge branch 'release/1.2.0' of https://github.com/JohnSnowLabs/langtest

dcbca56

into feature/hf-dataset-augmentation

RakshitKhajuria mentioned this pull request Aug 2, 2023

Update All Data Augmentation Functionalities on Website #688

Closed

RakshitKhajuria linked an issue Aug 2, 2023 that may be closed by this pull request

Update All Data Augmentation Functionalities on Website #688

Closed

RakshitKhajuria and others added 8 commits August 2, 2023 13:41

Updated website for templatic augmentations

c94ca9f

Augmentation notebook updated

9ebaa5c

fix(langtest): renamed parameter in augment method

c9e0287

Merge branch 'feature/hf-dataset-augmentation' of https://github.com/…

4bb70a0

…JohnSnowLabs/nlptest into feature/hf-dataset-augmentation

updated parameters in website

b787679

Merge branch 'feature/hf-dataset-augmentation' of https://github.com/…

2f5b5e9

…JohnSnowLabs/langtest into feature/hf-dataset-augmentation

param updated in notebook

ccc5052

param updated test_augmentation.py

c9dfa98

ArshaanNazir merged commit 185ecf0 into release/1.2.0 Aug 2, 2023

ArshaanNazir deleted the feature/hf-dataset-augmentation branch August 7, 2023 05:59

ArshaanNazir removed the v2.1.0 Issue or request to be done in v2.1.0 release label Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/hf dataset augmentation #653

Feature/hf dataset augmentation #653

RakshitKhajuria commented Jul 20, 2023 •

edited

Loading

JulesBelveze left a comment

RakshitKhajuria commented Jul 20, 2023

JulesBelveze commented Jul 20, 2023 •

edited

Loading

ArshaanNazir commented Jul 20, 2023

JulesBelveze commented Jul 20, 2023

ArshaanNazir commented Jul 20, 2023 •

edited

Loading

ArshaanNazir commented Jul 20, 2023 •

edited

Loading

Prikshit7766 commented Jul 20, 2023

JulesBelveze left a comment

RakshitKhajuria commented Aug 1, 2023

Feature/hf dataset augmentation #653

Feature/hf dataset augmentation #653

Conversation

RakshitKhajuria commented Jul 20, 2023 • edited Loading

Description

Notebook ➤ Demo

Type of change

Usage

Checklist:

Screenshots (if appropriate):

JulesBelveze left a comment

Choose a reason for hiding this comment

RakshitKhajuria commented Jul 20, 2023

JulesBelveze commented Jul 20, 2023 • edited Loading

ArshaanNazir commented Jul 20, 2023

JulesBelveze commented Jul 20, 2023

ArshaanNazir commented Jul 20, 2023 • edited Loading

ArshaanNazir commented Jul 20, 2023 • edited Loading

Prikshit7766 commented Jul 20, 2023

JulesBelveze left a comment

Choose a reason for hiding this comment

RakshitKhajuria commented Aug 1, 2023

RakshitKhajuria commented Jul 20, 2023 •

edited

Loading

JulesBelveze commented Jul 20, 2023 •

edited

Loading

ArshaanNazir commented Jul 20, 2023 •

edited

Loading

ArshaanNazir commented Jul 20, 2023 •

edited

Loading