Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/hf dataset augmentation #653

Merged
merged 24 commits into from
Aug 2, 2023

Conversation

RakshitKhajuria
Copy link
Contributor

@RakshitKhajuria RakshitKhajuria commented Jul 20, 2023

Description

Added support for loading HuggingFace Datasets for Augmentation tasks.

Notebook ➤ Demo

Fixes #621

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Usage

custom_proportions = {
    'add_ocr_typo':0.3
}

data_kwargs = {
      "data_source" : "glue",
      "subset": "sst2",
      "feature_column": "sentence",
      "target_column": "label",
      "split": "train"
       }
harness.augment(
    training_data = data_kwargs,
    augmented_data ="augmented_glue.csv",
    custom_proportions=custom_proportions,
    export_mode="add",
)

Checklist:

  • I've added Google style docstrings to my code.
  • I have linted my code
  • I have added tests to cover my changes.

Screenshots (if appropriate):

@RakshitKhajuria RakshitKhajuria added ⭐ Feature Indicates new feature requests v2.1.0 Issue or request to be done in v2.1.0 release labels Jul 20, 2023
@RakshitKhajuria RakshitKhajuria marked this pull request as draft July 20, 2023 09:20
@Prikshit7766 Prikshit7766 linked an issue Jul 20, 2023 that may be closed by this pull request
Copy link
Contributor

@JulesBelveze JulesBelveze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Prikshit7766 Why not doing something like?

harness.augment(
    input_path="glue",
    output_path="augmented.csv",
    custom_proportions=custom_proportions,
    export_mode="transformed",
    data_kwargs={
        "subset": "sst2",
        "feature_column": "sentence",
        "target_column": "label",
        "split": "train"
    }
)

then you simply have to do:

self.df.load_data(**data_kwargs)

@RakshitKhajuria
Copy link
Contributor Author

@Prikshit7766 Why not doing something like?

harness.augment(
    input_path="glue",
    output_path="augmented.csv",
    custom_proportions=custom_proportions,
    export_mode="transformed",
    data_kwargs={
        "subset": "sst2",
        "feature_column": "sentence",
        "target_column": "label",
        "split": "train"
    }
)

then you simply have to do:

self.df.load_data(**data_kwargs)

Do we have to change this for the loading of hf dataset in the harness as well?
If yes then we will be adding one more parameter to the haness class. @JulesBelveze

@JulesBelveze
Copy link
Contributor

JulesBelveze commented Jul 20, 2023

@RakshitKhajuria Yes I don't think it's too bad to have an optional parameter called data_kwargs in the Harness constructor.. Also, I don't see any other way around it..
@ArshaanNazir what's your opinion?

@ArshaanNazir
Copy link
Collaborator

@RakshitKhajuria Yes I don't think it's too bad to have an optional parameter called data_kwargs in the Harness constructor.. Also, I don't see any other way around it.. @ArshaanNazir what's your opinion?

@JulesBelveze adding an additional param was the initial thought. However David insisted of not adding more params to harness class.

@JulesBelveze
Copy link
Contributor

@ArshaanNazir then I guess the only way is to delete the input_path parameter and have a data_kwargs parameter of type Dict which requires a key input_path and other keys are optional?

@ArshaanNazir
Copy link
Collaborator

ArshaanNazir commented Jul 20, 2023

@ArshaanNazir then I guess the only way is to delete the input_path parameter and have a data_kwargs parameter of type Dict which requires a key input_path and other keys are optional?

@ArshaanNazir then I guess the only way is to delete the input_path parameter and have a data_kwargs parameter of type Dict which requires a key input_path and other keys are optional?

But our main use case was to retrain our own models. ( with data on which they have being trained )

h.augment( input_path = "train.conll" , ****) looks more appealing in that way. But yes, it will make it more generic.

What do you think @JulesBelveze ?

@ArshaanNazir
Copy link
Collaborator

ArshaanNazir commented Jul 20, 2023

data_kwargs

In this case, we can name it as training_data (parameter as Dict) , with data_source and other optional and output_path can be augmented_data

@Prikshit7766
Copy link
Contributor

data_kwargs

In this case, we can name it as training_data (parameter as Dict) , with data_source and other optional and output_path can be augmented_data

@ArshaanNazir
can you provied one sample example for both the cases

@RakshitKhajuria RakshitKhajuria marked this pull request as ready for review July 30, 2023 05:38
Copy link
Contributor

@JulesBelveze JulesBelveze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RakshitKhajuria
Copy link
Contributor Author

LGTM

@JulesBelveze don't merge it yet. We will be updating this PR with the notebook 😊

@ArshaanNazir ArshaanNazir merged commit 185ecf0 into release/1.2.0 Aug 2, 2023
@ArshaanNazir ArshaanNazir deleted the feature/hf-dataset-augmentation branch August 7, 2023 05:59
@ArshaanNazir ArshaanNazir removed the v2.1.0 Issue or request to be done in v2.1.0 release label Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⭐ Feature Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update All Data Augmentation Functionalities on Website Add support for HF datasets augmentations
4 participants