Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using GPT to create synthetic data for any NLP finetuning task #144

Closed
GautamR-Samagra opened this issue Jun 20, 2023 · 3 comments
Closed
Assignees

Comments

@GautamR-Samagra
Copy link
Collaborator

GautamR-Samagra commented Jun 20, 2023

This is to create a block that does the following for pre-decided model types(modelling tasks) :

  • Take as input a prompt to create synthetic data using GPT
  • Clean the output received from GPT and create/update files for training/validation/testing
  • Fine-tune/retrain a HF model based on created data (be able to do this in Azure/collab)
  • Maintain a repository of trained models in HF, validate their accuracy and push the best to {latest _model} to be used for any use case

We should be able to carry out following tasks for now :

  • Neural coreference
  • Sentence classification
@GautamR-Samagra
Copy link
Collaborator Author

Please update ticket with plan @ksgr5566

@ksgr5566
Copy link
Collaborator

Based on my understanding of the minimal viable scope of the project:

  • No authentication needed. Users access functions by providing details.
  • In the UI, per session the user would need to provide his OpenAI key, HuggingFace key as env variables.
  • Data generated and model finetuned is uploaded to user's hugging face repo.

I am starting out with building the API for the same.

Core functions:

  • Data Gen:
    auth head: openai key, hugging face key
    data: POST - usage: gets examples given prompt, does minor validation like checking if "Input" and "Output" are present in the generated content; params: prompt, number of examples, path (hugging face repo id).

    data: PUT - usage: does the same as above, diff is that "path" is necessary, loads the data from that path and adds more examples.

    data: DELETE - usage: same as above, "path" is necessary, loads the data from that path and deletes the specified amount of examples.

    data: GET - usage: returns the data generated from the call to gpt. Amount of data generated is specified by the number of examples praram.

  • Model FineTune:
    auth head: hugging face key
    model: POST - usage: finetunes a model upon providing the task and hugging face model path and stores it in the hugging face repo the user provides; params: task (for now classification or seq2seq), model name (if not provided a default one from config file is used), path (hugging face repo to store the model), data_file_path (either an uploaded file path, or hugging face repo file), extra params include all params that a hugging face Trainer takes, a train test split size.

    model: PUT - usage: finetune an already existing model in your hf repo again. except that model_name is compulsory and this should be a model present in user's hf repo. this is finetuned again and changed

  • Inference
    eval: GET - usage: inference, given a hf model name, loads it and replies with inferences on the input (file, hf test file path, or an uploaded file path) and reports out results.
    eval: POST - usage: saves results in the hf path provided

For the above, default params would be stored in a config file, if a user doesn't provide an optional param, the default would be used.

Next revision would include:

  • Show progress on model finetuning and data generation process. (still need to figure out how to do this, any suggestions would be helpful).
  • A more stricter generated data validation to be added taking in "task" as a param and adding in task specific constraints.

The above should include the base required to construct everything. If anything else is required please specify.

@ksgr5566
Copy link
Collaborator

ksgr5566 commented Jul 21, 2023

I am working on this here: AutoTuneNLP

demo vid covering data generation and model finetuning: https://drive.google.com/file/d/1s_gXqFMDXiVhOdCegZD32Vgvm9AISaIe/view?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants