Using GPT to create synthetic data for any NLP finetuning task #144

GautamR-Samagra · 2023-06-20T05:32:54Z

This is to create a block that does the following for pre-decided model types(modelling tasks) :

Take as input a prompt to create synthetic data using GPT
Clean the output received from GPT and create/update files for training/validation/testing
Fine-tune/retrain a HF model based on created data (be able to do this in Azure/collab)
Maintain a repository of trained models in HF, validate their accuracy and push the best to {latest _model} to be used for any use case

We should be able to carry out following tasks for now :

Neural coreference
Sentence classification

GautamR-Samagra · 2023-07-18T08:47:42Z

Please update ticket with plan @ksgr5566

ksgr5566 · 2023-07-18T14:56:30Z

Based on my understanding of the minimal viable scope of the project:

No authentication needed. Users access functions by providing details.
In the UI, per session the user would need to provide his OpenAI key, HuggingFace key as env variables.
Data generated and model finetuned is uploaded to user's hugging face repo.

I am starting out with building the API for the same.

Core functions:

Data Gen:
auth head: openai key, hugging face key
data: POST - usage: gets examples given prompt, does minor validation like checking if "Input" and "Output" are present in the generated content; params: prompt, number of examples, path (hugging face repo id).

data: PUT - usage: does the same as above, diff is that "path" is necessary, loads the data from that path and adds more examples.

data: DELETE - usage: same as above, "path" is necessary, loads the data from that path and deletes the specified amount of examples.

data: GET - usage: returns the data generated from the call to gpt. Amount of data generated is specified by the number of examples praram.
Model FineTune:
auth head: hugging face key
model: POST - usage: finetunes a model upon providing the task and hugging face model path and stores it in the hugging face repo the user provides; params: task (for now classification or seq2seq), model name (if not provided a default one from config file is used), path (hugging face repo to store the model), data_file_path (either an uploaded file path, or hugging face repo file), extra params include all params that a hugging face Trainer takes, a train test split size.

model: PUT - usage: finetune an already existing model in your hf repo again. except that model_name is compulsory and this should be a model present in user's hf repo. this is finetuned again and changed
Inference
eval: GET - usage: inference, given a hf model name, loads it and replies with inferences on the input (file, hf test file path, or an uploaded file path) and reports out results.
eval: POST - usage: saves results in the hf path provided

For the above, default params would be stored in a config file, if a user doesn't provide an optional param, the default would be used.

Next revision would include:

Show progress on model finetuning and data generation process. (still need to figure out how to do this, any suggestions would be helpful).
A more stricter generated data validation to be added taking in "task" as a param and adding in task specific constraints.

The above should include the base required to construct everything. If anything else is required please specify.

ksgr5566 · 2023-07-21T11:51:24Z

I am working on this here: AutoTuneNLP

demo vid covering data generation and model finetuning: https://drive.google.com/file/d/1s_gXqFMDXiVhOdCegZD32Vgvm9AISaIe/view?usp=sharing

GautamR-Samagra self-assigned this Jun 20, 2023

GautamR-Samagra mentioned this issue Jun 25, 2023

[C4GT] Neural coreference for enabling conversational flow in bots #42

Closed

18 tasks

GautamR-Samagra assigned ksgr5566 and unassigned GautamR-Samagra Jun 25, 2023

GautamR-Samagra mentioned this issue Jun 25, 2023

Model Training Pipeline for synthetic prompt based data #82

Closed

GautamR-Samagra closed this as completed Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GPT to create synthetic data for any NLP finetuning task #144

Using GPT to create synthetic data for any NLP finetuning task #144

GautamR-Samagra commented Jun 20, 2023 •

edited

Loading

GautamR-Samagra commented Jul 18, 2023

ksgr5566 commented Jul 18, 2023

ksgr5566 commented Jul 21, 2023 •

edited

Loading

Using GPT to create synthetic data for any NLP finetuning task #144

Using GPT to create synthetic data for any NLP finetuning task #144

Comments

GautamR-Samagra commented Jun 20, 2023 • edited Loading

GautamR-Samagra commented Jul 18, 2023

ksgr5566 commented Jul 18, 2023

ksgr5566 commented Jul 21, 2023 • edited Loading

GautamR-Samagra commented Jun 20, 2023 •

edited

Loading

ksgr5566 commented Jul 21, 2023 •

edited

Loading