Add script to prepare dataset from csv #462

Anindyadeep · 2023-08-24T17:41:40Z

This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)

Fixes #329

This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)

Anindyadeep · 2023-09-07T13:17:03Z

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

aniketmaurya · 2023-09-07T13:18:26Z

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week!

Anindyadeep · 2023-09-07T13:27:54Z

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week!

Thanks

Anindyadeep · 2023-09-11T16:27:23Z

@aniketmaurya Seems like tests are failing, this means should I need to change the requirements file or do the operations without pandas?

scripts/prepare_csv.py

aniketmaurya

Overall it looks good! I have added few comments. If you don't mind I can directly push to your PR and finish it?

Anindyadeep · 2023-09-11T18:17:20Z

Hi @aniketmaurya, I am cool with it you can take it and finish it. Thanks for reviewing

aniketmaurya

Hi @lantiga @carmocca, please have a look at this PR.

rasbt · 2023-09-12T12:27:12Z

I can take this up additionally for the tutorial if you guys are okay with it.
cc: @aniketmaurya @carmocca

Sounds great, @Anindyadeep !

I think we should include this new CSV approach above the "custom script" approach since this is potentially easier for some people. I modified the tutorial file accordingly and left a "TODO" section for you to fill in. Thanks for the great contribution in this PR here!

Also, if you have a draft, please let me know, I am happy to help and can then go over it as well.

Anindyadeep · 2023-09-12T15:49:02Z

Thanks @rasbt, I created the documentation, hope this should be good to go.

rasbt

Thanks for adding the documentation, this looks great. Here are a few small suggestions:

tutorials/prepare_dataset.md

rasbt · 2023-09-13T17:58:04Z

Just tested the script and it works great. The only nit I have is, shouldn't it be

python scripts/prepare_csv.py --csv_path test_data.csv

instead of

python scripts/prepare_csv.py test_data.csv

for consistency, since we use --args in all other scripts?

Any thoughts @carmocca ?

Anindyadeep · 2023-09-13T18:10:13Z

Just tested the script and it works great. The only nit I have is, shouldn't it be
python scripts/prepare_csv.py --csv_path test_data.csv

Yeah, I agree with this.

Since it is a positional argument, I was not able to provide consistency, however, that can be changed if it is okay.

rasbt · 2023-09-13T18:19:51Z

Sorry, I actually meant changing it in the script itself

Anindyadeep · 2023-09-13T18:23:27Z

Sorry, I actually meant changing it in the script itself

Please correct me if we are on the same page or not. So I was talking about changing the script so that we can have --csv_path argument introduced 😅

rasbt · 2023-09-13T19:12:04Z

Oh, sorry for causing any confusion. So

So I was talking about changing the script so that we can have --csv_path argument introduced 😅

was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments).

Long story short, what I have in mind is that it'd be nice to have the --csv_path in both the script (and then in the documentation).

I think the problem of not having the --csv_path is that it would show up when someone uses prepare_csv.py --help
to figure out the usage.

Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure.

Any thoughts there?

(CC @carmocca who is on top of all the stylistic choices in Lit-GPT)

Anindyadeep · 2023-09-13T19:17:48Z

Oh, sorry for causing any confusion. So

So I was talking about changing the script so that we can have --csv_path argument introduced 😅

was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments).

Long story short, what I have in mind is that it'd be nice to have the --csv_path in both the script (and then in the documentation).

I think the problem of not having the --csv_path is that it would show up when someone uses prepare_csv.py --help to figure out the usage.

Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure.

Any thoughts there?

(CC @carmocca who is on top of all the stylistic choices in Lit-GPT)

Yeah, I have a question, why we are using jsonargparse instead of fire , Which probably can handle positional arguments as the optional one.

carmocca

We use jsonargparse because it's easy to use, simple to understand, and the LightningCLI uses it too

scripts/prepare_csv.py

tutorials/prepare_dataset.md

changed position argument type of `--csv_path` to keyword Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Fixes typo of path Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

removed unusual repetitions Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Additional enhancements Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Anindyadeep · 2023-09-14T08:16:08Z

Thanks, @carmocca, I have committed the suggestions and guess we are good to go.

tutorials/prepare_dataset.md

Anindyadeep · 2023-09-14T19:00:36Z

Thanks, @aniketmaurya @rasbt @carmocca for the merge. It has been incredible. I am excited to contribute more.

rasbt · 2023-09-14T19:05:56Z

Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again!

Anindyadeep · 2023-09-14T19:20:43Z

Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again!

Thanks again and it would be awesome to get mentioned. PS: The LoRA Article is one of my best reads from your amazing set of articles. Excited to read the upcoming article on datasets.

Add script to prepare dataset from csv

213b084

This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)

Anindyadeep requested review from awaelchli, carmocca and lantiga as code owners August 24, 2023 17:41

Anindyadeep mentioned this pull request Aug 24, 2023

Script for data preparation from CSVs #329

Closed

aniketmaurya self-requested a review August 25, 2023 11:29

Merge branch 'main' into anindya/add_csv_script

83e9aa3

Merge branch 'main' into anindya/add_csv_script

0e8a604

aniketmaurya reviewed Sep 11, 2023

View reviewed changes

scripts/prepare_csv.py Outdated Show resolved Hide resolved

aniketmaurya reviewed Sep 11, 2023

View reviewed changes

scripts/prepare_csv.py Outdated Show resolved Hide resolved

aniketmaurya reviewed Sep 11, 2023

View reviewed changes

scripts/prepare_csv.py Outdated Show resolved Hide resolved

aniketmaurya requested changes Sep 11, 2023

View reviewed changes

aniketmaurya added 8 commits September 11, 2023 20:04

Merge branch 'main' into anindya/add_csv_script

ad0a7a4

update

ce2ff3e

Merge branch 'main' into anindya/add_csv_script

6a58378

fixes

10c4cf9

Merge branch 'main' into anindya/add_csv_script

2f13cb0

fix

69e4f49

formatting

f2611cb

format

a723a8a

aniketmaurya added enhancement New feature or request good first issue Good for newcomers labels Sep 11, 2023

aniketmaurya approved these changes Sep 11, 2023

View reviewed changes

aniketmaurya added 2 commits September 11, 2023 20:45

update requirements file

1c716ee

Merge branch 'main' into anindya/add_csv_script

41bed21

add tutorial section

1b753e7

rasbt and others added 2 commits September 12, 2023 07:27

Merge branch 'main' into anindya/add_csv_script

6e6dd25

Add documentation to prepare dataset from csv.

853cfad

aniketmaurya requested a review from rasbt September 12, 2023 15:49

Merge branch 'main' into anindya/add_csv_script

2f2043b

rasbt requested changes Sep 13, 2023

View reviewed changes

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

Fix: Small changes in documentation with rewordings

59b2af1

carmocca approved these changes Sep 13, 2023

View reviewed changes

scripts/prepare_csv.py Outdated Show resolved Hide resolved

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

Anindyadeep and others added 4 commits September 14, 2023 13:40

Update scripts/prepare_csv.py

a0182f0

changed position argument type of `--csv_path` to keyword Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Update tutorials/prepare_dataset.md

8cd00c9

Fixes typo of path Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Update tutorials/prepare_dataset.md

1113cad

removed unusual repetitions Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Update tutorials/prepare_dataset.md

2aa6b1e

Additional enhancements Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Merge branch 'main' into anindya/add_csv_script

fc73906

rasbt approved these changes Sep 14, 2023

View reviewed changes

carmocca reviewed Sep 14, 2023

View reviewed changes

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved

Apply suggestions from code review

2edc7b1

carmocca merged commit d38fa3a into Lightning-AI:main Sep 14, 2023
5 checks passed

carmocca mentioned this pull request Mar 7, 2024

Enable positional arguments for some CLI commands #1028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to prepare dataset from csv #462

Add script to prepare dataset from csv #462

Anindyadeep commented Aug 24, 2023 •

edited by carmocca

Anindyadeep commented Sep 7, 2023

aniketmaurya commented Sep 7, 2023

Anindyadeep commented Sep 7, 2023

Anindyadeep commented Sep 11, 2023

aniketmaurya left a comment

Anindyadeep commented Sep 11, 2023

aniketmaurya left a comment

rasbt commented Sep 12, 2023

Anindyadeep commented Sep 12, 2023

rasbt left a comment

rasbt commented Sep 13, 2023

Anindyadeep commented Sep 13, 2023

rasbt commented Sep 13, 2023

Anindyadeep commented Sep 13, 2023 •

edited

rasbt commented Sep 13, 2023

Anindyadeep commented Sep 13, 2023

carmocca left a comment

Anindyadeep commented Sep 14, 2023

Anindyadeep commented Sep 14, 2023

rasbt commented Sep 14, 2023

Anindyadeep commented Sep 14, 2023

Add script to prepare dataset from csv #462

Add script to prepare dataset from csv #462

Conversation

Anindyadeep commented Aug 24, 2023 • edited by carmocca

Anindyadeep commented Sep 7, 2023

aniketmaurya commented Sep 7, 2023

Anindyadeep commented Sep 7, 2023

Anindyadeep commented Sep 11, 2023

aniketmaurya left a comment

Choose a reason for hiding this comment

Anindyadeep commented Sep 11, 2023

aniketmaurya left a comment

Choose a reason for hiding this comment

rasbt commented Sep 12, 2023

Anindyadeep commented Sep 12, 2023

rasbt left a comment

Choose a reason for hiding this comment

rasbt commented Sep 13, 2023

Anindyadeep commented Sep 13, 2023

rasbt commented Sep 13, 2023

Anindyadeep commented Sep 13, 2023 • edited

rasbt commented Sep 13, 2023

Anindyadeep commented Sep 13, 2023

carmocca left a comment

Choose a reason for hiding this comment

Anindyadeep commented Sep 14, 2023

Anindyadeep commented Sep 14, 2023

rasbt commented Sep 14, 2023

Anindyadeep commented Sep 14, 2023

Anindyadeep commented Aug 24, 2023 •

edited by carmocca

Anindyadeep commented Sep 13, 2023 •

edited