Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to prepare dataset from csv #462

Merged
merged 28 commits into from Sep 14, 2023

Conversation

Anindyadeep
Copy link
Contributor

@Anindyadeep Anindyadeep commented Aug 24, 2023

This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)

Fixes #329

This commit aims to add a simple script to prepare a dataset from csv
An assumption that the script make is the csv must contain these
three columns: "instruction", "input", "output" (all case sensitive)
@Anindyadeep
Copy link
Contributor Author

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

@aniketmaurya
Copy link
Contributor

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week!

@Anindyadeep
Copy link
Contributor Author

Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days.

hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week!

Thanks

@Anindyadeep
Copy link
Contributor Author

@aniketmaurya Seems like tests are failing, this means should I need to change the requirements file or do the operations without pandas?

Copy link
Contributor

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good! I have added few comments. If you don't mind I can directly push to your PR and finish it?

@Anindyadeep
Copy link
Contributor Author

Hi @aniketmaurya, I am cool with it you can take it and finish it. Thanks for reviewing

@aniketmaurya aniketmaurya added enhancement New feature or request good first issue Good for newcomers labels Sep 11, 2023
Copy link
Contributor

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lantiga @carmocca, please have a look at this PR.

@rasbt
Copy link
Collaborator

rasbt commented Sep 12, 2023

I can take this up additionally for the tutorial if you guys are okay with it.
cc: @aniketmaurya @carmocca

Sounds great, @Anindyadeep !

I think we should include this new CSV approach above the "custom script" approach since this is potentially easier for some people. I modified the tutorial file accordingly and left a "TODO" section for you to fill in. Thanks for the great contribution in this PR here!

Also, if you have a draft, please let me know, I am happy to help and can then go over it as well.

@Anindyadeep
Copy link
Contributor Author

Thanks @rasbt, I created the documentation, hope this should be good to go.

Copy link
Collaborator

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the documentation, this looks great. Here are a few small suggestions:

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
@rasbt
Copy link
Collaborator

rasbt commented Sep 13, 2023

Just tested the script and it works great. The only nit I have is, shouldn't it be

python scripts/prepare_csv.py --csv_path test_data.csv

instead of

python scripts/prepare_csv.py test_data.csv

for consistency, since we use --args in all other scripts?

Any thoughts @carmocca ?

@Anindyadeep
Copy link
Contributor Author

Just tested the script and it works great. The only nit I have is, shouldn't it be

python scripts/prepare_csv.py --csv_path test_data.csv

Yeah, I agree with this.

Since it is a positional argument, I was not able to provide consistency, however, that can be changed if it is okay.

@rasbt
Copy link
Collaborator

rasbt commented Sep 13, 2023

Sorry, I actually meant changing it in the script itself

@Anindyadeep
Copy link
Contributor Author

Anindyadeep commented Sep 13, 2023

Sorry, I actually meant changing it in the script itself

Please correct me if we are on the same page or not. So I was talking about changing the script so that we can have --csv_path argument introduced 😅

@rasbt
Copy link
Collaborator

rasbt commented Sep 13, 2023

Oh, sorry for causing any confusion. So

So I was talking about changing the script so that we can have --csv_path argument introduced 😅

was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments).

Long story short, what I have in mind is that it'd be nice to have the --csv_path in both the script (and then in the documentation).

I think the problem of not having the --csv_path is that it would show up when someone uses prepare_csv.py --help
to figure out the usage.

Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure.

Any thoughts there?

(CC @carmocca who is on top of all the stylistic choices in Lit-GPT)

@Anindyadeep
Copy link
Contributor Author

Oh, sorry for causing any confusion. So

So I was talking about changing the script so that we can have --csv_path argument introduced 😅

was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments).

Long story short, what I have in mind is that it'd be nice to have the --csv_path in both the script (and then in the documentation).

I think the problem of not having the --csv_path is that it would show up when someone uses prepare_csv.py --help to figure out the usage.

Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure.

Any thoughts there?

(CC @carmocca who is on top of all the stylistic choices in Lit-GPT)

Yeah, I have a question, why we are using jsonargparse instead of fire , Which probably can handle positional arguments as the optional one.

Copy link
Member

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use jsonargparse because it's easy to use, simple to understand, and the LightningCLI uses it too

scripts/prepare_csv.py Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
Anindyadeep and others added 4 commits September 14, 2023 13:40
changed position argument type of `--csv_path` to keyword

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Fixes typo of path

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
removed unusual repetitions

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Additional enhancements

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
@Anindyadeep
Copy link
Contributor Author

Thanks, @carmocca, I have committed the suggestions and guess we are good to go.

tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
@carmocca carmocca merged commit d38fa3a into Lightning-AI:main Sep 14, 2023
5 checks passed
@Anindyadeep
Copy link
Contributor Author

Thanks, @aniketmaurya @rasbt @carmocca for the merge. It has been incredible. I am excited to contribute more.

@rasbt
Copy link
Collaborator

rasbt commented Sep 14, 2023

Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again!

@Anindyadeep
Copy link
Contributor Author

Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again!

Thanks again and it would be awesome to get mentioned. PS: The LoRA Article is one of my best reads from your amazing set of articles. Excited to read the upcoming article on datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Script for data preparation from CSVs
4 participants