New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to prepare dataset from csv #462
Add script to prepare dataset from csv #462
Conversation
This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)
Hi @aniketmaurya , can I please know the status of this PR, I see it is getting in hold for quite some days. |
hi @Anindyadeep, thanks for the PR. sorry for late response, I will review it this week! |
Thanks |
@aniketmaurya Seems like tests are failing, this means should I need to change the requirements file or do the operations without pandas? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good! I have added few comments. If you don't mind I can directly push to your PR and finish it?
Hi @aniketmaurya, I am cool with it you can take it and finish it. Thanks for reviewing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, @Anindyadeep ! I think we should include this new CSV approach above the "custom script" approach since this is potentially easier for some people. I modified the tutorial file accordingly and left a "TODO" section for you to fill in. Thanks for the great contribution in this PR here! Also, if you have a draft, please let me know, I am happy to help and can then go over it as well. |
Thanks @rasbt, I created the documentation, hope this should be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the documentation, this looks great. Here are a few small suggestions:
Just tested the script and it works great. The only nit I have is, shouldn't it be python scripts/prepare_csv.py --csv_path test_data.csv instead of python scripts/prepare_csv.py test_data.csv for consistency, since we use Any thoughts @carmocca ? |
Yeah, I agree with this. Since it is a positional argument, I was not able to provide consistency, however, that can be changed if it is okay. |
Sorry, I actually meant changing it in the script itself |
Please correct me if we are on the same page or not. So I was talking about changing the script so that we can have |
Oh, sorry for causing any confusion. So
was what I had originally in mind. But then my brain thought you were thinking I meant the documentation (since I just reviewed that and provided feedback in the previous round of comments). Long story short, what I have in mind is that it'd be nice to have the I think the problem of not having the Maybe the CLI class doesn't list it there because it's a positional argument. In this case, maybe we can set it to an empty string or None and raise an ValueError if it's not specified. Not sure. Any thoughts there? (CC @carmocca who is on top of all the stylistic choices in Lit-GPT) |
Yeah, I have a question, why we are using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use jsonargparse
because it's easy to use, simple to understand, and the LightningCLI
uses it too
changed position argument type of `--csv_path` to keyword Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Fixes typo of path Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
removed unusual repetitions Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Additional enhancements Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Thanks, @carmocca, I have committed the suggestions and guess we are good to go. |
Thanks, @aniketmaurya @rasbt @carmocca for the merge. It has been incredible. I am excited to contribute more. |
Thanks for the awesome contribution @Anindyadeep. This is super valuable! (PS: I am currently working on an article on datasets and be excited to mention your awesome PR) Thanks again! |
Thanks again and it would be awesome to get mentioned. PS: The LoRA Article is one of my best reads from your amazing set of articles. Excited to read the upcoming article on datasets. |
This commit aims to add a simple script to prepare a dataset from csv An assumption that the script make is the csv must contain these three columns: "instruction", "input", "output" (all case sensitive)
Fixes #329