Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small fixes for generate_textcorpus script #566

Closed
5 tasks done
jerielizabeth opened this issue Dec 20, 2023 · 7 comments
Closed
5 tasks done

Small fixes for generate_textcorpus script #566

jerielizabeth opened this issue Dec 20, 2023 · 7 comments
Assignees
Labels

Comments

@jerielizabeth
Copy link
Contributor

jerielizabeth commented Dec 20, 2023

During testing we found a few issues related to the help text and options of the generate_textcorpus.py that would improve functionality and make the use clearer:

Help Text for generate_textcorpus

  • Missing options for turning off progress bar
  • Default batch size needs to be documented
  • Typo in no-gzip option (missing space)

Explanation at the head of the script file

  • Update the examples in head of script. One example with default expected use is good.

Additional functionality

  • It would be useful to have labels on progress bar, so that user knows what the counts are for / related to.
@mnaydan
Copy link
Contributor

mnaydan commented Jan 5, 2024

One additional task:

  • Rename page field names to "order" (for digital page sequence) and "label" (for physical page number) to maintain consistency with database

@quadrismegistus
Copy link
Contributor

Missing options for turning off progress bar

Right now the progress bar is used if verbosity > 0, and logged statements are used if verbosity > 1. Does that sound like a good solution, and we just need to document it, or should we add a separate progressbar option?

@mnaydan
Copy link
Contributor

mnaydan commented Jan 9, 2024

@rlskoeser what do you think ^here?

@rlskoeser
Copy link
Contributor

Missing options for turning off progress bar

Right now the progress bar is used if verbosity > 0, and logged statements are used if verbosity > 1. Does that sound like a good solution, and we just need to document it, or should we add a separate progressbar option?

It's possible we'll want to run this as a cron job, in which case we would definitely not want a progress bar but we might want some output to go to a log (not verbosity=0, which is I imagine what you meant?). The other option besides a separate flag would be detecting if it's not a terminal. I'm fine with whichever is easier.

@jerielizabeth
Copy link
Contributor Author

  • Ran the script using the updated examples (thank you!). Worked as expected.
  • confirmed the typo fix in the help text
  • checked metadata - field name updated as expected.

One requested change with the examples - the path used in the examples bypasses the default timestamped folder and introduced some user challenges for me with rsync (since I ran the script as conan but am connecting via rsync as pulsys.) Can there be documentation of the default path (with the default batch size, perhaps), and not use that argument in the examples?

@quadrismegistus
Copy link
Contributor

Hm, the default batch size is already documented, and I'm not sure it's worth putting in the examples because there seems to be no benefit (in fact it slows it down) to use a smaller or larger number than the default. I'll push a tiny PR for the path change

@jerielizabeth
Copy link
Contributor Author

Thank you! I merged the PR and declare this issue closed :D

@rlskoeser rlskoeser changed the title Small fixes for generate_textcorpus.py Small fixes for generate_textcorpus script Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants