Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. #5

Closed
7 tasks done
mnaydan opened this issue Mar 19, 2024 · 5 comments
Assignees

Comments

@mnaydan
Copy link
Collaborator

mnaydan commented Mar 19, 2024

acceptance criteria

  • filter script successfully filters the corpus and generates a subset corpus that includes all pages for the specified source ids and only pages for those source ids
    • for simple idfile (one source_id per line; non-excerpt sources)
    • for simple idfile with leading whitespace
    • for simple idfile with trailing whitespace
    • for idfiles with sources with 1 or more excerpts
    • for idfiles with a mix of existing and non-existing sources
  • filter script has reasonable error handling (missing/empty id file, output file already exists)
@mnaydan mnaydan changed the title As a researcher, I want to be able to get the text corpus for a subset for record IDs so that I can conduct textual analysis within particular groups of texts. As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. Mar 19, 2024
@rlskoeser
Copy link
Collaborator

Here's a screen recording of my terminal showing the corpus subset functionality:

Screen.Recording.2024-04-02.at.2.34.46.PM.mov

@rlskoeser rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 3, 2024
@laurejt
Copy link
Contributor

laurejt commented Apr 4, 2024

Should an output file be created if there are no matches?

@rlskoeser
Copy link
Collaborator

Should an output file be created if there are no matches?

I guess ideally not, but I don't think there's an easy way to detect this until after consuming the generator. I guess we could check and remove it after if it's zero size.

@laurejt
Copy link
Contributor

laurejt commented Apr 4, 2024

Testing acceptance complete. The only outstanding issue, is whether the script should produce empty output files when no matches are found (either way seems reasonable).

@rlskoeser
Copy link
Collaborator

Thanks for testing. Let's leave the empty file behavior as it is for now, we've both put enough time into this feature already. We can always revisit and tweak it later if we find it's causing problems.

@rlskoeser rlskoeser removed the 🗜️ awaiting testing Implemented and ready to be tested label Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants