As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. #5

mnaydan · 2024-03-19T20:17:53Z

acceptance criteria

filter script successfully filters the corpus and generates a subset corpus that includes all pages for the specified source ids and only pages for those source ids
- for simple idfile (one source_id per line; non-excerpt sources)
- for simple idfile with leading whitespace
- for simple idfile with trailing whitespace
- for idfiles with sources with 1 or more excerpts
- for idfiles with a mix of existing and non-existing sources
filter script has reasonable error handling (missing/empty id file, output file already exists)

The text was updated successfully, but these errors were encountered:

resolves #5

rlskoeser · 2024-04-02T20:11:08Z

Here's a screen recording of my terminal showing the corpus subset functionality:

Screen.Recording.2024-04-02.at.2.34.46.PM.mov

laurejt · 2024-04-04T14:24:13Z

Should an output file be created if there are no matches?

rlskoeser · 2024-04-04T14:27:27Z

Should an output file be created if there are no matches?

I guess ideally not, but I don't think there's an easy way to detect this until after consuming the generator. I guess we could check and remove it after if it's zero size.

laurejt · 2024-04-04T14:42:19Z

Testing acceptance complete. The only outstanding issue, is whether the script should produce empty output files when no matches are found (either way seems reasonable).

rlskoeser · 2024-04-04T14:45:18Z

Thanks for testing. Let's leave the empty file behavior as it is for now, we've both put enough time into this feature already. We can always revisit and tweak it later if we find it's causing problems.

mnaydan assigned rlskoeser Mar 19, 2024

rlskoeser added a commit that referenced this issue Mar 27, 2024

Utility method & script to filter PPA full-text corpus by source id

d9d4c6e

resolves #5

rlskoeser mentioned this issue Mar 28, 2024

Functionality to filter/subset PPA full-text corpus by source id #8

Merged

rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 3, 2024

rlskoeser removed the 🗜️ awaiting testing Implemented and ready to be tested label Apr 4, 2024

rlskoeser closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. #5

As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. #5

mnaydan commented Mar 19, 2024 •

edited by laurejt

rlskoeser commented Apr 2, 2024

laurejt commented Apr 4, 2024

rlskoeser commented Apr 4, 2024

laurejt commented Apr 4, 2024

rlskoeser commented Apr 4, 2024

As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. #5

As a researcher, I want to be able to get the text corpus for a subset of record IDs so that I can conduct textual analysis within particular groups of texts. #5

Comments

mnaydan commented Mar 19, 2024 • edited by laurejt

acceptance criteria

rlskoeser commented Apr 2, 2024

laurejt commented Apr 4, 2024

rlskoeser commented Apr 4, 2024

laurejt commented Apr 4, 2024

rlskoeser commented Apr 4, 2024

mnaydan commented Mar 19, 2024 •

edited by laurejt