As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus. #6

mnaydan · 2024-03-19T20:30:11Z

mnaydan · 2024-03-28T13:10:45Z

@rlskoeser @laurejt @WHaverals tagging for review. Here is the link to the Google Sheet

Acceptance criteria:

sufficient metadata for use with Rebecca's script to get a subset from the text corpus
sufficient representation across the metadata categories represented in the PPA: Gale, HathiTrust, old, new, excerpts/articles/full works
sufficient number of straightforward, challenging, control (prose-only), and edge cases (see "reason" column)

rlskoeser · 2024-03-28T19:34:17Z

Here are my review notes:

sufficient metadata for use with Rebecca's script to get a subset from the text corpus
sufficient representation across the metadata categories represented in the PPA: Gale, HathiTrust, old, new, excerpts/articles/full works
sufficient number of straightforward, challenging, control (prose-only), and edge cases (see "reason" column)

I used the list of source ids from the spreadsheet to create a text file if ids, which I successfully used with my filter script.
Six Gale records, mix of CB and CW; 7ish excerpt/articles; nice range of publication dates.
Reason column is so informative and helpful! Lots of interesting cases that will be useful to test against and think about. My only worry from reading through the spreadsheet is that there won't be enough straightforward/easy examples, but I'm not sure how many we actually need and I see that there a few. Happy to defer to Mary's wisdom on this.

FYI / question:

My corpus filter utility as currently written takes a text file with one id per line and only uses source id. It is extensible for other filtering options if/when we want them. Is that sufficient for now? Do you imagine we might sometime one excerpt from a source but not another, and if so what would you use to filter? (unique id with source id + p# ?)

mnaydan · 2024-03-28T19:40:20Z

@rlskoeser thank you for these helpful notes! There is in fact one excerpt from a source but not the other currently in the test set (hvd.32044050827351). I would probably filter on unique id (source id + p#) but that's tricky since we are changing that for stability. Is it easier if I swap that record out for a different one?

rlskoeser · 2024-03-28T19:43:27Z

@mnaydan I think let's keep it in! That's a good edge case to have in mind; the filter script won't support it properly as currently written, but there are a couple of ways to handle that - do you want that kind of filtering supported this round or as a second pass? My preference would be to use unique ids once we fix them.

mnaydan · 2024-03-28T19:45:53Z

Let's support it in the second pass! Once we fix the unique ids.

mnaydan · 2024-04-10T19:16:47Z

I'm going to close this since we discussed during standup it is likely good enough and already quite big in terms of number of pages.

mnaydan self-assigned this Mar 19, 2024

mnaydan added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 1, 2024

mnaydan removed the 🗜️ awaiting testing Implemented and ready to be tested label Apr 10, 2024

mnaydan closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus. #6

As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus. #6

mnaydan commented Mar 19, 2024 •

edited

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024 •

edited

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Apr 10, 2024

As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus. #6

As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus. #6

Comments

mnaydan commented Mar 19, 2024 • edited

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024 • edited

mnaydan commented Mar 28, 2024

rlskoeser commented Mar 28, 2024

mnaydan commented Mar 28, 2024

mnaydan commented Apr 10, 2024

mnaydan commented Mar 19, 2024 •

edited

rlskoeser commented Mar 28, 2024 •

edited