Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a researcher, I want a set of 20-25 representative texts so that I can conduct controlled experiments on a sample corpus. #6

Closed
11 tasks done
mnaydan opened this issue Mar 19, 2024 · 6 comments
Assignees

Comments

@mnaydan
Copy link
Collaborator

mnaydan commented Mar 19, 2024

  • List of record IDs with basic metadata
  • Short description of why these texts

Include:

  • Gale
  • HathiTrust
  • Old
  • New
  • Excerpts
  • Full
  • Prose only
  • Includes poetry
  • Music
@mnaydan mnaydan self-assigned this Mar 19, 2024
@mnaydan
Copy link
Collaborator Author

mnaydan commented Mar 28, 2024

@rlskoeser @laurejt @WHaverals tagging for review. Here is the link to the Google Sheet

Acceptance criteria:

  • sufficient metadata for use with Rebecca's script to get a subset from the text corpus
  • sufficient representation across the metadata categories represented in the PPA: Gale, HathiTrust, old, new, excerpts/articles/full works
  • sufficient number of straightforward, challenging, control (prose-only), and edge cases (see "reason" column)

@rlskoeser
Copy link
Collaborator

rlskoeser commented Mar 28, 2024

Here are my review notes:

  • sufficient metadata for use with Rebecca's script to get a subset from the text corpus
  • sufficient representation across the metadata categories represented in the PPA: Gale, HathiTrust, old, new, excerpts/articles/full works
  • sufficient number of straightforward, challenging, control (prose-only), and edge cases (see "reason" column)
  1. I used the list of source ids from the spreadsheet to create a text file if ids, which I successfully used with my filter script.
  2. Six Gale records, mix of CB and CW; 7ish excerpt/articles; nice range of publication dates.
  3. Reason column is so informative and helpful! Lots of interesting cases that will be useful to test against and think about. My only worry from reading through the spreadsheet is that there won't be enough straightforward/easy examples, but I'm not sure how many we actually need and I see that there a few. Happy to defer to Mary's wisdom on this.

FYI / question:

  • My corpus filter utility as currently written takes a text file with one id per line and only uses source id. It is extensible for other filtering options if/when we want them. Is that sufficient for now? Do you imagine we might sometime one excerpt from a source but not another, and if so what would you use to filter? (unique id with source id + p# ?)

@mnaydan
Copy link
Collaborator Author

mnaydan commented Mar 28, 2024

@rlskoeser thank you for these helpful notes! There is in fact one excerpt from a source but not the other currently in the test set (hvd.32044050827351). I would probably filter on unique id (source id + p#) but that's tricky since we are changing that for stability. Is it easier if I swap that record out for a different one?

@rlskoeser
Copy link
Collaborator

@mnaydan I think let's keep it in! That's a good edge case to have in mind; the filter script won't support it properly as currently written, but there are a couple of ways to handle that - do you want that kind of filtering supported this round or as a second pass? My preference would be to use unique ids once we fix them.

@mnaydan
Copy link
Collaborator Author

mnaydan commented Mar 28, 2024

Let's support it in the second pass! Once we fix the unique ids.

@mnaydan mnaydan added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 1, 2024
@mnaydan mnaydan removed the 🗜️ awaiting testing Implemented and ready to be tested label Apr 10, 2024
@mnaydan
Copy link
Collaborator Author

mnaydan commented Apr 10, 2024

I'm going to close this since we discussed during standup it is likely good enough and already quite big in terms of number of pages.

@mnaydan mnaydan closed this as completed Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants