Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate a report of duplicates in event data export to aid review and cleaning #681

Closed
4 tasks done
rlskoeser opened this issue Jul 9, 2020 · 3 comments
Closed
4 tasks done
Assignees
Labels

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Jul 9, 2020

goodtables validation revealed a fair number of duplicate rows in the events data export, and we'd like to check them in a format that's easier to use than going back and forth between the goodtables output by line number and a spreadsheet.

Please work from this export in google drive (latest revisions to export logic, generated from a fresh copy of production data): https://drive.google.com/drive/u/0/folders/1aPDGBhT9CE0aIozbaelcPrHb6_hDDTkC

Josh would like them sorted by event type:

If we could have the duplicated subscription events first (subs, renewals, supplements), and book events after (borrows, purchases, generics, etc) next, that would ideal. Ian could do the sub events and I could do the books.

If we include some specific site urls for different event types, I think we can make this even easier for them to review quickly. (I'll add notes in a moment).

No need to use goodtables for this, use pandas or whatever else is easiest. (If you use something else please you sanity check the total number of duplicates against the goodtables report).

  • identify duplicate rows in the events.csv
  • sort by / segment out membership events (Subscription, Supplement, Other, Reimbursement) from all other events; possibly also useful to separate remaining events that are associated with books by the presence of item_uri
  • for membership events, please link to the individual member activities page on the site; generate this by adding membership/ to the first url in membership_urls (in the case of joint accounts, member info is in one field delimited by ;; using the first one should be sufficient)
  • for non-membership events, please link to borrowing activities page on the site; generate this by adding borrowing/ to the first url in membership_urls
  • for any event with a source_image, provide the link to the image so reviewers can quickly look at the image as they review the events (mostly book activity but there are some exceptions, I don't know if they will show up in the duplicates or not)
@rlskoeser
Copy link
Contributor Author

@kmcelwee Don't worry about making this reusable, it's a one-off to help with data cleaning so we can publish the data! At this point we prefer something quick; I don't think it requires code review.

@rlskoeser
Copy link
Contributor Author

@kmcelwee when I was writing up the instructions for this and looking at the data in the export, it occurred to me that it's not easy to generate a link to events on the card detail page based on the information currently included in the export, because I'm using the IIIF image id but the data export only includes the IIIF manifest and the IIIF image. (You could probably figure out the image id from the combination of those two, but I bet it would be a pain.)

The URLs I'm talking about look like this: https://shakespeareandco.princeton.edu/members/aldington-richard/cards/f2a52cae-7c90-4dda-be32-cc03653f8270/
It's also possible to highlight a particular event on that page with an anchor link, like this: https://shakespeareandco.princeton.edu/members/aldington-richard/cards/f2a52cae-7c90-4dda-be32-cc03653f8270/#e33472

Do you have an opinion on how useful/important it would be to include either the source image id or (probably better) some kind of event in context url? (e.g. card url with event highlighted if the event is linked to an image; membership or book activity page otherwise)

@kmcelwee
Copy link
Contributor

kmcelwee commented Jul 9, 2020

The google doc 2020-07-09 Event export duplicates was created. There were 113 duplicate rows. The google doc can be found at S&Co -> Data Work -> Reports in the Drive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants