Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize Datasets for Deposit #673

Closed
jkotin opened this issue Jun 27, 2020 · 56 comments
Closed

Finalize Datasets for Deposit #673

jkotin opened this issue Jun 27, 2020 · 56 comments
Assignees
Labels

Comments

@jkotin
Copy link

jkotin commented Jun 27, 2020

Comments on books:

  1. Exclude OCLC links. I worry many are incorrect—when I was correcting books, I tried to check them, but I’m not confident that I deleted all the incorrect ones.

  2. Exclude column “identified.” I worry that we’re using the uncertainty icon for too broad a range of issues. Might be better to refer researchers to the notes and the use of “unidentified.”

  3. I'm going to try to make some headway on the blank format. The format should only be blank for some items with an uncertainty icon.

I'll post on the members and events ASAP.

@jkotin
Copy link
Author

jkotin commented Jun 28, 2020

Comments on members:

  1. This looks great. My only concern/regret is the use of titles. We decided to include titles in the "name" field for display issues. We might want to consider deleting all titles from that field for the data exports.

  2. I have been deleting some of the titles from the "title" field for display issues. If a member is also an author, their title appears with their name in the search results—e.g. Hemingway, Mr. Ernest. The easy way to solve the problem is to delete the member's title in the database. That's a minor but still regrettable loss of data.

@jkotin
Copy link
Author

jkotin commented Jun 28, 2020

Additional comments on members:

  1. Would it make sense, at a later date, to have a separate export for addresses? We would list members associated with each address. Not addresses associated with each member.

@jkotin
Copy link
Author

jkotin commented Jun 28, 2020

Comments on events:

  1. FWIW, I've been downloading CSVs and looking at them. For members would it be worth having the member URIs in the first column to match the other two datasets?

  2. I don't think we need the OCLC work URI—for the reasons I explain in my notes to books above.

  3. There are ten event types called "other" -- do you know what they are? Should I ask Ian to fix this?

  4. Should we include author and date of publication fields? I think we should.

@rlskoeser
Copy link
Contributor

@jkotin thanks for the feedback and review. Responses:

books

  1. Regarding OCLC: do you mean work uri, edition uri, or both? It would be nice to at least keep the work uri; is there any way I could sanity check it against our data to give you more confidence? (Or to convince me it's junk data we should omit)
  2. Can we rename the "identified" column to "uncertain"? Having a boolean field to filter on is really powerful, e.g. if someone wants to investigate the uncertain records or do analysis on the books but not include the uncertain ones.
  3. Great on the blank format. Would it be helpful to run the reconcile_oclc script for items without OCLC URIS and not flagged as no match, or is that more likely to introduce new errors?

members

  1. I remember some of the conversations about this decision; too bad we didn't think of all the implications! (Not that we could have). Are you suggesting I do something in the code that generates the data exports or suggesting the team will do data work here?
  2. Yes, I think I remember seeing this also. We do have titles in a separate field in the database, and they're included in the data export.
  3. Yes — I already created an issue for this after the email conversation with Moacir, see Create a separate data export with member addresses so fields are broken out separately and easier to work with #612. I think it could make sense to work on this when/if we circle back to the map.

events

  1. Events don't have URIs which is why they don't match field order; I do notice that the member fields here don't match the order in the member export, might be nice to make the order match within the list of members fields included here.
  2. I'm fine with omitting OCLC work URI here independent of the decision about the book exports, since we have S&co URIs
  3. I can get a list of these — it would be great to get them fixed. I'll follow up with specifics. FYI, this is a "subtype" option for Subscription events (Subscription, Renewal, Supplement, Other) — I'm wondering if we can remove that "other" option after this is fixed? It's confusing and I'm not sure it's meaningful.
  4. My intent is that people who want to look at books & events should use the two exports together, but I can see including author and publication year to make the titles more identifiable with just the data in the events dataset.

@rlskoeser
Copy link
Contributor

@jkotin @i-davis here are the 10 subscription events that are showing up as event type "other" :

Account #3267: Mr. Donald Culver  /  Subscription for account #3267 ??/??
Account #198: L. Michaelides  /  Subscription for account #198 1920-11-16/1921-12-16
Account #6279: Mr. Moynahan  /  Subscription for account #6279 1926-04-07/1926-05-07
Account #3356: Mr. C. T. Chang  /  Subscription for account #3356 1932-10-01/??
Account #5005: Phyllis Price  /  Subscription for account #5005 1933-06-24/??
Account #3805: Heinemann  /  Subscription for account #3805 1938-10-24/??
Account #5691: Bell  /  Subscription for account #5691 1938-12-19/??
Account #5774: Creswick  /  Subscription for account #5774 1939-03-08/??
Account #3920: Burrow  /  Subscription for account #3920 1939-04-19/??
Account #5951: de Girodon  /  Subscription for account #5951 1941-10-20/??

@rlskoeser
Copy link
Contributor

@jkotin I ran the reconcile_oclc script in report mode against current data. Output attached so you can see if it would be useful. It looks like we'd need to update it to ignore items with UNCERTAINTYICON in the notes since it looks like the existing codes it excludes (generic, problem, obscure, zero) aren't sufficient.

Here's the report output (named as .txt because GitHub doesn't allow CSV attachments; rename after you download).
works-oclc.csv.txt

@rlskoeser rlskoeser self-assigned this Jun 29, 2020
@jkotin
Copy link
Author

jkotin commented Jun 29, 2020

@rlskoeser Re: books:

  1. Re: OCLC, I would like to delete both work and edition from the export. If there's a way to check whether either is accurate (or accurate enough) I'd definitely be happy to reconsider. I can't think of an obvious way to check. I worry that even a 10% error rate would undermine the value of the data. Maybe we could not include the URIs now and think about ways to verify and include in an update?

  2. I hear you about keeping the "identified" column. I'm not sure relabelling as "uncertain" would address the issue. In truth, every item is either more or less uncertain. The issue is that we currently use "false" for both books we can't identify and books that we can almost identify. Would it make sense to generate the column based on whether "unidentified" is in the public notes? That would limit "false" to books in the latter category. Alternatively, we could label the column "blue icon" or "icon" and refer to the FAQ.

  3. Would the following be possible: make every item with a blank format "book," iff the notes field does not include the word "uncertaintyicon"? In other words, I want the format field to remain blank (if it is currently blank) for items with an uncertainty icon. The remainder of the blank format fields should be "book."

Re: members:

  1. I think we have two options. Leave things as they are. (That's fine with me.) Or delete the titles from the name fields -- there are only six possible titles: Mr., Mrs., Miss, Mme, Mlle, and M? Would deleting cause more problems than it solves? I can't tell.

  2. Upon consideration, my point here is less about this export than about future exports. Currently, I am in the process of deleting titles of member-authors, so the titles don't show up on the front end. If we change how we capture author names to use on the front, I can stop deleting and we can keep the titles.

  3. Sounds good.

Re: events:

  1. Sounds good.

  2. Sounds good.

  3. Sounds good. I'll ask Ian to investigate ASAP.

  4. Great about including pub. date and author/editor/etc.

@rlskoeser
Copy link
Contributor

@jkotin thanks for responses; seems like we should be choosing the simplest option at this point in favor of getting the datasets published and not introducing new problems. That means:

books

  1. Remove OCLC work and edition URIs. (I'm interested in potentially revisiting later, but not sure what it would entail.)
  2. Remove identified/uncertain boolean (unless we can come up with another label you're satisfied with — "unidentifiable"? I don't want to create a new field based on the text, and I don't think blue icon is meaningful)
  3. Set default format to book except for items marked with uncertainty icon (this will be for the data export only, not the public site)

members

  1. No changes for the data exports (Export the names as they are.)

events

  1. Reorder included member fields to match member export order
  2. Remove OCLC URIs
  3. Add pub date and author only — I should have been clearer before, I do not want to include any other fields beyond this ("editor/etc"). Bibliographic analysis of events should be done with events and books datasets together; events export should include minimum information required to recognize the books, which author/year helps with. Including year & author allows for some high level analysis of book events without using the book export, which is nice but not essential.

We should remember these decisions and think about how to incorporate into the dataset essay (especially the things we're not including and why).

@jkotin
Copy link
Author

jkotin commented Jun 30, 2020

@rlskoeser I agree with all this, except I'd like to think a little more about books 2/ the uncertain column. I agree about not generating new data. Let me think about a label today. Sound OK?

@rlskoeser
Copy link
Contributor

@jkotin yes, that sounds fine. Thanks.

@jkotin
Copy link
Author

jkotin commented Jun 30, 2020

@i-davis @rlskoeser :

We have ten "other" event types -- all listed above, earlier in this issue. We would like to get rid of the event type "other."

Solution 1: change them all to either "generic" or "supplement."
Solution 2: create new event type "deposit" and change them all to "generic," "supplement," or "deposit."

Are there any other options? Which would you both prefer? My worry about solution 2 is that there are other events that should be ID'd as "deposit" that we don't know about. Perhaps they are currently ID'd as generic. There are also all the subscription events that include deposits.

@rlskoeser
Copy link
Contributor

Solution 1: I don't think generic events work here because they don't allow you to document the deposit. Does supplement make sense to you two?
Solution 2: in case it wasn't clear, I was proposing not a brand new event type but a subtype of subscription (akin to supplement and renewal); my preference would be to get rid of "other" and replace it with "deposit" if that works because I don't know how "other" can ever be meaningful. I think it's fine that subscriptions sometimes include deposits — it's clearly included as a separate field in the data exports, and based on these outliers, it seems clear that subscription + deposit was the way that things operated the vast majority of the time.

It's a small enough number of records that maybe it doesn't matter that much, as long as we document what "other" means. And Ian already corrected some of them, right? So now less than 10?

Documenting Ian's comment from Slack so we know status of these "other" events based on his analysis:

Account #3267: Mr. Donald Culver  /  Subscription for account #3267 ??/?? [Culver paid in November for his June-July 1935 subscription. I turned this one into a renewal.]
Account #198: L. Michaelides  /  Subscription for account #198 1920-11-16/1921-12-16 [late payment for a sub]
Account #6279: Mr. Moynahan  /  Subscription for account #6279 1926-04-07/1926-05-07 ["Moynahan paid up": looks like a belated payment. Could make a supplement.]
Account #3356: Mr. C. T. Chang  /  Subscription for account #3356 1932-10-01/?? [deposit on a periodical]
Account #5005: Phyllis Price  /  Subscription for account #5005 1933-06-24/?? [deposit on a periodical]
Account #3805: Heinemann  /  Subscription for account #3805 1938-10-24/?? [deposit]
Account #5691: Bell  /  Subscription for account #5691 1938-12-19/?? [deposit]
Account #5774: Creswick  /  Subscription for account #5774 1939-03-08/?? [deposit]
Account #3920: Burrow  /  Subscription for account #3920 1939-04-19/?? ["Burrow re-deposit"]
Account #5951: de Girodon  /  Subscription for account #5951 1941-10-20/?? [deposit]

@i-davis
Copy link

i-davis commented Jul 1, 2020

Solution 1: Agreed, re generic. I don't think supplement works: I think we should keep that restricted to the particular situation where a member pays for an extra vol or a duration extension after the initial subscription.

Solution 2: I could see "deposit" working. There are now only 8 "other" subscription events. I corrected Culver and Michaelides. So technically every "other" except for Moynahan is a deposit, either for periodical subscription or library subscription. I think that Moynahan event is a sui generis instance of Sylvia sass about Moynahan owing her for something unspecified.

@jkotin's questions on slack: "1/ are there events that fit one of the three categories above that are currently labeled as something else? and 2/ there were ten “other” events in the list — do the above capture them all? I’m a little confused because earlier you mentioned seven."

  1. I can imagine that there are events that might need to be recategorized as deposit, if we made that a category. I imagine we could find them fairly quickly, if we searched in the right way--for events with no duration and no fee paid (I think?).

  2. I mentioned seven in my notes from several weeks ago because we only had seven "other" subs then; three were added in the meantime, I'm not sure by whom.

@i-davis
Copy link

i-davis commented Jul 1, 2020

I just found another record for the Moynahan, and it looks like it should just be a regular subscription. So we're back down to 7 "other" categories, and all of them would fit under "deposit."

@rlskoeser
Copy link
Contributor

@i-davis good thought. I did a quick query and there are 197 subscription events with no price paid but a deposit. Here's a handful of them if you want to look and see if this matches what you expect and would make sense to convert to a "subscription deposit" event:

Account #1075: Allison Nienaber / Subscription / Subscription for account #1075 1923-10-17/??
Account #1105: Jane Stafford / Subscription / Subscription for account #1105 1924-05-03/??
Account #1111: Mary Fenner / Subscription / Subscription for account #1111 1925-04-01/??
Account #1361: Mr. Delaney / Supplement / Subscription for account #1361 1925-08-03/??
Account #2499: Denby / Subscription / Subscription for account #2499 1925-09-14/1926-03-14
Account #6156: Miss Knox / Supplement / Subscription for account #6156 1925-09-21/??
Account #1603: Raymond Scudder / Subscription / Subscription for account #1603 1925-11-25/??
Account #1722: Mrs. P. M. Camfferman / Subscription / Subscription for account #1722 1925-11-26/??
Account #1525: Mme Oguz / Subscription / Subscription for account #1525 1926-02-20/??
Account #1702: Miss Young / Subscription / Subscription for account #1702 1926-06-04/1926-07-04
Account #1702: Miss Young / Subscription / Subscription for account #1702 1926-07-04/??
Account #2200: John R. Crawford / Subscription / Subscription for account #2200 1926-09-29/1926-10-29
Account #475: Ruth Wise / Subscription / Subscription for account #475 1926-10-18/1926-11-18
Account #2284: Mrs. Howlaw / Subscription / Subscription for account #2284 1926-12-09/1927-01-09
Account #2363: Mrs. Charles Hughes / Supplement / Subscription for account #2363 1927-03-29/??
Account #2411: Charles Decamp / Subscription / Subscription for account #2411 1927-04-23/1927-05-23
Account #7412: Mr. Reynolds / Subscription / Subscription for account #7412 1927-04-27/1927-05-27
Account #405: Mrs. Jackson / Subscription / Subscription for account #405 1927-09-16/1928-09-16
Account #2750: Mayne / Subscription / Subscription for account #2750 1928-05-09/??
Account #2642: Joy Andrews / Subscription / Subscription for account #2642 1928-05-12/1928-06-12
Account #5927: Stone / Subscription / Subscription for account #5927 1929-01-05/1929-02-05
Account #5899: Hauson / Subscription / Subscription for account #5899 1929-01-08/1929-04-08
Account #2977: Mr. R. L. Cook / Subscription / Subscription for account #2977 1929-01-09/1929-02-09

@rlskoeser
Copy link
Contributor

@jkotin did you come to any decisions about the uncertainty column for the books export? I thought of a couple more possible labels: problematic, ambiguous

@jkotin
Copy link
Author

jkotin commented Jul 1, 2020

@rlskoeser I think "uncertain" is good. Not perfect, but better than not including the column. Sorry it took me so long to come around to your original suggestion.

@rlskoeser
Copy link
Contributor

@jkotin thanks. I think it was worth discussing anyway!

@jkotin
Copy link
Author

jkotin commented Jul 1, 2020

I'm still a little confused re: "other" -- sorry @i-davis and @rlskoeser. What's the current proposal?

Re: "subscription deposit" event -- it seems odd to me to separate out IFF there is no subscription fee. Wouldn't a subscription deposit event be a subscription deposit event even if a subscription fee was paid as well?

Another question: do we know the deposit amount for the 7 remaining events?

@rlskoeser
Copy link
Contributor

@jkotin did Beach treat them as separate events when people joined and paid their deposit & subscription fee at the same time?

@jkotin
Copy link
Author

jkotin commented Jul 1, 2020

@rlskoeser : @i-davis can confirm -- he knows the material better than I do -- but Beach usually recorded new memberships like this:

D. 50 S. John Smith 1m 1v 25—

That means a deposit of 50f for new subscriber John Smith for 1 month, 1 volume at a time, for a 25f membership fee.

Beach would sum up the membership fees as part of her revenue, but not the deposits. I realize now that our sample logbook page doesn't have any deposits!!

@i-davis
Copy link

i-davis commented Jul 2, 2020

@jkotin @rlskoeser : Yep, confirmed, that's what most sub events look like!

We do know the deposit amount for each of the remaining 7.

I'm not sure if this answers your question about whether they should be denoted differently, Josh -- I think that's a good question, I'm not sure they should. But the Others do seem like a different sort of event from a sub. They're mostly off in the left margin, and all they say is: "dep. [name] 50f." See attached screenshot. Heineman's deposit isn't a refund, like Walher, nor a deposit attached to a sub, like Milne.

Screen Shot 2020-07-02 at 10 28 53 AM

@jkotin
Copy link
Author

jkotin commented Jul 2, 2020

@i-davis would you post screen shots of the remaining 7 "other" events? Include links to the logbooks and the dates with years of the events. Also include versos, if relevant.

We'll tackle them one by one. I worry that Heineman's is deposit on a book that Beach has ordered for him. And thus doesn't involve the lending library.

When a patron wanted an expensive English language books, Beach would likely have asked for a deposit.

@jkotin
Copy link
Author

jkotin commented Jul 2, 2020

@rlskoeser Is this "other" challenge the last thing we have to decide for the exports? I manually fixed a lot of the format blanks.

If you could generate a CSV with all the remaining format blanks without uncertainty notes, I'll fix them all.

@i-davis
Copy link

i-davis commented Jul 2, 2020

@jkotin : mm I was just wondering that myself. Will collect the screen shots now!

@i-davis
Copy link

i-davis commented Jul 2, 2020

Chang: 1932-10-01

Screen Shot 2020-07-02 at 10 48 21 AM

Price: 1933-06-09

Screen Shot 2020-07-02 at 10 50 28 AM

Bell: 1938-12-19

Screen Shot 2020-07-02 at 10 51 31 AM

Creswick: 1939-03-08

Screen Shot 2020-07-02 at 10 52 42 AM

Burrow: 1939-04-19

Screen Shot 2020-07-02 at 10 54 06 AM

de Girodon: 1941-10-20

Screen Shot 2020-07-02 at 10 55 59 AM

@jkotin
Copy link
Author

jkotin commented Jul 2, 2020

@i-davis thank you! Have you checked to see how these events fit the membership timelines of the individual members? If not, would you? For example, does de Girodon not give a deposit for their membership on 10/22/41, and hence this deposit?

I suspect that these are legit membership/subscription deposit-events. If that suspicion is correct, I'm not sure how to categorize them. Ideally, we would have a separate deposit-event type for ALL deposits, but barring that, I don't know. We could categorize them as subscriptions with no fee and only a deposit, but that will look strange on the activities/membership page. Alternatively, we could label the events "deposit," and create a FAQ that indicates that when given on the same day subscription = fee + deposit, but when given on separate days subscription = fee and deposit = deposit. Ugh.

@i-davis
Copy link

i-davis commented Jul 3, 2020

@jkotin: Some of them do seem to fit as subscription-like or -adjacent events in membership timelines (Bell, Creswick, Burrow, de Girodon). Some of them don't quite seem to fit (Price, Heinemann). Chang seems clearly to have been a magazine subscription. Here are the timelines: does that analysis seem right to you?

Price: the deposit comes in the middle of another subscription, 15 days before a renewal would be needed.

Heinemann's subscription activity begins with two unusual events, including this "other": Screen Shot 2020-07-03 at 10 09 19 AM

Bell looks like their other could be a renewal: Screen Shot 2020-07-03 at 10 10 56 AM

Creswick looks like theirs could be a downpayment on a subscription that is recorded 14 days later: Screen Shot 2020-07-03 at 10 11 57 AM

Burrow looks like they could be renewing by re-depositing a couple days after they got a reimbursement: Screen Shot 2020-07-03 at 10 13 23 AM

de Girodon looks like they're depositing two days in advance of actually subscribing: Screen Shot 2020-07-03 at 10 14 52 AM

@i-davis
Copy link

i-davis commented Jul 3, 2020

@rlskoeser @jkotin : re generic events: yes, totally, that description seems right to me, Josh. As always with the database, I'm not entirely sure, there are so many events that have been entered by a variety of people--but I can say with certainty that the vast majority of generic events are definitely about books.

@jkotin
Copy link
Author

jkotin commented Jul 3, 2020

@i-davis Thank you. Let me think on this. Re: Chang -- it seems like we should treat that as a subscription to borrow periodicals. What about categorizing it as a "subscription" event, 10/1/32, with a 15f deposit, no fee, and adding a note that it's for periodical privileges?

Now we have 6!

@i-davis
Copy link

i-davis commented Jul 3, 2020

@jkotin: sounds good to me! I'll make the change now.

@jkotin
Copy link
Author

jkotin commented Jul 3, 2020

@i-davis: I'm still working on Price. But here's what I want to do with the others:

Keep them all as other. I worry that if we change the event to "deposit" it will imply that these are the only deposits. At a later date, I think we should separate out all the deposits and make them their own events. That way, the site can give a clear portrait of the Shakespeare and Co. finances and whether members were or were not reimbursed. But until then, let's keep these 5 as "other."

The one extra issue: do you think the Bell "other" could be for a different Bell? Would it make sense to separate it out?

Price in a second.

@jkotin
Copy link
Author

jkotin commented Jul 3, 2020

@i-davis Re: Price -- would you do a deep dive here and look at the Price cards? I worry that we might be conflating two different Prices? Let me know what you think.

@i-davis
Copy link

i-davis commented Jul 4, 2020

@jkotin : I could imagine the Bell other being for a different Bell. I mean, they are fairly close to each other, calendrically, and we don't have many Bells. But yeah, it's definitely a possibility! Want me to separate?

Re Price: good question. It does seem to me like we could separate:

  1. The first card (s1 & s2) and the first two events to a separate account for P. M. Price.
  2. The "other" event and the 1933-06-09 reimbursement to an account for "Price."
  3. The fourth and fifth events (July to September) and the 1933-07-08 and 1933-09-01 reimbursement to an account for "Price."
  4. Keep the remaining events in Phyllis Price's account; the cards clearly marked Phyllis Price verify them.

Arguments for keeping them together:

  1. We can read the 1933-07-08 reimbursement (30f) as a reimbursement for the 1933-01-20 subscription (dep. 30f). This would link the first two events with the fourth and fifth events.
  2. P. M. Price's address is the Foyer International des Étudiantes, at 93 boulevard Saint-Michel. Phyllis Price's address on s3 is
    43 boulevard Saint-Michel, and she subscribes through the British Institute. It would make sense that she was a student who stayed at the Foyer at first and then just moved a couple blocks down to an apartment as she settled in Paris. Hence also the switch from "P. M." to "Phyllis" as she became more of a regular at S&Co.
  3. If we concede both 1 and 2, that would mean all the events should stay together--although maybe we could still separate out the "other".

Screen Shot 2020-07-04 at 1 41 10 PM

@jkotin
Copy link
Author

jkotin commented Jul 5, 2020

@i-davis -- re: Bell: leave it the way it is. I think it's a deposit refund, but Beach just didn't note it as such. I looked at the logbook: there's very little differentiation between deposits and refunds. I suspect some of these questions will be resolved in the fall when you review the logbooks work. But let's leave it alone for now.

This makes me realize that we should make all the logbooks available as PDFs at some point. We have them. People will be interested. Maybe we can plan to do this in the new year -- it could just be links from the logbooks source essay.

@jkotin
Copy link
Author

jkotin commented Jul 5, 2020

@i-davis -- thank you for the research for Price. I'm not convinced that separating would make things any more accurate. The "other" is likely a supplement deposit for periodicals. My vote is to leave it as "other" until we create separate events for "deposits" if that happens. This seem OK to you?

@i-davis
Copy link

i-davis commented Jul 5, 2020

@jkotin: re Bell: sounds good!

Excellent! The logbooks feel absolutely vital to me, and fascinating, esp full of little oddities that the current iteration of the site can't capture. I think it'd be great to make them open access!

re Price: Yeah, that seems right to me, keeping all these events together as Phyllis Price, and keeping the other as other!

@jkotin
Copy link
Author

jkotin commented Jul 6, 2020

@rlskoeser I think we are set for the exports. I need today to finish going through the books one last time -- I'm 80% through. But otherwise, I don't think there are any unresolved issues. Is there anything I need to do for the export pages? I'll revise the export page on the site and save a draft on Wagtail.

@rlskoeser
Copy link
Contributor

@jkotin just to confirm, you are signing off on the revised data exports without any additional software changes (i.e. we'll leave "other" as is for now and document)? Please close this issue if that's the case.

We'll probably want to revise the export page (and it would probably help to see what the dataspace page looks like! I can work on that soon); that can be done independently of this task to revise the data set.

@jkotin
Copy link
Author

jkotin commented Jul 7, 2020

@rlskoeser I've been going through all the books and fixing mistakes. I'm wondering if it would be possible to do two queries that would further clean the data:

  1. A list of generic events without work titles. For example, I've find some generic events without work titles for strikethrus, as well as some generic events without work titles that should be supplements or "other." For example:

https://shakespeareandco.princeton.edu/admin/accounts/event/4703/change/?_changelist_filters=p%3D2%26o%3D5.3%26start_date__year%3D1926

This is an event for an overdue notice that is likely connected to the incorrect account.

  1. A list of book events (borrow, purchase, generic) without footnotes. I've found a few of these, which I've fixed manually. CORRECTION: often events have footnotes, but not a location in the footnotes—I'm especially interested in these cases.

Sorry for the delay identifying these queries. It's been helpful to go through all the titles and fix errors that I would have asked Cate to fix.

@jkotin
Copy link
Author

jkotin commented Jul 7, 2020

@i-davis There are 12 "overdue" subscription events currently labelled "generic." See:

https://shakespeareandco.princeton.edu/admin/accounts/event/?q=overdue

These should be changed to "other" or deleted, or made into a new kind of event "overdue." As it stands, they lead to weird information on the site. For example, Mrs. P. F. Dunne has 1923 and 1926 as membership years, but only visible events in 1923: https://shakespeareandco.princeton.edu/members/dunne/.

What's your opinion on how to handle this @rlskoeser ?

@rlskoeser
Copy link
Contributor

@jkotin if it's an overdue subscription, I can add another subscription "subtype" analogous to supplements and renewals — would that make sense? That would make it into something the code recognizes as a "membership activity" so it would be listed in that table.

@rlskoeser
Copy link
Contributor

I've been going through all the books and fixing mistakes. I'm wondering if it would be possible to do two queries that would further clean the data:

@jkotin I think the queries you're interested in can be done with OpenRefine. I'll generate and share a fresh event export from production and provide guidance on how to find the events you're interested in. I'm glad you've identified these problems and will be able to fix more of them before we publish the data!

@jkotin
Copy link
Author

jkotin commented Jul 7, 2020

@rlskoeser excellent, thank you, re: OpenRefine and fresh event export.

Re: overdue -- @i-davis what do you think? Have we been recording overdue notes consistently, or are these 12 a remnant of a practice that was abandoned? If the latter, we should probably just delete them. If these are all the overdue notices (or most of them), when we should follow Rebecca's suggestion. Have we been recording fines?

@jkotin
Copy link
Author

jkotin commented Jul 7, 2020

@rlskoeser one other query that occurred to me -- but it's not vital to do it before publishing the data: identify people that are not attached to any events or any works. I think there are a lot of people (members and creators) we created by mistake.

@i-davis
Copy link

i-davis commented Jul 7, 2020

@jkotin : I'm not sure about this: the overdue pre-date me! They clearly came, that is, from XML transcriptions of the logbooks before the database existed. They have no event history, and the notes attached to them are weirdly standardized, produced by the database as it metabolized the XML transcriptions.

Screen Shot 2020-07-07 at 2 47 24 PM

They're clearly representing events in the logbook:

Screen Shot 2020-07-07 at 2 51 44 PM

I don't know: should we save them? I'd be happy to turn them all into the sub events @rlskoeser suggests!

@rlskoeser
Copy link
Contributor

Deleting them is fine with me, especially if we think we haven't captured them systematically (as seems to be the case)!

@jkotin
Copy link
Author

jkotin commented Jul 7, 2020

I deleted them!

@rlskoeser
Copy link
Contributor

@jkotin are there any software changes needed at this point? It seems like at this point it's all data cleanup that will any additional software changes.

If you've reviewed the changes we agreed on (removing OCLC urls, switching uncertainty, re-ordering member fields in the events export) then I think this issue can be closed.

@jkotin
Copy link
Author

jkotin commented Jul 8, 2020

@rlskoeser the events export that I'm working with still have a "item_work_uri" field -- should that be deleted?

@rlskoeser
Copy link
Contributor

@jkotin I was concerned about that when I saw it until I remembered that I gave you fresh production exports to make sure you're looking at the latest data.

Did you ever review the revised qa/test data exports for the changes we agreed on? (including removing OCLC URIs, re-ordering member fields in the event export, and the revision to the uncertainty flag). All I saw was your #673 (comment) that it "looks great"

@jkotin
Copy link
Author

jkotin commented Jul 8, 2020

Oh, good that makes sense. I was just worried that I overlooked at field in my earlier review. I'll close now. My hesitation (psychological) is that the datasets are really finalized yet. But all the software changes are done. I don't think we need to autofill the formats anymore. I manually corrected them, FWIW.

@jkotin jkotin closed this as completed Jul 8, 2020
@rlskoeser
Copy link
Contributor

@jkotin should I remove the auto-fill format logic? Great to have them corrected in the database.

@jkotin
Copy link
Author

jkotin commented Jul 8, 2020

Yes please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants