This is a quick demo of what sort of data I'm pulling out with Hathi-Specific features of the Bookworm-MARC library.


First just some basic imports

In [6]:
import pymarc
import random
import json
from bookwormMARC.bookwormMARC import parse_record
from bookwormMARC.hathi_methods import hathi_record_yielder

import bookwormMARC
import sys
import os
from collections import defaultdict

In [2]:
%load_ext autoreload
%autoreload 2
 

# Example output

Here is an example of the output of this script on Hathi books: 5 randomly selected records from the first 50000 or so in the DPLA dump. This is usually, note, more than 5 *items*: Hathi groups multiple items into a single record.

Note that we're using a custom superset of the pymarc.Record class called `BRecord`. This adds a number of functions that make it easier--for instance--to pull out a dictionary with the categories that may be useful for Bookworms in a variety of ways.

Each of the keys here is something that might make sense to chart or analyze. We want to know the scanner so that we can see if there are OCR effects or something that might be relevant. We want the library so we can see how shifting library composition affects time series. It might make sense to build up miniature bookworms for particular authors, or publishers, etc.

To start with, I just print four random records from the first 500 or so.

In [3]:
from bookwormMARC.bookwormMARC import BRecord

n=0
for rec in all_files:
    if random.random()>.01:
        continue
    for entry in rec.hathi_bookworm_dicts():
        # Pretty print the dictionary entry.
        print json.dumps(entry,sort_keys=True, indent=2, separators=(',', ': ') )
        print ""
    n+=1
    if n>4:
        break


{
  "cataloging_source": " ",
  "cntry": "miu",
  "contributing_library": "University of Michigan",
  "date": 1971,
  "filename": "mdp.39015051304940",
  "first_author_birth": 1910,
  "first_author_death": 1980,
  "first_author_name": "Campbell, Angus, 1910-1980.",
  "first_place": "Ann Arbor,",
  "first_publisher": "Institute for Social Research, University of Michigan",
  "government_document": " ",
  "item_date": 1971,
  "language": "eng",
  "lc0": "E",
  "lc1": "E",
  "lc2": "185.615",
  "lc_class_from_lc": true,
  "marc_record_created": "1988-07-15",
  "permalink": "https://babel.hathitrust.org/cgi/pt?id=mdp.39015051304940",
  "record_date": 1971,
  "resource_type": "book",
  "rights_changed_date": "2013-08-01",
  "scanner": "google",
  "searchstring": "<a href=https://babel.hathitrust.org/cgi/pt?id=mdp.39015051304940><em>White attitudes toward black people.</em> (1971)",
  "serial_killer_guess": "book",
  "subject_places": [
    "n-us---"
  ],
  "title": "White attitudes toward b

Experimental: testing the goodness of record 043 codes.

In [None]:
if False:
    all_files = hathi_record_yielder()
    knowledge = open("/drobo/knowledge_directions.tsv","w")

    for record in all_files:
        if record['043'] is not None:
            try:
                dicto = record.bookworm_dict()
                subjects = dicto['subject_places']
                p1 = record.first_place()
                cntry = dicto['cntry']
                year = dicto['date']
                for subject in subjects:
                    knowledge.write("\t".join(map(str,[subject,p1,year,cntry,record['001'].value(),dicto['title'].encode("utf-8")]))+ "\n")
            except:
                pass

### The world of available fields

The fields that appear in more than 10% of a randomly selected subset of records. They include the control fields; author and title information; and some more esoteric things including county of study.

In [None]:
from collections import defaultdict
n = 0
global_counts = defaultdict(int)

for record in all_files:
    if random.random() >.2:
        continue
    already_seen = set([])
    n+=1
    from collections import defaultdict
    for dicto in record.as_dict()['fields']:
        name = dicto.keys()[0]
        if 'subfields' in dicto[name]:
            for subfield in dicto[name]['subfields']:
                tupo = (name,subfield.keys()[0])
        else:
            tupo = (name,None)
        if not tupo in already_seen:
            global_counts[tupo] +=1
            already_seen.add(tupo)
    if n > 10000:
        break


In [None]:
a = [((k,v),count) for ((k,v),count) in global_counts.iteritems()]
a.sort()
for elem in a:
    if elem[1] > 1000:
        print elem

### Better years

One of the big things I've noticed is that the 974 field has better year information than the record information, such as individual fields. 

The following block shows that something like 1 in 3 items, in about one in ten records, have a different entry in the 974y field from the native date field. That suggests huge possibilities for improving dates if we're not already using the 974y fields: I suspect we are not based on the serial volumes that include 974y fields I see in the online browser.



In [None]:
import collections

records = 0
diff_records = 0
items = 0
diff_items = 0
date_diffs = collections.defaultdict(int)
for rec in all_files:
    if random.random() > .01:
        # Print just one in one hundred files each time for debugging
        continue
    records += 1
    line_counted = False
    for dicto in rec.hathi_bookworm_dicts():
        try:
            if dicto["item_date"] != dicto["record_date"]:
                date_diffs[(dicto["item_date"],dicto["record_date"])] += 1
        except KeyError:
            pass
    if records>1000:
        break
print "%i out of %i records and %i out of %i items have differing dates" %(diff_records,records,diff_items,items)

## Assessing differences in dates between 974 and the main MARC record

]The most common pattern is that I'm replacing a "None" value with an actual year, or vice versa. It would be wise to see if there isn't sometimes a better solution than the Nones for the original fields. (Eg; am I overrelying on F008?)


In [None]:
flattened = sorted([(-val,f008,f974,val) for ((f974,f008),val) in date_diffs.iteritems()])
flattened[:20]

Looking at only places where we have years, most are the realm of reasonableness here.
(With just 1000 examples, I'm certainly getting a lot of repeat entries.)

There are, though, a number of places where f974 instates an earlier entry than does the native date field.

In [None]:
flattened = sorted([(-val,f008,f974,val) for ((f974,f008),val) in date_diffs.iteritems() 
                    if f974 is not None and f008 is not None])
flattened[:20]

Let's look to see what those are. Here are thirty.

In [None]:
records = 0
for rec in all_files:
    if random.random() > .5:
        # Print just one in two for debugging
        continue
    rec.__class__ = BRecord
    for field in rec.get_fields('974'):
        items += 1
        if str(field['y']) != str(rec.date()) and field['y'] is not None and rec.date() is not None:
            if int(field['y']) < int(rec.date()):
                records += 1
                print rec
    if records>30:
        break


The basic problem in all of these seems to be that in the original record, field 260c and field 008 disagree on the date. Pymarc prefers 260 in these cases; Zephir prefers field 008. Fair enough.

In [24]:
for rec in all_files:
    dicto =  rec.bookworm_dict()
    try:
        if "Wien" in dicto['first_place']:
            if dicto['cntry'] != 'au ':
                break
    except KeyError:
        continue
print (dicto['cntry'],dicto['first_place'])
a = bookwormMARC.bookwormMARC.F008(rec)
print a.data

(u'au#', u'Wien :')
880719d19041919au#uu#########0####0ger#u
(u'au#', u'Wien :')
880719d19041919au#uu#########0####0ger#u


In [26]:
rec.hathi_bookworm_dicts

<bound method BRecord.hathi_bookworm_dicts of <bookwormMARC.bookwormMARC.BRecord object at 0x1066c7350>>

<bound method BRecord.hathi_bookworm_dicts of <bookwormMARC.bookwormMARC.BRecord object at 0x1066c7350>>

In [23]:
n = 0

dictee = defaultdict(int)

for rec in all_files:
    n+=1
    if n > 10:
        break
        
    for item in rec.hathi_bookworm_dicts():
        if item['serial_killer_guess'] != item['resource_type']:
            print item['title']
        try:
            dictee[(item['serial_killer_guess'],item['resource_type'])] += 1
        except KeyError:
            pass
        break
for (k,v) in dictee.iteritems():
    print (k,v)

Catalog of publications /
Impact of the use of microorganisms on the aquatic environment : proceedings /
Housing in the seventies : a report of the National Housing Policy Review.
Transportation and the rural community : report on the first workshop on national transportation problems.
(('serial', 'book'), 4)
(('book', 'book'), 6)
Catalog of publications /
Impact of the use of microorganisms on the aquatic environment : proceedings /
Housing in the seventies : a report of the National Housing Policy Review.
Transportation and the rural community : report on the first workshop on national transportation problems.
(('serial', 'book'), 4)
(('book', 'book'), 6)


In [25]:
jsoncatalog = open("/drobo/hathi_metadata/jsoncatalog.txt","w")      


all_files = hathi_record_yielder()
