This is a quick demo of what sort of data I'm pulling out with Hathi-Specific features of the Bookworm-MARC library.


First just some basic imports

In [1]:
import pymarc
import random
import json
from bookwormMARC.bookwormMARC import parse_record
from bookwormMARC.hathi_methods import hathi_record_yielder

import bookwormMARC
import sys
import os


In [2]:
%load_ext autoreload
%autoreload 2

I'm using a generator that makes it possible to parse the dumps one entry at a time.

In [None]:
all_files = hathi_record_yielder()
knowledge = open("/drobo/knowledge_directions.tsv","w")

for record in all_files:
    if record['043'] is not None:
        try:
            dicto = record.bookworm_dict()
            subjects = dicto['subject_places']
            p1 = record.first_place()
            cntry = dicto['cntry']
            year = dicto['date']
            for subject in subjects:
                knowledge.write("\t".join(map(str,[subject,p1,year,cntry,record['001'].value(),dicto['title'].encode("utf-8")]))+ "\n")
        except:
            pass

Parsing new XML file dpla_full_20160501_01.xmlParsing new XML file dpla_full_20160501_02.xmlParsing new XML file dpla_full_20160501_03.xmlParsing new XML file dpla_full_20160501_04.xmlParsing new XML file dpla_full_20160501_05.xmlParsing new XML file dpla_full_20160501_06.xmlParsing new XML file dpla_full_20160501_07.xmlParsing new XML file dpla_full_20160501_08.xmlParsing new XML file dpla_full_20160501_09.xmlParsing new XML file dpla_full_20160501_10.xml

In [4]:
record.bookworm_dict()

{'cataloging_source': u' ',
 'cntry': u'dcu',
 'date': 1970,
 'first_place': u'Washington,',
 'first_publisher': u'U.S. Govt. Print. Off.,',
 'government_document': u'f',
 'language': u'eng',
 'lc0': 'K',
 'lc1': 'KF',
 'lc2': '9219',
 'lc_class_from_lc': True,
 'marc_record_created': u'1984-01-23',
 'subject_places': [u'n-us---'],
 'title': u'Working papers of the National Commission on Reform of Federal Criminal Laws relating to the Study draft of the new Federal criminal code.'}

In [None]:
n = 0
global_counts = defaultdict(int)

for record in all_files:
    if random.random() >.2:
        continue
    already_seen = set([])
    n+=1
    from collections import defaultdict
    for dicto in record.as_dict()['fields']:
        name = dicto.keys()[0]
        if 'subfields' in dicto[name]:
            for subfield in dicto[name]['subfields']:
                tupo = (name,subfield.keys()[0])
        else:
            tupo = (name,None)
        if not tupo in already_seen:
            global_counts[tupo] +=1
            already_seen.add(tupo)
    if n > 10000:
        break


In [None]:
a = [((k,v),count) for ((k,v),count) in global_counts.iteritems()]
a.sort()
for elem in a:
    if elem[1] > 1000:
        print elem

Above are the fields that appear in more than 10% of a randomly selected subset of records. They include the control fields; author and title information; and some more esoteric things including county of study.

In [None]:
for record in all_files:
    try:
        print record['043']['a']
        print record.title()
        break
    except:
        continue

In [None]:
# This would be easier to demo if this could be loaded over the web. But that apparently will take Python 3.
#import urllib2

hathi_records = tarfile.open("dpla_full_20160501.tar.gz")

def record_yielder(files = hathi_records):
    for file in files:
        if file.name.endswith(".xml"):
            sys.stderr.write("Parsing new XML file " + file.name)
            records = pymarc.parse_xml_to_array(files.extractfile(file))
            for record in records:
                yield record
                
all_files = record_yielder()

Note that it requires a fair amount of dedicated RAM, since it blocks off huge section of MARC records at a time. 
It also takes a while to load.

Here, for example, is an early record in the file.

In [None]:
for record in all_files:
    print record
    break

# Example output

Here is an example of the output of this script on Hathi books: 5 randomly selected records from the first 50000 or so in the DPLA dump. This is usually, note, more than 5 *items*: Hathi groups multiple items into a single record.

Note that we're using a custom superset of the pymarc.Record class called `BRecord`. This adds a number of functions that make it easier--for instance--to pull out a dictionary with the categories that may be useful for Bookworms in a variety of ways.

Each of the keys here is something that might make sense to chart or analyze. We want to know the scanner so that we can see if there are OCR effects or something that might be relevant. We want the library so we can see how shifting library composition affects time series. It might make sense to build up miniature bookworms for particular authors, or publishers, etc.

In [None]:
from bookwormMARC.bookwormMARC import BRecord

n=0
for rec in all_files:
    # Reclass so we can use the extended methods.
    rec.__class__=bookwormMARC.bookwormMARC.BRecord
    for entry in rec.hathi_bookworm_dicts():
        # Pretty print the dictionary entry.
        print json.dumps(entry,sort_keys=True, indent=2, separators=(',', ': ') )
        print ""
    n+=1
    if n>4:
        break


### Better years

One of the big things I've noticed is that the 974 field has better year information than the record information, such as individual fields. 

The following block shows that something like 1 in 3 items, in about one in ten records, have a different entry in the 974y field from the native date field. That suggests huge possibilities for improving dates if we're not already using the 974y fields: I suspect we are not based on the serial volumes that include 974y fields I see in the online browser.



In [None]:
records = 0
diff_records = 0
items = 0
diff_items = 0
date_diffs = dict()
for rec in all_files:
    if random.random() > .01:
        # Print just one in one hundred files each time for debugging
        continue
    rec.__class__ = BRecord
    records += 1
    line_counted = False
    for field in rec.get_fields('974'):
        items += 1
        if str(field['y']) != str(rec.date()):
            try:
                date_diffs[(str(field['y']), str(rec.date()))] += 1
            except: 
                date_diffs[(str(field['y']), str(rec.date()))] = 1
            if not line_counted:
                diff_records += 1
                line_counted = True
            diff_items +=1
    if records>1000:
        break
print "%i out of %i records and %i out of %i items have differing dates" %(diff_records,records,diff_items,items)

## Assessing differences in dates between 974 and the main MARC record

]The most common pattern is that I'm replacing a "None" value with an actual year, or vice versa. It would be wise to see if there isn't sometimes a better solution than the Nones for the original fields. (Eg; am I overrelying on F008?)


In [None]:
flattened = sorted([(-val,f008,f974,val) for ((f974,f008),val) in date_diffs.iteritems()])
flattened[:20]

Looking at only places where we have years, most are the realm of reasonableness here.
(With just 1000 examples, I'm certainly getting a lot of repeat entries.)

There are, though, a number of places where f974 instates an earlier entry than does the native date field.

In [None]:
flattened = sorted([(-val,f008,f974,val) for ((f974,f008),val) in date_diffs.iteritems() 
                    if f974 != "None" and f008 != "None"])
flattened[:20]

Let's look to see what those are. Here are thirty.

In [None]:
records = 0
for rec in all_files:
    if random.random() > .5:
        # Print just one in two for debugging
        continue
    rec.__class__ = BRecord
    for field in rec.get_fields('974'):
        items += 1
        if str(field['y']) != str(rec.date()) and field['y'] is not None and rec.date() is not None:
            if int(field['y']) < int(rec.date()):
                records += 1
                print rec
    if records>30:
        break


The basic problem in all of these seems to be that in the original record, field 260c and field 008 disagree on the date. Pymarc prefers 260 in these cases; Zephir prefers field 008. Fair enough.

In [None]:
while True:
    rec = parse_record(hathi_records.readline().strip(","))
    dicto =  rec.bookworm_dict()
    
    try:
        if "Wien" in dicto['first_place']:
            if dicto['cntry'] != 'au ':
                break
    except KeyError:
        continue
print (dicto['cntry'],dicto['first_place'])
a = bookwormMARC.bookwormMARC.F008(rec)
print a.data

In [None]:
while True:
    rec = parse_record(hathi_records.readline().strip(","))
    if '045' in rec:
        print rec['045'].value()
        break


In [None]:
print rec