This is a quick demo of what sort of data I'm pulling out with Hathi-Specific features of the Bookworm-MARC library.


First just some basic imports

In [1]:
import pymarc
import tarfile
import random
import json
from bookwormMARC.bookwormMARC import parse_record
import bookwormMARC
import sys
import os


In [2]:
%load_ext autoreload
%autoreload 2

We're reading the full DPLA dump out of the tarfile, and then definnig a generator object from a yielding function.

Each object returned by the generator is a record from the great pymarc utility

In [3]:
# This would be easier to demo if this could be loaded over the web. But that apparently will take Python 3.
#import urllib2

hathi_records = tarfile.open("dpla_full_20160501.tar.gz")

def record_yielder(files = hathi_records):
    for file in files:
        if file.name.endswith(".xml"):
            sys.stderr.write("Parsing new XML file " + file.name)
            records = pymarc.parse_xml_to_array(files.extractfile(file))
            for record in records:
                yield record
                
all_files = record_yielder()

Note that it requires a fair amount of dedicated RAM, since it blocks off huge section of MARC records at a time. 
It also takes a while to load.

Here, for example, is an early record in the file.

In [4]:
for record in all_files:
    print record
    break

=LDR  01015nam a22003011  4500
=001  000000033
=003  MiAaHDL
=005  20130211000000.0
=006  m\\\\\\\\d\\\\\\\\
=007  cr\bn\---auaua
=008  880715r19681908nyua\\\\\\\\\\00110\eng\\
=010  \\$a68024915
=035  \\$a(MiU)000000033
=035  \\$asdr-miu000000033
=035  \\$asdr-ucd001191777
=035  \\$a(OCoLC)1425
=035  \\$a(CaOTULAS)159818037
=035  \\$a(RLIN)MIUG0001425-B
=040  \\$aDLC$cDLC$dMiU
=050  0\$aPR5238$b.A3 1968
=082  \\$a826.8
=100  1\$aRossetti, Christina Georgina,$d1830-1894.
=245  14$aThe family letters of Christina Georgina Rossetti;$bwith some supplementary letters and appendices.$cEdited by William Michael Rossetti.
=260  \\$aNew York,$bHaskell House,$c1968.
=300  \\$axxii, 242 p.$billus.$c24 cm.
=500  \\$aReprint of the 1908 ed.
=500  \\$a"Haskell House catalogue item #237."
=538  \\$aMode of access: Internet.
=600  10$aRossetti, Christina Georgina,$d1830-1894.
=700  1\$aRossetti, William Michael,$d1829-1919.$eed.
=970  \\$aBK
=974  \\$bUC$cUCD$d20140729$sgoogle$uuc1.31175000101330$y19

Parsing new XML file dpla_full_20160501_01.xml

# Example output

Here is an example of the output of this script on Hathi books: 5 randomly selected records from the first 50000 or so in the DPLA dump. This is usually, note, more than 5 *items*: Hathi groups multiple items into a single record.

Note that we're using a custom superset of the pymarc.Record class called `BRecord`. This adds a number of functions that make it easier--for instance--to pull out a dictionary with the categories that may be useful for Bookworms in a variety of ways.

Each of the keys here is something that might make sense to chart or analyze. We want to know the scanner so that we can see if there are OCR effects or something that might be relevant. We want the library so we can see how shifting library composition affects time series. It might make sense to build up miniature bookworms for particular authors, or publishers, etc.

In [12]:
from bookwormMARC.bookwormMARC import BRecord

n=0
for rec in all_files:
    # Reclass so we can use the extended methods.
    rec.__class__=bookwormMARC.bookwormMARC.BRecord
    for entry in rec.hathi_bookworm_dicts():
        # Pretty print the dictionary entry.
        print json.dumps(entry,sort_keys=True, indent=2, separators=(',', ': ') )
        print ""
    n+=1
    if n>4:
        break


{
  "cataloging_source": " ",
  "cntry": "dcu",
  "contributing_library": "ipst",
  "date": 1967,
  "date_source": "008",
  "filename": "psia.ark:/13960/t5z623168",
  "first_place": "Washington :",
  "first_publisher": "U.S. Dept. of Commerce, National Bureau of Standards :",
  "government_document": "f",
  "language": "eng",
  "lc0": "T",
  "lc1": "T",
  "lc2": "6",
  "lc_class_from_lc": false,
  "marc_record_created": "1970-05-18",
  "permalink": "https://babel.hathitrust.org/cgi/pt?id=psia.ark:/13960/t5z623168",
  "rights_changed_date": "2013-08-10",
  "scanner": "ia",
  "searchstring": "<a href=https://babel.hathitrust.org/cgi/pt?id=psia.ark:/13960/t5z623168><em>Technology and world trade : proceedings /</em> (1967)",
  "title": "Technology and world trade : proceedings /"
}

{
  "cataloging_source": " ",
  "cntry": "dcu",
  "contributing_library": "University of Michigan",
  "date": 1967,
  "date_source": "008",
  "filename": "mdp.39015082414155",
  "first_place": "Washington :",


### Better years

One of the big things I've noticed is that the 974 field has better year information than the record information, such as individual fields. 

The following block shows that something like 1 in 3 items, in about one in ten records, have a different entry in the 974y field from the native date field. That suggests huge possibilities for improving dates if we're not already using the 974y fields: I suspect we are not based on the serial volumes that include 974y fields I see in the online browser.



In [14]:
records = 0
diff_records = 0
items = 0
diff_items = 0
date_diffs = dict()
for rec in all_files:
    if random.random() > .01:
        # Print just one in one hundred files each time for debugging
        continue
    rec.__class__ = BRecord
    records += 1
    line_counted = False
    for field in rec.get_fields('974'):
        items += 1
        if str(field['y']) != str(rec.date()):
            try:
                date_diffs[(str(field['y']), str(rec.date()))] += 1
            except: 
                date_diffs[(str(field['y']), str(rec.date()))] = 1
            if not line_counted:
                diff_records += 1
                line_counted = True
            diff_items +=1
    if records>1000:
        break
print "%i out of %i records and %i out of %i items have differing dates" %(diff_records,records,diff_items,items)

Parsing new XML file dpla_full_20160501_02.xmlParsing new XML file dpla_full_20160501_03.xml

276 out of 1001 records and 4998 out of 6887 items have differing dates


## Assessing differences in dates between 974 and the main MARC record

]The most common pattern is that I'm replacing a "None" value with an actual year, or vice versa. It would be wise to see if there isn't sometimes a better solution than the Nones for the original fields. (Eg; am I overrelying on F008?)


In [37]:
flattened = sorted([(-val,f008,f974,val) for ((f974,f008),val) in date_diffs.iteritems()])
flattened[:20]

[(-88, '1902', '1905', 88),
 (-80, 'None', '1900', 80),
 (-79, '1972', 'None', 79),
 (-78, '1886', 'None', 78),
 (-76, 'None', '1921', 76),
 (-73, 'None', '1901', 73),
 (-68, 'None', '1911', 68),
 (-64, 'None', '1910', 64),
 (-63, 'None', '1906', 63),
 (-60, 'None', '1909', 60),
 (-59, 'None', '1916', 59),
 (-57, 'None', '1904', 57),
 (-57, 'None', '1914', 57),
 (-56, 'None', '1913', 56),
 (-55, 'None', '1908', 55),
 (-55, 'None', '1922', 55),
 (-54, 'None', '1917', 54),
 (-54, 'None', '1920', 54),
 (-52, 'None', '1902', 52),
 (-52, 'None', '1907', 52)]

Looking at only places where we have years, most are the realm of reasonableness here.
(With just 1000 examples, I'm certainly getting a lot of repeat entries.)

There are, though, a number of places where f974 instates an earlier entry than does the native date field.

In [38]:
flattened = sorted([(-val,f008,f974,val) for ((f974,f008),val) in date_diffs.iteritems() 
                    if f974 != "None" and f008 != "None"])
flattened[:20]

[(-88, '1902', '1905', 88),
 (-32, '1878', '1890', 32),
 (-31, '1972', '1973', 31),
 (-24, '1903', '1906', 24),
 (-22, '1903', '1907', 22),
 (-22, '1903', '1908', 22),
 (-21, '1903', '1905', 21),
 (-19, '1870', '1877', 19),
 (-18, '1890', '1903', 18),
 (-18, '1900', '1908', 18),
 (-18, '1903', '1921', 18),
 (-16, '2003', '1996', 16),
 (-15, '1903', '1904', 15),
 (-15, '2003', '1997', 15),
 (-14, '1921', '1920', 14),
 (-13, '1834', '1835', 13),
 (-13, '1871', '1878', 13),
 (-13, '1903', '1909', 13),
 (-13, '2003', '1995', 13),
 (-12, '1899', '1913', 12)]

Let's look to see what those are. Here are thirty.

In [40]:
records = 0
for rec in all_files:
    if random.random() > .5:
        # Print just one in two for debugging
        continue
    rec.__class__ = BRecord
    for field in rec.get_fields('974'):
        items += 1
        if str(field['y']) != str(rec.date()) and field['y'] is not None and rec.date() is not None:
            if int(field['y']) < int(rec.date()):
                records += 1
                print rec
    if records>30:
        break


=LDR  00953nam a22002531i 4500
=001  000696021
=003  MiAaHDL
=005  20130211000000.0
=006  m\\\\\\\\d\\\\\\\\
=007  cr\bn\---auaua
=008  880718s1975\\\\dcu\\\\\\\\\\f00000\eng\d
=035  \\$a(MiU)000696021
=035  \\$asdr-miu000696021
=035  \\$a(OCoLC)14383927
=035  \\$a(CaOTULAS)159880473
=035  \\$a(RLIN)MIUG2218677-B
=040  \\$aTNTU$cTNTU$dMiU$dCStRLIN
=110  2\$aNational Center on Child Abuse and Neglect.
=245  10$aFederally funded child abuse and neglect projects, 1975.
=260  \\$a[Washington] :$bU. S. Dept. of Health, Education, and Welfare, Office of Human Development, Office of Child Development, Children's Bureau, National Center on Child Abuse and Neglect,$c[1976?]
=300  \\$aiii, 56 p. ; 26 cm.
=490  0\$aDHEW publication no. (OHD) ; 76-30076.
=538  \\$aMode of access: Internet.
=650  \0$aChild welfare$zUnited States.
=650  \0$aChild abuse$zUnited States.
=970  \\$aBK
=974  \\$bMIU$cMIU$d20150323$sgoogle$umdp.39015016219290$y1975$rpd

=LDR  01396nam a2200313   4500
=001  000701480
=003 

The basic problem in all of these seems to be that in the original record, field 260c and field 008 disagree on the date. Pymarc prefers 260 in these cases; Zephir prefers field 008. Fair enough.

In [None]:
while True:
    rec = parse_record(hathi_records.readline().strip(","))
    dicto =  rec.bookworm_dict()
    
    try:
        if "Wien" in dicto['first_place']:
            if dicto['cntry'] != 'au ':
                break
    except KeyError:
        continue
print (dicto['cntry'],dicto['first_place'])
a = bookwormMARC.bookwormMARC.F008(rec)
print a.data

In [None]:
while True:
    rec = parse_record(hathi_records.readline().strip(","))
    if '045' in rec:
        print rec['045'].value()
        break


In [None]:
print rec