Skip to content
NYPL Project to transcribe and parse pages from the US Catalog of Copyright Entries
Python
Branch: master
Clone or download

README.md

Catalog of Copyright Entries Project

NYPL Project to transcribe and parse pages from the US Catalog of Copyright Entries

The New York Public Library (NYPL) is embarking on a pilot project to extract the data from a publication known as the Catalog of Copyright Entries, published annually by the United States Copyright Office. The volumes have already been digitized and are freely available through the Internet Archive; our project aims to extract and parse the data contained in the records in order to create a searchable database that will aid copyright research.

For more on the project, see "Unlocking the Record of American Creativity—with Your Help"

For more on the catalog, see the following:

Data Structure and Contents

All data files are in the ./xml directory, organized by year. The XML files conform to the project DTD, and each directory has an alto subdirectory with ALTO format files for the original OCR.

See TOC.md for details on the volumes transcribed so far.

CopyrightEntries.dtd

The main components of an XML files, within the root <copyrightEntries> element are a mandatory <header> followed by any order of <copyrightEntry>, <entryGroup>, <crossRef> and <pgNum> elements.

There are tags for identifyings authors, titles, publishers, and claimants, as well as the various dates and id numbers that an entry can contain. Many entries have attributes for recording normalized versions of dates and numbers or for identifying where corrections have been made.

See the Guide for specifics of formatting entries.

Anatomy of a Registration

The format of entries in the Catalog varies widely over time but they essentialy contain simple bibliographic information and a registration date and id number.

ADAMS, JAMES DONALD.

  Literary frontiers.  New
    York, Duell, Slone and
    Pearce.  175 p. © J. Donald 
	Adams; 6Jun51; A56505.

This is converted to XML:

<copyrightEntry 
     id="1D4D33CD-6E97-1014-8315-97D5E63C7536"
     regnum="A56505">
  <author>
    <authorName>ADAMS, JAMES DONALD</authorName>.
  </author> 
  <title>Literary frontiers.</title>
  <publisher>
    <pubPlace>New York</pubPlace>, 
    <pubName>Duell, Sloan and Pearce.</pubName> 
  </publisher>
  <desc>175 p.</desc> 
  &#xA9; <claimant>J. Donald Adams</claimant>;
  <regDate date="1951-06-06">6Jun51</regDate>; 
  <regNum>A56505</regNum>.
</copyrightEntry>

Our top priority is to correctly tag the registration numbers and dates since these are required to match registrations to renewals. Next in priority are the authors and titles although for practical purposes a full-text search is probably adequate to find an entry.

Identifiers

Every registration should have a registration number, such as A56505, but these are not unique. Numbering was restarted in "Third Series" (1947–) so there is quite a bit of overlap between this and the "New Series." For example, the example entry above shares a registration number with another book Barton Warren Stone, pathfinder of Christian union; a story of his life and times registered in 1932. Because of this a registration number and date is always required to distinguish A56505/1951-06-06 from A56505/1932-10-12

In addition every <copyrightEntry> and <crossRef> is assigned a UUID so that it can be uniquely identified, even if the registration number or date is changed (for instance, to correct a typo).

Renewals

These volumes were chosen to transcribe first because they come from the period when a book may in copyright if its first 28-year copyright term was renewed, whike it is otherwise public domain. Renewal data is available from the Stanford Copyright Renewals Database and from an NYPL version of essentially the same sources. The NYPL version is better formatted for matching renewal entries the registrations in these XML files.

By combining the two datasets we can determine how many books were registered for copyright in every year between 1923 and 1963, as well as how many were renewed:

Chart showing the number of books registered and renewed each year, 1923-1963

For this period we have about 642,000 books registered for copyright. Of these about 162,000 or 25% had their copyrights renewed. So, the copyright has expired on 75% of the books published during these years, about 480,000, and they are now in the Public Domain.

Press Inquiries: Please contact Greg Cram or Sean Redmond

You can’t perform that action at this time.