Catalog of Copyright Entries Renewals
Tab-delimited Copyright Renewals, 1950–1991
These files contain, in tab-delimited format, copyright renewals from the US Copyright Office's Catalog of Copyright Entries for the years 1950–1977 (based on Project Gutenberg transcriptions and data from the Copyright Office's database for 1978–1991, with a sprinkling of later years (based on data made available by Google). Renewals from 1950 are not strictly relevant since all books renewed that year are now (2019) public domain, but they are included here for completeness.
Though similar to Stanford's Copyright Renewals database, the primary aim of this dataset is to fully parse out all the registration and renewal ids and dates so that we can better (and automatically) match renewals to the registrations we have been transcribing for our Catalog of Copyright Entries project.
Anatomy of a Renewal
Most renewal entries simply contain a title and author followed the date and id number of the copyright registration being renewed, the "claimant" or rights-holder making the renewal with a code such as
(A) (author) or
(C) (child or children), and finally the renewal date and id. For example:
Strangers on a train. By Mary Patricia Highsmith. © 15Mar50; A41904. Mary Patricia Highsmith (A); 6Jun77; R663598.
This renewal corresponds to a registration:
HIGHSMITH, PATRICIA. Strangers on a train. [1st ed.] New York, Harper. 299 p. © Mary Patricia Highsmith; 15Mar50; A41904.
Together, these two entries record that the copyright for this book was registerd on March 15, 1950 and renewed by the author on June 6, 1977.
Registration ids like
A41904 are not unique because the numbering started over in 1947 with the beginning of the 3rd series of the Catalog, so in order to match a renewal with a registration it is necessary to match both the id number (
A41904) and date (
This is converted into a row of tab-delimtied data (not all fields shown):
|Mary Patricia Highsmith||Strangers on a train.||A41904||1950-03-15||R663598||1977-06-06||Mary Patricia Highsmith|A|
Often more than one registration is involved, such as when a book is first published outside the United States and has an "interim" registration (class AI) before it's final registration:
First crossing of the Polar Sea, by Roald Amundsen and Lincoln Ellsworth; with additional chapters by other members of the expedition. (Pub. abroad under title: The first flight across the Polar Sea) © 15Apr27; (pub. abroad 25Feb27, AI-9217); A972756. Mary-Louise Ellsworth (W); 19Apr54; R129296.
This renewal refers to both the interim registration,
AI-9217, and final registration
A972756. This illustrates the importance of the id numbers, since the two original registrations have two different titles and wouldn't be clear otherwise that they are the same book. This would be converted into two rows of data (again, not all fields shown):
|Roald Amundsen and Lincoln Ellsworth||First crossing of the Polar Sea||A972756||1927-04-15||R129296||1954-04-19||Mary-Louise Ellsworth|W|
|Roald Amundsen and Lincoln Ellsworth||First crossing of the Polar Sea||AI9217||1927-02-25||R129296||1954-04-19||Mary-Louise Ellsworth|W|
This allows us to easily match the two registrations found in the Catalog.
Note that dates are converted to
YYYY-MM-DD format and there is some regularization of id numbers (for instance
AI-9217 changed to
Differences from Stanford Copyright Renewals
In 2007 Stanford University Libraries and Academic Information Resources launched a copyright renewals database covering the same material. The Stanford database is intended to make it simple to find a book by published before 1964 by author or title and see whether or not its copyright has been renewed. While transcribing and parsing the original book registration entries at NYPL we have relied extensively on this data, but our task is a little different since we would like to accurately match every registered entry with a renewal or not. The data in this repository is organized to make it easier to accurately match a registration ID and date with a renewal if it exists and to reduce the number of false negatives so that we can be confident the lack of a match means the copyright wasn't renewed. It differs from the Stanford data in a few areas.
You can download a comma-delimited copy of Stanford's data (latest version) which contains 246,448 renewal records.
Stanford's data includes only "Book" or "Class A" registrations. In our CCE project we are transcribing the "Book" volumes of the Catalog (Part 1, Group 1, 1923–1946; Part 1A, 1947–1953; Part 1, 1953–1964) which include some registrations for classes other than "A" (about 2% of registration entries), though many of these would be considered books despite the classification. If any of these have renewals, they would be excluded from the Stanford data, creating false negatives.
This dataset contains all the Part 1, Group 1 and 2 (1923–1946), Part 1A and 1B (1947–1953) and Part 1 (1953–1964, two previous groups combined) renewals transcribed by Project Gutenberg as well as renewals for all classes from the Google dataset (over 445,000 renewals altogether) derived from the Copyright Office database.
Unrolling of batch renewals
Many renewal entries actually contain multiple renewals or registrations. These might be multiples of both registrations and renewals, or they might be multiple registrations renewed under a single renewal ID (the reverse is also possible). Special effort has been made to "unroll" these entries so that every row contains a unique combination of registration id, registration date and renewal id, and so that every id is accounted for. For example, this entry (1958, vol. 12.1.1 p. 764)
RULING CASE LAW. 1930 supplement, continuing Permanent supplement ed. Vol. 1-28. © 24May30; A23877-23904. Lawyers Co-operative Pub. Co. & Bancroft-Whitney Co. (PCW); 28Apr58; R213954-213981.
Is converted into 28 rows, each with the proper registration and renewal id so that they can all be matched to our registration data entries. A unique ID is assigned to each entry before unrolling, so each of these 28 rows carries an ID tying them back to the original entry.
In the Stanford data for this entry each renewal id is separate, but they are all assigned to the same registration id,
A23877, allowing one match but causing 27 false negatives.
Conversely many records in this dataset do not have authors or titles parsed into the proper fields, so it is less useful for that kind of searching.
Registration numbers and dates
Similar to how multiple registrations are handled, when a renewal records an "interim" (class AI) or foreign (class AF) registration followed by a regular, class A, registration, the Stanford data usually has the date from the earlier registration with the id from the later one. This leads to false negatives when matching registrations because registration must be matched by registration id and date. For instance, this renewal (1962, vol 16.1.1, p. 914)
R294052 is linked in the Stanford data to the registration id
A79921 and date
THIRKELL, ANGELA. The demon in the house. © 29Oct34, AI-19786; 4Mar35, A79921. Lancelot George Thirkell, Colin McInnes & Graham Campbell McInnes (C); 30Mar62; R294052.
This is not an issue for the ways in which someone is likely to use Stanford's database to look up by author's name, title, or registration id if they have it.
However, in the CCE registration data there are two relevant entries and we need to be able to link them to this renewal by the correct number and date pairs:
A79921/1935-03-04. There is no CCE registration with the combination
A79921/1934-10-29 so the Stanford data gives a false negative in this case (actually two). In this dataset there are two rows for
R294052, one for each of the proper date/number pairs.
Registration entries in the Catalog don't always indicate that there was a previous interim registration, so it is a happy side-effect that we can link the two via their common renewal.
Tab-delimited files have the following structure:
|entry_id||A UUID for the entry from which the row was parsed||2e8b17ae-d4a9-52f3-8774-0ec7597cb93d|
|part||Source volume part||1|
|number||Source volume number||1|
|page||Source volume page|
|oreg||Original registration id number.||A972756|
|odat||Original registration date, in YYYY-MM-DD format||1927-04-15|
|id||Renewal id number||R129296|
|rdat||Renewal date, in YYYY-MM-DD format||1954-04-19|
|claimants||Copyright claimants of the renewal and class code. Claimants and code separated by a pipe (|), multiple claimaints separated by a double pipe (||)||Mary-Louise Ellsworth|W|
|new_matter||Indicates material being renewed if it not the text as a whole||pref. and revisions|
|see_also_ren||Related renewal ids|
|see_also_reg||Related registration ids|
|full_text||Full text of the source entry|
Every row is guaranteed to have
full_text fields. If all other fields are empty it means that the entry could not be parsed.
Except for entries that could not be parsed at all, every row will have
rdat fields, which are sufficient for matching to registration data.
Wherever possible, the remaining fields are parsed and populated.
All dates are converted to
In registration ids, all hyphens are removed except those that separate a class ending with a digit from the serial number part of the id. For instance
AI-9217 is normalized to
AI9217, but the hyphen is kept in
The class code
RE in later renewal ids is normalized to
In the database derived data (1978-) ids are padded internally with 0's, which are removed. For instance
A00000366051 is normalized to
A366051 and, combined with the previous rule,
RE0000736790 is normalized to
Some errors in the transcription become apparent either because they cause the parsing to fail or because they are easy to detect once the data can be queried. Where errors have been found in the Project Gutenberg transcriptions, these are fixed by applying patches (found in the
data/pre-patches directory) before processing the files. Using patches means there is a clear record of what is being changed and we don't have to maintain a separate, corrected copy of the files. It will be apparent if the source files are ever updated because applying the patches will fail.
Some rows are simply difficult to parse automatically. Rather than adding complexity to the parsing code to a handle a small number of cases, another set of patches (from the
data/post-patches directory) are applied after processing to the generated TSV files. This is a very fidgety process, however, and it kept to a minimum.
Building the dataset yourself
It should be possible to rebuild the dataset on any sufficiently Unix-like system, by cloning this repository, switching to the
data directory and running
make. This will:
- For Gutenberg files
- Download the Gutenberg source files
- Apply patches in the
data/pre-patchdirectory to the Gutenberg files to correct errors and typos
- "Unnest" the Guteberg files (by running them through
- Concatenate parts into one file for each volume
- Parse each volume file (by running them through
parse.py) to generate tab-delimited files
- Apply patches in the
data/post-patchdirectory to the tab-delimited files.
- For Google data
- Download and unzip the Google data
- Generate a tab-delimited file for each year, starting with 1978, from the XML (by running
expl-google-cceonce for each year)
This pipline requires:
python3.5 or greater