# Python hands-on with the five things a computer can do
Carl provided a notebook with a high-level overview of the basic building blocks of pretty much any coding task. In this notebook, we'll put all of those into action to explore what seems like it should be a fairly simple question: in any given year, how many of the books printed by William Bowyer were by living authors, and how many were by dead ones?

We won't take up those basic tasks in the same order as Carl's discussion, but rather as they crop up in the course of our task—I'll try to highlight where each of the things we do fits under those broad headings.

If you haven't done much coding before, it may seem like there end up being an awful lot of steps involved, but what we're doing is breaking down that broad question into the smaller "computationally tractable" tasks it takes to answer it. If you spend more time doing this kind of work, a lot of what we'll do in this notebook will come to seem completely routine.

If you *have* done a fair amount of coding before, you'll probably have immediately started thinking through how you'd go about exploring the question (or could start coming up with a plan in a minute or two). If that's you, you can probably skim past most of the discussion and cut straight to the code cells to make sure everything there makes sense.

## Preliminary observations and caveats
The aim of this notebook is to illustrate basic coding patterns using what could be a real book historical question. As such, the notebook ends up offering a simplified view both of coding and of book historical questions.

The code here was written with an eye towards readability for people who may be relatively new to coding: there are probably more efficient and more elegant ways to handle lots of what we'll do here.

There's also a *long* discussion to be had about something like ESTC records in this way. Suffice it to say that there are more steps we'd want to take if this were a real research question that we'll just dispense with in the interests of getting right into the code.

## Let's just think about the problem for a second
If we were looking at these records one at a time in a browser window, we wouldn't have any trouble saying whether the author of the book was alive or dead at the time of publication.  

But it's worth pausing to consider *how* we answer this question so easily: what are the steps we actually go through to decide, even if we move through them so quickly that they don't even feel like "steps"?

It's worth pausing to think about that because, when you come down to it, we're going to have to make those steps explicit if we're going to make them happen in code.

Here's some information from the first five records in the file we're going to be working with—the information we'd pay the most attention to if we were looking at these records in a tab in our browser with our question about living and dead authors in mind.

>* Davila, Arrigo Caterino,(1576-1631). *The history of the civil wars of France*. (London:printed for D. Browne..., MDCCLVIII. \[1758\]).
>
>* Holland, Richard,(1688-1730). *Observations on the small pox: or, An essay to discover a more effectual method of cure*. (London:printed for John Brindley...,1728).
>
>* Hasledine, William,(1713 or 14-1773). *The beau and the academick*. (London: printed for J. Roberts,\[1733\]).
>
> * Clarke, John,(1687-1734).
*A new grammar of the Latin tongue, comprising all necessary for grammar-schools*. (London: printed for L. Hawes, W. Clarke, and R. Collins..., M.DCC.LXVII. \[1767\]).
>
> * Spinckes, Nathaniel,(1654-1727). *The new pretenders to prophecy re-examined*. (London: printed for Richard Sare..., 1710).

Take a minute or to think about the mental processes involved in deciding whether the author of the work was alive or dead in the year the book was published and jot a phrase/sentence or two about them in the cell below. Also take note of anything you see in these summaries that might cause you to have to spend a fraction of a second longer considering some of them.


In [None]:
#Jot down some notes on how you go about answering the simple question of
#whether the author of the record was living or dead in the year of publication.







## Getting the tools we need
In an example of what Carl termed "**distributed code reuse**," our first task is to get a Python package to help us read MARC records.

While editing this notebook for brevity, I've had to cut out a lovingly-crafted section on the MARC format. There's a kind of "crash course" handout in the resources folder if you want to read a little bit more. If you want to read a lot more, you could curl up with the Library of Congress's documentation for the [MARC 21 format for bibliographic data](https://www.loc.gov/marc/bibliographic/).

The long and short of it is that, while we probably could write code to read raw MARC records, ourselves, it makes a lot more sense to take advantage of the fact that somebody else has already written code to the job for us and is giving it away for free.

`Pymarc` is a specialized-enough tool that it's not included by default in an installation of Python. We'll download it using `pip` (Package Installer for Python).

In [None]:
#Code cell #1
#Download and install the Pymarc package so that it's available for use in our
#Python environment
!pip install pymarc

Collecting pymarc
  Downloading pymarc-5.1.0-py3-none-any.whl (157 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/157.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m153.6/157.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.2/157.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymarc
Successfully installed pymarc-5.1.0


We've now *installed* `Pymarc`, but to actually make use of it in our script, we need to `import` it. In this case, we just need the `MARCReader` module, so that's what we'll `import`.

In [None]:
#Code cell #2
#Import the MARCReader module from the Pymarc package.
from pymarc import MARCReader

## Opening a MARC file
With `Pymarc` installed and the `MARCReader` module available to our script, we're ready to start dealing with MARC records! Awesome. Now we need some MARC records. And those are in a file. So we have to open a file, which obviously brings us to another of Carl's five things (**input and output**), but also—perhaps somewhat less obviously—brings us to another of the five things: **variables**.

We want to open a file that you have in the `rbs_digital_approaches_2023/2023_data_class` folder that you've cloned from GitHub  to Google Drive. To do that, we first have to import the code that allows Google Colaboratory to talk with Google Drive, as we did in the first session.

In [None]:
#Code cell #3
from google.colab import drive
drive.mount('/gdrive/')

Mounted at /gdrive/


To actually work with any of our files, we'll need to provide a full file path to each file we want. Since all of those files are going to be in the same place, let's just create a variable to hold the path to the directory we want.

Then, any time we need to get a file in this directory, we can just use the variable name instead of re-typing the full path.

In [None]:
#Code cell #4
directory_path = '/gdrive/MyDrive/rbs_digital_approaches_2023/2023_data_class/'

This next cell presents the most common way of opening a file in Python. There are a couple of different things going on here, which can make this convention a little confusing at first.

Before getting into the content of the lines, first note the structure: the first line begins with `with` and ends with a colon, which is setting a **condition** for the code that follows (this is one example of the kind of **control structures** that Carl's notebook mentions).

Notice, especially, that the second line is indented. This is important because, in Python, *whitespace has meaning*. Python infers just from the whitespace that the code in line two happens "inside" the code on line one. If line two weren't indented properly, we'd end up with an error.

Now for the content of the lines. We're using the `open` command and giving it two "arguments" inside the parentheses:
1. The path to the file we want to open (which I've built up here by using the `directory_path` and adding the name of the specific file I want).
2. The way we want to open the file. In this case, `rb` stands for "readable" and "binary" (as opposed to text: `MARCReader` wants to read the MARC records as bytes). We'll see other variations this week.

Then—still in line 1—we're creating a new variable (`marc_file`) to represent the file we've just opened. This part can be a little confusing because it's not necessarily obvious that the word `as` would mean something like "Create a new variable and assign a value to it," but that's what's implicit in that formula. (Opening files is something you have to do a lot, so having a very terse, economical syntax for doing it is a real convenience.)

Line two creates *another* variable (`marc_reader`). We're giving our MARC file to the `MARCReader` module to handle (saying, in effect, "Uh, you know how to read this stuff, right?"). `MARCReader` will then allow us to deal with the contents of the MARC records in our file in ways that are more congenial for Python coding. From now on, we'll use `marc_reader` to refer to the representation of our file that the MARC-savvy `MARCReader` module constructs for us.

For now, we'll just have `MARCReader` print the records out for us.



In [None]:
#Code cell #5
#Open a file in readable binary mode and refer to it as marc_file
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  #Give marc_file to MARCReader and refer to whatever we get back as marc_reader
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record)

=LDR  01493nam a2200229   4500
=001  N10049
=003  CU-RivES\
=005  20090206223027.0
=008  840710s1758\\\\enk||||\\\\\\\00|\||eng\c
=009  006000111\
=035  \\$a(Uk-ES)006000111 
=040  \\$aCU-RivES$cCU-RivES$dCU-RivES$dCStRLIN$dCU-RivES
=100  1\$aDavila, Arrigo Caterino,$d1576-1631.
=240  10$aIstoria delle guerre civilé di Francia.$lEnglish
=245  14$aThe history of the civil wars of France.$bIn which are related, the most remarkable transactions that happened during the reigns of Francis the Second, Charles the Ninth, Henry the Third, and, Henry the Fourth, surnamed the Great. A new translation from the Italian of Henrico Caterino Davila. By Ellis Farneworth, M. A. ... 
=260  \\$aLondon :$bprinted for D. Browne, without Temple-Bar A. Millar, in the Strand J. Whiston and B. White, in Fleet-Street R. and J. Dodsley, in Pall-Mall and W. Sandby, in Fleet-Street,$cMDCCLVIII. [1758] 
=300  \\$a2v. ;$c4⁰. 
=533  \\$aMicrofilm.$bWoodbridge, Conn.:$cPrimary Source Media,$d1999.$e1 reel ; 35 mm.$f(T

## Iterating through the records with MARCReader
The code in lines 5 and 6, above, is another **control structure** known as a `for` loop.

Notice how line 5 (beginning with `for`) is indented to be "inside" the `with` statement, and that line 6 is indented to be "inside" the `for` loop—we have a nested control structure here, and it's all signaled by white space.

Our `for` loop creates, in passing, yet another **variable** (`record`). It says, essentially:
* `marc_reader` contains an unknown number of things. We'll call each of those things `record`, in turn.
    * Print the contents of every `record`.

(In this cell, we're starting with a very small subset of the records we'll work with eventually—the full file is large enough that Colaboratory would complain about the I/O rate and cut our notebook off temporarily if we tried to simpy print all of the records like this).

### Getting individual fields from each MARC record
Now we know that there are, in fact MARC records, but just printing out the full record doesn't do us much good. Let's get just a single field from each record.

In [None]:
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record['245'])

=245  14$aThe history of the civil wars of France.$bIn which are related, the most remarkable transactions that happened during the reigns of Francis the Second, Charles the Ninth, Henry the Third, and, Henry the Fourth, surnamed the Great. A new translation from the Italian of Henrico Caterino Davila. By Ellis Farneworth, M. A. ... 
=245  10$aObservations on the small pox: or, An essay to discover a more effectual method of cure.$bBy Richard Holland, M.D. Fellow of the College of Physicians and the Royal Society.
=245  14$aThe beau and the academick.$bA dialogue in imitation of Bellus homo and academicus, spoken at the late publick act at Oxford. Address'd to the ladies. 
=245  12$aA new grammar of the Latin tongue, comprising all necessary for grammar-schools.$bTo which is annexed a dissertation upon language. By John Clarke, Author of the Two Essays upon Education and Study, Introduction to the Making of Latin, &c.
=245  14$aThe new pretenders to prophecy re-examined:$band their pre

Okay, but we're still getting stuff we don't want, like the field codes, the MARC indicators, and subfield codes. Let's just get the text content of the subfield a of field 245, which holds the main title.

(Notice how, with `MARCReader`, we're getting specific sections of a field by adding information in square brackets—this is a common pattern we'll see again. In this case we're putting the names of the fields we want in quotation marks; in other cases we might use numbers, instead.)

In [None]:
#Code cell #6
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record['245']['a'])

The history of the civil wars of France.
Observations on the small pox: or, An essay to discover a more effectual method of cure.
The beau and the academick.
A new grammar of the Latin tongue, comprising all necessary for grammar-schools.
The new pretenders to prophecy re-examined:
Laws relating to the poor,


While we're at it, let's construct a variable of our own that will hold both the main title at the continuation of the title as a text string. This is "basic math" of a kind, in that we're creating `full_title` by adding together the contents of `['245']['a']` and `['245']['b']`.

In [None]:
#Code cell #7
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    full_title = record['245']['a'] + record['245']['b']
    print(full_title)

The history of the civil wars of France.In which are related, the most remarkable transactions that happened during the reigns of Francis the Second, Charles the Ninth, Henry the Third, and, Henry the Fourth, surnamed the Great. A new translation from the Italian of Henrico Caterino Davila. By Ellis Farneworth, M. A. ... 
Observations on the small pox: or, An essay to discover a more effectual method of cure.By Richard Holland, M.D. Fellow of the College of Physicians and the Royal Society.
The beau and the academick.A dialogue in imitation of Bellus homo and academicus, spoken at the late publick act at Oxford. Address'd to the ladies. 
A new grammar of the Latin tongue, comprising all necessary for grammar-schools.To which is annexed a dissertation upon language. By John Clarke, Author of the Two Essays upon Education and Study, Introduction to the Making of Latin, &c.
The new pretenders to prophecy re-examined:and their pretences shewn to be groundless and false. And Sir R. Bulkeley

If you look closely, you'll notice a little problem with the `full_title` variable we created in the last cell:
>
```
The history of the civil wars of France.In which are related, ...
...more effectual method of cure.By Richard Holland...
The beau and the academick.A dialogue in...
...necessary for grammar schools.To which is annexed...
```
>
Our script did exactly what we said: it took the string of text in `['245']['a']` and added the string of text in `['245']['b']` to it. We never said anything about putting a space between those two strings of text

Based on what we've seen about "adding" strings of text together in the last cell, try to construct the `full_title` variable to address the problem of the missing space.

In [None]:
#Write-your-own-code cell A
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    #Construct the value of the full_title variable in a way that includes a
    #space between the content of the two subfields
    full_title =
    print(full_title)

### Conditionals
Printing out the title for every record is a safe bet because every record *has* to have something in the 245 field—you can't have a book without a title.

But look what happens when we try to print out the contents of field 100, subfield a: the author's name

In [None]:
#Code cell 8
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record['100']['a'] + ' ' + record['100']['d'])

Davila, Arrigo Caterino, 1576-1631.
Holland, Richard, 1688-1730.
Hasledine, William, 1713 or 14-1773.
Clarke, John, 1687-1734.
Spinckes, Nathaniel, 1654-1727.


KeyError: ignored

Whoops. Something's not working with the last record. Let's have a look.

In [None]:
#Code cell #9
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record)

=LDR  01493nam a2200229   4500
=001  N10049
=003  CU-RivES\
=005  20090206223027.0
=008  840710s1758\\\\enk||||\\\\\\\00|\||eng\c
=009  006000111\
=035  \\$a(Uk-ES)006000111 
=040  \\$aCU-RivES$cCU-RivES$dCU-RivES$dCStRLIN$dCU-RivES
=100  1\$aDavila, Arrigo Caterino,$d1576-1631.
=240  10$aIstoria delle guerre civilé di Francia.$lEnglish
=245  14$aThe history of the civil wars of France.$bIn which are related, the most remarkable transactions that happened during the reigns of Francis the Second, Charles the Ninth, Henry the Third, and, Henry the Fourth, surnamed the Great. A new translation from the Italian of Henrico Caterino Davila. By Ellis Farneworth, M. A. ... 
=260  \\$aLondon :$bprinted for D. Browne, without Temple-Bar A. Millar, in the Strand J. Whiston and B. White, in Fleet-Street R. and J. Dodsley, in Pall-Mall and W. Sandby, in Fleet-Street,$cMDCCLVIII. [1758] 
=300  \\$a2v. ;$c4⁰. 
=533  \\$aMicrofilm.$bWoodbridge, Conn.:$cPrimary Source Media,$d1999.$e1 reel ; 35 mm.$f(T

It turns out not every record has a `100` field.

Let's put in a **conditional** statement to first see if there *is* a `100` field before trying to print the author's name.
>Note: In earlier versions of this notebook I used a different syntax that is fairly standard in Python, but that stopped working with a recent update to PyMarc. This code uses PyMarc's `get` function to test whether a field exists: we pass the field code and a default value (in this case `None`). If the field exists, we'll get its value; but if it doesn't exist, we'll get back the default value we supplied. If we get back anything other than `None`, we know there's a 100 field and we can execute the conditional code. If there's not a 100 field, then we won't pass this conditional check and will move to the code that comes after `else`.

In [None]:
#Code cell #10
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      print(record['100']['a'] + ' ' + record['100']['d'])
    else :
      print('No 100 field')

Davila, Arrigo Caterino, 1576-1631.
Holland, Richard, 1688-1730.
Hasledine, William, 1713 or 14-1773.
Clarke, John, 1687-1734.
Spinckes, Nathaniel, 1654-1727.
No 100 field


### Try this for yourself
In the cell below, write code that will:
1. Open the MARC file we've been working with;
2. Pass that file to MARCReader
3. Iterate through the records and print:
    * The contents of field 100, subfield a and field 100, subfield d, **if that field exists**
    * The contents of field 260, subfield c, **if that field exists**

In [None]:
#Write-your-own-code cell B
#Your code goes here:


### One solution
If you're running into trouble, click to reveal a solution.

In [None]:
#Suggestion code cell 1
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      print(record['100']['a'] + ' ' + record['100']['d'])
    if record.get('260', None) is not None :
      print(record['260']['c'])

Davila, Arrigo Caterino, 1576-1631.
MDCCLVIII. [1758] 
Holland, Richard, 1688-1730.
1728.
Hasledine, William, 1713 or 14-1773.
[1733] 
Clarke, John, 1687-1734.
M.DCC.LXVII. [1767] 
Spinckes, Nathaniel, 1654-1727.
1710. 
MDCCXLIII. [1743]


## Getting the information we need
To figure out whether an author was alive or dead the year a work was published, we need to know the year the work was published. As we saw above, field 260|c gives the date *as it appears on the title page*.

As it happens, though, another part of the MARC record also has the publication date, and it's presented in such a way that we won't have to deal with things like Roman numerals. The MARC 008 field is a structured data field that tells us about things like country of publication, language, and date of publication. Let's have a look at that field. (Note that MARC "control fields," including 008 field, don't have subfields, so we need to use `.data` to get the content of those fields.)

In [None]:
#Code cell #11
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record['008'].data)

840710s1758    enk||||       00| ||eng c
840710s1728    enk||||       00| ||eng c
820917s1733    enk||||       00| ||eng c
840716s1767    enk||||       00||||eng c
840716s1710    enk||||       00| ||eng c
840719s1743    enk||||       00||||eng c


Looking at [the documentation for the 008 field](https://www.loc.gov/marc/bibliographic/bd008a.html), we can see what parts of this fixed-length field are actually helpful to us.
* Character 6 tells us what *kind* of date (or dates) the field provides: a single year, multiple years, etc. (There are only single years in this subset, but there are other combinations to be found in the full set.)
* If there's only one year, we'll find it in characters 7-10; if there are two years, characters 7-10 will give us the first year and characters 11-14 will give us the second year.

This is one of those cases where we'd really need to know our data better to know what would give us the *best* answer, but for demonstration purposes, let's just treat the first year (that is, the one in characters 7-10) as our publication year. (**Note:** This is just one factor in this quick demonstration that means what we're getting from this notebook should not be considered a real answer to our book historical question!)

### Slicing a substring from a longer string
We can get just the characters that interest us by indicating a starting and ending point in the string (a position like this is referred to as an "index"). The MARC documentation tells us exactly which characters, so let's try getting our string from character 7 to character 10.

Just like we used information in square brackets to get segments of MARC fields, we can add information in square brackets to get just a portion of MARC field 008. We can "slice" a string by indicating where we want to start and where we want to stop. So, if we're looking for characters 7-10, we might expect to use something like `record['008].data[7:10]`.

There are two things we need to bear in mind here:
* In Python, we start counting at zero, rather than one. (The MARC documentation also starts counting at zero, so we're okay there.)
* In Python, when we're slicing something like a string of text, the starting point of a slice is inclusive, but the ending point is exclusive: we get everything up to *but not including* the ending.

This can be confusing at first, but it's actually kind of nice because it means you can know how many elements you're going to get back by subtracting the first number from the second: `[7:10]` would only get us three digits. We want `[7:11]`, instead, which will get us the four characters starting at character 7 and going up to (but not including) character 11.

In [None]:
#Code cell #12
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record['008'].data[7:11])

1758
1728
1733
1767
1710
1743


### More ways of manipulating strings of text
We're in pretty good shape for the publication years. Now let's turn to the authors' dates.

Unlike the 008 field, which has a predictable, fixed length, authors dates can come in lots of different forms, and we need to figure out how to get just the part of that field that interests us.

Python has a lot of different [methods for working with strings of text](https://docs.python.org/3/library/stdtypes.html?highlight=split#string-methods)—more than we can reasonably cover in an hour or so. We'll try to explain various methods that we use as they come up in the code we'll use this week, but feel free to ask about anything that seems unclear.

We'll stick with the subset of records we've been working with for the time being to figure out an approach that will work for most of our cases, but then we'll look at a different subset selected to bring certain problems into view.

Let's begin by just looking at the authors' dates again.

In [None]:
#Code cell #13
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      print(record['100']['d'])

1576-1631.
1688-1730.
1713 or 14-1773.
1687-1734.
1654-1727.


#### Finding a substring
We know we want the digits after the hyphen, it's just that we can't know ahead of time where the hyphen will be in every string of text we find in a 100|d subfield.

But if we can find the hyphen, we'd at least have a way of knowing our starting index. We can use Python's built-in `find()` method to locate the hyphen. The way this is written can be confusing at first: `find()` is a string method, so we start by giving the string we're interested in searching and then attaching the `find()` method to it. If the substring we're looking for is found, we'll get its position in the string we were searching (`mystring.find('ri')` would give us a result of `4`). If the substring isn't found in the string we're searching , the result will be `-1`.

To make what's happening in the next cell clearer, I'm going to break things down into smaller steps and print the output one step at a time before putting it all together in one line.

For now, we'll just use the starting index and a colon with nothing after it to show a feature of slicing in Python: if we provide only a starting index and no ending index, we'll get a substring that begins where we designated and continues to the end of the string.

In [None]:
#Code cell #14
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      dates = record['100']['d']
      print("record['100']['d'] = " + dates)
      #Find the hyphen
      hyphen_position = dates.find('-')
      #What is hyphen_position, actually?
      print('hyphen_position = ' + str(hyphen_position))

      #Use hyphen position as the starting point of our slice. No ending position
      #means we'll get the remainder of the string
      print('dates[hyphen_position:] = ' + dates[hyphen_position:])

      #Opening indices are inclusive. Add 1 to get rid of the hyphen
      print('dates[hyphen_position+1:] = ' + dates[hyphen_position+1:])

      #One-liner
      print(dates[dates.find('-')+1:])
      print('---------')


record['100']['d'] = 1576-1631.
hyphen_position = 4
dates[hyphen_position:] = -1631.
dates[hyphen_position+1:] = 1631.
1631.
---------
record['100']['d'] = 1688-1730.
hyphen_position = 4
dates[hyphen_position:] = -1730.
dates[hyphen_position+1:] = 1730.
1730.
---------
record['100']['d'] = 1713 or 14-1773.
hyphen_position = 10
dates[hyphen_position:] = -1773.
dates[hyphen_position+1:] = 1773.
1773.
---------
record['100']['d'] = 1687-1734.
hyphen_position = 4
dates[hyphen_position:] = -1734.
dates[hyphen_position+1:] = 1734.
1734.
---------
record['100']['d'] = 1654-1727.
hyphen_position = 4
dates[hyphen_position:] = -1727.
dates[hyphen_position+1:] = 1727.
1727.
---------


We need to get rid of the period at the end of the string, and there are several different approaches we could take.
* We could use `find()` again to find the period and use its position as our ending index.
* We could use an ending index of -1 to go up to (but not including) the last character in the string
* We could use the `strip()` method to eliminate the period from the string.

Any of those approaches will give us the same result.

In [None]:
#Code cell #15
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      dates = record['100']['d']
      print('Finding the period as an ending index = ' + \
            dates[dates.find('-')+1:dates.find('.')])
      print('Using -1 as an ending index = ' + dates[dates.find('-')+1:-1])
      print("Using strip('.') = " + dates[dates.find('-')+1:].strip('.'))
      print('----------')

Finding the period as an ending index = 1631
Using -1 as an ending index = 1631
Using strip('.') = 1631
----------
Finding the period as an ending index = 1730
Using -1 as an ending index = 1730
Using strip('.') = 1730
----------
Finding the period as an ending index = 1773
Using -1 as an ending index = 1773
Using strip('.') = 1773
----------
Finding the period as an ending index = 1734
Using -1 as an ending index = 1734
Using strip('.') = 1734
----------
Finding the period as an ending index = 1727
Using -1 as an ending index = 1727
Using strip('.') = 1727
----------


#### Splitting a string to a list
So far, all of our approaches have involved finding the hyphen in our string and using its location to slice the string—we've just used a few different methods after that.

If there's a character we know serves as a separator between different parts of a string that we're interested, we can use the `split()` to split the string apart every time we encounter that separator and store the individual components as a list.

As you'll see, lists appear in square brackets with their individual items separated by commas. Because the items in our lists are all strings, each item will appear in single quotation marks. (Note, though, that a list can contain items of different types—strings, numbers, other lists, etc.)

In [None]:
#Code cell #16
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      date_string = record['100']['d']
      dates = date_string.split('-')
      print(dates)

['1576', '1631.']
['1688', '1730.']
['1713 or 14', '1773.']
['1687', '1734.']
['1654', '1727.']


Working with the elements of a list is going to look a lot like what we've been doing in getting substrings of longer strings, because items in lists also have indices, and they behave the same way:

* `my_list[1:5]` would get the second through fifth items in a list
* `my_list[2:]` would return everything from the third item of the list to the end
* `my_list[:-1]` would return all of the items except the last

Let's have a look at the structure of a list using `enumerate()` (which explicitly returns list indices as well as values).

* First, we'll print the list, itself, again.
* Then we'll iterate through the enumerated list using a `for` loop to show the `index` and `value` of each item in the list.
* Finally, we'll print out each item in the list individually using its list index



In [None]:
#Code cell #17
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      date_string = record['100']['d']
      dates = date_string.split('-')
      print(dates)
      #Note: another for loop
      for index, value in enumerate(dates) :
        print(str(index) + ': ' + value)
      print(dates[0])
      print(dates[1])
      print('----------')

['1576', '1631.']
0: 1576
1: 1631.
1576
1631.
----------
['1688', '1730.']
0: 1688
1: 1730.
1688
1730.
----------
['1713 or 14', '1773.']
0: 1713 or 14
1: 1773.
1713 or 14
1773.
----------
['1687', '1734.']
0: 1687
1: 1734.
1687
1734.
----------
['1654', '1727.']
0: 1654
1: 1727.
1654
1727.
----------


Our `dates` lists only have two items, of course. But we can get each item in the list by using its index. In the cell below, I've split the `date_string` to create a `dates` list. Fill in the code to assign items from the `dates` list to two new variables (`birth_year` and `death_year`).

In [None]:
#Code cell #18
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      author_name = record['100']['a']
      date_string = record['100']['d']
      dates = date_string.split('-')
      #Assign values to the variables birth_year and death_year, using the list index of
      #the appropriate item in the dates list
      birth_year =
      death_year =
      print(author_name)
      print('--Birth year: ' + birth_year)
      print('--Death year: ' + death_year)

#### Solution
If you're having trouble, click here to reveal a solution.

In [None]:
#Suggestion code cell 2
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      author_name = record['100']['a']
      date_string = record['100']['d']
      dates = date_string.split('-')
      #Assign values to the variables birth_year and death_year, using the list index of
      #the appropriate item in the dates list
      birth_year = dates[0]
      #Let's go ahead and remove the period from the end of the death year now,
      #rather than having to do it in a second step.
      death_year = dates[1].rstrip('.')
      print(author_name)
      print('--Birth year: ' + birth_year)
      print('--Death year: ' + death_year)

Davila, Arrigo Caterino,
--Birth year: 1576
--Death year: 1631
Holland, Richard,
--Birth year: 1688
--Death year: 1730
Hasledine, William,
--Birth year: 1713 or 14
--Death year: 1773
Clarke, John,
--Birth year: 1687
--Death year: 1734
Spinckes, Nathaniel,
--Birth year: 1654
--Death year: 1727


### You thought we were pretty much in the clear at this point, didn't you?
So, that's been taking shape nicely: we know how to get publication years from the 008 field without having to get into any messiness with Roman numerals or anything in the 260|c field, and we have several different ways to isolate death years in the 100|d field.

Let's have a look at a different subset of our MARC records, though, because there are some more wrinkles we're going to have to deal with.



In [None]:
#Code cell #19
with open(directory_path + '2023_d1_estc_bowyer_problem_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      author_name = record['100']['a']
      #A new wrinkle in this particular set, and a different bit of PyMarc...
      if record.get('100', None).get('c', None) is not None :
        author_name += ' ' + record['100']['c']
      date_string = record['100']['d']
      print(author_name + ' (' + date_string + ')')

Cowper, Ashley, (d. 1788.)
White, John, (ca. 1685-1755.)
De-La-Cour, James, (1709-ca. 1785.)
Wright, J. (fl. 1720-1744.)
Ogilby, William, (b. ca. 1713.)
White, Stephen, (b. 1696 or 7.)
Fauques, Marianne-Agnès Pillement, dame de, (ca. 1720-ca. 1777.)
Buckingham, John Sheffield, Duke of, (1648-1720 or 21.)


Oh boy. Some of these dates fields are going to present problems for us. We know when Ashley Cooper died, at least, even if we don't know when he was born. We know *that* John Sheffield, Duke of Buckingham died, but there's some ambiguity about exactly when. But we only have birth years for William Ogilby and Stephen White. We have birth and death years for Marianne Agnès Pillement Fauques, but they're both approximate. And J. Wright only gets a "fl." How much confidence can we have in those dates as a guide to his/their lifespan?

### Introducing Regular Expressions
Regular expressions provide us with a way for searching for *patterns* of text, even if we don't know the specific form that the text will take.

Consider US phone numbers, for example, which often take the form of (xxx) xxx-xxxx. A phone number should only include numerical characters, and not letters. So if you wanted to detect telephone numbers in a document, you would look for any continuous string of 14 characters that followed the pattern:

* Open parenthesis
* Three digits (0-9)
* A space
* Three digits (0-9)
* A hyphen
* Four digits (0-9)

Written as a regular expression, that might take the form:
> `'\([0-9]{3}\)\s[0-9]{3}\-[0-9]{4}'`

It looks like there's a lot going on there, but it's really not so bad.

This regular expression starts with a backwards slash because certain characters—including parentheses—have special meanings in regular expressions, so when we actually want to find those characters, we need to make sure that our regular expression evaluates them as text, and not as part of the regular expression. (A good example in this regular expresison is the hyphen. Note how there's a hyphen each time we see `0-9`, for example. *Those* hyphens means "in the range of zero to nine." When we come to the hyphen in the middle of the phone number, we need to "escape" it so that our regular expression will look for a hyphen character).

The square brackets enclose sets of characters that *could* appear in the pattern we're looking for. In this case, that's just `[0-9]`, but we could also see things like:
* `[A-Z]` (upper-case letters)
* `[a-z]` (lower-case letters)
* `[A-Za-z0-9]` (any mix of upper-case letters, lower-case letters, and numbers
* `[cho]` (lower-case letters c, h, or o, specifically)

There are other special patterns that we may encounter this week, but rather than getting into it all now, we'll just try to explain them as they come up.

Regular expressions can be *extremely* useful, but also kind of confusing. We've provided a handout with a crash course on regular expressions which you might find helpful, and there are also lots of resources available online to help you master them. (Especially helpful are interactive sites like [RegEx101](https://regex101.com) and [RegExr](https://regexr.com).)

Regular expressions can get pretty complex and confusing, but sometimes fairly simple ones will get the job done.

#### Dealing with unhelpful date fields
For our purposes, there are a few different patterns that are going to spell different kinds of trouble (roughly in descending order of severity):
* Any date field that begins with "b." (There's no death year. We can't use this at all.)
* Any date field that begins with "fl." (Those aren't necessarily real dates. We shouldn't use these.)
* Any date field that contains "ca."
  - If it's expressing approximation about the birth year (i.e., before the hyphen), we might be able to use it.
  - If it's expressing approximation about the death date (i.e., after the hyphen), we can't use it.
* Any date field that contains "or"
  - If it's expressing a choice between birth years, but the death year is solid, we're okay with it
  - If the birth year is solid, but the death year is uncertain, we're less okay with it
* Any date field beginning with "d." (At least we have a death year, though we should be on the lookout for other tricky bits like "d. ca. 1778"  or "d. 1750 or 51")

Rather than trying to explain the regular expressions and control structures here, I'll add comments to the code to explain things as we go.

Note that I'm using `re.compile()` to define my regular expressions. I could simply write the regular expressions in my control structures, but I prefer to compile them first for two reasons:
1. It strikes me as usually more readable use a variable name as the first argument in `re.search()` or `re.findall()`, rather than a potentially gnarly regular expression.
2. It allows me to reuse the same regular expression in different places, if need be.

In [None]:
#Code cell 20
#We need to import Python's re (Regular Expression) library to be able to work
#with regular expressions
import re

#Compile a regular expression to match a series of four digits
#(equivalent to [0-9]{4})
year_pattern = re.compile(r'\d{4}')

#Compile a regular expression to match strings with starting with "b." or "fl."
#(any combination of one or two of the characters b, f, or l, optionally
#followed by a period); OR (the "|" signals alternatives) including a hyphen,
#optionally followed by a space, followed by the characters "ca", optionally
#followed by a period.
born_flourished_cadeath = re.compile(r'^[bfl]{1,2}\.?|\-\s?ca\.?')

#Compile a regular expression to match any string starting with "d", optionally
#followed by a period.
died = re.compile(r'^d\.?')

#Compile a regular expression to match any string with "or" appearing after a
#hyphen (that is, a hyphen followed by one or more character of any kind, followed
#by a space, the characters "or" and another space).
ordeath = re.compile(r'\-.+\sor\s')

with open(directory_path + '2023_d1_estc_bowyer_problem_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      author_name = record['100']['a']
      date_string = record['100']['d']
      print(author_name + ' (' + date_string + ')')
      #Look for occurrences of the year pattern (there should be either one or
      # two,but some records may not have valid years [e.g. "17th-century"])
      if re.findall(year_pattern, date_string) is not None :

        #If the born_flourished_cadeath pattern is found (or, I guess, isn't
        #not found), we can't use this record.
        if re.search(born_flourished_cadeath, date_string) is not None :
          print('--Cannot use this date field.')


        #If the died pattern is found...
        elif re.search(died, date_string) is not None :
          #Print the first (i.e. only) year
          print('--' + re.findall(year_pattern, date_string)[0])

        #If the ordeath pattern is found...
        elif re.search(ordeath, date_string) is not None :
          #Print the second (i.e., death) year
          print('--' + re.findall(year_pattern, date_string)[1])

        #In any other case, we should be dealing with a standard yyyy-yyyy pattern
        #So find all instances of the year_pattern and print the second one.
        else :
          print('--' + re.findall(year_pattern, date_string)[1])

      else :
        print('No valid dates')



Cowper, Ashley, (d. 1788.)
--1788
White, John, (ca. 1685-1755.)
--1755
De-La-Cour, James, (1709-ca. 1785.)
--Cannot use this date field.
Wright, J. (fl. 1720-1744.)
--Cannot use this date field.
Ogilby, William, (b. ca. 1713.)
--Cannot use this date field.
White, Stephen, (b. 1696 or 7.)
--Cannot use this date field.
Fauques, Marianne-Agnès Pillement, (ca. 1720-ca. 1777.)
--Cannot use this date field.
Buckingham, John Sheffield, (1648-1720 or 21.)
--1720


#### Once more, with *functions*!
This kind of checking a string to see if it meets certain conditions is the *kind* of thing we might want to be able to do in other contexts—it's a candidate for what Carl called **local code reuse**. Perhaps we could use the same logic to search added entry subject fields to see which books were written *about* living vs. dead people in a given year, for instance. It would be a shame to have to type all of that code all over again just because we wanted to check a 700 field instead of a 100 field.

I'll write this as a function that accepts one argument (I'm expecting a string of text). Everything else in this cell should look familiar from the previous cell. What this cell does, though, is to take the procedures from the previous cell and give them a name that can be called, as we'll see in the next cell.

(**Note:** You need to run this cell for the `get_death_year` function to be available elsewhere in the notebook, but you won't see any output.)

In [None]:
#Code cell 21
def get_death_year(date_string) :
  year_pattern = re.compile(r'\d{4}')
  born_flourished_cadeath = re.compile(r'^[bfl]{1,2}\.?|\-\s?ca\.?')
  died = re.compile(r'^d\.?')
  ordeath = re.compile(r'\-.+\sor\s')

  if re.search(year_pattern, date_string) is None :
    #We don't just want to print results now, we want to process a string and
    #produce a result. If there are no valid years, we need a way to return
    #a result of None.
    result = None

  elif re.search(born_flourished_cadeath, date_string) is not None :
    #If we can't provide a *usable* year, we should also return a result of None
    result = None

  elif re.search(died, date_string) is not None :
    result = re.findall(year_pattern, date_string)[0]

  elif re.search(ordeath, date_string) is not None :
    result = re.findall(year_pattern, date_string)[1]

  else :
    result = re.findall(year_pattern, date_string)[1]

  return result

In this cell, I still have to read the MARC file and identify what field I want to check for a valid death year. But once I have that string, I pass it as an argument to the `get_death_year` function and receive a result.

In [None]:
#Code cell 22
with open(directory_path + '2023_d1_estc_bowyer_problem_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None:
      author_name = record['100']['a']
      date_string = record['100']['d']

      #Call the get_death_year function, passing date_string as the argument
      death_year = get_death_year(date_string)

      #Depending on the result that the get_death_year function returns, do
      #something
      if death_year is not None :
        print(author_name + ' died in ' + death_year)
      else :
        print('No death year for ' + author_name)

Cowper, Ashley, died in 1788
White, John, died in 1755
No death year for De-La-Cour, James,
No death year for Wright, J.
No death year for Ogilby, William,
No death year for White, Stephen,
No death year for Fauques, Marianne-Agnès Pillement,
Buckingham, John Sheffield, died in 1720


Now that `get_death_year` is available as a function, I can call it again when opening a different file without having to rewrite (or copy and paste) the code.

In [None]:
#Code cell 23
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    if record.get('100', None) is not None :
      #You know, let's change around with this author name a little...
      author_name = record['100']['a'].rstrip(',')
      author_name_parts = author_name.split(', ')
      author_name = author_name_parts[1] + ' ' + author_name_parts[0]
      date_string = record['100']['d']
      death_year = get_death_year(date_string)
      if death_year is not None :
        print(author_name + ' died in ' + death_year)
      else :
        print('No death year for ' + author_name)

Arrigo Caterino Davila died in 1631
Richard Holland died in 1730
William Hasledine died in 1773
John Clarke died in 1734
Nathaniel Spinckes died in 1727


## Okay, we're ready to get going, but first a word about data types...
At this point, we're able to get publication years from the 008 field and we have a function for finding a valid death year, if there is one. There's just one thing that ought to give us pause: we've been treating all of these values simply as strings of characters, but while we recognize "1773" as a number, the computer doesn't necessarily know that we mean the integer 1,773, rather than just a sequence of the typographical characters "1", "7", "7", and "3".

In [None]:
#Code cell 24
with open(directory_path + '2023_d1_estc_bowyer_sample.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    pub_year = record['008'].data[7:11]
    print('Publication year: ' + pub_year)
    print(type(pub_year))

Publication year: 1758
<class 'str'>
Publication year: 1728
<class 'str'>
Publication year: 1733
<class 'str'>
Publication year: 1767
<class 'str'>
Publication year: 1710
<class 'str'>
Publication year: 1743
<class 'str'>


Up to now, this hasn't really been a problem for us because we've just been printing things to the screen. (Though you may have noticed me silently using `str()` to change integers into strings so that they could be printed, back in code cells #14 and #17.)

But if we want to perform a mathematical operation—like deciding whether one year is greater or less than another, we need to be working with actual numbers.

In [None]:
#Code cell 25
with open(directory_path + '2023_d1_estc_bowyer_full.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    print(record['001'].data)
    #Get a substring from MARC field 008 and convert it to an integer
    pub_year = int(record['008'].data[7:11])
    if record.get('100', None) is not None :
      author_name = record['100']['a'].rstrip(',')
      #Just in case some records don't have a date for the author, at all,
      #as can happen with ancient authors as well as pseudonymous ones.
      if record.get('100', None).get('d', None) is not None :
        date_string = record['100']['d']
        death_year = get_death_year(date_string)
        if death_year is not None :
          #Convert the death_year to an integer
          death_year = int(death_year)
          if death_year > pub_year :
            print(author_name + ' died in ' + str(death_year) + ', and so was alive in ' + str(pub_year))
          elif death_year == pub_year :
            print(author_name + ' died in the same year the work was published. Need more information.')
          else :
            print(author_name + ' died in ' + str(death_year) + ', and so was already dead by ' + str(pub_year))
      else :
        print('No dates for ' + author_name)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
T32184
T32228
Girard, Jean-Baptiste died in 1733, and so was alive in 1731
T32229
Wynne, William died in 1765, and so was alive in 1723
T32261
Potter, John died in 1747, and so was alive in 1720
T32285
Powell, W. S. died in 1775, and so was alive in 1757
T32337
No dates for Perlin, Étienne.
T3248
Butts, Robert died in 1748, and so was alive in 1737
T32614
Mason, William died in 1797, and so was alive in 1752
T32616
Mason, William died in 1797, and so was alive in 1752
T32629
Mallet, David died in 1765, and so was alive in 1763
T3272
Monoux, Lewis died in 1771, and so was alive in 1751
T32728
Sharp, Thomas died in 1758, and so was alive in 1730
T32746
Whiston, William died in 1752, and so was alive in 1736
T32798
T32811
Keate, George died in 1797, and so was alive in 1762
T32887
Manningham, Thomas died in 1722, and so was alive in 1721
T33034
Wotton, William died in 1727, and so was already dead by 1734
T33053
Rock died in

## Construct a data structure to hold on to all this information
In Python, as in other languages, there are different kinds of data structures that have different properties that make them well-suited to different uses. So far, we've only dealt with lists. Let's explore a somewhat more involved data type for holding all the information we're getting about living vs. dead authors.

A dictionary stores information in key/value pairs. That is, for each value there's a corresponding label. Dictionaries can hold any kind of data, including not just strings or integers, but also lists or other dictionaries (as well as another data type—the tuple—that we may run into later). So dictionaries can end up being complex, nested data structures, if need be.

For our purposes, let's keep track of living vs. dead authors for each year. One way to organize our data might be:

```
{
  1710:
        {'total_works': <integer-value>,
         'no_author': <integer_value>,
         'living_authors': <integer-value>,
         'dead_authors': <integer-value>,
         'ambiguous': <integer_value>
        },
  1711:
        {'total_works': <integer-value>,
         'no_author': <integer_value>,
         'living_authors': <integer-value>,
         'dead_authors': <integer-value>,
         'ambiguous': <integer_value>
        },
...
  1778:
        {'total_works': <integer-value>,
         'no_author': <integer_value>,
         'living_authors': <integer-value>,
         'dead_authors': <integer-value>,
         'ambiguous': <integer_value>
        }
}
```


We'll need to iterate through our list of MARC records and update the number in each category for each year as we go. The way we work with dictionaries is a bit like how we work with lists, but instead of always having numeric indices, dictionaries have "keys" that can be strings, integers, or tuples. (For now, suffice it to say that tuples are [kind of like lists, but different](https://diveintopython3.net/native-datatypes.html#tuples).)

So to get the values for a given year, we would give the dictionary name followed by the key in square brackets (e.g. `author_count[1730]`). Because the value for each year is *another* dictionary, we would get those values by providing the key for the nested dictionary (e.g., `author_count[1730]['total_works']`—note that, while `1730` is an integer, `'total_works'` is a string, so `'total_works'` needs to be in quotation marks, while `1730` doesn't).

For each record we encounter, the first thing we'll get is the publication year. If it's the first record we've seen for that year, we'll create a new entry in the dictionary using `setdefault()`: this creates a new key/value pair with default values. Our default value will be a dictionary with all of the nested values set to 0.

If there's already key for our `pub_year` in the dictionary (or if we've just created a new one), we'll figure out which value in the nested dictionary we need to increase and then add one to that value.

In [None]:
#Code cell #26
#Create an empty dictionary to hold our information
author_count = {}

with open(directory_path + '2023_d1_estc_bowyer_full.mrc', 'rb') as marc_file :
  marc_reader = MARCReader(marc_file)
  for record in marc_reader :
    #Get a substring from MARC field 008 and convert it to an integer
    pub_year = int(record['008'].data[7:11])

    #If we don't already have an entry in the author_count dictionary for this
    #year, create one, with the year as a key. Give that key a value of a
    #dictionary starting with empty counts for all the categories
    author_count.setdefault(pub_year, {'total_works': 0,
                                       'no_author': 0,
                                       'living_authors': 0,
                                       'dead_authors': 0,
                                       'ambiguous': 0
                                       })
    #Immediately increase the number of total works—we don't know anything else
    #yet, but we do know that much.
    author_count[pub_year]['total_works'] += 1

    if record.get('100', None) is not None :
      author_name = record['100']['a'].rstrip(',')
      #Just in case some records don't have a date for the author, at all,
      #as can happen with ancient authors as well as pseudonymous ones.
      if record.get('100', None).get('d', None) is not None :
        date_string = record['100']['d']

        #Pass the death year (still a string) to the get_death_year function
        death_year = get_death_year(date_string)

        #If we get back a result from get_death_year other than 'None'...
        if death_year is not None :

          #Turn that result into an integer
          death_year = int(death_year)

          #Compare the death year to the publication year, and increment
          #values in our dictionary accordingly
          if death_year > pub_year :
            author_count[pub_year]['living_authors'] += 1
          elif death_year == pub_year :
            author_count[pub_year]['ambiguous'] += 1
          else :
            author_count[pub_year]['dead_authors'] += 1

      #If there are no dates
      else :
        author_count[pub_year]['ambiguous'] += 1
    #If there's no author at all
    else :
      author_count[pub_year]['no_author'] += 1

In [None]:
#Code cell #27
for k, v in author_count.items() :
  print(k)
  print(v)

1758
{'total_works': 46, 'no_author': 9, 'living_authors': 18, 'dead_authors': 8, 'ambiguous': 8}
1728
{'total_works': 50, 'no_author': 12, 'living_authors': 22, 'dead_authors': 6, 'ambiguous': 9}
1733
{'total_works': 66, 'no_author': 17, 'living_authors': 24, 'dead_authors': 15, 'ambiguous': 6}
1767
{'total_works': 43, 'no_author': 4, 'living_authors': 25, 'dead_authors': 9, 'ambiguous': 5}
1710
{'total_works': 45, 'no_author': 12, 'living_authors': 27, 'dead_authors': 3, 'ambiguous': 2}
1768
{'total_works': 37, 'no_author': 10, 'living_authors': 14, 'dead_authors': 10, 'ambiguous': 3}
1720
{'total_works': 104, 'no_author': 14, 'living_authors': 60, 'dead_authors': 11, 'ambiguous': 18}
1743
{'total_works': 31, 'no_author': 9, 'living_authors': 16, 'dead_authors': 1, 'ambiguous': 4}
1761
{'total_works': 51, 'no_author': 13, 'living_authors': 20, 'dead_authors': 15, 'ambiguous': 3}
1774
{'total_works': 47, 'no_author': 6, 'living_authors': 22, 'dead_authors': 14, 'ambiguous': 4}
1749
{'

## Do something with these data
Now we have all our data in a dictionary, we'll want to do something with them so we can get a sense of the answer to our question. First we'll save the data to a spreadsheet, then we'll construct a (rudimentary) bar chart.

### Saving to a file
We'll use the `pandas` library to convert the data in our dictionary to a DataFrame and save it as an Excel spreadsheet.

In [None]:
#Code cell #28
#Import pandas and os library—in case we're trying to write a file to
#a folder that doesn't exist yet
import pandas as pd
import os

#Set a variable containing the path to the output directory. See if this
#directory exists and, if it doesn't create it
output_directory = '/gdrive/MyDrive/rbs_digital_approaches_2023/output/'
if os.path.exists(output_directory) is not True :
  os.makedirs(output_directory)

df = pd.DataFrame.from_dict(author_count, orient='index')
df.sort_index(inplace=True)
df
df.to_excel(output_directory + 'living_vs_dead_authors.xlsx')

## Making a bar chart with plotly
It seems like we ought to at least have something to look at at the end of all this. Let's use the `plotly` package to draw a bar chart comparing the number of living and dead authors in each year .

In [None]:
#Code cell #29
#Import the module we need from the plotly package. (Plotly is installed by
#default in Google Colaboratory. In another environment, you might need to
#install it using pip)
import plotly.graph_objects as go

#Create empty lists for the years that we'll use as our x axis and for the counts
#that we'll chart on our y axis
years = []
living_counts = []
dead_counts = []

#Iterate through the keys and add values to our lists
for year, values in author_count.items() :
  years.append(year)
  living_counts.append(values['living_authors'])
  dead_counts.append(values['dead_authors'])

#There's a more compact syntax for creating those lists, but one that might be
#a little opaque if you're new to Python:
#years = [k for k, v in author_counts.items()]
#living_counts = [v['living_authors'] for k, v in author_counts.items()]
#dead_counts = [v['dead_authors'] for k, v in author_counts.items()]

#Create the chart using the plotly graph objects module. The "data" is a list of
#two graph_objects "bars," each of which has a label ("name"), and each of which
#takes its values for the x and y axes from the lists we constructed above.
fig = go.Figure(data=[
    go.Bar(name='Living', x=years, y=living_counts),
    go.Bar(name='Dead', x=years, y=dead_counts)
])

# Change the bar mode to group the bars together
fig.update_layout(barmode='group')

#Display the chart
fig.show()

## And there we have it
This notebook has walked through a practical approach to a concrete question that we might want to ask about the books described in a set of MARC records. There are plenty of nuances we'd want to add to our approach to these data if we were really approaching this as a research question, but even this fairly simple example manages to touch on many of the fundamental coding tasks that you will run into again and again in this kind of work.

* We opened, read, and wrote files,
* We reused code—both by importing several different Python packages for different specialized tasks and by writing a function to make some of our own code more easily reusable.
* We worked in several different ways with different data types and structures (strings, integers, lists, and dictionaries).
* We iterated through various pieces of information, examining each in turn.
* We added conditions and control structures to deal with branching sets of options as we iterated through our information.

Learning to break questions down into these kinds of smaller steps that are suited to what computers do well is at the heart of digital work. Not all coding questions are simple, certainly, and neither is all code. But there's a lot of complexity that emerges from what are, at bottom, fairly simple operations.

