# Mapping MARC Records
In this notebook, we're going to use regularized place names to query a web API (in this case, the [GeoNames geographical database](https://www.geonames.org/) and retrieve latitude and longitude coordinates so that we can put those places on a map. 

In all of the screenshots, I'll be using a set of MARC records related to issues of John Bunyan's *The Pilgrim's Progress* prior to 1800 (which is available in the `data_class` folder. I chose those records, frankly, because I had them on hand and because I figured that there would be enough variety in the publication places to make it actually interesting to see on a map. If you have catalog records that you'd rather use, instead, please feel free. The *Pilgrim's Progress* records are MARC-21 records, but I've included brief asides about doing the same thing with MARCXML (which you could generalize to other kinds of XML) or with information in a .csv file.

Because Monday's Python hands-on notebook spent so much time using MARC records handled with `Pymarc`, this notebook won't include nearly as much discussion of that sort. I'll provide comments in the code to explain what's going on, and we can talk about the approach I took here (and alternatives that occur to you) in our discussion period.

We'll be using the Python `folium` package, which provides a wrapper for the popular [Leaflet JavaScript library](https://leafletjs.com/). This notebook is by no means a last word on this kind of mapping: 

* There are some limits to what we can do in a Colab notebook because of the way Colab does and does not work with some IPython extensions; 
* `folium` doesn't seem to expose everything that `Leaflet.js` can do; 
* Even if it did, my JavaScript knowledge is rusty enough that I'm not sure I could work out all the things we might want to do.

This workbook focuses on getting the geocoding information we'd need to map our catalog records. The details of what to do next would vary depending on the specific software you're using, so I haven't pushed to figure out all the possible niceties of Leaflet maos, for example. Leaflet is very widely used, however, and people have made it possible to integrate Leaflet maps with lots of different platforms: if you imagine yourself doing more work with maps, it would probably be worth your while to investigate Leaflet further.

## If you get stuck
All the code in this notebook is shaped by what I encountered while working with the particular records I was working with when I wrote it. You may well need to adapt things if you're using different data. 

In the records I used in the section on MARCXML, for example, I had to build in a conditional to see if there was a publication city in the record—that never come up in the MARC-21 records for *The Pilgrim's Progress*. I also had to come up with some different ways of processing place names in those MARCXML records because most of the books were in languages other than English and handled place names differently. 

If you run into problems with your data, try to think about what control structures you would need to add to your code to catch and deal with the problems you encounter in your data. But you should always feel free to check in on Slack or Zoom if you get stuck!

## Getting started
You'll use different packages depending on the format of your data, so I'll defer installing and importing packages until the relevant section. In the meantime, though, you'll need to connect to Google Drive and set a variable for the path to the directory that has your data.

In [None]:
from google.colab import drive
drive.mount('/gdrive')

### Create a variable for our source directory
If you want to use one of the sets of catalog records we've provided, you can find them in the `data_class` directory that you've cloned from GitHub. If you have other records you'd like to use, instead, add them to the `rbs_digital_approaches_2021/data_my/` directory in your Google Drive and use that path, instead: comment out line 1 and uncomment line 2.

In [None]:
source_directory = '/gdrive/MyDrive/rbs_digital_approaches_2021/s2_data_class/'
# source_directory = '/gdrive/MyDrive/rbs_digital_approaches_2021/data_my/'

## Getting the official country codes used in MARC records
As we saw yesterday, the 008 field some defined data about the resource described in the MARC record. Yesterday we took advantage of the publication year available at characters 7-10. Today, we'll use the place of publication information in characters 15-17. This is a two- or three-character code drawn from an official list that's linked from the [MARC documentation for field 008](https://www.loc.gov/marc/countries/).

I've used those codes as keys for a dictionary, with the corresponding country name as the value. We'll use this dictionary for decoding our country codes (`marc_country_codes['aa']` would return `'Albania'`, for example).

The text editor that I use (BBEdit) has a very nice support for regular expressions in its find/replace dialogue. I created this dictionary by copying the text from the LOC's web site into a new text document in BBEdit, then did a single find-and replace. I looked for the following regular expression:

>`^([\-a-z]{2,4})\s+(.+)`

From left to right: 
* `^`: A string at the beginning of a line (this wasn't strictly necessary)
* The parentheses create a "capture group": I don't just want to match what's inside the parentheses, I want to be able to reuse the texct that matches the pattern
* `[\-a-z]{2,4}`: Any combination of between two and four hyphens or lower-case letters. The closed parenthesis marks the end of my capture pattern,
* `\s+`: One or more whitespace characters
* `(.+)\n`: Another capture group, this time consisting of one or more characters of any kind.

I then replaced every instance of that regular expression using: 

>`'\1': '\2', `

That is: Capture group 1 inside single quotes, followed by a colon and a space; then capture group 2 insinde single quotes, followed by a comma and a space. So, for example, 

>aa	Albania

became

>'aa': 'Albania',

Then I copied the resulting text and pasted it into my Colab notebook cell between curly braces, and there was my dictionary.


In [None]:
#Information from https://www.loc.gov/marc/countries/ Turned into a Python
#dictionary with a regular expression replacement in BBEdit
marc_country_codes = {
    "aa": "Albania",
"abc": "Alberta",
"-ac": "Ashmore and Cartier Islands",
"aca": "Australian Capital Territory",
"ae": "Algeria",
"af": "Afghanistan",
"ag": "Argentina",
"-ai": "Anguilla",
"ai": "Armenia (Republic)",
"-air": "Armenian S.S.R.",
"aj": "Azerbaijan",
"-ajr": "Azerbaijan S.S.R.",
"aku": "Alaska",
"alu": "Alabama",
"am": "Anguilla",
"an": "Andorra",
"ao": "Angola",
"aq": "Antigua and Barbuda",
"aru": "Arkansas",
"as": "American Samoa",
"at": "Australia",
"au": "Austria",
"aw": "Aruba",
"ay": "Antarctica",
"azu": "Arizona",
"ba": "Bahrain",
"bb": "Barbados",
"bcc": "British Columbia",
"bd": "Burundi",
"be": "Belgium",
"bf": "Bahamas",
"bg": "Bangladesh",
"bh": "Belize",
"bi": "British Indian Ocean Territory",
"bl": "Brazil",
"bm": "Bermuda Islands",
"bn": "Bosnia and Herzegovina",
"bo": "Bolivia",
"bp": "Solomon Islands",
"br": "Burma",
"bs": "Botswana",
"bt": "Bhutan",
"bu": "Bulgaria",
"bv": "Bouvet Island",
"bw": "Belarus",
"-bwr": "Byelorussian S.S.R.",
"bx": "Brunei",
"ca": "Caribbean Netherlands",
"cau": "California",
"cb": "Cambodia",
"cc": "China",
"cd": "Chad",
"ce": "Sri Lanka",
"cf": "Congo (Brazzaville)",
"cg": "Congo (Democratic Republic)",
"ch": "China (Republic : 1949- )",
"ci": "Croatia",
"cj": "Cayman Islands",
"ck": "Colombia",
"cl": "Chile",
"cm": "Cameroon",
"-cn": "Canada",
"co": "Curaçao",
"cou": "Colorado",
"-cp": "Canton and Enderbury Islands",
"cq": "Comoros",
"cr": "Costa Rica",
"-cs": "Czechoslovakia",
"ctu": "Connecticut",
"cu": "Cuba",
"cv": "Cabo Verde",
"cw": "Cook Islands",
"cx": "Central African Republic",
"cy": "Cyprus",
"-cz": "Canal Zone",
"dcu": "District of Columbia",
"deu": "Delaware",
"dk": "Denmark",
"dm": "Benin",
"dq": "Dominica",
"dr": "Dominican Republic",
"ea": "Eritrea",
"ec": "Ecuador",
"eg": "Equatorial Guinea",
"em": "Timor-Leste",
"enk": "England",
"er": "Estonia",
"-err": "Estonia",
"es": "El Salvador",
"et": "Ethiopia",
"fa": "Faroe Islands",
"fg": "French Guiana",
"fi": "Finland",
"fj": "Fiji",
"fk": "Falkland Islands",
"flu": "Florida",
"fm": "Micronesia (Federated States)",
"fp": "French Polynesia",
"fr": "France",
"fs": "Terres australes et antarctiques françaises",
"ft": "Djibouti",
"gau": "Georgia",
"gb": "Kiribati",
"gd": "Grenada",
"-ge": "Germany (East)",
"gg": "Guernsey",
"gh": "Ghana",
"gi": "Gibraltar",
"gl": "Greenland",
"gm": "Gambia",
"-gn": "Gilbert and Ellice Islands",
"go": "Gabon",
"gp": "Guadeloupe",
"gr": "Greece",
"gs": "Georgia (Republic)",
"-gsr": "Georgian S.S.R.",
"gt": "Guatemala",
"gu": "Guam",
"gv": "Guinea",
"gw": "Germany",
"gy": "Guyana",
"gz": "Gaza Strip",
"hiu": "Hawaii",
"-hk": "Hong Kong",
"hm": "Heard and McDonald Islands",
"ho": "Honduras",
"ht": "Haiti",
"hu": "Hungary",
"iau": "Iowa",
"ic": "Iceland",
"idu": "Idaho",
"ie": "Ireland",
"ii": "India",
"ilu": "Illinois",
"im": "Isle of Man",
"inu": "Indiana",
"io": "Indonesia",
"iq": "Iraq",
"ir": "Iran",
"is": "Israel",
"it": "Italy",
"-iu": "Israel-Syria Demilitarized Zones",
"iv": "Côte d'Ivoire",
"-iw": "Israel-Jordan Demilitarized Zones",
"iy": "Iraq-Saudi Arabia Neutral Zone",
"ja": "Japan",
"je": "Jersey",
"ji": "Johnston Atoll",
"jm": "Jamaica",
"-jn": "Jan Mayen",
"jo": "Jordan",
"ke": "Kenya",
"kg": "Kyrgyzstan",
"-kgr": "Kirghiz S.S.R.",
"kn": "Korea (North)",
"ko": "Korea (South)",
"ksu": "Kansas",
"ku": "Kuwait",
"kv": "Kosovo",
"kyu": "Kentucky",
"kz": "Kazakhstan",
"-kzr": "Kazakh S.S.R.",
"lau": "Louisiana",
"lb": "Liberia",
"le": "Lebanon",
"lh": "Liechtenstein",
"li": "Lithuania",
"-lir": "Lithuania",
"-ln": "Central and Southern Line Islands",
"lo": "Lesotho",
"ls": "Laos",
"lu": "Luxembourg",
"lv": "Latvia",
"-lvr": "Latvia",
"ly": "Libya",
"mau": "Massachusetts",
"mbc": "Manitoba",
"mc": "Monaco",
"mdu": "Maryland",
"meu": "Maine",
"mf": "Mauritius",
"mg": "Madagascar",
"-mh": "Macao",
"miu": "Michigan",
"mj": "Montserrat",
"mk": "Oman",
"ml": "Mali",
"mm": "Malta",
"mnu": "Minnesota",
"mo": "Montenegro",
"mou": "Missouri",
"mp": "Mongolia",
"mq": "Martinique",
"mr": "Morocco",
"msu": "Mississippi",
"mtu": "Montana",
"mu": "Mauritania",
"mv": "Moldova",
"-mvr": "Moldavian S.S.R.",
"mw": "Malawi",
"mx": "Mexico",
"my": "Malaysia",
"mz": "Mozambique",
"-na": "Netherlands Antilles",
"nbu": "Nebraska",
"ncu": "North Carolina",
"ndu": "North Dakota",
"ne": "Netherlands",
"nfc": "Newfoundland and Labrador",
"ng": "Niger",
"nhu": "New Hampshire",
"nik": "Northern Ireland",
"nju": "New Jersey",
"nkc": "New Brunswick",
"nl": "New Caledonia",
"-nm": "Northern Mariana Islands",
"nmu": "New Mexico",
"nn": "Vanuatu",
"no": "Norway",
"np": "Nepal",
"nq": "Nicaragua",
"nr": "Nigeria",
"nsc": "Nova Scotia",
"ntc": "Northwest Territories",
"nu": "Nauru",
"nuc": "Nunavut",
"nvu": "Nevada",
"nw": "Northern Mariana Islands",
"nx": "Norfolk Island",
"nyu": "New York (State)",
"nz": "New Zealand",
"ohu": "Ohio",
"oku": "Oklahoma",
"onc": "Ontario",
"oru": "Oregon",
"ot": "Mayotte",
"pau": "Pennsylvania",
"pc": "Pitcairn Island",
"pe": "Peru",
"pf": "Paracel Islands",
"pg": "Guinea-Bissau",
"ph": "Philippines",
"pic": "Prince Edward Island",
"pk": "Pakistan",
"pl": "Poland",
"pn": "Panama",
"po": "Portugal",
"pp": "Papua New Guinea",
"pr": "Puerto Rico",
"-pt": "Portuguese Timor",
"pw": "Palau",
"py": "Paraguay",
"qa": "Qatar",
"qea": "Queensland",
"quc": "Québec (Province)",
"rb": "Serbia",
"re": "Réunion",
"rh": "Zimbabwe",
"riu": "Rhode Island",
"rm": "Romania",
"ru": "Russia (Federation)",
"-rur": "Russian S.F.S.R.",
"rw": "Rwanda",
"-ry": "Ryukyu Islands, Southern",
"sa": "South Africa",
"-sb": "Svalbard",
"sc": "Saint-Barthélemy",
"scu": "South Carolina",
"sd": "South Sudan",
"sdu": "South Dakota",
"se": "Seychelles",
"sf": "Sao Tome and Principe",
"sg": "Senegal",
"sh": "Spanish North Africa",
"si": "Singapore",
"sj": "Sudan",
"-sk": "Sikkim",
"sl": "Sierra Leone",
"sm": "San Marino",
"sn": "Sint Maarten",
"snc": "Saskatchewan",
"so": "Somalia",
"sp": "Spain",
"sq": "Eswatini",
"sr": "Surinam",
"ss": "Western Sahara",
"st": "Saint-Martin",
"stk": "Scotland",
"su": "Saudi Arabia",
"-sv": "Swan Islands",
"sw": "Sweden",
"sx": "Namibia",
"sy": "Syria",
"sz": "Switzerland",
"ta": "Tajikistan",
"-tar": "Tajik S.S.R.",
"tc": "Turks and Caicos Islands",
"tg": "Togo",
"th": "Thailand",
"ti": "Tunisia",
"tk": "Turkmenistan",
"-tkr": "Turkmen S.S.R.",
"tl": "Tokelau",
"tma": "Tasmania",
"tnu": "Tennessee",
"to": "Tonga",
"tr": "Trinidad and Tobago",
"ts": "United Arab Emirates",
"-tt": "Trust Territory of the Pacific Islands",
"tu": "Turkey",
"tv": "Tuvalu",
"txu": "Texas",
"tz": "Tanzania",
"ua": "Egypt",
"uc": "United States Misc. Caribbean Islands",
"ug": "Uganda",
"-ui": "United Kingdom Misc. Islands",
"-uik": "United Kingdom Misc. Islands",
"-uk": "United Kingdom",
"un": "Ukraine",
"-unr": "Ukraine",
"up": "United States Misc. Pacific Islands",
"-ur": "Soviet Union",
"-us": "United States",
"utu": "Utah",
"uv": "Burkina Faso",
"uy": "Uruguay",
"uz": "Uzbekistan",
"-uzr": "Uzbek S.S.R.",
"vau": "Virginia",
"vb": "British Virgin Islands",
"vc": "Vatican City",
"ve": "Venezuela",
"vi": "Virgin Islands of the United States",
"vm": "Vietnam",
"-vn": "Vietnam, North",
"vp": "Various places",
"vra": "Victoria",
"-vs": "Vietnam, South",
"vtu": "Vermont",
"wau": "Washington (State)",
"-wb": "West Berlin",
"wea": "Western Australia",
"wf": "Wallis and Futuna",
"wiu": "Wisconsin",
"wj": "West Bank of the Jordan River",
"wk": "Wake Island",
"wlk": "Wales",
"ws": "Samoa",
"wvu": "West Virginia",
"wyu": "Wyoming",
"xa": "Christmas Island (Indian Ocean)",
"xb": "Cocos (Keeling) Islands",
"xc": "Maldives",
"xd": "Saint Kitts-Nevis",
"xe": "Marshall Islands",
"xf": "Midway Islands",
"xga": "Coral Sea Islands Territory",
"xh": "Niue",
"-xi": "Saint Kitts-Nevis-Anguilla",
"xj": "Saint Helena",
"xk": "Saint Lucia",
"xl": "Saint Pierre and Miquelon",
"xm": "Saint Vincent and the Grenadines",
"xn": "North Macedonia",
"xna": "New South Wales",
"xo": "Slovakia",
"xoa": "Northern Territory",
"xp": "Spratly Island",
"xr": "Czech Republic",
"xra": "South Australia",
"xs": "South Georgia and the South Sandwich Islands",
"xv": "Slovenia",
"xx": "No place, unknown, or undetermined",
"xxc": "Canada",
"xxk": "United Kingdom",
"-xxr": "Soviet Union",
"xxu": "United States",
"ye": "Yemen",
"ykc": "Yukon Territory",
"-ys": "Yemen (People's Democratic Republic)",
"-yu": "Serbia and Montenegro",
"za": "Zambia"
}

## Creating Python data structures for the information from our records
Whatever format your records are in, we'll need structures for holding the information we extract from the records so that we can work with it later.

I'm creating a dictionary to hold some basic information about each of my ESTC records, as well as a list for keeping track of distinct places. (This information doesn't need labels, in particular, so I'm using the simpler list structure.)

In [None]:
bib_records = {}
distinct_places = []

### Example 1: MARC-21 records
For records in MARC-21 format, I'll used `Pymarc`'s `MARCReader` module to read and parse the MARC records, just like we did yestersay. This will probably look pretty familiar, so I'll just provide some comments in the code, itself.

In [None]:
#We'll need Pymarc for reading MARC-21 records
!pip install pymarc

In [None]:
# Import the necessary components from the Pymarc library.
from pymarc import MARCReader
import re

#Define a couple of regular expressions for stripping away punctuation that's
#included in the MARC fields
field_punctuation = re.compile(r'[\s\:]+$')
other_punctuation = re.compile(r'[\[\]\?\.,]')
with open(source_directory + '2021_s2_d2_estc_pilgrims_progress.mrc', 'rb') as infile :
    reader = MARCReader(infile) 
    for record in reader :
      #Get the ESTC number from the 001 field
      estc_num = record['001'].data
      
      #If there is no key for that ESTC number in the records dictionary, create
      #one with an empty nested dictionary as the value
      bib_records.setdefault(estc_num, {})
      
      #Get the publication city from MARC field 260|a
      pub_city = record['260']['a']
      #Add it without modification to the nested dictionary for this record
      bib_records[estc_num]['original_260a'] = pub_city
      
      #Get rid of punctuation that's included in accord with cataloging rules
      #using the regular expression defined at line 7, above
      pub_city = re.sub(field_punctuation, '', pub_city)
      
      #If the publication city has "i.e." in it
      if pub_city.find('i.e.') != -1 :
        #Only keep the string starting 5 characters ahead of the i in "i.e."
        pub_city = pub_city[pub_city.find('i.e.')+5:]
      
      #Remove any other punctuation (like square brackets) from the publication 
      #city, using the regular expression defined at line 8, above
      pub_city = re.sub(other_punctuation, '', pub_city)
      
      #Get the country code from MARC field 008, stripping any white space from 
      #the right: some country codes are three characters long, others are only
      #two, and would bring white space with them
      country_code = record['008'].data[15:18].rstrip()
      
      #Get the value of the dictionary item whose key matches the country code
      country = marc_country_codes[country_code]
      
      #Add a tuple—an immutable list—consisting of the pub_city and full country
      #name to the nested dictionary for this record
      bib_records[estc_num]['place'] = (pub_city, country)

      #If the tuple for the pairing of pub_city and country is not in our list 
      #of distinct places, add it to that list
      if (pub_city, country) not in distinct_places :
        distinct_places.append((pub_city, country))

#Sort the list of distinct_places alphabetically
distinct_places.sort()    

### Example 2: MARCXML Records
I've downloaded a set of records related to Bartoloméo de las Casas (author of the *Brevísima relación de la destrucción de las Indias*) from the [Catálogo Colectivo del Patrimonio Bibliográfico](http://catalogos.mecd.es/CCPB/cgi-ccpb/abnetopac/) in MARCXML format. Because MARCXML is just an XML implementation of MARC, all the MARC field codes we've used will still apply. 

`Pymarc` can handle MARCXML records, but in this example I'll use the `BeautifulSoup` package along with `lxml` to show some approaches that are generally applicable to any XML data. (If you have records in a different flavor of XML, like MODS, you'll need to adapt the steps below to match your source.)

In [None]:
#Import packages for working with XML
from bs4 import BeautifulSoup
import lxml

In [None]:
# Example permalink: http://catalogos.mecd.es/CCPB/cgi-ccpb/abnetopac?ACC=DOSEARCH&xsqf99=CCPB000413410-9

#Define a couple of regular expressions for stripping away punctuation that's
#included in the MARC fields
field_punctuation = re.compile(r'[\s\:]+$')
other_punctuation = re.compile(r'[\[\]\?\.,]')

with open(source_directory + '2021_s2_d2_CCPB_las_casas.xml', 'rb') as xml_file :
  xml_data = xml_file.read()
  soup = BeautifulSoup(xml_data, 'xml')
  records = soup.find_all('record')
  for record in records :
    field_001 = record.find('controlfield', tag = '001').get_text()
    bib_records.setdefault(field_001, {})
    if record.find('datafield', tag='260').find('subfield', code='a') is not None :
      pub_city = record.find('datafield', tag='260').find('subfield', code='a').get_text()
      pub_city = re.sub(field_punctuation, '', pub_city)
      # print(pub_city)
      
      #I don't typically work with records on resources that weren't published
      #in English-speaking countries, so I am *not at all* confident that these 
      #next steps would hold up in all cases, but it seems to work for the records 
      #I have.
      
      #If a cataloger placed a publication city in square brackets as the actual
      #place of publication, we should believe them.
      #I'm contructing a regular expression on the fly here searching for any
      #characters inside square brackets
      if re.search(r'\[(.+)\]', pub_city) is not None :
        #Taking advantage of the regular expression capture groups() function:
        #group(0) is the complete matched regular expression (e.g., [Barcelona]), 
        #then there are as many groups as there are captured patterns. I only 
        #have one capture group, so group(1) seems to be working: it turns
        #"[Barcelona]" into "Barcelona"
        pub_city = re.search(r'\[(.+)\]', pub_city).group(1)

      #Get rid of everything but the last word, in hopes of excluding prepositions
      #("In", "A", "Tot", etc.) and language about printing ("Impresso en", 
      #"Fue impressa ...", etc.). This is potentially reckless on my part.
      pub_city = pub_city[pub_city.rfind(' ')+1:]
      # print(pub_city)

    #Get the country code from MARC field 008, stripping any white space from 
    #the right: some country codes are three characters long, others are only
    #two, and would bring white space with them
    field_008 = record.find('controlfield', tag='008').get_text()
    country_code = field_008[15:18].rstrip()

    country = marc_country_codes[country_code]

    bib_records[field_001]['place'] = (pub_city, country)

    if (pub_city, country) not in distinct_places :
        distinct_places.append((pub_city, country))

distinct_places.sort()

### Example 3: Excel spreadsheet or .csv file
Some library catalogs allow export to Microsoft Excel or .csv, or you may very well have bibliographical records translated into one of those formats from some other source. 

I've provided a selection in both .csv and .xslx format of the first 148 records with a subject heading including "Abolition" published between 1740 and 1860 that are held by Harvard's Houghton Library.

There's less we can do with the Excel and .csv files that Harbard's catalog is giving us here: there's a fair amount of information in there, but not the kind of granular cataloging data we've seen in MARC and MARCXML files: the publication city, imprint statement, and imprint year are simply concartenated into a sigle column, for instance, and we don't have any structured data field like MARC 008 to consult for things like a definitive statement of the publication country. We'll get the information we can—just the 

I generally work with data in .csv format using built-in `csv` library for Python, but the `pandas` package can handle both .csv and Microsoft Excel files, so let's use that instead.

In [None]:
# Example permalink: http://id.lib.harvard.edu/alma/990134930760203941/catalog
import pandas as pd
import re
data_file = '2021_s2_d2_houghton_abolition_1740_1860_selection.xlsx'
# data_file = '2021_s1_d2_houghton_abolition_1740_1860_selection.csv'
with open(source_directory + data_file, 'rb') as tabular_file :
  df_harvard_records = pd.read_excel(tabular_file)
  # df_harvard_records = pd.read_csv(tabular_file)
  
  #This seems to be how we iterate through rows in pandas...
  for index in df_harvard_records.index :
    #Get Harvard's catalog id to identify this record and construct a link
    #later
    record_id = df_harvard_records.loc[index]['HOLLIS number']
    bib_records.setdefault(record_id, {})
    
    #Get the "Published" column, which concatenates lots of information I wish
    #were separate
    publication_info = df_harvard_records.loc[index]['Published']

    #Get the part of the "Published" column up to the colon (minus 1). It's just
    #a city, but that's all we're getting
    city_separator = re.compile(r'^([\[\w\s\-]+?)[\.\:,]')
    if re.search(city_separator, publication_info) is not None :
      pub_city = re.findall(city_separator, publication_info)[0].strip().lstrip('[')

    #If we don't have this city yet, add it to our list of distinct_places
    if pub_city not in distinct_places :
      distinct_places.append(pub_city)
    
    #Update the nested dictionary for this record with a tuple consisting of
    #the pub_city and an empty string. We'll have to sort out countries in the 
    #next step
    bib_records[record_id]['place'] = (pub_city, '')

## Let's see what we have
Whichever example you followed above, you should now have a dictionary that uses some kind of record identifier for keys; the value for each key is a nested dictionary with a `place` key and a tuple (consisting of the publication city and country, if you were able to get it) as the value.

In [None]:
for k, v in bib_records.items() :
  print(k, v)

## Regularizing our place names
For the purposes of this exercise, I went with a pretty low-tech approach to regularizing the place names: I printed out all of the distinct place names in the distinct_place_names list, then copied and pasted them into a new plain text document in my rtext editor (BBEdit, in my case). 

Let's print those out so you can see how many places you're dealing with.

In [None]:
#Print the list of distinct place names. Then copy them into a text editor
#and change them to a regularized form. You may have duplicate lines
print(len(distinct_places))
for distinct_place in distinct_places :
  print(distinct_place)

Copy the output of the cell above and paste the distinct place names in your text editor. Then simply go through line by line and change the place name to a form that would have a good chance of returning a result from the GeoNames server. (In the case of ESTC records, for example, I had to change the Welsh name for Shrewsbury, England to "Shrewsbury GB".) 

Note that you may end up with duplicate lines (in the examples below, `('Boston', 'United States')`, `('Boston NE', 'United States')`, and `('Boston in New England', 'United States')` all get changed to `('Boston MA', 'United States')`, for instance.

Also note that you shouldn't delete any lines: you want to have a replacement for every place name in your list.

When you've come up with regularized forms for all your place names, paste them into the cell below, replacing the ESTC values in the `regularized_places` list.

(Note the structure you want to end up with: `regularized_places` is a list that contains a series of tuples. 

* The `regularized_places` list, itself should open and close with square brackets.
* Each regularized place name in the `regularized_places` list is a tuple. For each tuple:
    - The tuple should be enclosed in parentheses
    - The name of each city should be in quotation marks
    - The name of each country should be in quotation marks
    - Each tuple should be separated by a comma from the one that follows


For the remainder of this walkthrough, I'm going to stick with just the ESTC records for *The Pilgrim's Progress* that I used in Example 1. If you're using different data, you just need to supply your own values for the `regularized_places` dictionary, below, keeping in mind the formatting that's described here.

In [None]:
regularized_places = [('Shrewsbury', 'England'),
('Bath', 'England'),
('Birmingham', 'England'),
('Boston MA', 'United States'),
('Boston MA', 'United States'),
('Boston MA', 'United States'),
('Bristol', 'England'),
('Caerfyrddin', 'Wales'),
('Chester', 'England'),
('Coventry', 'England'),
('Dublin', 'Ireland'),
('Edinburgh', 'Scotland'),
('Edinburgh', 'Scotland'),
('Ephrata PA', 'United States'),
('Gainsborough', 'England'),
('Gainsborough', 'England'),
('Germantown PA', 'United States'),
('Germantown PA', 'United States'),
('Glasgow', 'Scotland'),
('Liverpool', 'England'),
('London', 'England'),
('Manchester', 'England'),
('New York', 'United States'),
('Newcastle upon Tyne', 'England'),
('Newcastle upon Tyne', 'England'),
('Nottingham', 'England'),
('Paisley', 'Scotland'),
('Philadelphia PA', 'United States'),
('Preston', 'England'),
('Shrewsbury', 'England'),
('Vepery', 'India'),
('Wolverhampton', 'England'),
('Worcester MA', 'United States'),
('Worcester MA', 'United States'),
('York', 'England')]

### Pairing up distinct place names from our records with their regularized forms
This next cell uses `zip` to create a pairwise relationship between the `distinct_places` list from our records and the `regularized_places` list that we created "by hand."

Our `regularization` variable creates a correspondence between each item in the `distinct_places` list and the item at the corresponding index in the `regularized_places` list (that's why it's important that there be one entry in `regularized_places` for every entry in `distinct_places`). 

The `for` loop here iterates through those pairings: `orig` represents the item in `distinct_places` and `reg` represents the corresponding item in `regularized_places`.

As we iterate through those pairings, we *also* iterate through the entries in our `bib_records` dictionary (`record_id` is the key in our `bib_records` dictionary and `nested_dict` is the value for that key). If the value of `place` in the nested dictionary for a given `record_id` matches `orig` (the non-regularized version of the place name), we want to change that value to `reg` (the regularized form), instead.

In the *Pilgrim's Progress* records, for example, on the first iteration through `regularized`, `orig` would equal `('Argraphwŷd yn y Mwŷthig', 'England')` (which is `distinct_places[0]`) and `reg` would equal `('Shrewsbury', 'England')` (which is `regularized_places[0]`).

As we iterate through the `bib_records` dictionary, it's not until we reach `bib_records['R37515']` that we'll find that `bib_records['R37515']['place']` is equal to `orig`. But when we *do* find that match, we'll change the value of
`bib_records['R37515']['place']` from `('Argraphwŷd yn y Mwŷthig', 'England')` to `('Shrewsbury', 'England')`. 

In [None]:
regularization = zip(distinct_places, regularized_places)
for orig, reg in regularization :
  for record_id, nested_dict in bib_records.items() :
    if nested_dict['place'] == orig :
      nested_dict['place'] = reg


We can see our place names have now been updated to the version from `regularized_places`.

In [None]:
for k, v in bib_records.items() :
  print(k, v)


### Getting ready to search for latitude/longitude coordinates
When we were regularizing our place names, we needed one regularized place name for every distinct place name. But, as we saw in the example of the *Pilgrim's Progress* records, that left us with three different instances of `('Boston MA', 'United States')`. We really only need to search the GeoNames server for Boston once. It's not like it's going anywhere.

To get the unique values of our `refularized_places` list, we can use `set()`. But because we actually do want our `search_places` to end up as a list, let's turn that set into a list, in turn.

In [None]:
search_places = list(set(regularized_places))
for search_place in sorted(search_places) :
  print(search_place)

## Actually searching for coordinates
Now that we have a list of uniquer place names that we want to search, we can finally construct URLs to send to the GeoNames API.

We'll need the `re` (Regular Expressions) package to replace the spaces in our place names with the approapriate characters for URL encoding. (**Note:** There's probably a function in the `requests` module to handle this for us automatically, but it didn't occur to me to check...)

We'll use the `requests` package for Python to handle sending and receiving our queries and the built-in `json` module for parsing the eresults. We'll also use the built-in `time` module to space out our requests by a couple of seconds, just to be polite.

In [None]:
import requests
import re
import json
import time

#Create a dictionary to store the coordinates we'll be getting back
geocoded_places = {}

#Iterate through our list of unique regularized place names
for search_place in sorted(search_places) :
  print(search_place)
  
  #Build up our query URL, starting with this base
  query_url = 'http://api.geonames.org/search?q='
  #Then adding the first item of our place name tuple (the city), replacing
  #any and all spaces with '%20', then putting another %20 ...
  query_url += re.sub(' ', '%20', search_place[0]) + '%20'
  #Adding the second part of our place name tuple, replacing spaces with '%20
  query_url += re.sub(' ', '%20', search_place[1])
  #Finally adding the end of our URL: be sure to replace <username> with
  # the class username
  query_url += '&maxRows=1&type=json&username=RBSDigitalApproaches'
  
  #Get the resulting URL using the requests module
  r = requests.get(query_url)
  #Parse the response from the GeoNames server as json
  response = r.json()
  
  #Create an entry in our geocoded_places with search_place as the key and
  #a nested dictionary as the value. The nested dictionary contains 'coordinates'
  #(a list of the the latitude and longitude that we get from the GeoNames json,
  #stored as a float, rather than a string), and a 'record_count' field set to 0
  geocoded_places.setdefault(search_place, 
                             {'coordinates': \
                              [float(response['geonames'][0]['lat']), 
                               float(response['geonames'][0]['lng'])],
                              'record_count': 0})
  #Wait two seconds before sending another request
  time.sleep(2)


Let's see what we've gotten back from GeoNames.

In [None]:
for k, v in geocoded_places.items() :
  print(k, v)

### Figuring out the number of records for each place
Let's go back through our `bib_records` and figure out how many records we have for each of these places. 

For each entry in bib_records, we get the value of `place` in the nested dictionary and use that as the key in `geocoded_places` (because we regularized our place names, these should all match). Every time we encounter a record whose `place` is the same as the hey in our `geocdoded_palces` dictionary, we increase the value of `record_count` for that place by adding 1.

In [None]:
for k, v in bib_records.items() :
  geocoded_places[v['place']]['record_count'] += 1


In [None]:
for k, v in geocoded_places.items() :
  print(k, v['record_count'])

## Putting these places on the map
We'll us the `folium` package to create markers on a map: one marker for every geocoded place, with information about all the records associated with that place. (**Note:** This is only one approach, and not necessarily the best one, depending on what you want to do—it seemed like the easiest way, for demonstration purposes.)

In [None]:
!pip install folium

### Making our map with `folium`
I don't really want to get into too much detail about `folium`, specifically, because you may well find yourself using different software. I'll just point out a couple of things that are going on here, and then provide comments in the code.

One slightly confusing thing that's happening in this cell (starting at line 5) is that I'm trying to synthesize coordinates that don't actually exist in my set of places so that I can figure out what rectangle will fit all of the coordinates I have into the initial view of the map: 

* I need a northeast corner that's as far north as the northernmost coordinate and as far east as the easternmost coordinate.

* I need a southwest corner that's as far south as the southernmost coordinate and as far west as the westernmost coordinate.

So I end up creating lists of latitudes and longitudes at lines 6 and 7 by iterating through all of the latitudes and all of the longitudes (the latitides are the first item in the lists of corrdinates for each place and the longitudes are the second item in that list). I'm using a more compact syntax than you've seen in any of our code before that plays to one of Python's strengths. This technique is called "[list comprehension](https://www.python.org/dev/peps/pep-0202/)."

Then I use the first and last entries in each of those lists to create a northeast and a southwest point, respectively at lines 9 and 10. Those points end up being used as parameters for `fit_bounds()` at line 52. 

The other big thing that's happening here is building up a list of links back to the catalog to show when we hover over any of our markers. This is a nice idea, but the implementation could use some work: it's fine for Bath (1 record), but the legend for London (with 166 records) ends up way the heck up in the North Sea somewhere.

In [None]:
import folium
#Create a map
m = folium.Map(prefer_canvas=True, min_zoom=3)

#Figure our northern-/southern-most and eastern-/western-most coordinates
lats = sorted([v['coordinates'][0] for k, v in geocoded_places.items()])
lngs = sorted([v['coordinates'][1] for k, v in geocoded_places.items()])
#Construct southwesternmost and northeasternmost points for fit_bounds()
sw = [lats[0], lngs[0]]
ne = [lats[-1], lngs[-1]]

#Create text to show with markers
url_prefix = 'http://estc.bl.uk/'
#url_prefix = 'http://catalogos.mecd.es/CCPB/cgi-ccpb/abnetopac?ACC=DOSEARCH&xsqf99='
#url_prefix = 'http://id.lib.harvard.edu/alma/'

#Iterate through unique places
for k, v in geocoded_places.items() :
  #Construct an empty list to hold links back to teh catalog
  catalog_links = []

  #Iterate thrlugh the bib_records dictionary
  for record_id, record_contents in bib_records.items() :
    if record_contents['place'] == k :
      #Construct a link to teh catalog record and add it to the list
      catalog_links.append('<a href="' + url_prefix + record_id + '">' + record_id + '</a>')
      #For Harvard:
      # catalog_links.append('<a href="' + url_prefix + record_id + '/catalog">' + record_id + '</a>')
    
    #If there's only one record, use singular "record" at line 38, below, if 
    #there are more, use plural "records"
    if v['record_count'] == 1 :
        record_form =  ' record:\n'
    else :
        record_form = ' records\n'
    
    #Create a legend to display when we hover over a marker
    legend = k[0] + '\n' + str(v['record_count']) + record_form
    #Join together all of our catalog links and add them to the legend
    legend += '\n'.join(catalog_links)
  
  #Construct a marker for each place. Locate it at the coordinates for that place
  #and use the legend created at lines 38 and 40 as a popup. Then add the marker
  #to the map object we created at line 3
  folium.Marker(
      location = v['coordinates'],
      popup = legend).add_to(m)

#Zoom the map so that all markers are visible within an initial zoom level 
#defined by the southwesternmost and northeasternmost corners we defined at
#lines 9 and 10
m.fit_bounds([sw, ne])
#Show the map
m

## Conclusion
This is a fairly rough-and-ready example of mapping, but hopefully it's enough to introduce a few key ideas:

* When we repurpose data, there's almost always going to be cleaning and regularization that need to happen. As this notebook shows, that may involve a mix of code and "manual" work. 
* We can use data we have to get data that somebody else has using an API (here, the GeoNames API)
* Once we have our data in a form that allows us to draw in data from elsewhere, lots of interesting possibilities open up.

If we were using a different environment (and had more time), there are all sorts of things we might do differently with this map. Hopefully, this is enough to give us a jumping off point for discussing other kinds of things we might do along these lines.