# Scraping a web page without an API
This notebook provides code snippets for "scraping" information from a web site that doesn't offer an API--that is, a web site that was designed for a user to consult and interact with in their web browser. These snippets are for use in our class session. The `reference` folder includes a notebook with all of this code (and more) supported by a prose discussion of what's going on that may be useful for later review. This notebook doesn't include all the code examples in the longer notebook, but maintains the numbering of the code cells from that notebook so you can readily find your place in it.

## Connect to Google Drive and import some packages

In [14]:
#Code cell 1
#Connect to and mount Google Drive
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [13]:
#Code cell 2
#Import packages for working with the British Library Labs' metadata file
import numpy as np
import pandas as pd

## Have a look at British Library dataset of metadata for digitized printed books

In [15]:
#Code cell 3
#Set data folder
data_directory = '/gdrive/MyDrive/rbs_digital_approaches_2023/2023_data_class/'

#Load BL metadata (source: https://data.bl.uk/bl_labs_datasets/#3)
bl_digitized = pd.read_csv(data_directory + 'MS_digitised_books_2021-01-09.csv')

#Inspect the DataFrame to get a list of columns, a count of how many rows have
#data in each column, and the datatype of the column
bl_digitized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52695 entries, 0 to 52694
Data columns (total 24 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   BL record ID                        52695 non-null  int64 
 1   Type of resource                    52695 non-null  object
 2   Name                                47552 non-null  object
 3   Dates associated with name          10825 non-null  object
 4   Type of name                        47552 non-null  object
 5   Role                                1680 non-null   object
 6   All names                           49633 non-null  object
 7   Title                               52695 non-null  object
 8   Variant titles                      5867 non-null   object
 9   Series title                        260 non-null    object
 10  Number within series                111 non-null    object
 11  Country of publication              36460 non-null  ob

### Get usable publication dates

In [16]:
#Code cell 4
#Make a new dataframe of just the rows where the date column is not null (i.e.,
#books for which we have a publication date)
bl_digitized_w_dates = bl_digitized.loc[bl_digitized['Date of publication'].notnull()].copy()
bl_digitized_w_dates

Unnamed: 0,BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,...,Date of publication,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource
0,14602826,Monograph,"Yearsley, Ann",1753-1806,person,,"More, Hannah, 1745-1833 [person] ; Yearsley, A...",Poems on several occasions [With a prefatory l...,,,...,1786,Fourth edition MANUSCRIPT note,,,Digital Store 11644.d.32,,,English,,3996603
1,14602830,Monograph,"A, T.",,person,,"Oldham, John, 1653-1683 [person] ; A, T. [person]",A Satyr against Vertue. (A poem: supposed to b...,,,...,1679,,15 pages (4°),,Digital Store 11602.ee.10. (2.),,,English,,1143
2,14602831,Monograph,,,,,,"The Aeronaut, a poem; founded almost entirely,...",,,...,1816,,17 pages (8°),,Digital Store 992.i.12. (3.),Dublin (Ireland),,English,,22782
3,14602832,Monograph,"Albert, Prince Consort, consort of Victoria, Q...",1819-1861,person,,"Plimsoll, Joseph [person] ; Albert, Prince Con...","The Prince Albert, a poem [By Joseph Plimsoll.]",Appendix,,...,1868,,16 pages (8°),,Digital Store 11602.ee.17. (1.),,,English,,39775
4,14602833,Monograph,"Anslow, Robert",,person,,"Anslow, Robert [person]","The Defeat of the Spanish Armada, A.D. 1588. A...",,,...,1888,,40 pages (8°),,Digital Store 11602.ee.17. (7.),,,English,,92666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52688,16289056,Monograph,"Eliot, George",1819-1880,person,,"Eliot, George, 1819-1880 [person]",The Mill on the Floss ... Illustrated by T. H....,The Mill on the Floss,,...,1936,Another edition,"377 pages, plates, 21 cm",,Digital Store 012604.l.3,,,English,,4117457
52689,16289057,Monograph,"Garstang, Walter, M.A., F.Z.S.",,person,,"Garstang, Walter, M.A., F.Z.S. [person] ; Shep...",Songs of the Birds ... With illustrations by J...,,,...,1922,,"101 pages, illustrations (8°)",598.259,Digital Store 011648.g.133,,,English,"Poems, with and introductory essay",4158005
52692,16289060,Monograph,"Wellesley, Dorothy",1889-1956,person,,"Wellesley, Dorothy, 1889-1956 [person]",Early Poems. By M. A [i.e. Dorothy Violet Well...,,,...,1913,,"vii, 90 pages (8°)",,Digital Store 011649.eee.17,,,English,,839
52693,16289061,Monograph,"A, T. H. E.",,person,,"A, T. H. E. [person]","Of Life and Love [Poems.] By T. H. E. A, write...",,,...,1924,,89 pages (8°),,Digital Store 011645.e.125,,,English,,1167


In [None]:
#Code cell 5
#Try to turn those dates into numbers. This is going to produce an error.
bl_digitized_w_dates['Date of publication'] = bl_digitized['Date of publication'].astype(int)

In [None]:
#Code cell 6
#Filter (using pandas' .loc[] function) to show rows for rows whose values aren't
#entirely numeric
bl_digitized_w_dates.loc[bl_digitized_w_dates['Date of publication'].str.isnumeric() == False]

In [17]:
#Code cell 7
#Add a new column based on the 'Date of publication' column; populate that column
#with the first string of four digits we find; make all values integers.
bl_digitized_w_dates['use_date'] = bl_digitized_w_dates['Date of publication'].str.extract(r'([0-9]{4})').astype(int)
bl_digitized_w_dates

Unnamed: 0,BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,...,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource,use_date
0,14602826,Monograph,"Yearsley, Ann",1753-1806,person,,"More, Hannah, 1745-1833 [person] ; Yearsley, A...",Poems on several occasions [With a prefatory l...,,,...,Fourth edition MANUSCRIPT note,,,Digital Store 11644.d.32,,,English,,3996603,1786
1,14602830,Monograph,"A, T.",,person,,"Oldham, John, 1653-1683 [person] ; A, T. [person]",A Satyr against Vertue. (A poem: supposed to b...,,,...,,15 pages (4°),,Digital Store 11602.ee.10. (2.),,,English,,1143,1679
2,14602831,Monograph,,,,,,"The Aeronaut, a poem; founded almost entirely,...",,,...,,17 pages (8°),,Digital Store 992.i.12. (3.),Dublin (Ireland),,English,,22782,1816
3,14602832,Monograph,"Albert, Prince Consort, consort of Victoria, Q...",1819-1861,person,,"Plimsoll, Joseph [person] ; Albert, Prince Con...","The Prince Albert, a poem [By Joseph Plimsoll.]",Appendix,,...,,16 pages (8°),,Digital Store 11602.ee.17. (1.),,,English,,39775,1868
4,14602833,Monograph,"Anslow, Robert",,person,,"Anslow, Robert [person]","The Defeat of the Spanish Armada, A.D. 1588. A...",,,...,,40 pages (8°),,Digital Store 11602.ee.17. (7.),,,English,,92666,1888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52688,16289056,Monograph,"Eliot, George",1819-1880,person,,"Eliot, George, 1819-1880 [person]",The Mill on the Floss ... Illustrated by T. H....,The Mill on the Floss,,...,Another edition,"377 pages, plates, 21 cm",,Digital Store 012604.l.3,,,English,,4117457,1936
52689,16289057,Monograph,"Garstang, Walter, M.A., F.Z.S.",,person,,"Garstang, Walter, M.A., F.Z.S. [person] ; Shep...",Songs of the Birds ... With illustrations by J...,,,...,,"101 pages, illustrations (8°)",598.259,Digital Store 011648.g.133,,,English,"Poems, with and introductory essay",4158005,1922
52692,16289060,Monograph,"Wellesley, Dorothy",1889-1956,person,,"Wellesley, Dorothy, 1889-1956 [person]",Early Poems. By M. A [i.e. Dorothy Violet Well...,,,...,,"vii, 90 pages (8°)",,Digital Store 011649.eee.17,,,English,,839,1913
52693,16289061,Monograph,"A, T. H. E.",,person,,"A, T. H. E. [person]","Of Life and Love [Poems.] By T. H. E. A, write...",,,...,,89 pages (8°),,Digital Store 011645.e.125,,,English,,1167,1924


## Identify records of interest
Note that we're not throwing anything away: you can access the full set of records at any time using the `bl_digitized_w_dates` or `bl_digitized` DataFrames. See the full notebook in today's `reference` folder for more examples of ways to select a subset of records.

In [18]:
#Code cell 8
#Create another DataFrame for rows with a date before 1801
pre_1801 = bl_digitized_w_dates.loc[bl_digitized_w_dates['use_date'] < 1801].copy()
pre_1801.sort_values(by=['use_date', 'Name'])

Unnamed: 0,BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,...,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource,use_date
24487,14827520,Monograph,,,,,"Regius, Raphael [person]",Piutarchi Chaeronensis Regum  Imperatorum Apo...,Moralia. Latin,,...,,,,Digital Store 1077.f.39,,,English,,2938347,1510
1169,14610165,Monograph,"Cursius, Petrus",,person,,"Cursius, Petrus [person]",Ro. vrbis excidium [In verse.],,,...,,,,Digital Store 1077.i.56,,,English,Other edition: Ro. vrbis excidium [In verse.]....,840022,1528
13229,14811778,Monograph,"Gnapheus, Gulielmus",,person,,"Gnapheus, Gulielmus [person] ; Palsgrave, John...",J. Palsgravii ... Ecphrasis Anglica in Comœdia...,,,...,,,,Digital Store 644.e.11,,Drama,English,,1442148,1540
1175,14610171,Monograph,England,,organisation,,Germany [organisation] ; England [organisation...,Capitoli della Tregua conclusa ... tra la Maes...,"Treaties, etc.. II. Chronological Series. Phil...",,...,,,,Digital Store 1077.g.16,,,English,,1080196,1556
18684,14817484,Monograph,"Losa, Andres de la",,person,,"Losa, Andres de la [person]","Verdadero Entretenimiento del Christiano, en e...",,,...,,119 pages (4°),,Digital Store 1077.g.52,,,English,,2263347,1584
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24971,14828004,Monograph,,,,,,The Rape of the Faro-Bank: an heroi-comical po...,,,...,,,,Digital Store 992.h.24. (7.),,,English,,3040984,1800
28873,14831910,Monograph,,,,,,The West Briton; being a collection of poems o...,,,...,,,,Digital Store 11601.k.3,,,English,,3893474,1800
34894,14861019,Monograph,,,,,,"Serração da Velha, etc [In verse.]",,,...,,,,Digital Store 10360.ff.4. (1.),,,English,,3341790,1800
40203,14873210,Monograph,,,,,"White, Joseph, 1745-1814 [person] ; Pocock, Ed...",Abdollatiphi Historiæ Ægypti compendium [With ...,Appendix,,...,"Another edition, Abdollatiphi historiæ Ægypti ...","xxxii, 321 pages (4°)",,Digital Store 983.e.6,,,English,,4976,1800


In [20]:
#Code cell 14
#Create a DataFrame of works by Aphra Behn using str.startswith(). Not all rows
#have a value in the 'Name' column, so we need to ignore any rows where that
#column is 'nan'. (See the reference notebook for more examples.)
pre_1801_behn = pre_1801.loc[pre_1801['Name'].str.startswith('Behn', na=False)].copy().reset_index()
pre_1801_behn

Unnamed: 0,index,BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,...,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource,use_date
0,6533,14804962,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","The City-Heiress: or, Sir Timothy Treat-all. A...",,...,,61 pages (4°),,Digital Store 644.g.13,,,English,"Other edition: The City-Heiress: or, Sir Timot...",252167,1682
1,6534,14804963,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","The City-Heiress: or, Sir Timothy Treat-all. A...",,...,Another edition,61 pages (4°),,Digital Store 644.g.14,,Drama,English,,252168,1698
2,6535,14804964,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","[The] Emperor of the Moon: a farce, etc",,...,Second edition,54 pages (4°),,Digital Store 644.g.17,,,English,,252175,1688
3,6536,14804965,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","The Forc'd Marriage; or, the Jealous Bridegroo...",,...,,89 pages (4°),,Digital Store 644.g.10,,Poetry or verse,English,"Other edition: The Forc'd Marriage; or, the Je...",252181,1671
4,6537,14804966,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","The Lucky Chance, or an Alderman's Bargain. A ...",,...,,69 pages (4°),,Digital Store 644.g.16,,Drama,English,,252185,1687
5,6538,14804967,Monograph,"Behn, Aphra",1640-1689,person,,"J, G. [person] ; Behn, Aphra, 1640-1689 [perso...","The Widdow Ranter, or, the History of Bacon in...","Works edited or adapted by Dryden, or containi...",...,,56 pages (4°),,Digital Store 644.g.18,,,English,'Prologue by Mr. Dryden'. The same prologue oc...,252228,1690
6,17470,14816270,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]",Lycidus: or the Lover in Fashion. Being an acc...,,...,,2 parts (8°),,Digital Store 1077.f.91,,Poetry or verse,English,,252187,1688
7,17471,14816271,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","The Roundheads; or, the Good Old Cause, a comedy",,...,,56 pages (4°),,Digital Store 644.g.15,,Drama,English,"Other edition: The Roundheads; or, the Good Ol...",252208,1682
8,17472,14816272,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]","The Rover, or, the Banish't Cavaliers [A comed...",,...,,85 pages (4°),,Digital Store 644.g.12. (1.),,,English,"Other edition: The Rover, or, the Banish't Cav...",252210,1677
9,17473,14816273,Monograph,"Behn, Aphra",1640-1689,person,,"Behn, Aphra, 1640-1689 [person]",The second part of The Rover [A comedy.],,...,,85 pages (4°),,Digital Store 644.g.12. (2.),,Drama,English,,252222,1681


## A very quick introduction to HTML
This cell produces a *very* simple HTML document right here in our notebook.

In [None]:
#Code cell 15
%%html
<html>
  <head>
    <!--Information about the page goes here, normally along with links to scripts,
    stylesheets, etc. This simple HTML puts the styling information "inline" in the header.-->
    <style type="text/css">
      body { width: 40%;}
      h1 { color: #496fad;
         }
      div { margin-bottom: 1em; }
      .maintext { font-family: serif;
                  font-size: 13pt;
                }
      .blockquote { font-family: italic;
                    margin: 0em 3em 1em 3em;
                    padding: 0.5em;
                    background-color: #dedede;
                  }
      form { margin-top: 2em; }
      form label { font-weight: bold;
                   font-size: 11pt;
                 }
      #comment { width: 100%; }
      #submitbutton { float: right;
                      font-weight: bold;
                      font-size: 10pt;
                      background-color: #9dbaf5;
                      padding: 10px;
                      border: none;
                      border-radius: 10px;
                    }
    </style>
  </head>
  <body>
    <!--The actual content of the page that you end up seeing.-->
    <h1>A very simple HTML page</h1>
    <div class="maintext">A content <code>div</code>. This element has a
    <code>class</code> attribute that identifies it for one set of visual
    styling rules.</div>
    <div class="blockquote">This is another content <code>div</code>, with a
    different <code>class</code> attribute for very different styling. </div>
    <div class="maintext">Note that the elements in the form below have
      <code>id</code> attributes that <em>can</em> be used for visual styling,
      but also identify those elements for functional purposes.</div>
    <form id="feedback" action="/post_comment.php">
      <label for="comment">Tell us what you think!</label><br />
      <textarea rows="5" id="comment"
      placeholder="This form doesn't do actually do anything..."/></textarea>
      <input type="submit" id="submitbutton" value="Submit comment" />
    </form>
</body>
</html>

## Have a look at a record in the British Library's online catalogue
We'll take a look at a catalogue record and the underlying HTML using Developer Tools in our web browser:
http://explore.bl.uk/BLVU1:LSCOP-ALL:BLL01014912206

## Scraping functions


In [1]:
#Code cell 17
#Import packages
import requests
import urllib3
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

#This function defines an http request with a retry strategy. It accepts a URL
#as an argument, requests the URL using our defined http connection, and returns
#the response to that request
def create_http(url) :
  retry_strategy = Retry(
      total=3,
      status_forcelist=[429, 500, 502, 503, 504],
      allowed_methods=["GET"]
  )
  adapter = HTTPAdapter(max_retries=retry_strategy)
  http = requests.Session()
  http.mount("https://", adapter)
  http.mount("http://", adapter)
  headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0'
  }

  r = http.get(url, headers=headers)

  return r

In [None]:
#Code cell 17a
from bs4 import BeautifulSoup
test_request = create_http('http://explore.bl.uk/primo_library/libweb/action/display.do?frbrVersion=4&tabs=moreTab&ct=display&fn=search&doc=BLL01014912206')
soup = BeautifulSoup(test_request.content, 'html.parser')
print(soup.prettify())
# important_bits = soup.find(class_='EXLSummary EXLResult')
# print(important_bits.prettify())


In [21]:
#Code cell 18
from bs4 import BeautifulSoup
import re
def retrieve_vdc(rec_id) :
  #Construct a URL incorporating the rec_id parameter
  rec_url = 'http://explore.bl.uk/primo_library/libweb/action/display.do?frbrVersion=4&tabs=moreTab&ct=display&fn=search&doc=BLL010' + \
  str(rec_id) + '&vid=BLVU1&lang=en_US&institution=BL'

  #pass the URL we just constructed to the create_http function we defined in
  #code cell 17: we are calling a function from inside another function
  rec_r = create_http(rec_url)

  #Pass the content of the response to BeautifulSoup for parsing
  rec_soup = BeautifulSoup(rec_r.content, 'html.parser')

  #See comments in code cell 16 in full notebook

  #Compile regular expressions
  viewer_pattern = re.compile(r'vdc_([A-Za-z0-9\.]+)')
  google_pattern = re.compile(r'books\.google\.com.+vid%253DBL%253A([A-Za-z0-9]+)%2520')

  #Locate the relevant "Go" button
  view_button = rec_soup.find_all('input', id='getit1_0')
  if len(view_button) == 0 :
    return 'None found'
  else :
    #Find the vdc_ number(s)
    vdc_list = re.findall(viewer_pattern, view_button[0]['value'])
    print(vdc_list)
    #Get rid of duplicates
    vdc_distinct_list = list(set(vdc_list))
    #If there's only one...
    if len(vdc_distinct_list) == 1 :
      vdc = 'vdc_' + vdc_distinct_list[0]
    else :
     #If there's more than one, create a string that delimits the VDC numbers
     #with a pipe character
     multi_vdc = ['vdc_' + vdc_item for vdc_item in vdc_distinct_list]
     vdc = '|'.join(multi_vdc)
    return vdc

In [22]:
#Code cell 18a
#Test out our function with a known record id
#Test out our function
newtest = retrieve_vdc(14816272)
newtest

['00000002ABE8', '00000002ABE8', '100026274426.0x000001']


'vdc_00000002ABE8|vdc_100026274426.0x000001'

## Actually scraping
We'll use the record ids from the dataset the British Library provided to get the VDC number from each record associated with Aphra Behn, then use that information to do a further scraping run to gather more informatino about the scans. We'll determine which scans are marked as public domain and download the title pages of public domain scans.

### Getting VDC numbers for records of interest

In [None]:
#Code cell 19
#A case in which several volumes were scanned twice: once by Google, then again by
#the British Library, itself. We created this DataFrame earlier, in code cell 11.
pre_1801_behn['vdc'] = pre_1801_behn['BL record ID'].apply(retrieve_vdc)
pre_1801_behn

### Getting one row for each scan

In [None]:
#Code cell 25
pre_1801_behn = (
    pre_1801_behn.assign(vdc=pre_1801_behn['vdc'].str.split('|'))
      .explode('vdc')
      .reset_index(drop=True)
)
pre_1801_behn

### Looking at IIIF manifest (in JSON format)
Let's have a look at a IIIF manifest for a minute: https://api.bl.uk/metadata/iiif/ark:/81055/vdc_00000002C83E/manifest.json

In [26]:
#Code cell 26
import json
def check_manifest(vdc_num) :
  iiif_manifest_url = 'https://api.bl.uk/metadata/iiif/ark:/81055/' + vdc_num + '/manifest.json'
  print(iiif_manifest_url)
  manifest_r = create_http(iiif_manifest_url)
  try :
    manifest_json = json.loads(manifest_r.text)
    metadata = manifest_json['metadata']
    confirmed_shelfmark = ''
    for metadata_item in metadata :
      if metadata_item['label'] == 'Identifier' :
        confirmed_shelfmark = metadata_item['value']
    #Look for the license object in the JSON, too
    license_terms = manifest_json['license']
    #Check to see if the value of the license object includes "creativecommons",
    #"google", or... something else.
    if license_terms.find('creativecommons') != -1 :
      license = 'Public Domain'
    elif license_terms.find('google') != -1 :
      license = 'Google Books'
    else :
      license = license_terms
    #Find the link to the book's title page (used as a thumbnail)
    title_page_link = manifest_json['thumbnail']['@id']
    #Return a list including the confirmed shelfmark and the license
    return [confirmed_shelfmark, license, title_page_link]
  except :
    return ('Not found')

In [None]:
#Code cell 27
#This one's a little tricky: we're adding two columns to the DataFrame: we pass
#a list of columns to add. We populate that list of columns using a list of values
#created from a list comprehension of value returned by the check_manifest
#function
pre_1801_behn[['confirmed_shelfmark', 'license', 'title_page']] = [result for result in pre_1801_behn['vdc'].apply(check_manifest)]
pre_1801_behn

In [28]:
#Code cell 28
def add_links(vdc_num) :
  viewer_link = 'http://access.bl.uk/item/viewer/ark:/81055/' + vdc_num
  iiif_manifest = 'https://api.bl.uk/metadata/iiif/ark:/81055/' + vdc_num + \
    'manifest.json'
  return([viewer_link, iiif_manifest])

pre_1801_behn[['book_viewer', 'iiif_manifest']] = [i for i in pre_1801_behn['vdc'].apply(add_links)]

In [29]:
#Code cell 30
#Create a subset of Behn's works with a Public Domain license and see what we have
pre_1801_behn_public_domain = pre_1801_behn.loc[pre_1801_behn['license'] == 'Public Domain'].copy()
pre_1801_behn_public_domain[['Title', 'confirmed_shelfmark', 'book_viewer', 'license', 'title_page']]

Unnamed: 0,Title,confirmed_shelfmark,book_viewer,license,title_page
1,"The City-Heiress: or, Sir Timothy Treat-all. A...",Digital Store 644.g.13.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
2,"The City-Heiress: or, Sir Timothy Treat-all. A...",Digital Store 644.g.14.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
4,"[The] Emperor of the Moon: a farce, etc",Digital Store 644.g.17.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
5,"The Forc'd Marriage; or, the Jealous Bridegroo...",Digital Store 644.g.10.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
8,"The Lucky Chance, or an Alderman's Bargain. A ...",Digital Store 644.g.16.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
10,"The Widdow Ranter, or, the History of Bacon in...",Digital Store 644.g.18.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
11,Lycidus: or the Lover in Fashion. Being an acc...,Digital Store 1077.f.91.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
12,"The Roundheads; or, the Good Old Cause, a comedy",Digital Store 644.g.15.,http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
14,"The Rover, or, the Banish't Cavaliers [A comed...",Digital Store 644.g.12.(1.),http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...
17,The second part of The Rover [A comedy.],Digital Store 644.g.12.(2.),http://access.bl.uk/item/viewer/ark:/81055/vdc...,Public Domain,https://api.bl.uk/image/iiif/ark:/81055/vdc_00...


In [31]:
#Code cell 31
#Import package
import os

#Check to see if the directory exists. If not, create it and set it as the output
#directory
if not os.path.exists('/gdrive/MyDrive/rbs_digital_approaches_2023/output/behn_titlepages/') :
  os.makedirs('/gdrive/MyDrive/rbs_digital_approaches_2023/output/behn_titlepages/')
output_dir = '/gdrive/MyDrive/rbs_digital_approaches_2023/output/behn_titlepages/'

#Iterate through the rows of the dataframe. Retrieve the title page files using
#the create_http function and save them to our output directory
for index, row in pre_1801_behn_public_domain.iterrows() :
  vdc = row['vdc']
  get_tp = create_http(row['title_page'])
  with open(output_dir + vdc + '_tp.jpg', 'wb') as file :
    print('Saving ' + vdc + '_tp.jpg...')
    file.write(get_tp.content)
