### A routine to fetch pdfs

Author, title, and year are "and'd" in the query.  If you need to "or", call this routine once for each part of the or.

If you want data for only one year, provide a starting_year and ending_year with the same value.

The Gallica CQL processor doesn't seem to handle dates like you'd expect given the documentation available elsewhere.  Part of the problem is that their dc:date field is a free-text field.  I'm not sure what the other part of the problem is, since their API doesn't provide error messages for CQL errors; instead, it simply fails.

This routine figures out what years are between starting_year and ending_year, if starting_year and ending_year are supplied.  For each of those years (or just once, if no year is supplied)

1.  It searches for that year using the author and title passed to the routine;
2.  For each record it finds, it looks up the pagination;
3.  It uses the pagination to build the URL for the PDF (Gallica's doc doesn't say how to do this), downloads the pdf, and renames it.

The script is supposed to handle instances where the search results extend past more that one page of results,  And I think it actually may (I inadvertently tested it, but not comprehensively).  I have not tested the date range business.

The process seems quite slow.  It took me about 12 minutes to download 21 pdf's . . . 

Also, please note that I'm not using PyGallica because it doesn't really do anything useful.  It's search API simply dumps the XML results to the file system, and does nothing to simplify the composition of CQL queries, and it's document API doesn't do anything for PDF's.  It really is quite useless . . . 

In [1]:
import requests, subprocess, re
from lxml import etree

ns = {'srw': 'http://www.loc.gov/zing/srw/', 
      'dc': 'http://purl.org/dc/elements/1.1/'}

def get_pdfs_for_query(author=None, 
                     title=None, 
                     starting_year=None,
                     ending_year=None,
                     max_n_pdfs_to_fetch=100):
    
    starting_record = 1
    
    years = []
    if starting_year != None and ending_year != None:
        for year in range(int(starting_year), int(ending_year) + 1):
            years.append(year)
            
    if len(years) == 0:
        years.append(-1)
        
    n_pdfs_fetched = 0
    
    continue_fetching = True
    
    while continue_fetching:
        
        for year in years:
                    
            # -------------------------------------------------------------------
            # SEARCH
            # -------------------------------------------------------------------

            query_parts = ['dc.format all "application/pdf"',]

            if author != None:
                query_parts.append('dc.title all ' + author)

            if title != None:
                query_parts.append('dc.title all ' + title)

            if year != -1:
                query_parts.append('dc.date any ' + str(year))

            results = []

            search_url = 'https://gallica.bnf.fr/SRU?operation=searchRetrieve&version=1.2&query=' + \
                            '(' + ' and '.join(query_parts) + ')' + \
                            '&startRecord=' + str(starting_record)

            search_req = requests.get(search_url)

            search_root = etree.fromstring(search_req.content)

            n_records = search_root.xpath('//srw:numberOfRecords', namespaces=ns)[0].text
            
            # THE NEXT QUERY STARTING RECORD, IF THE SEARCH RUNS MORE THAN ONE PAGE.
            starting_record = search_root.xpath('//srw:nextRecordPosition', namespaces=ns)[0].text
                    
            # -------------------------------------------------------------------
            # FOR EVERY RECORD SEARCH FOUND
            # -------------------------------------------------------------------

            for record in search_root.xpath('//srw:record', namespaces=ns):

                record_author = ''
                if len(record.xpath('descendant::dc:creator', namespaces=ns)) > 0:
                    record_author = record.xpath('descendant::dc:creator', namespaces=ns)[0].text

                record_title = ''
                if len(record.xpath('descendant::dc:title', namespaces=ns)) > 0:
                    record_title = record.xpath('descendant::dc:title', namespaces=ns)[0].text

                record_date = ''
                if len(record.xpath('descendant::dc:date', namespaces=ns)) > 0:
                    record_date = record.xpath('descendant::dc:date', namespaces=ns)[0].text

                ark_id = ''
                if len(record.xpath('descendant::uri', namespaces=ns)) > 0:
                    ark_id = record.xpath('descendant::uri', namespaces=ns)[0].text
                    
                # -------------------------------------------------------------------
                # HOW MANY PAGES?  WE NEED TO KNOW IN ORDER TO CONSTRUCT THE PDF URL
                # -------------------------------------------------------------------
                    
                pagination_url = 'https://gallica.bnf.fr/services/Pagination?ark=' + ark_id

                pagination_req = requests.get(pagination_url)

                pagination_root = etree.fromstring(pagination_req.content)

                n_images = -1
                if len(pagination_root.xpath('//nbVueImages')) > 0:
                    n_images = int(pagination_root.xpath('//nbVueImages')[0].text)
                    
                # -------------------------------------------------------------------
                # ACTUALLY FETCH THE PDF
                # -------------------------------------------------------------------
        
                pdf_url = 'https://gallica.bnf.fr/ark:/12148/' + ark_id + \
                            '/f1n' + str(n_images) + '.pdf?download=1'
                
                # I'M TRYING TO BUILD A SENSIBLE OUTPUT FILE NAME HERE

                output_pdf_name = record_date + '_' + '_'.join(re.split('[ ,\)\)\-\:]+', record_author)[:2]) + \
                                    '_' + '_'.join(re.split('[ ,\)\)\-\:]+', record_title)[:3]) + '_' + \
                                    '_' + ark_id + '.pdf'
                
                # SOMETHING LIKE
                #     cmd = 'wget ' + pdf_url + ' -O resulting_pdfs/' + ark_id + '.pdf'
                # WILL PUT THE PDF'S IN THE FOLDER resulting_pdfs.
                cmd = 'wget ' + pdf_url + ' -O ' + output_pdf_name

                subprocess.getoutput(cmd)

                n_pdfs_fetched += 1
                
            # -------------------------------------------------------------------
            # SO IT DOESN'T GO ON FOREVER . . . 
            # -------------------------------------------------------------------
            
            if n_pdfs_fetched > max_n_pdfs_to_fetch:
                continue_fetching = False
                break
            if int(starting_record) > int(n_records):
                continue_fetching = False
                break

### Call the routine with a particular request

In [None]:
get_pdfs_for_query(title='Toto', author='Offenbach')