What's below will create the url necessary to pass to canadiana.ca to carry out a search and from the page returned create a csv of the results that includes the URLs.

You're going to need two libraries to make this possible:

1. _requests_.  This will let you make the url request and grab the resulting content.  See https://2.python-requests.org/en/master/
2. _BeautifulSoup_.  This will let you parse the returned web content.  See https://www.crummy.com/software/BeautifulSoup/

We'll start with getting requests to work.  If you want date ranges you'll need to figure that out.  Do some different searches and note how the url changes on the results page is each case and then compare with what happens below.  Also note that the url used below is simpler than what you'll be seeing.  You should come to some idea about why and then check in to make sure you understand why I dropped the other stuff (this will make it easier for you to understand how these searches work so you can add stuff in later).

In [15]:
import requests

r = requests.get('http://eco.canadiana.ca/search/?&q0.0=pickles')
r.status_code

#r.headers['content-type']

#r.encoding

#r.text

r.text

'<!doctype html>\n<html id="html" class="no-js eco" lang="en">\n    <head>\n        <title>Search results for pickles - Early Canadiana Online</title>\n        <meta charset="utf-8" />\n        <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" />\n        \n        \n        <meta name="author" content="Canadiana" />\n\n        <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Raleway:400,700" />\n        <link rel="stylesheet" href="http://eco.canadiana.ca/static/css/eco.min.css?cssr=8" />\n        <!--[if lt IE 8]><link rel="stylesheet" href="http://eco.canadiana.ca/static/css/ie7.min.css?cssr=4" /><![endif]-->\n        <script src="http://eco.canadiana.ca/static/js/modernizr.custom.js"></script>\n        <script src="http://eco.canadiana.ca/static/js/respond.min.js"></script>\n    </head>\n    <body>\n        <div id="wrapper">\n            <header class="navbar navbar-static-top">\n                <div class="navbar-inner">\n         

So, we can get all the html from the search.  Yes, I searched for "pickles".  In the future we could build in a piece of code that asks a user for input.  For now you'll just have to change pickles to something else.  Unless you like pickles.  Seems Canadiana does, there are over 4,000 results!

Carrying out this search shows two things:

1. Each of the records you want is wrapped up in `<section class="search-item">` tags.  This will make it fairly straightforward to grab the content from.
2. It is possible that there are a lot of pages.  Seriously, I was not expecting 497 pages with content about pickles!
    
So, first we make sure we can grab the content from one section.  Then we make sure we can handle multiple pages.  I'll just grab a bit of content.  The rest is up to you.  So is what you do with it.

So, some content from each article on one page...  Soup anyone?

In [16]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

for section in soup.find_all("section", class_="search-item"):
    print(section.table)

<table class="table">
<thead>
<tr>
<th colspan="2">Document Record</th>
</tr>
</thead>
<tbody>
<tr>
<th>Title</th>
<td>
                            
                                
                                
                                
                                    
                                        Medical criticism [No. 8 (Oct. 21, 1882)]
                                    
                                    
                                
                            
                                </td>
</tr>
<tr>
<th>Published</th>
<td>
                            
                                
                                
                                
                                    
                                        Toronto : [s.n., 1882]
                                    
                                    
                                
                            
                                </td>
</tr>
<tr>
<th>Identif

The stuff you want is inside html tables inside each set of section tags with the class "search-item" on each page.  Let's create a list and add some of the content we want from each table to the list.  Actually, we are going to create a list of lists.  Each list in the list will be a row with all the content of the row.

In [17]:
searchResults = []  # create the empty list

searchResults.append(['Title',"Published"])  # create the first row, the headers

for section in soup.find_all("section", class_="search-item"):
    title=section.table.contents[3].contents[1].contents[3]
    print(title)
    published=section.table.contents[3].contents[3].contents[3]
    print(published)
    searchResults.append([title.get_text(),published.get_text()])
    #rowSoup = BeautifulSoup(section, 'html.parser')
    #print(rowSoup)
        
print(searchResults)

<td>
                            
                                
                                
                                
                                    
                                        Medical criticism [No. 8 (Oct. 21, 1882)]
                                    
                                    
                                
                            
                                </td>
<td>
                            
                                
                                
                                
                                    
                                        Toronto : [s.n., 1882]
                                    
                                    
                                
                            
                                </td>
<td>
                            
                                
                                
                                
                                  

So, it's mostly doing what it needs to do but it's got a lot of lines line breaks and white space that we don't want.  So we'll try again and strip that stuff out.

In [18]:
searchResults = []  # create the empty list

searchResults.append(['Title',"Published"])  # create the first row, the headers

for section in soup.find_all("section", class_="search-item"):
    title=str(section.table.contents[3].contents[1].contents[3].get_text()).replace('\n','').strip()
    #print(title)
    published=str(section.table.contents[3].contents[3].contents[3].get_text()).replace('\n','').strip()
    print(published)
    searchResults.append([title,published])
    #rowSoup = BeautifulSoup(section, 'html.parser')
    #print(rowSoup)
        
print(searchResults)

Toronto : [s.n., 1882]
Toronto : Grip Print. & Pub. Co., [1890]
[Toronto : Fruit Growers' Association of Ontario, 1882]
Quebec : Printed for the proprietors at the New Printing-Office, [1819]
Quebec : Printed for the proprietors at the New Printing-Office, [1819]
Preservation of food : home canning preserving, jelly-making, pickling, drying                                                                                                                                                                                                                                                                                                                CIHM/ICMH microfiche series ; no. 84491                                                                                                                                                                                                                                                                                                                Bulletin / 

To write this to a file we'll use the csv library.

In [19]:
import csv

with open('someOutputFile.csv','w') as myCSV:
    myCSVwriter = csv.writer(myCSV) # full details at https://docs.python.org/3/library/csv.html
    for list in searchResults:
        myCSVwriter.writerow(list)

Running this and looking at the output file reveals that we have a problem: some entries have newline characters in them.  Seriously?  Seriously.  

Didn't we do this already?  Yes, but the .replace() method on strings only gets rid of what we tell it (the "\n" characters in this case) and .strip() method only gets rid of _outer_ whitespace, not inner.  We need to do more.  We need to get rid of all double instances of whitespace.

Actually, after trying this it becomes clear that _Title_ is not always the first item returned.  We can't use the square bracket indexing method.  Back to the drawing board (this took _way too much time to figure out_).  So, let's search smarter...

In [20]:
searchResults = []  # create the empty list

searchResults.append(['Title',"Published"])  # create the first row, the headers

for section in soup.find_all("section", class_="search-item"):
    title=section.table.find('th',text="Title")
    print(str(title.next_sibling.next_sibling.get_text()).strip())

Medical criticism [No. 8 (Oct. 21, 1882)]
Grip [Vol. 35, no. 22 (Nov. 29, 1890)]
The Canadian horticulturist [Vol. 5, no. 10 (Oct. 1882)]
The commercial list : Vol. 4, no. 19 (Sept. 9, 1819)
The commercial list : Vol. 4, no. 26 (Oct. 28, 1819)
Preservation of food : home canning preserving, jelly-making, pickling, drying
                                    
                                    
                                
                            
                                
                                
                                
                                    
                                        CIHM/ICMH microfiche series ; no. 84491
                                    
                                    
                                
                            
                                
                                
                                
                                    
                                        Bulletin / Br

Ok, we're back in business...

Now we need to drop the extra newline characters and white space from the titles...  And we're going to need to do this a lot (so it seems).  Time for a function...  And some regular expressions...

In [35]:
import re

def entryCleaner(what_you_pass_in):
    what_gets_passed_out = what_you_pass_in.strip()
    what_gets_passed_out = re.sub(' +', ' ', what_gets_passed_out)
    what_gets_passed_out = re.sub('(\n )+', '|', what_gets_passed_out)
    return(what_gets_passed_out)

searchResults = []  # create the empty list

searchResults.append(['Title',"Published"])  # create the first row, the headers

for section in soup.find_all("section", class_="search-item"):
    title=section.table.find('th',text="Title").next_sibling.next_sibling.get_text()
    title=entryCleaner(title)
    #print(title)
    
    published=section.table.find('th',text="Published").next_sibling.next_sibling.get_text()
    published=entryCleaner(published)
    #print(published)
    
    searchResults.append([title,published])

print(searchResults)

[['Title', 'Published'], ['Medical criticism [No. 8 (Oct. 21, 1882)]', 'Toronto : [s.n., 1882]'], ['Grip [Vol. 35, no. 22 (Nov. 29, 1890)]', 'Toronto : Grip Print. & Pub. Co., [1890]'], ['The Canadian horticulturist [Vol. 5, no. 10 (Oct. 1882)]', "[Toronto : Fruit Growers' Association of Ontario, 1882]"], ['The commercial list : Vol. 4, no. 19 (Sept. 9, 1819)', 'Quebec : Printed for the proprietors at the New Printing-Office, [1819]'], ['The commercial list : Vol. 4, no. 26 (Oct. 28, 1819)', 'Quebec : Printed for the proprietors at the New Printing-Office, [1819]'], ['Preservation of food : home canning preserving, jelly-making, pickling, drying|CIHM/ICMH microfiche series ; no. 84491|Bulletin / British Columbia. Household Science Branch ; no. 83.', 'Victoria, B.C. : W.H. Cullin, 1919.'], ['La vie canadienne [Vol. 1, no. 10 (Sporting no. [1918])]', 'Rouen, Frence : [s.n., 1918]'], ['Grip [Vol. 31, no. 790 (July 28, 1888)]', 'Toronto : Grip Print. & Pub. Co., [1888]'], ['The commercial 

Finally working.  Now to finally write it to a CSV file...

In [36]:
import csv

with open('someOutputFile.csv','w') as myCSV:
    myCSVwriter = csv.writer(myCSV) # full details at https://docs.python.org/3/library/csv.html
    for list in searchResults:
        myCSVwriter.writerow(list)

Open the file to see that it now look much better.  =)

Now to the last problem, the 400 odd pages to deal with.

We'll do this by performing an initial search, grabbing the total number of pages and then iterating through the pages.  We've already done the initial search (above!) so we just need to grab the total number of pages and search again.

At this point we might as well just wrap everything up into a single cell so you can see what that looks like. =)

Your task?  Add understand what is going on such that you can do things like (in order of difficulty given what you've seen so far...):

1. Collect more fields.
2. Take input from a user.
3. Modify the file to work with cont

In [56]:
import requests
from bs4 import BeautifulSoup
import csv
import re

r = requests.get('http://eco.canadiana.ca/search/?&q0.0=pickles')

soup = BeautifulSoup(r.text, 'html.parser')

totalPages=int(soup.find('div', class_="pagination").find('li',text="...").next_sibling.next_sibling.get_text().strip())

def entryCleaner(what_you_pass_in):
    what_gets_passed_out = what_you_pass_in.strip()
    what_gets_passed_out = re.sub(' +', ' ', what_gets_passed_out)
    what_gets_passed_out = re.sub('(\n )+', '|', what_gets_passed_out)
    return(what_gets_passed_out)

searchResults = [['Title',"Published"]]

for page in range(totalPages):
    print("Now processing page: " + str(page+1))
    url = 'http://eco.canadiana.ca/search/'+ str(page+1) + '?&q0.0=pickles'
    r = requests.get(url)
    
    for section in soup.find_all("section", class_="search-item"):
        title=section.table.find('th',text="Title").next_sibling.next_sibling.get_text()
        title=entryCleaner(title)
        #print(title)

        published=section.table.find('th',text="Published").next_sibling.next_sibling.get_text()
        published=entryCleaner(published)
        #print(published)

        searchResults.append([title,published])

#print(searchResults)

with open('someOutputFile.csv','w') as myCSV:
    myCSVwriter = csv.writer(myCSV) # full details at https://docs.python.org/3/library/csv.html
    for list in searchResults:
        myCSVwriter.writerow(list)

[['Title', 'Published'], ['Medical criticism [No. 8 (Oct. 21, 1882)]', 'Toronto : [s.n., 1882]'], ['Grip [Vol. 35, no. 22 (Nov. 29, 1890)]', 'Toronto : Grip Print. & Pub. Co., [1890]'], ['The Canadian horticulturist [Vol. 5, no. 10 (Oct. 1882)]', "[Toronto : Fruit Growers' Association of Ontario, 1882]"], ['The commercial list : Vol. 4, no. 19 (Sept. 9, 1819)', 'Quebec : Printed for the proprietors at the New Printing-Office, [1819]'], ['The commercial list : Vol. 4, no. 26 (Oct. 28, 1819)', 'Quebec : Printed for the proprietors at the New Printing-Office, [1819]'], ['Preservation of food : home canning preserving, jelly-making, pickling, drying|CIHM/ICMH microfiche series ; no. 84491|Bulletin / British Columbia. Household Science Branch ; no. 83.', 'Victoria, B.C. : W.H. Cullin, 1919.'], ['La vie canadienne [Vol. 1, no. 10 (Sporting no. [1918])]', 'Rouen, Frence : [s.n., 1918]'], ['Grip [Vol. 31, no. 790 (July 28, 1888)]', 'Toronto : Grip Print. & Pub. Co., [1888]'], ['The commercial 

# The stuff below is junk

It's here so you can see some failed attempts.

In [None]:
# This is all junk given that title isn't always first.  So much for the indexing choice...  
# will leave this in at the bottom even though it's junk

searchResults = []  # create the empty list

searchResults.append(['Title',"Published"])  # create the first row, the headers

for section in soup.find_all("section", class_="search-item"):
    title=str(section.table.contents[3].contents[1].contents[3].get_text()).replace('\n\n','|').strip()
    print(title)
    published=str(section.table.contents[3].contents[3].contents[3].get_text()).strip()
    #print(published.splitlines())
    #published='|'.join(published.splitlines())
    #print(published)
    searchResults.append([title,published])
    #rowSoup = BeautifulSoup(section, 'html.parser')
    #print(rowSoup)
        
print(searchResults)

In [None]:
with open('someOutputFile.csv','w') as myCSV:
    myCSVwriter = csv.writer(myCSV) # full details at https://docs.python.org/3/library/csv.html
    for list in searchResults:
        myCSVwriter.writerow(list)

In [None]:
test = "     "
print(len(test.strip()))

In [9]:
"".splitlines()

[]