# Scraping Data From Canadiana Collections

You can create huge datasets related to Ottawa GLAM institutions by searching the [Canadiana digital collections](https://www.canadiana.ca/). You may have to do additional searching/cleaning after scraping the data to verify it meets your criteria due to the lack of built-in filtering tools in Canadiana's search engine, but that's much easier when all possible relevent information is already in one place/data set!

For this example, we're creating a small data set of documents about some of Ottawa's 19th century sporting clubs!

In [None]:
%pip install beautifulsoup4
%pip install wget

# might need this if you use macOS
%pip install lxml

%pip install pandas

import pandas as pd
import requests
from bs4 import BeautifulSoup
import lxml

In [None]:
page = 1
# search url
url = "https://www.canadiana.ca/search/general/" + str(page) + "?q0.0=su%3A%22Ottawa+(Ont.)+--+Clubs.%22"

data = []

while True:
    page = page + 1
    response = requests.get(url)

    # The number should be the last page of your results + 1
    # on most websites 'if page.status_code != 200:' would work, but for some reason the Canadiana website will let you just... keep going to the next page infinitely even when there's no results
    if page > 3:
        break
    url = "https://www.canadiana.ca/search/general/" + str(page) + "?q0.0=su%3A%22Ottawa+(Ont.)+--+Clubs.%22"

    pghtml = response.text

    soup = BeautifulSoup(pghtml, 'lxml')

    all_cards = soup.find_all("div", attrs={"class": "card"})
    
    # because of how the page html is formatted, we first make a list of dictionaries, where each dictionary is one search result
    for i in all_cards:
        col = {}
        dt = i.find_all("dt") 
        dd = i.find_all("dd")
        for j in range(len(dt)):
            dd_mod = dd[j].text
            # strips extra whitespace before and after text
            dd_mod = dd_mod.strip()
            # replaces any "newline" character that mess with formatting
            dd_mod = dd_mod.replace("\n", " ")
            # replaces any tabs that mess with formatting
            dd_mod = dd_mod.replace("\t", "")
            col[dt[j].text] = dd_mod
        data.append(col) 
    
    # this just lets you know what page it's on (handy for big searches)
    print('\n' + str(page) + '\n')

In [None]:
# convert your results into a data frame...
df = pd.DataFrame(data)
df

In [None]:
# ...that you can now export as a .csv! we love pandas
df.to_csv(r'example.csv', index = False)

A key part of this data set is the column containing URLs to all of the items. If you wanted to [download them](https://github.com/ChantalMB/HIST4916-Workbook/blob/master/scrapers/canadiana-item-scraper.ipynb), you now already have the list to do so in an automated way!