We'll scrape this page [GW Schedule of Classes](https://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=202303).  to extract some course information. We'll use Python tools to do this.

### Step 1: Download the content

We'll use a Python library called `requests` to retrieve the HTML of this and other pages from the web.

In [2]:
import requests

To scrape page, we start with a URL. In this case, the URL contains **parameters**, which tell the server to return specific kinds of information depending on the values of the parameters (campus and semester). ?

In [3]:
depts_url = 'https://my.gwu.edu/mod/pws/subjects.cfm'
params = {'campus_id': '1', # Main Campus
            'term_id': '202303'}

With our URL and params in place, we can make our first request.

In [4]:
depts_page = requests.get(depts_url, params=params) #depts_page is an HTTP response object

In [5]:
depts_page

<Response [200]>

The content of the page -- the HTML -- is available under the `.text` property of the response object.

In [6]:
depts_page.text

'\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"\n  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">\n<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" version="XHTML+RDFa 1.0" dir="ltr"\n  xmlns:og="http://ogp.me/ns#"\n  xmlns:fb="http://www.facebook.com/2008/fbml"\n  xmlns:content="http://purl.org/rss/1.0/modules/content/"\n  xmlns:dc="http://purl.org/dc/terms/"\n  xmlns:foaf="http://xmlns.com/foaf/0.1/"\n  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"\n  xmlns:sioc="http://rdfs.org/sioc/ns#"\n  xmlns:sioct="http://rdfs.org/sioc/types#"\n  xmlns:skos="http://www.w3.org/2004/02/skos/core#"\n  xmlns:xsd="http://www.w3.org/2001/XMLSchema#">\n\n<head profile="http://www.w3.org/1999/xhtml/vocab">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n  <link rel="shortcut icon" href="/images/favicon.ico" type="image/vnd.microsoft.icon" />\n  <meta name="keywords" content="

We'll use a library call [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) to parse the HTML of this page. 

In [7]:
from bs4 import BeautifulSoup

In [8]:
soup = BeautifulSoup(depts_page.text, features="html.parser")

We can use the nested structure of HTML to target elements that are **children** (nested under) other elements. So if we want to retrieve all hyperlinks (`<a>` tags) inside the `<div>` with the `class` attribute value of `subjectsMain`, we can write the following:

In [9]:
soup.find("div", class_="subjectsMain").find_all("a")

[<a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=ACA">Academy for Classical Acting</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=ACCY">Accountancy</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=AFST">Africana Studies</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=AMST">American Studies</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=ANAT">Anatomy and Cell Biology</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=ANTH">Anthropology</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=APSC">Applied Science</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=ARAB">Arabic</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=ASTR">Astronomy</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=BIOC">Biochemistry</a>,
 <a href="courses.cfm?campId=1&amp;termId=202401&amp;subjId=BISC">Biological Sciences</a>,
 <a href="courses.cfm?campId=1&amp;termId

The `find_all` method returns a Python list of BeautifulSoup objects, one for each HTML element that matches our query. 

In web scraping, we usually don't care about the HTML elements themselves; we want the text or data inside them. In this case, let's extract all of the `href` values, since each `href` is a link to a page listing the courses in a given department.

In [10]:
links = [l['href'] for l in soup.find("div", class_="subjectsMain").find_all("a")]

In [11]:
links = set(links)

Note that `links` is a list of Python strings, each of which corresponds to a single URL. 

We use the Python `set()` function to dedupe the list of links -- always a good idea when scraping data from the web, since you never know what might be duplicated. 

The URL's in `links` are not complete: note that they're missing the `gwu.edu` domain. But we can use them to reconstruct a complete URL to a course listings page by simply appending the string `'https://my.gwu.edu/mod/pws/'` to the beginning of each URL in `links`.

That way, we can automate scraping all the course schedule pages for the Fall 2023 semester, main campus. 

But before we proceed, it behooves us to look at the site's **robots.txt** file. Since the `mod/pws` directory is allowed, we can proceed to scrape knowing that we're not in violation of the website's policies.

In [12]:
course_listings = []
for link in links:
    course_page = requests.get('https://my.gwu.edu/mod/pws/' + link)
    course_listings.append(course_page.text)

Now that we have downloaded the HTML for all the course listings, we could use BeautifulSoup to extract the course schedules from each HTML document.

In [13]:
course_page = BeautifulSoup(course_listings[0], features="html.parser")

Parse the HTML for the page

In [14]:
tables = course_page.find_all('table', class_='courseListing')

Extract all the `<table>` elements that have the `courseListing` class attribute. (There should be one per course).

Now we want to extract the text from the relevant cells from each table. 

Within the `for` loop below, the `table.find('tr')` method call extracts the first row from each table, and the call to `find_all('td')` extracts all of the cells within that row. 

Then we use Python indexing to target the particular table cells we want, and access the `text` attribute to extract the text from those elements. (Remember that in web scraping, we usually want to extract the text from elements and attributes; we don't really care about the elements or attributes themselves, except insofar as they lead us to the desired text, which is ultimately the content of the web apge.)

We wrap each collection of course information in a Python dictionary and `append` that dictionary to a list.

In [15]:
courses = []
for table in tables:
    cells = table.find('tr').find_all('td')
    course = {'course_code': cells[2].text.split(),
            'section': cells[3].text,
            'title': cells[4].text,
            'times': cells[8].text.split('AND')}
    courses.append(course)

Now `courses` should contain the information from the course listings for the first department in our list.

In [16]:
courses

[{'course_code': ['EHS', '1002'],
  'section': '10',
  'title': 'CPR and First Aid',
  'times': ['S08:00AM - 05:00PM']},
 {'course_code': ['EHS', '1002'],
  'section': '11',
  'title': 'CPR and First Aid',
  'times': ['U08:00AM - 05:00PM']},
 {'course_code': ['EHS', '1002'],
  'section': '12',
  'title': 'CPR and First Aid',
  'times': ['S10:00AM - 05:00PM']},
 {'course_code': ['EHS', '1040'],
  'section': '10',
  'title': 'Emergency Medical Tech-Basic',
  'times': ['T06:30PM - 10:30PM']},
 {'course_code': ['EHS', '1041'],
  'section': '11',
  'title': 'EMT - Basic Lab',
  'times': ['W01:00PM - 05:00PM']},
 {'course_code': ['EHS', '1041'],
  'section': '12',
  'title': 'EMT - Basic Lab',
  'times': ['W06:30PM - 10:30PM']},
 {'course_code': ['EHS', '1041'],
  'section': '13',
  'title': 'EMT - Basic Lab',
  'times': ['R01:00PM - 05:00PM']},
 {'course_code': ['EHS', '1041'],
  'section': '14',
  'title': 'EMT - Basic Lab',
  'times': ['R06:30PM - 10:30PM']},
 {'course_code': ['EHS', '105

Now we can repeat this process for each page in the course_listings list.

It will be cleaner to refactor the code above into a Python function, so that we can simply call that function once per page.

In [17]:
def scrape_course_info(page):
    soup = BeautifulSoup(page, features="html.parser")
    tables = soup.find_all('table', class_='courseListing')
    courses = []
    for table in tables:
        cells = table.find('tr').find_all('td')
        course = {'course_code': cells[2].text.split(),
                'section': cells[3].text,
                'title': cells[4].text,
                'times': cells[8].text.split('AND')}
        courses.append(course)
    return courses

Now we can loop over `course_listings`, calling our function on each pass through the list. We use the `extend` method to add the results of each call to `scrape_course_info` to one big list, which will hold the course info for all course pages (across all departments).

In [18]:
all_courses = []
for listing in course_listings:
    courses = scrape_course_info(listing)
    all_courses.extend(courses)

In [19]:
len(all_courses)

1652

Now we have the information for 1,652 courses!

We already have the first page of each set of course schedules -- in our `course_listings` variable above. We also have the URL of each of these pages in our `links` variable. (The length and order of both lists should be the same, since we visited each link in order to obtain the pages in `course_listings`, and we haven't reordered or changed either list. 

So for each page in `course_listings`, we can do the following:

1. Extract any links on the page having for their destination `javascript:goToPage`. (That will be in the `href` attribute of the `<a>` tag.)
2. For all links on the page where the argument to `goToPage` is greater than `1` (we already have the first page, so we don't need to get it again), visit the page URL (stored in the `links` list), passing the page number in a `POST` request.
3. Scrape these pages and add them to our `courses` list, using our function defined above.

In [20]:
import re
# The Python zip() function is a handy tool for looping over multiple lists in parallel
for page, url in zip(course_listings, links):
    # Since course_listings contains the raw HTML of each page, we need to parse it first
    page_html = BeautifulSoup(page, features="html.parser")
    # Find all a tags that match a given string
    page_nums =  page_html.find_all('a', href=re.compile('javascript:goToPage')) 
    # Loop over each matching element
    for page_num in page_nums:
        # We don't need to get the first page (we already have it)
        if page_num.text != '1':
            page_data = requests.post('https://my.gwu.edu/mod/pws/' + url, 
                              headers={'Content-Type': 'application/x-www-form-urlencoded'}, # This special head tells the server to expect form data
                              data=f"pageNum={page_num.text}")  # Because it's form-urlencoded, we pass the data as a string
            next_page_courses = scrape_course_info(page_data.text)
            all_courses.extend(next_page_courses)

In [23]:
print(all_courses)

[{'course_code': ['EHS', '1002'], 'section': '10', 'title': 'CPR and First Aid', 'times': ['S08:00AM - 05:00PM']}, {'course_code': ['EHS', '1002'], 'section': '11', 'title': 'CPR and First Aid', 'times': ['U08:00AM - 05:00PM']}, {'course_code': ['EHS', '1002'], 'section': '12', 'title': 'CPR and First Aid', 'times': ['S10:00AM - 05:00PM']}, {'course_code': ['EHS', '1040'], 'section': '10', 'title': 'Emergency Medical Tech-Basic', 'times': ['T06:30PM - 10:30PM']}, {'course_code': ['EHS', '1041'], 'section': '11', 'title': 'EMT - Basic Lab', 'times': ['W01:00PM - 05:00PM']}, {'course_code': ['EHS', '1041'], 'section': '12', 'title': 'EMT - Basic Lab', 'times': ['W06:30PM - 10:30PM']}, {'course_code': ['EHS', '1041'], 'section': '13', 'title': 'EMT - Basic Lab', 'times': ['R01:00PM - 05:00PM']}, {'course_code': ['EHS', '1041'], 'section': '14', 'title': 'EMT - Basic Lab', 'times': ['R06:30PM - 10:30PM']}, {'course_code': ['EHS', '1058'], 'section': '10', 'title': 'EMT Instructor Developme