# In-Class Activity: Web Scraping with Beautiful Soup

Today we will learn how to:
* get HTML from a web page
* parse that data using BeauitfulSoup
* navigate through the soup
* deal with some hairy complications
* put it all together to build a brand new dataset

A very big thank you to Brian Keegan (my advisor!) and the materials in his [Web Data Scraping course](https://github.com/CU-ITSS/Web-Data-Scraping-S2021). Check that out if you want to dig deeper ;) 

In [1]:
# Import packages we've been using all semester long :)
import numpy as np
import pandas as pd

We'll be using two new packages today, [requests](https://pypi.org/project/requests/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

Try running the two import lines below. They *should* have come bundled with your Anaconda installation. But if they don't work, you may need to first run:

    pip install requests
    pip install beautifulsoup4
    
If you can't install the packages, I've created a [Google Collab](https://colab.research.google.com/drive/1JCbFis4D7ZhNpTd3qN-rpf-Ee-Ut5yA9?usp=sharing) of this example you can use to follow along.

In [2]:
# Requests lets us grab info from a web page
import requests

# BeautifulSoup parses and searches that info
from bs4 import BeautifulSoup


We'll start with the [Whitman Economics Faculty page](https://www.whitman.edu/academics/majors-and-minors/economics/faculty). Our task will be to get a list of all the faculty and all the information we have about them.

Before we start, go ahead and open this page in your web browser.

### Part 1: Getting the HTML and turning it into "Soup"

In [3]:
# Create a variable called URL
url = "https://www.whitman.edu/academics/majors-and-minors/economics/faculty"

# Use requests.get to grab the html
html = requests.get(url)

# Examine it -- what does it look like? What format is it?
print(html.text)

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
            <!-- Google Tag Manager -->
            <script>
                (function (w, d, s, l, i) {
                    w[l] = w[l] || []; w[l].push({
                        'gtm.start':
                            new Date().getTime(), event: 'gtm.js'
                    }); var f = d.getElementsByTagName(s)[0],
                        j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
                            'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
                })(window, document, 'script', 'dataLayer', 'GTM-PN449F');</script>
            <!-- End Google Tag Manager -->



            <meta name="APIKey" content="SnazzyMap:AIzaSyA8pm3DLjUEic2ds_HPBH_VcDJwT84D6uY" />
            <meta name="APIKey" content="GoogleMaps:AIzaSyBBuhapIg-2rzMPiPltgp9gGP4rOXSV0B4" />
            <meta name="APIKey" content="GoogleDrive:AI

In [4]:
# Now let's turn it into soup
# The first argument is our HTML, which we can access with .text
# The second argument is our parser -- we can use lxml, html, and there are others...
soup = BeautifulSoup(html.text, 'html')

### Part 2: Doing some basic searching with BeautifulSoup

Not very fun for a human to read...So where this gets *really* powerful is that now we can navigate through the soup, programmatically, very easily!

#### We can use .find to show the first instance of any tag

In [5]:
# The first link (a tag)
soup.find("a")

<a class="wc-skip-to-content-link" href="/academics/majors-and-minors/economics/faculty#wc-reader-content">Skip to main content</a>

In [6]:
# The first h1
soup.find("h1")

<h1 class="wc-page-title">Faculty</h1>

In [7]:
# The first image
soup.find("img")

<img alt="Clocktower Logo" class="wc-desktop-header-logo" onerror="this.src='/prebuilt/v9/images/misc/whitman-logo-header.png'" src="/prebuilt/v9/images/svg/whitman-logo-header.svg"/>

In [8]:
# The first unordered list
soup.find("ul")

<ul role="menubar">
<li>
<a class="wc-clickable" data-tracking-action="Click" data-tracking-category="Header Link" data-tracking-label="Constituent Link - Apply" href="/admission-and-aid/applying-to-whitman" title="Apply">
        Apply</a>
<span aria-hidden="true" class="wc-bullet">•</span>
</li>
<li>
<a class="wc-clickable" data-tracking-action="Click" data-tracking-category="Header Link" data-tracking-label="Constituent Link - Alumni" href="/alumni" title="Alumni">
        Alumni</a>
<span aria-hidden="true" class="wc-bullet">•</span>
</li>
<li>
<a class="wc-clickable" data-tracking-action="Click" data-tracking-category="Header Link" data-tracking-label="Constituent Link - Diversity" href="/campus-life/diversity-and-inclusion" title="Diversity">
        Diversity</a>
<span aria-hidden="true" class="wc-bullet">•</span>
</li>
<li>
<a class="wc-clickable" data-tracking-action="Click" data-tracking-category="Header Link" data-tracking-label="Constituent Link - Library" href="https://lib

#### We can use .find_all() to find all the instances of a tag
Note how it returns them as a list

In [9]:
# Find all the links:
links = soup.find_all("a")

In [10]:
# How many links (a tags) are there are in this page?
print(len(links))

281


In [11]:
print(links)

[<a class="wc-skip-to-content-link" href="/academics/majors-and-minors/economics/faculty#wc-reader-content">Skip to main content</a>, <a class="wc-header-logo-link" href="/" title="Whitman College home">
<!-- Desktop Logo -->
<img alt="Clocktower Logo" class="wc-desktop-header-logo" onerror="this.src='/prebuilt/v9/images/misc/whitman-logo-header.png'" src="/prebuilt/v9/images/svg/whitman-logo-header.svg"/>
<!-- Mobile Logo-->
<svg class="wc-mobile-header-logo" viewbox="0 0 180 15">
<use xlink:href="/prebuilt/v9/dist/svg/svg-defs.svg?v=12#logo-whitman-nc-flat"></use>
</svg>
</a>, <a class="wc-clickable" data-tracking-action="Click" data-tracking-category="Header Link" data-tracking-label="Constituent Link - Apply" href="/admission-and-aid/applying-to-whitman" title="Apply">
        Apply</a>, <a class="wc-clickable" data-tracking-action="Click" data-tracking-category="Header Link" data-tracking-label="Constituent Link - Alumni" href="/alumni" title="Alumni">
        Alumni</a>, <a cla

#### We can use .text() or .get() to extract information from a tag

In [12]:
# Print the text of the links (a tags)
for l in links:
    print(l.text)

Skip to main content









        Apply

        Alumni

        Diversity

        Library

        MyWhitman

        Families

        Make A Gift

        Bias Reporting

        Bookstore

        CARE Team

        Career Services

        Communications

        Employment Opportunities

        Giving

        Grievance Policy

        Newsroom

        Nondiscrimination Policy

        Right to Know

        Sexual Misconduct & Title IX

        Social Media 

        The Center for Writing and Speaking (COWS)

        Website Privacy Policy

        Welty Student Health Center





											A to Z Index

										






											Map

										






											Events Calendar

										






											Penrose Library

										






											myWhitman
													









												A to Z 

											






												Map 

											






												Events 

											






												Library 

			

In [13]:
# Get the URL of an a tag using .get('href')
for l in links:
    print(l.get('href'))

/academics/majors-and-minors/economics/faculty#wc-reader-content
/
/admission-and-aid/applying-to-whitman
/alumni
/campus-life/diversity-and-inclusion
https://library.whitman.edu/
https://my.whitman.edu/
/families
https://www.givecampus.com/campaigns/7878/donations/new
https://whitman-advocate.symplicity.com/public_report/index.php
http://bookstore.whitman.edu/home.aspx
/dean-of-students/care-team
/after-whitman/career-and-community-engagement-center/job-and-career-resources
/communications
/human-resources
/giving
/human-resources/grievance-policy
/newsroom
/campus-life/diversity-and-inclusion/nondiscrimination-policy
/dean-of-students/right-to-know
/campus-life/diversity-and-inclusion/title-ix-and-sexual-misconduct
/communications/social-media
/academics/the-center-for-writing-and-speaking
/website-privacy-policy
/welty-student-health-center
a-to-z-index
campus-map
http://calendar.whitman.edu
https://library.whitman.edu/
https://my.whitman.edu
a-to-z-index
campus-map
http://calendar.

What do you notice about the URLs above? Can you paste them all into a web browser? Can you use .requests() to get their HTML?

No!

Some of these are _relative_ links. 

How would we turn them all into URLs of the format:
"https://whitman.edu/stuff"?

### Challenge: Take the list of links and use it to make a list of URLs in the format above


In [14]:
# YOUR CODE HERE



# Don't peek at the code below if you want to try on your own


# Hints: 
# Skip the ones that are empty or that are in the form of 'mailto'
# Think about the different cases, for example:
# /alumni
# http://bookstore.whitman.edu/home.aspx
# about





In [15]:
# One potential solution:

# This code turns all of the links into a list of URLs that we could follow:

# Create an empty list
list_of_URLs = []

# Iterate through all of the links
for l in links:
    
    # we want to ignore the ones that are empty or are 'mailto' links
    if (l.get('href') == None) or (l.get('href') == "") or ('mailto' in l.get('href')):
        pass
    
    # these are the cases where they start with http
    elif "http" in l.get('href'):
        
        # append the link as is
        list_of_URLs.append(l.get('href'))
    
    # these are the cases where the href starts with "/"
    elif l.get('href')[0] == "/":
        list_of_URLs.append("https://whitman.edu" + l.get('href'))
    
    # these are the cases where the href doesn't start with '/'
    else:
        list_of_URLs.append("https://whitman.edu/" + l.get('href'))

In [16]:
list_of_URLs

['https://whitman.edu/academics/majors-and-minors/economics/faculty#wc-reader-content',
 'https://whitman.edu/',
 'https://whitman.edu/admission-and-aid/applying-to-whitman',
 'https://whitman.edu/alumni',
 'https://whitman.edu/campus-life/diversity-and-inclusion',
 'https://library.whitman.edu/',
 'https://my.whitman.edu/',
 'https://whitman.edu/families',
 'https://www.givecampus.com/campaigns/7878/donations/new',
 'https://whitman-advocate.symplicity.com/public_report/index.php',
 'http://bookstore.whitman.edu/home.aspx',
 'https://whitman.edu/dean-of-students/care-team',
 'https://whitman.edu/after-whitman/career-and-community-engagement-center/job-and-career-resources',
 'https://whitman.edu/communications',
 'https://whitman.edu/human-resources',
 'https://whitman.edu/giving',
 'https://whitman.edu/human-resources/grievance-policy',
 'https://whitman.edu/newsroom',
 'https://whitman.edu/campus-life/diversity-and-inclusion/nondiscrimination-policy',
 'https://whitman.edu/dean-of-s

Why is this useful? We can use them to do more scraping -- we cold programmatically visit every page on the Whitman.edu website this way.

(Side note: This is fundamentally how [web crawlers](https://en.wikipedia.org/wiki/Web_crawler) work -- which are the underlying technology behind search engines!)

### Part 3: Targeting our scraping
We are interested in the Faculty and their information. How would we get that?

First, let's inspect this page in our web browsers and see how the Faculty info is formatted.

In [17]:
# Start by finding all of the 'h2' tags
soup.find_all('h2')

[<h2 class="wc-off-screen">Section Navigation</h2>,
 <h2>Academic Requirements</h2>,
 <h2>Combined Majors</h2>,
 <h2>Related Links</h2>,
 <h2 class="wc-profile-name">Denise Hazlett</h2>,
 <h2 class="wc-profile-name">Halefom Belay</h2>,
 <h2 class="wc-profile-name">Jan Crouter</h2>,
 <h2 class="wc-profile-name">Ruoning Han</h2>,
 <h2 class="wc-profile-name">Sai Madhurika Mamunuru</h2>,
 <h2 class="wc-profile-name">Marian Manic</h2>,
 <h2 class="wc-profile-name">Rosie Mueller</h2>,
 <h2 class="wc-profile-name">Jason Ralston</h2>,
 <h2 class="wc-profile-name">Sied Hassen Mohamed</h2>,
 <h2 class="wc-sub-header">Retired</h2>]

Note that all of the facilty names have the same class, 'wc-profile-name'. We can use this to target our soup search.

In [18]:
# Select h2 tags that have the 'wc-profile-name' class
faculty = soup.find_all('h2', {"class": "wc-profile-name"})

In [19]:
# Let's print out the faculty names
for f in faculty:
    print(f.text)

Denise Hazlett
Halefom Belay
Jan Crouter
Ruoning Han
Sai Madhurika Mamunuru
Marian Manic
Rosie Mueller
Jason Ralston
Sied Hassen Mohamed


#### Using .find_next() to move around our HTML
Let's go back to our inspector. Notice how the Faculty info is stored... 

It's in an unordered list that comes after the h2 tags. How can we access that?

BeautifulSoup has other useful tools: find.next() ... which works because the HTML is structured like a tree!

In [20]:
# From the first faculty tag, find the next tag:
faculty[0].find_next()

# What do you notice?

<h3 class="wc-profile-title">Hollon Parker Professor of Economics and Business, Chair</h3>

In [21]:
# And the one after that
faculty[0].find_next().find_next()

<div class="wc-profile-info-items">
<div class="wc-info-item">
<div class="wc-icon-box">
<svg viewbox="0 0 32 32">
<use xlink:href="/prebuilt/v9/dist/svg/svg-defs.svg?v=12#icon-email"></use>
</svg>
</div>
<div class="wc-text-box">
<a class="wc-text" href="mailto:hazlett@whitman.edu">hazlett@whitman.edu</a>
</div>
</div>
<div class="wc-info-item">
<div class="wc-icon-box">
<svg viewbox="0 0 32 32">
<use xlink:href="/prebuilt/v9/dist/svg/svg-defs.svg?v=12#icon-map-pin"></use>
</svg>
</div>
<div class="wc-text-box">
<span class="wc-text">Maxey Hall 224</span>
</div>
</div>
<div class="wc-info-item">
<div class="wc-icon-box">
<svg viewbox="0 0 32 32">
<use xlink:href="/prebuilt/v9/dist/svg/svg-defs.svg?v=12#icon-phone"></use>
</svg>
</div>
<div class="wc-text-box">
<span class="wc-text" itemprop="telephone">509-527-5155</span>
</div>
</div>
<div class="wc-info-item">
<div class="wc-icon-box">
<svg viewbox="0 0 32 32">
<use xlink:href="/prebuilt/v9/dist/svg/svg-defs.svg?v=12#icon-link-circl

### Challenge: Collect Data on the Economics Faculty
We now have everything we need to write some code that will:
* find all the faculty names
* find their titles
* find their email addresses
* store this info in a data frame

In [22]:
# YOUR CODE HERE

In [23]:
# Make a new dataframe
df_faculty = pd.DataFrame(columns=["Name", "Title", "Email"])

In [24]:
# Iterate over the faculty
for f in faculty:
    
    # Get the relevant info
    name = f.text
    title = f.find_next().text
    email = f.find_next().find_next().find('a').text
    
    # New row to add to the DataFrame
    new_row = {"Name": name, "Title": title, "Email":email}
    
    # Use .concat to add it
    df_faculty = pd.concat([df_faculty, pd.DataFrame([new_row])], ignore_index=True)

In [25]:
df_faculty

Unnamed: 0,Name,Title,Email
0,Denise Hazlett,Hollon Parker Professor of Economics and Busin...,hazlett@whitman.edu
1,Halefom Belay,Associate Professor of Economics,belayh@whitman.edu
2,Jan Crouter,Associate Professor of Economics,crouter@whitman.edu
3,Ruoning Han,Assistant Professor of Economics,hanr@whitman.edu
4,Sai Madhurika Mamunuru,Assistant Professor of Economics (Sabbatical F...,sai@whitman.edu
5,Marian Manic,Associate Professor of Economics,manicm@whitman.edu
6,Rosie Mueller,Assistant Professor of Economics,muellerm@whitman.edu
7,Jason Ralston,Assistant Professor of Economics,ralstonj@whitman.edu
8,Sied Hassen Mohamed,Visiting Assistant Professor of Economics,hassenms@whitman.edu


### Part 4: What if we wanted to write a web scraper to do this for all departments at Whitman?
What would our next steps be?

We would need to systematically visit all of the department pages and repeat this process.

How might we do that?

Let's start by visiting the Whitman Majors & Minors page. We can use the list of Majors and Minor to make a list of all of the department links -- as well as the department faculty pages.

In [26]:
# We'll start with the Whitman Majors & Minors page
majorminorURL = "https://www.whitman.edu/academics/majors-and-minors"

# Use requests.get to grab the html
majorminorHTML = requests.get(majorminorURL)

# Examine it -- what does it look like? What format is it?
print(majorminorHTML.text)

# Now let's turn it into soup
# The first argument is our HTML (remember, we need to do html.text to access it)
# The second argument is our parser -- this tells bs4 that it is reading HTML (and not some other format)
majorminorSoup = BeautifulSoup(majorminorHTML.text, 'html')

# Let's examine it... what do you notice?

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
            <!-- Google Tag Manager -->
            <script>
                (function (w, d, s, l, i) {
                    w[l] = w[l] || []; w[l].push({
                        'gtm.start':
                            new Date().getTime(), event: 'gtm.js'
                    }); var f = d.getElementsByTagName(s)[0],
                        j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
                            'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
                })(window, document, 'script', 'dataLayer', 'GTM-PN449F');</script>
            <!-- End Google Tag Manager -->



            <meta name="APIKey" content="SnazzyMap:AIzaSyA8pm3DLjUEic2ds_HPBH_VcDJwT84D6uY" />
            <meta name="APIKey" content="GoogleMaps:AIzaSyBBuhapIg-2rzMPiPltgp9gGP4rOXSV0B4" />
            <meta name="APIKey" content="GoogleDrive:AI

In [27]:
majorminor_links = majorminorSoup.find_all('a')

In [28]:
# Let's examine -- notice how the links to the majors and minors aren't in this list!!!
for l in majorminor_links:
    print(l.get('href'))

/academics/majors-and-minors#wc-reader-content
/
/admission-and-aid/applying-to-whitman
/alumni
/campus-life/diversity-and-inclusion
https://library.whitman.edu/
https://my.whitman.edu/
/families
https://www.givecampus.com/campaigns/7878/donations/new
https://whitman-advocate.symplicity.com/public_report/index.php
http://bookstore.whitman.edu/home.aspx
/dean-of-students/care-team
/after-whitman/career-and-community-engagement-center/job-and-career-resources
/communications
/human-resources
/giving
/human-resources/grievance-policy
/newsroom
/campus-life/diversity-and-inclusion/nondiscrimination-policy
/dean-of-students/right-to-know
/campus-life/diversity-and-inclusion/title-ix-and-sexual-misconduct
/communications/social-media
/academics/the-center-for-writing-and-speaking
/website-privacy-policy
/welty-student-health-center
a-to-z-index
campus-map
http://calendar.whitman.edu
https://library.whitman.edu/
https://my.whitman.edu
a-to-z-index
campus-map
http://calendar.whitman.edu
https:

The links to the majors and minors aren't in this list. What is going on?!?! Let's take a closer look at the HTML above.

Notice how the info we want is contained in Javascript, which is another programming language that fuels the web.

Never fear!

Javascript is contained inside of a <script> tag. So let's see if we can find it:

In [29]:
scripts = majorminorSoup.find_all("script")

In [30]:
# The particular chunk of Javascript we are interested in is in the 3rd script tag:
print(scripts[2])

<script>
				(function(){
					var catList = [{id:'2642',name:'Majors'},{id:'373',name:'Minors'},{id:'818',name:'Other'}];
					var pageList = [{"xid":"x52057","title":"‘Forever Yuck,’ a Collaboration Rooted in Disgust","tags":"#2834,#2835,#370,#2418,#373,#2642,#658,#888,#252,#1045,#249,#250","url":"newsroom/armstrong-disgust-study"},{"xid":"x36361","title":"Anthropology","tags":"#339,#373,#2642","url":"academics/majors-and-minors/anthropology"},{"xid":"x40869","title":"Anthropology - Environmental Studies","tags":"#339,#2642","url":"academics/majors-and-minors/environmental-studies/environmental-studies-major/anthropology-environmental-studies"},{"xid":"x36373","title":"Art","tags":"#339,#373,#2642","url":"academics/majors-and-minors/art"},{"xid":"x49065","title":"Art - Environmental Studies","tags":"#339,#2642","url":"academics/majors-and-minors/environmental-studies/environmental-studies-major/art-environmental-studies"},{"xid":"x36397","title":"Art History","tags":"#339,#373,#26

What do you notice about the structure of it?

It looks like a dictionary! We can work with that...

Here is how we can parse it and turn it into JSON (which is another data format we'll learn more about soon):

In [31]:
# This is the piece we want!
good_stuff = scripts[2].text.split('[')[2].split(']')[0]

# Now turn it into JSON
import json

good_json = json.loads("[" + good_stuff + ']')

good_json

[{'xid': 'x52057',
  'title': '‘Forever Yuck,’ a Collaboration Rooted in Disgust',
  'tags': '#2834,#2835,#370,#2418,#373,#2642,#658,#888,#252,#1045,#249,#250',
  'url': 'newsroom/armstrong-disgust-study'},
 {'xid': 'x36361',
  'title': 'Anthropology',
  'tags': '#339,#373,#2642',
  'url': 'academics/majors-and-minors/anthropology'},
 {'xid': 'x40869',
  'title': 'Anthropology - Environmental Studies',
  'tags': '#339,#2642',
  'url': 'academics/majors-and-minors/environmental-studies/environmental-studies-major/anthropology-environmental-studies'},
 {'xid': 'x36373',
  'title': 'Art',
  'tags': '#339,#373,#2642',
  'url': 'academics/majors-and-minors/art'},
 {'xid': 'x49065',
  'title': 'Art - Environmental Studies',
  'tags': '#339,#2642',
  'url': 'academics/majors-and-minors/environmental-studies/environmental-studies-major/art-environmental-studies'},
 {'xid': 'x36397',
  'title': 'Art History',
  'tags': '#339,#373,#2642',
  'url': 'academics/majors-and-minors/art-history'},
 {'x

In [32]:
good_json[0]["url"]

'newsroom/armstrong-disgust-study'

In [33]:
for item in good_json:
    print(item['url'])

newsroom/armstrong-disgust-study
academics/majors-and-minors/anthropology
academics/majors-and-minors/environmental-studies/environmental-studies-major/anthropology-environmental-studies
academics/majors-and-minors/art
academics/majors-and-minors/environmental-studies/environmental-studies-major/art-environmental-studies
academics/majors-and-minors/art-history
academics/majors-and-minors/astronomy
academics/majors-and-minors/biochemistry-biophysics-and-molecular-biology
academics/majors-and-minors/biology
academics/majors-and-minors/environmental-studies/environmental-studies-major/biology-environmental-studies
academics/majors-and-minors/interdisciplinary-majors/biology-geology
academics/majors-and-minors/chemistry
academics/majors-and-minors/environmental-studies/environmental-studies-major/chemistry-environmental-studies
academics/majors-and-minors/interdisciplinary-majors/chemistry-geology
academics/majors-and-minors/chinese
academics/majors-and-minors/classics-and-classical-studie

### Challenge
Now that you have this list of URLs, you could write a web scraper to:
* Visit each department's page
* Go to the faculty page
* Collect the information on each faculty member

How might you do this? Start figuring out what the process would be.

### Challenge
Think about another websites you might interested in scraping. Take a look at it, both with your browser's inspector and with BeautifulSoup, and see what sorts of challenges and opportunities exist for scraping this page's data.