---
toc: false
page-layout: full
---

# Week 6B<br>Web Scraping

- Section 401
- Wednesday, October 11, 2023

## Week 6 agenda: web scraping

**Last time:**
- Why web scraping? 
- Getting familiar with the Web
- Web scraping: extracting data from static sites

**Today:**
- Practice with web scraping
- How to deal with dynamic content

In [1]:
# Start with the usual imports
# We'll use these throughout
import pandas as pd
from bs4 import BeautifulSoup
import requests

## Part 1: Web scraping exercises

For each of the exercises, use the Web Inspector to inspect the structure of the relevant web page, and identify the HTML content you will need to scrape with Python.


### 1. The number of days until the General Election

The relevant URL is: [https://vote.phila.gov/](https://vote.phila.gov/)

**Hint:** Select the `<span>` element that holds the number of days using the class name for the tag.

### 2. How many millions of people are currently experiencing drought?

Relevant URL: [https://www.drought.gov/current-conditions](https://www.drought.gov/current-conditions)

**Hint:** We're interested in just a single HTML element so you can inspect the website, identify the right element, and copy the selector for the element.

### 3. Scrape the Weitzman School directory

The Weitzman School lists their directory of people on this page: [https://www.design.upenn.edu/people/list](https://www.design.upenn.edu/people/list). From this site, let's extract out the person's name, title, and associated department.

This example is similar to the Inquirer Clean Plates demo from last lecture. The info we want for each person is wrapped up in a `<div>` element. You can select all of those elements, loop over each one in a "for" loop, extract the three pieces of content we want from each `<div>`, and then save the result to a list.



## Part 2: What about dynamic content?

How do you scrape data that is loaded via Javascript or only appears after user interaction?

<center>
    <img src="imgs/yoda.png" width=700>
</center>

### Note: web browser needed

You'll need a web browser installed to use `selenium`, e.g., FireFox, Google Chrome, Edge, etc.

### Selenium 

- Designed as a framework for testing webpages during development
- Provides an interface to interact with webpages just as a user would 
- Becoming increasingly popular for web scraping dynamic content from pages

### Two common use cases

Two cases when requests.get() won't work:

1. Many sites load content via javascript, so requests.get() won't return anything. It doesn't load any javascript on the page, just returns the static HTML content.
1. If the site requires user interaction via buttons, dropdowns, etc., then get requests won't be able to show that information.

Selenium to the rescue!


### Example #1: Scraping SEPTA

- URL: [https://beta-realtime.septa.org/alerts](https://beta-realtime.septa.org/alerts)
- Let's see if we can extract the current set of alerts across the SEPTA system

**Problem:** content is loaded via Javascript!

In [149]:
url = "https://beta-realtime.septa.org/alerts"

In [150]:
requests.get(url).content



Ah, nothing useful was returned -- content must be loaded via javascript!

When you see things like:

```
app-root></app-root>
<script src="runtime.73a50d36fa41b9bf.js" type="module"></script>
<script src="polyfills.174bc340856cccb6.js" type="module"></script>
<script src="scripts.faf651b9a5f14fb2.js" defer></script>
<script src="main.9cc42a1ddb87056f.js" type="module"></script>
```

it means the website is loading content via javascript and replacing the content of `<app-root>` dynamically once the page is loaded.

### Can we parse it with selenium? Yes!

Let's use selenium to open a web browser, load the page, and then extract the HTML content AFTER the javascript has loaded the content

### Initialize the driver

The initialization steps will depend on which browser you want to use!

In [151]:
# Import the webdriver from selenium
from selenium import webdriver

::: {.callout-important}
#### Important: Working on Binder

If you are working on Binder, you'll need to use FireFox in "headless" mode, which prevents a browser window from opening.

If you are working locally, it's better to run with the default options — you'll be able to see the browser window open and change as we perform the web scraping.
:::

#### Using Google Chrome

In [152]:
# UNCOMMENT BELOW TO USE CHROME

driver = webdriver.Chrome()

#### Using Firefox

If you are working on Binder, use the below code!

In [153]:
# UNCOMMENT BELOW IF ON BINDER

options = webdriver.FirefoxOptions()

# IF ON BINDER, RUN IN "HEADLESS" MODE (NO BROWSER WINDOW IS OPENED)
# COMMENT THIS LINE IF WORKING LOCALLY
# options.add_argument("--headless")

# Initialize
#driver = webdriver.Firefox(options=options)

#### Using Microsoft Edge

In [154]:
# UNCOMMENT BELOW TO USE MICROSOFT EDGE

#driver = webdriver.Edge()

Navigate to the URL in our opened browser:

In [157]:
url = "https://beta-realtime.septa.org/alerts"
driver.get(url)

After it loads, extract the HTML page source and pass it to BeautifulSoup:

In [158]:
septaSoup = BeautifulSoup(driver.page_source, "html.parser")

Now, head to the web inspector to understand the structure of the website.

It looks like 

First, select all of the `<li>` elements that hold the route name + alert info.

In [159]:
route_containers = septaSoup.select("ul.d-grid.gap-5 > li")

In [160]:
print(route_containers[0].prettify())

<li _ngcontent-gib-c36="">
 <app-route-display _ngcontent-gib-c36="" _nghost-gib-c32="" size="medium">
  <a _ngcontent-gib-c32="" class="align-items-center fw-bold m-0 metro medium bg-L" href="/schedules/L1" title="Market-Frankford Line" translate="no">
   Market-Frankford Line
  </a>
 </app-route-display>
 <ul _ngcontent-gib-c36="" class="mt-3 rounded border border-lightgray route-alert-list">
  <li _ngcontent-gib-c36="" class="border-bottom border-lightgray alert-card">
   <app-card-alerts _ngcontent-gib-c36="" _nghost-gib-c13="">
     <div _ngcontent-gib-c13="" class="d-flex gap-5 flex-row align-items-center">
      <div _ngcontent-gib-c13="" class="d-flex gap-3 flex-fill">
       <svg-icon _ngcontent-gib-c13="">
        <svg _ngcontent-gib-c13="" aria-hidden="true" fill="currentColor" height="33" style="width: 25px; fill: rgb(51, 51, 51);" viewbox="0 0 38 33" width="38" xmlns="http://www.w3.org/2000/svg">
         <path _ngcontent-gib-c13="" clip-rule="evenodd" d="M18.5641 0C17.521

In [161]:
data = []

for route_container in route_containers:
    # The route name
    route_name = route_container.select_one("app-route-display").text.strip()

    # Loop over list of alerts
    for alert_card in route_container.select(".alert-card"):
        # Type of alert
        alert_type = alert_card.select_one("h5").text.strip()

        # The status, e.g. current
        when = alert_card.select_one(".badge").text.strip()

        # Description
        desc = alert_card.select_one(
            "app-card-alerts > div > div > div > div > div:nth-child(2)"
        ).text.strip()

        # Save it
        data.append([route_name, alert_type, when, desc])


data = pd.DataFrame(data, columns=["route_name", "alert_type", "when", "description"])

In [162]:
data.head(n=20)

Unnamed: 0,route_name,alert_type,when,description
0,Market-Frankford Line,Alert,Current,Passengers must board trains on the westbound ...
1,Market-Frankford Line,Advisory,Current,Early Weekend Station Closures
2,Market-Frankford Line,Advisory,Current,Temporary Closure of 30th & Market Street Entr...
3,Market-Frankford Line,Advisory,Upcoming,Late Night Deep Cleaning
4,Route 10,Advisory,Current,Temporary Closure of 30th & Market Street Entr...
5,Route 10,Advisory,Current,Station Deep Cleaning
6,Route 10,Advisory,Upcoming,Late Night Deep Cleaning
7,Route 11,Advisory,Current,Temporary Closure of 30th & Market Street Entr...
8,Route 11,Advisory,Current,Station Deep Cleaning
9,Route 11,Advisory,Upcoming,Late Night Deep Cleaning


Success!

### Example #2: Scraping the Philadelphia Municipal Courts portal

- URL: https://ujsportal.pacourts.us/CaseSearch
- Given a Police incident number, we'll see if there is an associated court case with the incident

**Problem:** we'll need to click several buttons before we can see the info we want!

### Run the scraping analysis

Strategy:

- Rely on the Web Inspector to identify specific elements of the webpage
- Use Selenium to interact with the webpage
    - Change dropdown elements
    - Click buttons

#### 1. Open the URL

In [163]:
# Open the URL
url = "https://ujsportal.pacourts.us/CaseSearch"
driver.get(url)

#### 2. Create a dropdown "Select" element

We'll need to: 
- Select the dropdown element on the main page by its ID
- Initialize a `selenium` `Select()` object

In [164]:
# Use the Web Inspector to get the css selector of the dropdown select element
dropdown_selector = "#SearchBy-Control > select"

In [165]:
from selenium.webdriver.common.by import By

# Select the dropdown by the element's CSS selector
dropdown = driver.find_element(By.CSS_SELECTOR, dropdown_selector)

In [166]:
from selenium.webdriver.support.ui import Select

# Initialize a Select object
dropdown_select = Select(dropdown)

#### 3. Change the selected text in the dropdown

Change the selected element: "Police Incident/Complaint Number" 

In [167]:
# Set the selected text in the dropdown element
dropdown_select.select_by_visible_text("Incident Number")

#### 4. Set the incident number 

In [168]:
# Get the input element for the DC number
incident_input_selector = "#IncidentNumber-Control > input"
incident_input = driver.find_element(By.CSS_SELECTOR, incident_input_selector)

In [169]:
# Clear any existing entry
incident_input.clear()

# Input our example incident number
incident_input.send_keys("1725088232")

#### 5. Click the search button!

In [170]:
# Submit the search
search_button_id = "btnSearch"
driver.find_element(By.ID, search_button_id).click()

#### 6. Use BeautifulSoup to parse the results

- Use the `page_source` attribute to get the current HTML displayed on the page
- Initialize a "soup" object with the HTML

In [171]:
courtsSoup = BeautifulSoup(driver.page_source, "html.parser")

- Identify the element holding all of the results
- Within this container, find the `<table>` element and each `<tr>` element within the table

In [172]:
# Select the results container by its ID 
results_table = courtsSoup.select_one("#caseSearchResultGrid")

In [173]:
# Get all of the <tr> rows inside the tbody element 
# NOTE: we using nested selections here!
results_rows = results_table.select("tbody > tr")

**Example:** The number of court cases

In [174]:
# Number of court cases
number_of_cases = len(results_rows)
print(f"Number of courts cases: {number_of_cases}")

Number of courts cases: 2


**Example:** Extract the text elements from the first row of the results

In [175]:
first_row = results_rows[0]

In [176]:
print(first_row.prettify())

<tr class="slide-active">
 <td class="display-none">
  1
 </td>
 <td class="display-none">
  0
 </td>
 <td>
  MC-51-CR-0030672-2017
 </td>
 <td>
  Common Pleas
 </td>
 <td>
  Comm. v. Velquez, Victor
 </td>
 <td>
  Closed
 </td>
 <td>
  10/13/2017
 </td>
 <td>
  Velquez, Victor
 </td>
 <td>
  09/05/1974
 </td>
 <td>
  Philadelphia
 </td>
 <td>
  MC-01-51-Crim
 </td>
 <td>
  U0981035
 </td>
 <td>
  1725088232-0030672
 </td>
 <td>
  1725088232
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td>
  <div class="grid inline-block">
   <div>
    <div class="inline-block">
     <a class="icon-wrapper" href="/Report/CpDocketSheet?docketNumber=MC-51-CR-0030672-2017&amp;dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
      <img alt="Docket Sheet" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=qJ77ypOpzSMFk7r1gsI6H0xjdteha_ZIjvGslGgQV2M#icon-document-letter-D" 

In [177]:
# Extract out all of the "<td>" cells from the first row
td_cells = first_row.select("td")

# Loop over each <td> cell
for cell in td_cells:
    
    # Extract out the text from the <td> element
    text = cell.text
    
    # Print out text
    if text != "":
        print(text)

1
0
MC-51-CR-0030672-2017
Common Pleas
Comm. v. Velquez, Victor
Closed
10/13/2017
Velquez, Victor
09/05/1974
Philadelphia
MC-01-51-Crim
U0981035
1725088232-0030672
1725088232
Docket SheetCourt Summary


#### 7. Close the driver!

In [178]:
driver.close()

## That's it!

- Next week: part 2 of "getting data" with APIs
- See you on Monday!