# Web Scraping

The web provides a rich source of data, so we often want to work with that information programmatically in order to make sense of it. Sometimes, that data is provided to us by website creators via csv files or through an API (Application Programming Interface). Other times, we need to collect text from the web ourselves.

It is easy to pull HTML from a website but more difficult to find the information we want from HTML.  Parsing the HTML for targeted information and then storing that information in a structured format so it's ready for downstream analysis will be the focus of this activity.

### Learning Goals:
- Understand the purpose of web scraping and when it is appropriate to use
- Articulate the differences between web scraping and using an API
- Become familiar with the structure of HTML and CSS by inspecting source code
- Use the Requests and Beautiful Soup libraries to acquire and parse data from websites
- Describe the challenges of web scraping (messy, unstructured, inconsistent, site-dependent)
- Understand the ethics of web scraping (e.g., check the terms and conditions of a site before scraping)

This activity will make use of the Requests and Beautiful Soup Python packages.
Documentation can be found here: 
- https://docs.python-requests.org/en/latest/
- https://www.crummy.com/software/BeautifulSoup/
The Requests module lets us integrate Python programs with web services and the Beautiful Soup module makes screen scraping an efficient task.  (The name Beautiful Soup comes frm Alice's Adventures in Wonderland by Lewis Carroll.)

We will also use the lxml package for one or two functions when parsing because it has an advantage when dealing with messy HTML code.
- https://lxml.de


### Install packages 

In [1]:
conda install -c anaconda requests 

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [2]:
conda install -c anaconda beautifulsoup4 

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [15]:
conda install -c anaconda lxml 

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


### Import packages


In [4]:
import requests
from bs4 import BeautifulSoup

We'll choose the Cal Poly Data Science webpage to inspect, and then assign the results of that page to the variable page.

Before we do that, let's take a moment to view the source code of the html document.  In Chrome, navigate to https://math.humboldt.edu/programs/data-science, then choose View -> Developer/View Source.  This might be the first time you have viewed HTML code.  Pay special attention to the tags-- a particular word or letter enclosed in triangular brackets: $<$h1$>$ for example.  Tags tell a browser how to render a page, and examples include deading tags, paragraph tags, line break tags, and formatting tags (among others).
Here is a list of basic HTML tags and what they do:

- $<$a$>$ for link
- $<$b$>$ to make bold text
- $<$body$>$ main HTML part
- $<$br$>$ for break
- $<$div$>$ it is a division or part of an HTML document
- $<$h1$>$ for titles
- $<$i$>$ to make an italic text
- $<$img$>$ for images in document
- $<$ol$>$ is an ordered list, $<$ul$>$ for an unordered list
- $<$li$>$ is a list item in bulleted (ordered list)
- $<$p$>$ for paragraph
- $<$span$>$ to style part of text
- $<$table$>$ for a table
- $<$tr$>$ for table row
- $<$th$>$ for table header
- $<$td$>$ for table cell

Look at the source code again, and see if you recognize any of these tags.

In [7]:
# choose webpage
url = 'https://math.humboldt.edu/programs/data-science'
page = requests.get(url)
page.status_code

200

Status code 200 indicates the page was downloaded successfully.  Status code info can be found here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.  We want to access the text-based content of the server's response. We can access this with page.text, which will give the full text of the page, including all of the HTML tags.  It will be difficult to read, but we will use Beautiful Soup to make this textual data more usable.

In [8]:
page.text

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" version="XHTML+RDFa 1.0" dir="ltr" >\n\n<head profile="http://www.w3.org/1999/xhtml/vocab">\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta name="Generator" content="Drupal 7 (http://drupal.org)" />\n<link rel="canonical" href="/programs/data-science" />\n<link rel="shortlink" href="/node/1486" />\n<script type="text/x-mathjax-config">\nMathJax.Hub.Config({\n  extensions: [\'tex2jax.js\'],\n  jax: [\'input/TeX\',\'output/HTML-CSS\'],\n  tex2jax: {\n    inlineMath: [ [\'$\',\'$\'], [\'\\\\(\',\'\\\\)\'] ],\n    processEscapes: true,\n    processClass: \'tex2jax\',\n    ignoreClass: \'html\'\n  },\n  showProcessingMessages: false,\n  messageStyle: \'none\'\n});\n</script><link rel="shortcut icon" href="https://math.humboldt.edu/profiles/openhsu/themes/hsu_kalatheme/favicon.ico" type="image/vnd.microsoft.icon" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0" 

### HTML and CSS

HTML (HyperText Markup Language) has a tree structure.  Generally, an HTML element (e.g., a heading or a paragraph) has three componets:
- tags (starting and ending the element)
- attributes (providing information about the element
- text or content (text inside the element)

![alternative text](html_tree_structure.png)

Cascading Style Sheets (CSS) defines how the HTML elements will be displayed.  With CSS you can control the color, font, size of text, spacing between elements, displays for different devices, etc.  The entire look of a website can be changed just be changing the CSS file.

We want to understand a minimal amount about HTML and CSS because we'll use their tags and attributes to parse scraped content.


Now we will create a Beautiful Soup object (a parse tree from the parse page we get from running Python's built in html.parser over the HTML).  This object represents the html document as a nested data structure, and assigns it to the variable soup.

In [9]:
soup = BeautifulSoup(page.text, 'html.parser')

The method prettify() will turn the Beautiful Soup parse tree into a nicely formatted string with each HTML tag on its own line.  The tags are nested, capturing the tree schema.

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Drupal 7 (http://drupal.org)" name="Generator"/>
  <link href="/programs/data-science" rel="canonical"/>
  <link href="/node/1486" rel="shortlink"/>
  <script type="text/x-mathjax-config">
   MathJax.Hub.Config({
  extensions: ['tex2jax.js'],
  jax: ['input/TeX','output/HTML-CSS'],
  tex2jax: {
    inlineMath: [ ['$','$'], ['\\(','\\)'] ],
    processEscapes: true,
    processClass: 'tex2jax',
    ignoreClass: 'html'
  },
  showProcessingMessages: false,
  messageStyle: 'none'
});
  </script>
  <link href="https://math.humboldt.edu/profiles/openhsu/themes/hsu_kalatheme/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Data Science, B.S. |

We can extract a single tag from a page using the find_all() method.  This will return all instances of a given tag within a document (as Python list data type-- notice the square brackets in the output below).
The tag "p" indicates a new paragraph.

In [11]:
soup.find_all('p')

[<p>
 <a href="http://humboldt.edu">
 <svg role="img" viewbox="0 0 1420.2 97.21" xmlns="http://www.w3.org/2000/svg"> <g data-name="Layer 2"> <g data-name="Layer 1"> <g data-name="calpoly-humboldt-mark"> <title id="humboldtTitle">Cal Poly Humboldt</title>
 <path class="hsu-humboldt" d="M1420.2 24.2V2.3H1342v22.8l8.4-4.7h18.8v64.7l-3.8 9.7h31.4l-3.8-9.7V20.4h19l8.2 3.8zm-278.3 25.3c0-7.8-2.3-14.5-7-20a22.3 22.3 0 0 0-17.9-8.3c-7.8 0-14 2.7-18.8 8.1s-7.1 12.2-7.1 20.3 2.3 14.5 7.1 19.9 10.8 8.1 18.4 8.1 13.6-2.7 18.3-8.1 7-12 7-19.9m22.8-1.2c0 13.9-4.6 25.6-13.9 34.9s-20.8 13.9-34.6 13.9-24.8-4.6-34.1-13.7-13.9-20.4-13.9-33.8 4.7-25.7 14.1-34.8 21.1-13.6 35.1-13.6 24.8 4.3 33.7 13 13.6 20.3 13.6 34M847.4 12v57.5a23.7 23.7 0 0 1-4.1 13.8 31.2 31.2 0 0 1-10.8 9.2A40.5 40.5 0 0 1 822 96a57.6 57.6 0 0 1-11.8 1.2 59.2 59.2 0 0 1-11.9-1 38.1 38.1 0 0 1-10.5-3.3 29.5 29.5 0 0 1-11.7-9.1 23.4 23.4 0 0 1-4.4-14.3V12l-4.2-9.8H795v63.5c0 3.2 2.8 11.8 14.5 11.8S824 68.9 824 65.7V2.2h27.4Zm115.4 73.2 

Because this returns a list, we can call a particular element (for example, the third "p" element), and use the get_text() method to extract all the text from within that tag.

In [12]:
soup.find_all('p')[2].get_text()

'Coursework centers around hands-on, applied learning experiences, which include projects that are community-based. Students will analyze real case studies of ethics in data science relating to privacy issues and the biases that are often inherently built into data.\xa0 Courses are designed to help students build a portfolio of work and finished projects that they can show potential employers to demonstrate their career-readiness.'

Note, in Python, "\xa0" is a character escape sequence that represents a non-breaking space.  A non-breaking space is a space character that prevents line breaks and word wrapping between two words separated by it.  

### State Senators of Illinois 
- A current list of Illinois state sentators can be found here: http://www.ilga.gov/senate.
Take a moment and inspect the structure of the HTML of the website.

In [13]:
# Import required libraries
from bs4 import BeautifulSoup
import requests


### Extracting and Parsing HTML

In order to succesfully scrape and analyze HTML, we will

- Make a GET request to get a page's HTML
- Parse the page with Beautiful Soup (the soup object will be an HTML tree)
- Search for HTML elements
- Get attributes and text of these elements

In [56]:
# Make a GET request
req = requests.get('https://www.ilga.gov/senate/')
# Read the content of the server’s response
src = req.text
# View some output
print(src[:1000])

<html lang="en"> 
<!-- Trigger/Open The Modal -->
<div style="position: fixed; z-index: 999; top: 5; left: 600; background-color: navy; display: block">
<button id="myBtn" style="color: white; background-color: navy; display: block">Translate Website</button></div>
<!-- The Modal -->
<div id="myModal" class="modal" style="display: none">
  <!-- Modal content -->
  <div class="modal-content">
      <div class="modal-header"><h3>
    <span class="close">&times;</span></h3></div>    
    <p>The Illinois General Assembly offers the Google Translate service for visitor convenience. In no way should it be considered accurate as to the translation of any content herein.</p>
    <p>Visitors of the Illinois General Assembly website are encouraged to use other translation services available on the internet.</p>
    <p>The English language version is always the official and authoritative version of this website.</p>
    <p>NOTE: To return to the original English language version, selec

Now we will parse the page with Beautiful Soup.  Notice that the output looks very similar to what was above, but it is in form that is a bit easier to read.

In [57]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml')
# Take a look
print(soup.prettify()[:1000])

<html lang="en">
 <!-- Trigger/Open The Modal -->
 <body>
  <div style="position: fixed; z-index: 999; top: 5; left: 600; background-color: navy; display: block">
   <button id="myBtn" style="color: white; background-color: navy; display: block">
    Translate Website
   </button>
  </div>
  <!-- The Modal -->
  <div class="modal" id="myModal" style="display: none">
   <!-- Modal content -->
   <div class="modal-content">
    <div class="modal-header">
     <h3>
      <span class="close">
       ×
      </span>
     </h3>
    </div>
    <p>
     The Illinois General Assembly offers the Google Translate service for visitor convenience. In no way should it be considered accurate as to the translation of any content herein.
    </p>
    <p>
     Visitors of the Illinois General Assembly website are encouraged to use other translation services available on the internet.
    </p>
    <p>
     The English language version is always the official and authoritative version of this website.
   

Beautiful Soup lets us find elements by their
- HTML tags
- HTML Attributes
- CSS Selectors

We will now look for specific HTML tags.  The function find_all searches the soup tree to find all the elements with an a particular HTML tag, and returns all of those elements.  The tag "a" defines a hyperlink.


In [58]:
# Find all elements with a certain tag
a_tags = soup.find_all("a")
print(a_tags[:10])

[<a class="goog-logo-link" href="https://translate.google.com" target="_blank"><img alt="Google Translate" height="14" src="https://www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_42x16dp.png" style="padding-right: 3px;" width="37"/>Translate</a>, <a href="/default.asp"><img alt="Illinois General Assembly" border="0" height="49" src="/images/logo_sm.gif" width="462"/></a>, <a class="mainmenu" href="/">Home</a>, <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>, <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>, <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseove

In [59]:
print(len(a_tags))

206


This means we found 206 links.  We might want to subset this, and look just for hyperlinks with certain **attributes** (such as a particular CSS class).  For example, we can search for all "a" tags" with a class sidemenu.

In [60]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>]

### Task: 
Use BeautifulSoup to find all the a elements with class mainmenu.

### Solution

In [61]:
# Get only the 'a' tags in 'mainmenu' class
main_menus = soup("a", class_="mainmenu")
main_menus[:5]

[<a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>]

We want to access information within specific elements.  In pratice, this means text and or attributes of those elements.  

In [62]:
# Get attributes and text of these elements

# Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu")

# Examine the first link
first_link = side_menu_links[0]
print(first_link)

# What class is this variable?
print('Class: ', type(first_link))

<a class="sidemenu" href="/senate/default.asp">  Members  </a>
Class:  <class 'bs4.element.Tag'>


In [63]:
print(first_link.text)

  Members  


### Scraping Information about the Illinois General Assembly
Our goal will be to scrape this information about each senator:
- their name
- their district
- their political party

#### Scrape and Soup the Government Webpage

In [69]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
###req = requests.get('https://www.ilga.gov/senate/')
# Read the content of the server’s response
src = req.text
# Soup it
soup = BeautifulSoup(src, "lxml")

#### Search for Table Elements
Rows are identified by the "tr" tag.  

In [70]:
# Get all table row elements
rows = soup.find_all("tr")
len(rows)

77

In [71]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

for row in rows[:5]:
    print(row, '\n')

<tr><td colspan="5">
<span class="heading">Current Senate Members</span>
<span class="italics">  103rd  General Assembly</span><br/>
<!-- 3/2/09 temp comment out until fixed for GA specific-->
<!-- add 97th ga currently no info -->
<a href="103rd_Senate_Leadership.pdf">Leadership</a> <a href="103rd_Senate_Officers.pdf">Officers</a> <a href="103rd_Senate_Seating_Chart.pdf">Senate Seating Chart</a>  <span class="content"><b>Democrats:</b> 40   <b>Republicans:</b> 19</span><br/>
</td></tr> 

<tr>
<td class="header" width="45%"><a class="filetab" href="javascript:Sort('LastName','',103);" title="Sort by Senator">Senator</a></td>
<td align="center" class="header" width="15%">Bills</td>
<td align="center" class="header" width="10%">Committees</td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascript:Sort('DistrictNumber','',103);" title="Sort by District">District</a></td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascript:Sort('Par

If we skim the above output, we see that we want everything after the first two rows (notice that "Neil Anderson" is the first member listed and that starts in the third row).  We will start by looking at just the single row for Neil Anderson. 

In [72]:
example_row = rows[2]
print(example_row.prettify())

<tr>
 <td bgcolor="white" class="detail" width="40%">
  <a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3092">
   Neil Anderson
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenatorBills.asp?MemberID=3092">
   Bills
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenCommittees.asp?MemberID=3092">
   Committees
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  47
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  R
 </td>
</tr>



The tag "td" means table data cell element in HTML.

In [73]:
for cell in example_row.select('td'):
    print(cell)
print()

<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3092">Neil Anderson</a></td>
<td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3092">Bills</a></td>
<td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3092">Committees</a></td>
<td align="center" bgcolor="white" class="detail" width="15%">47</td>
<td align="center" bgcolor="white" class="detail" width="15%">R</td>



Let's select only the "td" tags with class "detail."  (All of the above have this class, but in some instances this might not be true.)

In [74]:
# Select only those 'td' tags with class 'detail' 
detail_cells = example_row.select('td.detail')
detail_cells

[<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3092">Neil Anderson</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3092">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3092">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">47</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

We are interested in the text on the website, not the tag.

In [75]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]
print(row_data)

['Neil Anderson', 'Bills', 'Committees', '47', 'R']


This gives us a list of strings with 5 elements.  We are interested in the first element (name), the fourth element (district), and the last element (political party).

In [76]:
print(row_data[0]) # Name
print(row_data[3]) # District
print(row_data[4]) # Party

Neil Anderson
47
R


#### Eliminating Junk Rows
Not all rows returned actually correspond to a senator.  We will want to clean things up before we use a for loop to loop through all of the senators.  For example, these rows will need to be filtered.  (Read what is returned and notice it does not correspond to a senator.)

In [77]:
print('Row 0:\n', rows[0], '\n')
print('Row 1:\n', rows[1], '\n')
print('Last Row:\n', rows[-1])

Row 0:
 <tr><td colspan="5">
<span class="heading">Current Senate Members</span>
<span class="italics">  103rd  General Assembly</span><br/>
<!-- 3/2/09 temp comment out until fixed for GA specific-->
<!-- add 97th ga currently no info -->
<a href="103rd_Senate_Leadership.pdf">Leadership</a> <a href="103rd_Senate_Officers.pdf">Officers</a> <a href="103rd_Senate_Seating_Chart.pdf">Senate Seating Chart</a>  <span class="content"><b>Democrats:</b> 40   <b>Republicans:</b> 19</span><br/>
</td></tr> 

Row 1:
 <tr>
<td class="header" width="45%"><a class="filetab" href="javascript:Sort('LastName','',103);" title="Sort by Senator">Senator</a></td>
<td align="center" class="header" width="15%">Bills</td>
<td align="center" class="header" width="10%">Committees</td>
<td align="center" class="header" width="15%"><a class="filetab" href="javascript:Sort('DistrictNumber','',103);" title="Sort by District">District</a></td>
<td align="center" class="header" width="15%"><a class="filetab" href="java

How we filter rows is somewhat dependent on the website, and we might have to try a few things to find a conditional that works for us.  Approaches to try:
- row length (perhaps the rows we want to keep all have the same length and the rows we want to filter have a different length)
- use a specific class, like "detail" which is in the rows we want and not in the

Let's look at the length of some of the rows we want to eliminate and some of the rows we know we want to keep.

In [78]:
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

1
11
5
5


Is the length 5 enough of a condition to be a row we keep?

In [79]:
good_rows = [row for row in rows if len(row) == 5]

# Let's check some rows
print(good_rows[0], '\n')
print(good_rows[-2], '\n')
print(good_rows[-1])

<tr><td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3092">Neil Anderson</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3092">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3092">Committees</a></td><td align="center" bgcolor="white" class="detail" width="15%">47</td><td align="center" bgcolor="white" class="detail" width="15%">R</td></tr> 

<tr><td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3123">Craig Wilcox</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3123">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3123">Committees</a></td><td align="center" bgcolor="white" class="detail" width="15%">32</td><td align="

Unfortunately, there is a footer at the end that we will need to avoid.  Notice that it has class = "footer", which differs from the class = "detail" of the rows we are interested in keeping.

In [80]:
rows[2].select('td.detail') 

[<td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3092">Neil Anderson</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3092">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3092">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">47</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

In [81]:
# Bad row-- note it does not have the class detail
print(rows[-1].select('td.detail'), '\n')

# Good row-- note it has the class detail
print(rows[5].select('td.detail'), '\n')

# Try this and check
good_rows = [row for row in rows if row.select('td.detail')]

print("Checking rows...\n")
print(good_rows[0], '\n') # check the first row
print(good_rows[-1]) # check the last row

[] 

[<td bgcolor="EBEBEB" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3222">Tom Bennett</a></td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3222">Bills</a></td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3222">Committees</a></td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%">53</td>, <td align="center" bgcolor="EBEBEB" class="detail" width="15%">R</td>] 

Checking rows...

<tr><td bgcolor="white" class="detail" width="40%"><a class="notranslate" href="/senate/Senator.asp?GA=103&amp;MemberID=3092">Neil Anderson</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=3092">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=3092">Committees</a></td><td align="center" bgcolor="white" class="detail" wi

This worked!  

#### Loop
Now that we know how to get the data from one row and how to filter out the rows we don't want, we can put this into a loop.

In [82]:
# Define storage list, this will be a list of tuples
members = []

# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]

# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

As a check, we know there should be 59 senators from the webpage.

In [83]:
# Should be 59
len(members)

59

In [37]:
# print first few elements to check
print(members[:5])

[('Neil Anderson', 47, 'R'), ('Omar Aquino', 2, 'D'), ('Christopher Belt', 57, 'D'), ('Tom Bennett', 53, 'R'), ('Terri Bryant', 58, 'R')]


### Store in a DataFrame
It would be more convenient to view this information in a pandas DataFrame.

In [84]:
import pandas as pd
df = pd.DataFrame(members, columns =['Name', 'District', 'Political Party'])
df

Unnamed: 0,Name,District,Political Party
0,Neil Anderson,47,R
1,Omar Aquino,2,D
2,Christopher Belt,57,D
3,Tom Bennett,53,R
4,Terri Bryant,58,R
5,Cristina Castro,22,D
6,Javier L. Cervantes,1,D
7,Andrew S. Chesney,45,R
8,Bill Cunningham,18,D
9,John F. Curran,41,R


### Task
Navigate to the Wikipedia page https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal),
extract the country and GDP information from the table, and stores it in a Pandas dataframe. The resulting dataframe will have two columns: "Country" and "GDP (billions USD)".  

### Solution

In [121]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send a GET request to the website
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
response = requests.get(url)

# Use Beautiful Soup to parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find the table that contains the GDP information
table = soup.find("table", {"class": "wikitable"})

# Create lists to store the data
countries = []
gdps = []


# Extract the country and GDP information from each row in the table
for row in table.find_all("tr")[1:]:
    cells = row.find_all("td")
    if cells:
        country = cells[0].get_text().strip()
        gdp = cells[2].get_text().strip()
       

        # Append the data to the lists
        countries.append(country)
        gdps.append(gdp)
    
# Create a Pandas dataframe from the lists
df = pd.DataFrame({"Country": countries, "GDP (billions USD)": gdps})

# Print the dataframe
df


Unnamed: 0,Country,GDP (billions USD)
0,World,105568776
1,United States,26854599
2,China,19373586
3,Japan,4409738
4,Germany,4308854
...,...,...
212,Anguilla,—
213,Kiribati,248
214,Nauru,151
215,Montserrat,—


### Task
Some websites don't allow web scraping.  As an example, go to Zillow and search for homes for sale in Arcata.  What happens if you try to create a soup object from this website?  (Look at the contents of the soup object.)

### Solution
Access will be denied.

### References
- Web Scraping with Python: Collecting More Data From the Modern Web by Ryan Mitchell
- Digital Ocean tutorials: https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3
- Digital Ocean HTML tutorial: https://www.digitalocean.com/community/tutorials/how-to-use-and-understand-html-elements
- UCB's DLab: https://github.com/dlab-berkeley/Python-Web-Scraping/blob/main/lessons/02_web_scraping.ipynb
- EDUCBA Tutorial on Beautiful Soup: https://www.educba.com/beautifulsoup-find/?source=leftnav