## Lab8 Web Scraping

Reference: https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/

### The Components of a Web Page
Before we start writing code, we need to understand a little bit about the structure of a web page. We'll use the site's structure to write code that gets us the data we want to scrape, so understanding that structure is an important first step for any web scraping project.
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. These files will typically include:

1. HTML — the main part of the page.
2. CSS — used to add styling to make the page look nicer.
3. JS — Javascript files add interactivity to web pages.
4. Images — image formats, such as JPG and PNG, allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us.
There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML.

HTML:

HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content. 

HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:


Tags:

p - the children of the body tag

a - add some extra text and hyperlinks

div — indicates a division, or area, of the page.

b — bolds any text inside.

i — italicizes any text inside.

table — creates a table.

form — creates an input form.
    

### Workflow
1. Request the content (source code) of a specific URL from the server
2. Download the content that is returned
3. Observe the page structure & Identify the elements of the page that are part of the table we want
4. Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

### Useful tools
Google developer tool: Chrome - view - developer - developer tool

Double/right click - inspect

In [56]:
import requests
import bs4
from bs4 import BeautifulSoup
import re


In [72]:
url = 'https://harris.uchicago.edu/academics/programs-degrees/degrees/master-public-policy-mpp'
page = requests.get(url)

In [73]:
soup = BeautifulSoup(page.content, 'html.parser')

In [74]:
soup

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
<head>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
            wl = wl || [];
            wl.push({'gtm.start': new Date().getTime(), event: 'gtm.js'});
            var f = d.getElementsByTagName(s)
            0, j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
            j.async = true;
            j.src = 'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
            f.parentNode.insertBefore(j, f);
          })(window, document, 'script', 'dataLayer', 'GTM-W7TVBBS');
        </script>
<!-- End Google Tag Manager -->
<meta charset="u

In [75]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
 <head>
  <!-- Google Tag Manager -->
  <script>
   (function (w, d, s, l, i) {
            wl = wl || [];
            wl.push({'gtm.start': new Date().getTime(), event: 'gtm.js'});
            var f = d.getElementsByTagName(s)
            0, j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
            j.async = true;
            j.src = 'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
            f.parentNode.insertBefore(j, f);
          })(window, document, 'script', 'dataLayer', 'GTM-W7TVBBS');
  </script>
  <!-- End Google Tag Manager -->
  <meta char

In [76]:
#find all text
print(soup.get_text())




















Master of Public Policy (MPP) | The University of Chicago Harris School of Public Policy






































































  Skip to main content








Utility


Search


Log in


Directory


Hire Harris



Find Talent


Find Work




Info For



Prospective Students


Admitted Students (Full-Time)


New EMP Students


Current Students


Faculty and Staff


Alumni


Media




Support Harris



Ways to Give


Harris Society







 







Sort








Older first



Most recent



Less relevant



Most relevant















Page 1

The University of Chicago Harris School of Public Policy






AboutConsidering Applying?Get a jump start now by creating an account with our simple online tool.Get StartedStill have questions? Check out Admissions for details on requirements, deadlines, and financial aid.Who We AreConsidering Applying?Get a jump start now by creating an account with our simple online tool.Get StartedStill have quest

In [77]:
all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href"))

#main-content
/search
/user/login
/directory
/hireharris
/findtalent
/student-life/career-development

/admissions
/admitted-students
/eveningprogram/admitted-students
/gateways/current-students
/gateways/faculty-staff
/alumni
/media
/support-harris
/support-harris/ways-to-give
/support-harris/harris-society
#
/
/about
https://apply-harris.uchicago.edu/apply/
/admissions
/about/who-we-are
https://apply-harris.uchicago.edu/apply/
/admissions
/about/who-we-are/our-principles
/about/who-we-are/harris-by-the-numbers
/about/who-we-are/career-outcomes-report
/about/keller-center
https://apply-harris.uchicago.edu/apply/
/admissions
/about/keller-center/the-future-of-policy
/about/keller-center/your-impact
/about/keller-center/design-sustainability
/about/keller-center/press-room
/about/leadership
https://apply-harris.uchicago.edu/apply/
/admissions
/about/leadership/meet-our-new-dean
/about/leadership/deans-office
/about/leadership/harris-council
/about/leadership/alumni-council
/about/leader

In [78]:
type(all_links)

bs4.element.ResultSet

In [79]:
all_divs = soup.find_all("div")
for link in all_divs:
    print(link.get_text())







Utility


Search


Log in


Directory


Hire Harris



Find Talent


Find Work




Info For



Prospective Students


Admitted Students (Full-Time)


New EMP Students


Current Students


Faculty and Staff


Alumni


Media




Support Harris



Ways to Give


Harris Society







 







Sort








Older first



Most recent



Less relevant



Most relevant















Page 1

The University of Chicago Harris School of Public Policy






AboutConsidering Applying?Get a jump start now by creating an account with our simple online tool.Get StartedStill have questions? Check out Admissions for details on requirements, deadlines, and financial aid.Who We AreConsidering Applying?Get a jump start now by creating an account with our simple online tool.Get StartedStill have questions? Check out Admissions for details on requirements, deadlines, and financial aid.Our PrinciplesHarris By The NumbersCareer Outcomes ReportThe Keller CenterConsidering Applying?Get a jump start now by

### Useful methods

1. soup.select()

BeautifulSoup has a .select() method which uses the SoupSieve package to run a CSS selector against a parsed document and return all the matching elements. Tag has a similar method which runs a CSS selector against the contents of a single tag.

(The SoupSieve integration was added in Beautiful Soup 4.7.0. Earlier versions also have the .select() method, but only the most commonly-used CSS selectors are supported. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.)

2. soup.get_text()

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string

In [85]:
#homework7 Q2 target

In [83]:
bullets

['18 graduate-level courses (1800 units of credit) with at least 12 Public Policy (PPHA) courses',
 "6 core courses that providea foundation in critical analysis, reflecting Harris's belief that mastering quantitative and analytical skills prepares students to be effective public policy leaders",
 'PPHA 30800 Analytical Politics I: Strategic Foundations or PPHA 41501 - PhD Game Theory(instructor approval required).',
 'PPHA 31610 Analytical Politics II: Political Institutions',
 'Statistics Sequence I.Choose one of the following:',
 'PPHA 31002 Statistics for Data Analysis I',
 'PPHA 31202 Advanced Statistics for Data Analysis I ',
 'Any course in the PhD econometrics sequence (instructor approval required):PPHA 42000 or PPHA42100, or PPHA 42200',
 'Statistics Sequence II.Choose one of the following:',
 'PPHA 31102 Statistics for Data Analysis II:Regressions',
 'PPHA 31302 Advanced Statistics for Data Analysis II',
 'Any course in the PhD econometrics sequence (instructor approval requ