## Why scrape a site?

- get needed info
- faster than manually recording information
- it's fun!

## Personal Examples 

- [Remember the Milk - online task list](https://rememberthemilk.com) 
- [Overcast podcast player website](https://overcast.fm)
- [Audible](https://audible.com)
- [Datacamp](https://datacamp.com)
- [O'Reilly Safari Online](https://learning.oreilly.com/videos/jupytercon-2017/9781491985311)

# How to get started?

## Tools
- Python (obviously)
- Beautiful Soup - https://www.crummy.com/software/BeautifulSoup/
- Requests - http://docs.python-requests.org/en/master/

# install with pip

```python
pip install beautifulsoup4
pip install requests
```

## or just use anaconda...
- both are included by default

## start with Beautiful Soup first
- go to the site you want to scrape
- save page as
- open in your favorite text editor
- start exploring!

## Some simple HTML 
- combination of [W3 Schools](https://www.w3schools.com/html/html_intro.asp) and the [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
html_stuff = """<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>
<h1>My Second Heading</h1>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
</body>
</html>"""

In [2]:
from bs4 import BeautifulSoup

# create soup object - pass the html and which parser to use
soup = BeautifulSoup(html_stuff, 'html.parser')

- note that there are other parsers supported (check the documentation) but that is outside the scope of this talk
- the standard html parser works fine for our purposes

### Beatiful soup will show us the nested structure of the soup object

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h1>
   My First Heading
  </h1>
  <p>
   My first paragraph.
  </p>
  <h1>
   My Second Heading
  </h1>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
 </body>
</html>


## We can pull out specific elements 

In [4]:
soup.title

<title>Page Title</title>

In [5]:
soup.title.name

'title'

In [6]:
soup.title.parent.name

'head'

# Common tasks - pulling text and URLs

### Use find_all() or find()

In [7]:
soup.find_all('h1')

[<h1>My First Heading</h1>, <h1>My Second Heading</h1>]

#### Find() is useful if there is only one of that tag so that the parser does not have to search the entire document

In [8]:
soup.find('h1')

<h1>My First Heading</h1>

### Return only the text

In [9]:
print(soup.get_text())




Page Title


My First Heading
My first paragraph.
My Second Heading
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.




### Just the URLs

In [10]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


### Just the text from a specific tag

In [11]:
for text in soup.find_all('a'):
    print(text.get_text())

Elsie
Lacie
Tillie


In [12]:
for text in soup.find_all('h1'):
    print(text.get_text())

My First Heading
My Second Heading


# Real World Example

In [13]:
with open('Think Stats, 2nd Edition.html', 'r') as out:
    safari = out.read()
    
safari_soup = BeautifulSoup(safari, 'html.parser')

In [14]:
print(safari_soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0072)https://learning.oreilly.com/library/view/think-stats-2nd/9781491907344/ -->
<html class=" js flexbox flexboxlegacy no-touch websqldatabase indexeddb history csscolumns csstransforms localstorage sessionstorage applicationcache svg inlinesvg" data-account-type="Paid" data-activated-trial-date="09/12/2017" data-archive="9781491907344" data-book-overview="true" data-csrf-cookie="csrfsafari" data-debug="0" data-highlight-privacy="" data-login-url="/accounts/login/" data-offline-url="/" data-publishers="O'Reilly Media, Inc." data-testing="0" data-url="/library/view/think-stats-2nd/9781491907344/" data-user-id="2002766" data-user-uuid="0f245ad9-c9d7-4f63-95c5-def519b24116" data-username="jballoonist" itemscope="" itemtype="http://schema.org/Book http://schema.org/CollectionPage" lang="en" prefix="og: http://ogp.me/ns/# og:book: http://ogp.me/ns/book# og:video: http://ogp.me/ns/video#" style="">
 <!--<![endif]-->
 <head>
  <meta content="text/html; c

## Find specific sections based on both tag and class

In [15]:
for i in safari_soup.find_all('a', class_="t-chapter"):
    print(i.get_text())

Preface
How I Wrote This Book
Using the Code
Contributor List
Safari® Books Online
How to Contact Us
1. Exploratory Data Analysis
A Statistical Approach
The National Survey of Family Growth
Importing the Data
DataFrames
Variables
Transformation
Validation
Interpretation
Exercises
Glossary
2. Distributions
Representing Histograms
Plotting Histograms
NSFG Variables
Outliers
First Babies
Summarizing Distributions
Variance
Effect Size
Reporting Results
Exercises
Glossary
3. Probability Mass Functions
Pmfs
Plotting PMFs
Other Visualizations
The Class Size Paradox
DataFrame Indexing
Exercises
Glossary
4. Cumulative Distribution Functions
The Limits of PMFs
Percentiles
CDFs
Representing CDFs
Comparing CDFs
Percentile-Based Statistics
Random Numbers
Comparing Percentile Ranks
Exercises
Glossary
5. Modeling Distributions
The Exponential Distribution
The Normal Distribution
Normal Probability Plot
The lognormal Distribution
The Pareto Distribution
Generating Random Numbers
Why Model?
Exercises
G

## Get only the parts I want

In [16]:
# create a list of numbers converted to strings and converted to a tuple
chap_nums = tuple([str(i) + '.' for i in range(1, 20)])

for i in safari_soup.find_all('a', class_="t-chapter"):
    
    # use startswith to check if it begins with a number
    if i.get_text().startswith(chap_nums):
        print(i.get_text())

1. Exploratory Data Analysis
2. Distributions
3. Probability Mass Functions
4. Cumulative Distribution Functions
5. Modeling Distributions
6. Probability Density Functions
7. Relationships Between Variables
8. Estimation
9. Hypothesis Testing
10. Linear Least Squares
11. Regression
12. Time Series Analysis
13. Survival Analysis
14. Analytic Methods


# Automate with Requests

### Here we are doing the same thing as above, but using requests to grab the html

In [17]:
import requests

# r = requests.get('https://learning.oreilly.com/library/view/python-machine-learning/9781786464477/')
r = requests.get('https://learning.oreilly.com/library/view/r-for-data/9781491910382/')
# r = requests.get('https://www.safaribooksonline.com/library/view/python-data-science/9781491912126/')

In [18]:
soup_requests = BeautifulSoup(r.text, 'html.parser')

for i in soup_requests.find_all('a', class_="t-chapter"):
    if i.get_text().startswith(chap_nums):
        print(i.get_text())

1. Data Visualization with ggplot2
2. Workflow: Basics
3. Data Transformation with dplyr
4. Workflow: Scripts
5. Exploratory Data Analysis
6. Workflow: Projects
7. Tibbles with tibble
8. Data Import with readr
9. Tidy Data with tidyr
10. Relational Data with dplyr
11. Strings with stringr
12. Factors with forcats
13. Dates and Times with lubridate
14. Pipes with magrittr
15. Functions
16. Vectors
17. Iteration with purrr
18. Model Basics with modelr
19. Model Building


# Logging into a site
- [Overcast podcast player website](https://overcast.fm)
- [Full Code example](https://www.pythonanywhere.com/user/JBalloonist/files/home/JBalloonist/RTM/getovercast.py?edit)


In [19]:
# import username and password
from const import payload

LOGIN_URL = 'https://overcast.fm/login'
PODCASTS = 'https://overcast.fm/podcasts'
OVERCAST = 'https://overcast.fm'

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    p = s.post(LOGIN_URL, data=payload)
    r = s.get(PODCASTS)
    soup = BeautifulSoup(r.text, 'html.parser')

In [29]:
print(soup.prettify())

<!DOCTYPE html>
<html class="controller_podcasts" lang="en">
 <head>
  <title>
   Overcast
  </title>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="152x152"/>
  <link color="#fc7e0f" href="/img/logo-black.svg" rel="mask-icon"/>
  <link href="/favicon.ico" rel="icon"/>
  <link href="/pure-min-0.5.0.css" rel="stylesheet" type="text/css"/>
  <link href="/grids-responsive-min-0.5.0.css" rel="stylesheet" type="text/css"/>
  <link href="/assets/48/main.css" rel="stylesheet" type="text/css"/>
  <script src="/assets/48/jquery.min.js" type="text/javascript">
  </script>
  <script src="/assets/48/main.js" type="text/javascript">
  </script>
 </head>
 <body>
  <div class="nav">
   <a class="left navlink" href="/podcasts">
    <img alt="Overcast" class="narrow" id="tinyhomeimg" src="/img/logo.svg"/>
    <span class="notnarrow">
     Overcast
    </span>
   </a>
   <a class="left navlink" href="/uploads">

# Downloading audio
- use urllib instead of requests

In [22]:
import urllib

url = 'http://archive-server.liveatc.net/kday/'
kday = 'KDAY'
month = 'Feb'
year = 2019

def download(url, station, month, day, year, time):
    variables = f'-{month}-{day}-{year}-{time}Z.mp3'
    link = url + station + variables
    print(link)
    response = urllib.request.urlopen(link)
    
    with open('{}{}'.format(station, variables), 'wb') as f:
    	f.write(response.read())
                
download(url, kday, month, '10', year, '1800')

http://archive-server.liveatc.net/kday/KDAY-Feb-10-2019-1800Z.mp3


# Live Coding?!
- if time allows...

In [23]:
r = requests.get('https://books.goalkicker.com/')

In [24]:
soup_program = BeautifulSoup(r.text, 'html.parser')
print(soup_program.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Free Programming Books – GoalKicker.com
  </title>
  <meta content="width=800" name="viewport"/>
  <meta charset="utf-8"/>
  <link href="favicon.ico" rel="icon" type="image/x-icon"/>
  <link href="https://books.goalkicker.com/" rel="canonical"/>
  <link href="https://fonts.googleapis.com/css?family=Quicksand" rel="stylesheet"/>
  <meta content="Free Programming Books on Android development, C, C#, CSS, HTML5, iOS development, Java, JavaScript, PowerShell, PHP, Python, SQL Sever and more" name="description">
   <meta content="Programming Books,Programming PDF Books,Programming Tutorials,Android development,CSS,HTML5,iOS development,Java,JavaScript,PowerShell,PHP,Python,SQL Sever" name="keywords">
    <meta content="Free Programming Books; HTML5, CSS3, JavaScript, PHP, Python..." property="og:title"/>
    <meta content="website" property="og:type"/>
    <meta content="https://books.goalkicker.com/goalkicker_books.png" property="og:ima

In [25]:
for i in soup_program.select('.bookContainer > a'):
    print(i.get('href'))

DotNETFrameworkBook/
AlgorithmsBook/
AndroidBook/
Angular2Book/
AngularJSBook/
BashBook/
CBook/
CPlusPlusBook/
CSharpBook/
CSSBook/
EntityFrameworkBook/
ExcelVBABook/
GitBook/
HaskellBook/
HibernateBook/
HTML5Book/
HTML5CanvasBook/
iOSBook/
JavaBook/
JavaScriptBook/
jQueryBook/
KotlinBook/
LaTeXBook/
LinuxBook/
MATLABBook/
MicrosoftSQLServerBook/
MongoDBBook/
MySQLBook/
NodeJSBook/
ObjectiveCBook/
OracleDatabaseBook/
PerlBook/
PHPBook/
PostgreSQLBook/
PowerShellBook/
PythonBook/
RBook/
ReactJSBook/
ReactNativeBook/
RubyBook/
RubyOnRailsBook/
SpringFrameworkBook/
SQLBook/
SwiftBook/
TypeScriptBook2/
VBABook/
VisualBasic_NETBook/
XamarinFormsBook/


In [26]:
import requests

# r = requests.get('https://learning.oreilly.com/library/view/python-machine-learning/9781786464477/')
r = requests.get('https://learning.oreilly.com/library/view/r-for-data/9781491910382/')
# r = requests.get('https://www.safaribooksonline.com/library/view/python-data-science/9781491912126/')

# Other Libraries

- Scrapy 
- Urllib (standard library)
- Selenium
- LXML (alternative to BeautifulSoup)