# Intro to Web Scraping

## Introduction
Web scraping is a task related to gathering data from the web; this might include text, numbers, images, videos, etc...  
The process of web scraping is simple:
1. download the web page
2. extract stuff from the web page

Yes, the process is simple but web pages are messy. Behind the hood of those pretty web sites we look at is a lot of structuring that's put in place by markup languages like HTML and CSS. This messiness, like in the image below, is typically hard to parse for someone who isn't familiar with building websites.

<img src="raw-html.png" alt="raw html" style="width: 70%; height: 70%"/>

But that's why we have Python packages :).

Web scraping is used a lot in the real world for many purposes: to store data in databases for later analysis, to make on-the-fly decisions (like buying stocks), or creating a bot that retrieves the news and tries to summarize it for you in a paragraph or less. We've personally used web scraping to help small to medium sized businesses collect more data, make more data-driven decisions, and to automate a lot of mundane processes.

## The Mechanics

### BeautifulSoup

To install: ```pip install beautifulsoup4```

#### Importing the modules

In [3]:
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint

* The [requests](http://docs.python-requests.org/en/master/) package is used for easily making web calls to download web pages and whatnot; it's for humans.
* ```re``` is for regular expressions
* ```pprint``` "pretty-prints" an array

#### Get the HTML from a Web Page

In [13]:
url = 'https://twitter.com/realDonaldTrump'
web_page_html = requests.get(url).text

#### Parsing the HTML

In [14]:
soup = BeautifulSoup(web_page_html, "html.parser")

#### Printing the Title of the Web Page

In [78]:
print(soup.title)             # Prints the web page title including the tags
print(soup.title.text)        # Prints just the title text

<title>Donald J. Trump (@realDonaldTrump) | Twitter</title>
Donald J. Trump (@realDonaldTrump) | Twitter


#### We Can Also Print the Body of the Web Page (i.e. no headers, footers, etc...)

In [23]:
print(soup.body)             # Prints the body of the web page (including all tags)
print(soup.body.text)        # Prints the text in the body stripped of all tags

#### Printing Tagged Links

In HTML, a link is created by using the tag ```<a>``` with the property ```href```. To find all links with this property, we can use the below code.

In [50]:
# print a few links
links = [link['href'] for link in soup.findAll('a', href=True, text=re.compile(r''))]
links[45:60]

['/help/verified',
 '/realDonaldTrump/likes',
 '/realDonaldTrump',
 '/help/verified',
 '/realDonaldTrump/media',
 '/realDonaldTrump/media',
 '/realDonaldTrump/with_replies',
 '/realDonaldTrump/media',
 'https://t.co/I4Vz1mRdBK',
 'https://t.co/o7YNUNwb8f',
 'https://t.co/48iZam5Fai',
 'https://t.co/okiZ8ZeDgz',
 '/search?q=place%3A07d9dafff4481002',
 '/realDonaldTrump/status/824083821889015809',
 '/realDonaldTrump/status/824080766288228352']

Just looking for links with "Trump" in it?

In [54]:
for link in soup.findAll('a', href=True):
    if 'Trump' in link['href']:
        print(link['href'])

/realDonaldTrump
/realDonaldTrump
/realDonaldTrump/following
/realDonaldTrump/followers
/realDonaldTrump/likes
/realDonaldTrump/likes
/realDonaldTrump
/realDonaldTrump
/realDonaldTrump/media
/realDonaldTrump/media
/realDonaldTrump/with_replies
/realDonaldTrump/media
/realDonaldTrump
/realDonaldTrump/status/824448880993509376
/realDonaldTrump
/realDonaldTrump/status/824448156935081984
/realDonaldTrump
/realDonaldTrump/status/824440456813707265
/realDonaldTrump
/realDonaldTrump/status/824407390674157568
/realDonaldTrump
/realDonaldTrump/status/824377804590563339
/realDonaldTrump
/realDonaldTrump/status/824229586091307008
/realDonaldTrump
/realDonaldTrump/status/824228768227217408
/realDonaldTrump
/realDonaldTrump/status/824227824903090176
/realDonaldTrump
/realDonaldTrump/status/824083821889015809
/realDonaldTrump
/realDonaldTrump/status/824080766288228352
/realDonaldTrump
/realDonaldTrump/status/824078417213747200
/realDonaldTrump
/realDonaldTrump/status/824055927200423936
/realDonaldTr

#### Parsing Text by Drilling into Tags

<img src="trump-twitter-descr.png" alt="trump description in twitter" style:"width:70%, height:70%" />

In [55]:
print(soup.find('div',{"class":"ProfileHeaderCard"}).find('p').text)

45th President of the United States of America


We can use a similar query to gather all the tweets on the page by drilling into all ```p``` tags that have the class ```TweetTextSize TweetTextSize--26px js-tweet-text tweet-text```.

In [72]:
tweets = [tweets.text for tweets in soup.findAll('p',{"class":"TweetTextSize TweetTextSize--26px js-tweet-text tweet-text"})]
for index, tweet in enumerate(tweets):
    print(index, tweet, "\n")

0 "@romoabcnews: .@DavidMuir first @POTUS interview since taking office.  Tonight on @ABCWorldNews @ABC2020 tonight. pic.twitter.com/I4Vz1mRdBK" 

1 As your President, I have no higher duty than to protect the lives of the American people.pic.twitter.com/o7YNUNwb8f 

2 I will be interviewed by @DavidMuir tonight at 10 o'clock on @ABC. Will be my first interview from the White House. Enjoy!pic.twitter.com/okiZ8ZeDgz – at The White House 

3 Big day planned on NATIONAL SECURITY tomorrow. Among many other things, we will build the wall! 

4 If Chicago doesn't fix the horrible "carnage" going on, 228 shootings in 2017 with 42 killings (up 24% from 2016), I will send in the Feds! 

5 Signing orders to move forward with the construction of the Keystone XL and Dakota Access pipelines in the Oval Office.pic.twitter.com/OErGmbBvYK – at The Oval Office 

6 Peaceful protests are a hallmark of our democracy. Even if I don't always agree, I recognize the rights of people to express their views. 

7

## Controlling the Web Browser

A lot of websites today use JavaScript to control the behavior of their sites (like Single Page Applications). Because of this, if one were to write a program to navigate and scrape data throughout a website, making simple URL calls is not enough - you need to control a web browser.

The [selenium API for Python](https://selenium-python.readthedocs.io/getting-started.html#simple-usage) allows you to control a web browser like Chrome, Firefox, or Safari.
* Provides easy way to handle JavaScript when trying to scrape data from a web page
* Great way to automate web testing

To install
* ```pip install selenium```
* ```brew install chromedriver``` (for Mac OS)

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.execute_script('window.focus()')
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
#driver.close()

## Newspaper

[Newspaper](https://newspaper.readthedocs.io/en/latest/) is a Python library specifically for extracting and curating news articles and should supported any major news site.

Some of the things it can do:
* multi-language support
* news url
* text extraction
* image, keyword, and summary extraction
* author extraction
* Google trending terms extraction

To Install:  
if you're using Python 3:  
```pip install newspaper3k```  
else:  
```pip install newspaper```

In [72]:
import newspaper
cnn_paper = newspaper.build('http://cnn.com')

In [None]:
# get article object
article = cnn_paper.articles[0]
# download article
article.download()
# parse article
article.parse()
# get article data
article.text
article.title
article.authors
# automatic summarization
article.nlp()
article.summary

## Ethics

On most web sites, there's typically a file in the root of their web site called ```robots.txt``` ([check out Twitter's](https://twitter.com/robots.txt)). This plain-text file contains directories that the web site doesn't want web bots (like the ones used by Google to index the internet) to access. Althought there's not a law that says you can't access those "forbidden" locations, it's typically a courtesy not to.

Also, one must take into account how many times they are requesting a web page from a web site. If you do it too often, some websites might automtically ban your IP from accessing the site. Or worse, you send their site so many requests that their servers can't handle the load and their web site goes down - this is very much like a DoS (Denial of Service) attack and you could get sued or go to jail, depending on the damage and your intentions.

So in short, when scraping a web page, don't do this:

```python
while True:
    scrape_page
```

Be smart about how often you scrape data from a site and if you're really clever, you'd collect user-behavior data from a website and use that to train your web scraper to request pages as often as a human would.

#### Exercises

** SpaceX Missions **
* Scrape SpaceX future missions table from [this](http://www.spacex.com/missions) website.

** House Listings **
* Create a CSV file with data on houses listed in a city around the US, from information off of [loop net](http://www.loopnet.com/). The requirements of the data are as follows:
- At least 25 rows of data with the following attributes:
    - Street Address
    - City
    - State
    - Type of Listing
    - Square Footage
    - Broker Name

** Scraping News **
* Using the newspaper package in Python, choose an online news medium to find all articles with titles related to Donald Trump. Save all the titles, authors, published dates, keywords, and summaries of these articles.