## Web Scraping with Python

[Web Scrape with Python](https://www.apmonitor.com/dde/index.php/Main/WebScraping) in the [Data-Driven Engineering](http://apmonitor.com/dde) online course.

<img align=left width=500px src='https://apmonitor.com/dde/uploads/Main/python_web_scrape.png'>

Internet data is a rich source of information. Much of the online information is designed for web-browsers and viewed by humans. Web scraping is data retrieval and curation of online information by a computer program. Scraping automates tedious manual retrieval of information and can be used to watch for updates. The exercises in this section demonstrate how to retrieve data from a website such as an image and a table.

### 📷  Download Image

Download `python_web_scrape.png` from [Web Scraping with Python](http://apmonitor.com/dde/index.php/Main/WebScraping) using the `urllib` library.

In [None]:
import urllib.request

# download image
img = 'python_web_scrape.png'
url = 'http://apmonitor.com/dde/uploads/Main/'+img
urllib.request.urlretrieve(url, img)

Packages for image manipulation and computer vision include `Pillow` (Python Imaging Library), `Scikit-image`, `Matplotlib` (uses Pillow functions), and `OpenCV`. OpenCV is the most capable computer vision package and is supported in many development environments. Display the image with `Matplotlib`. 

In [None]:
import matplotlib.pyplot as plt
im = plt.imread(img)
plt.imshow(im)

### 🔢  Read Table with Pandas

The [Pandas Time-Series](https://www.apmonitor.com/dde/index.php/Main/PandasTimeSeries) exercise has a sample exercise that demonstrates how to read a [text file](https://apmonitor.com/dde/uploads/Main/tclab.txt) either from a local directory or from a URL address. However, suppose data is online as an HTML table as shown in [Web Scrape with Python](https://www.apmonitor.com/dde/index.php/Main/WebScraping).

<html>
<table>
  <thead>
    <tr style="text-align: right;">
      <th>time</th>
      <th>Q1</th>
      <th>Q2</th>
      <th>T1</th>
      <th>T2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0.0</th>
      <td>0.0</td>
      <td>0.0</td>
      <td>20.9495</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>5.0</th>
      <td>0.0</td>
      <td>0.0</td>
      <td>20.9495</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>10.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>20.9495</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>15.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>21.5941</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>20.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>22.2387</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>25.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>22.8833</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>30.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>23.8502</td>
      <td>20.9495</td>
    </tr>
    <tr>
      <th>35.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>25.1394</td>
      <td>21.2718</td>
    </tr>
    <tr>
      <th>40.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>26.1063</td>
      <td>21.2718</td>
    </tr>
    <tr>
      <th>45.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>27.0732</td>
      <td>21.5941</td>
    </tr>
    <tr>
      <th>50.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>28.3624</td>
      <td>21.5941</td>
    </tr>
    <tr>
      <th>55.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>29.3293</td>
      <td>21.5941</td>
    </tr>
    <tr>
      <th>60.0</th>
      <td>70.0</td>
      <td>0.0</td>
      <td>30.6185</td>
      <td>21.9164</td>
    </tr>
  </tbody>
</table>
</html>

Read the table into Python with Pandas `read_html()` function. This function returns any tables on a webpage as a list. Use `[0]` to retrieve the first table.

In [None]:
import pandas as pd
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping'
data = pd.read_html(url)[0]
data

The table can be modified as a `DataFrame`, such as setting the index as time.

In [None]:
data.set_index('time',inplace=True)
data

### 🔢  Read Table with requests

Many websites are designed to block program (bot) access to avoid Distributed Denial of Service (DDoS) attacks that can overwhelm a web service. Before sending many requests, it is important to check with the website owner to not overload the service. Read the table with `requests` to include a header that emulates a browser.

In [None]:
import requests
# look like a browser
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) "+
                "AppleWebKit/537.36 (KHTML, like Gecko) "+
                "Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, headers=header)
data = pd.read_html(r.text)[0]
data.set_index('time',inplace=True)
data

### 🥣 Beautiful Soup

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python package for extracting (scraping) information from web pages. It uses an HTML or XML parser and functions for iterating, searching, and modifying the parse tree. First, get the html source from a webpage such as this page.

In [None]:
import requests
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping?action=print'
page = requests.get(url)

The attribute `page.content` contains the html source if `page_status_code` starts with a `2` such as 200 (downloaded successfully). A `4` or `5` indicates an error. BeautifulSoup parses HTML or XML files.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

Functions such as `print(soup.prettify())` can be used to view the structured output or the page title.

In [None]:
print(soup.title.text)

All of the links are extracted:

In [None]:
for link in soup.find_all('a'):
    print('Link Text: {}'.format(link.text))
    print('href: {}'.format(link.get('href')))

`Pandas` uses `BeautifulSoup` to extract tables from webpages. Data scraping is particularly useful for getting information from webpages that are updated with new information such as weather, stock data, and customer reviews. More advanced web scraping packages are `MechanicalSoup`, `Scrapy` and `Selenium`.

### ✅ Activity

Practice web scraping to retrieve data from another website of interest that contains a table. Organize the content into a DataFrame and export the DataFrame to a CSV file. Below is an example of retrieval from the [Wikipedia article on Data Tables](https://en.wikipedia.org/wiki/Table_(information)) where the data table is saved as `test.csv`. Change the `url`, use `requests` with a browser header if necessary, and export the data file.

In [None]:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Table_(information)'
req = requests.get(url)'
data = pd.read_html(req.content)[0]'
data.to_csv('test.csv')
data