# Module 1 - Lab: Web scraping

Publicly available web data can be gathered from web sources in one of two ways.

1. **Via web scraping**: reading and parsing source code from a web page.
2. **Via Application Program Interfaces (API)**: sending HTTP requests using a website's predefined set of protocols for requesting data. 

This lab provides a quick introduction along with some example code for the first method, web scraping.

## Web scraping
A normal web scraping workflow goes something like the following:
1. Read<sup>1</sup> into an R environment the website source code, known as Extensible Markup Language (XML), associated with a specified URL. XML is similar to HTML only its designed for storing rather than displaying data.
2. Parse the XML document via elements (e.g., `<p>...</p>`) and attributes (e.g., `<h1 screen-name="internet_user">`). Elements, or the nodes used to extract certain sections of source code, can be identified with tags (name of HTML tag, e.g., `p`), classes (denoted with an initial period, e.g., `.post`), and ids (denoted with an initial pound sign, e.g., `#main`) as according to CSS selectors or XPath.
3. Organize parsed values into lists, data frames, etc.

<sup>1</sup>To capture dynamic content, users can also use a Selenium driver/headless browser (for an example [see this StackExchange thread](https://stackoverflow.com/questions/29861117/r-rvest-scraping-a-dynamic-ecommerce-page); cf, [non-Selenium alternative version](https://gist.github.com/hrbrmstr/4cabe4af87bd2c5fe664b0b44a574366)). ***NOTE***: This course won't actually cover this web scraping method, though it may come in handy to know that it exists.

### Web scraping in R
The {rvest} package makes web scraping in R easy.

In [1]:
## load rvest
suppressPackageStartupMessages(library(rvest))

The two most important functions are `read_html()`, which is actually imported from the {xml2} package by {rvest}, and `html_nodes()`.

`read_html()` reads the content associated with a given URL into an R session and then stores it as an object of class `xml_document`.

In [2]:
## read a websites XML
(h <- read_html("https://www.tiobe.com/tiobe-index/r/"))

{xml_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n    <div id="slider" class="fullwidth cycle-slideshow" data-cycle ...

`html_nodes()` filters an xml document by CSS selector(s) or XPath values.

In [3]:
## html_nodes
h %>%
  html_nodes("div p") 

{xml_nodeset (3)}
[1] <p>TIOBE releases TICS 9.0.0 with over 200 improvements, a.o., TQI Securi ...
[2] <p>Programming language C is the language of 2017 in the TIOBE index (mos ...
[3] <p>The NavKit project has the best TIOBE Quality Indicator (TQI) score of ...

Other useful functions include `html_text()` (for extracting text), `html_attr()` (for extracting XML attribute values), and `html_table()` (for extracting tables).

In [4]:
## return the paragraph text
h %>%
  html_nodes("div p") %>%
  html_text(trim = TRUE)

## return attribute value
h %>%
  html_nodes("img") %>%
  html_attr("width")

More in-depth examples of web scraping using the {httr} package can be found in this week's labs notebooks.