# Web Scraping

This file activities introduces web scraping using:  

1. the Google Chrome Web Scraper plugin to automate the browser to crawl a site and extract relevant data;
2. Python code to request and parse an HTML page using BeautifulSoup.  


## Web scraping with webscraper.io

First, install the the [Web Scraper](https://chrome.google.com/webstore/detail/webscraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en-US) extension for Google Chrome.

You should also open the Web Scraper website, which has [tutorials](http://webscraper.io/tutorials), and good explanations and links to further resources.

Web Scraper integrates with Chrome's Developer Tools so you can scrape web content while taking advantage of the other information available, all within your browser.

### Getting started


- Create a new Site Map, name it 'Wireless', and provide the url ```https://www.radionz.co.nz/news/the-wireless```.  

- Use the Inspector / Inspect Element function in Chrome Developer tools to manually identify a CSS selector to identify the story title links in the main list, and those featured at the top of the page. Hint: it will be a 'class' attribute. The answer is in a comment hidden in the markdown just below - when you're ready, double click this cell to view it.
<!-- Answer: class='faux-link' appears on all the titles. You may get a more complicated selector from the Web Scraper tool, however. -->

- Next, create a 'story_links' selector to collects the links to each news story on the first page. It should have a type 'Link' and the 'Multiple' box should be ticked.

- Use the 'Select' button in Web Scraper to graphically select one of the stories. Test the result by clicking 'Element Preview' and 'Data Preview'.

If you get stuck at this point, ask a neighbour, or your tutor can help.

Next, in the main browswer window click on one of the story links - any one should do - so that you are now viewing a single news story page. In your Web Scraper add on, click into ```story_links``` (ie click its name in the ID column). You are now ready to create some selectors for the content of each story.

Create the following selectors:

* title (Text)
* date (Text)
* story_text (Text) (select 'multiple')

* Be careful when selecting the story_text to ensure you are getting the full article body. You can use the 'Element Preview' and 'Data Preview' to check this.

After you've created these, your sitemap graph should look something like this:

![](selector-graph-1.png)

Now, check you are at the page `https://www.radionz.co.nz/news/the-wireless` then choose 'Scrape' from the sitemap's dropdown menu to collect some test data.

### Dealing with paragraphs

Once you have scraped some data, you can view it in the browser or download it from the Sitemap menu ('Export data').

You'll see a problem, though: each paragraph is in a separate row. This could be useful in some cases. Here, we want to change the type of the 'story-text' selector from Text to Grouped.

Make this change, and re-run your scraper. You should see that all paragraphs are collected, and organized in a JSON data structure. We will provide you with a script to extract this into plain text files.

### Dealing with pagination

Automating the data collection from multiple pages will be important for collecting lots of data.

To scrape more than one page at a time, we create a `pagination` selector as a child of `_root`, and give it a type of Link. Visually select the 'Next Page' button at the bottom of the page as the target element. The CSS selector for this should be: `.next a`

Save your new `pagination` selector and then edit it again. Here you need to Control-Click to add `pagination` as a second parent selector of itself. This will ensure it follows all the pagination links.

Then edit the `story_links` selector. Add `pagination` as a second parent selector to `story_links` as well. This will ensure that every time the 'Next Page' button is 'clicked' by the scraper, it will then add all the stories it subsequently finds.

Your selector graph should now look like this, and clicking on pagination shows how the scraper will continue recursively until there are no more links:

![](selector-graph-2.png)

### Get some data

Run the scraper by going to Sitemap Wireless > Scrape. Let it run for a couple of minutes. If the scraper is still running after a long time and you want to move on, stop it by closing the popup window where it loads each story page.

Once you see clean data in the Browse view, you can export it to a CSV file that can be imported into Excel, or loaded into Python using the code below.

You can run this code to load data from a sample file. You should change this to the file created when you ran your scraper.

In [None]:
import pandas as pd
with open('sample-wireless-news.csv', encoding='utf-8') as f:
    df = pd.read_csv(f) # read csv into a pandas dataframe
df.head(5) # display the first five rows of the dataframe

### Extending your RNZ scraper

A key part of scraping is inspecting websites closely, scoping out what information is available and working out how best you can retrieve the information you want. 

You can make use of other site functionality though to access older content on the RNZ site by changing the page parameter of a URL to start further back in time (e.g. https://www.rnz.co.nz/tags/internet?page=40) or specific search results (e.g. https://www.rnz.co.nz/search/results?utf8=%E2%9C%93&q=climate+change&commit=Search). Take a look at those pages and then use the 'Element Preview' feature to confirm that the relevant links from your scraper are highlighted.

Look for different ways content is organised too. For example, see https://www.rnz.co.nz/topics/business-economy You’ll see news stories with tags from a number of categories (look under the story title for tags like ‘Business, ‘Life and Society’ etc). You can search for these tags separately, eg https://www.rnz.co.nz/tags/business However, readers may not encounter stories in this way on the website. For corpus building you may want to collect texts as they were seen by readers. And you may only want some categories, or only texts from a certain time period. In other words, you may wish to filter the texts you collect, and writing your own web scraper can help you to do this.

## Python for Web Scraping

It's worth exploring how we can use Python to create scripts for web scraping. Although writing your own web scraper may initially be slower than using a program like webscraper.io, ultimately using Python gives you the most flexibility. Coding a scraper is much more powerful, allowing you to capture the data you want and process it or export it how you want.

The main Python libraries that we'll use to do web scraping are:

* [Requests](http://docs.python-requests.org/en/master/) - for requesting web pages
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - for parsing (reading) the HTML and selecting the elements we want

This code will request [The Wireless page](https://www.rnz.co.nz/news/the-wireless) we have been working and print it out so you can inspect the HTML:

In [None]:
# import the requests module
import requests

# define the url we want as a string
url = 'https://www.rnz.co.nz/news/the-wireless'

# make the request and assign the result to a variable 'response'
response = requests.get(url)

# the data will be stored in response.text
# if response.text exists, and print it if it does
if response.text:
    print(response.text)
    
# Scroll through the resulting HTML, which will be pretty hard to read.

## BeautifulSoup to the rescue...

BeautifulSoup is a library for parsing HTML (and XML). It is incredibly useful and fairly easy to learn. The documentation pages (linked above) note that it was used to make [this artwork](http://www.nytimes.com/2007/10/25/arts/design/25vide.html), which is an interesting example for its use of both digital and analogue media.

Beautiful Soup provides methods for accessing HTML elements, and their useful attributes such as `id` and `class` attributes.

Here is an example of retrieving the title element:

In [None]:
from bs4 import BeautifulSoup

# create a BeautifulSoup object using the html.parser
soup = BeautifulSoup(response.text, "html.parser") 
# find the html title tag
title = soup.title

print(title)

Or, to obtain the title without HTML tags ...

In [None]:
print(title.get_text())

### Find by class attribute

How would we find links to the list of stories? 

Using the `find_all` filter you can return specific element types and target them more specifically using their class attribute. The code below will each select every ```<li>``` with the class attribute containing ```o-digest```. You can check the HTML above to verify that these elements contain a story link. 

Note: `class_` with an underscore is used because `class` is a also reserved word used for other things in the Python programming language.

In [None]:
stories = soup.find_all('li', class_='o-digest')

for story in stories:
    link = story.a #we want the a element that is a child of the list item
    print(link['href']) #we just want to see the URL

As with the Web Scraper for Chrome, we can use a CSS selector that identifies each story in the HTML. The code below uses `.select()` to target the elements we want using a CSS selector. So, if you can learn about CSS selectors you have a powerful tool to extract the data you want.

In [None]:
stories = soup.select('li.o-digest') # this is elegant and flexible!

for story in stories:
    link = story.a
    print(link['href'])

### Find by regular expression

We can use regular expressions to target specific content in the page. For example, if we just wanted all the "Wireless Docs" links we can select only links with that text.

In [None]:
import re

regex_results = soup.find_all(string=re.compile("Wireless Docs:"))

for result in regex_results:
    link = result.find_parent('a', class_='faux-link') # using the class attribute we found earlier
    if link is not None:
        print(link.get_text(),link['href'])

### Tasks

Continue investigating _The Wireless_ and try to copy and modify code from above to:

1. Modify the regular expression to list all the 'Someday Stories' articles from https://www.rnz.co.nz/news/the-wireless
2. Copy and modify the code to collect all the links in the text of this article: https://www.radionz.co.nz/news/the-wireless/375285/feature-artificial-affection-the-psychology-of-human-robot-interactions 
3. For the story https://www.radionz.co.nz/news/the-wireless/375285/feature-artificial-affection-the-psychology-of-human-robot-interactions, write code to extract the photo caption text using `.select()` and CSS selectors. 
4. On the page https://www.rnz.co.nz/news/the-wireless the articles with video are indicated with (VIDEO) after the description. Write some code to find all stories and test whether a story contains video. Your code should output the URL to each story that features video.

There is a notebook with solutions for each of the four tasks on Learn. Try your best to write the code using the examples above. You can use the solutions to check your answer or to help you if you get stuck.