# WEB SCRAPING: Day 2

The following contains coding examples used in the second day of the CALDISS workshop: "Web Scraping for the Social Sciences".

The following is meant to introduce basic web scraping in python.

Copy the code to your own folder to run and edit the code yourself.

**CONTENT**
- Accessing the web with `requests`
- Working with HTML in `Selector` objects from `scrapy`
- Navigating HTML with XPATH
- Parsing HTML with `scrapy`
- Creating a "spider" with `scrapy`
- Understanding JSON formats

## Accessing the web

Accessing a website in python can be done using the `requests` library and creating a `response` object.

The HTML content can be extracted from the `response` object.

Parsing the HTML and extracting specific tags requires addional libraries being set up

In [1]:
import requests

caldiss_url = 'https://www.en.caldiss.aau.dk/about/'
caldiss_get = requests.get(caldiss_url)
print(str(caldiss_get.status_code) + ' ' + caldiss_get.reason)  #Status code 200 is "OK"

200 OK


From the `response` object we can extract the HTML.

In [None]:
caldiss_html = caldiss_get.content
print(caldiss_html)

In [None]:
invalid_suburl = 'https://www.en.caldiss.aau.dk/allaboutmeanandpastas/'
invalid_subget = requests.get(invalid_suburl)
print(str(invalid_subget.status_code) + ' ' + invalid_subget.reason)  #Status code 404 is "Not found" - Useful for error handling!

Most sites have a page setup for 404 errors (so that users know that they are on the right main site).

We can still therefore still extract the HTML (the HTML of the 404 page)

In [None]:
invalid_subhtml = invalid_subget.content
print(invalid_subhtml)

## A brief look at HTML

- HTML consists of tags <>
- The tags are always in hierarchical structure - like a tree

Below is a HTML string example in python:

In [None]:
html = '''
<html>
  <body>
    <div>
      <p class="kenobi">Hello There!</p>
      <div>
        <p>General Kenobi!</p>
      </div>
    </div>
    <div>
      <p>So Uncivilized!</p>
    </div>
  </body>
</html>
'''

### HTML Tags

- Tags are used to indicate elements
- Tags can contain attributes: 
    - Classes: `<div class="class!">`
    - Id's: `<div id="div2">`
    - id's should be unique
    - classes do not need to be unique
    - Examples of attributes: href, rel, type

## Navigating HTML: XPATH Notation

XPATH notation is one way of specifying the path to specific HTML element(s).

An XPATH consists of a path-string pointing to spceific HTML elements.

Example: `/html/body/div[1]/p`
The above XPATH extracts the element `p` under the first `div` element under the `body` elements under the `html` element.

### Basic vocabulary

- `/` (forward slash): move forward one generation (in the tree structure)
- `[]` (brackets): sibling index (after a tag)
	- Index can also be specified with an attribute: `[@id="uid"]` (attribute id = "uid")
	- NOTE: Index 1 will take all the first siblings of a tag across generations (first siblings across different hierarchical levels)
- `//` (double forward slag): all elements with specific tag
	- Can also call all elements with a specific tag at a specific level
- `*` (wildcard)
    - all children: `/*` 
    - all descendants: `//*`
- `@`: specify attributes
    - Attributes can be navigated to by specifying `'/@href'` in the xpath
	- Can be combined with wildcards (find all elements with a specific attribute)
	- `contains`: function to use regular expression to find all attributes containing expression 
        - `'//*[contains(@class, "class-1")]'`: finds all class attributes containing "class-1" (NOTE: "class-12" also meets this criteria)
- `text()`: extract the textual content (excluding html tags)

Given the html string above, the strings below are all xpaths leading to "Hello There!"

In [None]:
xpath1 = '/html/body/div[1]/p'  #Specific path directory
xpath2 = '//p[@class="kenobi"]'  #Specified by class
xpath3 = '/html/body//*[contains(@class, "kenobi")]'  #Wildcard - all elements with classes containing the word "kenobi"

### Using `scrapy` to extract HTML elements

The package `scrapy` is specifically for web scraping.

It works by creating a `Selector` object from site's HTML. Elements from the object can be extracted using xpaths.

Using `.extract` or `.extract_first()` extracts all and the first element respectively as python objects (list or single string). Otherwise the object extracted is stored as a `Selector` object.

#### Extracting text

The code below demonstrates how to extract the raw text of an HTML element

In [None]:
import scrapy
import requests
from scrapy import Selector

aau_url = "https://www.en.caldiss.aau.dk/"  #The URL we want to scrape
aau_html = requests.get(aau_url).content  #Extract the content (the HTML) using requests (as seen earlier)

aau_sel = Selector(text = aau_html)  #convert html to Selector object

title_xpath = "//title/text()"  #the xpath - find all "title" tags and extract the text

aau_title = aau_sel.xpath(title_xpath).extract_first()  #Extract the text of the first title element using xpath above

print(aau_title)  #Print the title

#### Extracting attributes

The code below demonstrates how to extract attributes (here links).

(the code uses the object `aau_sel` from above).

In [None]:
import scrapy
import requests
from scrapy import Selector

url_xpath = '//@href'  #xpath to all href attributes (links)

aau_urls = aau_sel.xpath(url_xpath).extract()  #Extract all links with xpath above

aau_urls  #Print the links

In the example above, we are extracting every link on the page. Often we are interested in links appearing a specific place on the page.

In [None]:
import scrapy
import requests
from scrapy import Selector

url_xpath = '//ul[@class="unstyled"]//@href'  #xpath to links appearing under ul elements of class "unstyled"

aau_urls = aau_sel.xpath(url_xpath).extract()  #Extract all links with xpath above

aau_urls  #Print the links

## EXERCISE: Parsing HTML

Reusing code from above, extract the text of all `h2` elements that are siblings of `div` elements of class "article" from the site: https://www.en.caldiss.aau.dk/about/

In [None]:
import scrapy
import requests
from scrapy import Selector

aau_url = #????#  #The URL we want to scrape

aau_html = requests.get(aau_url).content  

aau_sel = Selector(text = aau_html)  

h2_xpath = #????#  #The xpath to extract the elements

aau_text = aau_sel.xpath(h2_xpath).extract()  

print(aau_text)  #Print the text

## Creating a spider with `scrapy`

Because we can scrape all kinds of content from a site, we can also scrape the URL's a site is linking to.

This allows us to create scrapers, that extract specific content from URL's and then jumping on to URL's within those URL's.

Scrapers that are meant to "crawl" from site to site are also called "spiders".

The `scrapy` packages contain build-in functions and classes for setting up a spider. An example is provided below.

### Spider-example: Extracting paragraphs from all subpages of the CALDISS page

Using the URL's extracted above, we can create a simple spider to go through each of these URL's and extract specific information.

The scrape provided us with some irrelevant URL's. First we clean up a bit to only have the URL's we are interested in.

In [None]:
last_url = aau_urls.index("/contact/")

caldiss_urls = aau_urls[0:last_url]
caldiss_urls

The URL's only contain the suffix of the URL. We can simply paste the correct prefix.

In [None]:
caldiss_urls = ["https://www.en.caldiss.aau.dk" + url for url in caldiss_urls]
caldiss_urls

We can then create the spider using the URL's.

In [None]:
from urllib.request import urlopen
import scrapy
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class CALDISS_spider(scrapy.Spider):
    name = "caldiss_spider"

    def start_requests(self):
        urls = caldiss_urls
        for url in urls:
            yield scrapy.Request(url = url, callback = self.parse)

    # First parsing method
    def parse( self, response ): # Functiong for parsing html and writing it to a file - defines the parse method used above (callback)
        print(response.url)
        page_url = response.url
        page_html = requests.get(page_url).content  #html from url
        page_sel = Selector(text = page_html)  #create selector object

        para_xpath = "//p/text()"  #xpath for extracting text of all paragraphs elements
        
        page_paras = page_sel.xpath(para_xpath).extract()  #extract the text

        caldiss_dict[ page_url ] = ' '.join([para.lower() for para in page_paras])  #join the paragraphs and add to dictionary
        
        
# Create an empty dictionary
caldiss_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(CALDISS_spider)
process.start()

In [None]:
caldiss_dict

## Common file formats on the web: JSON

A common formatting of data on the web is the JSON format (JavaScript Object Notation).

JSON is a hierarchical type of format where data is stored in attribute-value pairs.

An example of text formatted as JSON can be seen below:

In [None]:
json_example = '{"names": [{"value": "kenobi"}, {"value": "kylo"}, {"value": "windu"}], "rank": [{"title": "jedi master"}, {"title": "sith apprentice"}, {"title": "jedi master"}]}'

The `json` package can convert JSON formats to python-readable format. The most "direct" translation is a conversion to a python dictionary (which also is a attribute-value pair format).

In [None]:
import json

json_obj = json.loads(json_example)  #Read the JSON-formatted data as a dictionary.
print(json.dumps(json_obj, indent = 1))  #Pretty print

The JSON can now be subset like a dictionary.

In [None]:
json_obj["names"]

In [None]:
for name in json_obj["names"]:
    print(name.get("value"))

Each element in a JSON (an object) can contain sub-elements. 

When using the twitter API, each tweet is an object with a lot of elements (text, id, time, author etc.).

## HTTP Headers

Custom HTTP headers can be set when accessing a URL via python. This makes it possible to leave contact information.

In [None]:
import requests

headers = {'user-agent' : 'Mozilla/5.0 (Windows NT 10.0 Win64 x64 rv:66.0); Kristian Gade Kjelmann/Aalborg University/kgk@adm.aau.dk'}

aau_url = "https://www.aau.dk"
aau_html = requests.get(aau_url, headers=headers)