Data scrapping overview 

Objective:

1.  Some places to look for data
2.  Downloading and parsing web pages
3.  A couple of other libraries for "scraping" . . . 

### Some places to look for data

This is not an exhaustive list.

* [Project Gutenberg](http://www.gutenberg.org/).  It should be possible to hunt up a library to search and download texts; for example, see https://github.com/c-w/gutenberg.

* [Internet Archive](https://archive.org/).  [This python module](https://github.com/jjjake/internetarchive) works well. 

* [Folger Digital Texts](http://www.folgerdigitaltexts.org/download/).

* [Oxford Text Archive](https://ota.ox.ac.uk/).

* [eBooks@Adelaide](https://ebooks.adelaide.edu.au/).

* We've made some additional corpora (Inagural Addresses, State of the Union Addresses) available on box.


### Huge data . . . *not* for the end-of-semester project

Neither is this.

* [Common Crawl](http://commoncrawl.org/)

* [Wikipedia](https://dumps.wikimedia.org/), of course.  But also [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page), and especially its much smaller (but still quite large) [database dumps](https://dumps.wikimedia.org/simplewiki/20171120/).


### Downloading one web page

Cribbing from [Programming Historian](https://programminghistorian.org/lessons/working-with-web-pages), again.  And from [the Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

Note that I'm not verifying the certificate ("create_unverified_context"), and that I'm calling a method that is not "supposed" to be used by code outside its module (the method name starts with an underscore).  Both of these things mark me as ethically challenged, since I've let expedience overrule broadly accepted norms around such things.

Lots of people use [requests](http://docs.python-requests.org/en/master/) instead of urllib2.  Feel free to try it.


In [None]:
import urllib2, ssl

url = 'https://ebooks.adelaide.edu.au/meta/collections'

context = ssl._create_unverified_context()
response = urllib2.urlopen(url, context=context)

html = response.read()

print html[:500]

### Parse the web page we just downloaded

The variable **html** contains the downloaded web page. html is type string.

In [None]:
from bs4 import BeautifulSoup

parsed_html = BeautifulSoup(html, 'html.parser')

print 'type(html)', type(html)
print
print 'type(parsed_html)', type(parsed_html)
print
print parsed_html.prettify()[:500]

### Find the first and last 20 links

Links are "a" tags.  It helps to know something about html.  [A quick tutorial](https://www.w3schools.com/html/) may help you along.  Also, see the doc for the [Chrome developer tools](https://developer.chrome.com/devtools) and [Firefox web developer](https://developer.mozilla.org/en-US/docs/Tools) tools.

In [None]:
all_links = parsed_html.find_all('a')

print
for a in all_links[:20]:
    print a
    
print
for a in all_links[len(all_links) - 20:]:
    print a

### Make a dictionary for categories and links

In effect, we're extracting a crude catalog to the Adelaide ebooks.

In [None]:
all_links = parsed_html.find_all('a')

categories_links = {}
last_category = ''

for a in all_links:
    if a.get('class') != None and a.get('class')[0].startswith('mdi-menu') == True:
        last_category = a.text
        categories_links[last_category] = []
    elif a.get('class') != None and a.get('class')[0].startswith('mdi-') == True:
        break
    elif last_category > '':
        categories_links[last_category].append([a.get('href'), a.text])
        
#for category in sorted(categories_links.keys()):
#    print
#    print category
#    for link in categories_links[category]:
#        print '\t', link

print
print sorted(categories_links.keys())
print
print categories_links['History--Ancient Greece']


### Grab a bunch of texts, etc

Use categories_links (i.e., our crude catalog to the Adelaide ebooks) to extract the text for one category. 

My assumption here is that I want just the text proper; I don't want title pages, front matter, back matter, standard link text, etc.

In [None]:
import urllib2, ssl, codecs, re
from bs4 import BeautifulSoup

for link in categories_links['History--Ancient Greece']:
    
    print 'downloaded and parsing', link[1]

    text_url = 'https://ebooks.adelaide.edu.au' + link[0] + 'complete.html'

    text_context = ssl._create_unverified_context()
    text_response = urllib2.urlopen(text_url, context=text_context)

    text_html = text_response.read()
    
    text_parsed_html = BeautifulSoup(text_html, 'html.parser')
    
    extracted_text = []
    
    for div in text_parsed_html.find_all('div'):
        
        if div.parent.name == 'body':
        
            if div.get('id') != None and div.get('id') in ['controls', 'contents']:
                pass
            elif div.get('class') != None and \
                div.get('class')[0] in ['titlepage', 
                                        'titleverso', 
                                        'contents', 
                                        'frontmatter', 
                                        'preface', 
                                        'colophon',
                                        'frontispiece',
                                        'map']:
                pass
            else:
                #print '\t', 'extracting div', div.get('id'), div.get('class')
                extracted_text.append(div.get_text())
            
    file_name = 'picky_' + re.sub('[^A-Z]', '_', link[1].upper()) + '.txt'
    
    f = codecs.open(file_name, 'w', encoding='utf-8')
    f.write('\n'.join(extracted_text))
    f.close()
    
print 'Done!'

### Some basic bash (i.e., "terminal" or "shell") commands to check the results

You might look at [our brief guide to the command line](https://talus.artsci.wustl.edu/command_line.html).

If you're reading ahead, or you're bored, you might have a look at [Data Science at the Command Line](https://www.datascienceatthecommandline.com/).  I haven't looked at it yet; I plan to in May.

**NOTE** that I extracted 0 characters for *The Calvary General*.  The text of *The Calvary General* is all in a div which has a frontmatter id and class; there's no way, short of expanding the logic to handle *The Calvary General* specially, to both keep frontmatter out of the other texts and include it in *The Calvary General*.  I.e., **a general rule for extracting content fails, even when I'm taking just one kind of document from what appears to be a well-designed web site.**

In [None]:
!ls -l *.txt

In [None]:
!wc -w *.txt

In [None]:
!head -n 70 picky_ANABASIS___XENOPHON__TRANSLATED_BY_H__G__DAKYNS.txt

### And the results include footnotes . . . 

. . . which would seem to be something, like prefatory material, that I would otherwise want to drop.

In [None]:
!grep 'a district and town in the south-west of Arcadia' picky_ANABASIS___XENOPHON__TRANSLATED_BY_H__G__DAKYNS.txt

###  Maybe I shouldn't be so fussy . . . 

. . . just take all the text in the body of the document.

In [None]:
import urllib2, ssl, codecs, re
from bs4 import BeautifulSoup

for link in categories_links['History--Ancient Greece']:
    
    print 'downloaded and parsing', link[1]

    text_url = 'https://ebooks.adelaide.edu.au' + link[0] + 'complete.html'

    text_context = ssl._create_unverified_context()
    text_response = urllib2.urlopen(text_url, context=text_context)

    text_html = text_response.read()
    
    text_parsed_html = BeautifulSoup(text_html, 'html.parser')
            
    file_name = 'everything_' + re.sub('[^A-Z]', '_', link[1].upper()) + '.txt'
    
    f = codecs.open(file_name, 'w', encoding='utf-8')
    f.write(text_parsed_html.body.get_text())
    f.close()
    
print 'Done!'

### Check the results

Are the "everything" files good enough?  It depends on what I want to do with the data.

In [None]:
!ls -l *.txt

In [None]:
!wc -w *.txt

In [None]:
!head -n 100 everything_ANABASIS___XENOPHON__TRANSLATED_BY_H__G__DAKYNS.txt

### XML?

We use [lxml](http://lxml.de/index.html) instead of BeautifulSoup to parse HTML (although lxml will parse html, and use BeautifulSoup to do it!).

Note that I'm printing out only one speech for five speakers.

In [None]:
from lxml import etree

tree = etree.parse('http://www.folgerdigitaltexts.org/download/teisimple/Mac.xml')

speakers = []
for sp in tree.xpath('//tei:sp', namespaces={'tei': 'http://www.tei-c.org/ns/1.0'}):
    speakers.append(sp.get('who'))
    
speakers = sorted(list(set(speakers)))

print speakers

stylesheet = etree.XML("""<xsl:stylesheet version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:tei="http://www.tei-c.org/ns/1.0">
    
    <xsl:output method="text" omit-xml-declaration="yes" indent="no"/>
    
    <xsl:template match="tei:speaker|tei:stage"/>
    
    <xsl:template match="tei:sp/text()|tei:p/text()|tei:l/text()|tei:q/text()"/>
    
    <xsl:template match="tei:w|tei:pc"><xsl:apply-templates/></xsl:template>
    <xsl:template match="tei:c"><xsl:text> </xsl:text></xsl:template>
    <xsl:template match="tei:lb"><xsl:text> </xsl:text></xsl:template>
    
    <xsl:template match="tei:l|tei:q|tei:p">
        <xsl:apply-templates/>
        <xsl:text>&#xa;</xsl:text>
    </xsl:template>
    
    <xsl:template match="tei:q">
        <xsl:apply-templates/>
        <xsl:text>&#xa;</xsl:text>
        <xsl:text>&#xa;</xsl:text>
    </xsl:template>
    
    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>
    
</xsl:stylesheet>""")

transform = etree.XSLT(stylesheet)

for speaker in speakers[:5]:
                 
    print
    print '\t', speaker
    print
    
    for speech in tree.xpath('//tei:sp[@who="' + speaker + '"]', namespaces={'tei': 'http://www.tei-c.org/ns/1.0'}):
                             
        result = transform(speech)
                             
        print unicode(result)
        print
        
        break
                    

### What about [Scrapy](https://scrapy.org/)?

Inconvenient in Jupyter Notebooks . . . 

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.selector import Selector

class ScrapeAdelaideLinks(scrapy.Spider):
    
    name = "quotes"
    start_urls = [
        'https://ebooks.adelaide.edu.au/meta/collections',
    ]

    def parse(self, response):
        
        all_links = response.selector.xpath('//a')

        categories_links = {}
        last_category = ''

        for a in all_links:
            
            node_class = None
            if len(a.xpath("@class").extract()) > 0:
                node_class = a.xpath("@class").extract()[0]
            
            node_href = None
            if len(a.xpath("@href").extract()) > 0:
                node_href = a.xpath("@href").extract()[0]
            
            if node_class != None and node_class.startswith('mdi-menu') == True:
                last_category = a.xpath("./text()").extract()[0]
                categories_links[last_category] = []
            elif node_class != None and node_class.startswith('mdi-') == True:
                break
            elif last_category > '':
                categories_links[last_category].append([node_href, a.xpath("./text()").extract()])
        
        print
        print 'len(categories_links.keys())', len(categories_links.keys())
        print
        print categories_links.keys()[:20]
        print
        print categories_links['History--Ancient Greece']
        

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ScrapeAdelaideLinks)
process.start()