# Web Scraping

This notebook shows how to scrape web pages to build a text corpus. It draws on Ryan Mitchell's [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do) (Sebastopol, CA, O'Reilly, 2015.)

## Importing Libraries

Here are the libraries we want to import.

In [119]:
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import csv
import time

## Finding list of Links

Now we need to find a list of links to scrape. In this case we are using a CSV as then we can use it to keep track. Then we load the second column into a list.

In [1]:
ls

ExampleTable.csv         Second Notebook.ipynb    Web Scraping.ipynb
Hume Enquiry.txt         Test.txt
My First Notebook.ipynb  Third Notebook.ipynb


In [109]:
links = []
with open('ExampleTable.csv', 'r') as file: # This makes sure that file is closed after reading
    data = csv.reader(file)
    for row in data:
        links.append(row[1]) # This puts all the data into a list
file.closed
urlLinks = links[1:] # This gets all items but the first (which is a label)
urlLinks[:2]

['http://philosophi.ca', 'http://theoreti.ca']

## Grabbing one link

Here is a function to get one link and return the results.

In [98]:
def getURL(url):
    try:
        html = urlopen(url)
    except HTTPError as error:
        return "HTTPError: " + str(error)
    except URLError as error:
        return "URLError: " + str(error)
    else:
        bsObject = BeautifulSoup(html.read(), "lxml")
        return bsObject.get_text()

#### Testing with One Link
Here is code to get one link. We won't use this for the large scale scraping.

In [108]:
print(getURL("http://theoreti.ca"))




Theoreti.ca  



































































Theoreti.ca
Research notes taken on subjects around multimedia, electronic texts, and computer games.





My Digital Humanities
October 13th, 2016 


The folks at #dariah Teach have put together a first of a series of videos on My Digital Humanities. Despite appearing in it, the video seems very nicely produced and there is a nice mix of people. Stéfan Sinclair and I were interviewed together, something that isn’t clear in the first part, but will presumably become clear later.

Posted in Conference, Humanities Computing, Streaming Media |    Comments Off


FBI Game: What is Violent Extremism?
October 12th, 2016 


From Slashdot a story about an FBI game/interactive that is online and which aims at Countering Violent Extremism | What is Violent Extremism?. The subtitle is “Don’t Be A Puppet” and the game is part of a collection of interactive materials that try to teach about extremism in general and e

#### Cleaning Texts

Here is a function to clean up the text a bit.

In [110]:
def cleanLines(theXt):
    lineTokens = [line for line in theXt.split('\n') if line.strip() != '']
    theNewText = "\n".join(lineTokens)
    return theNewText

## Iterating Over a List of Links

This is the main loop that goes through the list of links, cleaning the text, and appending it to a text (simple XML) string.

In [120]:
theWebTexts = "<scrapes" + " date = \'" + time.strftime("%d/%m/%Y") + "\'>\n"
for linkUrl in urlLinks:
    theWebTexts += "<site " + "link = \'" + linkUrl + "\'>\n"
    theWebTexts += cleanLines(getURL(linkUrl))
    theWebTexts += "</site>\n"
    
theWebTexts += "</scrapes>"
    
print(theWebTexts[:300])

<scrapes date = '17/10/2016'>
<site link = 'http://philosophi.ca'>
philosophi.ca : Home Page 
<!--
  ul, ol, pre, dl, p { margin-top:0px; margin-bottom:0px; }
  code.escaped { white-space: nowrap; }
  .vspace { margin-top:1.33em; }
  .indent { margin-left:40px; }
  .outdent { margin-left:40px; text-


## Writing Out Results

Finally we write out the results as an XML file.

In [113]:
nameOfResults = "ScrapeResults.xml"

with open(nameOfResults, "w") as fileToWrite:
    fileToWrite.write(theWebTexts)
    
print("Done")

Done


'17/10/2016'