# Accessing the web

Accessing a website in python can be done using the `requests` library and creating a `response` object.

The HTML content can be extracted from the `response` object.

Parsing the HTML and extracting specific tags requires addional libraries being set up

In [1]:
import requests

caldiss_url = 'https://www.en.caldiss.aau.dk/about/'
caldiss_get = requests.get(caldiss_url)
caldiss_get  #Status code 200 is "OK"

<Response [200]>

From the `response` object we can extract the HTML.

In [2]:
caldiss_html = caldiss_get.content
print(caldiss_html)

b'<!DOCTYPE html>\r\n<html class="no-js" prefix="og: http://ogp.me/ns#">\r\n<head>\r\n<meta charset="utf-8" />\r\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\r\n\r\n\r\n<meta name="description" content="about" />\r\n<title>About CALDISS</title>\r\n\r\n<!-- Remove no-js enable html5 elements -->\r\n<script type="text/javascript">\r\n    //Clear no-js\r\n    document.getElementsByTagName(\'html\')[0].className = document\r\n            .getElementsByTagName(\'html\')[0].className.replace(\'no-js\', \'\');\r\n    //Enable html5 elements in IE\r\n    \'article aside footer header nav section time\'.replace(/\\w+/g, function(n) {\r\n        document.createElement(n)\r\n    });\r\n</script>\r\n\r\n<!-- Google Tag Manager --> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\': new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src= \'//www.

In [3]:
invalid_suburl = 'https://www.en.caldiss.aau.dk/allaboutmeanandpastas/'
invalid_subget = requests.get(invalid_suburl)
invalid_subget  #Status code 404 is "Not found" - Useful for error handling!

<Response [404]>

Most sites have a page setup for 404 errors (so that users know that they are on the right main site).

We can still therefore still extract the HTML (the HTML of the 404 page)

In [15]:
invalid_subhtml = invalid_subget.content
print(invalid_subhtml)

b'<!DOCTYPE html>\r\n<html class="no-js" prefix="og: http://ogp.me/ns#">\r\n<head>\r\n<meta charset="utf-8" />\r\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\r\n\r\n\r\n<meta name="description" content="404 (page not found)" />\r\n<title>404 (page not found)</title>\r\n\r\n<!-- Remove no-js enable html5 elements -->\r\n<script type="text/javascript">\r\n    //Clear no-js\r\n    document.getElementsByTagName(\'html\')[0].className = document\r\n            .getElementsByTagName(\'html\')[0].className.replace(\'no-js\', \'\');\r\n    //Enable html5 elements in IE\r\n    \'article aside footer header nav section time\'.replace(/\\w+/g, function(n) {\r\n        document.createElement(n)\r\n    });\r\n</script>\r\n\r\n<!-- Google Tag Manager --> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\': new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.asyn

# Image Scraper

Using output from image scraper, we can write a short script to download all the images on a website.

In [79]:
import urllib.request
import pandas as pd

image_df = pd.read_csv('caldiss_imgurls.csv', sep = ",", skiprows = 1, header=None)  #Read the DMI image scraper output as a pandas dataframe
image_urls = list(image_df.loc[:,1])  #Extract the image URL's and store them in a list

i = 1  #Setting i to 1 - using it as identifier
for url in image_urls:  #Loops through each URL
    j = str(i)
    url = url
    img_path = "caldissimg" + j + ".png"  #Creates the imagepath - assumes all images are .png!
    urllib.request.urlretrieve(url, filename = "./images/" + img_path)  #Saves the image to the folder "images" - has to be created first!
    i = i + 1  #Increases i with 1

# Search Engine Scraper

The output of the search engine scraper from the Digital Methods Initiative can be used in a spider to extract content from sites returned from google's search results.

In [43]:
import pandas as pd
import scrapy
from scrapy import Selector
import time
import random

swed_df = pd.read_csv('swed_goog.csv', sep = '\t')
swed_urls = list(swed_df['article url'])
swed_urls

['https://podtail.com/da/podcasts/dk/education/',
 'https://podtail.com/da/podcasts/es/education/new/',
 'https://podtail.com/da/podcasts/education/',
 'https://podtail.com/da/podcasts/se/education/new/',
 'https://podtail.com/da/podcasts/all/education/new/',
 'https://podtail.com/da/podcasts/no/education/new/',
 'https://podtail.com/da/podcasts/cn/education/new/',
 'https://podtail.com/da/podcasts/au/education/',
 'https://interreg-oks.eu/webdav/files/gamla-projektbanken/se/Menu/Projektbank+2007-2013/Forprojekt-oresund/INDEX-+Education+Program.html',
 'https://www.sasgroup.net/en/about-the-education/']

After reading in the list of URL's, we can loop through them and extract specific elements of each.

In [52]:
swed_titles = []
for url in swed_url:
    time_delay = random.uniform(1,3)
    time.sleep(time_delay)
    
    url_html = requests.get(url).content
    url_sel = Selector(text = url_html)
    
    title = url_sel.css('title::text').extract_first()
    
    swed_titles.append(title)
    

2019-04-01 21:50:13 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): podtail.com
2019-04-01 21:50:13 [urllib3.connectionpool] DEBUG: https://podtail.com:443 "GET /da/podcasts/dk/education/ HTTP/1.1" 200 None
2019-04-01 21:50:16 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): podtail.com
2019-04-01 21:50:16 [urllib3.connectionpool] DEBUG: https://podtail.com:443 "GET /da/podcasts/es/education/new/ HTTP/1.1" 200 None
2019-04-01 21:50:18 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): podtail.com
2019-04-01 21:50:19 [urllib3.connectionpool] DEBUG: https://podtail.com:443 "GET /da/podcasts/education/ HTTP/1.1" 200 None
2019-04-01 21:50:20 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): podtail.com
2019-04-01 21:50:21 [urllib3.connectionpool] DEBUG: https://podtail.com:443 "GET /da/podcasts/se/education/new/ HTTP/1.1" 200 None
2019-04-01 21:50:23 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): podt

The title elements of the sites are now stores in the list `swed_titles`.

In [53]:
swed_titles

['Education – Denmark – Anbefalede podcasts – Podtail',
 'Education – Spain – Nye podcasts – Podtail',
 'Education – Anbefalede podcasts – Podtail',
 'Education – Sweden – Nye podcasts – Podtail',
 'Education – Nye podcasts – Podtail',
 'Education – Norway – Nye podcasts – Podtail',
 'Education – China – Nye podcasts – Podtail',
 'Education – Australia – Anbefalede podcasts – Podtail',
 'Interreg IVA ÖKS - INDEX: Education Program',
 'About the education – SAS']

In [54]:
venez_df = pd.read_csv('venez_goog.csv', sep = '\t')
venez_urls = list(venez_df['article url'])
venez_urls

['https://www.cyberwyoming.org/education-and-training/',
 'https://www.cyberwyoming.org/education-collaboration/',
 'https://www.cyberwyoming.org/',
 'https://ultimateaccesseducation.com/',
 'http://www.workingin-australia.com/education/system/overview',
 'http://www.workingin-australia.com/education/universities/tertiary-education',
 'https://misfitteachers.com/education-is-power-but-do-your-students-know-it/',
 'http://www.unesco.org.ve/dmdocuments/biblioteca/libros/national_report_dominica_erma_alfred.pdf',
 'https://www.iaoed.org/',
 'http://www.tylerhistory.org/education-racism/']

In [50]:
venez_titles = []
for url in venez_urls:
    time_delay = random.uniform(1,3)
    time.sleep(time_delay)
    
    url_html = requests.get(url).content
    url_sel = Selector(text = url_html)
    
    title = url_sel.css('title::text').extract_first()
    
    venez_titles.append(title)

2019-04-01 21:48:24 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.cyberwyoming.org
2019-04-01 21:48:28 [urllib3.connectionpool] DEBUG: https://www.cyberwyoming.org:443 "GET /education-and-training/ HTTP/1.1" 200 9820
2019-04-01 21:48:31 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.cyberwyoming.org
2019-04-01 21:48:32 [urllib3.connectionpool] DEBUG: https://www.cyberwyoming.org:443 "GET /education-collaboration/ HTTP/1.1" 200 10350
2019-04-01 21:48:34 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.cyberwyoming.org
2019-04-01 21:48:35 [urllib3.connectionpool] DEBUG: https://www.cyberwyoming.org:443 "GET / HTTP/1.1" 200 11251
2019-04-01 21:48:37 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): ultimateaccesseducation.com
2019-04-01 21:48:39 [urllib3.connectionpool] DEBUG: https://ultimateaccesseducation.com:443 "GET / HTTP/1.1" 200 None
2019-04-01 21:48:42 [urllib3.connectionpool] DEBUG: Starting 

ConnectionError: HTTPConnectionPool(host='www.unesco.org.ve', port=80): Max retries exceeded with url: /dmdocuments/biblioteca/libros/national_report_dominica_erma_alfred.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000253847285C0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

In [51]:
venez_titles

[['Education and Training – CyberWyoming'],
 ['COLLABORATION WITH EDUCATION – CyberWyoming'],
 ['CyberWyoming – Cybersecurity outreach programs for business, education, and workforce development.'],
 ['Ultimate Access Education'],
 ['Education in Australia: an overview - Jobs in Australia | Immigration to Australia | Working In Australia',
  'Account Suspended'],
 ['University and tertiary education - Jobs in Australia | Immigration to Australia | Working In Australia',
  'Account Suspended'],
 ['Education is power — but do your students know it? - Misfit Teachers']]