# Scraping prison pages

We want to scrape pages for each prison in the UK, most created at the start of the first lockdown in March 2020, and grab details on when those pages were updated (plus other data such as official Twitter accounts).

First let's import the libraries that we'll need.

In [None]:
#Install libraries
!pip install scraperwiki
import scraperwiki
import lxml.html
!pip install cssselect
import cssselect

Collecting scraperwiki
  Downloading https://files.pythonhosted.org/packages/30/84/d874847baad89f03e6984fcd87505a37bf924b66519d1e07bf76e2369af0/scraperwiki-0.5.1.tar.gz
Collecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/12/aa/c261dfd7f4ba6ce4701846a2689a46e2a172e012171de4378fc2926e3bf0/alembic-1.4.3-py2.py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 5.2MB/s 
Collecting Mako
[?25l  Downloading https://files.pythonhosted.org/packages/a6/37/0e706200d22172eb8fa17d68a7ae22dec7631a0a92266634fb518a88a5b2/Mako-1.1.3-py2.py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 4.6MB/s 
[?25hCollecting python-editor>=0.3
  Downloading https://files.pythonhosted.org/packages/c6/d3/201fc3abe391bbae6606e6f1d598c15d367033332bd54352b12f35513717/python_editor-1.0.4-py3-none-any.whl
Building wheels for collected packages: scraperwiki
  Building wheel for scraperwiki (setup.py) ... [?25l[?25hdone
  Created wheel for scraperwiki: f

In [None]:
#import pandas
import pandas as pd

## Scrape the index of links

Links to all the prison pages are in sections on https://www.gov.uk/government/collections/prisons-in-england-and-wales

We need to fetch them and follow them.

In [None]:
indexurl = "https://www.gov.uk/government/collections/prisons-in-england-and-wales"
#use the scrape function on that url
html = scraperwiki.scrape(indexurl)
# turn that variable's contents into an lxml object, making it easier to drill into
root = lxml.html.fromstring(html) 
#We want all the a tags within <li> tags within a <div>
links = root.cssselect('div li a')
#How many matches
print(len(links))

177


[This Wikipedia page](https://en.wikipedia.org/wiki/List_of_prisons_in_the_United_Kingdom) says there are 150, so 177 isn't far off. Perhaps it's grabbed a few links we didn't need.

In [None]:
for link in links[:10]:
  print(link.text_content())


        Departments
      

        Worldwide
      

        How government works
      

        Get involved
      

        Consultations
      

        Statistics
      

        News and communications
      
Home
Prisons A - C
Prisons D - F


Let's specify a particular attribute

In [None]:
#Store the attribute as a separate string
liselector = 'a[data-track-category="navDocumentCollectionLinkClicked"]'
links = root.cssselect(liselector)
#How many matches
print(len(links))
for link in links[:10]:
  print(link.text_content())

120
Altcourse Prison
Ashfield Prison
Askham Grange Prison and Young Offender Institution
Aylesbury Young Offender Institution
Bedford Prison
Belmarsh Prison
Berwyn Prison
Birmingham Prison
Brinsford Prison
Bristol Prison


In [None]:
#Check the last 20
for link in links[-20:]:
  print(link.text_content())

Swansea Prison
Swinfen Hall Prison
Thameside Prison
The Mount Prison
The Verne Prison
Thorn Cross Prison
Usk Prison
Wakefield Prison
Wandsworth Prison
Warren Hill Prison
Wayland Prison
Wealstun Prison
Werrington Young Offender Institution
Wetherby Young Offender Institution
Whatton Prison
Whitemoor prison
Winchester Prison
Woodhill Prison
Wormwood Scrubs Prison
Wymott Prison


Not quite 150 but clearly all the prisons listed.

Now we need the links.

In [None]:
#Check the last 20
for link in links[-10:]:
  print(link.attrib['href'])

/guidance/wayland-prison
/guidance/wealstun-prison
/guidance/werrington-yoi
/guidance/wetherby-yoi
/guidance/whatton-prison
/guidance/whitemoor-prison
/guidance/winchester-prison
/guidance/woodhill-prison
/guidance/wormwood-scrubs-prison
/guidance/wymott-prison


In [None]:
#Create an empty list
pagelinks = []
#fill with all the links
for link in links:
  pagelinks.append(link.attrib['href'])
#Check it worked
print(pagelinks)
print(len(pagelinks))

['/guidance/altcourse-prison', '/guidance/ashfield-prison', '/guidance/askham-grange-prison', '/guidance/aylesbury-yoi', '/guidance/bedford-prison', '/guidance/belmarsh-prison', '/guidance/berwyn-prison', '/guidance/birmingham-prison', '/guidance/brinsford-prison', '/guidance/bristol-prison', '/guidance/brixton-prison', '/guidance/bronzefield-prison', '/guidance/buckley-hall-prison', '/guidance/bullingdon-prison', '/guidance/bure-prison', '/guidance/cardiff-prison', '/guidance/channings-wood-prison', '/guidance/chelmsford-prison', '/guidance/coldingley-prison', '/guidance/cookham-wood-yoi', '/guidance/dartmoor-prison', '/guidance/deerbolt-prison', '/guidance/doncaster-prison', '/guidance/dovegate-prison', '/guidance/downview-prison', '/guidance/drake-hall-prison', '/guidance/durham-prison', '/guidance/east-sutton-park-prison', '/guidance/eastwood-park-prison', '/guidance/elmley-prison', '/guidance/erlestoke-prison', '/guidance/exeter-prison', '/guidance/featherstone-prison', '/guidance

## Scraping the detail pages

Now to grab the details. First, we need to test it on one URL:

In [None]:
#Store a url to test
testdetailurl = "https://www.gov.uk/guidance/altcourse-prison"
html = scraperwiki.scrape(testdetailurl)
# turn that variable's contents into an lxml object, making it easier to drill into
root = lxml.html.fromstring(html) 
#We want all the <li> tags within a <div id="full-history"> - first the text
lis = root.cssselect('div#full-history li p')
#But also the time tags
times = root.cssselect('div#full-history li time')
#How many matches - these should be the same
print(len(lis))
print(len(times))
#This should be TRUE
print(len(lis) == len(times))
#Loop through a list of numbers that corresponds to the indices we need for both lists
for i in range(0,len(times)):
  #Show the text
  print(lis[i].text_content())
  #Show the timestamp
  print(times[i].attrib['datetime'])

4
4
Updated visiting information in line with coronavirus restrictions.
2020-10-14T12:01:24.000+01:00
Added survey link
2020-04-15T11:55:11.000+01:00
Prisons visits update
2020-03-25T17:55:37.000+00:00
First published.
2020-01-15T14:48:00.000+00:00


That matches what we can see and the more specific HTML attribute.

Let's now store that in a function.

In [None]:
record = {}
#define the function, it takes one argument called 'detailurl'
def scrapedetail(detailurl):
  #store url in dict
  record['url'] = detailurl
  #scrape page
  html = scraperwiki.scrape(detailurl)
  # turn that variable's contents into an lxml object, making it easier to drill into
  root = lxml.html.fromstring(html) 
  #We want all the <li> tags within a <div id="full-history"> - first the text
  lis = root.cssselect('div#full-history li p')
  #But also the time tags
  times = root.cssselect('div#full-history li time')
  #This should be TRUE
  if len(lis) == len(times):
    #print("TRUE")
    #Loop through a list of numbers that corresponds to the indices we need for both lists
    for i in range(0,len(times)):
      #Show the text
      changetext = lis[i].text_content()
      #Show the timestamp
      timestamp = times[i].attrib['datetime']
      #And time text
      timetext = times[i].text_content()
      record['changetext'] = changetext
      record['timestamp'] = timestamp
      record['timetext'] = timetext
      record['uniqueid'] = detailurl+timestamp
      print(record)
      #Store in datastore
      scraperwiki.sql.save(unique_keys='uniqueid',data = record, table_name = "testrun2")
  #If there are more p tags than timestamps we need to solve that as a problem, so these are tagged differently
  else:
    record['changetext'] = "MISMATCH"
    record['timestamp'] = "MISMATCH"
    record['timetext'] = "MISMATCH"
    record['uniqueid'] = detailurl
    print(record)
    #Store in datastore
    scraperwiki.sql.save(unique_keys='uniqueid',data = record, table_name = "testrun3")
  

#And test
#scrapedetail("https://www.gov.uk/guidance/ashfield-prison")

## Looping through the URLs

We now test that on a subsection of the URLs scraper earlier.

In [None]:
for link in pagelinks[10:20]:
  fullurl = "https://www.gov.uk"+link
  print("scraping",fullurl)
  scrapedetail(fullurl)

scraping https://www.gov.uk/guidance/brixton-prison
{'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visit info', 'timestamp': '2020-12-21T14:33:26.000+00:00', 'timetext': '21 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-21T14:33:26.000+00:00'}
{'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-17T11:11:39.000+00:00', 'timetext': '17 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-17T11:11:39.000+00:00'}
{'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-04T10:29:29.000+00:00', 'timetext': '4 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-04T10:29:29.000+00:00'}
{'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated vis

## Checking data in the datastore

Now let's see what's been stored in the datastore.

In [None]:
testrunresults = scraperwiki.sql.select("* from testrun2")
print(testrunresults)
#Create a dataframe and assign the data to it
testrunresultsdf = pd.DataFrame(testrunresults)
#Print the first few rows (head)
print(testrunresultsdf.head())

[{'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visit info', 'timestamp': '2020-12-21T14:33:26.000+00:00', 'timetext': '21 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-21T14:33:26.000+00:00'}, {'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-17T11:11:39.000+00:00', 'timetext': '17 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-17T11:11:39.000+00:00'}, {'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-04T10:29:29.000+00:00', 'timetext': '4 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-04T10:29:29.000+00:00'}, {'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restric

In [None]:
#Export as a CSV file using the .to_csv function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
testrunresultsdf.to_csv('testrunresultsdf.csv')

## Run on full list

That appears to be working. Now let's run it on the full list - changing the table name first:

In [None]:
record = {}
#define the function, it takes one argument called 'detailurl'
def scrapedetail(detailurl):
  #store url in dict
  record['url'] = detailurl
  #scrape page
  html = scraperwiki.scrape(detailurl)
  # turn that variable's contents into an lxml object, making it easier to drill into
  root = lxml.html.fromstring(html) 
  #We want all the <li> tags within a <div id="full-history"> - first the text
  lis = root.cssselect('div#full-history li p')
  #But also the time tags
  times = root.cssselect('div#full-history li time')
  #This should be TRUE
  if len(lis) == len(times):
    #print("TRUE")
    #Loop through a list of numbers that corresponds to the indices we need for both lists
    for i in range(0,len(times)):
      #Show the text
      changetext = lis[i].text_content()
      #Show the timestamp
      timestamp = times[i].attrib['datetime']
      #And time text
      timetext = times[i].text_content()
      record['changetext'] = changetext
      record['timestamp'] = timestamp
      record['timetext'] = timetext
      record['uniqueid'] = detailurl+timestamp
      print(record)
      #Store in datastore
      scraperwiki.sql.save(unique_keys='uniqueid',data = record, table_name = "testrun2")
  #If there are more p tags than timestamps we need to solve that as a problem, so these are tagged differently
  else:
    record['changetext'] = "MISMATCH"
    record['timestamp'] = "MISMATCH"
    record['timetext'] = "MISMATCH"
    record['uniqueid'] = detailurl
    print(record)
    #Store in datastore
    scraperwiki.sql.save(unique_keys='uniqueid',data = record, table_name = "fullrun1")
  

#And test
#scrapedetail("https://www.gov.uk/guidance/ashfield-prison")

In [None]:
for link in pagelinks:
  fullurl = "https://www.gov.uk"+link
  print("scraping",fullurl)
  scrapedetail(fullurl)

scraping https://www.gov.uk/guidance/altcourse-prison
{'url': 'https://www.gov.uk/guidance/altcourse-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-04T13:02:51.000+00:00', 'timetext': '4 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/altcourse-prison2020-12-04T13:02:51.000+00:00'}
{'url': 'https://www.gov.uk/guidance/altcourse-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-02T20:00:24.000+00:00', 'timetext': '2 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/altcourse-prison2020-12-02T20:00:24.000+00:00'}
{'url': 'https://www.gov.uk/guidance/altcourse-prison', 'changetext': 'Updated visiting information in line with coronavirus restrictions.', 'timestamp': '2020-10-14T12:01:24.000+01:00', 'timetext': '14 October 2020', 'uniqueid': 'https://www.gov.uk/guidance/altcourse-prison2020-10-14T12:01:24.000+01:00'}
{'url': 'https://ww

In [None]:
testrunresults = scraperwiki.sql.select("* from testrun2")
print(testrunresults)
#Create a dataframe and assign the data to it
testrunresultsdf = pd.DataFrame(testrunresults)
#Print the first few rows (head)
print(testrunresultsdf.head())

[{'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visit info', 'timestamp': '2020-12-21T14:33:26.000+00:00', 'timetext': '21 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-21T14:33:26.000+00:00'}, {'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-17T11:11:39.000+00:00', 'timetext': '17 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-17T11:11:39.000+00:00'}, {'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restriction tiers.', 'timestamp': '2020-12-04T10:29:29.000+00:00', 'timetext': '4 December 2020', 'uniqueid': 'https://www.gov.uk/guidance/brixton-prison2020-12-04T10:29:29.000+00:00'}, {'url': 'https://www.gov.uk/guidance/brixton-prison', 'changetext': 'Updated visiting information in line with new local restric

In [None]:
#Export as a CSV file using the .to_csv function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
testrunresultsdf.to_csv('testrunresultsdf.csv')