# Scraping coroners' reports

We want to scrape coroners' reports at https://www.judiciary.uk/related-offices-and-bodies/office-chief-coroner/https-www-judiciary-uk-subject-community-health-care-and-emergency-services-related-deaths/

Breaking down the process, we need to do the following:

1. Scrape the category links on that page
2. Go to each category page and scrape the results (name, detail link, categories etc)
3. Go to the detail page and scrape the document details (response)
4. Find the 'next' page link and repeat

First we install and import the libraries we need:

In [None]:
#!/usr/bin/env python
!pip install scraperwiki
import scraperwiki
import lxml.html
!pip install cssselect
import cssselect
import json
import urllib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scraperwiki
  Downloading scraperwiki-0.5.1.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting alembic
  Downloading alembic-1.9.1-py3-none-any.whl (210 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.4/210.4 KB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: scraperwiki
  Building wheel for scraperwiki (setup.py) ... [?25l[?25hdone
  Created wheel for scraperwiki: filename=scraperwiki-0.5.1-py3-none-any.whl size=6543 sha256=78b2315f2713f35033dccefc6c0715ad85a0a69516b45c5b3aa7707379c18871
  Stored in directory: /root/.cache/pip/wheels/9b/18/cf/a5b4f9182c2f001b76662535e0a806c13e996193eaaf2c9169
Successfully b

## Scraping the links to category pages

We need to store the URL of the starting page and then scrape all the category pages linked from that:

In [None]:
#store the URL we want to start scraping from
starturl = 'https://www.judiciary.uk/related-offices-and-bodies/office-chief-coroner/https-www-judiciary-uk-subject-community-health-care-and-emergency-services-related-deaths/'

In [None]:
html = scraperwiki.scrape(starturl)
#convert from a string into an lxml object that can be parsed using lxml functions
root = lxml.html.fromstring(html)
#use the lxml function cssselect to grab any link inside a list inside an <article> tag, which is where the category linksare
catlinks = root.cssselect('article ul li a')
#measure the length of that list of matches, and print it
print ('found ',len(catlinks), ' reports')

found  7  reports


In [None]:
#Loop through the matches and print the href="" attribute
for link in catlinks:
  print(link.attrib['href'])

https://www.judiciary.uk/courts-and-tribunals/coroners-courts/coroners/
https://www.judiciary.uk/courts-and-tribunals/coroners-courts/office-chief-coroner/
https://www.judiciary.uk/courts-and-tribunals/coroners-courts/coroners-legislation-guidance-and-advice/
https://www.judiciary.uk/courts-and-tribunals/coroners-courts/annual-reports/
https://www.judiciary.uk/courts-and-tribunals/coroners-courts/judge-led-inquests/
https://www.judiciary.uk/courts-and-tribunals/coroners-courts/coroners-appointments-contacts-and-areas/
https://www.judiciary.uk/courts-and-tribunals/coroners-courts/concerns-and-complaints-about-coroners/


That seems to have worked nicely. We need to store those so we can loop through and scrape each one:

In [None]:
#create an empty list to store the URLs
caturls = []
#Loop through the matches and print the href="" attribute
for link in catlinks:
  caturls.append(link.attrib['href'])
print(caturls)

['https://www.judiciary.uk/subject/accident-at-work-and-health-and-safety-related-deaths/', 'https://www.judiciary.uk/subject/alcohol-drug-and-medication-related-deaths/', 'https://www.judiciary.uk/subject/care-home-health-related-deaths/', 'https://www.judiciary.uk/subject/child-death', 'https://www.judiciary.uk/subject/community-health-care-and-emergency-services-related-deaths/', 'https://www.judiciary.uk/subject/emergency-services-related-deaths-prevention-of-future-deaths/', 'https://www.judiciary.uk/subject/hospital-death-clinical-procedures-and-medical-management-related-deaths/', 'https://www.judiciary.uk/subject/mental-health-related-deaths/', 'https://www.judiciary.uk/subject/other-related-deaths/', 'https://www.judiciary.uk/subject/police-related-deaths/', 'https://www.judiciary.uk/subject/product-related-deaths/', 'https://www.judiciary.uk/subject/road-highways-safety-related-deaths/', 'https://www.judiciary.uk/subject/railway-related-deaths/', 'https://www.judiciary.uk/sub

We now need to create a function to scrape each category page, before letting it loose on all those URLs.

## Creating a scraper function for category pages

In [None]:
#create a function to scrape each results page
def scraperesultspage(url):
    #scrape webpage into a variable called 'html'
    html = scraperwiki.scrape(url)
    #convert from a string into an lxml object that can be parsed using lxml functions
    root = lxml.html.fromstring(html)
    #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
    articles = root.cssselect('article')
    #measure the length of that list of results, and print it
    print ('found ',len(articles), ' reports')

#test it
scraperesultspage(caturls[0])

found  10  reports


10 results is what we'd expect. Let's extend that scraper to start storing different data points from each report. We create an empty dictionary first to store that data.

In [None]:
#create an empty dictionary variable to hold our data later
record = {}

#create a function to scrape each results page
def scraperesultspage(url):
    #scrape webpage into a variable called 'html'
    html = scraperwiki.scrape(url)
    #convert from a string into an lxml object that can be parsed using lxml functions
    root = lxml.html.fromstring(html)
    #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
    articles = root.cssselect('article')
    #measure the length of that list of results, and print it
    print ('found ',len(articles), ' reports')
    #loop through each match in the list
    for article in articles:
        #grab the heading 5 tags with links
        links = article.cssselect('h5 a')
        #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
        record['title'] = links[0].text_content()
        #print the data so far
        print(record)
        #grab the link attribute and store 
        record['reporturl'] = links[0].attrib['href']
        #grab the <time> tags
        times = article.cssselect('time')
        #there's only 1, grab the text content of that
        record['time'] = times[0].text_content()
        #grab the datetime= attribute, this is more specifically encoded
        record['timestamp'] = times[0].attrib['datetime']
        #grab all the <span> tags - there should be 4
        spans = article.cssselect('span a')
        #sometimes there are more than 4, where multiple categories are used, so we create an empty variable...
        categories = ""
        #...and loop through all the spans, adding them together with a semi-colon separator
        for span in spans:
            categories = categories+";"+span.text_content()
        #print categories
        #next we clean them up by replacing the redundant text content which doesn't refer to categories
        categoriescleaned = categories.replace('Prevention of Future Deaths','').replace('PFD Report','').replace('Coroner','').replace(';;','') #the double semi-colon is left over from when they were added
        #print categoriescleaned
        record['categories'] = categoriescleaned
        #This should always be 4, but we're storing it in case there are occasions when there are more, in which case we need to adapt the scraper
        record['spans'] = len(spans)
        print('test', record)
        #the summary is in <div class="entry-summary"> so this grabs any matches
        summaries = article.cssselect('div.entry-summary')
        #store the full text of the first (and only) match in a variable called 'summary'
        summary = summaries[0].text_content()
        print('summary: ', summary)
        print(len(summaries))
        #split it on key phrases and store results
        #at some point spaces are replaced with &nbsp; so we can't use spaces in the splits
        record['reportdate'] = summary.split('Date of report:')[1].split('Ref:')[0]
        #FORK TO GO HERE
        #Store the data
        scraperwiki.sql.save(['reporturl'],record, table_name = "test1") 

#test it
scraperesultspage(caturls[1])

found  10  reports
{'title': 'Daniel Coleman'}
test {'title': 'Daniel Coleman', 'reporturl': 'https://www.judiciary.uk/publications/daniel-coleman/', 'time': '26 October 2020', 'timestamp': '2020-10-26T17:33:42+00:00', 'categories': 'Alcohol, drug and medication related deaths;Other related deaths', 'spans': 5}
summary:  
    Date of report: 25 August 2020 Ref: 2020-0166 Deceased name: Daniel Coleman Coroner name: M E Hassell Coroner Area: Inner North London Category: Alcohol, drug and medication related deaths; Other related deaths This report is ...  
1
{'title': 'Toby Nieland', 'reporturl': 'https://www.judiciary.uk/publications/daniel-coleman/', 'time': '26 October 2020', 'timestamp': '2020-10-26T17:33:42+00:00', 'categories': 'Alcohol, drug and medication related deaths;Other related deaths', 'spans': 5, 'reportdate': ' 25 August 2020 '}
test {'title': 'Toby Nieland', 'reporturl': 'https://www.judiciary.uk/publications/toby-nieland/', 'time': '26 October 2020', 'timestamp': '2020-

## RETHINK: Going through the 'coroners' page and focusing on detail pages

At this point I notice that there's a 'Coroners' link which takes you to a page which appears to have *all* coroner reports in all categories: https://www.judiciary.uk/publication-jurisdiction/coroner/

I also realise that if we're going to grab details on the responses then it will be simpler to skip the results pages and only scrape the detail pages instead.

So the process is now:

1. Write a scraper for the detail page
2. Write a scraper for the 'all reports' page, which also goes to each 'next' page, and grabs each detail link, then runs the first scraper on the page at that link

## Creating a new function to scrape the detail pages

We begin by scraping the detail page. Later we can create a scraper to fetch the URLs of all the detail pages and run this function on those.

In [None]:
def scrapedetail(detailurl):
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #measure the length of that list of results, and print it
  print ('found ',len(articles), ' reports')

#Test
scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")

found  1  reports


Now we know it grabs the one `<article>` tag we know contains the info we need, we expand the function further...

In [None]:
#Create empty record to store data
record = {}

#define the function
def scrapedetail(detailurl):
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #there's only one, so store that
  article = articles[0]
  #grab the <h1> tags with links
  h1s = article.cssselect('h1')
  #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
  record['title'] = h1s[0].text_content()
  #grab the <time> tags
  times = article.cssselect('time')
  #there's only 1, grab the text content of that
  record['time'] = times[0].text_content()
  #grab the datetime= attribute, this is more specifically encoded
  record['timestamp'] = times[0].attrib['datetime']
  print(record)
  #grab all the <span> tags
  spans = article.cssselect('span a')
  #sometimes there are more than 4, where multiple categories are used
  for span in spans:
    print(span.text_content())
  

#Test
scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")

{'title': 'Kerry Aldridge', 'time': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00'}
Prevention of Future Deaths
Mental Health related deaths
Railway related deaths
Suicide (from 2015)
PFD Report
Coroner


### Handling multiple categories

How do we deal with this list of categories and other information which could be different lengths and different categories?

Firstly, we can remove the PFD Report and Coroner entries because a) they're always there and b) we are only interested in the categories.

There are 2 ways we can do this: by position (they are always the last and penultimate items in the lsit) or by text match (they always use the same text).

What about the remaining categories? We can use the text as the *key* for the dictionary, and set the *value* to true, e.g. `'Prevention of Future Deaths': True`.

In [None]:
#Create empty record to store data
record = {}

#define the function
def scrapedetail(detailurl):
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #there's only one, so store that
  article = articles[0]
  #grab the <h1> tags with links
  h1s = article.cssselect('h1')
  #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
  record['title'] = h1s[0].text_content()
  #grab the <time> tags
  times = article.cssselect('time')
  #there's only 1, grab the text content of that
  record['time'] = times[0].text_content()
  #grab the datetime= attribute, this is more specifically encoded
  record['timestamp'] = times[0].attrib['datetime']
  print(record)
  #grab all the <span> tags
  spans = article.cssselect('span a')
  #Create an empty list to store the text
  spantext = []
  #loop through the spans and store the text in that list
  for span in spans:
    spantext.append(span.text_content())
  #Remove the two values we don't want
  spantext.remove('PFD Report')
  spantext.remove('Coroner')
  print(spantext)
  for t in spantext:
    record[t] = True
  print(record)
  

#Test
scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")

{'title': 'Kerry Aldridge', 'time': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00'}
['Prevention of Future Deaths', 'Mental Health related deaths', 'Railway related deaths', 'Suicide (from 2015)']
{'title': 'Kerry Aldridge', 'time': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00', 'Prevention of Future Deaths': True, 'Mental Health related deaths': True, 'Railway related deaths': True, 'Suicide (from 2015)': True}


### Handling the data labels stored as text

This is one approach - another would be to create an existing dictionary of categories set to `False` by default which we switch to `True` if in the list captured by the code.

Meanwhile, we expand the function further to grab the data on reference number, coroner name, recipients, etc.

In [None]:
#Create empty record to store data
record = {}

#define the function
def scrapedetail(detailurl):
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #there's only one, so store that
  article = articles[0]
  #grab the <h1> tags with links
  h1s = article.cssselect('h1')
  #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
  record['title'] = h1s[0].text_content()
  #grab the <time> tags
  times = article.cssselect('time')
  #there's only 1, grab the text content of that
  record['published'] = times[0].text_content()
  #grab the datetime= attribute, this is more specifically encoded
  record['timestamp'] = times[0].attrib['datetime']
  print(record)
  #grab all the <span> tags
  spans = article.cssselect('span a')
  #Create an empty list to store the text
  spantext = []
  #loop through the spans and store the text in that list
  for span in spans:
    spantext.append(span.text_content())
  #Remove the two values we don't want
  spantext.remove('PFD Report')
  spantext.remove('Coroner')
  print(spantext)
  for t in spantext:
    record[t] = True
  print(record)
  #grab the <p> tags
  ps = article.cssselect('div.entry-content p')
  #Create an empty list to store the text content
  ptexts = []
  #Loop through the p tags and append the text to that list
  for p in ps:
    ptexts.append(p.text_content())
  for t in ptexts:
    #Split into the label (e.g. Ref:) and the value (e.g. Andrew Harris) and store in a list
    tsplit = t.split(": ")
    #this line converts the text label to a key by removing spaces and making it lowercase
    record[tsplit[0].replace(" ","").lower()] = tsplit[1]
    if "This report is being sent to: " in t:
      #Remove the label part and store the list of recipients
      recipients = t.replace("This report is being sent to: ","").split(";")
      #Use each as a key and store the True value against it
      for r in recipients:
        record[r] = True
  print(record)
  

#Test
scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")

{'title': 'Kerry Aldridge', 'published': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00'}
['Prevention of Future Deaths', 'Mental Health related deaths', 'Railway related deaths', 'Suicide (from 2015)']
{'title': 'Kerry Aldridge', 'published': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00', 'Prevention of Future Deaths': True, 'Mental Health related deaths': True, 'Railway related deaths': True, 'Suicide (from 2015)': True}
['South London and Maudsley NHS Foundation', ' Metropolitan Police service']
{'title': 'Kerry Aldridge', 'published': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00', 'Prevention of Future Deaths': True, 'Mental Health related deaths': True, 'Railway related deaths': True, 'Suicide (from 2015)': True, 'dateofreport': '10 February 2020', 'ref': '2020-0055', 'deceasedname': 'Kerry Aldridge', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Railway related deaths; Suicide (from 2015); Mental Health relat

We could have used an if/elif block like this...

```{python}
    if "Date of report: " in t:
      record['reportdate'] = t.replace("Date of report: ","")
    elif "Ref: " in t:
      record['ref'] = t.replace("Ref: ","")
    elif "Deceased name: " in t:
      record['deceasedname'] = t.replace("Deceased name: ","")
    elif "Coroners name: " in t:
      record['coronername'] = t.replace("Coroners name: ","")
    elif "Coroners Area: " in t:
      record['coronerarea'] = t.replace("Coroners Area: ","")
    elif "This report is being sent to: " in t:
      record['sentto'] = t.replace("This report is being sent to: ","")
```

...but the approach essentially does the same more elegantly

### Grab the PDF links and titles

Now to expand the function to fetch the PDFs.

In [None]:
#Create empty record to store data
record = {}

#define the function
def scrapedetail(detailurl):
  #store the url
  record['url'] = detailurl
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #there's only one, so store that
  article = articles[0]
  #grab the <h1> tags with links
  h1s = article.cssselect('h1')
  #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
  record['title'] = h1s[0].text_content()
  #grab the <time> tags
  times = article.cssselect('time')
  #there's only 1, grab the text content of that
  record['published'] = times[0].text_content()
  #grab the datetime= attribute, this is more specifically encoded
  record['timestamp'] = times[0].attrib['datetime']
  #print(record)
  #grab all the <span> tags
  spans = article.cssselect('span a')
  #Create an empty list to store the text
  spantext = []
  #loop through the spans and store the text in that list
  for span in spans:
    spantext.append(span.text_content())
  #Remove the two values we don't want
  spantext.remove('PFD Report')
  spantext.remove('Coroner')
  #Create a field for each category we find and store a True value against it
  for t in spantext:
    record[t] = True
  #print(record)
  #grab the <p> tags
  ps = article.cssselect('div.entry-content p')
  #Create an empty list to store the text content
  ptexts = []
  #Loop through the p tags and append the text to that list
  for p in ps:
    ptexts.append(p.text_content())
  #Loop through that list of paragraph text
  for t in ptexts:
    #Split into the label (e.g. Ref:) and the value (e.g. Andrew Harris) and store in a list
    tsplit = t.split(": ")
    #this line converts the text label to a key by removing spaces and making it lowercase
    record[tsplit[0].replace(" ","").lower()] = tsplit[1]
    #For one in particular we want to repeat the process of 
    # creating fields based on values and setting to True
    if "This report is being sent to: " in t:
      #Remove the label part and store the list of recipients
      recipients = t.replace("This report is being sent to: ","").split(";")
      #Use each as a key and store the True value against it
      for r in recipients:
        record[r] = True
  #the link to the PDF report is in <div class="download-box"> - grab all...
  pdfs = article.cssselect('li.pdf div a')
  #How many did we get
  #print(len(pdfs))
  #Store that
  record['num_of_pdfs'] = len(pdfs)
  #Some pages have no link to a report, so we need an if test before we try to extract that
  if len(pdfs)>0:
    pdflink = pdfs[0].attrib['href']
    #print(pdflink)
  else:
    pdflink = "NO REPORT LINK"
  record['pdflink'] = pdflink
  #Most pages have a response document, so here's another test
  if len(pdfs)>1:
    #Record how many responses there are
    record['responses'] = len(pdfs)-1
    #Loop through them - if there's more than one then we create different fields
    for i in range(1,len(pdfs)):
      #Create some field names by combining the index a string
      responseurlnum = "responseurl"+str(i)
      responsetitlenum = "responsetitle"+str(i)
      responseorgnum = "responseorg"+str(i)
      #Store attributes at that index under those field names
      record[responseurlnum] = pdfs[i].attrib['href']
      record[responsetitlenum] = pdfs[i].attrib['title']
      record[responseorgnum] = pdfs[i].attrib['title'].replace(" Response from ","").replace(record['ref'],"").replace(" Redacted","")
  #Return the data stored in 'record'
  return(record)
  

#Test the function, storing results in 'testresults'
testresults = scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")
#Print
print(testresults)

{'url': 'https://www.judiciary.uk/publications/kerry-aldridge/', 'title': 'Kerry Aldridge', 'published': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00', 'Prevention of Future Deaths': True, 'Mental Health related deaths': True, 'Railway related deaths': True, 'Suicide (from 2015)': True, 'dateofreport': '10 February 2020', 'ref': '2020-0055', 'deceasedname': 'Kerry Aldridge', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Railway related deaths; Suicide (from 2015); Mental Health related deaths', 'thisreportisbeingsentto': 'South London and Maudsley NHS Foundation; Metropolitan Police service', 'South London and Maudsley NHS Foundation': True, ' Metropolitan Police service': True, 'num_of_pdfs': 2, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/03/Kerry-Aldridge-2020-0055-Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-F

## Scraping links from the results pages to run our scraper on

Now that function is finished. We can return to the code that we wrote to scrape a category results page - and simplify it so it just grabs the link:

In [None]:
#create a function to scrape each results page
def scraperesultspage(url):
    #scrape webpage into a variable called 'html'
    html = scraperwiki.scrape(url)
    #convert from a string into an lxml object that can be parsed using lxml functions
    root = lxml.html.fromstring(html)
    #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
    detaillinks = root.cssselect('article h5 a')
    #That will be a list of lxml objects, so we need to create an empty list
    # to store the href attributes we extract from those
    detailurls = []
    #Now loop through the list of lxml objects
    for i in detaillinks:
      #Extract the href and append to that list
      detailurls.append(i.attrib['href'])
    #Return to whatever called this function
    return(detailurls)

#test it
testlist = scraperesultspage("https://www.judiciary.uk/publication-jurisdiction/coroner/")
print(testlist)

['https://www.judiciary.uk/publications/linda-phillipson/', 'https://www.judiciary.uk/publications/peter-howarth/', 'https://www.judiciary.uk/publications/laura-parsons/', 'https://www.judiciary.uk/publications/ellie-isaacs/', 'https://www.judiciary.uk/publications/zoe-knight/', 'https://www.judiciary.uk/publications/carlington-spencer/', 'https://www.judiciary.uk/publications/daniel-coleman/', 'https://www.judiciary.uk/publications/dereck-john-chapman/', 'https://www.judiciary.uk/publications/toby-nieland/', 'https://www.judiciary.uk/publications/viktor-scott-brown/']


### Generate a list of URLs to loop through

This will work for the first page, but we need to run this on multiple pages.

When we go to page 2 the URL changes to this: 

https://www.judiciary.uk/publication-jurisdiction/coroner/page/2/

So we can start from:

https://www.judiciary.uk/publication-jurisdiction/coroner/page/1

And then add 1 to that number until we get to 332, the last page shown on the navigation.

In [None]:
#Store the URL without that number - the bit that never changes
baseurl = "https://www.judiciary.uk/publication-jurisdiction/coroner/page/"
#Generate a range of numbers from 1 to 322 (the last number shown is not included)
pagenums = range(1,323)
#Create an empty list to store our urls in
pageurls = []
#Loop through the page numbers and add them to the base url
for i in pagenums:
  #The str() function is used to convert the number to a string 
  #so it can be combined with the other string
  fullurl = baseurl+str(i)
  #Append to the previously empty list
  pageurls.append(fullurl)

#Check the length of the list
print(len(pageurls))
#Check the first and last urls
print(pageurls[0])
print(pageurls[-1])

322
https://www.judiciary.uk/publication-jurisdiction/coroner/page/1
https://www.judiciary.uk/publication-jurisdiction/coroner/page/322


## Loop through the URLs and apply the detail scraper function

Now we test the two functions in two nested loops: one that loops through the list of results URLs we generated, and fetches the 10 detail URLs from each; then another loop which goes through those 10 detail URLs and scrapes the data.

We limit the loops on the first run to just grab the first two items of each (2 results URLs, then 2 detail pages from each):

In [None]:
#Loop through those results urls
for i in pageurls[:2]:
  #On each one, scrape the links to the detail pages
  detailurls = scraperesultspage(i)
  #Loop through the list of links
  for i in detailurls[:2]:
    testresults = scrapedetail(i)
    #Print
    print(testresults)

{'url': 'https://www.judiciary.uk/publications/linda-phillipson/', 'title': 'Linda Phillipson', 'published': '10 November 2020', 'timestamp': '2020-11-10T16:08:08+00:00', 'Prevention of Future Deaths': True, 'Mental Health related deaths': True, 'Railway related deaths': True, 'Suicide (from 2015)': True, 'dateofreport': '8 September 2020', 'ref': '2020-0172', 'deceasedname': 'Linda Phillipson', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Hospital death (Clinical procedures and medical management) related deaths', 'thisreportisbeingsentto': 'Western Sussex Hospital Trust', 'South London and Maudsley NHS Foundation': True, ' Metropolitan Police service': True, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/11/Linda-Phillipson-2020-0172_Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 're

## Storing the results in a datastore

Now let's see how easily we can deal with the resulting dictionaries. These will be of varying lengths so we try using the `sql.save()` function from the scraperwiki library first.

In [None]:
#Loop through those results urls
for i in pageurls[:2]:
  #On each one, scrape the links to the detail pages
  detailurls = scraperesultspage(i)
  #Loop through the list of links
  for i in detailurls[:2]:
    testresults = scrapedetail(i)
    #Print
    print(testresults)
    scraperwiki.sql.save(['url'], testresults, table_name = "testrun")

{'url': 'https://www.judiciary.uk/publications/linda-phillipson/', 'title': 'Linda Phillipson', 'published': '10 November 2020', 'timestamp': '2020-11-10T16:08:08+00:00', 'Prevention of Future Deaths': True, 'Mental Health related deaths': True, 'Railway related deaths': True, 'Suicide (from 2015)': True, 'dateofreport': '8 September 2020', 'ref': '2020-0172', 'deceasedname': 'Linda Phillipson', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Hospital death (Clinical procedures and medical management) related deaths', 'thisreportisbeingsentto': 'Western Sussex Hospital Trust', 'South London and Maudsley NHS Foundation': True, ' Metropolitan Police service': True, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/11/Linda-Phillipson-2020-0172_Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 're

In [None]:
print(scraperwiki.sql.select("* from testrun"))

[{'url': 'https://www.judiciary.uk/publications/linda-phillipson/', 'title': 'Linda Phillipson', 'published': '10 November 2020', 'timestamp': '2020-11-10T16:08:08+00:00', 'Prevention of Future Deaths': 1, 'Mental Health related deaths': 1, 'Railway related deaths': 1, 'Suicide (from 2015)': 1, 'dateofreport': '8 September 2020', 'ref': '2020-0172', 'deceasedname': 'Linda Phillipson', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Hospital death (Clinical procedures and medical management) related deaths', 'thisreportisbeingsentto': 'Western Sussex Hospital Trust', 'South London and Maudsley NHS Foundation': 1, ' Metropolitan Police service': 1, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/11/Linda-Phillipson-2020-0172_Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 'responsetitle1': '2

Here's that data organised in a way that's easier to read:

```
[
  {'url': 'https://www.judiciary.uk/publications/linda-phillipson/', 'title': 'Linda Phillipson', 'published': '10 November 2020', 'timestamp': '2020-11-10T16:08:08+00:00', 'Prevention of Future Deaths': 1, 'Mental Health related deaths': 1, 'Railway related deaths': 1, 'Suicide (from 2015)': 1, 'dateofreport': '8 September 2020', 'ref': '2020-0172', 'deceasedname': 'Linda Phillipson', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Hospital death (Clinical procedures and medical management) related deaths', 'thisreportisbeingsentto': 'Western Sussex Hospital Trust', 'South London and Maudsley NHS Foundation': 1, ' Metropolitan Police service': 1, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/11/Linda-Phillipson-2020-0172_Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 'responsetitle1': '2020-0055 Response from South London and Maudsley NHS Foundation Redacted', 'responseorg1': 'South London and Maudsley NHS Foundation', 'Hospital Death (Clinical Procedures and medical management) related deaths': 1, 'coronername': 'Veronica Hamilton-Deeley', 'coronerarea': 'Brighton and Hove', 'Western Sussex Hospital Trust': 1, 'Care Home Health related deaths': 1, 'Borough Care': 1, 'Royal Free Hospital': 1, 'Alcohol, drug and medication related deaths': 1, 'Birmingham and Solihull Mental Health Foundation Trust': 1, ' Department for Health and Social Care': 1}, 
  {'url': 'https://www.judiciary.uk/publications/peter-howarth/', 'title': 'Peter Howarth', 'published': '10 November 2020', 'timestamp': '2020-11-10T15:54:47+00:00', 'Prevention of Future Deaths': 1, 'Mental Health related deaths': 1, 'Railway related deaths': 1, 'Suicide (from 2015)': 1, 'dateofreport': '8 September 2020', 'ref': '2020-0171', 'deceasedname': 'Peter Howarth', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Care Home health related deaths; Hospital death (Clinical procedures and medical management) related deaths', 'thisreportisbeingsentto': 'Borough Care', 'South London and Maudsley NHS Foundation': 1, ' Metropolitan Police service': 1, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/11/Peter-Howarth-2020-0171_Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 'responsetitle1': '2020-0055 Response from South London and Maudsley NHS Foundation Redacted', 'responseorg1': 'South London and Maudsley NHS Foundation', 'Hospital Death (Clinical Procedures and medical management) related deaths': 1, 'coronername': 'Chris Morris', 'coronerarea': 'Greater Manchester South', 'Western Sussex Hospital Trust': 1, 'Care Home Health related deaths': 1, 'Borough Care': 1, 'Royal Free Hospital': 1, 'Alcohol, drug and medication related deaths': 1, 'Birmingham and Solihull Mental Health Foundation Trust': 1, ' Department for Health and Social Care': 1}, 
  {'url': 'https://www.judiciary.uk/publications/malyun-karama/', 'title': 'Malyun Karama', 'published': '26 October 2020', 'timestamp': '2020-10-26T14:56:50+00:00', 'Prevention of Future Deaths': 1, 'Mental Health related deaths': 1, 'Railway related deaths': 1, 'Suicide (from 2015)': 1, 'dateofreport': '21 August 2020', 'ref': '2020-0162', 'deceasedname': 'Malyun Karama', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Hospital death (Clinical procedure and medical management) related deaths', 'thisreportisbeingsentto': 'Royal Free Hospital', 'South London and Maudsley NHS Foundation': 1, ' Metropolitan Police service': 1, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/10/Malyun-Karama-2020-0162_Redacted.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 'responsetitle1': '2020-0055 Response from South London and Maudsley NHS Foundation Redacted', 'responseorg1': 'South London and Maudsley NHS Foundation', 'Hospital Death (Clinical Procedures and medical management) related deaths': 1, 'coronername': 'M E Hassell', 'coronerarea': 'Inner North London', 'Western Sussex Hospital Trust': 1, 'Care Home Health related deaths': 1, 'Borough Care': 1, 'Royal Free Hospital': 1, 'Alcohol, drug and medication related deaths': 1, 'Birmingham and Solihull Mental Health Foundation Trust': 1, ' Department for Health and Social Care': 1}, 
  {'url': 'https://www.judiciary.uk/publications/ian-allen/', 'title': 'Ian Allen', 'published': '26 October 2020', 'timestamp': '2020-10-26T14:32:57+00:00', 'Prevention of Future Deaths': 1, 'Mental Health related deaths': 1, 'Railway related deaths': 1, 'Suicide (from 2015)': 1, 'dateofreport': '17 August 2020', 'ref': '2020-0161', 'deceasedname': 'Ian Allen', 'coronersname': 'Andrew Harris', 'coronersarea': 'London Inner South', 'category': 'Care Home health deaths; Alcohol, drug and medication related deaths', 'thisreportisbeingsentto': 'Birmingham and Solihull Mental Health Foundation Trust; Department for Health and Social Care', 'South London and Maudsley NHS Foundation': 1, ' Metropolitan Police service': 1, 'num_of_pdfs': 1, 'pdflink': 'https://www.judiciary.uk/wp-content/uploads/2020/10/Ian-Allen-2020-0161.pdf', 'responses': 1, 'responseurl1': 'https://www.judiciary.uk/wp-content/uploads/2020/03/2020-0055-Response-from-South-London-and-Maudsley-NHS-Foundation-Redacted.pdf', 'responsetitle1': '2020-0055 Response from South London and Maudsley NHS Foundation Redacted', 'responseorg1': 'South London and Maudsley NHS Foundation', 'Hospital Death (Clinical Procedures and medical management) related deaths': 1, 'coronername': 'Louise Hunt', 'coronerarea': 'Birmingham and Solihull', 'Western Sussex Hospital Trust': 1, 'Care Home Health related deaths': 1, 'Borough Care': 1, 'Royal Free Hospital': 1, 'Alcohol, drug and medication related deaths': 1, 'Birmingham and Solihull Mental Health Foundation Trust': 1, ' Department for Health and Social Care': 1}
  ]
```

This part in the first entry is where we have a problem:

`'Prevention of Future Deaths': 1, 'Mental Health related deaths': 1,`

That doesn't match the [entry on the page](https://www.judiciary.uk/publications/linda-phillipson/) which only sits in one category.

The same problem applies to organisations:

`'thisreportisbeingsentto': 'Western Sussex Hospital Trust', 'South London and Maudsley NHS Foundation': 1, ' Metropolitan Police service': 1,`

## Using pandas

We might try a different approach. [This Stackoverflow entry on *Creating dataframe from a dictionary where entries have different lengths*](https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths) outlines a possible solution with `pandas`.



In [None]:
#import the pandas library
import pandas as pd
import numpy as np

Let's try out some of the code at that page and see what we can do with it:

In [None]:
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )

print(d)

pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))

{'A': array([1, 2]), 'B': array([1, 2, 3, 4])}


Unnamed: 0,A,B
0,1.0,1
1,2.0,2
2,,3
3,,4


The difference here is that there is only **one** dict, with each key representing a column, and the value attached to that key being an **array** of all the values in that column.

This won't work for us because even if we created a list as soon as a new category/tag was encountered, the lists wouldn't match up properly to ensure that as *rows* the data would all belong to the same page.

## Adjusting the function to start with all categories

Our solution, then, would be to update the function to begin with *all* categories rather than taking them from the data itself (an option we identified earlier).

In the function we had these lines of code that fetched categories from the HTML itself:

```
#Create a field for each category we find and store a True value against it
  for t in spantext:
    record[t] = True
```

We need to replace that with:
1. a set of keys for *all* the possible categories, set to `False` by default
2. some lines that turn *some* of those to `True` if they are found in the HTML.

Let's create that set of category keys. We can do this by repeating some code we used earlier to scrape the category links, with a small tweak to print the text of the links rather than the value of the href attribute:

In [None]:
#store the URL we want to start scraping from
starturl = 'https://www.judiciary.uk/related-offices-and-bodies/office-chief-coroner/https-www-judiciary-uk-subject-community-health-care-and-emergency-services-related-deaths/'
html = scraperwiki.scrape(starturl)
#convert from a string into an lxml object that can be parsed using lxml functions
root = lxml.html.fromstring(html)
#use the lxml function cssselect to grab any link inside a list inside an <article> tag, which is where the category linksare
catlinks = root.cssselect('article ul li a')
#measure the length of that list of matches, and print it
print ('found ',len(catlinks), ' reports')
#Loop through the matches and print the href="" attribute
for link in catlinks:
  print(link.text_content())

found  16  reports
Accident at Work and Health and Safety related deaths
Alcohol, drug and medication related deaths
Care Home Health related deaths
Child Death
Community health care and Emergency Services related deaths
Emergency services related deaths (2019 onwards)
Hospital Death (Clinical Procedures and medical management) related deaths
Mental Health related deaths
Other related deaths
Police related deaths
Product related deaths
Road (Highways Safety) related deaths
Railway related deaths
State Custody related deaths
Suicide
Wales prevention of future deaths reports (2019 onwards)


A couple of these include "(2019 onwards)" so we need to check if the categories are referred to in the same way in the relevant reports.

Looking at the reports on https://www.judiciary.uk/subject/emergency-services-related-deaths-prevention-of-future-deaths/ it does indeed seem that they do.

### Seed the dictionary with categories set to False

Now we can use that list to seed the dictionary.

In [None]:
record = {}
for link in catlinks:
  record[link.text_content()] = False
print(record)

{'Accident at Work and Health and Safety related deaths': False, 'Alcohol, drug and medication related deaths': False, 'Care Home Health related deaths': False, 'Child Death': False, 'Community health care and Emergency Services related deaths': False, 'Emergency services related deaths (2019 onwards)': False, 'Hospital Death (Clinical Procedures and medical management) related deaths': False, 'Mental Health related deaths': False, 'Other related deaths': False, 'Police related deaths': False, 'Product related deaths': False, 'Road (Highways Safety) related deaths': False, 'Railway related deaths': False, 'State Custody related deaths': False, 'Suicide': False, 'Wales prevention of future deaths reports (2019 onwards)': False}


With those defaults set, the function now just *changes* them where a match is found:

```
  #Create a field for each category we find and store a True value against it
  for t in spantext:
    record[t] = True
```

We also remove the similar code for recipients: 

```
for r in recipients: 
  record[r] = True
``` 

and just add each new recipient to a list (that might be used later) instead, by first creating that list with `recipientlist = []` then in the function changing the loop to:

```
for r in recipients: 
  recipientlist.append(r)
``` 

Later we can use `set(recipientlist)` to convert the list to a set of unique entries - but the list will allow us to count frequency.

In [None]:
#Create an empty list for recipients
recipientlist = []

#define the function
def scrapedetail(detailurl):
  #store the url
  record['url'] = detailurl
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #there's only one, so store that
  article = articles[0]
  #grab the <h1> tags with links
  h1s = article.cssselect('h1')
  #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
  record['title'] = h1s[0].text_content()
  #grab the <time> tags
  times = article.cssselect('time')
  #there's only 1, grab the text content of that
  record['published'] = times[0].text_content()
  #grab the datetime= attribute, this is more specifically encoded
  record['timestamp'] = times[0].attrib['datetime']
  #grab all the <span> tags
  spans = article.cssselect('span a')
  #Create an empty list to store the text
  spantext = []
  #loop through the spans and store the text in that list
  for span in spans:
    spantext.append(span.text_content())
  #Remove the two values we don't want
  spantext.remove('PFD Report')
  spantext.remove('Coroner')
  #Create a field for each category we find and change the default False value to a True value
  for t in spantext:
    record[t] = True
  #grab the <p> tags
  ps = article.cssselect('div.entry-content p')
  #Create an empty list to store the text content
  ptexts = []
  #Loop through the p tags and append the text to that list
  for p in ps:
    ptexts.append(p.text_content())
  #Loop through that list of paragraph text
  for t in ptexts:
    #Split into the label (e.g. Ref:) and the value (e.g. Andrew Harris) and store in a list
    tsplit = t.split(": ")
    #this line converts the text label to a key by removing spaces and making it lowercase
    record[tsplit[0].replace(" ","").lower()] = tsplit[1]
    #For one in particular we want to repeat the process of 
    # creating fields based on values and setting to True
    if "This report is being sent to: " in t:
      #Remove the label part and store the list of recipients
      recipients = t.replace("This report is being sent to: ","").split(";")
      #Add to the ongoing list of recipients
      for r in recipients:
        recipientlist.append(r)
  #the link to the PDF report is in <div class="download-box"> - grab all...
  pdfs = article.cssselect('li.pdf div a')
  #How many did we get
  #print(len(pdfs))
  #Store that
  record['num_of_pdfs'] = len(pdfs)
  #Some pages have no link to a report, so we need an if test before we try to extract that
  if len(pdfs)>0:
    pdflink = pdfs[0].attrib['href']
    #print(pdflink)
  else:
    pdflink = "NO REPORT LINK"
  record['pdflink'] = pdflink
  #Most pages have a response document, so here's another test
  if len(pdfs)>1:
    #Record how many responses there are
    record['responses'] = len(pdfs)-1
    #Loop through them - if there's more than one then we create different fields
    for i in range(1,len(pdfs)):
      #Create some field names by combining the index a string
      responseurlnum = "responseurl"+str(i)
      responsetitlenum = "responsetitle"+str(i)
      responseorgnum = "responseorg"+str(i)
      #Store attributes at that index under those field names
      record[responseurlnum] = pdfs[i].attrib['href']
      record[responsetitlenum] = pdfs[i].attrib['title']
      record[responseorgnum] = pdfs[i].attrib['title'].replace(" Response from ","").replace(record['ref'],"").replace(" Redacted","")
  #Return the data stored in 'record'
  return(record)
  


In [None]:

#Test the function, storing results in 'testresults'
testresults = scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")
#Print
print(testresults)

{'Accident at Work and Health and Safety related deaths': False, 'Alcohol, drug and medication related deaths': False, 'Care Home Health related deaths': False, 'Child Death': False, 'Community health care and Emergency Services related deaths': False, 'Emergency services related deaths (2019 onwards)': False, 'Hospital Death (Clinical Procedures and medical management) related deaths': False, 'Mental Health related deaths': True, 'Other related deaths': False, 'Police related deaths': False, 'Product related deaths': False, 'Road (Highways Safety) related deaths': False, 'Railway related deaths': True, 'State Custody related deaths': False, 'Suicide': False, 'Wales prevention of future deaths reports (2019 onwards)': False, 'url': 'https://www.judiciary.uk/publications/kerry-aldridge/', 'title': 'Kerry Aldridge', 'published': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00', 'Prevention of Future Deaths': True, 'Suicide (from 2015)': True, 'dateofreport': '10 February 2020', 

Checking this we can see that one category - Suicide - isn't stored as `True`. This is because the page itself says "Suicide (from 2015)".

We can fix these mismatches manually - but a simpler option is to match on the URLs instead.

In [None]:
#store the URL we want to start scraping from
starturl = 'https://www.judiciary.uk/related-offices-and-bodies/office-chief-coroner/https-www-judiciary-uk-subject-community-health-care-and-emergency-services-related-deaths/'
html = scraperwiki.scrape(starturl)
#convert from a string into an lxml object that can be parsed using lxml functions
root = lxml.html.fromstring(html)
#use the lxml function cssselect to grab any link inside a list inside an <article> tag, which is where the category linksare
catlinks = root.cssselect('article ul li a')
#measure the length of that list of matches, and print it
print ('found ',len(catlinks), ' reports')
#Create an empty list to store
categoryurls = []
#Loop through the matches and print the href="" attribute
for link in catlinks:
  caturl = link.attrib['href']
  #print(caturl)
  categoryurls.append(link.attrib['href'])

print(categoryurls)

found  16  reports
['https://www.judiciary.uk/subject/accident-at-work-and-health-and-safety-related-deaths/', 'https://www.judiciary.uk/subject/alcohol-drug-and-medication-related-deaths/', 'https://www.judiciary.uk/subject/care-home-health-related-deaths/', 'https://www.judiciary.uk/subject/child-death', 'https://www.judiciary.uk/subject/community-health-care-and-emergency-services-related-deaths/', 'https://www.judiciary.uk/subject/emergency-services-related-deaths-prevention-of-future-deaths/', 'https://www.judiciary.uk/subject/hospital-death-clinical-procedures-and-medical-management-related-deaths/', 'https://www.judiciary.uk/subject/mental-health-related-deaths/', 'https://www.judiciary.uk/subject/other-related-deaths/', 'https://www.judiciary.uk/subject/police-related-deaths/', 'https://www.judiciary.uk/subject/product-related-deaths/', 'https://www.judiciary.uk/subject/road-highways-safety-related-deaths/', 'https://www.judiciary.uk/subject/railway-related-deaths/', 'https://w

This is the code we now need to change in the function:

```
#Create an empty list to store the text
  spantext = []
  #loop through the spans and store the text in that list
  for span in spans:
    spantext.append(span.text_content())
  #Remove the two values we don't want
  spantext.remove('PFD Report')
  spantext.remove('Coroner')
  #Create a field for each category we find and change the default False value to a True value
  for t in spantext:
    record[t] = True
```

Instead of storing text we are going to store URLs:

```
#Create an empty list to store the links
  spanhrefs = []
  #loop through the spans and store the links in that list
  for span in spans:
    spanhrefs.append(span.attrib['href'])
  #Remove the two values we don't want
  spanhrefs.remove('https://www.judiciary.uk/publication-type/pfd-report/')
  spanhrefs.remove('https://www.judiciary.uk/publication-jurisdiction/coroner/')
  #Create a field for each category we find and change the default False value to a True value
  for t in spanhrefs:
    record[t] = True
```


We also need to change that default dictionary. Along the way, those URLs could be cleaned up so we don't have the redundant base part.

In [None]:
record = {}
for link in categoryurls:
  #This replaces the base URL and stops short of the / at the end of the URL
  record[link.replace("https://www.judiciary.uk/subject/","")[:-1]] = False
print(record)

{'accident-at-work-and-health-and-safety-related-deaths': False, 'alcohol-drug-and-medication-related-deaths': False, 'care-home-health-related-deaths': False, 'child-deat': False, 'community-health-care-and-emergency-services-related-deaths': False, 'emergency-services-related-deaths-prevention-of-future-deaths': False, 'hospital-death-clinical-procedures-and-medical-management-related-deaths': False, 'mental-health-related-deaths': False, 'other-related-deaths': False, 'police-related-deaths': False, 'product-related-deaths': False, 'road-highways-safety-related-deaths': False, 'railway-related-deaths': False, 'state-custody-related-deaths': False, 'suicide': False, 'wales-pfd-reports-related-deaths': False}


And we need to update our code to replicate that cleaning too, so it still matches up:

```
#Create an empty list to store the links
  spanhrefs = []
  #loop through the spans and store the links in that list
  for span in spans:
    spanhrefs.append(span.attrib['href'])
  #Remove the two values we don't want
  spanhrefs.remove('https://www.judiciary.uk/publication-type/pfd-report/')
  spanhrefs.remove('https://www.judiciary.uk/publication-jurisdiction/coroner/')
  #Create a field for each category we find and change the default False value to a True value
  for t in spanhrefs:
    record[t.replace("https://www.judiciary.uk/subject/","")[:-1]] = True
```

In [None]:
#Create that seeded dictionary
record = {}
for link in categoryurls:
  #This replaces the base URL and stops short of the / at the end of the URL
  record[link.replace("https://www.judiciary.uk/subject/","")[:-1]] = False
print(record)

#Create an empty list for recipients
recipientlist = []

#define the function
def scrapedetail(detailurl):
  #store the url
  record['url'] = detailurl
  #scrape webpage into a variable called 'html'
  html = scraperwiki.scrape(detailurl)
  #convert from a string into an lxml object that can be parsed using lxml functions
  root = lxml.html.fromstring(html)
  #use the lxml function cssselect to grab anything inside an <article> tag, which is where the reports are
  articles = root.cssselect('article')
  #there's only one, so store that
  #around 2135 reports in, there's a page that doesn't have one, so we add an if
  if len(articles) > 0:
    article = articles[0]
    #grab the <h1> tags with links
    h1s = article.cssselect('h1')
    #There should only be one, so we grab the text of that and store in the dictionary variable 'record'
    record['title'] = h1s[0].text_content()
    #grab the <time> tags
    times = article.cssselect('time')
    #there's only 1, grab the text content of that
    record['published'] = times[0].text_content()
    #grab the datetime= attribute, this is more specifically encoded
    record['timestamp'] = times[0].attrib['datetime']
    #grab all the <span> tags
    spans = article.cssselect('span a')
    #Create an empty list to store the links
    spanhrefs = []
    #loop through the spans and store the links in that list
    for span in spans:
      spanhrefs.append(span.attrib['href'])
    #Remove the two values we don't want
    #if these aren't there this returns list.remove(x): x not in list
    #So we nest in an if:
    if 'https://www.judiciary.uk/publication-type/pfd-report/' in spanhrefs:
      spanhrefs.remove('https://www.judiciary.uk/publication-type/pfd-report/')
    if 'https://www.judiciary.uk/publication-jurisdiction/coroner/' in spanhrefs:
      spanhrefs.remove('https://www.judiciary.uk/publication-jurisdiction/coroner/')
    #Create a field for each category we find and change the default False value to a True value
    for t in spanhrefs:
      #Strip out the base URL and stop before the / so it matches the keys
      record[t.replace("https://www.judiciary.uk/subject/","")[:-1]] = True
    #grab the <p> tags
    ps = article.cssselect('div.entry-content p')
    #Create an empty list to store the text content
    ptexts = []
    #Loop through the p tags and append the text to that list
    for p in ps:
      ptexts.append(p.text_content())
    #Loop through that list of paragraph text
    for t in ptexts:
      #Split into the label (e.g. Ref:) and the value (e.g. Andrew Harris) and store in a list
      tsplit = t.split(": ")
      #We need an if test here because some pages don't have this
      if len(tsplit) > 1:
        #this line converts the text label to a key by removing spaces and making it lowercase
        record[tsplit[0].replace(" ","").lower()] = tsplit[1]
      #And an else to specify there is no ref, otherwise the line later that uses this will break
      else:
        record['ref'] = "noref"
      #For one in particular we want to repeat the process of 
      # creating fields based on values and setting to True
      if "This report is being sent to: " in t:
        #Remove the label part and store the list of recipients
        recipients = t.replace("This report is being sent to: ","").split(";")
        #Add to the ongoing list of recipients
        for r in recipients:
          recipientlist.append(r)
    #the link to the PDF report is in <div class="download-box"> - grab all...
    pdfs = article.cssselect('li.pdf div a')
    #How many did we get
    #print(len(pdfs))
    #Store that
    record['num_of_pdfs'] = len(pdfs)
    #Some pages have no link to a report, so we need an if test before we try to extract that
    if len(pdfs)>0:
      pdflink = pdfs[0].attrib['href']
      #print(pdflink)
    else:
      pdflink = "NO REPORT LINK"
    record['pdflink'] = pdflink
    #Most pages have a response document, so here's another test
    if len(pdfs)>1:
      #Record how many responses there are
      record['responses'] = len(pdfs)-1
      #Loop through them - if there's more than one then we create different fields
      for i in range(1,len(pdfs)):
        #Create some field names by combining the index a string
        responseurlnum = "responseurl"+str(i)
        responsetitlenum = "responsetitle"+str(i)
        responseorgnum = "responseorg"+str(i)
        #Store attributes at that index under those field names
        record[responseurlnum] = pdfs[i].attrib['href']
        record[responsetitlenum] = pdfs[i].attrib['title']
        record[responseorgnum] = pdfs[i].attrib['title'].replace(" Response from ","").replace(record['ref'],"").replace(" Redacted","")
  #Return the data stored in 'record'
  return(record)
    


{'accident-at-work-and-health-and-safety-related-deaths': False, 'alcohol-drug-and-medication-related-deaths': False, 'care-home-health-related-deaths': False, 'child-deat': False, 'community-health-care-and-emergency-services-related-deaths': False, 'emergency-services-related-deaths-prevention-of-future-deaths': False, 'hospital-death-clinical-procedures-and-medical-management-related-deaths': False, 'mental-health-related-deaths': False, 'other-related-deaths': False, 'police-related-deaths': False, 'product-related-deaths': False, 'road-highways-safety-related-deaths': False, 'railway-related-deaths': False, 'state-custody-related-deaths': False, 'suicide': False, 'wales-pfd-reports-related-deaths': False}


Then test again:

In [None]:

#Test the function, storing results in 'testresults'
testresults = scrapedetail("https://www.judiciary.uk/publications/kerry-aldridge/")
#Print
print(testresults)

{'accident-at-work-and-health-and-safety-related-deaths': False, 'alcohol-drug-and-medication-related-deaths': False, 'care-home-health-related-deaths': False, 'child-deat': False, 'community-health-care-and-emergency-services-related-deaths': False, 'emergency-services-related-deaths-prevention-of-future-deaths': False, 'hospital-death-clinical-procedures-and-medical-management-related-deaths': False, 'mental-health-related-deaths': True, 'other-related-deaths': False, 'police-related-deaths': False, 'product-related-deaths': False, 'road-highways-safety-related-deaths': False, 'railway-related-deaths': True, 'state-custody-related-deaths': False, 'suicide': True, 'wales-pfd-reports-related-deaths': False, 'url': 'https://www.judiciary.uk/publications/kerry-aldridge/', 'title': 'Kerry Aldridge', 'published': '18 March 2020', 'timestamp': '2020-03-18T12:51:46+00:00', 'training-and-support': True, 'https://www.judiciary.uk/publication-type/guidance': True, 'https://www.judiciary.uk/publ

This seems to match up to what we see. 

Now to test on a few pages:

In [None]:
#Loop through those results urls
for i in pageurls[:2]:
  #On each one, scrape the links to the detail pages
  detailurls = scraperesultspage(i)
  #Loop through the list of links
  for i in detailurls[:2]:
    testresults = scrapedetail(i)
    #Print
    print(testresults)
    scraperwiki.sql.save(['url'], testresults, table_name = "testrun2")

{'accident-at-work-and-health-and-safety-related-deaths': False, 'alcohol-drug-and-medication-related-deaths': True, 'care-home-health-related-deaths': True, 'child-deat': False, 'community-health-care-and-emergency-services-related-deaths': False, 'emergency-services-related-deaths-prevention-of-future-deaths': False, 'hospital-death-clinical-procedures-and-medical-management-related-deaths': True, 'mental-health-related-deaths': True, 'other-related-deaths': False, 'police-related-deaths': False, 'product-related-deaths': False, 'road-highways-safety-related-deaths': False, 'railway-related-deaths': True, 'state-custody-related-deaths': False, 'suicide': True, 'wales-pfd-reports-related-deaths': False, 'url': 'https://www.judiciary.uk/publications/linda-phillipson/', 'title': 'Linda Phillipson', 'published': '10 November 2020', 'timestamp': '2020-11-10T16:08:08+00:00', 'prevention-of-future-deaths': True, 'dateofreport': '8 September 2020', 'ref': '2020-0172', 'deceasedname': 'Linda 

In [None]:
print(scraperwiki.sql.select("count(*) from testrun2"))

[{'count(*)': 4}]


## Export test results as CSV

Let's export as a CSV just to see what it looks like outside the environment.

In [None]:
testresults = scraperwiki.sql.select("* from testrun2")
#Create a dataframe and assign the data to it
testresultsdf = pd.DataFrame(testresults)
#Print the first few rows (head)
print(testresultsdf.head())
#Export as a CSV file using the .to_csv function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
testresultsdf.to_csv('testresultsdf.csv')

   accident-at-work-and-health-and-safety-related-deaths  ...               coronerarea
0                                                  0      ...         Brighton and Hove
1                                                  0      ...  Greater Manchester South
2                                                  0      ...        Inner North London
3                                                  0      ...   Birmingham and Solihull

[4 rows x 36 columns]


## Running on the full 3220 pages

Now that we've tested it on a few pages, we let it loose on all the pages. Chances are that older pages with different HTML may break it.

In [None]:
print(pageurls[213])

https://www.judiciary.uk/publication-jurisdiction/coroner/page/214


In [None]:
#Loop through those results urls
for i in pageurls[213:]:
  #On each one, scrape the links to the detail pages
  detailurls = scraperesultspage(i)
  #Loop through the list of links
  for i in detailurls:
    print(i)
    print
    testresults = scrapedetail(i)
    #Print
    print(testresults)
    scraperwiki.sql.save(['url'], testresults, table_name = "fullrun1")

https://www.judiciary.uk/publications/harry-mellor/
{'accident-at-work-and-health-and-safety-related-deaths': False, 'alcohol-drug-and-medication-related-deaths': False, 'care-home-health-related-deaths': False, 'child-deat': False, 'community-health-care-and-emergency-services-related-deaths': False, 'emergency-services-related-deaths-prevention-of-future-deaths': False, 'hospital-death-clinical-procedures-and-medical-management-related-deaths': True, 'mental-health-related-deaths': False, 'other-related-deaths': False, 'police-related-deaths': False, 'product-related-deaths': False, 'road-highways-safety-related-deaths': False, 'railway-related-deaths': False, 'state-custody-related-deaths': False, 'suicide': False, 'wales-pfd-reports-related-deaths': False, 'url': 'https://www.judiciary.uk/publications/harry-mellor/', 'title': 'Harry Mellor', 'published': '22 October 2015', 'timestamp': '2015-10-22T11:33:37+01:00', 'prevention-of-future-deaths': True, 'child-death': True, 'dateofrep

## Export the results

Now export as a CSV.

In [None]:
fullresults = scraperwiki.sql.select("* from fullrun1")
#Create a dataframe and assign the data to it
fullresultsdf = pd.DataFrame(fullresults)
#Print the first few rows (head)
print(fullresultsdf.head())
#Export as a CSV file using the .to_csv function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
fullresultsdf.to_csv('fullresultsdf.csv')

   accident-at-work-and-health-and-safety-related-deaths  ...  coroner’sarea
0                                                  0      ...           None
1                                                  0      ...           None
2                                                  0      ...           None
3                                                  0      ...           None
4                                                  0      ...           None

[5 rows x 87 columns]


In [None]:
scraperwiki.sql.select("count(*) from fullrun1")

[{'count(*)': 2136}]