# Scraping IOPC reference codes

This scraper builds on a previous notebook which scrapes IOPC reports. In that scraper we didn't grab the IOPC reference codes, so in this notebook we import the resulting CSV and go back through the URLs to grab those.

First, import the libraries we will need.

In [None]:
#install the libraries
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#We can also use requests instead
import requests
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd


In [None]:
#Just in case we get an IOPub Error
#https://stackoverflow.com/questions/48906481/iopub-error-on-google-colaboratory-in-jupyter-notebook
from pprint import pprint

## Import CSV

Now we import the CSV exported from the previous scraper.

In [None]:
recs = pd.read_csv("recs30june_wforces.csv")
#Show the url column
recs['url']

0       https://policeconduct.gov.uk/recommendations/r...
1       https://policeconduct.gov.uk/recommendations/w...
2       https://policeconduct.gov.uk/recommendations/n...
3       https://policeconduct.gov.uk/recommendations/i...
4       https://policeconduct.gov.uk/recommendations/p...
                              ...                        
1315    https://policeconduct.gov.uk/recommendations/r...
1316    https://policeconduct.gov.uk/recommendations/%...
1317    https://policeconduct.gov.uk/recommendations/c...
1318    https://policeconduct.gov.uk/recommendations/a...
1319    https://policeconduct.gov.uk/recommendations/a...
Name: url, Length: 1320, dtype: object

## Create scraper function for IOPC ref code

The reference code is not easy to grab. It's inside a chunk of HTML like this:

```{html}
  <div class="field-label">IOPC reference</div>
            2019/118999    
  <div class="field-label">Date of recommendation</div>
```

So it's not inside its own HTML tag. In fact it's inside the `<div class="clearfix">` that contains the whole recommendation.

The best way to grab it, then, might be regex. For that we will need the regex library `re`.

In [None]:
import re

## Compiling a regular expression

Next we need to [*compile* the regular expression](https://docs.python.org/3/howto/regex.html).

In [None]:
#compile the regex
p = re.compile("IOPC reference</div>\n\s+[0-9]{4}/[0-9]{6}")
#create some text to test it on
testtext = '''<div class="field-label">IOPC reference</div>
            2019/118999
  <div class="field-label">Date of recommendation</div>'''
#print it
print(testtext)
#test it
print(p.search(testtext))
#store it
matchtext = p.search(testtext)
#extract the match
print(matchtext[0])
#clean the match
print(matchtext[0].split('\n')[1].strip())

<div class="field-label">IOPC reference</div>
            2019/118999    
  <div class="field-label">Date of recommendation</div>
<re.Match object; span=(25, 69), match='IOPC reference</div>\n            2019/118999'>
IOPC reference</div>
            2019/118999
2019/118999


## Test on some pages

Now let's test that on one page. To do that we will need to scrape it, too.

In [None]:
#scrape the 3rd URL in that column
html = scraperwiki.scrape(recs['url'][2])
#print - it's a byte object (contained within b'')...
print(html)
#so will need to be decoded first
#see https://stackoverflow.com/questions/606191/convert-bytes-to-a-string
html = html.decode("utf-8")
#...before regex can be used
print(p.search(html)[0].split('\n')[1].strip())

2020/137342


Some URLs won't have a ref. What happens then?

In [None]:
for i in recs['url'][:10]:
  print(i)
  #scrape the 2nd URL in that column
  html = scraperwiki.scrape(i)
  #so will need to be decoded first
  #see https://stackoverflow.com/questions/606191/convert-bytes-to-a-string
  html = html.decode("utf-8")
  #...before regex can be used
  print(p.search(html)[0].split('\n')[1].strip())

https://policeconduct.gov.uk/recommendations/recommendations-sussex-police-april-2021
2020/135656
https://policeconduct.gov.uk/recommendations/woman-was-found-dead-after-welfare-concerns-reported-thames-valley-police-april-2020
2020/135205
https://policeconduct.gov.uk/recommendations/national-recommendation-college-policing-june-2021
2020/137342
https://policeconduct.gov.uk/recommendations/inappropriate-communications-member-public-%E2%80%93-cambridgeshire-police-and-crime-panel
2019/127658
https://policeconduct.gov.uk/recommendations/police-management-registered-terrorist-offender-following-his-release-prison-%E2%80%93
2019/128766
https://policeconduct.gov.uk/recommendations/fatal-police-shooting-terrorist-attacker-%E2%80%93-city-london-police-and-metropolitan
2019/128689
https://policeconduct.gov.uk/recommendations/death-following-police-pursuit-%E2%80%93-hampshire-constabulary-august-2020
2020/140719
https://policeconduct.gov.uk/recommendations/death-following-police-investigation-s

We couldn't check that way, so let's try another approach.

In [None]:
#compile the regex
p = re.compile("IOPC reference</div>\n\s+[0-9]{4}/[0-9]{6}")
#create some text to test it on
testtext = '''<div class="field-label">IOPC reference</div>
  <div class="field-label">Date of recommendation</div>'''
#print it
pprint(testtext)
#test it
print(p.search(testtext))
#store it
matchtext = p.search(testtext)
#check how many matches - matchtext is True/False
if matchtext:
  print("match")
else:
  print("No matches")
#extract the match
print(matchtext[0])
#clean the match
print(matchtext[0].split('\n')[1].strip())

('<div class="field-label">IOPC reference</div>\n'
 '  <div class="field-label">Date of recommendation</div>')
None
No matches


TypeError: ignored

## Create a function

With testing done, let's create a function that we can run on each URL to extract the IOPC ref if it's there, or an alternative if not. This needs to be stored alongside the URL for matching against the full record later.

In [None]:
#define a new function called getref, which takes 1 argument - a URL string
def getref(url):
  #scrape the webpage at that URL
  html = scraperwiki.scrape(url)
  #it will need to be decoded ...
  #see https://stackoverflow.com/questions/606191/convert-bytes-to-a-string
  html = html.decode("utf-8")
  #compile the regex
  refregex = re.compile("IOPC reference</div>\n\s+[0-9]{4}/[0-9]{6}")
  #grab any matches of that
  regexmatches = refregex.search(html)
  #check if there is a match
  if regexmatches:
    #print ref code
    iopcref = refregex.search(html)[0].split('\n')[1].strip()
  else:
    iopcref = "no ref found"
  return(iopcref)

A quick test:

In [None]:
testref = getref(recs['url'][0])
print(testref)

2020/135656


## Run the function on multiple URLs

With that done, we can run it on a list of URLs.

In [None]:
#create an empty list to add to
runninglist = []
#loop through urls
for i in recs['url'][:10]:
  print(i)
  #run function to grab ref from that page
  grabbedref = getref(i)
  #add to a dict along with the URL for matching
  runningdict = {'url':i, 'iopcref':grabbedref}
  #add to a list, too, which may be easier
  runninglist.append(grabbedref+i)
  print(runningdict)
  print(runninglist)

https://policeconduct.gov.uk/recommendations/recommendations-sussex-police-april-2021
{'url': 'https://policeconduct.gov.uk/recommendations/recommendations-sussex-police-april-2021', 'iopcref': '2020/135656'}
['2020/135656https://policeconduct.gov.uk/recommendations/recommendations-sussex-police-april-2021']
https://policeconduct.gov.uk/recommendations/woman-was-found-dead-after-welfare-concerns-reported-thames-valley-police-april-2020
{'url': 'https://policeconduct.gov.uk/recommendations/woman-was-found-dead-after-welfare-concerns-reported-thames-valley-police-april-2020', 'iopcref': '2020/135205'}
['2020/135656https://policeconduct.gov.uk/recommendations/recommendations-sussex-police-april-2021', '2020/135205https://policeconduct.gov.uk/recommendations/woman-was-found-dead-after-welfare-concerns-reported-thames-valley-police-april-2020']
https://policeconduct.gov.uk/recommendations/national-recommendation-college-policing-june-2021
{'url': 'https://policeconduct.gov.uk/recommendation

And the whole lot - this time we also create an empty data frame and add each dict as we go:

In [None]:
#create an empty list to add to
runninglist = []
#create an empty data frame to add to
df = pd.DataFrame(columns=["url","iopcref"])
#loop through urls
for i in recs['url']:
  print(i)
  #run function to grab ref from that page
  grabbedref = getref(i)
  #add to a dict along with the URL for matching
  runningdict = {'url':i, 'iopcref':grabbedref}
  #add to a list, too, which may be easier
  runninglist.append(grabbedref+i)
  #print(runningdict)
  #print(runninglist)
  #append to our dataframe
  df = df.append(
    runningdict,
    ignore_index=True
    )

## Checking gaps

Let's check those URLs where a ref wasn't found.

In [None]:
#create a dataframe by filtering on those rows where iopcref is 'no ref found'
norefs = df[df['iopcref'] == "no ref found"]
#show the urls in that new dataframe
norefs['url']


850     https://policeconduct.gov.uk/recommendations/f...
852     https://policeconduct.gov.uk/recommendations/p...
856     https://policeconduct.gov.uk/recommendations/r...
982     https://policeconduct.gov.uk/recommendations/r...
983     https://policeconduct.gov.uk/recommendations/r...
                              ...                        
1089    https://policeconduct.gov.uk/recommendations/r...
1090    https://policeconduct.gov.uk/recommendations/r...
1091    https://policeconduct.gov.uk/recommendations/r...
1092    https://policeconduct.gov.uk/recommendations/r...
1197    https://policeconduct.gov.uk/recommendations/r...
Name: url, Length: 106, dtype: object

In [None]:
for i in norefs['url'][:5]:
  print(i)

https://policeconduct.gov.uk/recommendations/fatal-collision-derbyshire-constabulary-september-2017
https://policeconduct.gov.uk/recommendations/pedestrian-killed-collision-police-car-west-midlands-police-april-2012
https://policeconduct.gov.uk/recommendations/response-report-bad-driving-kent-police-august-2017
https://policeconduct.gov.uk/recommendations/recommendation-humberside-police-november-2014
https://policeconduct.gov.uk/recommendations/recommendation-northamptonshire-police-december-2014


[Looking at one page](https://policeconduct.gov.uk/recommendations/allegations-assault-during-custody-merseyside-police-october-2016) we can see the reference has an extra digit - the year is incorrectly entered as 22017:

```{html}
<div class="field-label">IOPC reference</div>
            22017/082348                        </div>
```

On [another](https://policeconduct.gov.uk/recommendations/fatal-collision-derbyshire-constabulary-september-2017) it's a dash instead of a slash:

`2017-091299`

And on [another](https://policeconduct.gov.uk/recommendations/response-report-bad-driving-kent-police-august-2017) there's no slash *or* dash:

`2017090856`

So we need to adjust the regex.

In [None]:
#define a new function called getref, which takes 1 argument - a URL string
def getref(url):
  #scrape the webpage at that URL
  html = scraperwiki.scrape(url)
  #it will need to be decoded ...
  #see https://stackoverflow.com/questions/606191/convert-bytes-to-a-string
  html = html.decode("utf-8")
  #compile the regex - this will accept 3-5 digits and 4-7 digits in the code
  refregex = re.compile("IOPC reference</div>\n\s+[0-9]{3,5}[/-]?[0-9]{4,7}")
  #grab any matches of that
  regexmatches = refregex.search(html)
  #check if there is a match
  if regexmatches:
    #print ref code
    iopcref = regexmatches[0].split('\n')[1].strip()
  else:
    iopcref = "no ref found"
  return(iopcref)

testurl = "https://policeconduct.gov.uk/recommendations/allegations-assault-during-custody-merseyside-police-october-2016"
print(getref(testurl))
testurl = "https://policeconduct.gov.uk/recommendations/fatal-collision-derbyshire-constabulary-september-2017"
print(getref(testurl))
testurl = "https://policeconduct.gov.uk/recommendations/response-report-bad-driving-kent-police-august-2017"
print(getref(testurl))

22017/082348
2017-091299
2017090856


Then run it again.

In [None]:
#create an empty list to add to
runninglist = []
#create an empty data frame to add to
df = pd.DataFrame(columns=["url","iopcref"])
#loop through urls
for i in recs['url']:
  print(i)
  #run function to grab ref from that page
  grabbedref = getref(i)
  #add to a dict along with the URL for matching
  runningdict = {'url':i, 'iopcref':grabbedref}
  #add to a list, too, which may be easier
  runninglist.append(grabbedref+i)
  #print(runningdict)
  #print(runninglist)
  #append to our dataframe
  df = df.append(
    runningdict,
    ignore_index=True
    )

https://policeconduct.gov.uk/recommendations/recommendations-sussex-police-april-2021
https://policeconduct.gov.uk/recommendations/woman-was-found-dead-after-welfare-concerns-reported-thames-valley-police-april-2020
https://policeconduct.gov.uk/recommendations/national-recommendation-college-policing-june-2021
https://policeconduct.gov.uk/recommendations/inappropriate-communications-member-public-%E2%80%93-cambridgeshire-police-and-crime-panel
https://policeconduct.gov.uk/recommendations/police-management-registered-terrorist-offender-following-his-release-prison-%E2%80%93
https://policeconduct.gov.uk/recommendations/fatal-police-shooting-terrorist-attacker-%E2%80%93-city-london-police-and-metropolitan
https://policeconduct.gov.uk/recommendations/death-following-police-pursuit-%E2%80%93-hampshire-constabulary-august-2020
https://policeconduct.gov.uk/recommendations/death-following-police-investigation-sexual-offence-south-yorkshire-police-october
https://policeconduct.gov.uk/recommenda

## Export results

Now export the results which can be added back in to the original data using `VLOOKUP` in Excel.

In [None]:
#remove duplicates based on the url column
df = df.drop_duplicates(subset="url")
#And we can export it
df.to_csv("scrapeddata.csv")
