## Scraping web sites

Using this tutorial series https://www.youtube.com/watch?v=zXif_9RVadI 

This code fetches the data from the url and stores it in a 'response' object 

In [4]:
import requests 
from bs4 import BeautifulSoup
r = requests.get('https://allpolicejobs.co.uk/all#jobs')

Now that the html from the web page is inside the variable r, which is a response object, I can grab the text from it. The text below prints the first 500 characters:

In [49]:
print(r.text[0:500])

<!DOCTYPE html>
<html lang="en">
	<head>
		<meta charset="utf-8">
		<title>AllPoliceJobs</title>
		<meta name="description" content="">
		<meta name="viewport" content="width=device-width, initial-scale=1.0" />
		<meta name="author" content="Triden Elite Ltd." />
		<meta name="generator" content="MVCms" />
		<link href="/style/bootstrap.min.css" rel="stylesheet" />
		<link href="/style/style.css" rel="stylesheet" />
		<link href="/style/site-responsive.css" rel="stylesheet" />
		<lin


Next r.text is parsed and put inside a special object called soup. Beautiful Soup is reading the html here and making sense of the structure of it. Html parser is the parser included in the standard Python library.  

In [8]:
soup = BeautifulSoup(r.text, 'html.parser')

The code below is getting all the bits of the results object that have a div with a class called 'filter-result' (because that is the class at the beginning of the HTML for every job). It returns a 'beautiful soup results object'. It counts each bit of html that begins with that class as 'one' in the results object.  

In [9]:
results = soup.find_all('table', attrs={'class':'filter-result'})

In [10]:
print(results)

[<div class="filter-result featured">
<div class="col-lg-6 col-md-7 col-sm-9 col-xs-12 pull-left">
<div class="company-left-info pull-left">
<img alt="Cumbria" src="/asset/force/logo/cumbria.boxed-65x65-255-255-255-0.jpg"/>
</div>
<div class="desig">
<a href="/jobs/47445"><h3>Police Officer Recruitment</h3></a>
<h4>Cumbria</h4>
</div>
</div>
<div class="col-lg-6 col-md-5 col-sm-8 col-xs-12 pull-right">
<div class="pull-right location">
<p></p>
</div>
<div class="data-job">
<h3>Â£19,971 - Â£38,382 Per year</h3>
<a class="label job-type job-contract pull-right" href="/jobs/47445">Read more</a>
</div>
</div>
</div>, <div class="filter-result featured">
<div class="col-lg-6 col-md-7 col-sm-9 col-xs-12 pull-left">
<div class="company-left-info pull-left">
<img alt="Metropolitan Police Service" src="/asset/force/logo/met.boxed-65x65-255-255-255-0.jpg"/>
</div>
<div class="desig">
<a href="/jobs/3688"><h3>Police Constable</h3></a>
<h4>Metropolitan Police Service</h4>
</div>
</div>
<div class=

The beautiful soup results object acts like a python list, which means I can check the length of it with len. The result is 536 and there are 536 jobs on the website! 

In [11]:
len(results)

536

In [12]:
results[0:3]

[<div class="filter-result featured">
 <div class="col-lg-6 col-md-7 col-sm-9 col-xs-12 pull-left">
 <div class="company-left-info pull-left">
 <img alt="Cumbria" src="/asset/force/logo/cumbria.boxed-65x65-255-255-255-0.jpg"/>
 </div>
 <div class="desig">
 <a href="/jobs/47445"><h3>Police Officer Recruitment</h3></a>
 <h4>Cumbria</h4>
 </div>
 </div>
 <div class="col-lg-6 col-md-5 col-sm-8 col-xs-12 pull-right">
 <div class="pull-right location">
 <p></p>
 </div>
 <div class="data-job">
 <h3>Â£19,971 - Â£38,382 Per year</h3>
 <a class="label job-type job-contract pull-right" href="/jobs/47445">Read more</a>
 </div>
 </div>
 </div>, <div class="filter-result featured">
 <div class="col-lg-6 col-md-7 col-sm-9 col-xs-12 pull-left">
 <div class="company-left-info pull-left">
 <img alt="Metropolitan Police Service" src="/asset/force/logo/met.boxed-65x65-255-255-255-0.jpg"/>
 </div>
 <div class="desig">
 <a href="/jobs/3688"><h3>Police Constable</h3></a>
 <h4>Metropolitan Police Service</h4>

Next check that the last job on the page is the same as the last one I've scraped

In [13]:
results[-1]

<div class="filter-result">
<div class="col-lg-6 col-md-7 col-sm-9 col-xs-12 pull-left">
<div class="company-left-info pull-left">
<img alt="City of London" src="/asset/force/logo/london.boxed-65x65-255-255-255-0.jpg"/>
</div>
<div class="desig">
<a href="/jobs/2885"><h3>Transferee and Rejoiner Uniform and Detective Constables</h3></a>
<h4>City of London</h4>
</div>
</div>
<div class="col-lg-6 col-md-5 col-sm-8 col-xs-12 pull-right">
<div class="pull-right location">
<p>Various</p>
</div>
<div class="data-job">
<h3>National rates</h3>
<a class="label job-type job-contract pull-right" href="/jobs/2885">Read more</a>
</div>
</div>
</div>

Below I am putting the first job inside a variable so that I can work out the exact code with just one job. 

In [15]:
first_result = results[0]
first_result

<div class="filter-result featured">
<div class="col-lg-6 col-md-7 col-sm-9 col-xs-12 pull-left">
<div class="company-left-info pull-left">
<img alt="Cumbria" src="/asset/force/logo/cumbria.boxed-65x65-255-255-255-0.jpg"/>
</div>
<div class="desig">
<a href="/jobs/47445"><h3>Police Officer Recruitment</h3></a>
<h4>Cumbria</h4>
</div>
</div>
<div class="col-lg-6 col-md-5 col-sm-8 col-xs-12 pull-right">
<div class="pull-right location">
<p></p>
</div>
<div class="data-job">
<h3>Â£19,971 - Â£38,382 Per year</h3>
<a class="label job-type job-contract pull-right" href="/jobs/47445">Read more</a>
</div>
</div>
</div>

The variable called first_result is a special kind of beautiful soup object called a tag, that has methods and attributes we can use. 

The code below is searching first_result for the first instance of the h3 tag, and returns it as a beautiful soup tag object. 

The code below is asking for just the text:

In [17]:
first_result.find('h3').text

'Police Officer Recruitment'

In [52]:
first_result.find('h4').text

'Cumbria'

In [63]:
first_result.find_all('h3')[1].text[1:]

'£19,971 - Â£38,382 Per year'

In [64]:
first_result.find('a')['href']

'/jobs/47445'

So now I have worked out how to get the job title, region, salary and link, I need to do a loop...

In [65]:
records = []
for result in results:
    job = result.find('h3').text
    region = result.find('h4').text
    salary = result.find_all('h3')[1].text[1:]
    link = result.find('a')['href']
    records.append((job, region, salary, link))

In [67]:
records[0:3]

[('Police Officer Recruitment',
  'Cumbria',
  '£19,971 - Â£38,382 Per year',
  '/jobs/47445'),
 ('Police Constable', 'Metropolitan Police Service', '', '/jobs/3688'),
 ('Temporary Insight and Evaluation Lead',
  'Scotland',
  '£30,291 - Â£34,575',
  '/jobs/155239')]

In [68]:
len(records)

536

In [69]:
import pandas as pd
df = pd.DataFrame(records, columns=['job','region','salary','link'])

In [70]:
df.head()

Unnamed: 0,job,region,salary,link
0,Police Officer Recruitment,Cumbria,"£19,971 - Â£38,382 Per year",/jobs/47445
1,Police Constable,Metropolitan Police Service,,/jobs/3688
2,Temporary Insight and Evaluation Lead,Scotland,"£30,291 - Â£34,575",/jobs/155239
3,PS/303/BCH ICT Technical Architect,Bedfordshire,,/jobs/155182
4,PS/263/C Occupational Health Technician,Bedfordshire,,/jobs/155138


In [71]:
df.to_csv('police_jobs.csv')