Web scraping indeed website for data science jobs based on Greg Reda's [excellent tutorial](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/) and Jesse Steinweg's [excellent analysis](https://jessesw.com/Data-Science-Skills/) and finally, Sung Pil Moon's [awesome analysis](http://blog.nycdatascience.com/students-work/project-3-web-scraping-company-data-from-indeed-com-and-dice-com/)

## 1) Admin and Setup

I've already created a virtual environment in conda by downloading bs4. Please feel free to use my environment.yaml to create a similar virtual env. I'll update it as I go.

In [1]:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import pandas as pd
import re
import numpy as np

## 2) What do I want to achieve

http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=any&limit=10&sort=&psf=advsrch

OR

http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=0&l=Sydney+NSW&fromage=last&limit=10&sort=&psf=advsrch

in a nice tabular format for data exploration

## 3) String together webpage url based on different parameters

Create logic for converting different search parameters such as search query, city, salary etc. as separate lists and then stringing them together into one final query

final_query = base_url + job_query_string + company_name + salary + location + fromage

#### Base URL

In [2]:
# start of the url - this will not change because I'm including search query to search
# anywhere in the job ad. Not just the job title.
base_url = 'http://au.indeed.com/jobs?as_and=&as_phr=&as_any='

#### Job query string inputs as parameters are stored in job_query1, 2 and 3. Restrict to only 3 parameters. 

Step1 - collect 3 search queries<br/>
When you refactor this, make sure this is converted into parameters

In [3]:
job_query1 = 'data scientist'
job_query2 = 'customer analytics'
job_query3 = 'data analysis'

Step2 - create query string in the required format

In [4]:
# 1) within search string, spaces are replaced by '+' in html
# 2) each search query is preceded and succeeded by a "%22"
# 3) string the elements of the list into one string separated by a "+"
job_query_string = []
job_query_string.append("%22" + job_query1.replace(" ","+") + "%22")
job_query_string.append("%22" + job_query2.replace(" ","+") + "%22")
job_query_string.append("%22" + job_query3.replace(" ","+") + "%22")
job_query_string = "+".join(job_query_string)
job_query_string

'%22data+scientist%22+%22customer+analytics%22+%22data+analysis%22'

#### Company name, salary, location, fromage - not working on these right now

In [5]:
company_name=''
salary=''
location=''
fromage='any'

#### Create final query

In [6]:
final_query = [base_url,job_query_string,'&as_not=&as_ttl=&as_cmp=',company_name,
               '&jt=all&st=&salary=',salary,'&radius=50&l=',location,
               '&fromage=',fromage,'&limit=10&sort=&psf=advsrch']
final_query = "".join(final_query)
final_query

'http://au.indeed.com/jobs?as_and=&as_phr=&as_any=%22data+scientist%22+%22customer+analytics%22+%22data+analysis%22&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=&fromage=any&limit=10&sort=&psf=advsrch'

## 4) Open website and read it

Step1: Open the first page <br/>
Step2: Get the html of the first page

In [7]:
html = urlopen(final_query).read()  
soup = BeautifulSoup(html, "lxml")  
soup

<!DOCTYPE html>\n<html>\n<head>\n<meta content="text/html;charset=unicode-escape" http-equiv="content-type"/>\n<!-- pll --><script src="/s/567da8a/en_AU.js" type="text/javascript"></script>\n<link href="/s/a6a334e/jobsearch_all.css" rel="stylesheet" type="text/css"/>\n<link href="http://au.indeed.com/rss?q=%28%22data+scientist%22+or+%22customer+analytics%22+or+%22data+analysis%22%29" rel="alternate" title="Data Scientist Customer Analytics Data Analysis Jobs" type="application/rss+xml"/>\n<link href="/m/jobs?q=%28%22data+scientist%22+or+%22customer+analytics%22+or+%22data+analysis%22%29" media="handheld" rel="alternate"/>\n<script type="text/javascript">\n    \n    window['closureReadyCallbacks'] = [];\n\n    function call_when_jsall_loaded(cb) {\n        if (window['closureReady']) {\n            cb();\n        } else {\n            window['closureReadyCallbacks'].push(cb);\n        }\n    }\n</script>\n<script src="/s/37c105f/jobsearch-all-compiled.js" type="text/javascript"></script

Step3: find out how many jobs returned from the search query

In [8]:
number_of_jobs_page_area = soup.find(id="searchCount").string.encode('utf-8')
number_of_jobs_page_area

'Jobs 1 to 10 of 963'

In [9]:
number_of_jobs = re.findall('\d+', number_of_jobs_page_area)
number_of_jobs

['1', '10', '963']

In [10]:
total_number_of_jobs = int(number_of_jobs[2])
total_number_of_jobs

963

Step4: calculate how many pages to scroll

In [11]:
#round up the # of records divided by 10 as the number of pages in order to ensure coverage. 
number_of_pages_to_scroll = np.ceil(total_number_of_jobs/10.0)
number_of_pages_to_scroll

97.0

## 5) Load all the deets into separate lists

In [12]:
# get all the deets of each row. One row pertains to one job
# incidentally, i noticed that the class "row result" was only picking 9 results in the first page.
# this was because the last row was populated in another class 'lastRow row result'

targetElements = soup.findAll('div', attrs = {'class' : ' row result'})
targetElements.extend(soup.findAll('div', attrs = {'class' : 'lastRow row result'}))
targetElements

[<div class=" row result" data-jk="ca8146b5d7b02d5f" data-tn-component="organicJob" id="p_ca8146b5d7b02d5f" itemscope="" itemtype="http://schema.org/JobPosting">\n<h2 class="jobtitle" id="jl_ca8146b5d7b02d5f">\n<a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=ca8146b5d7b02d5f&amp;fccid=1818d10a60db56b4" itemprop="title" onclick="setRefineByCookie(['radius']); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="nofollow" target="_blank" title="Program and Service Advisor">Program and Service Advisor</a>\n</h2>\n<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">\n<span itemprop="name">\n    Victorian Government</span>\n</span>\n\n - <span itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place"><span class="location" itemprop="address" itemscope="" itemtype="http://schema.org/Postaladdress"><span itemprop="addressLocality">Geelong VIC</span></span></span>\n<table border="0" ce

In [13]:
type(targetElements)

bs4.element.ResultSet

In [14]:
for elem in targetElements:
    print elem
    print "\n"
    print "*****************************************************"
    print "\n"

<div class=" row result" data-jk="ca8146b5d7b02d5f" data-tn-component="organicJob" id="p_ca8146b5d7b02d5f" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_ca8146b5d7b02d5f">
<a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=ca8146b5d7b02d5f&amp;fccid=1818d10a60db56b4" itemprop="title" onclick="setRefineByCookie(['radius']); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="nofollow" target="_blank" title="Program and Service Advisor">Program and Service Advisor</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
    Victorian Government</span>
</span>

 - <span itemprop="jobLocation" itemscope="" itemtype="http://schema.org/Place"><span class="location" itemprop="address" itemscope="" itemtype="http://schema.org/Postaladdress"><span itemprop="addressLocality">Geelong VIC</span></span></span>
<table border="0" cellpadding="

#### Job Title

In [15]:
jobtitle = []
for elem in targetElements:
    #print elem.find('a', attrs = {'class':'turnstileLink'}).attrs['title']
    jobtitle.append(elem.find('a', attrs = {'class':'turnstileLink'}).attrs['title'])

jobtitle

['Program and Service Advisor',
 'Reliability Engineer - Christmas Creek',
 'Data Analyst / Administrator',
 'Test Data Analyst',
 'Stock Coordinator - Support Centre QLD',
 'Scholarly Teaching Fellow',
 'Global Purchasing Procurement Analyst',
 'Insight Analytics Manager',
 'Production Reliabilty Engineer',
 'RTO Administration Officer']

#### Company Name

In [16]:
companyname = []
for elem in targetElements:
    companyname.append(elem.find('span', attrs = {'itemprop':'name'}).getText().strip().encode('utf-8'))
    
companyname

['Victorian Government',
 'Fortescue Metals Group',
 'Brookfield',
 'UniSuper',
 'Lorna Jane',
 'Macquarie University',
 'PepsiCo',
 'PwC',
 'GlaxoSmithKline',
 'Downer EDI']

#### Location

In [17]:
location = []
for elem in targetElements:
    location.append(elem.find('span', attrs = {'itemprop':'addressLocality'}).getText().strip().encode('utf-8'))

location

['Geelong VIC',
 'Pilbara WA',
 'Sydney NSW',
 'Melbourne VIC',
 'Queensland',
 'Macquarie University NSW',
 'Chatswood NSW',
 'Melbourne VIC',
 'Melbourne VIC',
 'Laverton WA']

#### Summary

In [18]:
summary = []
for elem in targetElements:
    summary.append(elem.find('span', attrs = {'class':'summary'}).getText().strip().encode('utf-8'))
    
summary

['In particular, a highly motivated individual with great data analysis and relationship management skills with health sector experience would suit this role....',
 'Analysis and reporting of fleet performance. The role of the reliability engineer is to provide technical support using initiatives and applying reliability...',
 'Undertake data analysis, cleansing, validation and updates. Management of data including; Data and system incident management;...',
 'Extensive data extract, analysis and reporting experience. As a Test Data Analyst you will perform complex data analysis testing to ensure that changes to new...',
 'Analysis of sales data and through reports to supply. Combining a love of LJ Active Wear and data analysis with....',
 'Experience in acoustic, articulatory and/or perceptual speech data analysis. Macquarie is the university of pioneering minds....',
 'You will be responsible for managing performance metrics, insightful reports, data analytics, and commodity risk comp

#### Company Rating

In [19]:
## cant seem to get this to work for some reason
company_rating = []
for elem in targetElements:
    if elem.find('span', attrs = [{'class':'ratingNumber'}]) is None:
        company_rating.append(None)
    else:
        company_rating.append(elem.find('span', attrs = {'class':'ratingNumber'})
                                     .getText().strip().encode('utf-8'))
    
company_rating

[None, None, None, None, None, None, None, None, None, None]

#### Company rating - Number of Reviews

In [20]:
company_rating_counts = []
for elem in targetElements:
    if elem.find('span', attrs = {'class':'slNoUnderline'}) is None:
        company_rating_counts.append(None)
    else:
        company_rating_counts.append(elem.find('span', attrs = {'class':'slNoUnderline'})
                                     .getText().strip().encode('utf-8'))

company_rating_counts

[None,
 '23 reviews',
 '21 reviews',
 None,
 '19 reviews',
 '11 reviews',
 '4,712 reviews',
 '2,342 reviews',
 '1,581 reviews',
 '47 reviews']

#### Advertised number of days ago

In [21]:
advertised_number_of_days_ago = []
for elem in targetElements:
    if elem.find('span', attrs = {'class':'date'}) is None:
        advertised_number_of_days_ago.append(None)
    else:
        advertised_number_of_days_ago.append(elem.find('span', attrs = {'class':'date'})
                                     .getText().strip().encode('utf-8'))

advertised_number_of_days_ago

['12 days ago',
 '3 days ago',
 '28 days ago',
 '11 days ago',
 '3 days ago',
 '13 days ago',
 '6 days ago',
 '30+ days ago',
 '4 days ago',
 '11 days ago']

#### Salary

In [22]:
salary = []
for elem in targetElements:
    if elem.find('nobr') is None:
        salary.append(None)
    else:
        salary.append(elem.find('nobr').getText().strip().encode('utf-8'))

salary

[None,
 None,
 None,
 None,
 None,
 '$68,324 - $92,000 a year',
 None,
 None,
 None,
 None]

#### Job Link

In [23]:
joblink = []
home_url = 'http://www.indeed.com'

for elem in targetElements:
        joblink.append("%s%s" % (home_url,elem.find('a').get('href')))

joblink

['http://www.indeed.com/rc/clk?jk=ca8146b5d7b02d5f&fccid=1818d10a60db56b4',
 'http://www.indeed.com/rc/clk?jk=7a9fe7de1beb85f9&fccid=9605a3534a186df0',
 'http://www.indeed.com/rc/clk?jk=35211e3965486c66&fccid=b144006bbf2d95a5',
 'http://www.indeed.com/rc/clk?jk=90d5230fdc16cb28&fccid=7414fd5891b0ddd2',
 'http://www.indeed.com/rc/clk?jk=1c41fd61f33c7f7a&fccid=9f1632b30ffb46f4',
 'http://www.indeed.com/rc/clk?jk=e404043290252ddc&fccid=bca6f10d73dbbf6c',
 'http://www.indeed.com/rc/clk?jk=845365829ba37eda&fccid=2973259ddc967948',
 'http://www.indeed.com/rc/clk?jk=29209de19e74c885&fccid=5e964c4afc56b180',
 'http://www.indeed.com/rc/clk?jk=812254ce6d0994d7&fccid=4e42ec53f4b93e02',
 'http://www.indeed.com/rc/clk?jk=ca649839049927c4&fccid=44e7c30753d07f3b']

## Create a dataframe based on information collected

In [24]:
df_columns=['query_date','jobtitle','companyname','location',
             'advertised_number_of_days_ago','company_rating',
             'company_rating_counts','salary','summary',
             'joblink','job_query_string']

df_joblist = pd.DataFrame({'query_date':pd.to_datetime('today'),
                                'jobtitle':jobtitle,
                                'companyname':companyname,
                                'location':location,
                                'advertised_number_of_days_ago':advertised_number_of_days_ago,
                                'company_rating':company_rating,
                                'company_rating_counts':company_rating_counts,
                                'salary':salary,
                                'summary':summary,
                                'joblink':joblink,
                                'job_query_string':job_query_string},
                         columns = df_columns)

df_joblist.head()

Unnamed: 0,query_date,jobtitle,companyname,location,advertised_number_of_days_ago,company_rating,company_rating_counts,salary,summary,joblink,job_query_string
0,2016-06-13,Program and Service Advisor,Victorian Government,Geelong VIC,12 days ago,,,,"In particular, a highly motivated individual w...",http://www.indeed.com/rc/clk?jk=ca8146b5d7b02d...,%22data+scientist%22+%22customer+analytics%22+...
1,2016-06-13,Reliability Engineer - Christmas Creek,Fortescue Metals Group,Pilbara WA,3 days ago,,23 reviews,,Analysis and reporting of fleet performance. T...,http://www.indeed.com/rc/clk?jk=7a9fe7de1beb85...,%22data+scientist%22+%22customer+analytics%22+...
2,2016-06-13,Data Analyst / Administrator,Brookfield,Sydney NSW,28 days ago,,21 reviews,,"Undertake data analysis, cleansing, validation...",http://www.indeed.com/rc/clk?jk=35211e3965486c...,%22data+scientist%22+%22customer+analytics%22+...
3,2016-06-13,Test Data Analyst,UniSuper,Melbourne VIC,11 days ago,,,,"Extensive data extract, analysis and reporting...",http://www.indeed.com/rc/clk?jk=90d5230fdc16cb...,%22data+scientist%22+%22customer+analytics%22+...
4,2016-06-13,Stock Coordinator - Support Centre QLD,Lorna Jane,Queensland,3 days ago,,19 reviews,,Analysis of sales data and through reports to ...,http://www.indeed.com/rc/clk?jk=1c41fd61f33c7f...,%22data+scientist%22+%22customer+analytics%22+...


Need to do:

1) Source <br\>
2) Contract or not<br\>
3) Get full text from the job link. If source of job is indeed itself, then i imagine it's easier to source that. otherwise, the harder thing to do is to search for the short description for each row, find in parent website and then get it.<br\>
4) Start analysis of short text and job names already. Cluster jobs together.<br\>
5) Check how Jesse's and Sung's tuts got full text info about the jobs in order to map out the skills.