## Project 4: Job Listing Data Acquisition

#### In this notebook, I acquire the job listing data from Indeed.com that I used to perform the analysis for Project 4. I obtained the data using Indeed's API.

#### See the notebook entitled indeed_job_listings_project_Diane for the analysis of this data set.

In [400]:
import pandas as pd
import numpy as np
import requests
import time
from indeed import IndeedClient
import pprint

pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

### API Paramaters

Sources: https://ads.indeed.com/jobroll/xmlfeed, https://github.com/indeedlabs/indeed-python

#### Job Search

q - Query. By default terms are ANDed. To see what is possible, use our advanced search page to perform a search and then check the url for the q value.

l - Location. Use a postal code or a "city, state/province/region" combination.

userip - The IP number of the end-user to whom the job results will be displayed. This field is required.

useragent - The User-Agent (browser) of the end-user to whom the job results will be displayed. This can be obtained from the "User-Agent" HTTP request header from the end-user. This field is required.

format - Format. Which output format of the API you wish to use. The options are "xml" and "json.". Default is "json". The IndeedClient requests and parses a json repsonse by default. If you with to use the xml format, requests will be performed with the raw parameter set to True, see raw.

raw - A boolean. Receive the raw json/xml response from the Indeed API. Use in addition with format to specify which response format you would like. Default is False

sort - Sort by relevance or date. Default is relevance.

radius - Distance from search location ("as the crow flies"). Default is 25.

start - Start results at this result number, beginning with 0. Default is 0.

limit - Maximum number of results returned per query. Default is 10, Maximum is 25

fromage - Number of days back to search.

highlight - Setting this value to 1 will bold terms in the snippet that are also present in q. Default is 0.

filter - Filter duplicate results. 0 turns off duplicate job filtering. Default is 1.

latlong - If latlong=1, returns latitude and longitude information for each job result. Default is 0.

co - Search within country specified. Default is us.

#### Job Details

jobkeys - Job keys. A list of job keys specifying the jobs to look up. This parameter is required.

format - Format. Which output format of the API you wish to use. The options are "xml" and "json.". Default is "json". The IndeedClient requests and parses a json repsonse by default. If you with to use the xml format, requests will be performed with the raw parameter set to True, see raw.

raw - A boolean. Receive the raw json/xml response from the Indeed API. Use in addition with format to specify which response format you would like. Default is False

In [2]:
publisher_id = 'redacted'

client = IndeedClient(publisher = publisher_id)

In [64]:
# Testing out the API to get a functioning search going

params = {
    'q' : "data or scientist or analyst or engineer or business or intelligence",
    'l' : "Seattle, WA",
    'userip' : "10.1.6.148",
    'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
    'sort' : 'date',
    'limit' : '25'
}

search_response = client.search(**params)

search_response

{u'dupefilter': True,
 u'end': 25,
 u'highlight': True,
 u'location': u'Seattle, WA',
 u'pageNumber': 0,
 u'paginationPayload': u'',
 u'query': u'data or scientist or analyst or engineer or business or intelligence',
 u'radius': 25,
 u'results': [{u'city': u'Seattle',
   u'company': u'Parsons Corporation',
   u'country': u'US',
   u'date': u'Wed, 31 May 2017 16:29:39 GMT',
   u'expired': False,
   u'formattedLocation': u'Seattle, WA',
   u'formattedLocationFull': u'Seattle, WA 98101',
   u'formattedRelativeTime': u'2 hours ago',
   u'indeedApply': False,
   u'jobkey': u'4144fdc8da99c5f6',
   u'jobtitle': u'Project Engineer -Civil Road Highway',
   u'language': u'en',
   u'onmousedown': u"indeed_clk(this,'2746');",
   u'snippet': u'Civil Project Road &amp; Highway <b>Engineer</b>. Civil Project <b>Engineer</b> \u2013 Road &amp; Highway \u2013 Seattle, Washington. Parsons is now hiring a Civil Project <b>Engineer</b> for our...',
   u'source': u'Parsons Corporation',
   u'sponsored': Fal

In [65]:
type(search_response)

dict

In [66]:
test = pd.DataFrame(search_response['results'])

In [67]:
test.head()

Unnamed: 0,city,company,country,date,expired,formattedLocation,formattedLocationFull,formattedRelativeTime,indeedApply,jobkey,jobtitle,language,onmousedown,snippet,source,sponsored,state,stations,url
0,Seattle,Parsons Corporation,US,"Wed, 31 May 2017 16:29:39 GMT",False,"Seattle, WA","Seattle, WA 98101",2 hours ago,False,4144fdc8da99c5f6,Project Engineer -Civil Road Highway,en,"indeed_clk(this,'2746');",Civil Project Road &amp; Highway <b>Engineer</...,Parsons Corporation,False,WA,,http://www.indeed.com/viewjob?jk=4144fdc8da99c...
1,Redmond,E- Solutionsinc,US,"Wed, 31 May 2017 15:48:09 GMT",False,"Redmond, WA","Redmond, WA",3 hours ago,True,fc7ab33591f2435b,System Test Engineer,en,"indeed_clk(this,'2746');",System Testing <b>Engineer</b>*. Looking for L...,Indeed,False,WA,,http://www.indeed.com/viewjob?jk=fc7ab33591f24...
2,Redmond,E- Solutionsinc,US,"Wed, 31 May 2017 15:42:35 GMT",False,"Redmond, WA","Redmond, WA",3 hours ago,True,b019bdc1676fbbfa,System Testing Engineer,en,"indeed_clk(this,'2746');",System Testing <b>Engineer</b>*. Looking for L...,Indeed,False,WA,,http://www.indeed.com/viewjob?jk=b019bdc1676fb...
3,Redmond,HMG America LLC,US,"Wed, 31 May 2017 15:14:20 GMT",False,"Redmond, WA","Redmond, WA",4 hours ago,True,b76a9dca78417439,SCOM Engineer (Local candidates preferred),en,"indeed_clk(this,'2746');",The SCOM <b>Engineer</b> will be responsible f...,Indeed,False,WA,,http://www.indeed.com/viewjob?jk=b76a9dca78417...
4,Redmond,AvalonBay Communities,US,"Wed, 31 May 2017 13:31:26 GMT",False,"Redmond, WA","Redmond, WA",5 hours ago,False,d41a2a26149362f9,Project Engineer,en,"indeed_clk(this,'2746');","AvalonBay Communities, Inc., a national develo...",AvalonBay Communities,False,WA,,http://www.indeed.com/viewjob?jk=d41a2a2614936...


In [68]:
test.shape

(25, 19)

In [None]:
# From this first search, I notice that the listings I'm getting with the above keyword search is missing the mark.
# Instead, I make requests with more targeted search terms below that together will form my data set. The search terms
# are "data scientist," "data analyst," "data engineer," "business intelligence," and "machine learning."

In [226]:
df_ds = pd.DataFrame()

In [227]:
start_value = 0

# The range size was determined by performing a similar search manually on indeed.com to get an estimate of the number
# of listings each search should yield.

for i in range(12):
    params = {
        'q' : '"data scientist"',
        'l' : 'Seattle, WA',
        'userip' : '10.1.6.148',
        'useragent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
        'sort' : 'date',
        'limit' : '25',
        'start' : start_value
        }
    search_response = client.search(**params)
    df_ds = pd.concat([df_ds, pd.DataFrame(search_response['results'])])
    start_value += 25
    time.sleep(1) # sleep for 1 second between requests to slow down the search

In [228]:
df_ds.shape

(252, 19)

In [229]:
# Checking for duplicates. I'll delete them later, once I have all of my data from the API.
len(df_ds['jobkey'].unique())

167

In [230]:
df_da = pd.DataFrame()

In [231]:
start_value = 0

for i in range(8):
    params = {
        'q' : '"data analyst"',
        'l' : 'Seattle, WA',
        'userip' : '10.1.6.148',
        'useragent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
        'sort' : 'date',
        'limit' : '25',
        'start' : start_value
        }
    search_response = client.search(**params)
    df_da = pd.concat([df_da, pd.DataFrame(search_response['results'])])
    start_value += 25
    time.sleep(1) # sleep for 1 second between requests to slow down the search

In [232]:
df_da.shape

(167, 19)

In [233]:
len(df_da['jobkey'].unique())

139

In [235]:
df_de = pd.DataFrame()

In [236]:
start_value = 0

for i in range(12):
    params = {
        'q' : '"data engineer"',
        'l' : 'Seattle, WA',
        'userip' : '10.1.6.148',
        'useragent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
        'sort' : 'date',
        'limit' : '25',
        'start' : start_value
        }
    search_response = client.search(**params)
    df_de = pd.concat([df_de, pd.DataFrame(search_response['results'])])
    start_value += 25
    time.sleep(1) # sleep for 1 second between requests to slow down the search

In [237]:
df_de.shape

(252, 19)

In [238]:
len(df_de['jobkey'].unique())

119

In [243]:
df_bi = pd.DataFrame()

In [244]:
start_value = 0

for i in range(45):
    params = {
        'q' : '"business intelligence"',
        'l' : 'Seattle, WA',
        'userip' : '10.1.6.148',
        'useragent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
        'sort' : 'date',
        'limit' : '25',
        'start' : start_value
        }
    search_response = client.search(**params)
    df_bi = pd.concat([df_bi, pd.DataFrame(search_response['results'])])
    start_value += 25
    time.sleep(1) # sleep for 1 second between requests to slow down the search

In [245]:
df_bi.shape

(1099, 19)

In [246]:
len(df_bi['jobkey'].unique())

823

In [263]:
df_ml = pd.DataFrame()

In [264]:
start_value = 0

for i in range(45):
    params = {
        'q' : '"machine learning"',
        'l' : 'Seattle, WA',
        'userip' : '10.1.6.148',
        'useragent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36',
        'sort' : 'date',
        'limit' : '25',
        'start' : start_value
        }
    search_response = client.search(**params)
    df_ml = pd.concat([df_ml, pd.DataFrame(search_response['results'])])
    start_value += 25
    time.sleep(1) # sleep for 1 second between requests to slow down the search

In [265]:
df_ml.shape

(1125, 19)

In [266]:
len(df_ml['jobkey'].unique())

1025

In [247]:
# Adding a column to each data frame that records which search term was used to retrieve which job listings.
df_ds['search_term'] = 'data science'
df_da['search_term'] = 'data analyst'
df_de['search_term'] = 'data engineer'
df_bi['search_term'] = 'business intelligence'
df_ml['search_term'] = 'machine learning'

In [249]:
# Removing duplicate rows and checking the number of rows against the lengths of unique values calculated above.
df_ds = df_ds.drop_duplicates(['jobkey'], keep='first')

df_ds.shape

(167, 20)

In [250]:
df_da = df_da.drop_duplicates(['jobkey'], keep='first')
df_da.shape

(139, 20)

In [251]:
df_de = df_de.drop_duplicates(['jobkey'], keep='first')
df_de.shape

(119, 20)

In [252]:
df_bi = df_bi.drop_duplicates(['jobkey'], keep='first')
df_bi.shape

(823, 20)

In [268]:
df_ml = df_ml.drop_duplicates(['jobkey'], keep='first')
df_ml.shape

(1025, 20)

In [437]:
# Combining the job listings into a single DataFrame
jobs = pd.concat([df_ds, df_da, df_de, df_bi, df_ml])

jobs.shape

(2273, 20)

In [438]:
len(jobs['jobkey'].unique())

1960

In [439]:
jobs = jobs.drop_duplicates(['jobkey'], keep='first')

jobs.shape

(1960, 20)

In [440]:
jobs.describe()

Unnamed: 0,city,company,country,date,expired,formattedLocation,formattedLocationFull,formattedRelativeTime,indeedApply,jobkey,jobtitle,language,onmousedown,snippet,source,sponsored,state,stations,url,search_term
count,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960,1960.0,1960,1960
unique,16,439,1,1805,1,16,51,49,2,1960,1681,1,92,1618,331,1,1,1.0,1960,5
top,Seattle,Amazon Corporate LLC,US,"Fri, 19 May 2017 22:37:30 GMT",False,"Seattle, WA","Seattle, WA",30+ days ago,False,1472af44cec343dc,Data Scientist,en,"indeed_clk(this,'4341');","After a highly successful IPO in 2013, Tableau...",Amazon.com,False,WA,,http://www.indeed.com/viewjob?jk=28bb98f9d7dea...,machine learning
freq,1422,658,1960,5,1960,1422,1241,991,1699,1,26,1960,25,60,758,1960,1960,1960.0,1,830


In [441]:
jobs['search_term'].value_counts()

machine learning         830
business intelligence    725
data science             167
data analyst             137
data engineer            101
Name: search_term, dtype: int64

In [442]:
jobs['search_term'].isnull().sum()

0

In [443]:
# saving the data set to csv file

jobs.to_csv('indeed_job_listings.csv', sep=',', encoding='utf-8', index=False)