# EDF Project Notebook

## Overview and Motivation

The goal of this project was to analyze several major job posting boards in order to determine which are useful. We hoped to analyze the results of job searches not necessarily to find a single best job posting board, rather, we hoped to find the best job posting board for different situations.

## Initial Question

Initially we were hoping to analyze the descriptions of jobs. Unfortunately during this process we were unable to retrieve the descriptions from jobs. This is due to a few reasons partly that some job posting boards are deliberately hiding their data and some others link back the employers own site for applications. 

### Data Collection

We collected data from four different job sites: LinkedIn, Monster, Indeed, and Glassdoor. From LinkedIn and Monster we scraped data from their standard search interface. From Indeed we used their api. The only available data from glass door is aggregated data from the Glassdoor api so it contains different information from the other data sources.

In [1]:
%matplotlib inline

import requests
import pandas as pd
from bs4 import BeautifulSoup
import json
from geopy.geocoders import Nominatim
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import re
from bokeh.charts import Bar, output_file, show
from indeed import IndeedClient

In [None]:
QUERY = 'Developer'
LOCATION = '' # location should be full state name (e.g. Texas) or nothing for all states
N_RESULTS = 1000
DATA_FILE_NAME = 'monster_results_1000.json'

def get_html_page(page):
    request = 'https://www.monster.com/jobs/search/?q='
    query = QUERY.replace(' ', '-')
    request += query
    request += '&where='
    request += LOCATION
    request += '&page='
    request += str(page)
    
    return requests.get(request)

def get_page_results(page):
    r = get_html_page(page)
    soup = BeautifulSoup(r.text, 'html.parser')
    results = soup.find_all('script', {'type' : 'application/ld+json'}) # Results come in script tag
    results.pop(0) # Remove search info from results
    return results

def get_results():
    results = []
    current_page = 1
    # Add results page by page
    while len(results) < N_RESULTS:
        results += get_page_results(current_page)
        current_page += 1
    # Remove extra results from end
    while len(results) > N_RESULTS:
        results.pop(-1)
    return results

def make_json(results):
    json_string = '{ "results" : [' # Open json
    # Write results
    for result in results:
        result_text = result.string # Get result string
        # Clean out extra spaces and new lines
        result_text = result_text.replace('\r\n', '')
        result_text = result_text.replace('  ', ' ')
        result_text = result_text.replace('  ', ' ')
        result_text = result_text.replace('  ', ' ')
        result_text = result_text.replace('  ', ' ')
        json_string += result_text # Append to json
        # Add ',' if not last element
        if result != results[-1]:
            json_string += ','
    json_string += ']}' # Close json
    # Prettify
    json_string = json.loads(json_string)
    json_string = json.dumps(json_string, sort_keys=True, indent=4)
    return json_string
    
results = get_results()
json_string = make_json(results)

file = open(DATA_FILE_NAME, 'w')
file.write(json_string)
file.close()

In [None]:
QUERY = 'Developer'
LOCATION = '' # location should be lowercase two letter state abbreviation (e.g. tx) or nothing for all states
N_RESULTS = 1000
DATA_FILE_NAME = 'linkedin_results_1000.json'

def get_html_page(start):
    request = 'https://www.linkedin.com/jobs/search?keywords='
    query = QUERY.replace(' ', '%20') # Correct spaces
    request += query
    if len(LOCATION) > 0:
        request += '&locationId=STATES.us.'
        request += LOCATION
    request += '&orig=JSERP&start='
    request += str(start)
    request += '&count=25&trk=jobs_jserp_pagination_'
    request += str((start // 25) + 1) # Calculate page number
    
    return requests.get(request)

def get_html_page_results(start):
    results = []
    # Request page until it doesn't redirect
    while len(results) < 1:
        html = get_html_page(start)
        soup = BeautifulSoup(html.text, 'html.parser')
        results = soup.find_all('code', {'id' : 'decoratedJobPostingsModule'})
    
    result_json_string = results[0].string # all results come in one json object
    json_data = json.loads(result_json_string) # get json data
    
    return json_data['elements'] # get search results

def get_results():
    results = []
    current_start = 0
    while len(results) < N_RESULTS:
        results += get_html_page_results(current_start)
        current_start = len(results)
    return results
    
final_results = get_results()
final_json_string = json.dumps({'results' : final_results}, sort_keys=True, indent=4) 

file = open(DATA_FILE_NAME, 'w')
file.write(final_json_string)
file.close()

In [None]:
client = IndeedClient(publisher=1234567890) # Your key goes here.

results = [ ]
i = 0
pagenumber = 0
for x in range(0, 40):
    pagenumber += 1
    params = {
        'q': "developer",
        'userip': "", # your ip goes here
        'useragent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36",
        'start': i,
        'limit': 25,
        'radius': 25
    }
    i += 25
    search_response = client.search(**params)
    search_response = json.loads(json.dumps(search_response))

    results += search_response[ 'results' ]

json_string = json.dumps({'results': results}, sort_keys=True, indent=4)

file = open('data.json', 'w')
file.write(json_string)
file.close()

## Exploratory Data Analysis

Due to the lack of descriptions we had to find an alternate part of the data to explore. In the end we settled on exploring the locations of jobs and the titles of jobs. In order to analyze the locations of jobs we took from 1000 results from each job posting board. We then plotted the locations of these results onto a map and examined the locations of clusters as well as the density. Additionally we counted the number of times keywords appeared in the job titles. For example, we counted appearances of junior and senior to see the number of jobs available at each level.

In [None]:
with open('linkedin_results_1000.json') as data_file:    
    linkedin_json = json.load(data_file)
    
linkedin_results = linkedin_json['results']

geolocator = Nominatim()
linkedin_lons = []
linkedin_lats = []
for i in range(0, 1000):
    location = geolocator.geocode(linkedin_results[i]['decoratedJobPosting']['cityState'])
    linkedin_lons.append(location.longitude)
    linkedin_lats.append(location.latitude)

In [None]:
with open('monster_results_1000.json') as data_file:
    monster_json = json.load(data_file)

monster_results = monster_json['results']

geolocator = Nominatim()
monster_lons = []
monster_lats = []
for i in range(0, 1000):
    result = monster_results[i]['jobLocation']['address']['addressLocality'] + ', ' + monster_results[i]['jobLocation']['address']['addressRegion']
    location = geolocator.geocode(result)
    if location != None:
        monster_lons.append(location.longitude)
        monster_lats.append(location.latitude)

In [None]:
with open('indeed_results_1000.json') as data_file:
    indeed_json = json.load(data_file)

indeed_results = indeed_json['results']

geolocator = Nominatim()
indeed_lons = []
indeed_lats = []
for i in range(0, 1000):
    location = geolocator.geocode(indeed_results[i]['formattedLocation'])
    indeed_lons.append(location.longitude)
    indeed_lats.append(location.latitude)

In [None]:
fig = plt.figure()

themap = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
themap.drawcoastlines()
themap.drawcountries()
themap.drawstates()
themap.fillcontinents(color = 'gainsboro')
themap.drawmapboundary(fill_color='steelblue')

linkedin_x, linkedin_y = themap(linkedin_lons, linkedin_lats)
      
themap.plot(linkedin_x, linkedin_y, 
            'o',
            color='Blue',
            markersize=4
            )

plt.show()

fig = plt.figure()

themap = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95)
themap.drawcoastlines()
themap.drawcountries()
themap.drawstates()
themap.fillcontinents(color = 'gainsboro')
themap.drawmapboundary(fill_color='steelblue')

monster_x, monster_y = themap(monster_lons, monster_lats)

themap.plot(monster_x, monster_y,
           'o',
            color = 'Purple',
            markersize=4
            )

plt.show()

fig = plt.figure()

themap = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49, projection='lcc',lat_1=32,lat_2=45,lon_0=-95, resolution='f')
themap.drawcoastlines()
themap.drawcountries()
themap.drawstates()
themap.fillcontinents(color = 'gainsboro')
themap.drawmapboundary(fill_color='steelblue')

indeed_x, indeed_y = themap(indeed_lons, indeed_lats)

themap.plot(indeed_x, indeed_y,
           'o',
            color = 'Green',
            markersize=4
            )

plt.show()

In [2]:
with open('linkedin_results_1000.json') as data_file:    
    linkedin_json = json.load(data_file)

linkedin_results = linkedin_json['results']

linkedin_titles = []
for i in range(0, len(linkedin_results)):
    linkedin_titles.append(linkedin_results[i]['decoratedJobPosting']['jobPosting']['title'])
    
with open('monster_results_1000.json') as data_file:
    monster_json = json.load(data_file)
    
monster_results = monster_json['results']

monster_titles = []
for i in range(0, len(monster_results)):
    monster_titles.append(monster_results[i]['title'])
    
with open('indeed_results_1000.json') as data_file:
    indeed_json = json.load(data_file)
    
indeed_results = indeed_json['results']

indeed_titles = []
for i in range(0, len(indeed_results)):
    indeed_titles.append(indeed_results[i]['jobtitle'])
    
names = [
    (".Net", "\.net"),
    ("C++", "c\+\+"),
    ("C#", "c\#"),
    ("Javascript", "javascript"),
    ("Java ", "java "),
    ("Android", "android"),
    ("Security", "security"),
    ("Full Stack", "full stack"),
    ("Front End", "front end"),
    ("Devops", "devops"),
    ("Web Developer", "web"),
    ("Architect", "architect"),
    ("Data", "data"),
    ("Applications", "applications|(app )"),
    ("Programmer Analyst", "programmer analyst"),
    ("Network", "network"),
    ("Mobile", "mobile"),
    ("IT", "(it|(information (technology|systems)))"),
    ("Software Engineer", "software (engineer|developer|development)"),
    ("Systems Engineer", "systems engineer"),
    ("QA", "test|(quality assurance|qa)"),
    ("Intern", "Intern"),
    ("Entry Level", "(new grad)|(entry level)"),
    ("Associate", "associate"),
    ("Mid Level", "mid level"),
    ("Senior", "senior"),
    ("Principal", "principal"),
    ("Staff", "staff"),
    ("Lead", "lead"),
    ("Director", "director"),
]

def get_counts(titles):
    counts = []
    for name in names:
        count = 0
        for title in titles:
            if re.search(name[1], title, flags=re.IGNORECASE) is not None:
                count += 1
        counts.append(count)
    return counts
                
dataframe = pd.DataFrame()

formatted_names = []
for name in names:
    formatted_names.append(name[0])
    
dataframe['Title'] = pd.Series(formatted_names)
dataframe['LinkedIn'] = pd.Series(get_counts(linkedin_titles))
dataframe['Monster'] = pd.Series(get_counts(monster_titles))
dataframe['Indeed'] = pd.Series(get_counts(indeed_titles))

linkedin_graph = Bar(dataframe, 'Title', values='LinkedIn', legend=False)
output_file("linkedin_graph.html")
show(linkedin_graph)
monster_graph = Bar(dataframe, 'Title', values='Monster', legend=False)
output_file("monster_graph.html")
show(monster_graph)
indeed_graph = Bar(dataframe, 'Title', values='Indeed', legend=False)
output_file("indeed_graph.html")
show(indeed_graph)

dataframe

INFO:bokeh.core.state:Session output file 'monster_graph.html' already exists, will be overwritten.
INFO:bokeh.core.state:Session output file 'indeed_graph.html' already exists, will be overwritten.


Unnamed: 0,Title,LinkedIn,Monster,Indeed
0,.Net,76,115,33
1,C++,19,7,1
2,C#,27,30,9
3,Javascript,25,35,20
4,Java,100,149,28
5,Android,26,17,6
6,Security,1,1,0
7,Full Stack,32,43,24
8,Front End,43,6,107
9,Devops,0,2,1


## Final Analysis

#### Location Analysis

From the analysis of the locations it is apparent that LinkedIn features the most locations. However, all three have similar clusters in obvious hotspots (Silicon Valley, Dallas, Washington D.C., etc.). Overall the analysis of locations did not reveal as much as we had hoped. Still, it is useful that if you are looking for jobs in non-standard locations you may want to use LinkedIn as your primary search board.

#### Title Analysis

From the analysis of the job titles several trends become apparent.
1. Indeed is heavily weighted towards web development.
2. Monster has the highest amount of Java and .Net jobs.
3. LinkedIn has the most senior level jobs, and Indeed has the most entry level jobs.