# EMSI BG JOB BOARD SCRAPING PROJECT - Dated 3/7/22

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from seleniumwire import webdriver
import ast
import os

### Step 1: Check for access

I will first check and see if I can easy access this website using a simple requests.get

In [2]:
url = 'https://www.economicmodeling.com/open-positions/'

requests.get(url).text

'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

Oh no! I've been request blocked! I will maybe have to change my approach to accessing this website

In [3]:
headers = {
    'User-Agent':'not-suspicious-bot'
}

job_board_html = requests.get(url, headers=headers)
job_board_html

<Response [200]>

Now that's more like it.

### Step 2: Check for an API

I can certainly do this by hand by inspecting the network tab on my browser, but I'm going to let python handle it for me. Selenium wire allows me to check for the requests of a website, which is what I might do manually if I wanted to check whether or not a website has an api. The following code simply returns any url with 'api' in it.

In [4]:
#create a new instance of chrome driver with the appropriate options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--ignore-certificate-errors-spki-list')
chrome_options.add_argument('--ignore-ssl-errors')

#have to put your local path to the driver!
driver = webdriver.Chrome(executable_path=r"[INSERT PATH]chromedriver.exe", options=chrome_options)


#grab the emsi page
driver.get(url)

#create a blank array to store responses
requests_array = []

#grab requests from requests
for request in driver.requests:
    if request.response:
        requests_array.append(request.url)

In [5]:
#now check and see if there is anything with 'api'
requests_df = pd.DataFrame(requests_array)
api_df = requests_df.loc[requests_df[0].str.contains('api', case=False)].reset_index()
api_df

Unnamed: 0,index,0
0,40,https://www.economicmodeling.com/wp-content/up...
1,41,https://www.economicmodeling.com/wp-content/up...
2,128,https://api.lever.co/v0/postings/economicmodel...
3,149,https://api.hubspot.com/livechat-public/v1/mes...
4,169,https://api.hubspot.com/cartographer/v1/rhumb?...
5,172,https://api.hubspot.com/livechat-public/v1/bot...


I've now got 6 different urls that all contain the substring "api." I'm going to print each one to see if I can find some more information.

In [6]:
for api in api_df[0]:
    print(api)
    print()

https://www.economicmodeling.com/wp-content/uploads/2021/04/icon-api-cloud.svg

https://www.economicmodeling.com/wp-content/uploads/2021/04/icon-indigo-api-cloud.svg

https://api.lever.co/v0/postings/economicmodeling?group=team&mode=json

https://api.hubspot.com/livechat-public/v1/message/public?portalId=4906807&conversations-embed=static-1.9719&mobile=false&messagesUtk=c90a0dee98ee492fb0b8b23577cdbe35&traceId=c90a0dee98ee492fb0b8b23577cdbe35

https://api.hubspot.com/cartographer/v1/rhumb?hs_static_app=conversations-visitor-ui&hs_static_app_version=1.12180

https://api.hubspot.com/livechat-public/v1/bots/public/bot/261426/welcomeMessages?hs_static_app=conversations-visitor-ui&hs_static_app_version=1.12180&conversations-visitor-ui=static-1.12180&traceId=c90a0dee98ee492fb0b8b23577cdbe35&sessionId=AMOaWbLeaohdcuL9W3YBTFs_b7QhmxLJ1pSYjcaD3qYNICgERRW-JFKB_ZGKSCzmZ_Z6ZLztZbEm7nqdwF52nePbD2-hLAkzbiNZd8lGzqOj-w8ZRsFlqJx1oeHCSr1ckhGRdVsijO9WBrRb9r_RABfV0Pxwl4xbMLddz1CEQutTqdCs29gk9mA



I see that the first two urls are .svg files, so I will ignore those, but the other urls might be what I'm looking for

In [7]:
for api in api_df[0][2:]:
    try: 
        print(api)
        print(pd.DataFrame(requests.get(api, headers=headers).json()))
        print()
    except Exception as e:
        print(e)
        print()

https://api.lever.co/v0/postings/economicmodeling?group=team&mode=json
                    title                                           postings
0         Client Services  [{'additionalPlain': 'Emsi Burning Glass is an...
1      Data and Analytics  [{'additionalPlain': 'This position reports to...
2               Economics  [{'additionalPlain': 'At Emsi Burning Glass, w...
3             Engineering  [{'additionalPlain': 'Emsi Burning Glass is an...
4                 Finance  [{'additionalPlain': 'Emsi Burning Glass is an...
5                   Legal  [{'additionalPlain': 'Opportunities within Ems...
6               Marketing  [{'additionalPlain': 'Emsi Burning Glass is an...
7                 Product  [{'additionalPlain': 'Emsi Burning Glass is an...
8   Professional Services  [{'additionalPlain': 'Emsi Burning Glass is an...
9        Public Relations  [{'additionalPlain': 'Emsi Burning Glass is an...
10                  Sales  [{'additionalPlain': 'Emsi Burning Glass is an...
11   

Bingo, the third url seems to be what I was looking for, I'll double check and make sure.

In [18]:
pd.DataFrame(requests.get(api_df[0][2], headers=headers).json())

Unnamed: 0,title,postings
0,Client Services,[{'additionalPlain': 'Emsi Burning Glass is an...
1,Data and Analytics,[{'additionalPlain': 'This position reports to...
2,Economics,"[{'additionalPlain': 'At Emsi Burning Glass, w..."
3,Engineering,[{'additionalPlain': 'Emsi Burning Glass is an...
4,Finance,[{'additionalPlain': 'Emsi Burning Glass is an...
5,Legal,[{'additionalPlain': 'Opportunities within Ems...
6,Marketing,[{'additionalPlain': 'Emsi Burning Glass is an...
7,Product,[{'additionalPlain': 'Emsi Burning Glass is an...
8,Professional Services,[{'additionalPlain': 'Emsi Burning Glass is an...
9,Public Relations,[{'additionalPlain': 'Emsi Burning Glass is an...


## Step 3: Organize the data

It seems that there is a little bit to unpack here, so I'll get to separating the data out.

In [61]:
job_api = pd.DataFrame(requests.get('https://api.lever.co/v0/postings/economicmodeling?group=team&mode=json', headers=headers).json())
job_api = job_api.rename(columns={'title':'category'}) #the titles didn't really seem like job titles, more like categories
category_array = []

for index, row in job_api.iterrows():
    path = f"{row['category']}.csv"
    pd.DataFrame(row['postings']).to_csv(path, index=False)
    category_array.append(path)


Now let's see what I'm working with.

In [56]:
pd.read_csv(category_array[0], index_col=None)

Unnamed: 0,additionalPlain,additional,categories,createdAt,descriptionPlain,description,id,lists,text,hostedUrl,applyUrl
0,Emsi Burning Glass is an equal opportunity emp...,<div><i>Emsi Burning Glass is an equal opportu...,"{'commitment': 'Full Time', 'department': 'Edu...",1641404323821,"We are seeking a collaborative, detail-oriente...","<div>We are seeking a collaborative, detail-or...",390d26a0-e45c-4f6d-9bfd-ad7c50da188c,"[{'text': 'In this role, you would: ', 'conten...",Education Success Team Lead,https://jobs.lever.co/economicmodeling/390d26a...,https://jobs.lever.co/economicmodeling/390d26a...


That's pretty good! Now I'm going to pick out what I want and leave out any fluff.

In [62]:
final_array = []
for category in category_array:
    #these are the columns I want from the json output
    final_df = pd.read_csv(category, index_col=None)[['text', 'descriptionPlain', 'categories', 'hostedUrl']]

    final_df['categories'] = final_df['categories'].apply(lambda x: ast.literal_eval(x).get('location'))
    #categories column itself is json, but isn't formatted; ast is used to help format and interpret as json

    final_df.rename(columns={'categories':'location', 'text':'job_title', 'descriptionPlain':'description'}, inplace=True)
    final_df['department'] = category.replace('.csv', '')
    final_array.append(final_df)
    os.remove(category)
pd.concat(final_array).reset_index(drop=True).to_csv('job_listings.csv', index=False)

And that should do it! The final product will be a csv file titled job_listings.csv

The column names in the csv file are as follows: job_title,	description, location, hostedUrl, department

In [63]:
pd.read_csv('job_listings.csv')

Unnamed: 0,job_title,description,location,hostedUrl,department
0,Education Success Team Lead,"We are seeking a collaborative, detail-oriente...","Moscow, ID",https://jobs.lever.co/economicmodeling/390d26a...,Client Services
1,Data Curator,A Data Curator at Emsi BG is responsible for e...,"Moscow, ID",https://jobs.lever.co/economicmodeling/d5ff570...,Data and Analytics
2,"Data Engineer, Web Scraping",Emsi Burning Glass is looking for a talented D...,Remote (Work From Home) US,https://jobs.lever.co/economicmodeling/1411c67...,Data and Analytics
3,Data Scientist,A Data Scientist at Emsi Burning Glass works w...,Remote (Work From Home) US,https://jobs.lever.co/economicmodeling/e846f31...,Data and Analytics
4,Director of Data Analytics,The Director of Data Analytics leads data scie...,"Boston, Massachusetts",https://jobs.lever.co/economicmodeling/932d576...,Data and Analytics
5,Machine Learning Engineer,"At Emsi Burning Glass, Machine Learning Engine...",Remote (Work From Home) US,https://jobs.lever.co/economicmodeling/f75b465...,Data and Analytics
6,Senior Data Scientist,The Senior Data Scientist at Emsi Burning Glas...,Remote (Work From Home) US,https://jobs.lever.co/economicmodeling/6c85559...,Data and Analytics
7,Software Engineer,Emsi Burning Glass is looking for a skilled so...,"Moscow, ID",https://jobs.lever.co/economicmodeling/e9fa059...,Data and Analytics
8,Thought Leadership Research Analyst,We seek a Research Analyst to join our Thought...,"Boston, Massachusetts",https://jobs.lever.co/economicmodeling/90e27c9...,Economics
9,API Developer,Emsi Burning Glass is a trusted advisor on lab...,"Moscow, ID",https://jobs.lever.co/economicmodeling/d2f21e2...,Engineering
