# Inside Higher Ed Job Scraper

Collecting all job advertisements for tenure-track for North American four-year institutions.

- **[Query](https://careers.insidehighered.com/jobs/tenured-and-tenure-track/four-year-institution/north-america/)**


Everytime you scrape:

1. Load in previous job advertisements
2. Scrape all the *new job advertisements*
3. De-duplicate if necessary
4. Output to DB/CSV


In [1]:
# Data manipulation libraries
import pandas as pd
import numpy as np
# Common webscraping libaries
from bs4 import BeautifulSoup as bs
import requests

In [40]:
import xmltodict
from datetime import datetime

# Get RSS Feed
RRS_FEED = "https://careers.insidehighered.com/jobsrss/?InstitutionType=132&LocationId=4&countrycode=US"
r = requests.get(RRS_FEED)

# Parse feed into xml
xml_data = bs(r.content, features='xml')

# Get the date that the feed was published
try:
    pub_date = xml_data.find('pubDate').text
    pub_date = datetime.strptime(pub_date, '%a, %d %b %Y %H:%M:%S %z')
except:
    print("Cannot find last RSS feed publicaiton, aborting")

# Go through each xml item and convert it to a python dictionary
def parse_item(item):
    to_dict = xmltodict.parse(str(item))
    to_dict = dict(to_dict['item'])
    if 'guid' in to_dict:
        to_dict.pop('guid')
    return to_dict
parsed_data = map(parse_item,xml_data.findAll('item'))

# Convert the list of dictionaries into a dataframe.
uni_data = pd.DataFrame.from_records(parsed_data)
uni_data[['institution','title']] = uni_data['title'].str.split(":",expand=True)
uni_data['pubDate'] =  pd.to_datetime(uni_data['pubDate'], format='%a, %d %b %Y %H:%M:%S %z')
uni_data

Unnamed: 0,title,description,link,pubDate,institution
0,President,Dakota Wesleyan University:\nDakota Wesleyan U...,https://careers.insidehighered.com/job/2163178...,2021-06-08 16:49:00-05:00,Dakota Wesleyan University
1,"Associate Vice President, Inclusion and Belon...",Cornell University:\nCornell University seeks ...,https://careers.insidehighered.com/job/2156394...,2021-05-27 09:35:00-05:00,Cornell University
2,Metadata Librarian,Commensurate with experience:\n\nBridgewater C...,https://careers.insidehighered.com/job/2172564...,2021-06-22 15:03:00-05:00,Bridgewater College
3,Extension 4-H Program Coordinator,Michigan State University:\nJob no: 711226 Wor...,https://careers.insidehighered.com/job/2164790...,2021-06-10 00:00:00-05:00,Michigan State University
4,Technical Director/Designer,Lake Forest College:\nTechnical Director/Desig...,https://careers.insidehighered.com/job/2161469...,2021-06-05 00:00:00-05:00,Lake Forest College
5,"Lecturer, Computer Science",Skidmore College:\nInstructor for CS106 Introd...,https://careers.insidehighered.com/job/2173754...,2021-06-24 00:00:00-05:00,Skidmore College
6,Customer Attendant Assistant,"Skidmore College:\nThis is a part time, on cal...",https://careers.insidehighered.com/job/2173753...,2021-06-24 00:00:00-05:00,Skidmore College
7,Post-Doctoral Geospatial Data Scientist,Skidmore College:\nOrganization Overview MySO...,https://careers.insidehighered.com/job/2173752...,2021-06-24 00:00:00-05:00,Skidmore College
8,Tang Guide,Skidmore College:\nThe Tang Guides program is ...,https://careers.insidehighered.com/job/2173751...,2021-06-24 00:00:00-05:00,Skidmore College
9,"Lecturer-Choral Director, Department of Music",Skidmore College:\nChoral Director: The Skidmo...,https://careers.insidehighered.com/job/2173750...,2021-06-24 00:00:00-05:00,Skidmore College
