# Chronicles of higher education job scraper

Collecting all job advertisements for tenure-track for North American four-year institutions.

- **[Query](https://jobs.chronicle.com/jobs/faculty-positions/north-america/tenured-tenured-track/)**


Everytime you scrape:

1. Load in previous job advertisements
2. Scrape all the *new job advertisements*
3. De-duplicate if necessary
4. Output to DB/CSV


In [9]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Webscraping libaries and tools
import requests
from bs4 import BeautifulSoup as bs
import re
import time
from tqdm.notebook import tqdm
tqdm.pandas()

# reading path to data files
from glob import glob

In [10]:
def parse_list_page_item(list_item):
    """
        Takes the list item HTML and parses out the four fields below into a list
    
    """
    title_tag = list_item.find("h3").find("a")
    job_title = title_tag.text
    job_url_suffix = title_tag['href'].strip()
    job_id = job_url_suffix.split("/")[2]
    job_url = "https://jobs.chronicle.com{}".format(job_url_suffix)
    diversity_job = False if list_item.find("p",attrs={"class":"ribbon"}) is None else True
    return [int(job_id),job_title,job_url,diversity_job]

def parse_list_page(url):
    """
        Returns the basic info from the jobs listing page
        
        || job id || job title || url || diversity job? 
    
    """
    time.sleep(1)
    r = requests.get(url,headers = {'User-Agent': 'Mozilla/5.0'})
    # The part of the webpage with the id tag "listing" contains all the job postings
    listing_page = bs(r.text).find("ul",attrs={"id":'listing'})
    # Parse out the ads
    list_items = listing_page.findAll("li",attrs={"id": re.compile("item-[0-9]+")})
    parsed_list_page = [parse_list_page_item(li) for li in list_items]
    return pd.DataFrame(parsed_list_page,columns=["Job ID","Job Title","Job URL","Diversity Job"]).set_index("Job ID")



In [11]:
# Get Job ID for most recent date posted which already exists

list_of_csv_files = glob("../data/*")
most_recent_csv = sorted(list_of_csv_files, reverse=True)[0]
ls_df = pd.read_csv(most_recent_csv).sort_values("Posted Date",ascending=False)
already_scraped = set(ls_df['Job ID'])
ls_df

Unnamed: 0,Job ID,Job Title,Job URL,Diversity Job,Employer,Location,Salary,Posted Date,Description,Position Type 0.0,...,Position Type 0.2,Position Type 0.3,Position Type 0.4,Position Type 0.5,Position Type 0.6,Position Type 0.7,Position Type 0.8,Position Type 0.9,Position Type 1.0,Position Type 1.1
0,37302593,Assistant Professor of Human Development,https://jobs.chronicle.com/job/37302593/assist...,False,Eckerd College,"Saint Petersburg, Florida",Competitive Salary,2022-08-11,Human Development. Assistant Professor of Huma...,Faculty Positions,...,Human Development & Family Sciences,Psychology,,,,,,,,
14,37302383,Assistant Professor of Biology - Ecology,https://jobs.chronicle.com/job/37302383/assist...,False,Lewis & Clark College,"Portland, Oregon",Commensurate with experience,2022-08-11,"Description\r\nLocated in Portland, Oregon, Le...",Faculty Positions,...,,,,,,,,,,
1,37302381,Assistant Professor of Economics,https://jobs.chronicle.com/job/37302381/assist...,False,Lewis & Clark College,"Portland, Oregon",Commensurate with experience,2022-08-11,Description\r\nAssistant Professor of Economic...,Faculty Positions,...,,,,,,,,,,
25,37302403,Tenure Track Faculty - Behavior Health (Social...,https://jobs.chronicle.com/job/37302403/tenure...,False,"California State University, Sacramento","California, United States",Salary Not specified,2022-08-11,\r\n\r\nTenure Track Faculty - Behavior Health...,Faculty Positions,...,Other Health & Medicine,,,,,,,,,
24,37302404,"Assistant Professor, Tenure-Track, Department ...",https://jobs.chronicle.com/job/37302404/assist...,False,University of San Francisco,"California, United States",Salary Not specified,2022-08-11,"\r\n\r\nAssistant Professor, Tenure-Track, Dep...",Faculty Positions,...,Technology & Mathematics,Computer Sciences & Technology,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1178,373791,"Professor/Chair, Civil, Architectural, and Env...",https://jobs.chronicle.com/job/373791/professo...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-08-25,"The Chair of the Civil, Architectural, and Env...",Faculty Positions,...,Technology & Mathematics,Engineering,Executive,Other Executive,,,,,,
1179,358463,"Clinical/Field Faculty, Educator Preparation",https://jobs.chronicle.com/job/358463/clinical...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-07-14,The clinical/field faculty position will have ...,Faculty Positions,...,Curriculum & Instruction,Teacher Education,Health & Medicine,Dentistry,,,,,,
1180,358463,"Clinical/Field Faculty, Educator Preparation",https://jobs.chronicle.com/job/358463/clinical...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-07-14,The clinical/field faculty position will have ...,Faculty Positions,...,Curriculum & Instruction,Teacher Education,Health & Medicine,Dentistry,,,,,,
1181,358463,"Clinical/Field Faculty, Educator Preparation",https://jobs.chronicle.com/job/358463/clinical...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-07-14,The clinical/field faculty position will have ...,Faculty Positions,...,Curriculum & Instruction,Teacher Education,Health & Medicine,Dentistry,,,,,,


In [12]:
url = "https://jobs.chronicle.com/jobs/faculty-positions/north-america/tenured-tenured-track/{}"

frames = []
job_ids = set()
page = 1
new_jobs = True
while new_jobs:
    print(page,end=" ")
    frame = parse_list_page(url.format(page))
    prev_job_ids = job_ids
    job_ids = set(frame.index)
    new_jobs = not (bool(job_ids.intersection(already_scraped)) or (prev_job_ids == job_ids))
    if bool(job_ids.intersection(already_scraped)): print(job_ids.intersection(already_scraped))
    frames.append(frame)
    page +=1

listing_df = pd.concat(frames)
listing_df = listing_df[~listing_df.index.isin(already_scraped)]
listing_df

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 {37293479}


Unnamed: 0_level_0,Job Title,Job URL,Diversity Job
Job ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
37306957,Non-Tenure-Track Associate Professor of Profes...,https://jobs.chronicle.com/job/37306957/non-te...,True
37307127,UC Riverside Bourns College of Engineering Ope...,https://jobs.chronicle.com/job/37307127/uc-riv...,True
37308928,Director - William H. Gates Public Service Law...,https://jobs.chronicle.com/job/37308928/direct...,True
37308936,SENIOR CAREER COUNSELOR,https://jobs.chronicle.com/job/37308936/senior...,True
37307103,Assistant/Associate Professor in Social Psycho...,https://jobs.chronicle.com/job/37307103/assist...,True
...,...,...,...
37303932,"Instructional Aide A, Counseling, Full-Time",https://jobs.chronicle.com/job/37303932/instru...,False
37303931,"Director, Financial Aid",https://jobs.chronicle.com/job/37303931/direct...,False
37303925,Founding Faculty Positions in Integrated Sciences,https://jobs.chronicle.com/job/37303925/foundi...,False
37303922,Tenure-Track Assistant Professor of English in...,https://jobs.chronicle.com/job/37303922/tenure...,False


In [13]:
def get_description_of_page(soup_page):
    """
        Parses the beautiful-soup object of the page response for the job description.
        
        :param soup_page: The beautiful soup object that contains the desired page.
        :returns: The text of the job description.
    """
    description = soup_page.find("div",attrs={"class":"mds-edited-text mds-font-body-copy-bulk"}).get_text()
    return description
    

In [14]:
def get_details_block_of_page(soup_page):
    """
        Every page has a set of details that contains information like who the employer for a job is, location, etc.
        Parses the beautiful-soup object of the page for the summary of the details of the job.
        
        :param soup_page: The beautiful soup object that contains the desired page.
        :returns: The beautiful soup tag for the details. Gets parsed for the important details later.
    """
    details_block = soup_page.find_all("dl",attrs={"class":"mds-list mds-list--definition mds-list--border mds-margin-bottom-b0"})
    return details_block
    

In [15]:
def link_keys_and_values(list_of_keys_and_values):
    """
        Takes a list of alternating elements with key and value class elements and pairs them up. The current version
        of the website has a lot of information stored in a weird format where one element has a class called 'mds-list__key'
        and the element below it contains a class called 'mds-list__value'. this function matches those two together.
        
        :param list_of_keys_and_values: List of soup elements that have alternating key and value class attributes.
        :returns: A dictionary where the key and value correspond to the keys and value in the html. The keys and values are 
        just the text from the element.
    """
    dictionary_form = {}
    key = None
    value = None
    for element in list_of_keys_and_values:
        if "mds-list__key" in element.get("class"):
            key = element
        if "mds-list__value" in element.get("class"):
            value = element
            if key != None and value != None:
                dictionary_form[key.get_text().strip()] = value.get_text().strip()
            key = None
            value = None  

    return dictionary_form

In [16]:
def aggregate_children_of_elements(list_of_elements):
    """
        Takes a list of elements with children and gathers them together.
        
        :param list_of_elements: List of beautiful soup elements.
        :returns: A list of all the children of the elements in the input list.
    """
    children = []
    for element in list_of_elements:
        for child in element.findChildren(recursive=False):
            children.append(child)
    
    return children

In [17]:
def parse_details_page(url):
    """
        Parses the details page of a university
        
        || employer || location || salary || date posted || position_type (list) || description
    
    """
    time.sleep(0.25)
    r = requests.get(url,headers = {'User-Agent': 'Mozilla/5.0'})
    details_page = bs(r.text)
    
    description = get_description_of_page(details_page)
    
    details_block = get_details_block_of_page(details_page)
    list_of_keys_and_values = aggregate_children_of_elements(details_block)
    details_dict = link_keys_and_values(list_of_keys_and_values)
    
    employer,location,salary,posted_date,position_type = None,None,None,None,None
    
    employer = None if "Employer" not in details_dict else details_dict["Employer"]
    location = None if "Location" not in details_dict else details_dict["Location"]    
    salary = None if "Salary" not in details_dict else details_dict["Salary"]
    posted_date = None if "Date Posted" not in details_dict else details_dict["Date Posted"] # not sure if the "start date" is the posted date
    try:
        position_type = None if "Position Type" not in details_dict else details_dict["Position Type"]
        position_type = [text.strip() for text in position_type.split(",")]
    except:
        pass
    
    return employer,location,salary,posted_date,position_type,description
    


In [18]:
listing_df

Unnamed: 0_level_0,Job Title,Job URL,Diversity Job
Job ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
37306957,Non-Tenure-Track Associate Professor of Profes...,https://jobs.chronicle.com/job/37306957/non-te...,True
37307127,UC Riverside Bourns College of Engineering Ope...,https://jobs.chronicle.com/job/37307127/uc-riv...,True
37308928,Director - William H. Gates Public Service Law...,https://jobs.chronicle.com/job/37308928/direct...,True
37308936,SENIOR CAREER COUNSELOR,https://jobs.chronicle.com/job/37308936/senior...,True
37307103,Assistant/Associate Professor in Social Psycho...,https://jobs.chronicle.com/job/37307103/assist...,True
...,...,...,...
37303932,"Instructional Aide A, Counseling, Full-Time",https://jobs.chronicle.com/job/37303932/instru...,False
37303931,"Director, Financial Aid",https://jobs.chronicle.com/job/37303931/direct...,False
37303925,Founding Faculty Positions in Integrated Sciences,https://jobs.chronicle.com/job/37303925/foundi...,False
37303922,Tenure-Track Assistant Professor of English in...,https://jobs.chronicle.com/job/37303922/tenure...,False


In [19]:
listing_df[['Employer',
            'Location',
            'Salary',
            'Posted Date',
            'position_type',
            'Description']] = listing_df.progress_apply(lambda row: parse_details_page(row['Job URL']),
                                                        axis=1,
                                                        result_type='expand')
listing_df["Posted Date"] = pd.to_datetime(listing_df["Posted Date"],infer_datetime_format=True)

  0%|          | 0/324 [00:00<?, ?it/s]

In [20]:
listing_df = listing_df[listing_df['position_type'].notna()]

In [21]:
position_type = pd.DataFrame(listing_df['position_type'].values.tolist(),
                             index=listing_df.index).fillna(np.nan)
position_type = position_type.rename(columns = lambda x: (x/10)).add_prefix('Position Type ')
print("{}x{}".format(*listing_df.shape))

merged_df = pd.merge(listing_df,
                     position_type,
                     how="left",
                     left_index=True,
                     right_index=True)
print("{}x{}".format(*merged_df.shape))
merged_df = merged_df.drop("position_type",axis=1)
print("{}x{}".format(*merged_df.shape))
merged_df = merged_df.sort_values("Posted Date",ascending=False)
merged_df

324x9
328x24
328x23


Unnamed: 0_level_0,Job Title,Job URL,Diversity Job,Employer,Location,Salary,Posted Date,Description,Position Type 0.0,Position Type 0.1,...,Position Type 0.5,Position Type 0.6,Position Type 0.7,Position Type 0.8,Position Type 0.9,Position Type 1.0,Position Type 1.1,Position Type 1.2,Position Type 1.3,Position Type 1.4
Job ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
37292397,Assistant or Associate Professor of Kinesiolog...,https://jobs.chronicle.com/job/37292397/assist...,False,University of Maryland Eastern Shore,"Princess Anne, Maryland",Commensurate with experience and qualifications,NaT,The appointment reports to the Chair of Kinesi...,Faculty Positions,Health & Medicine,...,,,,,,,,,,
37295246,Assistant Professor of Economics,https://jobs.chronicle.com/job/37295246/assist...,False,Brown University,"Providence, Rhode Island",Competitive,NaT,The Department of Economics seeks to recruit a...,Faculty Positions,Education,...,,,,,,,,,,
37295250,Assistant Professor of Economics,https://jobs.chronicle.com/job/37295250/assist...,False,Brown University,"Providence, Rhode Island",Competitive,NaT,Tenure Position\n \nAssistant Professor\n \nPo...,Faculty Positions,Education,...,,,,,,,,,,
37295258,Associate or Full Professor of Economics,https://jobs.chronicle.com/job/37295258/associ...,False,Brown University,"Providence, Rhode Island",Competitive,NaT,"Tenure Position\n \nProfessor, Associate Profe...",Faculty Positions,Education,...,,,,,,,,,,
37303683,Seeking Creative Scientists for Faculty Positi...,https://jobs.chronicle.com/job/37303683/seekin...,False,Rockefeller University,"New York, United States",Salary Not Specified,NaT,FACULTY POSITIONS\nTHE ROCKEFELLER UNIVERSITY\...,Faculty Positions,Science,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37309142,Assistant Professor or Lecturer of Forestry,https://jobs.chronicle.com/job/37309142/assist...,False,Abraham Baldwin Agriculture College,"31793, Tifton",Salary is commensurate with experience,NaT,Title: Assistant Professor or Lecturer of Fore...,Faculty Positions,Science,...,,,,,,,,,,
37309157,Professor (all ranks) in Engineering Resilient...,https://jobs.chronicle.com/job/37309157/profes...,False,Arizona State University,"Tempe, Arizona",Depending on experience.,NaT,Arizona State University: Ira A. Fulton School...,Faculty Positions,,...,,,,,,,,,,
37309193,Assistant Professor (tenure-track) of Business...,https://jobs.chronicle.com/job/37309193/assist...,False,Gabelli School of Business,"New York City, New York",Salary is competitive and commensurate with ex...,NaT,The Law & Ethics Area at Fordham University Ga...,Faculty Positions,Business & Management,...,,,,,,,,,,
37309194,Open Access Collection Strategist,https://jobs.chronicle.com/job/37309194/open-a...,False,"University of California, Santa Barbara","Santa Barbara, California","Up to $144,822/yr with benefits",NaT,Apply now: https://recruit.ap.ucsb.edu/JPF0225...,Faculty Positions,Professional Fields,...,,,,,,,,,,


In [22]:
from datetime import datetime

timestamp = datetime.now().strftime("%Y-%m-%d")
merged_df.to_csv(f"../data/{timestamp}-chronicles_of_higher_ed.csv")