# Chronicles of higher education job scraper

Collecting all job advertisements for tenure-track for North American four-year institutions.

- **[Query](https://jobs.chronicle.com/jobs/faculty-positions/north-america/tenured-tenured-track/)**


Everytime you scrape:

1. Load in previous job advertisements
2. Scrape all the *new job advertisements*
3. De-duplicate if necessary
4. Output to DB/CSV


In [1]:
# Data manipulation libraries
import pandas as pd
import numpy as np

# Webscraping libaries and tools
import requests
from bs4 import BeautifulSoup as bs
import re
import time
from tqdm.notebook import tqdm
tqdm.pandas()

# reading path to data files
from glob import glob

In [2]:
def parse_list_page_item(list_item):
    """
        Takes the list item HTML and parses out the four fields below into a list
    
    """
    title_tag = list_item.find("h3").find("a")
    job_title = title_tag.text
    job_url_suffix = title_tag['href'].strip()
    job_id = job_url_suffix.split("/")[2]
    job_url = "https://jobs.chronicle.com{}".format(job_url_suffix)
    diversity_job = False if list_item.find("p",attrs={"class":"ribbon"}) is None else True
    return [int(job_id),job_title,job_url,diversity_job]

def parse_list_page(url):
    """
        Returns the basic info from the jobs listing page
        
        || job id || job title || url || diversity job? 
    
    """
    time.sleep(1)
    r = requests.get(url,headers = {'User-Agent': 'Mozilla/5.0'})
    # The part of the webpage with the id tag "listing" contains all the job postings
    listing_page = bs(r.text).find("ul",attrs={"id":'listing'})
    # Parse out the ads
    list_items = listing_page.findAll("li",attrs={"id": re.compile("item-[0-9]+")})
    parsed_list_page = [parse_list_page_item(li) for li in list_items]
    return pd.DataFrame(parsed_list_page,columns=["Job ID","Job Title","Job URL","Diversity Job"]).set_index("Job ID")



In [3]:
# Get Job ID for most recent date posted which already exists

list_of_csv_files = glob("../data/*")
most_recent_csv = sorted(list_of_csv_files, reverse=True)[0]
ls_df = pd.read_csv(most_recent_csv).sort_values("Date Posted",ascending=False)
already_scraped = set(ls_df['Job ID'])
ls_df

Unnamed: 0,Job ID,Job Title,Job URL,Diversity Job,Employer,Location,Salary,Date Posted,Description,Position Type 0.0,Position Type 0.1,Position Type 0.2,Position Type 0.3,Position Type 0.4,Position Type 0.5
0,426147,Lecturer of Civil and Environmental Engineering,https://jobs.chronicle.com/job/426147/lecturer...,True,Kennesaw State University,"Georgia, United States",Salary Not specified,,Kennesaw State University is now accepting app...,Faculty Positions,Science,Technology & Mathematics,Engineering,,
1,427095,Assistant/Associate Professor of Psychology,https://jobs.chronicle.com/job/427095/assistan...,False,The University of Texas at Tyler,"Texas, United States",Salary Commensurate with experience,,Job Type\nFull-Time\n \nSalary\nCommensurate w...,Faculty Positions,Social & Behavioral Sciences,Psychology,,,
2,427089,Assistant Professor of Music (Low Brass) and D...,https://jobs.chronicle.com/job/427089/assistan...,False,The University of Texas at Tyler,"Texas, United States",Salary Commensurate with experience,,Job Type\nFull-Time\n \nSalary\nCommensurate w...,Faculty Positions,Arts,Music,,,
3,427087,Assistant Professor of Teacher Education,https://jobs.chronicle.com/job/427087/assistan...,False,The College of Saint Rose,"New York, United States",Salary Not specified,,Company Description:\nThe College of Saint Ros...,Faculty Positions,Education,Teacher Education,,,
4,427086,Assistant Professor of Asian History,https://jobs.chronicle.com/job/427086/assistan...,False,The College of Saint Rose,"New York, United States",Salary Not specified,,Company Description:\nThe College of Saint Ros...,Faculty Positions,Humanities,History,Other Humanities,,
5,427063,Assistant/Associate Professor of Museum Educat...,https://jobs.chronicle.com/job/427063/assistan...,False,The George Washington University,"District of Columbia, United States",Salary Commensurate with experience,,The George Washington\nUniversity\nGraduate Sc...,Faculty Positions,Arts,Other Arts,Education,Curriculum & Instruction,Other Education
6,427028,"Assistant Professor, Communication Studies",https://jobs.chronicle.com/job/427028/assistan...,False,"California State University, San Bernardino","California, United States",Salary Not specified,,"\n\nAssistant Professor, Communication Studies...",Faculty Positions,Communications,Public Relations & Advertising,,,
7,427027,"Assistant, Advanced Assistant or Associate Pro...",https://jobs.chronicle.com/job/427027/assistan...,False,"California State University, San Bernardino","California, United States",Salary Not specified,,"\n\nAssistant, Advanced Assistant or Associate...",Faculty Positions,Humanities,Ethnic & Multicultural Studies,,,
8,427019,"Assistant Professor, Theatre for Youth and Com...",https://jobs.chronicle.com/job/427019/assistan...,False,"Arizona State University - Tempe, AZ","Arizona, United States",Salary Commensurate with experience,,"Description:\nThe School of Music, Dance and T...",Faculty Positions,Arts,Other Arts,Performing Arts,,
9,427010,Open-Rank Tenure Track/Tenured Positions - Sch...,https://jobs.chronicle.com/job/427010/open-ran...,False,Florida International University,"Florida, United States",Salary Not specified,,Florida International University is Miami’s pu...,Faculty Positions,Science,Technology & Mathematics,Computer Sciences & Technology,,


In [4]:
url = "https://jobs.chronicle.com/jobs/faculty-positions/north-america/tenured-tenured-track/{}"

frames = []
job_ids = set()
page = 1
new_jobs = True
while new_jobs:
    print(page,end=" ")
    frame = parse_list_page(url.format(page))
    prev_job_ids = job_ids
    job_ids = set(frame.index)
    new_jobs = not (bool(job_ids.intersection(already_scraped)) or (prev_job_ids == job_ids))
    if bool(job_ids.intersection(already_scraped)): print(job_ids.intersection(already_scraped))
    frames.append(frame)
    page +=1

listing_df = pd.concat(frames)
listing_df = listing_df[~listing_df.index.isin(already_scraped)]
listing_df

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 

Unnamed: 0_level_0,Job Title,Job URL,Diversity Job
Job ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
37290605,"Faculty (Open-Rank), Department of History",https://jobs.chronicle.com/job/37290605/facult...,True
37293331,Dean of the School of Law,https://jobs.chronicle.com/job/37293331/dean-o...,True
37288106,Multiple Faculty Positions in Electrical Engin...,https://jobs.chronicle.com/job/37288106/multip...,True
37295002,Tenure-Track positions in Accounting,https://jobs.chronicle.com/job/37295002/tenure...,True
37295315,Assistant OR Associate Professor of Public Rel...,https://jobs.chronicle.com/job/37295315/assist...,True
...,...,...,...
380501,"Villa Maria School of Nursing, Assistant Profe...",https://jobs.chronicle.com/job/380501/villa-ma...,False
379484,Assistant/Associate Professor of Mathematics,https://jobs.chronicle.com/job/379484/assistan...,False
378707,Professor/Nanoengineering Dept. Chair,https://jobs.chronicle.com/job/378707/professo...,False
373791,"Professor/Chair, Civil, Architectural, and Env...",https://jobs.chronicle.com/job/373791/professo...,False


In [5]:
def get_description_of_page(soup_page):
    """
        Parses the beautiful-soup object of the page response for the job description.
        
        :param soup_page: The beautiful soup object that contains the desired page.
        :returns: The text of the job description.
    """
    description = soup_page.find("div",attrs={"class":"mds-edited-text mds-font-body-copy-bulk"}).get_text()
    return description
    

In [6]:
def get_details_block_of_page(soup_page):
    """
        Every page has a set of details that contains information like who the employer for a job is, location, etc.
        Parses the beautiful-soup object of the page for the summary of the details of the job.
        
        :param soup_page: The beautiful soup object that contains the desired page.
        :returns: The beautiful soup tag for the details. Gets parsed for the important details later.
    """
    details_block = soup_page.find_all("dl",attrs={"class":"mds-list mds-list--definition mds-list--border mds-margin-bottom-b0"})
    return details_block
    

In [7]:
def link_keys_and_values(list_of_keys_and_values):
    """
        Takes a list of alternating elements with key and value class elements and pairs them up. The current version
        of the website has a lot of information stored in a weird format where one element has a class called 'mds-list__key'
        and the element below it contains a class called 'mds-list__value'. this function matches those two together.
        
        :param list_of_keys_and_values: List of soup elements that have alternating key and value class attributes.
        :returns: A dictionary where the key and value correspond to the keys and value in the html. The keys and values are 
        just the text from the element.
    """
    dictionary_form = {}
    key = None
    value = None
    for element in list_of_keys_and_values:
        if "mds-list__key" in element.get("class"):
            key = element
        if "mds-list__value" in element.get("class"):
            value = element
            if key != None and value != None:
                dictionary_form[key.get_text().strip()] = value.get_text().strip()
            key = None
            value = None  

    return dictionary_form

In [8]:
def aggregate_children_of_elements(list_of_elements):
    """
        Takes a list of elements with children and gathers them together.
        
        :param list_of_elements: List of beautiful soup elements.
        :returns: A list of all the children of the elements in the input list.
    """
    children = []
    for element in list_of_elements:
        for child in element.findChildren(recursive=False):
            children.append(child)
    
    return children

In [9]:
def parse_details_page(url):
    """
        Parses the details page of a university
        
        || employer || location || salary || date posted || position_type (list) || description
    
    """
    time.sleep(0.25)
    r = requests.get(url,headers = {'User-Agent': 'Mozilla/5.0'})
    details_page = bs(r.text)
    
    description = get_description_of_page(details_page)
    
    details_block = get_details_block_of_page(details_page)
    list_of_keys_and_values = aggregate_children_of_elements(details_block)
    details_dict = link_keys_and_values(list_of_keys_and_values)
    
    employer,location,salary,posted_date,position_type = None,None,None,None,None
    
    employer = None if "Employer" not in details_dict else details_dict["Employer"]
    location = None if "Location" not in details_dict else details_dict["Location"]    
    salary = None if "Salary" not in details_dict else details_dict["Salary"]
    posted_date = None if "Posted Date" not in details_dict else details_dict["Posted Date"] # not sure if the "start date" is the posted date
    try:
        position_type = None if "Position Type" not in details_dict else details_dict["Position Type"]
        position_type = [text.strip() for text in position_type.split(",")]
    except:
        pass
    
    return employer,location,salary,posted_date,position_type,description
    


In [10]:
listing_df

Unnamed: 0_level_0,Job Title,Job URL,Diversity Job
Job ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
37290605,"Faculty (Open-Rank), Department of History",https://jobs.chronicle.com/job/37290605/facult...,True
37293331,Dean of the School of Law,https://jobs.chronicle.com/job/37293331/dean-o...,True
37288106,Multiple Faculty Positions in Electrical Engin...,https://jobs.chronicle.com/job/37288106/multip...,True
37295002,Tenure-Track positions in Accounting,https://jobs.chronicle.com/job/37295002/tenure...,True
37295315,Assistant OR Associate Professor of Public Rel...,https://jobs.chronicle.com/job/37295315/assist...,True
...,...,...,...
380501,"Villa Maria School of Nursing, Assistant Profe...",https://jobs.chronicle.com/job/380501/villa-ma...,False
379484,Assistant/Associate Professor of Mathematics,https://jobs.chronicle.com/job/379484/assistan...,False
378707,Professor/Nanoengineering Dept. Chair,https://jobs.chronicle.com/job/378707/professo...,False
373791,"Professor/Chair, Civil, Architectural, and Env...",https://jobs.chronicle.com/job/373791/professo...,False


In [11]:
listing_df[['Employer',
            'Location',
            'Salary',
            'Posted Date',
            'position_type',
            'Description']] = listing_df.progress_apply(lambda row: parse_details_page(row['Job URL']),
                                                        axis=1,
                                                        result_type='expand')
listing_df["Posted Date"] = pd.to_datetime(listing_df["Posted Date"],infer_datetime_format=True)

  0%|          | 0/1149 [00:00<?, ?it/s]

In [12]:
listing_df = listing_df[listing_df['position_type'].notna()]

In [13]:
position_type = pd.DataFrame(listing_df['position_type'].values.tolist(),
                             index=listing_df.index).fillna(np.nan)
position_type = position_type.rename(columns = lambda x: (x/10)).add_prefix('Position Type ')
print("{}x{}".format(*listing_df.shape))

merged_df = pd.merge(listing_df,
                     position_type,
                     how="left",
                     left_index=True,
                     right_index=True)
print("{}x{}".format(*merged_df.shape))
merged_df = merged_df.drop("position_type",axis=1)
print("{}x{}".format(*merged_df.shape))
merged_df = merged_df.sort_values("Posted Date",ascending=False)
merged_df

1149x9
1183x21
1183x20


Unnamed: 0_level_0,Job Title,Job URL,Diversity Job,Employer,Location,Salary,Posted Date,Description,Position Type 0.0,Position Type 0.1,Position Type 0.2,Position Type 0.3,Position Type 0.4,Position Type 0.5,Position Type 0.6,Position Type 0.7,Position Type 0.8,Position Type 0.9,Position Type 1.0,Position Type 1.1
Job ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
37302593,Assistant Professor of Human Development,https://jobs.chronicle.com/job/37302593/assist...,False,Eckerd College,"Saint Petersburg, Florida",Competitive Salary,2022-08-11,Human Development. Assistant Professor of Huma...,Faculty Positions,Social & Behavioral Sciences,Human Development & Family Sciences,Psychology,,,,,,,,
37302381,Assistant Professor of Economics,https://jobs.chronicle.com/job/37302381/assist...,False,Lewis & Clark College,"Portland, Oregon",Commensurate with experience,2022-08-11,Description\nAssistant Professor of Economics:...,Faculty Positions,,,,,,,,,,,
37302180,Office Administrative Assistant II,https://jobs.chronicle.com/job/37302180/office...,False,Community College of Philadelphia,"Pennsylvania, United States",Salary Not specified,2022-08-11,\nCommunity College of Philadelphia\n\n\nGener...,Faculty Positions,Professional Fields,Other Professional Fields,,,,,,,,,
37302207,"Assistant Professor, Management",https://jobs.chronicle.com/job/37302207/assist...,False,Murray State University,"Kentucky, United States",Salary Commensurate with experience,2022-08-11,"The Department of Management, Marketing and Bu...",Faculty Positions,Business & Management,Business Administration,Management,Marketing & Sales,Other Business & Management,Social & Behavioral Sciences,Recreation & Leisure Studies,,,,
37302211,Assistant/Associate Professor - Industrial and...,https://jobs.chronicle.com/job/37302211/assist...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2022-08-11,The Industrial and Systems Engineering (ISE) D...,Faculty Positions,Science,Technology & Mathematics,Biotechnology & Bioengineering,Computer Sciences & Technology,Engineering,Other Science & Technology,Statistics,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373791,"Professor/Chair, Civil, Architectural, and Env...",https://jobs.chronicle.com/job/373791/professo...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-08-25,"The Chair of the Civil, Architectural, and Env...",Faculty Positions,Science,Technology & Mathematics,Engineering,Executive,Other Executive,,,,,,
358463,"Clinical/Field Faculty, Educator Preparation",https://jobs.chronicle.com/job/358463/clinical...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-07-14,The clinical/field faculty position will have ...,Faculty Positions,Education,Curriculum & Instruction,Teacher Education,Health & Medicine,Dentistry,,,,,,
358463,"Clinical/Field Faculty, Educator Preparation",https://jobs.chronicle.com/job/358463/clinical...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-07-14,The clinical/field faculty position will have ...,Faculty Positions,Education,Curriculum & Instruction,Teacher Education,Health & Medicine,Dentistry,,,,,,
358463,"Clinical/Field Faculty, Educator Preparation",https://jobs.chronicle.com/job/358463/clinical...,False,North Carolina Agricultural and Technical Stat...,"North Carolina, United States",Salary Not specified,2021-07-14,The clinical/field faculty position will have ...,Faculty Positions,Education,Curriculum & Instruction,Teacher Education,Health & Medicine,Dentistry,,,,,,


In [14]:
from datetime import datetime

timestamp = datetime.now().strftime("%Y-%m-%d")
merged_df.to_csv(f"../data/{timestamp}-chronicles_of_higher_ed.csv")