# Scraping AI Job Board with Python
## ABB #3 - Session 1

Code authored by: Shaw Talebi

### imports

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

### 1) extract job listing links

In [2]:
# URL of the website
job_board_url = "https://aijobs.net/"
query = "?reg=5" # north america jobs

# Send a GET request to the website
response = requests.get(job_board_url + query)

# Check if the request was successful
if response.status_code == 200:
    # Get the HTML content
    html_content = response.text
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [3]:
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

In [4]:
# Find all job links within the <ul> list
job_links = soup.select('ul#job-list a.col.py-2[href]')

# explanation from ChatGPT:
# This selects all <a> tags with class col py-2 inside the <ul> element with id="job-list"

In [5]:
# Extract href attributes and create full URLs
job_url_list = [job_board_url + link['href'] for link in job_links]

for job_url in job_url_list:
    print(job_url)

https://aijobs.net//job/993083-software-engineer-ii/
https://aijobs.net//job/993082-signal-processing-software-engineer/
https://aijobs.net//job/993076-senior-virtual-desktop-engineer/
https://aijobs.net//job/993040-director-data-management-ndc/
https://aijobs.net//job/993039-senior-clinical-data-management-manager/
https://aijobs.net//job/993038-data-management-intern/
https://aijobs.net//job/993000-inventory-data-specialist/
https://aijobs.net//job/992994-sr-machine-learning-engineermachine-learning-engineer/
https://aijobs.net//job/992986-summer-2025-data-analyst-intern-santa-clara-ca/
https://aijobs.net//job/992982-lead-data-analyst/
https://aijobs.net//job/992977-director-of-innovation-information-and-data-sciences/
https://aijobs.net//job/992946-data-and-gift-specialist-i/
https://aijobs.net//job/992909-specialist-coding-standards-and-data-quality-p2bi/
https://aijobs.net//job/992869-senior-director-clinical-data-strategy-and-process-improvement/
https://aijobs.net//job/992855-sr

### 2) extract info from one listing

In [6]:
# extract html from job listing (same as cell 2)
job_url = job_url_list[0]
response = requests.get(job_url)
html_content = response.text

#### way 1: screen scraping 

In [7]:
# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract company name
company_name = soup.find('h2', class_='h5').text.strip()

# Extract job title
job_title = soup.find('h1', class_='display-5').text.strip()

# Extract job description
job_description_section = soup.find('div', class_='job-description-text')
job_description = job_description_section.get_text(separator='\n').strip() if job_description_section else "N/A"

# Extract salary range
salary_badge = soup.find('span', class_='badge rounded-pill text-bg-success')
salary_badge = soup.select('#content > section > div > div > div:nth-child(2) > div.col-6.col-sm-7 > h5 > span')
salary = salary_badge[0].text.strip() if salary_badge else "N/A"

# Print extracted details
print(f"Company Name: {company_name}")
print(f"Job Title: {job_title}")
print(f"Job Description: {job_description[:500]}...")  # Truncate for readability
print(f"Salary: {salary}")
# downside: irregular salary format

Company Name: The Walt Disney Company
Job Title: Software Engineer II
Job Description: Job Posting Title:
Software Engineer II
Req ID:
10110382
Job Description:
On any given day at Disney Entertainment & ESPN Technology, we’re reimagining ways to create magical viewing experiences for the world’s most beloved stories while also transforming our media business for the future. Whether that’s evolving our streaming and digital products in new and immersive ways, powering worldwide advertising and distribution to enhance flexibility and efficiency, or delivering Disney’s unmatched ent...
Salary: USD 114K - 154K


#### way 2: pull json data

In [8]:
# Find the script tag containing JSON-LD
script_tag = soup.find('script', type='application/ld+json')

# Load the JSON content
if script_tag:
    job_data = json.loads(script_tag.string)

    # Extract relevant fields
    company_name = job_data['hiringOrganization']['name']
    job_title = job_data['title']
    job_description = job_data['description']
    salary_min = job_data['baseSalary']['value']['minValue']
    salary_max = job_data['baseSalary']['value']['maxValue']

    # Print extracted data
    print(f"Company Name: {company_name}")
    print(f"Job Title: {job_title}")
    print(f"Job Description: {job_description[:500]}...")
    print(f"Salary Range: {salary_min} - {salary_max} USD")

Company Name: The Walt Disney Company
Job Title: Software Engineer II
Job Description: Job Posting Title:Software Engineer IIReq ID:10110382Job Description:On any given day at Disney Entertainment &amp;amp; ESPN Technology, we’re reimagining ways to create magical viewing experiences for the world’s most beloved stories while also transforming our media business for the future. Whether that’s evolving our streaming and digital products in new and immersive ways, powering worldwide advertising and distribution to enhance flexibility and efficiency, or delivering Disney’s unmatched ...
Salary Range: 114900 - 154100 USD


### 3) extract info from all listings

In [9]:
# write function to implement way 2

def extract_job_info(url):
    """
    Extracts job information from a given job listing URL.

    Args:
        url (str): The URL of the job listing.

    Returns:
        dict: A dictionary containing the following key-value pairs:
            - 'company_name' (str): Name of the hiring organization.
            - 'job_title' (str): Title of the job.
            - 'job_description' (str): Detailed description of the job.
            - 'salary_min' (float or str): Minimum salary offered for the job.
            - 'salary_max' (float or str): Maximum salary offered for the job.
               Returns 'N/A' if salary information is unavailable.
    """
    try:
        # Fetch the HTML content of the job listing
        response = requests.get(url)
        response.raise_for_status()  # Raise an error for bad status codes
        html_content = response.text
        
        # Parse the HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Find the script tag containing JSON-LD
        script_tag = soup.find('script', type='application/ld+json')
        
        if script_tag:
            job_data = json.loads(script_tag.string)
            
            # Extract relevant fields with default values if not present
            company_name = job_data.get('hiringOrganization', {}).get('name', 'N/A')
            job_title = job_data.get('title', 'N/A')
            job_description = job_data.get('description', 'N/A')
            salary_data = job_data.get('baseSalary', {}).get('value', {})
            salary_min = salary_data.get('minValue', 'N/A')
            salary_max = salary_data.get('maxValue', 'N/A')
            
            return {
                'company_name': company_name,
                'job_title': job_title,
                'job_description': job_description,
                'salary_min': salary_min,
                'salary_max': salary_max
            }
        else:
            return {'error': 'No JSON-LD script found in the page'}
    
    except requests.RequestException as e:
        return {'error': f"Request failed: {e}"}
    
    except json.JSONDecodeError:
        return {'error': 'Failed to parse JSON-LD content'}
    
    except Exception as e:
        return {'error': f"An unexpected error occurred: {e}"}

In [10]:
# extract job info from all job urls
job_info_list = []

for job_url in job_url_list:
    # extract job info
    job_info = extract_job_info(job_url)

    # store results in list if no errors occured
    try:
        print(job_info["job_title"])
        job_info_list.append(job_info)
    except:
        print(f"Could not extract info from: {job_url}")
        continue

Software Engineer II
Signal Processing Software Engineer
Senior Virtual Desktop Engineer
Director, Data Management NDC
Senior Clinical Data Management Manager
Data Management Intern
Inventory Data Specialist
Sr. Machine Learning Engineer/Machine Learning Engineer
Summer 2025 Data Analyst Intern (Santa Clara, CA)
Lead Data Analyst
Director of Innovation - Information and Data Sciences
Data and Gift Specialist I
Specialist Coding Standards and Data Quality P2BI
Senior Director, Clinical Data Strategy and Process Improvement
Sr. Business Intelligence Developer
Senior Associate Director, Data Science &amp; Data Governance
Data Governance Analyst
Senior Software Engineer – APIs and Data Pipelines
CNO Developer
Associate Director - Catastrophe Analytics
Senior Lead Analytics Consultant
Credit Risk Management-Risk Analytics Model Intern
Senior Solution Consultant - USA
Jr Business Intelligence Analyst/Developer
Business Intelligence Analyst
Software Engineer
Director of Advanced Insights
Prin

### 4) Store data in Pandas dataframe

In [11]:
df = pd.DataFrame(job_info_list)
df.head()

Unnamed: 0,company_name,job_title,job_description,salary_min,salary_max
0,The Walt Disney Company,Software Engineer II,Job Posting Title:Software Engineer IIReq ID:1...,114900.0,154100.0
1,Leidos,Signal Processing Software Engineer,Do you want to join a high performing team tha...,67600.0,122200.0
2,Govcio LLC,Senior Virtual Desktop Engineer,Overview GovCIO is currently seeking a highly ...,,
3,Crinetics Pharmaceuticals,"Director, Data Management NDC",Crinetics is a pharmaceutical company based in...,189000.0,236000.0
4,Crinetics Pharmaceuticals,Senior Clinical Data Management Manager,Crinetics is a pharmaceutical company based in...,114000.0,143000.0


In [12]:
# save to file
df.to_csv("data/ai_job_data.csv", index=False)

#### Future directions
- extract other fields from job listings e.g. tags, key skills
- add filters to job search e.g. remote, Product, salary