# Scraping AI Job Board with Python
## ABB #2 - Session 1

Code authored by: Shaw Talebi

### imports

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

### 1) extract job listing links

In [2]:
# URL of the website
job_board_url = "https://aijobs.net"

# Send a GET request to the website
response = requests.get(job_board_url)

# Check if the request was successful
if response.status_code == 200:
    # Get the HTML content
    html_content = response.text
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [3]:
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

In [4]:
# Find all job links within the <ul> list
job_links = soup.select('ul#job-list a.col.py-2[href]')

# explanation from ChatGPT:
# This selects all <a> tags with class col py-2 inside the <ul> element with id="job-list"

In [5]:
# Extract href attributes and create full URLs
job_url_list = [job_board_url + link['href'] for link in job_links]

for job_url in job_url_list:
    print(job_url)

https://aijobs.net/job/934495-staff-software-engineer/
https://aijobs.net/job/939954-data-science-intern-madrid-careerstartsas-2025/
https://aijobs.net/job/939953-midsenior-data-scientist/
https://aijobs.net/job/939952-data-scientist-artificial-intelligence/
https://aijobs.net/job/939951-senior-data-scientist/
https://aijobs.net/job/939949-senior-full-stack-software-engineer-ai-ml-model-development-ml-operations-and-applied-data-science/
https://aijobs.net/job/939948-mid-level-full-stack-software-engineer-aiml-model-development-ml-operations-and-applied-data-science/
https://aijobs.net/job/939943-java-and-spark-architect-tax-national-tax-other-ttt-development-blr-hyd-gurgaon/
https://aijobs.net/job/939942-lead-quantitative-analytics-specialist-genai-nlp-pyspark/
https://aijobs.net/job/939941-assistant-manager-legends-sparks-sparks-nv/
https://aijobs.net/job/939940-lead-data-engineer-python-spark-aws/
https://aijobs.net/job/939939-data-engineer-iii-java-aws-spark/
https://aijobs.net/job

### 2) extract info from one listing

In [6]:
# extract html from job listing (same as cell 2)
job_url = job_url_list[0]
response = requests.get(job_url)
html_content = response.text

#### way 1: screen scraping 

In [7]:
# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract company name
company_name = soup.find('h2', class_='h5').text.strip()

# Extract job title
job_title = soup.find('h1', class_='display-5').text.strip()

# Extract job description
job_description_section = soup.find('div', class_='job-description-text')
job_description = job_description_section.get_text(separator='\n').strip() if job_description_section else "N/A"

# Extract salary range
salary_badge = soup.find('span', class_='badge rounded-pill text-bg-success')
salary_badge = soup.select('#content > section > div > div > div:nth-child(2) > div.col-6.col-sm-7 > h5 > span')
salary = salary_badge[0].text.strip() if salary_badge else "N/A"

# Print extracted details
print(f"Company Name: {company_name}")
print(f"Job Title: {job_title}")
print(f"Job Description: {job_description[:500]}...")  # Truncate for readability
print(f"Salary: {salary}")

Company Name: murmuration
Job Title: Staff Software Engineer
Job Description: Who We Are


Murmuration is a nonprofit organization that amplifies the power of civic engagement by providing data, digital tools, and research-driven insights to community-focused organizations so that together we can create an America where everyone can lead healthy, free, and dignified lives.


Every day, people are trying to shape our future for the better. Fighting for water that’s safe to drink. Schools that serve students equitably. Gun laws that make sense. And rallying people who care ...
Salary: USD 135K - 165K


#### way 2: pull json data

In [8]:
# Find the script tag containing JSON-LD
script_tag = soup.find('script', type='application/ld+json')

# Load the JSON content
if script_tag:
    job_data = json.loads(script_tag.string)

    # Extract relevant fields
    company_name = job_data['hiringOrganization']['name']
    job_title = job_data['title']
    job_description = job_data['description']
    salary_min = job_data['baseSalary']['value']['minValue']
    salary_max = job_data['baseSalary']['value']['maxValue']

    # Print extracted data
    print(f"Company Name: {company_name}")
    print(f"Job Title: {job_title}")
    print(f"Job Description: {job_description[:500]}...")
    print(f"Salary Range: {salary_min} - {salary_max} USD")

Company Name: murmuration
Job Title: Staff Software Engineer
Job Description: Who We Are Murmuration is a nonprofit organization that amplifies the power of civic engagement by providing data, digital tools, and research-driven insights to community-focused organizations so that together we can create an America where everyone can lead healthy, free, and dignified lives. Every day, people are trying to shape our future for the better. Fighting for water that’s safe to drink. Schools that serve students equitably. Gun laws that make sense. And rallying people who care like...
Salary Range: 135000 - 165000 USD


### 3) extract info from all listings

In [9]:
# write function to implement way 2

def extract_job_info(url):
    """
    Extracts job information from a given job listing URL.

    Args:
        url (str): The URL of the job listing.

    Returns:
        dict: A dictionary containing the following key-value pairs:
            - 'company_name' (str): Name of the hiring organization.
            - 'job_title' (str): Title of the job.
            - 'job_description' (str): Detailed description of the job.
            - 'salary_min' (float or str): Minimum salary offered for the job.
            - 'salary_max' (float or str): Maximum salary offered for the job.
               Returns 'N/A' if salary information is unavailable.
    """
    try:
        # Fetch the HTML content of the job listing
        response = requests.get(url)
        response.raise_for_status()  # Raise an error for bad status codes
        html_content = response.text
        
        # Parse the HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Find the script tag containing JSON-LD
        script_tag = soup.find('script', type='application/ld+json')
        
        if script_tag:
            job_data = json.loads(script_tag.string)
            
            # Extract relevant fields with default values if not present
            company_name = job_data.get('hiringOrganization', {}).get('name', 'N/A')
            job_title = job_data.get('title', 'N/A')
            job_description = job_data.get('description', 'N/A')
            salary_data = job_data.get('baseSalary', {}).get('value', {})
            salary_min = salary_data.get('minValue', 'N/A')
            salary_max = salary_data.get('maxValue', 'N/A')
            
            return {
                'company_name': company_name,
                'job_title': job_title,
                'job_description': job_description,
                'salary_min': salary_min,
                'salary_max': salary_max
            }
        else:
            return {'error': 'No JSON-LD script found in the page'}
    
    except requests.RequestException as e:
        return {'error': f"Request failed: {e}"}
    
    except json.JSONDecodeError:
        return {'error': 'Failed to parse JSON-LD content'}
    
    except Exception as e:
        return {'error': f"An unexpected error occurred: {e}"}

In [10]:
# extract job info from all job urls
job_info_list = []

for job_url in job_url_list:
    # extract job info
    job_info = extract_job_info(job_url)

    # store results in list if no errors occured
    try:
        print(job_info["job_title"])
        job_info_list.append(job_info)
    except:
        print(f"Could not extract info from: {job_url}")
        continue

Staff Software Engineer
Data Science Intern- Madrid [CareerStart@SAS 2025]
Mid/Senior Data Scientist
Data Scientist-Artificial Intelligence
Senior Data Scientist
Senior Full Stack Software Engineer - AI  ML Model Development, ML Operations, and Applied Data Science
Mid-Level Full Stack Software Engineer -- AI/ML Model Development, ML Operations, and Applied Data Science
Java and spark architect - TAX - National - TAX - Other - TTT Development - BLR, HYD/ Gurgaon
Lead Quantitative Analytics Specialist - GenAI NLP PySpark
Assistant Manager, Legends @ Sparks, Sparks NV
Lead Data Engineer (Python, Spark, AWS)
Data Engineer III - JAVA, AWS, Spark
Data Engineer III - PySpark, Python, AWS
Data Engineer ( Pyspark / Python )
Could not extract info from: https://aijobs.net/job/939936-software-engineer-principal-model-implementation-platform-mip-python-pyspark-run-time-optimization-large-scale-processing/
Senior Software Engineer - Apache Spark Expert (Bangkok based, relocation provided)
Staff So

### 4) Store data in Pandas dataframe

In [11]:
df = pd.DataFrame(job_info_list)
df.head()

Unnamed: 0,company_name,job_title,job_description,salary_min,salary_max
0,murmuration,Staff Software Engineer,Who We Are Murmuration is a nonprofit organiza...,135000.0,165000.0
1,SAS,Data Science Intern- Madrid [CareerStart@SAS 2...,CareerStart@SAS Internship Program 2025 Da...,,
2,Endeavour Group,Mid/Senior Data Scientist,Company DescriptionLet’s create a more sociabl...,91774.0,170438.0
3,IBM,Data Scientist-Artificial Intelligence,IntroductionIn this role you will join IBM Con...,74567.0,138482.0
4,NielsenIQ,Senior Data Scientist,Company DescriptionGfk is seeking a Data Scien...,57359.0,106524.0


In [12]:
# save to file
df.to_csv("data/ai_job_data.csv", index=False)

#### Future directions
- add pagination (extract urls from mutliple pages)
- extract other fields from job listings e.g. tags, key skills