# Scraping AI Job Board with Python
## ABB #3 - Session 1

Code authored by: Shaw Talebi

### imports

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

### 1) extract job listing links

In [2]:
# URL of the website
job_board_url = "https://aijobs.net"
query = "/?reg=5" # north america jobs

# Send a GET request to the website
response = requests.get(job_board_url + query)

# Check if the request was successful
if response.status_code == 200:
    # Get the HTML content
    html_content = response.text
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [3]:
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

In [4]:
# Find all job links within the <ul> list
job_links = soup.select('ul#job-list a.col.py-2[href]')

# explanation from ChatGPT:
# This selects all <a> tags with class col py-2 inside the <ul> element with id="job-list"

In [5]:
# Extract href attributes and create full URLs
job_url_list = [job_board_url + link['href'] for link in job_links]

for job_url in job_url_list:
    print(job_url)

https://aijobs.net/job/998893-database-warehouse-analyst/
https://aijobs.net/job/998888-executive-compensation-and-stock-analyst/
https://aijobs.net/job/998878-senior-crm-and-loyalty-analyst/
https://aijobs.net/job/998877-quantitative-analytics-prime-services/
https://aijobs.net/job/998876-load-research-and-analysis-intern/
https://aijobs.net/job/998875-rc-pricing-staff-ey-gds/
https://aijobs.net/job/998874-senior-analyst-product-research-auto-product-design/
https://aijobs.net/job/998870-fx-derivatives-quant/
https://aijobs.net/job/998860-manager-ifrs-9-modelling-enterprise-stress-testing/
https://aijobs.net/job/998856-data-solutions-architect/
https://aijobs.net/job/998847-senior-nextjs-engineer/
https://aijobs.net/job/998845-devops-solutions-engineer/
https://aijobs.net/job/998843-application-developer/
https://aijobs.net/job/998835-sr-customer-success-specialist/
https://aijobs.net/job/998834-full-stack-developer-senior/
https://aijobs.net/job/998833-geospatial-analyst-ii-skillbrid

### 2) extract info from one listing

In [6]:
# extract html from job listing (same as cell 2)
job_url = job_url_list[0]
response = requests.get(job_url)
html_content = response.text

#### way 1: screen scraping 

In [7]:
# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract company name
company_name = soup.find('h2', class_='h5').text.strip()

# Extract job title
job_title = soup.find('h1', class_='display-5').text.strip()

# Extract job description
job_description_section = soup.find('div', class_='job-description-text')
job_description = job_description_section.get_text(separator='\n').strip() if job_description_section else "N/A"

# Extract salary range
salary_badge = soup.find('span', class_='badge rounded-pill text-bg-success')
salary_badge = soup.select('#content > section > div > div > div:nth-child(2) > div.col-6.col-sm-7 > h5 > span')
salary = salary_badge[0].text.strip() if salary_badge else "N/A"

# Print extracted details
print(f"Company Name: {company_name}")
print(f"Job Title: {job_title}")
print(f"Job Description: {job_description[:500]}...")  # Truncate for readability
print(f"Salary: {salary}")
# downside: irregular salary format

Company Name: Taylor Corporation
Job Title: Database Warehouse Analyst
Job Description: Let Us Power Your Potential
Taylor Corporation is a dynamic, diversified company with big plans for the future ― and your career. We power our employees’ potential and strive to create opportunity and security for every member of the team. If you’re ready for something bigger ― more challenge, more variety, more pathways for professional growth ― we should talk. We’re passionate about our work, we believe there is always a better way, and we’re looking for people like you.
Ready to reach your po...
Salary: N/A


#### way 2: pull json data

In [8]:
# Find the script tag containing JSON-LD
script_tag = soup.find('script', type='application/ld+json')

# Load the JSON content
if script_tag:
    job_data = json.loads(script_tag.string)

    # Extract relevant fields
    company_name = job_data['hiringOrganization']['name']
    job_title = job_data['title']
    job_description = job_data['description']
    salary_min = job_data['baseSalary']['value']['minValue']
    salary_max = job_data['baseSalary']['value']['maxValue']

    # Print extracted data
    print(f"Company Name: {company_name}")
    print(f"Job Title: {job_title}")
    print(f"Job Description: {job_description[:500]}...")
    print(f"Salary Range: {salary_min} - {salary_max} USD")

Company Name: Taylor Corporation
Job Title: Database Warehouse Analyst
Job Description: Let Us Power Your PotentialTaylor Corporation is a dynamic, diversified company with big plans for the future ― and your career. We power our employees’ potential and strive to create opportunity and security for every member of the team. If you’re ready for something bigger ― more challenge, more variety, more pathways for professional growth ― we should talk. We’re passionate about our work, we believe there is always a better way, and we’re looking for people like you.Ready to reach your pote...
Salary Range: 54978 - 102102 USD


### 3) extract info from all listings

In [9]:
# write function to implement way 2

def extract_job_info(url):
    """
    Extracts job information from a given job listing URL.

    Args:
        url (str): The URL of the job listing.

    Returns:
        dict: A dictionary containing the following key-value pairs:
            - 'company_name' (str): Name of the hiring organization.
            - 'job_title' (str): Title of the job.
            - 'job_description' (str): Detailed description of the job.
            - 'salary_min' (float or str): Minimum salary offered for the job.
            - 'salary_max' (float or str): Maximum salary offered for the job.
               Returns 'N/A' if salary information is unavailable.
    """
    try:
        # Fetch the HTML content of the job listing
        response = requests.get(url)
        response.raise_for_status()  # Raise an error for bad status codes
        html_content = response.text
        
        # Parse the HTML with BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Find the script tag containing JSON-LD
        script_tag = soup.find('script', type='application/ld+json')
        
        if script_tag:
            job_data = json.loads(script_tag.string)
            
            # Extract relevant fields with default values if not present
            company_name = job_data.get('hiringOrganization', {}).get('name', 'N/A')
            job_title = job_data.get('title', 'N/A')
            job_description = job_data.get('description', 'N/A')
            salary_data = job_data.get('baseSalary', {}).get('value', {})
            salary_min = salary_data.get('minValue', 'N/A')
            salary_max = salary_data.get('maxValue', 'N/A')
            
            return {
                'company_name': company_name,
                'job_title': job_title,
                'job_description': job_description,
                'salary_min': salary_min,
                'salary_max': salary_max
            }
        else:
            return {'error': 'No JSON-LD script found in the page'}
    
    except requests.RequestException as e:
        return {'error': f"Request failed: {e}"}
    
    except json.JSONDecodeError:
        return {'error': 'Failed to parse JSON-LD content'}
    
    except Exception as e:
        return {'error': f"An unexpected error occurred: {e}"}

In [10]:
# extract job info from all job urls
job_info_list = []

for job_url in job_url_list:
    # extract job info
    job_info = extract_job_info(job_url)

    # store results in list if no errors occured
    try:
        print(job_info["job_title"])
        job_info_list.append(job_info)
    except:
        print(f"Could not extract info from: {job_url}")
        continue

Database Warehouse Analyst
Executive Compensation and Stock Analyst
Senior CRM and Loyalty Analyst
Quantitative Analytics Prime Services
Load Research and Analysis Intern
RC - Pricing - Staff - EY GDS
Senior Analyst, Product Research, Auto Product Design
FX Derivatives Quant
Manager, IFRS 9 Modelling - Enterprise Stress Testing
Data Solutions Architect
Senior Next.js Engineer
DevOps Solutions Engineer
Application developer
Sr. Customer Success Specialist
Full Stack Developer (Senior)
Geospatial Analyst II - SkillBridge Eligible Opportunity
Artificial Intelligence (AI) - (SME)
Software Engineer, Robotics
AI Product Manager: Gen AI
Marketing BI Analyst, Principal, Sr
Credit Strategy, Decision Sciences
IT Data Science
Could not extract info from: https://aijobs.net/job/998789-data-scientist/
Data Scientist - Biologist - Indianapolis, IN
Data Scientist
Senior Quantitative Analyst, Data Science
IT Advisor - Technology Con - AI and Data - Data Science - Manager - Multiple Positions - 1580124

### 4) Store data in Pandas dataframe

In [11]:
df = pd.DataFrame(job_info_list)
df.head()

Unnamed: 0,company_name,job_title,job_description,salary_min,salary_max
0,Taylor Corporation,Database Warehouse Analyst,Let Us Power Your PotentialTaylor Corporation ...,54978.0,102102.0
1,8x8,Executive Compensation and Stock Analyst,"8x8, Inc. (NASDAQ: EGHT) believes that CX limi...",138000.0,230000.0
2,Cracker Barrel,Senior CRM and Loyalty Analyst,WHY CRACKER BARRELWhat is it like to work at C...,82467.0,153153.0
3,Barclays,Quantitative Analytics Prime Services,Job DescriptionPurpose of the roleTo provide q...,125000.0,175000.0
4,Xcel Energy,Load Research and Analysis Intern,Are you looking for an exciting job where you ...,,


In [12]:
# save to file
df.to_csv("data/ai_job_data.csv", index=False)

#### Future directions
- extract other fields from job listings e.g. tags, key skills
- add filters to job search e.g. remote, Product, salary