In this segment of my data acquisition, designed a web scraping script to collect detailed job data from the Bureau of Labor Statistics (BLS) website. This script contributes significantly to the completeness of my project by providing up-to-date job listings and wage data, which is crucial for analyzing current employment trends. Here's how the process unfolds:

1. **Web Browser Automation Setup**: Similar to my previous scripts, I utilize Selenium with Chrome WebDriver, configuring it with a list of user agents to mimic diverse browser environments. This setup helps in accessing the BLS site without triggering anti-bot mechanisms.

2. **Job Listings Scraping**: I navigate to a specific BLS page that lists various job categories and roles. Using BeautifulSoup, the script parses the HTML to extract job names, codes, and specific URLs for more detailed data. This ensures that I capture a broad spectrum of employment opportunities available in the database.

3. **Data Structuring and Storage**: The extracted job information, including names, codes, and URLs, is structured into a list of dictionaries, which is then saved into a JSON file. This file acts as a directory of jobs that can be referenced in subsequent analyses or for deeper data extraction.

4. **Detailed Wage Data Extraction**: In the extended part of the script, I revisit each job's URL to scrape detailed wage information. This includes navigating to individual job pages and extracting wage data for various percentiles (10th, 25th, 50th, 75th, and 90th). This step is crucial for understanding wage distributions across different jobs and sectors.

5. **Error Handling and Robustness**: The script includes error handling mechanisms to manage issues like missing data elements or page load errors. This ensures the reliability and completeness of my data set.

6. **Final Data Compilation and Saving**: After collecting all the necessary details, the wage data for each job is compiled into a comprehensive JSON file. This file now serves as a rich source of current wage information by job type, which is essential for any employment trend analysis and predictive wage modeling in my project.


In [18]:
import json
import random
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# USER_AGENTS list as provided
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0",
    "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.1 Safari/604.5.6",
    "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
    "Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
]

def init_driver():
    # Configure Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument(f"user-agent={random.choice(USER_AGENTS)}")
    # Initialize Chrome WebDriver
    driver = webdriver.Chrome(options=chrome_options)
    return driver

def scrape_jobs(driver):
    url = "https://www.bls.gov/oes/current/oes_stru.htm"
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    jobs_list = []
    # Adjust the selector to be more inclusive to capture all <li> tags under the relevant <ul> structures
    job_elements = soup.select("#container > div > ul ul ul li")
    for job_element in job_elements:
        link_tag = job_element.find('a')
        if link_tag:
            job_text = link_tag.previous_sibling.strip()
            if job_text.endswith("&nbsp;"):
                job_text = job_text[:-6]
            job_code = job_text
            job_name = " ".join(link_tag.get_text(strip=True).split())
            job_url = f"https://www.bls.gov/oes/current/{link_tag['href']}"
            jobs_list.append({"name": job_name, "code": job_code, "url": job_url})
    
    return jobs_list



def save_jobs(jobs):
    # Define the path where you want to save the JSON file
    file_path = "/Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/Job_Directory.json"
    with open(file_path, 'w') as f:
        json.dump(jobs, f, indent=4)

def main():
    driver = init_driver()
    try:
        jobs = scrape_jobs(driver)
        save_jobs(jobs)
    finally:
        driver.quit()

if __name__ == "__main__":
    main()


In [20]:
import json
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import random


# Function to scrape job wage details
def scrape_job_details(driver, jobs):
    job_data = []
    for job in jobs:
        print(f"Processing {job['name']}...")
        driver.get(job['url'])
        driver.implicitly_wait(5)  # Waiting for page to load
        details = {'name': job['name'], 'code': job['code'], 'url': job['url']}
        
        # Use the presence of the 'tr:nth-child(3)' to determine the row for wage data
        row_selector = f"#oes{job['code'].lower()}_b > tbody > tr:nth-child(3)"
        if not driver.find_elements(By.CSS_SELECTOR, row_selector):
            # If 'tr:nth-child(3)' does not exist, use 'tr:nth-child(2)'
            row_selector = f"#oes{job['code'].lower()}_b > tbody > tr:nth-child(2)"
        
        wage_levels = ['10', '25', '50', '75', '90']
        for i, wage in enumerate(wage_levels, 2):
            try:
                wage_selector = f"{row_selector} > td:nth-child({i})"
                wage_value_element = driver.find_element(By.CSS_SELECTOR, wage_selector)
                wage_value = wage_value_element.text.strip().replace('$', '').replace(',', '')
                details[f'Annual Wage {wage}'] = wage_value
            except Exception as e:
                print(f"Failed to retrieve Annual Wage {wage} for {job['name']}. Error: {str(e)}")
                details[f'Annual Wage {wage}'] = "Data not available"
        print(details)
        job_data.append(details)
    return job_data


# Function to save the scraped data to JSON
def save_job_data(job_data):
    with open("/Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/Job_Data_Sum.json", 'w') as f:
        json.dump(job_data, f, indent=4)
    print("Data saved to Job_Data_Sum.json.")

def main():
    # Load job directory data
    with open("/Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/Job_Directory.json", 'r') as file:
        jobs = json.load(file)
    print("Job directory loaded.")
    
    driver = init_driver()
    try:
        job_details = scrape_job_details(driver, jobs)
        save_job_data(job_details)
    finally:
        driver.quit()
        print("WebDriver closed.")

if __name__ == "__main__":
    main()


Job directory loaded.
Processing Agents and Business Managers of Artists, Performers, and Athletes...
{'name': 'Agents and Business Managers of Artists, Performers, and Athletes', 'code': '13-1011', 'url': 'https://www.bls.gov/oes/current/oes131011.htm', 'Annual Wage 10': ' 47100', 'Annual Wage 25': ' 62280', 'Annual Wage 50': ' 84900', 'Annual Wage 75': ' 129930', 'Annual Wage 90': '(5)'}
Processing Buyers and Purchasing Agents...
{'name': 'Buyers and Purchasing Agents', 'code': '13-1020', 'url': 'https://www.bls.gov/oes/current/oes131020.htm', 'Annual Wage 10': ' 43680', 'Annual Wage 25': ' 54910', 'Annual Wage 50': ' 71950', 'Annual Wage 75': ' 94910', 'Annual Wage 90': ' 121680'}
Processing Claims Adjusters, Examiners, and Investigators...
{'name': 'Claims Adjusters, Examiners, and Investigators', 'code': '13-1031', 'url': 'https://www.bls.gov/oes/current/oes131031.htm', 'Annual Wage 10': ' 47390', 'Annual Wage 25': ' 58770', 'Annual Wage 50': ' 75050', 'Annual Wage 75': ' 91100', 