In the first part set up a process to automatically scrape and collect news release texts from the Bureau of Labor Statistics (BLS) website. This process is structured into several key steps:

1. **Web Browser Automation**: Using Selenium and Chrome WebDriver, I configure the browser to mimic real user behavior by randomly selecting user agents. This helps in accessing the website without being blocked by potential security measures against bots.
2. **Web Scraping with BeautifulSoup**: Once the page is loaded, BeautifulSoup parses the HTML content to extract relevant links to news releases, specifically focusing on employment statistics (identified by specific URL patterns).

3. **Data Collection and Storage**: Each link's text and corresponding URL are stored in a JSON file, ensuring that I have a structured dataset of URLs pointing to individual news releases.

4. **Further Content Scraping**: In the second script, I revisit the stored URLs to extract the full text content of each news release. This involves opening each URL, loading the content, and then parsing it to extract the text, which is then saved alongside its respective date.

5. **Error Handling and Data Integrity**: I implemented basic error handling during date extraction and ensured that each text content is associated with its correct release date. If the content cannot be found or the date cannot be parsed, this is also logged, ensuring transparency in my data collection process.

In [5]:
from bs4 import BeautifulSoup
import json
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import random

# USER_AGENTS is a list of user agent strings
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0",
    "Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.1 Safari/604.5.6",
    "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
    "Opera/9.60 (Windows NT 6.0; U; en) Presto/2.1.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
]

def init_driver():
    # Configure Chrome options
    chrome_options = Options()
    # chrome_options.add_argument("--headless")  # Enable headless mode
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    
    # Randomly select a user agent
    chrome_options.add_argument(f"user-agent={random.choice(USER_AGENTS)}")
    
    # Initialize Chrome WebDriver
    driver = webdriver.Chrome(options=chrome_options)
    return driver

# Initialize the driver using the init_driver function
driver = init_driver()

# Open the target webpage
url = 'https://www.bls.gov/bls/news-release/empsit.htm'
driver.get(url)

# Wait for the page to load
time.sleep(5)  # Adjust this time appropriately to ensure the webpage is fully loaded

# Get the page source
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Find all the desired links
base_url = 'https://www.bls.gov'
links = soup.select('div.bls--news-release div ul > li > a')

# Parse names and URLs
news_releases = {}
for link in links:
    # Get the text as name
    name = link.get_text().strip()
    # Get the href attribute and concatenate the complete URL
    partial_url = link.get('href')
    
    # Ensure the link is a news release link and ends with .htm
    if "/news.release/archives/empsit" in partial_url and partial_url.endswith('.htm'):
        # Make sure to add a prefix only if the URL doesn't start with http:// or https://
        if not partial_url.startswith(('http:', 'https:')):
            full_url = base_url + partial_url
        else:
            full_url = partial_url  # URL is already complete
        
        # Save to the dictionary
        if name and full_url:
            news_releases[name] = full_url

# Close the WebDriver
driver.quit()

# Save the obtained data as a JSON file
output_path = '/Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/News_Releases_URL.json'
with open(output_path, 'w') as outfile:
    json.dump(news_releases, outfile, indent=4)

print(f"Data saved to {output_path}")


Data saved to /Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/News_Releases_URL.json


In [8]:
from selenium import webdriver
from bs4 import BeautifulSoup
import json
import time
from datetime import datetime

# Define a function to extract date from the news release name
def extract_date_from_name(name):
    try:
        # Try to parse the date
        date_str = name.split(' ')[0] + ' ' + name.split(' ')[1]
        # Convert to datetime object
        return datetime.strptime(date_str, '%B %Y')
    except ValueError:
        # Return None if parsing fails
        return None

# Read the stored JSON file containing URLs
with open('/Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/News_Releases_URL.json', 'r') as infile:
    urls = json.load(infile)

# Initialize the WebDriver
driver = init_driver()

# Dictionary to store text contents
text_contents = {}

# Iterate through URLs
for name, url in urls.items():
    # Extract date from the name
    date_obj = extract_date_from_name(name)
    if date_obj:
        # If date extraction successful, convert to string format
        date_str = date_obj.strftime('%Y-%m')
    else:
        # If date extraction fails, use the original name
        date_str = name
    
    print(f"Scraping data for {date_str}...")
    
    # Access the URL
    driver.get(url)
    time.sleep(2)  # Wait for page to load
    
    print("Waiting for page to load...")
    
    # Get and parse the page source
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    
    # Extract text content
    text_content = soup.select_one('div.normalnews > pre:nth-child(1)')
    if text_content:
        # Save text content along with date in the dictionary
        text_contents[date_str] = text_content.get_text()
        print("Data scraped successfully.")
    else:
        text_contents[date_str] = 'No content found'
        print("No content found for this date.")

# Close the WebDriver
driver.quit()

# Save the obtained text content and dates
output_path = '/Users/a1234/Desktop/workspace/Employment_Analysis_and_Recommendation_System_Based_on_NLP_and_Data_Modeling/data/Text_Contents.json'

# Save the obtained text content and dates
with open(output_path, 'w') as outfile:
    json.dump(text_contents, outfile, indent=4)

print(f"All text contents have been saved to {output_path}.")


Scraping data for 2024-03...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2024-02...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2024-01...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-12...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-11...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-10...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-09...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-08...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-07...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-06...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-05...
Waiting for page to load...
Data scraped successfully.
Scraping data for 2023-04...
Waiting for page to load...
Data scraped succes