In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

In [4]:
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd


1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  

In [6]:
url = "https://realpython.github.io/fake-jobs/"
response = requests.get(url)
soup = BS(response.text, 'html.parser')


 Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.  

In [9]:
job_titles = [job.text.strip() for job in soup.find_all('h2', class_='title')]
print(job_titles[:5])  # Print the first 5 to check


['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager']


 Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  

 Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [12]:
companies = [company.text.strip() for company in soup.find_all('h3', class_='company')]
locations = [location.text.strip() for location in soup.find_all('p', class_='location')]
dates = [date.text.strip() for date in soup.find_all('time')]

print(companies[:5], locations[:5], dates[:5])  # Print the first 5 of each to check


['Payne, Roberts and Davis', 'Vasquez-Davidson', 'Jackson, Chambers and Levy', 'Savage-Bradley', 'Ramirez Inc'] ['Stewartbury, AA', 'Christopherville, AA', 'Port Ericaburgh, AA', 'East Seanview, AP', 'North Jamieview, AP'] ['2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08']


 Take the lists that you have created and combine them into a pandas DataFrame.

In [14]:
job_data = pd.DataFrame({
    'Job Title': job_titles,
    'Company': companies,
    'Location': locations,
    'Date Posted': dates
})

print(job_data.head())  # Display the first few rows to check



                 Job Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

  Date Posted  
0  2021-04-08  
1  2021-04-08  
2  2021-04-08  
3  2021-04-08  
4  2021-04-08  


. Next, add a column that contains the url for the "Apply" button. Try this in two ways.   
    a. First, use the BeautifulSoup find_all method to extract the urls. 

In [16]:
apply_urls = [url['href'] for url in soup.find_all('a', class_='card-footer-item')]
job_data['Apply URL'] = apply_urls[:len(job_data)]  # Ensure the list matches DataFrame length

print(job_data.head())  # Check the DataFrame


                 Job Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

  Date Posted                                          Apply URL  
0  2021-04-08                         https://www.realpython.com  
1  2021-04-08  https://realpython.github.io/fake-jobs/jobs/se...  
2  2021-04-08                         https://www.realpython.com  
3  2021-04-08  https://realpython.github.io/fake-jobs/jobs/en...  
4  2021-04-08                         https://www.realpython.com  


In [17]:
base_url = "https://realpython.github.io/fake-jobs/jobs/"
job_slugs = [title.lower().replace(' ', '-').replace(',', '').replace('(', '').replace(')', '').replace('/', '') for title in job_titles]
constructed_urls = [f"{base_url}{slug}-0.html" for slug in job_slugs]

job_data['Constructed URL'] = constructed_urls[:len(job_data)]

print(job_data[['Apply URL', 'Constructed URL']].head())  # Compare both URL columns


                                           Apply URL  \
0                         https://www.realpython.com   
1  https://realpython.github.io/fake-jobs/jobs/se...   
2                         https://www.realpython.com   
3  https://realpython.github.io/fake-jobs/jobs/en...   
4                         https://www.realpython.com   

                                     Constructed URL  
0  https://realpython.github.io/fake-jobs/jobs/se...  
1  https://realpython.github.io/fake-jobs/jobs/en...  
2  https://realpython.github.io/fake-jobs/jobs/le...  
3  https://realpython.github.io/fake-jobs/jobs/fi...  
4  https://realpython.github.io/fake-jobs/jobs/pr...  


 Finally, we want to get the job description text for each job.  
  a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.  

In [20]:
def get_job_description(url):
    response = requests.get(url)
    soup = BS(response.text, 'html.parser')
    content = soup.find('div', class_='content')
    if content:
        return content.text.strip()
    else:
        return "Description not found"



In [22]:
print(soup)



<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>

In [24]:
print(soup.find('div'))



<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>


In [32]:
import requests
from bs4 import BeautifulSoup as BS

# Step 1: Load the web page content
job_url = "https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
job_response = requests.get(job_url)

# Step 2: Parse the HTML with BeautifulSoup
soup = BS(job_response.text, 'html.parser')

# Step 3: Find and extract the job description paragraph
job_description = soup.find('div', class_='content').text.strip()

# Print the job description
print(job_description)



Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.
Location: Stewartbury, AA
Posted: 2021-04-08


In [34]:

import requests
from bs4 import BeautifulSoup as BS

# Step 1: Load the web page content
job_url = "https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
job_response = requests.get(job_url)

# Step 2: Parse the HTML with BeautifulSoup
soup = BS(job_response.text, 'html.parser')

# Step 3: Find and extract only the job description paragraph
job_description = soup.find('div', class_='content').find('p').text.strip()

# Print the job description
print(job_description)


Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.


In [56]:
# Improved function to extract job description with error handling
def get_job_description(url):
    # Check if the input is a string
    if not isinstance(url, str):
        return "Error: The provided URL must be a string."
    
    try:
        # Perform the GET request to the provided URL
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
        
        # Parse the HTML content using BeautifulSoup
        soup = BS(response.text, 'html.parser')
        
        # Extract the job description paragraph
        description = soup.find('div', class_='content').find('p').text.strip()
        
        return description
    
    except requests.exceptions.RequestException as e:
        # Handle network-related errors (e.g., connection issues, timeouts)
        return f"Error: Failed to retrieve data from {url}. Exception: {str(e)}"
    
    except AttributeError:
        # Handle errors related to parsing (e.g., missing 'div' or 'p' tags)
        return f"Error: Could not find the job description on the page at {url}."

# Example usage of the function
example_url = "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html"
print(get_job_description(example_url))



At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.


In [58]:
# Apply the get_job_description function to each URL in the 'Apply URL' column
job_data['Description'] = job_data['Apply URL'].apply(get_job_description)

# Display the first few rows to check the results
print(job_data.head())


                 Job Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

  Date Posted                                          Apply URL  \
0  2021-04-08                         https://www.realpython.com   
1  2021-04-08  https://realpython.github.io/fake-jobs/jobs/se...   
2  2021-04-08                         https://www.realpython.com   
3  2021-04-08  https://realpython.github.io/fake-jobs/jobs/en...   
4  2021-04-08                         https://www.realpython.com   

                                     Constructed URL  \
0  https://realpython.github.io/fake-jobs/jobs/se...  

In [None]:
# Create a 'Description_Apply_URL' column using the 'Apply URL' column
job_data['Description_Apply_URL'] = job_data['Apply URL'].apply(get_job_description)

# Create a 'Description_Constructed_URL' column using the 'Constructed URL' column
job_data['Description_Constructed_URL'] = job_data['Constructed URL'].apply(get_job_description)

# Display the first few rows to check the results
print(job_data.head())

In [62]:
# Count the errors in the 'Description_Apply_URL' column
apply_url_errors = job_data['Description_Apply_URL'].str.contains('Error:').sum()

# Count the errors in the 'Description_Constructed_URL' column
constructed_url_errors = job_data['Description_Constructed_URL'].str.contains('Error:').sum()

# Calculate the total number of job postings
total_jobs = len(job_data)

# Calculate the error ratio for each method
apply_url_error_ratio = apply_url_errors / total_jobs
constructed_url_error_ratio = constructed_url_errors / total_jobs

# Display the results
print(f"Errors using Apply URL: {apply_url_errors}/{total_jobs} ({apply_url_error_ratio:.2%})")
print(f"Errors using Constructed URL: {constructed_url_errors}/{total_jobs} ({constructed_url_error_ratio:.2%})")


Errors using Apply URL: 50/100 (50.00%)
Errors using Constructed URL: 99/100 (99.00%)
