1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.
a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.
b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.
c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.
d. Take the lists that you have created and combine them into a pandas DataFrame.

In [2]:
import requests

In [3]:
import pandas as pd

In [4]:
from bs4 import BeautifulSoup as BS

In [5]:
URL = "https://realpython.github.io/fake-jobs/"

In [6]:
response = requests.get(URL)

In [7]:
print(response.status_code)

200


In [8]:
requests.get("https://realpython.github.io/fake-jobs/")

<Response [200]>

In [9]:
soup = BS(response.text)

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c

In [11]:
soup.find('h2')

<h2 class="title is-5">Senior Python Developer</h2>

In [12]:
type(soup.find('h2'))

bs4.element.Tag

In [13]:
soup.find('h2').text

'Senior Python Developer'

In [14]:
job_titles = soup.findAll('h2')
print(type(job_titles))


<class 'bs4.element.ResultSet'>


In [15]:
job_titles_text = [job.text.strip() for job in job_titles]
print(job_titles_text[:5])

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager']


This steps extracts the text from all of the job titles, and then prints the first five job titles.

In [17]:
job_titles = []
companies = []
locations = []
posting_dates = []

Initializing the empty lists will store company, location, and posting date info.

In [19]:
job_entries = soup.findAll("div", class_="column is-half")

The findAll method is used to find all occurrences of a specified tag that match the search.

In [21]:
for job in job_entries:
    job_title = job.find("h2", class_="title is-5")
    if job_title:
        job_titles.append(job_title.text.strip())
    else:
        job_titles.append("N/A")

    
    company = job.find("h3", class_="subtitle is-6 company")
    if company:
        companies.append(company.text.strip())
    else:
        companies.append("N/A")

    
    location = job.find("p", class_="location")
    if location:
        locations.append(location.text.strip())
    else:
        locations.append("N/A")

    
    posting_date = job.find("time")
    if posting_date:
        posting_dates.append(posting_date.text.strip())
    else:
        posting_dates.append("N/A")



In [22]:
print(len(job_titles))

100


In [23]:
fake_python = {
    "Job Title": job_titles,
    "Company": companies,
    "Location": locations,
    "Posting Date": posting_dates
}

fake_jobs_df = pd.DataFrame(fake_python)
fake_jobs_df.head()

Unnamed: 0,Job Title,Company,Location,Posting Date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08


In [24]:
job_titles = pd.Series(job_titles)
companies = pd.Series(companies)
locations = pd.Series(locations)
posting_dates = pd.Series(posting_dates)

In [25]:
fake_jobs_df = pd.concat([job_titles, companies, locations, posting_dates], axis=1)

In [26]:
fake_jobs_df.columns = ["Job Title", "Company", "Location", "Posting Date"]

In [27]:
fake_jobs_df.head()

Unnamed: 0,Job Title,Company,Location,Posting Date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08


2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.
a. First, use the BeautifulSoup find_all method to extract the urls.
b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [29]:
apply_urls = []

In [30]:
for job in job_entries:
    apply_button = job.find("a", class_="card-footer-item")
    if apply_button and apply_button.get('href'):
        apply_urls.append(apply_button['href'])
    else:
        apply_urls.append("N/A")

print(f"Apply URL for first job: {apply_urls[0]}")

Apply URL for first job: https://www.realpython.com


finds the a tag with the element that contains the apply button url. get('href') extract URL from the href attribute of the anchor tag. 

In [32]:
fake_jobs_df['Apply URL (BeautifulSoup)'] = apply_urls

In [33]:
print(fake_jobs_df)

                             Job Title                     Company  \
0              Senior Python Developer    Payne, Roberts and Davis   
1                      Energy engineer            Vasquez-Davidson   
2                      Legal executive  Jackson, Chambers and Levy   
3               Fitness centre manager              Savage-Bradley   
4                      Product manager                 Ramirez Inc   
..                                 ...                         ...   
95  Museum/gallery exhibitions officer     Nguyen, Yoder and Petty   
96            Radiographer, diagnostic                  Holder LLC   
97              Database administrator              Yates-Ferguson   
98                  Furniture designer             Ortega-Lawrence   
99                         Ship broker   Fuentes, Walls and Castro   

                Location Posting Date   Apply URL (BeautifulSoup)  
0        Stewartbury, AA   2021-04-08  https://www.realpython.com  
1   Christopherville, A

In [34]:
url = soup.find_all("a")
for url in url:
    print(url.get("href"))

https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/product-manager-4.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html
https://www.realpython.com
https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html
https://

In [35]:
 links = [a['href'] for a in soup.find_all("a", string="Apply")]



In [36]:
fake_jobs_df['Job URL'] = links

In [37]:
print(fake_jobs_df.head())

                 Job Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

  Posting Date   Apply URL (BeautifulSoup)  \
0   2021-04-08  https://www.realpython.com   
1   2021-04-08  https://www.realpython.com   
2   2021-04-08  https://www.realpython.com   
3   2021-04-08  https://www.realpython.com   
4   2021-04-08  https://www.realpython.com   

                                             Job URL  
0  https://realpython.github.io/fake-jobs/jobs/se...  
1  https://realpython.github.io/fake-jobs/jobs/en...  
2  https://realpython.github.io/fake-jobs/jobs/le...  
3  https://realpython.

In [38]:
job_titles = pd.Series(job_titles)
companies = pd.Series(companies)
locations = pd.Series(locations)
posting_dates = pd.Series(posting_dates)
links = pd.Series(links)

In [39]:
fake_jobs_df = pd.concat([job_titles, companies, locations, posting_dates, links], axis=1)

In [40]:
fake_jobs_df.columns = ["Job Title", "Company", "Location", "Posting Date", "Job URL"]

In [41]:
fake_jobs_df.head()

Unnamed: 0,Job Title,Company,Location,Posting Date,Job URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...


3. Finally, we want to get the job description text for each job.
a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

In [43]:
print(soup)

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>


In [44]:
single = requests.get('https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html')

In [45]:
single_job = BS(single.text)

In [46]:
print(single_job.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="box">
      <h1 class="title is-2">
       Senior Python Developer
      </h1>
      <h2 class="subtitle is-4 company">
       Payne, Roberts and Davis
      </h2>
      <div class="content">
       <p>
        Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational

In [47]:
job_description = single_job.find('div', class_ = 'content')
job_description

<div class="content">
<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.</p>
<p id="location"><strong>Location:</strong> Stewartbury, AA</p>
<p id="date"><strong>Posted:</strong> 2021-04-08</p>
</div>

b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page.

In [49]:
base_url = 'https://realpython.github.io/fake-jobs/'
response = requests.get(base_url)

In [50]:
single = requests.get('https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html')
single_job = BS(single.text, 'html.parser')
job_description = single_job.find('div', class_='content')
job_description

<div class="content">
<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.</p>
<p id="location"><strong>Location:</strong> Stewartbury, AA</p>
<p id="date"><strong>Posted:</strong> 2021-04-08</p>
</div>

In [94]:
job_description_text = job_description.get_text(strip=True)
print(job_description_text)

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.Location:Stewartbury, AAPosted:2021-04-08


In [98]:
def get_job_description(url):
    response = requests.get(url)
    soup = BS(response.text, 'html.parser')
    job_description = soup.find('div', class_='content')
    if job_description:
        return job_description.get_text(strip=True)
    else:
        return "No description found"

The get_job_description function takes each url and links it to a specific job page. The html.parser is used so it can understand and navigate the HTML structure. using .get_text(strip=True)) removes all the tags and trims whitespace.

In [100]:
test_url = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'
print(get_job_description(test_url))

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.Location:Stewartbury, AAPosted:2021-04-08


In [102]:
job_descriptions = []

In [104]:
for url in fake_jobs_df['Job URL']:
    description = get_job_description(url)
    job_descriptions.append(description)

The for loop iterates through the Job URL and adds(appends the description) at the end.

In [105]:
fake_jobs_df['Job Description'] = job_descriptions

Add the descriptions to a new column in the DataFrame