# Fake Python Jobs web Scrapping

In [13]:
import requests
from bs4 import BeautifulSoup


We can easily request the content from the page since our objetive is to scrap an static HTML web.

## Request and parse HTML

In [166]:
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

print(page.text[1:1000])

!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Fake Python</title>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">
  </head>
  <body>
  <section class="section">
    <div class="container mb-5">
      <h1 class="title is-1">
        Fake Python
      </h1>
      <p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
    </div>
    <div class="container">
    <div id="ResultsContainer" class="columns is-multiline">
    <div class="column is-half">
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title is-


In [19]:
soup = BeautifulSoup(page.content, "html.parser")

We are looking for the `<div>` with id="ResultsContainer"

In [167]:
results = soup.find(id="ResultsContainer")
print(results.prettify()[1:1000])

div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=


From this container we want an iterable with all the HTML for all the jobs listings displayed on the page

In [169]:
job_elements = results.find_all("div", class_="card-content")

print(job_elements[0].prettify())

<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>



Let's focus only on the 'title', 'company' and 'location' elements.

In [170]:
c = 0
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()
    c += 1
    if c == 10:
        break

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Energy engineer
Vasquez-Davidson
Christopherville, AA

Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA

Fitness centre manager
Savage-Bradley
East Seanview, AP

Product manager
Ramirez Inc
North Jamieview, AP

Medical technical officer
Rogers-Yates
Davidville, AP

Physiological scientist
Kramer-Klein
South Christopher, AE

Textile designer
Meyers-Johnson
Port Jonathan, AE

Television floor manager
Hughes-Williams
Osbornetown, AE

Waste management officer
Jones, Williams and Villa
Scotttown, AP



## Filter results

In this exercise we only want to keep the Python jobs. **Thus we'll look for the jobs with the word "Python" included in the title.**

In [47]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

for python_job in python_jobs:
    print(python_job.text.strip())

Senior Python Developer
Software Engineer (Python)
Python Programmer (Entry-Level)
Python Programmer (Entry-Level)
Software Developer (Python)
Python Developer
Back-End Web Developer (Python, Django)
Back-End Web Developer (Python, Django)
Python Programmer (Entry-Level)
Software Developer (Python)


## Access Parent elements
The `<div>` element with the "card-content" class contains all the information we want. It’s a third-level parent of the `<h2>` title element that we found using our filter.

In [172]:
python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]

print(python_job_elements[0].prettify())


<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>



## Extract Attributes From HTML Elements

In [56]:
for job_element in python_job_elements:

    links = job_element.find_all("a")
    for link in links:
        link_url = link["href"]
        print(f"Link: {link_url}\n")

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/python-developer-50.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-60.html

Link: https://www.realpython.com

Link: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-70.html

Link: https://www.realpython.co

We only want the **second** link on each card, since this is the one containing the application info.

In [57]:
for job_element in python_job_elements:
    link_url = job_element.find_all("a")[1]["href"] # With index [1] we keep only the second link available per card 
    print(f"Apply here: {link_url}\n")

Apply here: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-developer-50.html

Apply here: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-60.html

Apply here: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-70.html

Apply here: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-80.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html



## Save list of jobs in DataFrame

In [162]:
import pandas as pd
import re

Let's take a look again to the python_job_elements structure.

In [175]:
print(python_job_elements[0].prettify())

<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>



We'll iterate over the list of python jobs and **request, parse and extract the information related to each specific job subpage**.

In [164]:
columns = ['Title', 'Company', 'Description', 'Location', 'Date posted']
python_jobs_list = []

for job_element in python_job_elements:
    link_url = job_element.find_all("a")[1]["href"]
    job_page = requests.get(link_url)
    job_soup = BeautifulSoup(job_page.content, "html.parser")
    job_results = job_soup.find(id="ResultsContainer")
    title = job_results.find("h1", class_="title").text.strip()
    company = job_results.find("h2", class_="company").text.strip()
    content_element = job_results.find("div", class_="content")
    description = content_element.find_all("p")[0].text.strip()
    location = re.split(":", content_element.find_all("p")[1].text.strip())[1]
    date = re.split(":", content_element.find_all("p")[2].text.strip())[1]
    python_jobs_list.append([title, company, description, location, date])

python_jobs_df = pd.DataFrame(python_jobs_list, columns=columns) 
print(python_jobs_df)

                                     Title                   Company  \
0                  Senior Python Developer  Payne, Roberts and Davis   
1               Software Engineer (Python)                Garcia PLC   
2          Python Programmer (Entry-Level)    Moss, Duncan and Allen   
3          Python Programmer (Entry-Level)           Cooper and Sons   
4              Software Developer (Python)              Adams-Brewer   
5                         Python Developer           Rivera and Sons   
6  Back-End Web Developer (Python, Django)         Stewart-Alexander   
7  Back-End Web Developer (Python, Django)    Jackson, Ali and Mckee   
8          Python Programmer (Entry-Level)               Mathews Inc   
9              Software Developer (Python)          Moreno-Rodriguez   

                                         Description                Location  \
0  Professional asset web application environment...         Stewartbury, AA   
1  Collaborate discussions responsible tech gro