1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
url = 'https://realpython.github.io/fake-jobs/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')


a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.

In [5]:
first_job_title = soup.find('h2', class_='title').text.strip()
print(f'First job title: {first_job_title}')

First job title: Senior Python Developer


b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [7]:
job_titles = [job_title.text.strip() for job_title in soup.find_all('h2', class_='title')]

In [8]:
print(job_titles)

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trader', 'Tour

c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  


In [10]:
companies = [company.text.strip() for company in soup.find_all('h3', class_='company')]
locations = [location.text.strip() for location in soup.find_all('p', class_='location')]
dates = [date.text.strip() for date in soup.find_all('time')]


d. Take the lists that you have created and combine them into a pandas DataFrame. 


In [12]:
jobs_df = pd.DataFrame({
    'Job Title': job_titles,
    'Company': companies,
    'Location': locations,
    'Posting Date': dates
})


In [13]:
print(jobs_df)

                             Job Title                     Company  \
0              Senior Python Developer    Payne, Roberts and Davis   
1                      Energy engineer            Vasquez-Davidson   
2                      Legal executive  Jackson, Chambers and Levy   
3               Fitness centre manager              Savage-Bradley   
4                      Product manager                 Ramirez Inc   
..                                 ...                         ...   
95  Museum/gallery exhibitions officer     Nguyen, Yoder and Petty   
96            Radiographer, diagnostic                  Holder LLC   
97              Database administrator              Yates-Ferguson   
98                  Furniture designer             Ortega-Lawrence   
99                         Ship broker   Fuentes, Walls and Castro   

                Location Posting Date  
0        Stewartbury, AA   2021-04-08  
1   Christopherville, AA   2021-04-08  
2    Port Ericaburgh, AA   2021-04-08  

2a. First, use the BeautifulSoup find_all method to extract the urls.  


In [15]:
apply_links = [a['href'] for a in soup.find_all('a', class_='card-footer-item') if 'Apply' in a.text]
jobs_df['Apply URL (Extracted)'] = apply_links

In [16]:
print(apply_links)

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

2 b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.


In [18]:
def to_slug(text):
    return text.lower().replace(' ', '-').replace(',', '').replace('(', '').replace(')', '')

In [19]:
 constructed_urls = [
    f"https://realpython.github.io/fake-jobs/{to_slug(title)}-{to_slug(company)}/"
    for title, company in zip(job_titles, companies)
]

In [20]:
jobs_df['Apply URL (Constructed)'] = constructed_urls


In [21]:
jobs_df['URLs Match'] = jobs_df['Apply URL (Extracted)'] == jobs_df['Apply URL (Constructed)']
print(jobs_df)

                             Job Title                     Company  \
0              Senior Python Developer    Payne, Roberts and Davis   
1                      Energy engineer            Vasquez-Davidson   
2                      Legal executive  Jackson, Chambers and Levy   
3               Fitness centre manager              Savage-Bradley   
4                      Product manager                 Ramirez Inc   
..                                 ...                         ...   
95  Museum/gallery exhibitions officer     Nguyen, Yoder and Petty   
96            Radiographer, diagnostic                  Holder LLC   
97              Database administrator              Yates-Ferguson   
98                  Furniture designer             Ortega-Lawrence   
99                         Ship broker   Fuentes, Walls and Castro   

                Location Posting Date  \
0        Stewartbury, AA   2021-04-08   
1   Christopherville, AA   2021-04-08   
2    Port Ericaburgh, AA   2021-04-0

 3a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

In [46]:
first_job_url = "https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
response_first_job = requests.get(first_job_url)
soup_first_job = BeautifulSoup(response_first_job.text, 'lxml')


In [65]:
print(soup_first_job)

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="box">
<h1 class="title is-2">Senior Python Developer</h1>
<h2 class="subtitle is-4 company">Payne, Roberts and Davis</h2>
<div class="content">
<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inc

In [73]:
first_job_description = soup_first_job.find_all('p')[1]


In [75]:
print(first_job_description)

<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.</p>


b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".  


In [79]:
url = 'https://realpython.github.io/fake-jobs/'
response_job = requests.get(url)
soup_job = BeautifulSoup(response.text, 'lxml')

In [81]:
print(response_job)

<Response [200]>


In [83]:
print(soup_job)

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>


In [117]:
apply_links = [a['href'] for a in soup.find_all('a', class_='card-footer-item') if 'Apply' in a.text]

In [119]:
print(apply_links)

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

In [139]:
x = apply_links[7]
url = str(x)
response_job = requests.get(url)
soup_job = BeautifulSoup(response.text, 'lxml')

In [141]:
soup_job

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>


In [158]:
apply_links[7].find_all('p')[1]

AttributeError: 'str' object has no attribute 'find_all'

In [162]:
job_descriptions = []

for x in apply_links: 
    url = str(x)
    response_job = requests.get(url)
    soup_job = BeautifulSoup(response_job.text, 'lxml')
    job_description_text = soup_job.find_all('p')[1].text.strip()
    job_descriptions.append(job_description_text)
    

In [164]:
print(job_descriptions)

['Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.', 'Party prevent live. Quickly candidate change although. Together type music hospital. Every speech support time operation wear often.', 'Administration even relate head color. Staff beyond chair recently and off. Own available buy country store build before. Already against which continue. Look road article quickly. International big employee determine positive go Congress. Level others record hospital employee t

In [160]:
job_descriptions = []

for x in apply_links: 
    url = str(x)
    response_job = requests.get(url)
    soup_job = BeautifulSoup(response_job.text, 'lxml')
    job_descriptions.append(soup_job.find_all('p')[1])
    

In [145]:
print(job_descriptions)

[<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.</p>, <p>Party prevent live. Quickly candidate change although. Together type music hospital. Every speech support time operation wear often.</p>, <p>Administration even relate head color. Staff beyond chair recently and off. Own available buy country store build before. Already against which continue. Look road article quickly. International big employee determine positive go Congress. Level others record hospita