Webscraping
In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

In [2]:
# Import libraries
import pandas as pd
from bs4 import BeautifulSoup as BS
import requests

1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

In [7]:
# Perform a GET request and pass in the desired url
url = "https://realpython.github.io/fake-jobs/"
response = requests.get(url)

# Convert the response into a BeautifulSoup object
soup = BS(response.text, "html.parser")

In [8]:
type(response)

requests.models.Response

In [9]:
response.status_code

200

1.a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.

In [None]:
# Find the first job title. # Use class= when you know the tags class and want to filter by it.
first_job = soup.find("h2", class_="title is-5")
print(first_job.text.strip())  # Should print "Senior Python Developer"

Senior Python Developer


1. b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [None]:
# Extract all jobs
titles = [job.text.strip() for job in soup.find_all("h2", class_="title is-5")]
print(titles)

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trader', 'Tour

1. c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.

In [16]:
# Extract companies, locations, and dates
companies = [company.text.strip() for company in soup.find_all("h3", class_="subtitle is-6 company")]
locations = [loc.text.strip() for loc in soup.find_all("p", class_="location")]
dates = [time.text.strip() for time in soup.find_all("time")]
print(companies)
print(locations)
print(dates)

['Payne, Roberts and Davis', 'Vasquez-Davidson', 'Jackson, Chambers and Levy', 'Savage-Bradley', 'Ramirez Inc', 'Rogers-Yates', 'Kramer-Klein', 'Meyers-Johnson', 'Hughes-Williams', 'Jones, Williams and Villa', 'Garcia PLC', 'Gregory and Sons', 'Clark, Garcia and Sosa', 'Bush PLC', 'Salazar-Meyers', 'Parker, Murphy and Brooks', 'Cruz-Brown', 'Macdonald-Ferguson', 'Williams, Peterson and Rojas', 'Smith and Sons', 'Moss, Duncan and Allen', 'Gomez-Carroll', 'Manning, Welch and Herring', 'Lee, Gutierrez and Brown', 'Davis, Serrano and Cook', 'Smith LLC', 'Thomas Group', 'Silva-King', 'Pierce-Long', 'Walker-Simpson', 'Cooper and Sons', 'Donovan, Gonzalez and Figueroa', 'Morgan, Butler and Bennett', 'Snyder-Lee', 'Harris PLC', 'Washington PLC', 'Brown, Price and Campbell', 'Mcgee PLC', 'Dixon Inc', 'Thompson, Sheppard and Ward', 'Adams-Brewer', 'Schneider-Brady', 'Gonzales-Frank', 'Smith-Wong', 'Pierce-Herrera', 'Aguilar, Rivera and Quinn', 'Lowe, Barnes and Thomas', 'Lewis, Gonzalez and Vasq

1. d. Take the lists that you have created and combine them into a pandas DataFrame.

In [17]:
# Put exerything into a dataframe
jobs_df = pd.DataFrame({
    "Title": titles,
    "Company": companies,
    "Location": locations,
    "Date": dates
})

print(jobs_df.head())

                     Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

         Date  
0  2021-04-08  
1  2021-04-08  
2  2021-04-08  
3  2021-04-08  
4  2021-04-08  


2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.
a. First, use the BeautifulSoup find_all method to extract the urls.

In [None]:
# Find all apply buttons. # Use string= when you know the exact text inside the tag
apply_links = [a["href"] for a in soup.find_all("a", string="Apply")]
print(apply_links)

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

In [None]:
# Add apply url column to the dataframe
jobs_df["Apply URL"] = apply_links

# Check dataframe
print(jobs_df.head())

                     Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

         Date                                          Apply URL  
0  2021-04-08  https://realpython.github.io/fake-jobs/jobs/se...  
1  2021-04-08  https://realpython.github.io/fake-jobs/jobs/en...  
2  2021-04-08  https://realpython.github.io/fake-jobs/jobs/le...  
3  2021-04-08  https://realpython.github.io/fake-jobs/jobs/fi...  
4  2021-04-08  https://realpython.github.io/fake-jobs/jobs/pr...  


2. b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [None]:
# Set variable for base url
base_url = "https://realpython.github.io/fake-jobs/"

# Build function to construct full url
def make_url(title, idx):
    # Lowercase
    url_title = title.lower()
    # Replace spaces with hyphens
    url_title = url_title.replace(" ", "-")
    # Remove commas and parentheses
    url_title = url_title.replace(",", "").replace("(", "").replace(")", "")
    # Add index at the end
    return f"{base_url}{url_title}-{idx}.html"

# Create a list of constructed urls
constructed_urls = [make_url(title, i) for i, title in enumerate(titles)]

# Check that they match the ones we scraped
for real, constructed in zip(apply_links[:10], constructed_urls[:10]):
    print(real, constructed)

#constructed_urls = [make_url(title) for title in titles]

#for real, constructed in zip(apply_links, constructed_urls):
#    print(real, constructed)

https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html https://realpython.github.io/fake-jobs/senior-python-developer-0.html
https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html https://realpython.github.io/fake-jobs/energy-engineer-1.html
https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html https://realpython.github.io/fake-jobs/legal-executive-2.html
https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html https://realpython.github.io/fake-jobs/fitness-centre-manager-3.html
https://realpython.github.io/fake-jobs/jobs/product-manager-4.html https://realpython.github.io/fake-jobs/product-manager-4.html
https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html https://realpython.github.io/fake-jobs/medical-technical-officer-5.html
https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html https://realpython.github.io/fake-jobs/physiological-scientist-6.html
https://realpython.github.io/fa

In [36]:
# Add apply urls v2 to dataframe
jobs_df["Apply URL V2"] = constructed_urls

# Check dataframe
print(jobs_df.head())

                     Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

         Date                                          Apply URL  \
0  2021-04-08  https://realpython.github.io/fake-jobs/jobs/se...   
1  2021-04-08  https://realpython.github.io/fake-jobs/jobs/en...   
2  2021-04-08  https://realpython.github.io/fake-jobs/jobs/le...   
3  2021-04-08  https://realpython.github.io/fake-jobs/jobs/fi...   
4  2021-04-08  https://realpython.github.io/fake-jobs/jobs/pr...   

                                        Apply URL V2  
0  https://realpython.github.io/fake-jobs/senior-...  


3. Finally, we want to get the job description text for each job.
a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

3. b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".

3. c. Use the .apply method on the url column you created above to retrieve the description text for all of the jobs.