<h1>Fake Python Web Scraping Practice</h1>

<h3>In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.</h3>

In [1]:
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

<h4>Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.</h4>

In [2]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

fake_python_soup = bs(response.text)

<h4>Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.</h4>

In [3]:
fake_python_soup.find('h2', attrs={'class' : 'title'}).text

'Senior Python Developer'

<h4>Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.</h4>

In [4]:
job_title_tags = fake_python_soup.findAll('h2', attrs={'class' : 'title'})

job_titles = [job_title_tag.text for job_title_tag in job_title_tags]

print(job_titles)

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trader', 'Tour

<h4>Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.</h4>

In [5]:
company_tags = fake_python_soup.findAll('h3', attrs={'class' : 'company'})
location_tags = fake_python_soup.findAll('p', attrs={'class' : 'location'})
posting_date_tags = fake_python_soup.findAll('time')

companies = [company_tag.text.strip() for company_tag in company_tags]
locations = [location_tag.text.strip() for location_tag in location_tags]
posting_dates = [posting_date_tag.text.strip() for posting_date_tag in posting_date_tags]

print(companies)
print(locations)
print(posting_dates)

['Payne, Roberts and Davis', 'Vasquez-Davidson', 'Jackson, Chambers and Levy', 'Savage-Bradley', 'Ramirez Inc', 'Rogers-Yates', 'Kramer-Klein', 'Meyers-Johnson', 'Hughes-Williams', 'Jones, Williams and Villa', 'Garcia PLC', 'Gregory and Sons', 'Clark, Garcia and Sosa', 'Bush PLC', 'Salazar-Meyers', 'Parker, Murphy and Brooks', 'Cruz-Brown', 'Macdonald-Ferguson', 'Williams, Peterson and Rojas', 'Smith and Sons', 'Moss, Duncan and Allen', 'Gomez-Carroll', 'Manning, Welch and Herring', 'Lee, Gutierrez and Brown', 'Davis, Serrano and Cook', 'Smith LLC', 'Thomas Group', 'Silva-King', 'Pierce-Long', 'Walker-Simpson', 'Cooper and Sons', 'Donovan, Gonzalez and Figueroa', 'Morgan, Butler and Bennett', 'Snyder-Lee', 'Harris PLC', 'Washington PLC', 'Brown, Price and Campbell', 'Mcgee PLC', 'Dixon Inc', 'Thompson, Sheppard and Ward', 'Adams-Brewer', 'Schneider-Brady', 'Gonzales-Frank', 'Smith-Wong', 'Pierce-Herrera', 'Aguilar, Rivera and Quinn', 'Lowe, Barnes and Thomas', 'Lewis, Gonzalez and Vasq

<h4>Take the lists that you have created and combine them into a pandas DataFrame.</h4>

In [6]:
# take our lists and combine them into a pandas dataframe
fake_python_jobs_df = pd.DataFrame({
    'Job Title' : job_titles,
    'Company Name' : companies,
    'Location' : locations,
    'Posting Date' : posting_dates
})

fake_python_jobs_df

Unnamed: 0,Job Title,Company Name,Location,Posting Date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08
