## Webscraping project
In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

Using the requests library to retrieve the contents of this web page.

In [3]:
import pandas as pd 
import requests 

To perform a GET request, use `requests.get()` and pass in the URL.

In [4]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

Checking on the kind of object we get:

In [5]:
type(response)

requests.models.Response

And checking the status using `status_code`:

In [6]:
response.status_code

200

That's great! We got 200: the standard response for a successful request.

Other common status codes are:
* 400: Bad Request
* 404: Not Found

Now let's see what the request returned.

In [7]:
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

Importing and using [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) will help us more easily decipher the text above.

In [8]:
from bs4 import BeautifulSoup as BS

Soupifying response text:

In [9]:
soup = BS(response.text)

In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c

1a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title. 

In [14]:
soup.find('h2')

<h2 class="title is-5">Senior Python Developer</h2>

In [15]:
type(soup.find('h2'))

bs4.element.Tag

In [16]:
soup.find('h2').text

'Senior Python Developer'

1b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [33]:
job_titles = soup.findAll('h2')
print(job_titles)
job_titles

[<h2 class="title is-5">Senior Python Developer</h2>, <h2 class="title is-5">Energy engineer</h2>, <h2 class="title is-5">Legal executive</h2>, <h2 class="title is-5">Fitness centre manager</h2>, <h2 class="title is-5">Product manager</h2>, <h2 class="title is-5">Medical technical officer</h2>, <h2 class="title is-5">Physiological scientist</h2>, <h2 class="title is-5">Textile designer</h2>, <h2 class="title is-5">Television floor manager</h2>, <h2 class="title is-5">Waste management officer</h2>, <h2 class="title is-5">Software Engineer (Python)</h2>, <h2 class="title is-5">Interpreter</h2>, <h2 class="title is-5">Architect</h2>, <h2 class="title is-5">Meteorologist</h2>, <h2 class="title is-5">Audiological scientist</h2>, <h2 class="title is-5">English as a second language teacher</h2>, <h2 class="title is-5">Surgeon</h2>, <h2 class="title is-5">Equities trader</h2>, <h2 class="title is-5">Newspaper journalist</h2>, <h2 class="title is-5">Materials engineer</h2>, <h2 class="title is-

[<h2 class="title is-5">Senior Python Developer</h2>,
 <h2 class="title is-5">Energy engineer</h2>,
 <h2 class="title is-5">Legal executive</h2>,
 <h2 class="title is-5">Fitness centre manager</h2>,
 <h2 class="title is-5">Product manager</h2>,
 <h2 class="title is-5">Medical technical officer</h2>,
 <h2 class="title is-5">Physiological scientist</h2>,
 <h2 class="title is-5">Textile designer</h2>,
 <h2 class="title is-5">Television floor manager</h2>,
 <h2 class="title is-5">Waste management officer</h2>,
 <h2 class="title is-5">Software Engineer (Python)</h2>,
 <h2 class="title is-5">Interpreter</h2>,
 <h2 class="title is-5">Architect</h2>,
 <h2 class="title is-5">Meteorologist</h2>,
 <h2 class="title is-5">Audiological scientist</h2>,
 <h2 class="title is-5">English as a second language teacher</h2>,
 <h2 class="title is-5">Surgeon</h2>,
 <h2 class="title is-5">Equities trader</h2>,
 <h2 class="title is-5">Newspaper journalist</h2>,
 <h2 class="title is-5">Materials engineer</h2>,
 

Printing the text only from the list of job titles above using a for loop.
Still need to extract ...

In [116]:
job_titles_text = []
for h2 in job_titles :
    job_titles_text.append(h2.text)

In [117]:
print(job_titles_text)

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trader', 'Tour

1c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

    #    <h3 class="subtitle is-6 company">
    #        Payne, Roberts and Davis
    #       </h3>
    #      </div>
    #     </div>
    #     <div class="content">
    #      <p class="location">
    #       Stewartbury, AA
    #      </p>
    #      <p class="is-small has-text-grey">
    #       <time datetime="2021-04-08">
    #        2021-04-08
    #       </time>
    #      </p>
    #     </div>

In [44]:
soup.find('h3')

<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>

In [82]:
companies = soup.findAll('h3')
print(companies)
companies

[<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>, <h3 class="subtitle is-6 company">Vasquez-Davidson</h3>, <h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>, <h3 class="subtitle is-6 company">Savage-Bradley</h3>, <h3 class="subtitle is-6 company">Ramirez Inc</h3>, <h3 class="subtitle is-6 company">Rogers-Yates</h3>, <h3 class="subtitle is-6 company">Kramer-Klein</h3>, <h3 class="subtitle is-6 company">Meyers-Johnson</h3>, <h3 class="subtitle is-6 company">Hughes-Williams</h3>, <h3 class="subtitle is-6 company">Jones, Williams and Villa</h3>, <h3 class="subtitle is-6 company">Garcia PLC</h3>, <h3 class="subtitle is-6 company">Gregory and Sons</h3>, <h3 class="subtitle is-6 company">Clark, Garcia and Sosa</h3>, <h3 class="subtitle is-6 company">Bush PLC</h3>, <h3 class="subtitle is-6 company">Salazar-Meyers</h3>, <h3 class="subtitle is-6 company">Parker, Murphy and Brooks</h3>, <h3 class="subtitle is-6 company">Cruz-Brown</h3>, <h3 class="subtitle is-6 com

[<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>,
 <h3 class="subtitle is-6 company">Vasquez-Davidson</h3>,
 <h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>,
 <h3 class="subtitle is-6 company">Savage-Bradley</h3>,
 <h3 class="subtitle is-6 company">Ramirez Inc</h3>,
 <h3 class="subtitle is-6 company">Rogers-Yates</h3>,
 <h3 class="subtitle is-6 company">Kramer-Klein</h3>,
 <h3 class="subtitle is-6 company">Meyers-Johnson</h3>,
 <h3 class="subtitle is-6 company">Hughes-Williams</h3>,
 <h3 class="subtitle is-6 company">Jones, Williams and Villa</h3>,
 <h3 class="subtitle is-6 company">Garcia PLC</h3>,
 <h3 class="subtitle is-6 company">Gregory and Sons</h3>,
 <h3 class="subtitle is-6 company">Clark, Garcia and Sosa</h3>,
 <h3 class="subtitle is-6 company">Bush PLC</h3>,
 <h3 class="subtitle is-6 company">Salazar-Meyers</h3>,
 <h3 class="subtitle is-6 company">Parker, Murphy and Brooks</h3>,
 <h3 class="subtitle is-6 company">Cruz-Brown</h3>,
 <h3 class="

In [119]:
companies_text = []
for h3 in companies :
    companies_text.append(h3.text)

In [120]:
print(companies_text)

['Payne, Roberts and Davis', 'Vasquez-Davidson', 'Jackson, Chambers and Levy', 'Savage-Bradley', 'Ramirez Inc', 'Rogers-Yates', 'Kramer-Klein', 'Meyers-Johnson', 'Hughes-Williams', 'Jones, Williams and Villa', 'Garcia PLC', 'Gregory and Sons', 'Clark, Garcia and Sosa', 'Bush PLC', 'Salazar-Meyers', 'Parker, Murphy and Brooks', 'Cruz-Brown', 'Macdonald-Ferguson', 'Williams, Peterson and Rojas', 'Smith and Sons', 'Moss, Duncan and Allen', 'Gomez-Carroll', 'Manning, Welch and Herring', 'Lee, Gutierrez and Brown', 'Davis, Serrano and Cook', 'Smith LLC', 'Thomas Group', 'Silva-King', 'Pierce-Long', 'Walker-Simpson', 'Cooper and Sons', 'Donovan, Gonzalez and Figueroa', 'Morgan, Butler and Bennett', 'Snyder-Lee', 'Harris PLC', 'Washington PLC', 'Brown, Price and Campbell', 'Mcgee PLC', 'Dixon Inc', 'Thompson, Sheppard and Ward', 'Adams-Brewer', 'Schneider-Brady', 'Gonzales-Frank', 'Smith-Wong', 'Pierce-Herrera', 'Aguilar, Rivera and Quinn', 'Lowe, Barnes and Thomas', 'Lewis, Gonzalez and Vasq

In [102]:
soup.find('p')

<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>

In [103]:
all_p = soup.findAll('p')
print(type(all_p))
all_p

<class 'bs4.element.ResultSet'>


[<p class="subtitle is-3">
         Fake Jobs for Your Web Scraping Journey
       </p>,
 <p class="location">
         Stewartbury, AA
       </p>,
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p>,
 <p class="location">
         Christopherville, AA
       </p>,
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p>,
 <p class="location">
         Port Ericaburgh, AA
       </p>,
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p>,
 <p class="location">
         East Seanview, AP
       </p>,
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p>,
 <p class="location">
         North Jamieview, AP
       </p>,
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p>,
 <p class="location">
         Davidville, AP
       </p>,
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p

In [108]:
location_p = all_p[1]
print(type(location_p))
location_p

<class 'bs4.element.Tag'>


<p class="location">
        Stewartbury, AA
      </p>

In [109]:
location_p.get('class', default = 'No Class')

['location']

In [112]:
soup.find('p', attrs={'class' : 'location'})

<p class="location">
        Stewartbury, AA
      </p>

In [113]:
locations = soup.findAll('p', attrs={'class' : 'location'})
print(locations)
locations

[<p class="location">
        Stewartbury, AA
      </p>, <p class="location">
        Christopherville, AA
      </p>, <p class="location">
        Port Ericaburgh, AA
      </p>, <p class="location">
        East Seanview, AP
      </p>, <p class="location">
        North Jamieview, AP
      </p>, <p class="location">
        Davidville, AP
      </p>, <p class="location">
        South Christopher, AE
      </p>, <p class="location">
        Port Jonathan, AE
      </p>, <p class="location">
        Osbornetown, AE
      </p>, <p class="location">
        Scotttown, AP
      </p>, <p class="location">
        Ericberg, AE
      </p>, <p class="location">
        Ramireztown, AE
      </p>, <p class="location">
        Figueroaview, AA
      </p>, <p class="location">
        Kelseystad, AA
      </p>, <p class="location">
        Williamsburgh, AE
      </p>, <p class="location">
        Mitchellburgh, AE
      </p>, <p class="location">
        West Jessicabury, AA
      </p>, <p c

[<p class="location">
         Stewartbury, AA
       </p>,
 <p class="location">
         Christopherville, AA
       </p>,
 <p class="location">
         Port Ericaburgh, AA
       </p>,
 <p class="location">
         East Seanview, AP
       </p>,
 <p class="location">
         North Jamieview, AP
       </p>,
 <p class="location">
         Davidville, AP
       </p>,
 <p class="location">
         South Christopher, AE
       </p>,
 <p class="location">
         Port Jonathan, AE
       </p>,
 <p class="location">
         Osbornetown, AE
       </p>,
 <p class="location">
         Scotttown, AP
       </p>,
 <p class="location">
         Ericberg, AE
       </p>,
 <p class="location">
         Ramireztown, AE
       </p>,
 <p class="location">
         Figueroaview, AA
       </p>,
 <p class="location">
         Kelseystad, AA
       </p>,
 <p class="location">
         Williamsburgh, AE
       </p>,
 <p class="location">
         Mitchellburgh, AE
       </p>,
 <p class="location

In [121]:
locations_text = []
for place in locations :
    locations_text.append(place.text)

In [148]:
locations_text_clean = []
for place in locations_text :
    locations_text_clean.append(place.strip())


In [149]:
print(locations_text_clean)

['Stewartbury, AA', 'Christopherville, AA', 'Port Ericaburgh, AA', 'East Seanview, AP', 'North Jamieview, AP', 'Davidville, AP', 'South Christopher, AE', 'Port Jonathan, AE', 'Osbornetown, AE', 'Scotttown, AP', 'Ericberg, AE', 'Ramireztown, AE', 'Figueroaview, AA', 'Kelseystad, AA', 'Williamsburgh, AE', 'Mitchellburgh, AE', 'West Jessicabury, AA', 'Maloneshire, AE', 'Johnsonton, AA', 'South Davidtown, AP', 'Port Sara, AE', 'Marktown, AA', 'Laurenland, AE', 'Lauraton, AP', 'South Tammyberg, AP', 'North Brandonville, AP', 'Port Robertfurt, AA', 'Burnettbury, AE', 'Herbertside, AA', 'Christopherport, AP', 'West Victor, AE', 'Port Aaron, AP', 'Loribury, AA', 'Angelastad, AP', 'Larrytown, AE', 'West Colin, AP', 'West Stephanie, AP', 'Laurentown, AP', 'Wrightberg, AP', 'Alberttown, AE', 'Brockburgh, AE', 'North Jason, AE', 'Arnoldhaven, AE', 'Lake Destiny, AP', 'South Timothyburgh, AP', 'New Jimmyton, AE', 'New Lucasbury, AP', 'Port Cory, AE', 'Gileston, AA', 'Cindyshire, AA', 'East Michaelf

In [122]:
print(locations_text)

['\n        Stewartbury, AA\n      ', '\n        Christopherville, AA\n      ', '\n        Port Ericaburgh, AA\n      ', '\n        East Seanview, AP\n      ', '\n        North Jamieview, AP\n      ', '\n        Davidville, AP\n      ', '\n        South Christopher, AE\n      ', '\n        Port Jonathan, AE\n      ', '\n        Osbornetown, AE\n      ', '\n        Scotttown, AP\n      ', '\n        Ericberg, AE\n      ', '\n        Ramireztown, AE\n      ', '\n        Figueroaview, AA\n      ', '\n        Kelseystad, AA\n      ', '\n        Williamsburgh, AE\n      ', '\n        Mitchellburgh, AE\n      ', '\n        West Jessicabury, AA\n      ', '\n        Maloneshire, AE\n      ', '\n        Johnsonton, AA\n      ', '\n        South Davidtown, AP\n      ', '\n        Port Sara, AE\n      ', '\n        Marktown, AA\n      ', '\n        Laurenland, AE\n      ', '\n        Lauraton, AP\n      ', '\n        South Tammyberg, AP\n      ', '\n        North Brandonville, AP\n      ', '\n   

In [71]:
soup.find('time')

<time datetime="2021-04-08">2021-04-08</time>

In [86]:
posting_dates = soup.findAll('time')
print(posting_dates)
posting_dates

[<time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time datetime="2021-04-08">2021-04-08</time>, <time dateti

[<time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08</time>,
 <time datetime="2021-04-08">2021-04-08<

In [124]:
date_text = []
for date in posting_dates :
    date_text.append(date.text)

In [125]:
print(date_text)

['2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021

1d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [150]:
fake_jobs_df = pd.DataFrame(
    {'job_title': job_titles_text,
     'company': companies_text,
     'location': locations_text_clean,
     'date_posted': date_text}
)

In [151]:
fake_jobs_df.head()

Unnamed: 0,job_title,company,location,date_posted
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
