## Webscraping

>In this exercise, you'll practice using BeautifulSoup to parse the content of a web page.
>
>The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings.
>
>Your job is to extract the data on each job and convert into a pandas DataFrame.

In [2]:
import requests
from bs4 import BeautifulSoup
from IPython.core.display import HTML
import pandas as pd
import io

### 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  

In [4]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

In [5]:
response.status_code

200

In [6]:
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

In [7]:
soup = BeautifulSoup(response.text)

In [8]:
#alternate is soup = BeautifulSoup(page.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c

In [9]:
#the table that lists all the jobs is id="ResultsContainer"
results_container = soup.find(id='ResultsContainer')
results_container

<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>
<div class="content">
<p class="location">
        Stewartbury, AA
      </p>
<p class="is-small has-text-grey">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>
</footer>
</div>
</div>
</div>
<div class="column is-half">


#### a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.  


In [11]:
soup.find('h2').text

'Senior Python Developer'

#### b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list. 

In [13]:
job_titles = [job.text.strip() for job in soup.findAll('h2')]

#### c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [15]:
job_titles = [job.text.strip() for job in soup.findAll('h2')]
company_names = [company.text.strip() for company in soup.findAll('h3')]
job_location = [location.text.strip() for location in soup.findAll(class_='location')]
posting_dates = [date.text.strip() for date in soup.findAll('time')]

#### d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [17]:
job_list = pd.DataFrame({'job_title': job_titles, 'company_name': company_names, 'job_location': job_location, 'posting_date': posting_dates})

In [18]:
job_list

Unnamed: 0,job_title,company_name,job_location,posting_date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08


### 2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.  

In [None]:
#where the hyper links are located for the apply button
#'"<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>"

href= "Hypertext REFerence"
'<a' anchor tag to create a hyperlink

#### a. First, use the BeautifulSoup find_all method to extract the urls. 

In [61]:
#using findAll to locate all the "a" anchors that have a class="card-footer-item".
test = soup.findAll('a', class_='card-footer-item')
test

[<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>,
 <a class="card

In [63]:
#using findAll to locate all the "a" anchors that have a class="card-footer-item" 
#And the findingAll the attributes of the herf at the "a" anchor. Only bringing the hyper links
apply_url = [a['href'] for a in soup.findAll('a', class_='card-footer-item')]
apply_url

['https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://www.realpython.com',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://www.realpython.com',
 'https:

In [24]:
#Using dict.fromkeys()
#This method creates a dictionary where each list item becomes a key
#keys can’t repeat, which removes duplicates. Then we turn it back into a list. keeps the original order.
#still have one: 'https://www.realpython.com'
apply_url_list = list(dict.fromkeys(apply_url))
#using .remove() to get rid of the last "https://www.realpython.com"
apply_url_list.remove('https://www.realpython.com')
apply_url_list

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi

In [25]:
apply_urls = [a['href'] for a in soup.findAll('a', class_='card-footer-item') if a.text.strip() == 'Apply']
apply_urls

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi

In [133]:
#creating a job list with the urls from the website
job_list_apply = pd.DataFrame({'job_title': job_titles, 
                         'company_name': company_names, 
                         'job_location': job_location, 
                         'posting_date': posting_dates, 
                         'apply_url' : apply_urls})
job_list_apply

Unnamed: 0,job_title,company_name,job_location,posting_date,apply_url
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/product-manager-4.html
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/museum-gallery-exhibitions-officer-95.html
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/radiographer-diagnostic-96.html
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/database-administrator-97.html
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/furniture-designer-98.html


#### b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [131]:
#max width to see the full column
pd.set_option('display.max_colwidth', None)

#clean/create the Job titles and add "-"
job_list['job_title_url'] = job_list['job_title'].str.replace(" ", "-", regex=False)
job_list['job_title_url'] = job_list['job_title_url'].str.replace('[#%$^]', '', regex=True)

#Create the URL and lower case the URL
apply_link = 'https://realpython.github.io/fake-jobs/jobs/'+ job_list['job_title_url']+ "-" + job_list['job_title'].index.astype(str) +'.html'
apply_link.str.lower()

0                 https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
1                         https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html
2                         https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html
3                  https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html
4                         https://realpython.github.io/fake-jobs/jobs/product-manager-4.html
                                               ...                                          
95    https://realpython.github.io/fake-jobs/jobs/museum/gallery-exhibitions-officer-95.html
96              https://realpython.github.io/fake-jobs/jobs/radiographer,-diagnostic-96.html
97                https://realpython.github.io/fake-jobs/jobs/database-administrator-97.html
98                    https://realpython.github.io/fake-jobs/jobs/furniture-designer-98.html
99                           https://realpython.github.io/fake-jobs/jo

In [135]:
#create df with the built urls
job_list_built_url = pd.DataFrame({'job_title': job_titles, 
                         'company_name': company_names, 
                         'job_location': job_location, 
                         'posting_date': posting_dates, 
                         'apply_link' : apply_urls})

job_list_built_url

Unnamed: 0,job_title,company_name,job_location,posting_date,apply_link
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/product-manager-4.html
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/museum-gallery-exhibitions-officer-95.html
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/radiographer-diagnostic-96.html
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/database-administrator-97.html
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/furniture-designer-98.html


### 3. Finally, we want to get the job description text for each job.  

#### a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.  

In [139]:
URL_job_des = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

response_job_des = requests.get(URL_job_des)

In [141]:
response_job_des.status_code

200

In [143]:
response_job_des.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="box">\n<h1 class="title is-2">Senior Python Developer</h1>\n<h2 class="subtitle is-4 company">Payne, Roberts and Davis</h2>\n<div class="content">\n    <p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web applicat

In [151]:
job_des_soup = BeautifulSoup(response_job_des.text)

print(job_des_soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="box">
      <h1 class="title is-2">
       Senior Python Developer
      </h1>
      <h2 class="subtitle is-4 company">
       Payne, Roberts and Davis
      </h2>
      <div class="content">
       <p>
        Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational

In [None]:
#The HTML division tag, called "div" for short, is a special element that lets you group similar sets of content together on a web page.
#<div class="content">
       #<p>
        #Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile 
        #employ growth opportunity. Company programs CSS explore role. Html educational grit web application. 
        #Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. 
        #Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. 
        #Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.
       #</p>
#The HTML <p> tag is a fundamental element used for creating paragraphs in web development.

In [183]:
#the table that lists all the jobs is id="ResultsContainer"
results_container_des = job_des_soup.find(id='ResultsContainer')
results_container_des

#description = [p['location'] for p in results_container_des.findAll('p', class_='content') if a.text.strip() == 'Apply']
#using findAll to locate all the "a" anchors that have a class="card-footer-item".
test2 = [p.get_text(strip=True) for p in results_container_des.findAll('div', class_='content')]
#apply_urls = [p['href'] for a in soup.findAll('a', class_='card-footer-item') if a.text.strip() == 'Apply']
test2

['Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.Location:Stewartbury, AAPosted:2021-04-08']

#### b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.". 

#### c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.