## Webscraping - OpenSecrets

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

In [1]:
import requests

In [2]:
URL = 'https://www.opensecrets.org/races/candidates?cycle=2020&id=TN07&spec=N'

response = requests.get(URL)

In [3]:
response.status_code

200

In [4]:
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

In [5]:
from bs4 import BeautifulSoup

1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  


In [6]:
soup = BeautifulSoup(response.text)

Now, we can print it out in a slightly more readable form.

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c

a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title. 

In [8]:
soup.find('title')

<title>Fake Python</title>

Notice that this returns a bs4 Tag object.

In [9]:
type(soup.find('title'))

bs4.element.Tag

To extract out the text, you can use the `.text` attribute.

In [10]:
soup.find('h2', attrs={'class' : 'title is-5'}).text

'Senior Python Developer'


b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  

In [11]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="ResultsContainer")

job_elements = results.find_all("div", class_="card-content")

# for job_element in job_elements:
#    print(job_element, end="\n"*2)

list0 = []

for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    list0.append(title_element.text.strip())
         
list0

['Senior Python Developer',
 'Energy engineer',
 'Legal executive',
 'Fitness centre manager',
 'Product manager',
 'Medical technical officer',
 'Physiological scientist',
 'Textile designer',
 'Television floor manager',
 'Waste management officer',
 'Software Engineer (Python)',
 'Interpreter',
 'Architect',
 'Meteorologist',
 'Audiological scientist',
 'English as a second language teacher',
 'Surgeon',
 'Equities trader',
 'Newspaper journalist',
 'Materials engineer',
 'Python Programmer (Entry-Level)',
 'Product/process development scientist',
 'Scientist, research (maths)',
 'Ecologist',
 'Materials engineer',
 'Historic buildings inspector/conservation officer',
 'Data scientist',
 'Psychiatrist',
 'Structural engineer',
 'Immigration officer',
 'Python Programmer (Entry-Level)',
 'Neurosurgeon',
 'Broadcast engineer',
 'Make',
 'Nurse, adult',
 'Air broker',
 'Editor, film/video',
 'Production assistant, radio',
 'Engineer, communications',
 'Sales executive',
 'Software Deve

c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [12]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="ResultsContainer")

job_elements = results.find_all("div", class_="card-content")

# for job_element in job_elements:
#    print(job_element, end="\n"*2)


for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    # date_element = job_element.find("time", datetime_="%Y-%m-%d") # NEED HELP HERE
  
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    # print(date_element.text.strip())  # NEED HELP HERE
    print()
    

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Energy engineer
Vasquez-Davidson
Christopherville, AA

Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA

Fitness centre manager
Savage-Bradley
East Seanview, AP

Product manager
Ramirez Inc
North Jamieview, AP

Medical technical officer
Rogers-Yates
Davidville, AP

Physiological scientist
Kramer-Klein
South Christopher, AE

Textile designer
Meyers-Johnson
Port Jonathan, AE

Television floor manager
Hughes-Williams
Osbornetown, AE

Waste management officer
Jones, Williams and Villa
Scotttown, AP

Software Engineer (Python)
Garcia PLC
Ericberg, AE

Interpreter
Gregory and Sons
Ramireztown, AE

Architect
Clark, Garcia and Sosa
Figueroaview, AA

Meteorologist
Bush PLC
Kelseystad, AA

Audiological scientist
Salazar-Meyers
Williamsburgh, AE

English as a second language teacher
Parker, Murphy and Brooks
Mitchellburgh, AE

Surgeon
Cruz-Brown
West Jessicabury, AA

Equities trader
Macdonald-Ferguson
Maloneshire, AE



d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [13]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd

URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="ResultsContainer")

job_elements = results.find_all("div", class_="card-content")

# for job_element in job_elements:
#    print(job_element, end="\n"*2)

list1 = []
list2 = []
list3 = []
list4 = []

for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    list1.append(title_element.text.strip())
    company_element = job_element.find("h3", class_="company")
    list2.append(company_element.text.strip())
    location_element = job_element.find("p", class_="location")
    list3.append(location_element.text.strip())
    date_element = job_element.find("time") # NEED HELP HERE
    list4.append(date_element.text.strip())
    # print(title_element.text.strip())
    #print(company_element.text.strip())
    # print(location_element.text.strip())
    print(date_element.text.strip())  # NEED HELP HERE
    # print()
    
list1

2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08
2021-04-08

['Senior Python Developer',
 'Energy engineer',
 'Legal executive',
 'Fitness centre manager',
 'Product manager',
 'Medical technical officer',
 'Physiological scientist',
 'Textile designer',
 'Television floor manager',
 'Waste management officer',
 'Software Engineer (Python)',
 'Interpreter',
 'Architect',
 'Meteorologist',
 'Audiological scientist',
 'English as a second language teacher',
 'Surgeon',
 'Equities trader',
 'Newspaper journalist',
 'Materials engineer',
 'Python Programmer (Entry-Level)',
 'Product/process development scientist',
 'Scientist, research (maths)',
 'Ecologist',
 'Materials engineer',
 'Historic buildings inspector/conservation officer',
 'Data scientist',
 'Psychiatrist',
 'Structural engineer',
 'Immigration officer',
 'Python Programmer (Entry-Level)',
 'Neurosurgeon',
 'Broadcast engineer',
 'Make',
 'Nurse, adult',
 'Air broker',
 'Editor, film/video',
 'Production assistant, radio',
 'Engineer, communications',
 'Sales executive',
 'Software Deve

In [14]:
list2

['Payne, Roberts and Davis',
 'Vasquez-Davidson',
 'Jackson, Chambers and Levy',
 'Savage-Bradley',
 'Ramirez Inc',
 'Rogers-Yates',
 'Kramer-Klein',
 'Meyers-Johnson',
 'Hughes-Williams',
 'Jones, Williams and Villa',
 'Garcia PLC',
 'Gregory and Sons',
 'Clark, Garcia and Sosa',
 'Bush PLC',
 'Salazar-Meyers',
 'Parker, Murphy and Brooks',
 'Cruz-Brown',
 'Macdonald-Ferguson',
 'Williams, Peterson and Rojas',
 'Smith and Sons',
 'Moss, Duncan and Allen',
 'Gomez-Carroll',
 'Manning, Welch and Herring',
 'Lee, Gutierrez and Brown',
 'Davis, Serrano and Cook',
 'Smith LLC',
 'Thomas Group',
 'Silva-King',
 'Pierce-Long',
 'Walker-Simpson',
 'Cooper and Sons',
 'Donovan, Gonzalez and Figueroa',
 'Morgan, Butler and Bennett',
 'Snyder-Lee',
 'Harris PLC',
 'Washington PLC',
 'Brown, Price and Campbell',
 'Mcgee PLC',
 'Dixon Inc',
 'Thompson, Sheppard and Ward',
 'Adams-Brewer',
 'Schneider-Brady',
 'Gonzales-Frank',
 'Smith-Wong',
 'Pierce-Herrera',
 'Aguilar, Rivera and Quinn',
 'Lowe,

In [15]:
list3

['Stewartbury, AA',
 'Christopherville, AA',
 'Port Ericaburgh, AA',
 'East Seanview, AP',
 'North Jamieview, AP',
 'Davidville, AP',
 'South Christopher, AE',
 'Port Jonathan, AE',
 'Osbornetown, AE',
 'Scotttown, AP',
 'Ericberg, AE',
 'Ramireztown, AE',
 'Figueroaview, AA',
 'Kelseystad, AA',
 'Williamsburgh, AE',
 'Mitchellburgh, AE',
 'West Jessicabury, AA',
 'Maloneshire, AE',
 'Johnsonton, AA',
 'South Davidtown, AP',
 'Port Sara, AE',
 'Marktown, AA',
 'Laurenland, AE',
 'Lauraton, AP',
 'South Tammyberg, AP',
 'North Brandonville, AP',
 'Port Robertfurt, AA',
 'Burnettbury, AE',
 'Herbertside, AA',
 'Christopherport, AP',
 'West Victor, AE',
 'Port Aaron, AP',
 'Loribury, AA',
 'Angelastad, AP',
 'Larrytown, AE',
 'West Colin, AP',
 'West Stephanie, AP',
 'Laurentown, AP',
 'Wrightberg, AP',
 'Alberttown, AE',
 'Brockburgh, AE',
 'North Jason, AE',
 'Arnoldhaven, AE',
 'Lake Destiny, AP',
 'South Timothyburgh, AP',
 'New Jimmyton, AE',
 'New Lucasbury, AP',
 'Port Cory, AE',
 

In [16]:
list4

['2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-

2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.   
    a. First, use the BeautifulSoup find_all method to extract the urls.  

In [17]:
html_page = requests.get(URL)
html_page.text     # a response object

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

In [18]:
html_page = requests.get(URL)
soup = BeautifulSoup(html_page.text)
soup

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>


In [19]:
# from BeautifulSoup import BeautifulSoup
import urllib3
# import re2
import regex
import re
import requests 
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd

 

URL = "https://realpython.github.io/fake-jobs/"

def getLinks(URL):
    html_page = requests.get(URL)
    soup = BeautifulSoup(html_page.text)
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile(r'https://realpython\.github\.io/fake-jobs/jobs.+\.html')}):
        links.append(link.get('href'))

    return links

links = getLinks(URL)

print( getLinks(URL) )

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

3. Finally, we want to get the job description text for each job.  
    a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.  

In [20]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd

URL = "https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")


list(soup.children)

print(soup.find_all('p'))

print('\n\n')

print(soup.find_all('p')[1].get_text().strip())



# python_jobs = soup.find_all(
#    "h2", string=lambda text: "python" in text.lower()
# )

# for job_element in job_elements:
#    title_element = job_element.find("h2", class_="title")
#    list1.append(title_element.text.strip())
#    company_element = job_element.find("h3", class_="company")
#    list2.append(company_element.text.strip())
#    location_element = job_element.find("p", class_="location")
#    list3.append(location_element.text.strip())
    # date_element = job_element.find("time", datetime_="%Y-%m-%d") # NEED HELP HERE
    # df.append(date_element.text.strip())
    # print(title_element.text.strip())
    #print(company_element.text.strip())
    # print(location_element.text.strip())
    # print(date_element.text.strip())  # NEED HELP HERE
    # print()
    
    
    
#    results = soup.find(id="ResultsContainer")

# description = results.find_all("div", class_="content")

# for job_element in job_elements:
#    print(job_element, end="\n"*2)

# description_element = description.find("p", class_="$0")
# print(description_element.text.strip())

# for p in soup.find_all('p'):
#    if "Professional asset" in p.text:
        # found it, do something
        

[<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>, <p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.</p>, <p id="location"><strong>Location:</strong> Stewartbury, AA</p>, <p id="date"><strong>Posted:</strong> 2021-04-08</p>]



Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational gr

In [21]:
# applylinks = [s for s in URLs if s != 'https://www.realpython.com']
# applylinks

    b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".  

In [22]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd

URL = "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")


list(soup.children)

print(soup.find_all('p'))

print('\n\n')

print(soup.find_all('p')[1].get_text().strip()) # the answer that we want

[<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>, <p>At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.</p>, <p id="location"><strong>Location:</strong> Osbornetown, AE</p>, <p id="date"><strong>Posted:</strong> 2021-04-08</p>]



At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.


In [23]:
def get_description(URL) :
  
    page = requests.get(URL)

    soup = BeautifulSoup(page.content, "html.parser")


    list(soup.children)

    # print(soup.find_all('p'))

    # print('\n\n')

    return (soup.find_all('p')[1].get_text().strip()) # the answer that we want

In [24]:
data = {'Title': list1,
        'Company': list2,
        'Location': list3,
        'Date': list4,
        'URL': links}

df = pd.DataFrame(data)
df

Unnamed: 0,Title,Company,Location,Date,URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...


In [25]:

df = pd.DataFrame(data, columns=['Title', 'Company', 'Location', 'Date', 'URL'])
df


Unnamed: 0,Title,Company,Location,Date,URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...


In [26]:
Description_Text_df = get_description('https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html')
Description_Text_df

'At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.'

In [27]:
Description_Text_df

'At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.'


    c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.

In [28]:
URL_series = df['URL'].squeeze()
URL_series

0     https://realpython.github.io/fake-jobs/jobs/se...
1     https://realpython.github.io/fake-jobs/jobs/en...
2     https://realpython.github.io/fake-jobs/jobs/le...
3     https://realpython.github.io/fake-jobs/jobs/fi...
4     https://realpython.github.io/fake-jobs/jobs/pr...
                            ...                        
95    https://realpython.github.io/fake-jobs/jobs/mu...
96    https://realpython.github.io/fake-jobs/jobs/ra...
97    https://realpython.github.io/fake-jobs/jobs/da...
98    https://realpython.github.io/fake-jobs/jobs/fu...
99    https://realpython.github.io/fake-jobs/jobs/sh...
Name: URL, Length: 100, dtype: object

In [29]:
# DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), by_row='compat', **kwargs)[source]

Description = URL_series.apply(get_description)

In [30]:
data = {'Title': list1,
        'Company': list2,
        'Location': list3,
        'Date': list4,
        'URL': links,
        'Description': Description}

df = pd.DataFrame(data)
df

Unnamed: 0,Title,Company,Location,Date,URL,Description
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...,Professional asset web application environment...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...,Party prevent live. Quickly candidate change a...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...,Administration even relate head color. Staff b...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...,Tv program actually race tonight themselves tr...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...,Traditional page a although for study anyone. ...
...,...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...,Paper age physical current note. There reality...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...,Able such right culture. Wrong pick structure ...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...,Create day party decade high clear. Past trade...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...,Pressure under rock next week. Recognize so re...


The `.find` method find the first matching tag. 

We can find _all_ elements with a particular tag using the `.findAll(<tag>)` method. Say we want to find all images. We'll look for the `img` tag.

In [31]:
images = soup.findAll('img')
images

[]

Let's look closer at the first image.

In [32]:
first_image = images[0]
first_image

IndexError: list index out of range

You can access attributes of a Tag object in the same way that you would access values from a dictionary.

In [None]:
first_image['src']

'https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1'

You can also safely access attributes using `.get`. This might be useful if, for example, you aren't sure if a particular Tag or all tags had a certain attribute.

In [None]:
# Non-safe
first_image['class']

KeyError: 'class'

In [None]:
# Safe
first_image.get('class')

['mw-logo-icon']

You can also specify a default value when using `get`.

In [None]:
first_image.get('class', default = 'No Class')

['mw-logo-icon']

If you want to grab a particular attribute for all images, an easy way to do so is with a list comprehension.

In [None]:
image_srcs = [x.get('src') for x in images]

In [None]:
image_srcs

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg/80px-Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Marvin_Minsky_at_OLPCc.jpg/80px-Marvin_Minsky_at_OLPCc.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/49/John_McCarthy_Stanford.jpg/80px-John_McCarthy_Stanford.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Edsger_Wybe_Dijkstra.jpg/80px-Edsger_Wybe_Dijkstra.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Charles_Bachman_2012.jpg/80px-Charles_Bachman_2012.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4f/KnuthAtOpenContentAlliance.jpg/80px-KnuthAtOpenContentAlliance.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Herbert_A._Simon_and_Allen_Newell_Chess_Match_

We can further navigate the html tree to extract out other bits of information.

When scraping from a web page, you should make use of "View Page Source" and/or "Inspect Element" in your web browswer.

For example, let's say we want to look at the second header on the page.

In [None]:
soup.findAll('header')[1]

<header class="mw-body-header vector-page-titlebar">
<label aria-controls="vector-toc" class="cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet vector-button-flush-left cdx-button--icon-only" for="vector-toc-collapsed-checkbox" id="vector-toc-collapsed-button" role="button" tabindex="0" title="Table of Contents">
<span class="vector-icon mw-ui-icon-wikimedia-listBullet"></span>
<span>Toggle the table of contents</span>
</label>
<nav aria-label="Contents" class="vector-toc-landmark" role="navigation">
<div class="vector-dropdown vector-page-titlebar-toc vector-button-flush-left" id="vector-page-titlebar-toc">
<input aria-haspopup="true" aria-label="Toggle the table of contents" class="vector-dropdown-checkbox" data-event-name="ui.dropdown-vector-page-titlebar-toc" id="vector-page-titlebar-toc-checkbox" role="button" type="checkbox"/>
<label aria-hidden="true" class="vector-dropdown-label cdx-button cdx-button--fake-button cdx-button--fake-butto

Similar to using `find` and `findall` in the full soup, we can use the `.find` method just within a Tag.

In [None]:
soup.findAll('header')[1].find('h1').get('id')

'firstHeading'

In [None]:
soup.findAll('header')[1].find('h1').text

'Turing Award'

Now, let's look for the table containing the Turing Award winners.

Using `.findAll` reveals that there are multiple tables on the page.

In [None]:
soup.findAll('table')

[<table class="infobox vevent"><tbody><tr><th class="infobox-above summary" colspan="2">ACM Turing Award</th></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Awarded for</th><td class="infobox-data">Outstanding contributions in <a href="/wiki/Computer_science" title="Computer science">computer science</a></td></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Country</th><td class="infobox-data location">United States</td></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Presented by</th><td class="infobox-data attendee"><a href="/wiki/Association_for_Computing_Machinery" title="Association for Computing Machinery">Association for Computing Machinery</a> (ACM)</td></tr><tr><th class="infobox-label" scope="row" style="width: 33%;">Reward(s)</th><td class="infobox-data">US $1,000,000<sup class="reference" id="cite_ref-million_1-0"><a href="#cite_note-million-1">[1]</a></sup></td></tr><tr><th class="infobox-label" scope="row" style="width

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [None]:
soup.find('table', attrs={'class' : 'wikitable'})

<table class="wikitable sortable">
<tbody><tr bgcolor="#ccccc">
<th>Year
</th>
<th>Recipient(s)
</th>
<th>Photo
</th>
<th>Rationale
</th>
<th>Affiliated institute(s)
</th></tr>
<tr>
<td>1966
</td>
<td><a href="/wiki/Alan_Perlis" title="Alan Perlis">Alan Perlis</a>
</td>
<td>
</td>
<td>For his influence in the area of advanced <a href="/wiki/Computer_programming" title="Computer programming">computer programming</a> techniques and <a href="/wiki/Compiler" title="Compiler">compiler</a> construction.<sup class="reference" id="cite_ref-10"><a href="#cite_note-10">[10]</a></sup>
</td>
<td><a href="/wiki/Carnegie_Mellon_University" title="Carnegie Mellon University">Carnegie Mellon University</a>
</td></tr>
<tr>
<td>1967
</td>
<td><a href="/wiki/Maurice_Wilkes" title="Maurice Wilkes">Maurice Wilkes</a>
</td>
<td><span typeof="mw:File"><a class="mw-file-description" href="/wiki/File:Maurice_Vincent_Wilkes_1980_(3,_cropped).jpg"><img class="mw-file-element" data-file-height="463" data-file-wid

We can display the table by importing the `HTML` function.

In [None]:
table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

from IPython.core.display import HTML

HTML(table_html)

Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
1966,Alan Perlis,,For his influence in the area of advanced computer programming techniques and compiler construction.[10],Carnegie Mellon University
1967,Maurice Wilkes,,"Wilkes is best known as the builder and designer of the EDSAC, the second computer with an internally stored program. Built in 1949, the EDSAC used a mercury delay line memory. He is also known as the author, with Wheeler and Gill, of a volume on ""Preparation of Programs for Electronic Digital Computers"" in 1951, in which program libraries were effectively introduced.[11]",University of Cambridge
1968,Richard Hamming,,"For his work on numerical methods, automatic coding systems, and error-detecting and error-correcting codes.[12]",Bell Labs
1969,Marvin Minsky,,"For his central role in creating, shaping, promoting, and advancing the field of artificial intelligence.[13]",Massachusetts Institute of Technology
1970,James H. Wilkinson,,"For his research in numerical analysis to facilitate the use of the high-speed digital computer, having received special recognition for his work in computations in linear algebra and ""backward"" error analysis.[14]",National Physical Laboratory
1971,John McCarthy,,"McCarthy's lecture ""The Present State of Research on Artificial Intelligence"" is a topic that covers the area in which he has achieved considerable recognition for his work.[15]",Stanford University
1972,Edsger W. Dijkstra,,"Edsger Dijkstra was a principal contributor in the late 1950s to the development of the ALGOL, a high level programming language which has become a model of clarity and mathematical rigor. He is one of the principal proponents of the science and art of programming languages in general, and has greatly contributed to our understanding of their structure, representation, and implementation. His fifteen years of publications extend from theoretical articles on graph theory to basic manuals, expository texts, and philosophical contemplations in the field of programming languages.[16]","Centrum Wiskunde & Informatica, Eindhoven University of Technology, University of Texas at Austin"
1973,Charles Bachman,,For his outstanding contributions to database technology.[17],"General Electric Research Laboratory (now under Groupe Bull, an Atos company)"
1974,Donald Knuth,,"For his major contributions to the analysis of algorithms and the design of programming languages, and in particular for his contributions to ""The Art of Computer Programming"" through his well-known books in a continuous series by this title.[18]","California Institute of Technology, Center for Communications Research, Center for Communications and Computing, Institute for Defense Analyses, Stanford University"
1975,Allen Newell,,"In joint scientific efforts extending over twenty years, initially in collaboration with J. C. Shaw at the RAND Corporation, and subsequently with numerous faculty and student colleagues at Carnegie Mellon University, they have made basic contributions to artificial intelligence, the psychology of human cognition, and list processing.[19]","RAND Corporation, Carnegie Mellon University"


However, this does not give us a way to work with the data in the table, only to display it.

If we want to interact with the table, we can use the _pandas_ `read_html` method.

In [None]:
import pandas as pd

In [None]:
pd.read_html(str(soup.find('table', attrs={'class' : 'wikitable'})))[0]

Unnamed: 0,Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
0,1966,Alan Perlis,,For his influence in the area of advanced comp...,Carnegie Mellon University
1,1967,Maurice Wilkes,,Wilkes is best known as the builder and design...,University of Cambridge
2,1968,Richard Hamming,,"For his work on numerical methods, automatic c...",Bell Labs
3,1969,Marvin Minsky,,"For his central role in creating, shaping, pro...",Massachusetts Institute of Technology
4,1970,James H. Wilkinson,,For his research in numerical analysis to faci...,National Physical Laboratory
...,...,...,...,...,...
71,2019,Pat Hanrahan,,For fundamental contributions to 3-D computer ...,"Pixar, Princeton University, Stanford University"
72,2020,Alfred Aho,,For fundamental algorithms and theory underlyi...,"Bell Labs, Columbia University"
73,2020,Jeffrey Ullman,,For fundamental algorithms and theory underlyi...,"Bell Labs, Princeton University, Stanford Univ..."
74,2021,Jack Dongarra,,For pioneering contributions to numerical algo...,"Argonne National Laboratory, Oak Ridge Nationa..."
