# Part 3: Parse HTML Code With Beautiful Soup

- Find Elements by ID
- Find Elements by HTML Class Name
- Extract Text From HTML Elements
- Extract Attributes From HTML Elements

In [1]:
# scrape the site
import requests

url = "https://www.indeed.com/jobs?q=python&l=new+york"
response = requests.get(url)

After scraping the HTML content, you continue working to pick out the info you need.

In [2]:
from bs4 import BeautifulSoup

In [3]:
soup = BeautifulSoup(response.content)

In [4]:
soup

<!DOCTYPE html>

<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script src="//d3fw5vlhllyvee.cloudfront.net/s/e2eb3f2/en_US.js" type="text/javascript"></script>
<link href="//d3fw5vlhllyvee.cloudfront.net/s/bbb0362/jobsearch_all.css" rel="stylesheet" type="text/css"/>
<link href="http://rss.indeed.com/rss?q=python&amp;l=new+york" rel="alternate" title="Python Jobs, Employment in New York State" type="application/rss+xml"/>
<link href="/m/jobs?q=python&amp;l=new+york" media="only screen and (max-width: 640px)" rel="alternate"/>
<link href="/m/jobs?q=python&amp;l=new+york" media="handheld" rel="alternate"/>
<script type="text/javascript">

if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {
cb();
} else {
window['closureReadyCallbacks'].push(cb);
}
}
</script>
<meta content="1" name="ppstriptst"/>
<script>
var _scrip

What a soup!!! 🍜 Let's be picky and thin it out.

## Find Elements By ID

`id` attributes uniquely identify HTML elements. Let's find one we need with Developer Tools!

In [5]:
results = soup.find(id='resultsCol')

In [6]:
results

<td id="resultsCol">
<div id="resultsColTopSpace"></div>
<div class="messageContainer">
<script type="text/javascript">
      function setRefineByCookie(refineByTypes) {
        var expires = new Date();
        expires.setTime(expires.getTime() + (10 * 1000));
        for (var i = 0; i < refineByTypes.length; i++) {
          setCookie(refineByTypes[i], "1", expires);
        }
      }
    </script>
</div>
<style type="text/css">
    #increased_radius_result {
        font-size: 16px;
        font-style: italic;
    }
    #original_radius_result{
        font-size: 13px;
        font-style: italic;
        color: #666666;
    }
</style>
<div class="resultsTop"><div class="mosaic-zone" id="mosaic-zone-aboveJobCards"></div><script type="text/javascript">
                try {
                    window.mosaic.onMosaicApiReady(function() {
                        var zoneId = 'aboveJobCards';
                        var providers = window.mosaic.zonedProviders[zoneId];

                 

Better, but let's drill down some more

## Find Elements By Class Name

The job postings all have the same HTML `class`. Let's find all that are on this page.

In [7]:
jobs = results.find_all('div', class_='result')

In [8]:
len(jobs)  # how many?

10

In [9]:
jobs[0]  # let's check out just one of them

<div class="jobsearch-SerpJobCard unifiedRow row result" data-jk="487b30db63184515" data-tn-component="organicJob" id="p_487b30db63184515">
<h2 class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=487b30db63184515&amp;fccid=bf0600f0f252b45b&amp;vjs=3" id="jl_487b30db63184515" onclick="setRefineByCookie([]); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="noopener nofollow" target="_blank" title="Penetration Testing Trainee (Remote USA)">
Penetration Testing Trainee (Remote USA)</a>
</h2>
<div class="sjcl">
<div>
<span class="company">
BreachLock</span>
</div>
<div class="recJobLoc" data-rc-loc="Florida, NY" id="recJobLoc_487b30db63184515" style="display: none"></div>
<span class="location accessible-contrast-color-location">Florida, NY</span>
<span class="remote-bullet">•</span>
<span class="remote">Remote work available</span>
</div>
<div class="summary">
<ul style="list-style-type:circle;margin-top: 0px;ma

## Extract Text From HTML Elements

Next, let's target a specific text from the site and extract it from the surrounding HTML

In [10]:
title = jobs[0].find('h2')
title

<h2 class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=487b30db63184515&amp;fccid=bf0600f0f252b45b&amp;vjs=3" id="jl_487b30db63184515" onclick="setRefineByCookie([]); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="noopener nofollow" target="_blank" title="Penetration Testing Trainee (Remote USA)">
Penetration Testing Trainee (Remote USA)</a>
</h2>

In [11]:
title_link = title.find('a')
title_link

<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=487b30db63184515&amp;fccid=bf0600f0f252b45b&amp;vjs=3" id="jl_487b30db63184515" onclick="setRefineByCookie([]); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="noopener nofollow" target="_blank" title="Penetration Testing Trainee (Remote USA)">
Penetration Testing Trainee (Remote USA)</a>

In [12]:
link_text = title_link.text
link_text

'\nPenetration Testing Trainee (Remote USA)'

In [13]:
# clean it up
link_text.strip()

'Penetration Testing Trainee (Remote USA)'

And now for all jobs, in a concise list comprehension:

In [14]:
job_titles = [job.find('h2').find('a').text.strip() for job in jobs]

In [15]:
job_titles

['Penetration Testing Trainee (Remote USA)',
 'Data Engineer Summer Internship (REMOTE)',
 'Python & JavaScript Developer',
 'Alternative Data Research Analyst',
 'Data Technician (Full- or Part-Time)',
 'Python Developer - Compliance',
 'Content Contributor: Deep Learning with TensorFlow',
 'Subject Matter Expert: Deep Learning with TensorFlow',
 'Junior Front End / Full Stack Software Engineer',
 '2020 Enterprise Data Accelerated Talent Entry Program']

## Extract Attributes From HTML Elements

Apart from text content, HTML attributes can contain important information you want to parse, for example the URL where a link points to. Let's learn how to extract them.

In [16]:
title_link

<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=487b30db63184515&amp;fccid=bf0600f0f252b45b&amp;vjs=3" id="jl_487b30db63184515" onclick="setRefineByCookie([]); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="noopener nofollow" target="_blank" title="Penetration Testing Trainee (Remote USA)">
Penetration Testing Trainee (Remote USA)</a>

In [22]:
title_link['href']

'/rc/clk?jk=487b30db63184515&fccid=bf0600f0f252b45b&vjs=3'

That's a **relative link**. In order to be able to access the resource, you will need to assemble the absolute URL.

In [18]:
base_url = "https://www.indeed.com"
job_url = base_url + title_link['href']
job_url

'https://www.indeed.com/rc/clk?jk=487b30db63184515&fccid=bf0600f0f252b45b&vjs=3'

With this, you are now able to access the specifc job posting, for example by using `requests` again:

In [19]:
job_site = requests.get(job_url)
job_soup = BeautifulSoup(job_site.content)

In [20]:
job_soup.text

'\n\nPenetration Testing Trainee (Remote USA) - Florida, NY - Indeed.com\n\n\n\n\n\n\n\n\n\n\n\nFind jobsCompany reviewsFind salariesUpload your resumeSign inEmployers / Post Job\n\nWhatWhereFind JobsAdvanced Job SearchPenetration Testing Trainee (Remote USA)BreachLock-Florida, NYRemoteOtherWho are we?\nBreachLock is a security startup that offers a unique SaaS platform delivering on-demand, continuous, and scalable security testing suitable for modern cloud and DevOps powered businesses. The BreachLock platform leverages both human-powered penetration testing and AI-powered automated scans to create a powerful and easy to use solution that delivers continuous and on-demand vulnerability management. BreachLocks’s modern SaaS-based approach redefines the old school and time-consuming pen test model into fast and comprehensive security as service. As a result, CIO’s and CISO’s get a single pane view into their application and network security posture. The BreachLock platform facilitates 

You could set up a pipeline that follows the job posting details links and fetches the more detailed job description from there. You could set up some parameters by which to highlight or discard listings that contain certain key phrases.

There's a lot you can do to customize this automated job search script to your own specific interests.