> In this notebook, I will be exploring basics of web scraping. 
1. I will be using BeatifulSoup to parse HTML.
2. I will locate elements find(), find_all() methods.
3. I will find elements by tag or attribute.
4. I will retrieve attributes like links. 

> This is a workalong from Pycon 2020 workshop. https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=12s

In [42]:
simple_html = """
<html>

<body>
  <h1>Today's Workshop</h1>
  <div id='agenda' style="background-color: aliceblue">
    <h2>Agenda</h2>
    <p>Today's workshop is comprised of three main sections:</p>
    <ol>
      <li>HTML Basics</li>
      <li>Scraping Basics</li>
      <li>Scraping Pipeline</li>
    </ol>
  </div>
  
  <div id='tools' style='background-color: honeydew'>
    <h2>Tools</h2>
    <p>You will be learning about two primary Python libraries:</p>  
    <ol>
      <li>BeautifulSoup</li>
      <li>requests</li>
    </ol>
  </div>
</body>

</html>
"""

In [43]:
# to tell Ipython to display it as html
from IPython.core.display import display, HTML
display(HTML(simple_html))

In [44]:
from bs4 import BeautifulSoup as bs

In [45]:
soup = bs(simple_html)

In [46]:
soup


<html>
<body>
<h1>Today's Workshop</h1>
<div id="agenda" style="background-color: aliceblue">
<h2>Agenda</h2>
<p>Today's workshop is comprised of three main sections:</p>
<ol>
<li>HTML Basics</li>
<li>Scraping Basics</li>
<li>Scraping Pipeline</li>
</ol>
</div>
<div id="tools" style="background-color: honeydew">
<h2>Tools</h2>
<p>You will be learning about two primary Python libraries:</p>
<ol>
<li>BeautifulSoup</li>
<li>requests</li>
</ol>
</div>
</body>
</html>

In [47]:
type(soup)

bs4.BeautifulSoup

In [48]:
soup.find('h1')

<h1>Today's Workshop</h1>

In [49]:
soup.find('h1').text

"Today's Workshop"

In [50]:
soup.find('li')

<li>HTML Basics</li>

In [51]:
soup.find_all('li')

[<li>HTML Basics</li>,
 <li>Scraping Basics</li>,
 <li>Scraping Pipeline</li>,
 <li>BeautifulSoup</li>,
 <li>requests</li>]

In [52]:
type(soup.find_all('li'))

bs4.element.ResultSet

In [53]:
for item in soup.find_all('li'):
  print(item.text)

HTML Basics
Scraping Basics
Scraping Pipeline
BeautifulSoup
requests


In [54]:
learning_objectives = [item.text for item in soup.find_all('li')]

learning_objectives

['HTML Basics',
 'Scraping Basics',
 'Scraping Pipeline',
 'BeautifulSoup',
 'requests']

In [55]:
soup.find('p')

<p>Today's workshop is comprised of three main sections:</p>

In [56]:
for item in soup.find_all('ol'):
  print(item.text)


HTML Basics
Scraping Basics
Scraping Pipeline


BeautifulSoup
requests



In [58]:
soup.find_all('p')

[<p>Today's workshop is comprised of three main sections:</p>,
 <p>You will be learning about two primary Python libraries:</p>]

In [59]:
workshop_html = """
<html>

<body>
  <h1>Today's Workshop</h1>
  <div id='agenda' style="background-color: aliceblue">
    <h2>Agenda</h2>
    <p>Today's workshop is comprised of three main sections:</p>
    <ol>
      <li>HTML Basics</li>
      <li>Scraping Basics</li>
      <li>Scraping Pipeline</li>
    </ol>
  </div>
  
  <div id='tools' style='background-color: honeydew'>
    <h2>Tools</h2>
    <p>You will be learning about two primary Python libraries:</p>  
    <ol>
      <li>BeautifulSoup</li>
      <li>requests</li>
    </ol>
  </div>
</body>

</html>
"""

In [60]:
soup = bs(workshop_html)

In [61]:
type(soup)

bs4.BeautifulSoup

In [62]:
soup.find('h1')

<h1>Today's Workshop</h1>

In [64]:
header = soup.find('h1').text
header

"Today's Workshop"

In [65]:
soup.find_all('p')

[<p>Today's workshop is comprised of three main sections:</p>,
 <p>You will be learning about two primary Python libraries:</p>]

In [66]:
paragraph = [item.text for item in soup.find_all('p')]
paragraph

["Today's workshop is comprised of three main sections:",
 'You will be learning about two primary Python libraries:']

In [69]:
soup.find_all('li')[:3]

[<li>HTML Basics</li>, <li>Scraping Basics</li>, <li>Scraping Pipeline</li>]

In [70]:
agenda_list = [li.text for li in soup.find_all('li')[:3]]
agenda_list

['HTML Basics', 'Scraping Basics', 'Scraping Pipeline']

In [73]:
# downloading a file from the internet
import urllib.request
url = 'https://raw.github.com/kimfetti/Conferences/master/PyCon_2020/pycon_info.html'
filename = 'pycon_info.html'
urllib.request.urlretrieve(url, filename)

('pycon_info.html', <http.client.HTTPMessage at 0x1ff3d98e8b0>)

In [74]:
#reading the file
pycon_html = open('pycon_info.html').read()

In [75]:
print(pycon_html)

<html>
    <head>
        <title>PyCon 2020 Info</title>

        <style>
            body {
                background-color: cornsilk;
            }

            h1 {
                font-size: 40px;
                font-family: courier new, arial;
                text-align: center;
                margin-top: 50px;
            }

            a {
                color: #411B2D;
                font-size: 20px;
            }

            p {
                font-size: 20px;
            }

            a:hover{
                color: white;
                background-color: #411B2D;
            }

            #toolbar {
                background-color: #F3B643;
                font-family: courier new, arial;
                font-weight: bold;
                font-size: 16px;
                display: flex;
                justify-content: space-around;
                flex-direction: row;
                border: 1px solid black;
                border-radius: 1px;
                marg

In [76]:
soup = bs(pycon_html)

In [77]:
type(soup)

bs4.BeautifulSoup

In [78]:
soup.find_all('a')

[<a href="https://us.pycon.org/2020/about/">WHAT IS PYCON?</a>,
 <a href="https://us.pycon.org/2020/schedule/tutorials/">TUTORIAL SCHEDULE</a>,
 <a href="https://us.pycon.org/2020/speaking/">SPEAKING AT PYCON</a>,
 <a href="https://us.pycon.org/2020/psf/">PYTHON SOFTWARE FOUNDATION</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/74/">Effective Data Visu

In [79]:
today_div = soup.find(id='today')
today_div

<div class="events" id="today">
<h2>A Selection of Today's Events</h2>
<p> Room 309, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a></p>
<p> Room 315, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a></p>
<p> Room 317, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a></p>
<p> Room 318, 1:20 pm -<a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a></p>
</div>

In [80]:
type(today_div)

bs4.element.Tag

In [81]:
today_div.find_all('a')

[<a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a>,
 <a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a>]

In [84]:
soup.find_all(class_='events')

[<div class="events" id="today">
 <h2>A Selection of Today's Events</h2>
 <p> Room 309, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a></p>
 <p> Room 315, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a></p>
 <p> Room 317, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a></p>
 <p> Room 318, 1:20 pm -<a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a></p>
 </div>,
 <div class="events" id="tomorrow">
 <h2>Coming Up Tomorrow</h2>
 <p> Room 316, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a></p>
 <p> Room 319, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a></p>
 <p> Room 309, 1:20 pm - <a href="https://us.pycon.org/2020/sc

In [85]:
soup.find_all(attrs={'class': 'events', 'id': 'tomorrow'})

[<div class="events" id="tomorrow">
 <h2>Coming Up Tomorrow</h2>
 <p> Room 316, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a></p>
 <p> Room 319, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a></p>
 <p> Room 309, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/74/">Effective Data Visualization</a>
 </p></div>]

In [87]:
soup.find('a')['href']

'https://us.pycon.org/2020/about/'

In [88]:
todays_links = [link['href'] for link in today_div.find_all('a')]
todays_links

['https://us.pycon.org/2020/schedule/presentation/50/',
 'https://us.pycon.org/2020/schedule/presentation/72/',
 'https://us.pycon.org/2020/schedule/presentation/54/',
 'https://us.pycon.org/2020/schedule/presentation/55/']

In [89]:
tomorrow_tuples = [(link.text, link['href']) for link in soup.find(id='tomorrow').find_all('a')]
tomorrow_tuples

[('Creating a Great Python Package',
  'https://us.pycon.org/2020/schedule/presentation/63/'),
 ('Minimum Viable Documentation',
  'https://us.pycon.org/2020/schedule/presentation/45/'),
 ('Effective Data Visualization',
  'https://us.pycon.org/2020/schedule/presentation/74/')]

In [95]:
events = soup.find_all(class_='events')
events

[<div class="events" id="today">
 <h2>A Selection of Today's Events</h2>
 <p> Room 309, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/50/">Foundations of Numerical Computing in Python</a></p>
 <p> Room 315, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/72/">It's Officially Legal so Let's Scrape the Web</a></p>
 <p> Room 317, 1:20 pm - <a href="https://us.pycon.org/2020/schedule/presentation/54/">A Beginner's Guide to Befriending Python</a></p>
 <p> Room 318, 1:20 pm -<a href="https://us.pycon.org/2020/schedule/presentation/55/">Scalable Computing with Dask</a></p>
 </div>,
 <div class="events" id="tomorrow">
 <h2>Coming Up Tomorrow</h2>
 <p> Room 316, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/63/">Creating a Great Python Package</a></p>
 <p> Room 319, 9:00 am - <a href="https://us.pycon.org/2020/schedule/presentation/45/">Minimum Viable Documentation</a></p>
 <p> Room 309, 1:20 pm - <a href="https://us.pycon.org/2020/sc

In [99]:
event_h2 = [day.find('h2').text for day in events]
event_h2

["A Selection of Today's Events", 'Coming Up Tomorrow']