## Web Scraping 101


We start getting into the basics of the beautiful soup library. This includes how to load a webpage, the basic commands you need to know such as find & find_all, grabbing strings from an HTML elements, etc. The final section of this tutorial is a series of exercises where you can practice your skills. In this section we scrape a webpage for links, we learn how to scrape a table and load it into a pandas dataframe, and we see how you can scrape & download a web image.

Tutorial link: https://www.youtube.com/watch?v=GjKQ6V_ViQE&t=320s

## Load in the necessary libraries

In [2]:
import requests # pip install requests
from bs4 import BeautifulSoup as bs #pip install beautifulsoup4

## Load our first page

In [5]:
# Load the webpage content
r = requests.get('https://keithgalli.github.io/web-scraping/example.html')

# Convert to a beautiful soup object
soup = bs(r.content)

# Print out our html
print(soup.prettify()) #.prettify makes the indentions and all within the HTML doc.


<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



# Start using Beautiful Soup to Scrape

Find and Find_all

In [10]:
first_header = soup.find("h2") # find method
headers = soup.find_all("h2") # find_all method

print(headers)

[<h2>A Header</h2>, <h2>Another header</h2>]


Pass in a list of elements to look for

In [14]:
first_header = soup.find(['h1', 'h2'])

headers = soup.find_all(['h1', 'h2'])
headers

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]

You can pass in attributes to the find/find_all function

In [16]:
paragraph = soup.find_all('p', attrs={'id': 'paragraph-id'})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

You can nest find/find_all calls

In [20]:
body = soup.find('body')
div = body.find('div')
header = div.find('h1')
header

<h1>HTML Webpage</h1>

Search Specific strings in our find/find_all cells. 
More about RegEx : https://www.w3schools.com/python/python_regex.asp


In [25]:
import re #importing RegEX

paragraphs = soup.find_all('p', string=re.compile("Some"))
paragraphs

headers = soup.find_all('h2', string=re.compile('(H|h)eader'))
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

## Select (CSS Selectors)
CSS Selector Reference: https://www.w3schools.com/cssref/css_selectors.asp

In [27]:
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [28]:
content = soup.select('div p')
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [30]:
paragraphs = soup.select('h2 ~ p')
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [31]:
bold_text = soup.select('p#paragraph-id b')
bold_text

[<b>Some bold text</b>]

In [34]:
# run nested calls
paragraphs = soup.select('body > p')
print(paragraphs)

for paragraph in paragraphs:
    print(paragraph.select('i'))
    

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [35]:
# Grab by elemeent with specific property
soup.select("[align=middle]")

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

## Get different properties of the HTML

In [43]:
header = soup.find('h2')
header # noticed it included the tags when printed

header = soup.find('h2')
header.string #takes off the tags and just gives the string.

'A Header'

In [45]:
div = soup.find('div')
print(div.prettify())
print(div.get_text())

<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [49]:
# Get a specific property from an element
link = soup.find('a')
link['href'] # will alow you to grab the link only



paragraphs = soup.select('p#paragraph-id')
paragraphs[0]['id']

'paragraph-id'

## Code Navigation

In [53]:
# Path Syntax
soup.body.div.h1.string


'HTML Webpage'

In [55]:
# Know the term: Parent, sibiling, Child
soup.body.find('div').find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]