# Work in Progress (WIP)

# Using Beautiful Soup

Beautiful Soup is a Python package for parsing HTML and XML documents. Webscrapping normally involves several steps and they include: 

*   Using the [requests](https://pypi.org/project/requests/) library to get the HTML of a page 
*   Using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to turn that into a soup object
*   Applying find/find_all or select to find a particular HTML tag or CSS 
*   And, sometimes, some sort of iteration (e.g., for loop) to capture an element, such as finding all the hyperlinks on a page 
* Appending things to a list, opening links, and scrapping more content





## Some Applications for BeautifulSoup and Webscrapping 

*   Monitoring e-commerce prices 
*   Analyzing social media web data
*   Getting text for NLP analysis
*   Extracting abstracts, key words, contact information
*   Alternative to getting data where there is no API





## Things to Keep in Mind 

 
* Keep it simple and only scrape what you need. 
* Practice being a good internet citizen. Not all websites take kindly to scraping, and some may prohibit it explicitly. Check with the website owners if they're okay with scraping or see if there is a robots.txt file.
* If you feel like your requests are excessive, add [time](https://docs.python.org/3/library/time.html) delays to your requests. 
* Websites can be very complex and, in my experience, websites with a lot of javascript can be impossible to scrape; sometime it just won't work. 
* All HTML is not created equal and some sites are not always consistant with their tagging, make sure to check "View Page Source" or "Inspect" as you go. 

### BeautifulSoup Doucmentation

Read the Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

# Getting the HTML

In [None]:
# install libraries 

!pip install requests 
!pip install beautifulsoup4

In [None]:
# import libraries 

from bs4 import BeautifulSoup as bs
import requests

In [10]:
page = requests.get("https://keithgalli.github.io/web-scraping/example.html")

# making into soup object 
page_html = page.text
soup =bs(page_html, "html.parser")

# alternatively 
# soup = bs(r.content)

# # print page or pretty print 
# # print(soup)
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



## Find and Find All 
* find: first element
* find all: all elements related to a html tag (e.g., all H2s)

In [None]:
header = soup.find("h2")
headers = soup.find_all("h2")
print(header)
print(headers)

<h2>A Header</h2>
[<h2>A Header</h2>, <h2>Another header</h2>]


In [None]:
# pass in a list of elements to look for
h1_h2 = soup.find_all(["h1", "h2"])
print(h1_h2)

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [None]:
# you can use find/find all when looking for a particular attibute 

paragraph = soup.find_all("p", attrs={"id": "paragraph-id"})
print(paragraph)

[<p id="paragraph-id"><b>Some bold text</b></p>]


In [None]:
# nesting find and find all calls 

body = soup.find("body")
# body

div = body.find("div")
#  div

para = div.find("p")
para

<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>

In [None]:
# sesrch for specific strings in find/find all calls 

import re
some_paragraphs = soup.find_all("p", string=re.compile("Some"))
some_paragraphs

# headers 

headers = soup.find_all("h2", string = re.compile("(H|h)eader"))
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

### Select (CSS Selector)
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
CSS Selector Refereance: https://www.w3schools.com/cssref/css_selectors.php

In [None]:
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



### Simple Ways to Navigate 

In [None]:
soup.select("div p")

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [None]:

paragraphs = soup.select("div ~ p") 

In [None]:
bold_text = soup.selct("paragraph-id b")

In [None]:
paragraphs = soup.select("body > p")

In [None]:
for paragraph in paragraphs: 
  print(paragraph.select("i"))

# getting different properties of the HTML

In [None]:
heater = soup.find("h2")
heater.string

div = soup.find("div")
print(print.prettify())
print(div.string)

# if the string gives "none", use get_text 

div = soup.find("div")
print(print.prettify())
print(div.get_text)




Get a spefific property from the element 

In [None]:
lnk = soup.find("a")
link["href"]

paragraph = soup.select("paragraph-id")
paragraph[0]["id"]

In [None]:
# Path syntax 

soup

soup.body 

soup.body.div.h1 

soup.body.div.h1.string


In [None]:
# know the terms: parent, sibling, and child 

print(soup.body.prettify())

# body is the parent, div is child and h1 is on the same level (sibling)
# Review "Navigating the tree" in bs4 documentation 

soup.body.find("div").find_next_siblings()


# Exercises 

go to: https://keithgalli.github.io/web-scraping/webpage.html

Check out the "inspect" to see the CSS

In [None]:
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

#convert to a beautifulsoup object
webpage = bs(r.content)
print(webpage.prettify())

# grab all of the social links from the webpage 
# you have to do with in at least three different ways, one was to use find/findall and the other use the select method 



In [None]:
# 

In [None]:
# 

## Iteration / for loops