# Beautiful Soup
- Use for HTML parsing
- From [Corey Schafer - Python Tutorial: Web Scraping with BeautifulSoup and Requests](https://www.youtube.com/watch?v=ng2o98k983k)

## Reading from a local file

In [3]:
from bs4 import BeautifulSoup
import requests

with open ('randomfiles/simple.html') as f:
    soup = BeautifulSoup(f,'lxml')

print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>



- Grabbing html elements by styles and type

In [4]:
title = soup.title.text
print(title)

first_div = soup.div
print(first_div)

# find a div with a class of footer
match = soup.find('div', class_='footer')
print(match)

for article in soup.find_all('dic', class_='article'):
    headline = article.h2.a.text
    print(headline)
    
    summary = article.p.text
    print(summary)


Test - A Sample Website
<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>
<div class="footer">
<p>Footer Information</p>
</div>


- Grabbing from a website and parsing the youtube identifiers

In [6]:
source = requests.get('http://coreyms.com').text

soup2 = BeautifulSoup(source,'lxml')
try:
    vid_src = soup2.find('iframe', class_='youtube-player')['src']
    
    print(vid_src)
    vid_id = vid_src.split('/')[4]
    
    print(vid_id)
    vid_id = vid_id.split('?')[0]
    
    print(vid_id)
    yt_link= f'https://youtube.com/watch?v={vid_id}'
    
    print(yt_link)

except Exception as e:
    pass


https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
z0gguhEmWiY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
z0gguhEmWiY
https://youtube.com/watch?v=z0gguhEmWiY
