### BeautifulSoup - 
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

#### !pip install beautifulsoup4


They also suggest using the lxml parser

parse - resolve (a sentence) into its component parts and describe their syntactic roles.
#### !pip install lxml

### Requests -
Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
#### !pip install requests

In [1]:
from bs4 import BeautifulSoup
import requests


with open ('simple.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    
    
# print(soup)

# with indentation
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>


In [2]:
# Get the title, access it like an attribute...
# match = soup.title.text

match = soup.title

# print(match)

# only text
print(match.text)

Test - A Sample Website


In [3]:
# Search for specific divs using classes or IDs...

# match = soup.find_all('div')             -- for all divs

# match = soup.find('div', id = 'footer')

match = soup.find('div', class_ = 'footer')
print(match)

<div class="footer">
<p>Footer Information</p>
</div>


In [4]:
# getting the link inside the article...
article = soup.find('div', class_ = 'article')
print(article, '\n\n')

# Get text
link_text = article.h2.a.text
print(link_text, '\n\n')

# get link
link = article.h2.a['href']
print(link)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div> 


Article 1 Headline 


article_1.html


In [5]:
# Looping through all articles...

for article in soup.find_all('div', class_ = 'article'):
    print(article, '\n\n')

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div> 


<div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div> 




---

# Working with websites...

In [41]:
source = requests.get('http://coreyms.com').text

soup = BeautifulSoup(source, 'lxml')

# print(soup.prettify())  -- entire html page

# finding article tag  --returns the first one...
article = soup.find('article')

# print(article.prettify(), '\n\n\n')


# get headline - 
# headline = article.h2.a.text
# print(headline)


# get paragraph from class "entry-content" - 
# summary = article.find('div', class_ = 'entry-content').p.text
# print(summary)


# get youtube vid link - 
vid_src = article.find('iframe', class_ = 'youtube-player')['src']
# print(vid_src)


# Splitting string based on /
vid_id = vid_src.split('/')[4]
# print(vid_id)


vid_id = vid_src.split('?')[0]
print(vid_id)

https://www.youtube.com/embed/z0gguhEmWiY


## For all articles

In [43]:
source = requests.get('http://coreyms.com').text

soup = BeautifulSoup(source, 'lxml')

# print(soup.prettify())  -- entire html page

# finding article tag  
for article in soup.find_all('article'):

    try:
        
        # print(article.prettify(), '\n\n\n')


        # get headline - 
        # headline = article.h2.a.text
        # print(headline)


        # get paragraph from class "entry-content" - 
        # summary = article.find('div', class_ = 'entry-content').p.text
        # print(summary)


        # get youtube vid link - 
        vid_src = article.find('iframe', class_ = 'youtube-player')['src']
        # print(vid_src)


        # Splitting string based on /
        vid_id = vid_src.split('/')[4]
        # print(vid_id)


        vid_id = vid_src.split('?')[0]
        print(vid_id)
        
    except Exception as e:
        vid_id = None
        
        

https://www.youtube.com/embed/z0gguhEmWiY
https://www.youtube.com/embed/_P7X8tMplsw
https://www.youtube.com/embed/fKl2JW_qrso
https://www.youtube.com/embed/IEEhzQoKtQU
https://www.youtube.com/embed/mO_dS3rXDIs
https://www.youtube.com/embed/2Fp1N6dof0Y
https://www.youtube.com/embed/-nh9rCzPJ20
https://www.youtube.com/embed/06I63_p-2A4
https://www.youtube.com/embed/_JGmemuINww


---