# Beautiful Soup

Beautiful Soup is a python module designed to help you easily pull information out of an HTML document or string.

Ideally, all of the information we could like to get from the web would be readily available as JSON via a REST API.  Alas, this is not the case - much of the info is presented as HTML which is a pain to parse and get data out of.  This is where Beautiful Soup comes in...

HTML basic:https://www.w3schools.com/html/html_basic.asp



In [3]:
import requests
from bs4 import BeautifulSoup
import csv

In [4]:
with open('simple.html') as html_file:
  soup= BeautifulSoup(html_file, 'lxml')

print(soup)

<!DOCTYPE html>
<html class="no-js" lang="">
<head>
<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link href="css/normalize.css" rel="stylesheet"/>
<link href="css/main.css" rel="stylesheet"/>
</head>
<body>
<h1 id="site_title">Test Website</h1>
<hr/>
<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>
<hr/>
<div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div>
<hr/>
<div class="footer">
<p>Footer Information</p>
</div>
<script src="js/vendor/modernizr-3.5.0.min.js"></script>
<script src="js/plugins.js"></script>
<script src="js/main.js"></script>
</body>
</html>


In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>


In [6]:
match=soup.title
print(match)

<title>Test - A Sample Website</title>


In [7]:
match=soup.title.text
print(match)

Test - A Sample Website


#try yourself
1.match=soup.div
2.match=soup.find('div')
3.match=soup.find('div',class_='_footer_')


In [8]:
article=soup.find('div',class_='article')
print(article)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


In [9]:
headline=article.h2.a.text
print(headline)

Article 1 Headline


In [10]:
summary=article.p.text
print(summary)

This is a summary of article 1


In [11]:
for article in soup.find_all('div', class_='article'):
  headline=article.h2.a.text
  print(headline)

  summary=article.p.text
  print(summary)

  print()


Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2

