In [1]:
from bs4 import BeautifulSoup
import requests

with open('simple.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')  #Creating a BeautifulSoub object #'lxml' is the parser
    
print(soup)    
    

<!DOCTYPE html>
<html class="no-js" lang="">
<head>
<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link href="css/normalize.css" rel="stylesheet"/>
<link href="css/main.css" rel="stylesheet"/>
</head>
<body>
<h1 id="site_title">Test Website</h1>
<hr/>
<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>
<hr/>
<div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div>
<hr/>
<div class="footer">
<p>Footer Information</p>
</div>
<script src="js/vendor/modernizr-3.5.0.min.js"></script>
<script src="js/plugins.js"></script>
<script src="js/main.js"></script>
</body>
</html>


### To format the html with proper indentation, we can use the prettify command

In [2]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>


## How to get information from the HTML

### 1. The easiest way to access a tag is to just access it like an attribute

By doing this, we will get the first tag with the attribute that we specify <br>
For example, if we look up the title tag from the html, we will get the one that is on the top of the html <br>
We cannot specify the class of the tag here

## Note
To access the attributes of a tag, we can do it like accessing the items of a dictionary <br>
Example:<br>
soup.a['href']

In [3]:
#Accessing the title tag of the html
match = soup.title
print(match)

<title>Test - A Sample Website</title>


In [4]:
#Accessing and returning only the text of the tag
match = soup.title.text
print(match)

Test - A Sample Website


In [5]:
#Accessing the div tag
soup.div

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>

### 2. By using the find() method
The find() method gives us more flexibility and more filtering options to drill down to the tags that we want to access unlike accessing as attributes which only returned the tags at the top of the html. <br>
<br>
It gives us the flexibility to specify more attributes of the tags as its arguments<br>

#Syntax:<br>
soup.find('tag_name', class_ = 'class_name')  #could be id as well

NOTE: Why was class_ used instead of just class?<br>
Because class is a special keyword in Python


<b>IMP NOTE:</b><br>
<i>Although the find() method may give us the tag that we specify it, it will still give us the top tag that satisfies the condition.<br></i>
So to access all the tags that satisfy the condition, we use find_all()

In [6]:
#This one will do the same thing as soup.div
match = soup.find('div')
print(match)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


In [7]:
#Getting the div tag of class 'footer'
match = soup.find('div', class_ = 'footer')
match

<div class="footer">
<p>Footer Information</p>
</div>

### We can access the higher level tags and store them in a variable and perform further drill down operations on those high level tag variables to drill down to lower level tags.
i.e. the tags nested within the higher level tags
<br><br>
For example:<br>
We can store the div tag in a variable<br>
And then further drill down this variable to get the h2 and a tags nested within<br>

In [8]:
article = soup.find('div', class_ = 'article')
article

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>

In [9]:
#Now let's drill down the div tag stored in the article variable to access the h2 and p tags within
headline = article.h2.a.text
headline

'Article 1 Headline'

In [10]:
summary = article.p.text
summary

'This is a summary of article 1'

## 3. By using the find_all() method

Previously, we used the find() method to catch/store a single tag of given class.<br><br>
We can use the find_all() method to access all the tags that satisfy the condition/arguments. <br>
i.e. all the specific tags of given class <br><br>
This method <b><i>will return a list of all the tags</i></b> that satisfy the conditions/arguments provided in the paranthesis<br>

And so we can iterate along the items in the list

In [11]:
articles = soup.find_all('div', class_ = 'article')
print(articles)

[<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>, <div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div>]


In [12]:
#looping over the list of tags returned by the find_all method
for article in articles:
    headline = article.h2.a.text
    print(headline)
    
    summary = article.p.text
    print(summary, end = '\n\n')
    

Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2



In [13]:
#Alternatively, we could have just used the for loop while declaring the find_all method
for article in soup.find_all('div', class_ = 'article'):
    headline = article.h2.text
    print(headline)
    
    summary = article.p.text
    print(summary, end = '\n\n')
    

Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2

