### Introduction to SoupStrainer

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#soupstrainer

### Importing SoupStrainer and BeautifulSoup

In [1]:
import re
from bs4 import SoupStrainer, BeautifulSoup

In [23]:
html_code = """
<html>
 <head>
  <title>
   The story of Tom and Jerry
  </title>
 </head>
 <body class="container">
  <h1>
   Tom and Jerry
  </h1>
  &gt;
  <img alt="cartoon_image" height="300" src="TomAndJerry.jpg" width="300"/>
  <p class="comedy animated series">
   Tom and Jerry is an American animated series of comedy short films created by
   <a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">
    William Hanna
   </a>
   and
   <a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">
    Joseph Barbera
   </a>
   . 
        It centers on a rivalry between the title characters
   <a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">
    Tom
   </a>
   , a cat, and
   <a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">
    Jerry
   </a>
   , a mouse.
  </p>
  <div>
   <img alt="creator_image" height="300" name="William_Hanna" src="https://upload.wikimedia.org/wikipedia/commons/d/d2/William_Hanna_1977.jpg" width="300"/>
   <img alt="creator_image" height="300" name="Joseph_Barbera" src="https://upload.wikimedia.org/wikipedia/commons/6/67/JBarbera.jpg" width="300"/>
   <img src="https://upload.wikimedia.org/wikipedia/en/2/2f/Jerry_Mouse.png"/>
   <img extra-info="Tom_Cat" src="https://upload.wikimedia.org/wikipedia/en/f/f6/Tom_Tom_and_Jerry.png"/>
  </div>
  <p class="comedy story">
   <b>
    The series features comic fights between an iconic pair of adversaries, 
                a house cat (Tom) and a mouse (Jerry). The plots of each short usually center on Tom's 
                numerous attempts to capture Jerry and the mayhem and destruction that follows. 
                Tom rarely succeeds in catching Jerry, mainly because of Jerry's cleverness, 
                cunning abilities, and luck.
   </b>
  </p>
  <i>
   Tom and Jerry show is a full length comedy show
  </i>
 </body>
</html>
"""

### Parsing the document by considering  Tag as argument

In [24]:
div_tags = SoupStrainer("div")

This code will take SoupStriner Class as value for "parse_only" attribute and BeautifulSoup will parse only the tags ("div") which are satiesfying the filtered condition of SoupStrainer Class

In [25]:
soup = BeautifulSoup(html_code, "lxml", parse_only = div_tags)

print(soup.prettify())

<div>
 <img alt="creator_image" height="300" name="William_Hanna" src="https://upload.wikimedia.org/wikipedia/commons/d/d2/William_Hanna_1977.jpg" width="300"/>
 <img alt="creator_image" height="300" name="Joseph_Barbera" src="https://upload.wikimedia.org/wikipedia/commons/6/67/JBarbera.jpg" width="300"/>
 <img src="https://upload.wikimedia.org/wikipedia/en/2/2f/Jerry_Mouse.png"/>
 <img extra-info="Tom_Cat" src="https://upload.wikimedia.org/wikipedia/en/f/f6/Tom_Tom_and_Jerry.png"/>
</div>


This code will take SoupStriner Class as value for "parse_only" attribute and BeautifulSoup will parse only the tags ("a") which are satiesfying the filtered condition of SoupStrainer Class

In [28]:
link_tags = SoupStrainer("a")

soup = BeautifulSoup(html_code, "lxml", parse_only = link_tags)

print(soup.prettify())

<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">
 William Hanna
</a>
<a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">
 Joseph Barbera
</a>
<a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">
 Tom
</a>
<a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">
 Jerry
</a>


This code will take SoupStriner Class as value for "parse_only" attribute and SoupStriner Class will parse only  the tags ("img") which are satiesfying the filtered condition of SoupStrainer Class. 

In [29]:
img_tags = SoupStrainer("img")

soup = BeautifulSoup(html_code, "lxml", parse_only = img_tags)

print(soup.prettify())

<img alt="cartoon_image" height="300" src="TomAndJerry.jpg" width="300"/>
<img alt="creator_image" height="300" name="William_Hanna" src="https://upload.wikimedia.org/wikipedia/commons/d/d2/William_Hanna_1977.jpg" width="300"/>
<img alt="creator_image" height="300" name="Joseph_Barbera" src="https://upload.wikimedia.org/wikipedia/commons/6/67/JBarbera.jpg" width="300"/>
<img src="https://upload.wikimedia.org/wikipedia/en/2/2f/Jerry_Mouse.png"/>
<img extra-info="Tom_Cat" src="https://upload.wikimedia.org/wikipedia/en/f/f6/Tom_Tom_and_Jerry.png"/>



### Parsing the document by considering  Attribute as argument

Here we are giving attribute("alt") value ("Creator_image") as argument to the soupStriner Class.

In [30]:
alt_attr = SoupStrainer(alt = "creator_image")

soup = BeautifulSoup(html_code, "lxml", parse_only = alt_attr)

print(soup.prettify())

<img alt="creator_image" height="300" name="William_Hanna" src="https://upload.wikimedia.org/wikipedia/commons/d/d2/William_Hanna_1977.jpg" width="300"/>
<img alt="creator_image" height="300" name="Joseph_Barbera" src="https://upload.wikimedia.org/wikipedia/commons/6/67/JBarbera.jpg" width="300"/>



This code will take SoupStriner Class as value for "parse_only" attribute and BeautifulSoup will parse only the tags which are satiesfying the filtered condition of SoupStrainer Class

In [31]:
id_attr = SoupStrainer(id = re.compile('link'))

soup = BeautifulSoup(html_code, "lxml", parse_only = id_attr)

print(soup.prettify())

<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">
 William Hanna
</a>
<a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">
 Joseph Barbera
</a>
<a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">
 Tom
</a>
<a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">
 Jerry
</a>


In [32]:
href_attr = SoupStrainer(href = True)

soup = BeautifulSoup(html_code, "lxml", parse_only = href_attr)

print(soup.prettify())

<a class="creator" href="https://en.wikipedia.org/wiki/William_Hanna" id="link1">
 William Hanna
</a>
<a class="creator" href="https://en.wikipedia.org/wiki/Joseph_Barbera" id="link2">
 Joseph Barbera
</a>
<a class="character" href="https://en.wikipedia.org/wiki/Tom_Cat" id="link3">
 Tom
</a>
<a class="character" href="https://en.wikipedia.org/wiki/Jerry_Mouse" id="link4">
 Jerry
</a>
