<a href="https://colab.research.google.com/github/ShaunakSen/Data-Science-and-Machine-Learning/blob/master/Try_BS4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Scraoing for extending the dataset



In [0]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [0]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [4]:
print (soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Here are some simple ways to navigate that data structure:

In [30]:
print (soup.title)
print (soup.title.name)
print (soup.title.string)
print (soup.title.parent)
print (soup.p)
print (soup.p['class'])
print (soup.a)
print (soup.find_all('a'))
print (soup.find(href="http://example.com/elsie"))
print (soup.find(id="link3"))

<title>The Dormouse's story</title>
title
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


### Objects in bs4

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

#### Tag

A Tag object corresponds to an XML or HTML tag in the original document:



In [0]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print (type(tag))


Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

#### Name

Every tag has a name, accessible as .name:



In [32]:
tag.name

'b'

#### Attributes

A tag may have any number of attributes. The tag `<b id="boldest">` has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary: `tag['id']`



In [34]:
print (tag.attrs)

{'class': ['boldest']}


### NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:



In [36]:
print (tag.string)

print (type(tag.string)) # NOTE: not 'str'

Extremely bold
<class 'bs4.element.NavigableString'>


A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with unicode():

NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method.

**If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.**

### Navigating the tree

#### Going down

Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.head:



In [38]:
soup = BeautifulSoup(html_doc, 'html.parser')

# The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head> tag, just say soup.head:

print (soup.head)

# This code gets the first <b> tag beneath the <body> tag:

print (soup.body.b)

<head><title>The Dormouse's story</title></head>
<b>The Dormouse's story</b>


**Using a tag name as an attribute will give you only the first tag by that name**

If you need to get all the <a> tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():



In [40]:
print (soup.a)
print (soup.find_all('a'))

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


#### .contents and .children

A tag’s children are available in a list called .contents:



In [45]:
head_tag = soup.head

print (head_tag)

print (head_tag.contents)

# each content is also a 'Tag'
print (type(head_tag.contents[0]))

title_tag = head_tag.contents[0]

print (title_tag.contents)

# now the content is a 'NavigableString'
print (type(title_tag.contents[0]))

<head><title>The Dormouse's story</title></head>
[<title>The Dormouse's story</title>]
<class 'bs4.element.Tag'>
["The Dormouse's story"]
<class 'bs4.element.NavigableString'>


Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:



In [47]:
for child in title_tag.children:
  print(child)


The Dormouse's story


#### descendants

The .contents and .children attributes only consider a tag’s direct children. For instance, the <head> tag has a single direct child–the <title> tag:

But the <title> tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the <head> tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:



In [49]:
for child in head_tag.descendants:
    print(child, type(child))


<title>The Dormouse's story</title> <class 'bs4.element.Tag'>
The Dormouse's story <class 'bs4.element.NavigableString'>


PROJECT: We can use this for our scraping.. We can only take ancestors of type string

The `<head>` tag has only one child, but it has two descendants: the `<title>` tag and the `<title>` tag’s child. The BeautifulSoup object only has one direct child (the `<html>` tag), but it has a whole lot of descendants

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:



```
# This is formatted as code
```



In [53]:
print (head_tag.contents)

print (head_tag.string)

# If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print (soup.string)

[<title>The Dormouse's story</title>]
The Dormouse's story
None


PROJECT: The following blocks are v imp

#### .strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:



In [55]:
for string in soup.strings:
  print (repr(string))

'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:



In [56]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

#### Going up

