<a href="https://colab.research.google.com/github/Kaiziferr/-Miner_Detector/blob/master/BeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import bs4
from bs4 import BeautifulSoup

In [None]:
bs4.__version__

'4.6.3'

In [None]:
doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

**Definiendo un objeto SOUP**

In [None]:
soup = BeautifulSoup(doc, 'html.parser')

**Pretty**

In [None]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


**Formas sencillas de navegar**

In [None]:
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent)
print(soup.p)
print(soup.p['class'])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

<title>The Dormouse's story</title>
title
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


**Extraer una propiedad**

In [None]:
for link in soup.find_all('a'):
  print(link.get('href'), link['href'])

http://example.com/elsie http://example.com/elsie
http://example.com/lacie http://example.com/lacie
http://example.com/tillie http://example.com/tillie


**Extraer el texto**

In [None]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



# **Tag**

In [None]:
tag = soup.p
print(type(tag))
# Nombre
print(tag.name)
# Atributos
print(tag['class'])
# Accedder al diccionario
print(tag.attrs)

<class 'bs4.element.Tag'>
p
['title']
{'class': ['title']}


**Agregar, eliminar y modificar los atributos de una etiqueta**

In [None]:
tag['id'] = 'new_id'
tag['another-attribute'] = 1
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p another-attribute="1" class="title" id="new_id"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [None]:
del tag['id']
del tag['another-attribute']
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [None]:
try:
  tag['id']
except:
  print(Exception)

print(tag.get('i'))

<class 'Exception'>
None


**Atributos de varios valores**


In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class']

['body', 'strikeout']

In [None]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
print(rel_soup.a['rel'])
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

['index']
<p>Back to the <a rel="index contents">homepage</a></p>


In [None]:
no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
no_list_soup.p.get_attribute_list('class')

['body', 'strikeout']

In [None]:
# Con xml no hay atributos con varios valores
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
display(xml_soup.p['class'], xml_soup.p.get_attribute_list('class'))

'body strikeout'

['body strikeout']

# **NavigableString**

In [None]:
tag = soup.b
print(type(tag), type(tag.string))

<class 'bs4.element.Tag'> <class 'bs4.element.NavigableString'>


In [None]:
unicode_string = str(tag.string)
type(unicode_string)

str

**remplazar una cadena**

In [None]:
tag.string.replace_with('Halo')
soup

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>Halo</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

# **BeautifulSoup**

In [None]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text='INSERT FOOTER HERE').replace_with(footer)

'INSERT FOOTER HERE'

In [None]:
doc.name

'[document]'

**Comentarios y otras cadenas especiales**

In [None]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
tag = "<b>Hey, buddy. Want to buy a used parser</b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(soup.b.prettify())
print(type(comment))
soup = BeautifulSoup(tag, 'html.parser')
comment = soup.b.string
print(type(comment))

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>
<class 'bs4.element.Comment'>
<class 'bs4.element.NavigableString'>


# **Navegando por el arbol**

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
soup.head, soup.title

(<head><title>The Dormouse's story</title></head>,
 <title>The Dormouse's story</title>)

In [None]:
soup.body.b

<b>The Dormouse's story</b>

In [None]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [None]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

**contents y children**

In [None]:
head_tag = soup.head
print(head_tag)
print(head_tag.contents)

<head><title>The Dormouse's story</title></head>
[<title>The Dormouse's story</title>]


In [None]:
soup.contents

['\n', <html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p></body></html>]

**Iteraciones**

In [None]:
for child in soup.children:
  print(child)



<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>


In [None]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


In [None]:
print(len(list(soup.children)))
print(len(list(soup.descendants)))

2
26


In [None]:
for descend in soup.descendants:
  print(descend)
  print('--------')



--------
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>
--------
<head><title>The Dormouse's story</title></head>
--------
<title>The Dormouse's story</title>
--------
The Dormouse's story
--------


--------
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com

# **.stringsy stripped_strings**

In [None]:
for string in soup.strings:
  print(repr(string))

'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'


In [None]:
for string in soup.stripped_strings:
  print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


**parent**

In [None]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [None]:
soup.title.string.parent

<title>The Dormouse's story</title>

In [None]:
for parent in soup.a.parents:
  print(parent)

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a clas

# **sideways**

In [None]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser')
print(sibling_soup.prettify())

<a>
 <b>
  text1
 </b>
 <c>
  text2
 </c>
</a>


In [None]:
sibling_soup.b.next_sibling

<c>text2</c>

In [None]:
sibling_soup.c.previous_sibling

<b>text1</b>

In [None]:
print(sibling_soup.b.previous_sibling)
print(sibling_soup.c.next_sibling)

None
None


 **.next_siblingsy .previous_siblings**

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

In [None]:
for sibiling in soup.a.next_siblings:
  print(repr(sibiling))

for sibiling in soup.find(id = 'link3').previous_siblings:
  print(repr(sibiling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


# **Going back and forth**

**.next_element and .previous_element**

In [None]:
last_a_tag  = soup.find('a', id='link3')
last_a_tag

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [None]:
last_a_tag.next_sibling

';\nand they lived at the bottom of a well.'

In [None]:
last_a_tag.next_element

'Tillie'

**.next_elementsy .previous_elements**

In [None]:
for element in last_a_tag.next_elements:
    print(repr(element))

'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'


# Buscando en el arbol