# Web Scraping com Beatiful Soup

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) é um pacote Python para analisar documentos HTML e XML. Ele cria uma árvore de análise para páginas analisadas que podem ser usadas para extrair dados de HTML, o que é útil para web scraping.

In [2]:
# Importar a biblioteca do BeautifulSoup
from bs4 import BeautifulSoup

# Ler o arquivo html
with open("Beautiful Soup Documentation.html", "r", encoding="utf-8") as arquivo:
    soup = BeautifulSoup(arquivo, "html.parser")

# print(soup.prettify())

In [3]:
# Acessar o título da página
titulo = soup.title
print(titulo.text)

Beautiful Soup Documentation — Beautiful Soup 4.12.0 documentation


#### Usando o método `find()` do pacote BeautifulSoup, podemos encontrar o primeiro elemento com um determinado nome de tag:

In [4]:
h1 = soup.find("h1")
print(h1.prettify())

<h1>
 Beautiful Soup Documentation
 <a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4" title="Permalink to this heading">
  ¶
 </a>
</h1>



In [5]:
# Acessar o primeiro link da página
link = soup.find("a")
print(link.prettify())

<a accesskey="I" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html" title="General Index">
 index
</a>



#### O método `find_all()` retorna uma lista contendo todos os elementos com um determinado nome de tag, como uma lista Python:

In [6]:
# Buscar todos os links da página
links = soup.find_all("a")
print(links)

[<a accesskey="I" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html" title="General Index">index</a>, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/py-modindex.html" title="Python Module Index">modules</a>, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#">Beautiful Soup 4.12.0 documentation</a>, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup Documentation</a>, <a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4" title="Permalink to this heading">¶</a>, <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>, <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup 3</a>, <a class="reference internal" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#porting-code-to-bs4">Porting code to BS4</a>, <a class="reference external" href="https:/

In [7]:
# Acessar o primeiro link da lista
print(links[0].prettify())

<a accesskey="I" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html" title="General Index">
 index
</a>



In [8]:
# Acessar os atributos do link
print(links[0].attrs)

{'href': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html', 'title': 'General Index', 'accesskey': ['I']}


In [9]:
# Extrair o link
print(links[0]["href"])

https://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html


In [10]:
# Extrair todos os links da página
for link in links:
    print(link["href"])

https://www.crummy.com/software/BeautifulSoup/bs4/doc/genindex.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/py-modindex.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4
http://www.crummy.com/software/BeautifulSoup/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#porting-code-to-bs4
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
http://kondou.com/BS4/
https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/
https://www.crummy.com/software/BeautifulSoup/bs4/doc.ptbr
https://www.crummy.com/software/BeautifulSoup/bs4/doc.ru/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#diagnose
https://www.crummy.com/software/BeautifulS

#### Encontrando elementos com várias regras

In [11]:
elementos_navegacao = soup.find_all(["h1", "h2", "h3"])
print(elementos_navegacao)

[<h3>Navigation</h3>, <h1>Beautiful Soup Documentation<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4" title="Permalink to this heading">¶</a></h1>, <h2>Getting help<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help" title="Permalink to this heading">¶</a></h2>, <h1>Quick Start<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start" title="Permalink to this heading">¶</a></h1>, <h1>Installing Beautiful Soup<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup" title="Permalink to this heading">¶</a></h1>, <h2>Installing a parser<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser" title="Permalink to this heading">¶</a></h2>, <h1>Making the soup<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup" title="Perma

#### Outras regras de find()
- por classe (class_)
- por id (id)
- por atributos (role, href, src, etc)
- por texto (string)
- por pedaços de texto

In [12]:
imagem = soup.find("img")
print(imagem)
print(imagem["src"])

<img alt='"The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself."' class="align-right" src="./Beautiful Soup Documentation_files/6.1.jpg"/>
./Beautiful Soup Documentation_files/6.1.jpg


In [13]:
# Buscar por classe
headerlink = soup.find(class_="headerlink")
print(headerlink)

<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#module-bs4" title="Permalink to this heading">¶</a>


In [14]:
# Buscar por id
qs = soup.find(id="quick-start")
print(qs.prettify())

<section id="quick-start">
 <h1>
  Quick Start
  <a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start" title="Permalink to this heading">
   ¶
  </a>
 </h1>
 <p>
  Here’s an HTML document I’ll be using as an example throughout this
document. It’s part of a story from
  <cite>
   Alice in Wonderland
  </cite>
  :
 </p>
 <div bis_skin_checked="1" class="highlight-default notranslate">
  <div bis_skin_checked="1" class="highlight">
   <pre><span></span><span class="n">html_doc</span> <span class="o">=</span> <span class="s2">"""&lt;html&gt;&lt;head&gt;&lt;title&gt;The Dormouse's story&lt;/title&gt;&lt;/head&gt;</span>
<span class="s2">&lt;body&gt;</span>
<span class="s2">&lt;p class="title"&gt;&lt;b&gt;The Dormouse's story&lt;/b&gt;&lt;/p&gt;</span>

<span class="s2">&lt;p class="story"&gt;Once upon a time there were three little sisters; and their names were</span>
<span class="s2">&lt;a href="http://example.com/elsie" class="sister" id="link1"&g

In [15]:
# Buscar com múltiplos filtros
elemento = soup.find(id="#", role="#")

Obs.: Quando o parâmetro tiver caracteres especiais, passar em forma de dicionário:

In [16]:
logo = soup.find("img", {"data-ll-status": "loaded", "class": "custom-logo"})

#### Busca por texto

In [17]:
# Texto é igual
instalacao = soup.find(string="Installing Beautiful Soup")
print(instalacao)

Installing Beautiful Soup


In [18]:
# Texto contém
import re  # Expressões regulares

textos = soup.find_all(string=re.compile("Installing"))
print(textos)

['Installing Beautiful Soup', 'Installing a parser', 'Installing a parser', '# I noticed that html5lib is not installed. Installing it may help.', 'Installing a parser', 'Installing a parser', 'Installing Beautiful Soup', 'Installing a parser']


#### Parent e Contents

In [23]:
# Parent retorna o elemento pai (acima na hierarquia)
parent = textos[0].parent
print(parent)

<h1>Installing Beautiful Soup<a class="headerlink" href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup" title="Permalink to this heading">¶</a></h1>


In [27]:
# Contents retorna os elementos filhos (abaixo na hierarquia)
parent.contents[0]

'Installing Beautiful Soup'