#**Livro que utilizei como base:**
- *Web Scraping com Python: coletando mais dados na web moderna, 2ª edição, Ryan Mitchell (O'Reilly).*

O princípio desse notebook é estudar sobre Web Scraping, para isso, testarei exemplos que foram abordados no livro (mencionado acima), e no final de cada capítulo, farei um exemplo na prática em base do capítulo lido. 

# **Capítulo 2 - Parsing de HTML avançado**

**Outras utilidades do BeautifulSoup**

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url= "http://www.pythonscraping.com/pages/warandpeace.html"
html= urlopen(url)
bs= BeautifulSoup(html.read(), "html.parser")

lista_nome= bs.find_all(name= "span",               # Tags
                        attrs= {"class":"green"},   # Atributo
                        recursive= True,            # Nivel de aprofundamento  
                        text= None)                 # Pesquisa em base do parametro

for nome in lista_nome:
  print(nome.get_text())        # Separa o conteudo das Tags

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


**Diferentes pesquisas**

In [None]:
def mostrar_resultados(lista):
  for i, nome in enumerate(lista_nome):
    if i < 5:
      print(nome.get_text())        # Separa o conteudo das Tags

In [None]:
# Passando Tags em lista
mostrar_resultados(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]))

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


In [None]:
# podemos querer passar mais de uma classe
mostrar_resultados(bs.find_all("span", {"class":{"green", "red"}}))

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


In [None]:
# Busca com base em nome
len(bs.find_all(text= "the prince"))

7

In [None]:
# Busca com keyword
mostrar_resultados(bs.find_all(id= "title", class_= "text"))

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg


**Navegando em arvore**

*Pesquisando filho*

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url= "http://www.pythonscraping.com/pages/page3.html"
html= urlopen(url)
bs= BeautifulSoup(html.read(), "html.parser")

for filho in bs.find("table", {"id": "giftList"}).children:
  print(filho)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


*Lidando com irmões*

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url= "http://www.pythonscraping.com/pages/page3.html"
html= urlopen(url)
bs= BeautifulSoup(html.read(), "html.parser")

for irmao in bs.find("table", {"id": "giftList"}).tr.next_siblings:
  print(irmao)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

*Lidando com os pais*

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url= "http://www.pythonscraping.com/pages/page3.html"
html= urlopen(url)
bs= BeautifulSoup(html.read(), "html.parser")

print(bs.find("img", {"src": "../img/gifts/img1.jpg"}), "\n--------------------------")
print(bs.find("img", {"src": "../img/gifts/img1.jpg"}).parent, "\n--------------------------")
print(bs.find("img", {"src": "../img/gifts/img1.jpg"}).parent.previous_sibling, "\n--------------------------")
print(bs.find("img", {"src": "../img/gifts/img1.jpg"}).parent.previous_sibling.get_text(), "\n--------------------------")

<img src="../img/gifts/img1.jpg"/> 
--------------------------
<td>
<img src="../img/gifts/img1.jpg"/>
</td> 
--------------------------
<td>
$15.00
</td> 
--------------------------

$15.00
 
--------------------------


**Expressões regulares e o BeautifulSoup**

In [None]:
# Queremos todas as imagens
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url= "http://www.pythonscraping.com/pages/page3.html"
html= urlopen(url)
bs= BeautifulSoup(html.read(), "html.parser")

lista_pesquisa= bs.find_all("img",
                            {"src": re.compile("\.\.\/img\/gifts/img.*\.jpg")})

for elemento in lista_pesquisa:
  print(elemento.attrs["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


#**Meu exemplo para praticar**
- Objetivo é obter os dados da tabela e exportar num arquivo csv.

In [35]:
# Queremos todas as imagens
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pandas as pd

url= "http://www.pythonscraping.com/pages/page3.html"
html= urlopen(url)
bs= BeautifulSoup(html.read(), "html.parser")

lista_pesquisa= bs.find("table", {"id": "giftList"})

variaveis= [i.get_text().replace("\n","") for i in lista_pesquisa.tr]    # Temos as variaveis da coluna
dados= [resultado.get_text().split("\n\n")[:-1] for resultado in lista_pesquisa.find_all(class_= "gift")]
link= [link.attrs["src"].replace("..", "http://www.pythonscraping.com") for link in bs.find_all("img", {"src": re.compile("\.\.\/img\/gifts/img.*\.jpg")})]

df= pd.DataFrame(dados)
df["Imagem"]= link
df.columns= variaveis
df['Item Title']= df['Item Title'].map(lambda x: x.replace("\n", ""))
df

Unnamed: 0,Item Title,Description,Cost,Image
0,Vegetable Basket,This vegetable basket is the perfect gift for ...,$15.00,http://www.pythonscraping.com/img/gifts/img1.jpg
1,Russian Nesting Dolls,"Hand-painted by trained monkeys, these exquisi...","$10,000.52",http://www.pythonscraping.com/img/gifts/img2.jpg
2,Fish Painting,"If something seems fishy about this painting, ...","$10,005.00",http://www.pythonscraping.com/img/gifts/img3.jpg
3,Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50,http://www.pythonscraping.com/img/gifts/img4.jpg
4,Mystery Box,"If you love suprises, this mystery box is for ...",$1.50,http://www.pythonscraping.com/img/gifts/img6.jpg


In [36]:
df.to_csv("Tabela de preço do Cap 2.csv")