# Beautiful Soup Tutorial

## En este notebook...

En este Jupyter Notebook, aprender√°s los conceptos b√°sicos sobre c√≥mo extraer datos de HTML. 

Extraeremos datos de la p√°gina de libros **thriftbooks**, y para lograr esto, tambi√©n tendr√° que hacer uso de un poco de pandas principalmente.

### Conoce a tus nuevos mejores amigos: 

- Beautiful Soup
- Requests

In [None]:
# !pip install beautifulsoup4

Para obtener la experiencia completa de Beautiful Soup, tambi√©n deber√°s instalar un parser, dentro de ellos tenemos..

- html.parser
- lxml
- html5lib


Vamos a utilizar el lxml ya que es el mas r√°pido 

In [None]:
# !pip install lxml

Se necesita una cosa m√°s para que podamos comenzar a hacer web scraping, y es la biblioteca de ```requests```. Con ```requests``` podemos solicitar p√°ginas web de sitios web.

In [None]:
# !pip install requests

Ahora asi manos a la obra..

## Mi primer scraping

Como siempre lo primero es importar las librer√≠as 

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
# from splinter import Browser
import numpy as np

Ahora, estamos listos para solicitar nuestra primera p√°gina web. No es nada complicado: guardamos la URL que queremos realizar *scraping* en la variable URL, luego solicitamos la URL (requests.get (url)) y guardamos la respuesta en la variable de respuesta:

In [2]:
url = 'https://books.toscrape.com/'
# 'https://www.bookdepository.com/es/bestsellers'
response = requests.get(url)

C√≥mo saber si se guardo correctamente el sitio web?

In [3]:
print(response)

<Response [200]>


Posibles respuestas:

- [Respuestas informativas](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#information_responses) (100‚Äì199)
- [Respuestas exitosas](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#successful_responses) (200‚Äì299)
- [Mensajes de redirecci√≥n](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#redirection_messages) (300‚Äì399)
- [Respuestas de error del cliente](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses) (400‚Äì499)
- [Respuestas de error del servidor](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses) (500‚Äì599)

Pero necesitamos el contenido HTML de la p√°gina web solicitada, as√≠ que como siguiente paso guardamos el contenido de la respuesta a html:

In [4]:
html = response.content

Lo podemos imprimir para ver su estructura

In [5]:
print(html[:990])

b'<!DOCTYPE html>\n<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->\n<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->\n<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->\n    <head>\n        <title>\n    All products | Books to Scrape - Sandbox\n</title>\n\n        <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n        <meta name="created" content="24th Jun 2016 09:29" />\n        <meta name="description" content="" />\n        <meta name="viewport" content="width=device-width" />\n        <meta name="robots" content="NOARCHIVE,NOCACHE" />\n\n        <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->\n        <!--[if lt IE 9]>\n        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>\n        <![endif]-->\n\n        \n            <link rel="shortcut icon" hre

Este es el resultado obtenido en HTML de la p√°gina de los libros m√°s vendidos, pero es realmente dif√≠cil de leer...

Pero para eso usamos BeautifulSoup y lxml

Creamos un objeto BeautifulSoup llamado soup con la siguiente l√≠nea de c√≥digo:

In [6]:
soup = bs(html, "lxml")

Ahora vamos a ver el cambio

In [27]:
print(soup)

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

## C√≥mo navegar por un objeto de Beautiful Soup

![image](img\html-content-web-scraping.png)

![image](img\attribute-example-for-web-scraping-1536x386.png)

Ahora que hemos aprendido algo de HTML b√°sico, finalmente podemos comenzar a extraer datos de soup. Simplemente escriba un nombre de etiqueta despu√©s de soup y un punto (como soup.title), y observe c√≥mo se desarrolla la magia:

In [8]:
soup.title

<title>
    All products | Books to Scrape - Sandbox
</title>

In [20]:
soup.h1

<h1>All products</h1>

Eliminamos las etiquetas

In [17]:
soup.h1.get_text()

'All products'

¬øQu√© sucede si solo necesita el atributo de un elemento? Tampoco hay problema:

In [18]:
print(soup.a)
print('')
print(soup.a['href']) # accedo a la referencia href

<a href="index.html">Books to Scrape</a>

index.html


Tambi√©n podemos..
> soup.a.get("href")

Tambi√©n puedes usar el m√©todo .find() y obtendr√°s exactamente el mismo resultado:

In [21]:
print("Sin utilizar m√©todo .find()")
print(soup.h1)
print("")
print("Utilizando m√©todo .find()")
print(soup.find("h1"))

Sin utilizar m√©todo .find()
<h1>All products</h1>

Utilizando m√©todo .find()
<h1>All products</h1>


A menudo, no solo necesitas uno, sino todos los elementos (por ejemplo, cada enlace en una p√°gina). Para eso es bueno el m√©todo .find_all():

In [22]:
soup.find_all('a')

[<a href="index.html">Books to Scrape</a>,
 <a href="index.html">Home</a>,
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
            

Si nos fijamos podemos ver que lo que nos devuelve es una lista..

In [23]:
all_a = soup.find_all('a')
for a in all_a[:5]:
    print(a)

<a href="index.html">Books to Scrape</a>
<a href="index.html">Home</a>
<a href="catalogue/category/books_1/index.html">
                            
                                Books
                            
                        </a>
<a href="catalogue/category/books/travel_2/index.html">
                            
                                Travel
                            
                        </a>
<a href="catalogue/category/books/mystery_3/index.html">
                            
                                Mystery
                            
                        </a>


La p√°gina contiene 20 libros con informaci√≥n relacionada con ellos. De los datos disponibles extraeremos los siguientes:

- t√≠tulo del libro
- precios
- disponibilidad de stock

## Suficiente informaci√≥n...

Manos a la obra

## Obtener los titulos de los libros (find_all + get_text)

Para ello vamos a inspeccionar en el navegador (click derecho sobre un titulo de un libro y elegimos inspeccionar)

In [24]:
boton_1 = soup.find("h3")
print(boton_1)

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>


In [25]:
all_title = soup.find_all("h3")
for title in all_title:
    print(title.get_text(strip=True)) #me devuelve el contenido de texto dentro de h3

A Light in the ...
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History ...
The Requiem Red
The Dirty Little Secrets ...
The Coming Woman: A ...
The Boys in the ...
The Black Maria
Starving Hearts (Triangular Trade ...
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little ...
Rip it Up and ...
Our Band Could Be ...
Olio
Mesaerion: The Best Science ...
Libertarianism for Beginners
It's Only the Himalayas


In [26]:
all_title = soup.find_all("h3")
for title in all_title:
    print(title.find('a')['title']) #encontrar en el tag 'a', el valor del atributo 'title'

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


## Obtener la disponibilidad del libro (find_all + get_text)

Inspeccionamos el c√≥digo........(instock_availability).

In [28]:
all_stocks = soup.find_all("p", class_='instock availability')
for stock in all_stocks:
    print(stock.get_text(strip=True)) #strip=True borra los espacios laterales (o eso creo)

In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock
In stock


## Obtener los precios (find_all + get_text)

Inspeccionamos el c√≥digo........(price_color).

In [31]:
all_price = soup.find_all("p", class_="price_color")
for price in all_price:
    print(price.get_text(strip=True))

¬£51.77
¬£53.74
¬£50.10
¬£47.82
¬£54.23
¬£22.65
¬£33.34
¬£17.93
¬£22.60
¬£52.15
¬£13.99
¬£20.66
¬£17.46
¬£52.29
¬£35.02
¬£57.25
¬£23.88
¬£37.59
¬£51.33
¬£45.17


## Obtener las extensiones a la vista individual de cada libro (find_all + get_text)

Inspeccionamos el c√≥digo........(h3).

In [32]:
all_vistas = soup.find_all("h3")
for vista in all_vistas:
    print(vista.find('a').get('href'))

catalogue/a-light-in-the-attic_1000/index.html
catalogue/tipping-the-velvet_999/index.html
catalogue/soumission_998/index.html
catalogue/sharp-objects_997/index.html
catalogue/sapiens-a-brief-history-of-humankind_996/index.html
catalogue/the-requiem-red_995/index.html
catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
catalogue/the-black-maria_991/index.html
catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html
catalogue/shakespeares-sonnets_989/index.html
catalogue/set-me-free_988/index.html
catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html
catalogue/rip-it-up-and-start-again_986/index.html
catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html

## Obtener las extensiones a las portadas de cada libro (find_all + get_text)

Inspeccionamos el c√≥digo........(img).

In [33]:
all_img = soup.find_all("img")
for img in all_img:
    print(img.get('src'))
    # 'src' es la referencia de la imagen en el codigo

media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg
media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg
media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg
media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg
media/cache/58/46/5846057e28022268153beff6d352b06c.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg
media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg
media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg
media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg
media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg
media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg
media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg
media/cach

## <font color='orange'>Recolectar informaci√≥n de los libros</font>

Primero guardamos en una variable la url principal y creamos un objeto soup.

In [34]:
url_principal = 'https://books.toscrape.com/' # 'https://www.bookdepository.com/es/bestsellers'
response = requests.get(url)

In [35]:
html = response.content
soup = bs(html, "lxml")

Creamos una lista con los urls de los libros

In [36]:
libros = soup.find_all('h3')
lista_url = []
for libro in libros:
    url_libro = libro.find('a')['href']
    # href: referencia de la extension individual de cada libro (su pagina)
    lista_url.append(url_principal+url_libro)

lista_url # obtengo lista con los url de cada libro

['https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
 'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
 'https://books.toscrape.com/catalogue/soumission_998/index.html',
 'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
 'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
 'https://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
 'https://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
 'https://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
 'https://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
 'https://books.toscrape.com/catalogue/the-black-maria_991/index.html',
 'https://books.toscrape.com/catalogue/starving-hearts-triangular-trade-tr

Vamos a analizar el url de un libro primero

In [37]:
# Hacemos un nuevo request para el segundo libro:
r = requests.get(lista_url[1])

Creamos un soup del primer libro

In [38]:
soup_libro = bs(r.text, "lxml")

In [39]:
soup_libro

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    Tipping the Velvet | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="
    &quot;Erotic and absorbing...Written with starling power.&quot;--&quot;The New York Times Book Review &quot; Nan King, an oyster girl, is captivated by the music hall phenomenon Kitty Butler, a male impersonator extraordinaire treading the boards in Canterbury. Through a friend at the box office, Nan manages to visit all her shows and finally meet her heroine. Soon after, she becomes Kitty's &quot;Erotic and absorbing...Written with starling

Obtenemos el titulo del libro

In [40]:
titulo = soup_libro.find('h1').get_text(strip=True)
titulo

'Tipping the Velvet'

Precio

In [41]:
price = float(soup_libro.find('p', class_='price_color').get_text(strip=True)[2:])
# cojo el contenido desde la posicion 2
# hasta el final para convertirlo en float
price

53.74

Url de la portada

In [42]:
image_url = soup_libro.find('img')['src']
image_url

'../../media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg'

In [43]:
url_principal+image_url

'https://books.toscrape.com/../../media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg'

Automatizamos para hacer un web scraping

In [44]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
# from splinter import Browser
import numpy as np

In [45]:
pages = np.arange(1,51)
lista_libros = []

for page in pages:

    url_principal = 'https://books.toscrape.com/catalogue/page-'+str(page)+'.html'
    r = requests.get(url_principal)
    soup = bs(r.text, 'lxml')

    libros = soup.find_all('h3')

    for libro in libros:

        url_ext = 'https://books.toscrape.com/catalogue/' + libro.find('a')['href']
        r_ext = requests.get(url_ext)
        soup_ext = bs(r_ext.text, 'lxml')

        # ID del libro
        id_libro = str(soup_ext.find_all('td')[0]).split('>')[1].split('<')[0]

        # T√≠tulo del libro
        name = libro.find('a')['title']

        # Precio
        price = float(soup_ext.find('p', class_='price_color').get_text(strip=True)[2:])

        # Impuestos
        tax = float(soup_ext.find_all('td')[4].get_text()[2:])

        # Url de la portada
        url_image = soup_ext.find('img')['src']

        # Disponibilidad de stock
            ## Variable binaria.
            ## Variable cuantitativa.
        unidades = float(soup_ext.find('p', class_='instock availability').get_text(strip=True).split('(')[1].split(' ')[0])
        stock = unidades>0

        # N√∫mero de reviews
        reviews = int(soup_ext.find_all('td')[-1].get_text())

        data = {'ID': id_libro,
                'NOMBRE': name,
                'PRECIO': price,
                'IMPUESTOS': tax,
                'URL_IMAGEN': url_image,
                'UNIDADES': unidades,
                'STOCK': stock,
                'NUM_COMENTARIOS': reviews}



        lista_libros.append(data)

In [47]:
lista_libros[:4]

[{'ID': 'a897fe39b1053632',
  'NOMBRE': 'A Light in the Attic',
  'PRECIO': 51.77,
  'IMPUESTOS': 0.0,
  'URL_IMAGEN': '../../media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg',
  'UNIDADES': 22.0,
  'STOCK': True,
  'NUM_COMENTARIOS': 0},
 {'ID': '90fa61229261140a',
  'NOMBRE': 'Tipping the Velvet',
  'PRECIO': 53.74,
  'IMPUESTOS': 0.0,
  'URL_IMAGEN': '../../media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg',
  'UNIDADES': 20.0,
  'STOCK': True,
  'NUM_COMENTARIOS': 0},
 {'ID': '6957f44c3847a760',
  'NOMBRE': 'Soumission',
  'PRECIO': 50.1,
  'IMPUESTOS': 0.0,
  'URL_IMAGEN': '../../media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg',
  'UNIDADES': 20.0,
  'STOCK': True,
  'NUM_COMENTARIOS': 0},
 {'ID': 'e00eb4fd7b871a48',
  'NOMBRE': 'Sharp Objects',
  'PRECIO': 47.82,
  'IMPUESTOS': 0.0,
  'URL_IMAGEN': '../../media/cache/c0/59/c05972805aa7201171b8fc71a5b00292.jpg',
  'UNIDADES': 20.0,
  'STOCK': True,
  'NUM_COMENTARIOS': 0}]

In [48]:
df_libros = pd.DataFrame(lista_libros)
df_libros.head(4)

Unnamed: 0,ID,NOMBRE,PRECIO,IMPUESTOS,URL_IMAGEN,UNIDADES,STOCK,NUM_COMENTARIOS
0,a897fe39b1053632,A Light in the Attic,51.77,0.0,../../media/cache/fe/72/fe72f0532301ec28892ae7...,22.0,True,0
1,90fa61229261140a,Tipping the Velvet,53.74,0.0,../../media/cache/08/e9/08e94f3731d7d6b760dfbf...,20.0,True,0
2,6957f44c3847a760,Soumission,50.1,0.0,../../media/cache/ee/cf/eecfe998905e455df12064...,20.0,True,0
3,e00eb4fd7b871a48,Sharp Objects,47.82,0.0,../../media/cache/c0/59/c05972805aa7201171b8fc...,20.0,True,0


In [49]:
df_libros.to_csv('libros_scrap.csv', index=False)

## <font color='red'>R</font>e<font color='red'>g</font>a<font color='red'>l</font>o ü•≥üéÑüéÅüéâüéäüéÜ

![imagen](https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg)

In [53]:
import urllib.request

from PIL import Image

image_url = url + 'media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg'

urllib.request.urlretrieve(image_url, 'prueba.jpg')

img = Image.open('prueba.jpg')

img.show()