# Web Scrapping
- No todas las webs tienen APIs así que vamos a ver otros modos de conseguir datos de Webs.
- La diferencia es que las APIS son datos que nos proporciona el mismo servidor y ahora vamos a hacer una llamada pero para obtener la info de la web en html
- Las APIs son más estables
- ¿Legalidad?
- Existen diferentes bibliotecas para hacer Web Scraping: Beautiful Soup, Selenium, Scrapy, PyQuery...
- HTML: Ya lo vimos. Un tipo de documento con etiquetas
  
### Web Scraping vs API's

- Ambos son métodos comunes para obtener datos de la web en Python.
- API's:
    - Datos a través de endopints proporcionados por el servidor. Datos enf ormato JSON o XML.
    - Generalmente más estable que web scraping.
    - Legalidad: diseñado para usar por programadores. 
- Web Scraping:
    - Datos directmanete de la web al descargar un HTML.
    - Cambios en la web pueden suponer fallo en los scripts.
    - Legalidad: depende de la web.

### HTML

El HTML se basa en una estructura jerárquica de etiquetas -> elementos. Ejemplo:

```html
<!DOCTYPE html>
<html>
<head>
    <title>Título de la página</title>
</head>
<body>
    <h1>Título principal</h1>
    <p>Este es un párrafo de ejemplo.</p>
    <a href="https://www.ejemplo.com">Enlace a Ejemplo.com</a>
    <img src="imagen.jpg" alt="Decripción de la imagen">
</body>
</html>
```

En este ejemplo, puedes ver algunas etiquetas HTML comunes:

- `<!DOCTYPE html>`: Declara que el documento es un archivo HTML.
- `<html>`: Es el elemento raíz que envuelve todo el contenido de la página.
- `<head>`: Contiene información meta y enlaces a archivos externos, como hojas de estilo CSS o scripts JavaScript.
- `<title>`: Define el título de la página, que se muestra en la pestaña del navegador.
- `<body>`: Contiene el contenido visible de la página web.
- `<h1>`: Define un encabezado de nivel 1.
- `<p>`: Define un párrafo de texto.
- `<a>`: Define un enlace a otra página web.
- `<img>`: Inserta una imagen en la página web.


# Python

#### Requests
- res = get(url)
- res.status_code
- res.content


#### BeautifulSoup
- Crear objeto:
    - soup = BeautifulSoup(html_doc, 'html.parser')
- Buscar elementos:
    - soup.find('a') # Buscar un elemento por su etiqueta
    - soup.find_all('a') # Buscar todos los elementos <a> en una lista

In [3]:
### Python

from bs4 import BeautifulSoup # biblioteca para análisis y extracción de datos de archivos HTML y XML.
import requests
import pandas as pd

In [10]:
url = "https://www.decathlon.com/collections/deals?page=1"

In [11]:
result = requests.get(url)
print(result.status_code)

200


In [12]:
result.content



In [13]:
soup = BeautifulSoup(result.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]-->
<!--[if IE 9 ]> <html class="ie9 no-js" lang="en"> <![endif]-->
<!-- [if (gt IE 9)|!(IE)]><! -->
<html class="no-js" lang="en">
 <!-- <![endif] -->
 <head>
  <meta charset="utf-8"/>
  <title>
   Deals
        
        
        |
        Decathlon
  </title>
  <link href="//cdn.shopify.com" rel="dns-prefetch"/>
  <link href="//cdn.shopify.com" rel="preconnect"/>
  <link href="//cdn.dynamicyield.com" rel="preconnect"/>
  <link href="//st.dynamicyield.com" rel="preconnect"/>
  <link as="font" crossorigin="anonymous" href="//www.decathlon.com/cdn/shop/t/242/assets/Avalon-Bold-webfont.woff2?v=13987910295491874761667891283" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="anonymous" href="//www.decathlon.com/cdn/shop/t/242/assets/Avalon-Book-webfont.woff2?v=45737076064894658321667891286" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="anonymous" href="//www.decathlon.com/cdn/sho

In [14]:
# obtener los productos, precios y categorias
lista_categorias = soup.find_all('span', {'class': 'de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-textBold'}) 
lista_categorias


[<span class="de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-textBold" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Decathlon Brand | /products/forclaz-mt100-hooded-down-puffer-jacket-167571" itemprop="sub-brand">
   Decathlon  Forclaz
 </span>,
 <span class="de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-textBold" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Decathlon Brand | /products/forclaz-womens-mt100-hooded-synthetic-jacket-312532" itemprop="sub-brand">
   Decathlon  Forclaz
 </span>,
 <span class="de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-textBold" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Decathlon Brand | /products/quechua-womens-mountain-walking-short-sleeved-t-shirt-mh100-172246" itemprop="sub-b

In [15]:
categorias = []

for i in lista_categorias:
    categorias.append(i.getText())

categorias

['\n  Decathlon  Forclaz\n',
 '\n  Decathlon  Forclaz\n',
 '\n  Decathlon  Quechua\n',
 '\n  Decathlon  Quechua\n',
 '\n  Decathlon  Quechua\n',
 '\n  Decathlon  Quechua\n',
 '\n  Decathlon  Van Rysel\n',
 '\n  Decathlon  Van Rysel\n',
 '\n  Decathlon  Quechua\n']

In [16]:
categorias_limpias = []

for categoria in categorias: 
    categorias_limpias.append(categoria.replace("\n", "").strip())

categorias_limpias

['Decathlon  Forclaz',
 'Decathlon  Forclaz',
 'Decathlon  Quechua',
 'Decathlon  Quechua',
 'Decathlon  Quechua',
 'Decathlon  Quechua',
 'Decathlon  Van Rysel',
 'Decathlon  Van Rysel',
 'Decathlon  Quechua']

In [17]:
lista_precios = soup.find_all('span', {'class': 'js-de-ProductTile-currentPrice'})
lista_precios

[<span class="js-de-ProductTile-currentPrice">$70.00 — $99.99</span>,
 <span class="js-de-ProductTile-currentPrice">$40.00 — $69.99</span>,
 <span class="js-de-ProductTile-currentPrice">$8.00 — $19.99</span>,
 <span class="js-de-ProductTile-currentPrice">$25.00</span>,
 <span class="js-de-ProductTile-currentPrice">$35.00</span>,
 <span class="js-de-ProductTile-currentPrice">$10.00</span>,
 <span class="js-de-ProductTile-currentPrice">$35.00</span>,
 <span class="js-de-ProductTile-currentPrice">$20.00</span>,
 <span class="js-de-ProductTile-currentPrice">$45.00</span>]

In [18]:
precios = []
for a in lista_precios: 
    precios.append(a.getText())

precios

['$70.00 — $99.99',
 '$40.00 — $69.99',
 '$8.00 — $19.99',
 '$25.00',
 '$35.00',
 '$10.00',
 '$35.00',
 '$20.00',
 '$45.00']

In [19]:
## guardar la información en un diccionario
productos_decathlon = {}
productos_decathlon['categorias']= categorias_limpias
productos_decathlon['precio'] = precios
productos_decathlon

{'categorias': ['Decathlon  Forclaz',
  'Decathlon  Forclaz',
  'Decathlon  Quechua',
  'Decathlon  Quechua',
  'Decathlon  Quechua',
  'Decathlon  Quechua',
  'Decathlon  Van Rysel',
  'Decathlon  Van Rysel',
  'Decathlon  Quechua'],
 'precio': ['$70.00 — $99.99',
  '$40.00 — $69.99',
  '$8.00 — $19.99',
  '$25.00',
  '$35.00',
  '$10.00',
  '$35.00',
  '$20.00',
  '$45.00']}

In [20]:
df = pd.DataFrame(productos_decathlon)
df

Unnamed: 0,categorias,precio
0,Decathlon Forclaz,$70.00 — $99.99
1,Decathlon Forclaz,$40.00 — $69.99
2,Decathlon Quechua,$8.00 — $19.99
3,Decathlon Quechua,$25.00
4,Decathlon Quechua,$35.00
5,Decathlon Quechua,$10.00
6,Decathlon Van Rysel,$35.00
7,Decathlon Van Rysel,$20.00
8,Decathlon Quechua,$45.00


In [22]:
## ejemplo para buscar mujeres negras relevantes
url_one = "https://www.unwomen.org/es/news/stories/2020/7/compilation-inspirational-black-women-to-know"
url_two = "https://elcomercio.pe/viu/actitud-viu/dia-mundial-de-la-cultura-africana-dia-mundial-de-la-cultura-africana-y-de-los-afrodescendientes-10-mujeres-que-cambiaron-la-historia-del-mundo-afrodescendientes-mujeres-cultura-michelle-obama-rosa-parks-maya-angelou-kamala-harris-noticia/"

In [23]:
results_one = requests.get(url_one)
print(results_one.status_code)



200


In [25]:
soup_one = BeautifulSoup(results_one.content, 'html.parser')
nombres = soup_one.find_all('h2')
nombres

[<h2 class="visually-hidden" id="system-breadcrumb">Sobrescribir enlaces de ayuda a la navegación</h2>,
 <h2>Tarana Burke</h2>,
 <h2>Vanessa Nakate</h2>,
 <h2>Jaha Dukureh</h2>,
 <h2>Emanuela Paul</h2>,
 <h2>Unity Dow</h2>,
 <h2>Valdecir Nascimento</h2>]

In [35]:
nombres_limpios = []

for nombre in nombres:
    nombres_limpios.append(nombre.getText())

nombres_limpios.remove("Sobrescribir enlaces de ayuda a la navegación")
nombres_limpios

['Tarana Burke',
 'Vanessa Nakate',
 'Jaha Dukureh',
 'Emanuela Paul',
 'Unity Dow',
 'Valdecir Nascimento']

In [30]:
results_2 = requests.get(url_two)
print(results_2.status_code)



200


In [32]:
soup_2 = BeautifulSoup(results_2.content, 'html.parser')
nombres_2 = soup_2.find_all('strong', {'class': 'story-gallery__title'})
nombres_2

[<strong class="story-gallery__title"> Día Mundial de la Cultura Africana y de los Afrodescendientes: 10 mujeres que cambiaron la historia del mundo</strong>,
 <strong class="story-gallery__title"> Rosa Parks</strong>,
 <strong class="story-gallery__title"> Michelle Obama</strong>,
 <strong class="story-gallery__title"> Maya Angelou</strong>,
 <strong class="story-gallery__title"> Shirley Chisholm</strong>,
 <strong class="story-gallery__title"> Marsha P. Johnson</strong>,
 <strong class="story-gallery__title"> Kamala Harris</strong>,
 <strong class="story-gallery__title"> Mae Jeminson</strong>,
 <strong class="story-gallery__title"> Simone Biles</strong>,
 <strong class="story-gallery__title"> Oprah Winfrey</strong>,
 <strong class="story-gallery__title"> Serena Williams<strong class="story-gallery__caption-image"> / Quinn Rooney</strong></strong>]

In [36]:
for nombre in nombres_2:
    nombres_limpios.append(nombre.getText())


nombres_limpios


['Tarana Burke',
 'Vanessa Nakate',
 'Jaha Dukureh',
 'Emanuela Paul',
 'Unity Dow',
 'Valdecir Nascimento',
 ' Día Mundial de la Cultura Africana y de los Afrodescendientes: 10 mujeres que cambiaron la historia del mundo',
 ' Rosa Parks',
 ' Michelle Obama',
 ' Maya Angelou',
 ' Shirley Chisholm',
 ' Marsha P. Johnson',
 ' Kamala Harris',
 ' Mae Jeminson',
 ' Simone Biles',
 ' Oprah Winfrey',
 ' Serena Williams / Quinn Rooney']

In [41]:
nombres_limpios.pop(6)
nombres_limpios

['Tarana Burke',
 'Vanessa Nakate',
 'Jaha Dukureh',
 'Emanuela Paul',
 'Unity Dow',
 'Valdecir Nascimento',
 ' Michelle Obama',
 ' Maya Angelou',
 ' Shirley Chisholm',
 ' Marsha P. Johnson',
 ' Kamala Harris',
 ' Mae Jeminson',
 ' Simone Biles',
 ' Oprah Winfrey',
 ' Serena Williams / Quinn Rooney']

In [42]:
df = pd.DataFrame(nombres_limpios, columns=["Nombres"])
df

Unnamed: 0,Nombres
0,Tarana Burke
1,Vanessa Nakate
2,Jaha Dukureh
3,Emanuela Paul
4,Unity Dow
5,Valdecir Nascimento
6,Michelle Obama
7,Maya Angelou
8,Shirley Chisholm
9,Marsha P. Johnson


In [None]:
## tabla 
## table 
## thead (encabezado)
#  tr(filas) dentro tiene th(columnas)

In [43]:
url_bolsa = "https://www.bolsamania.com/indice/IBEX-35/historico-precios"
res_bolsa = requests.get(url_bolsa)
res_bolsa.status_code

200

In [57]:
soup_bolsa = BeautifulSoup(res_bolsa.content, 'html.parser');
tablas = soup_bolsa.find_all('table')

len(tablas)
nuestra_tabla = tablas[2]
nuestra_tabla

<table class="table table-hover cator sortable table-bm xs-grid-table stripped" v-table-sorter="{
                'orderBy': 'date',
                'orderType': 'asc',
                'baseUrl': 'https://www.bolsamania.com/indice/IBEX-35/historico-precios',
                'queryString': '?startDate=23-12-2024&amp;endDate=22-01-2025',
                'defaultBy': 'date',
                'reverse': true
                }">
<thead v-if="tableSorter">
<tr>
<th class="text-left" v-table-sorter-by="'date'">Fecha<span></span></th>
<th class="text-right" v-table-sorter-by="'price'">Precio<span></span></th>
<th class="text-right" v-table-sorter-by="'fallers'">Variación %<span></span></th>
<th class="text-right" v-table-sorter-by="'high'">Máximo<span></span></th>
<th class="text-right" v-table-sorter-by="'low'">Mínimo<span></span></th>
<th class="text-right" v-table-sorter-by="'open'">Apertura<span></span></th>
</tr>
</thead>
<tbody>
<tr>
<td class="text-left">23-dic-24</td>
<td class="text-ri

In [66]:
lista_encabezados = nuestra_tabla.find_all('th')
lista_encabezados

[<th class="text-left" v-table-sorter-by="'date'">Fecha<span></span></th>,
 <th class="text-right" v-table-sorter-by="'price'">Precio<span></span></th>,
 <th class="text-right" v-table-sorter-by="'fallers'">Variación %<span></span></th>,
 <th class="text-right" v-table-sorter-by="'high'">Máximo<span></span></th>,
 <th class="text-right" v-table-sorter-by="'low'">Mínimo<span></span></th>,
 <th class="text-right" v-table-sorter-by="'open'">Apertura<span></span></th>]

In [67]:
encabezados = []
for i in lista_encabezados: 
    encabezados.append(i.getText())
encabezados

['Fecha', 'Precio', 'Variación %', 'Máximo', 'Mínimo', 'Apertura']

In [60]:
filastabla = nuestra_tabla.find_all('tr')
filastabla

[<tr>
 <th class="text-left" v-table-sorter-by="'date'">Fecha<span></span></th>
 <th class="text-right" v-table-sorter-by="'price'">Precio<span></span></th>
 <th class="text-right" v-table-sorter-by="'fallers'">Variación %<span></span></th>
 <th class="text-right" v-table-sorter-by="'high'">Máximo<span></span></th>
 <th class="text-right" v-table-sorter-by="'low'">Mínimo<span></span></th>
 <th class="text-right" v-table-sorter-by="'open'">Apertura<span></span></th>
 </tr>,
 <tr>
 <td class="text-left">23-dic-24</td>
 <td class="text-right">11.435,700</td>
 <td class="text-right"><span class="dred">-0,28%</span></td>
 <td class="text-right">11.474,900</td>
 <td class="text-right">11.399,100</td>
 <td class="text-right">11.457,800</td>
 </tr>,
 <tr>
 <td class="text-left">24-dic-24</td>
 <td class="text-right">11.473,900</td>
 <td class="text-right"><span class="cgreen">0,33%</span></td>
 <td class="text-right">11.485,700</td>
 <td class="text-right">11.446,200</td>
 <td class="text-righ

In [61]:
resultados_limpios = []

for item in filastabla[1:]:
    resultados_limpios.append(item.text)
resultados_limpios

['\n23-dic-24\n11.435,700\n-0,28%\n11.474,900\n11.399,100\n11.457,800\n',
 '\n24-dic-24\n11.473,900\n0,33%\n11.485,700\n11.446,200\n11.468,500\n',
 '\n27-dic-24\n11.531,600\n0,50%\n11.531,600\n11.421,900\n11.452,000\n',
 '\n30-dic-24\n11.536,800\n0,05%\n11.600,400\n11.470,800\n11.478,200\n',
 '\n31-dic-24\n11.595,000\n0,50%\n11.613,400\n11.525,400\n11.529,600\n',
 '\n02-ene-25\n11.676,900\n0,71%\n11.676,900\n11.456,200\n11.609,800\n',
 '\n03-ene-25\n11.651,600\n-0,22%\n11.701,800\n11.635,500\n11.681,400\n',
 '\n06-ene-25\n11.808,200\n1,34%\n11.808,200\n11.613,700\n11.692,100\n',
 '\n07-ene-25\n11.811,900\n0,03%\n11.866,600\n11.724,600\n11.797,400\n',
 '\n08-ene-25\n11.798,100\n-0,12%\n11.871,700\n11.709,400\n11.800,000\n',
 '\n09-ene-25\n11.899,300\n0,86%\n11.904,200\n11.753,800\n11.756,900\n',
 '\n10-ene-25\n11.720,900\n-1,50%\n11.877,200\n11.703,800\n11.861,500\n',
 '\n13-ene-25\n11.688,200\n-0,28%\n11.707,500\n11.637,000\n11.664,000\n',
 '\n14-ene-25\n11.752,100\n0,55%\n11.797,200\n

In [63]:
resultados_limpios_super = []
for resultado in resultados_limpios: 
    elemento = resultado.split("\n")
    resultados_limpios_super.append(elemento)

resultados_limpios_super
     


[['',
  '23-dic-24',
  '11.435,700',
  '-0,28%',
  '11.474,900',
  '11.399,100',
  '11.457,800',
  ''],
 ['',
  '24-dic-24',
  '11.473,900',
  '0,33%',
  '11.485,700',
  '11.446,200',
  '11.468,500',
  ''],
 ['',
  '27-dic-24',
  '11.531,600',
  '0,50%',
  '11.531,600',
  '11.421,900',
  '11.452,000',
  ''],
 ['',
  '30-dic-24',
  '11.536,800',
  '0,05%',
  '11.600,400',
  '11.470,800',
  '11.478,200',
  ''],
 ['',
  '31-dic-24',
  '11.595,000',
  '0,50%',
  '11.613,400',
  '11.525,400',
  '11.529,600',
  ''],
 ['',
  '02-ene-25',
  '11.676,900',
  '0,71%',
  '11.676,900',
  '11.456,200',
  '11.609,800',
  ''],
 ['',
  '03-ene-25',
  '11.651,600',
  '-0,22%',
  '11.701,800',
  '11.635,500',
  '11.681,400',
  ''],
 ['',
  '06-ene-25',
  '11.808,200',
  '1,34%',
  '11.808,200',
  '11.613,700',
  '11.692,100',
  ''],
 ['',
  '07-ene-25',
  '11.811,900',
  '0,03%',
  '11.866,600',
  '11.724,600',
  '11.797,400',
  ''],
 ['',
  '08-ene-25',
  '11.798,100',
  '-0,12%',
  '11.871,700',
  '11.

In [71]:
resultados_finales = []

for resultado in resultados_limpios_super:
    elemento = resultado[1:-1] # para quitar los elementos vacíos
    resultados_finales.append(elemento)

In [72]:
df_tabla = pd.DataFrame(resultados_finales)
df_tabla

Unnamed: 0,0,1,2,3,4,5
0,23-dic-24,"11.435,700","-0,28%","11.474,900","11.399,100","11.457,800"
1,24-dic-24,"11.473,900","0,33%","11.485,700","11.446,200","11.468,500"
2,27-dic-24,"11.531,600","0,50%","11.531,600","11.421,900","11.452,000"
3,30-dic-24,"11.536,800","0,05%","11.600,400","11.470,800","11.478,200"
4,31-dic-24,"11.595,000","0,50%","11.613,400","11.525,400","11.529,600"
5,02-ene-25,"11.676,900","0,71%","11.676,900","11.456,200","11.609,800"
6,03-ene-25,"11.651,600","-0,22%","11.701,800","11.635,500","11.681,400"
7,06-ene-25,"11.808,200","1,34%","11.808,200","11.613,700","11.692,100"
8,07-ene-25,"11.811,900","0,03%","11.866,600","11.724,600","11.797,400"
9,08-ene-25,"11.798,100","-0,12%","11.871,700","11.709,400","11.800,000"


In [73]:
df_tabla.columns = encabezados
df_tabla

Unnamed: 0,Fecha,Precio,Variación %,Máximo,Mínimo,Apertura
0,23-dic-24,"11.435,700","-0,28%","11.474,900","11.399,100","11.457,800"
1,24-dic-24,"11.473,900","0,33%","11.485,700","11.446,200","11.468,500"
2,27-dic-24,"11.531,600","0,50%","11.531,600","11.421,900","11.452,000"
3,30-dic-24,"11.536,800","0,05%","11.600,400","11.470,800","11.478,200"
4,31-dic-24,"11.595,000","0,50%","11.613,400","11.525,400","11.529,600"
5,02-ene-25,"11.676,900","0,71%","11.676,900","11.456,200","11.609,800"
6,03-ene-25,"11.651,600","-0,22%","11.701,800","11.635,500","11.681,400"
7,06-ene-25,"11.808,200","1,34%","11.808,200","11.613,700","11.692,100"
8,07-ene-25,"11.811,900","0,03%","11.866,600","11.724,600","11.797,400"
9,08-ene-25,"11.798,100","-0,12%","11.871,700","11.709,400","11.800,000"
