# Web scraping

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Webpages" data-toc-modified-id="Webpages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webpages</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Basics</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tags</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web scraping</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Example:-Bycicles-webpage" data-toc-modified-id="Example:-Bycicles-webpage-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example: Bycicles webpage</a></span><ul class="toc-item"><li><span><a href="#Research" data-toc-modified-id="Research-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Research</a></span></li><li><span><a href="#Getting-the-title-tag" data-toc-modified-id="Getting-the-title-tag-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Getting the title tag</a></span></li><li><span><a href="#Get-bycicle-names" data-toc-modified-id="Get-bycicle-names-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Get bycicle names</a></span></li><li><span><a href="#Bike-prices" data-toc-modified-id="Bike-prices-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Bike prices</a></span></li><li><span><a href="#Creating-a-DataFrame-with-the-data" data-toc-modified-id="Creating-a-DataFrame-with-the-data-3.2.5"><span class="toc-item-num">3.2.5&nbsp;&nbsp;</span>Creating a DataFrame with the data</a></span></li></ul></li></ul></li><li><span><a href="#Advanced" data-toc-modified-id="Advanced-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Advanced</a></span><ul class="toc-item"><li><span><a href="#CSS-selectors" data-toc-modified-id="CSS-selectors-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>CSS selectors</a></span></li><li><span><a href="#Getting-attribute-values" data-toc-modified-id="Getting-attribute-values-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Getting attribute values</a></span></li><li><span><a href="#Querying-by-other-attributes" data-toc-modified-id="Querying-by-other-attributes-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Querying by other attributes</a></span></li></ul></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Exercise</a></span></li><li><span><a href="#Comments" data-toc-modified-id="Comments-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Webpages

Webpages are built mainly with 3 tools:
 * HTML: content (structure, headings, paragraphs, tables...)
 * CSS: style (color, shape, size...)
 * JavaScript: logic (clicks, popups, dynamic banners...)

## HTML

### Basics

HTML code consists of `<tagged>` content.

HTML has a hierarchichal structure: parent tags, children tags, sibling tags:  
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

### Tags

Tags may be classified in different groups, depending on the type of content they are expected to posess
 * heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * phrasing: `<b>`, `<img>`, `<sub>`...
 * embedded: `<audio>`, `<img>`, `<video>`...
 * tabulated: `<table>`, `<tr>`, `<tbody>`...
 * sections: `<header>`, `<section>`, `<article>`...
 * metadata: `<meta>`, `<title>`, `<script>`...

### Attributes

This tag has no attributes
`<div> Zapas Marca Joma X54 </div>`

Tags may have attributes. Here,  
`<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`,  the `div` tag has:
 * a `class` attribute with value `price-item`
 * an `id` attribute with value `offer`  



`id` attribute should be unique for a tag (no two tags should have same `id`)

`class` is not intended to be unique, it usually groups tags with similar behavior

Other frequently used attributes are:
 * `dir`
 * `lang`
 * `style` (not to be confused with `<style>` tag)
 * `title` (not to be confused with `<title>` tag)

## Web scraping

### Intro

When scraping, we want to filter tags by:
 * tag name
 * class
 * id
 * other attribute

Our browser Console is very useful for this:  we can **Inspect** content in the web and find the corresponding piece of HTML code

We use `requests` library to bring the HTML content to our Python script

We use `Beautiful Soup` library to easily navigate through the HTML in Python

In [None]:
!pip install beautifulsoup4

In [1]:
import requests

In [2]:
from bs4 import BeautifulSoup

### Example: Bycicles webpage

Lets scrape Bycicle Store!

In [3]:
url_depor = "https://www.deporvillage.com/bicicletas-mtb"

#### Research

In [4]:
response = requests.get(url_depor)

In [5]:
response

<Response [200]>

In [6]:
response.status_code

200

When we requested API endpoints, generally response content was a JSON.

Lets see what happens in this case

But we just requested a non-API URL, so we receive raw HTML content!

In [11]:
response.content

b'<!DOCTYPE html>\n<html lang="es">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />\n<script type="text/javascript">\n    window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(23),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,e,n){r(n.stack)}),s.dev&&(r("NR A

In [13]:
"<span>" in str(response.content)

True

In [14]:
type(response.content)

bytes

`Beautiful Soup` helps us with the task of accessing this info

In [15]:
soup = BeautifulSoup(response.content)

In [19]:
type(soup)

bs4.BeautifulSoup

#### Getting the title tag

In [20]:
soup.find("title")

<title>Bicicletas mtb - comprar bicicletas mtb online | deporvillage</title>

In [21]:
soup.find_all("title")

[<title>Bicicletas mtb - comprar bicicletas mtb online | deporvillage</title>]

In [24]:
len(soup.find_all("div"))

1292

#### Get bycicle names

Going to Chrome console and inspecting...

You can find html tags by running `soup.find_all(name=tag_name, class_=class_name)`

In [26]:
bike_names_tags = soup.find_all(
    name="div", 
    class_="product-item-name"
)

In [27]:
type(bike_names_tags)

bs4.element.ResultSet

Equivalently, using css selectors, which is a universal syntax, you can try and find `tag_name.class_name`. If class name has spaces, they must be changed by `.`

In [30]:
bike_names_tags = soup.select("div.product-item-name")

In [31]:
len(bike_names_tags)

47

**HINT**: execute in the console `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` to check how your guesses are working.  

You shall change `a` by your used tag or by `.class_name` for a class name

In [32]:
bike_names_tags[:10]

[<div class="product-item-name">Bicicleta MTB Conor - WRC 7200 Lady gris</div>,
 <div class="product-item-name">Bicicleta MTB Umit XR 26" rosa</div>,
 <div class="product-item-name">Bicicleta MTB Conor - WRC 5400 Lady verde</div>,
 <div class="product-item-name">Bicicleta MTB Conor - WRC 5500 verde</div>,
 <div class="product-item-name">Bicicleta MTB Marin Bikes Bolinas Ridge 2 29" azul oscuro azul claro</div>,
 <div class="product-item-name">Bicicleta MTB Conor - WRC 6000 verde</div>,
 <div class="product-item-name">Bicicleta MTB Conor - WRC 6700 azul turquesa</div>,
 <div class="product-item-name">Bicicleta MTB Umit XR 26" negro verde</div>,
 <div class="product-item-name">Bicicleta MTB Umit XR 26" azul</div>,
 <div class="product-item-name">Bicicleta MTB Umit XR 26" verde pistacho</div>]

In [33]:
bike_names_tags[0]

<div class="product-item-name">Bicicleta MTB Conor - WRC 7200 Lady gris</div>

In [35]:
bike_names_tags[0].text

'Bicicleta MTB Conor - WRC 7200 Lady gris'

In [36]:
names = [bike.text for bike in bike_names_tags]

In [38]:
names[:10]

['Bicicleta MTB Conor - WRC 7200 Lady gris',
 'Bicicleta MTB Umit XR 26" rosa',
 'Bicicleta MTB Conor - WRC 5400 Lady verde',
 'Bicicleta MTB Conor - WRC 5500 verde',
 'Bicicleta MTB Marin Bikes Bolinas Ridge 2 29" azul oscuro azul claro',
 'Bicicleta MTB Conor - WRC 6000 verde',
 'Bicicleta MTB Conor - WRC 6700 azul turquesa',
 'Bicicleta MTB Umit XR 26" negro verde',
 'Bicicleta MTB Umit XR 26" azul',
 'Bicicleta MTB Umit XR 26" verde pistacho']

#### Bike prices

Inspect, guess the tag you must look for, paint green all item prices

In [45]:
price_div_tags = soup.find_all("div", class_="product-item-price")

In [46]:
len(price_div_tags)

47

In [47]:
price_div_tags[:10]

[<div class="product-item-price"><span class="special-price">569,95 €<del>575,00 €</del></span></div>,
 <div class="product-item-price"><span class="special-price">209,95 €<del>244,99 €</del></span></div>,
 <div class="product-item-price"><span class="special-price">259,95 €<del>305,00 €</del></span></div>,
 <div class="product-item-price"><span>320,00 €</span></div>,
 <div class="product-item-price"><span class="special-price">494,00 €<del>549,00 €</del></span></div>,
 <div class="product-item-price"><span>383,00 €</span></div>,
 <div class="product-item-price"><span>475,00 €</span></div>,
 <div class="product-item-price"><span class="special-price">209,95 €<del>244,99 €</del></span></div>,
 <div class="product-item-price"><span class="special-price">209,95 €<del>244,99 €</del></span></div>,
 <div class="product-item-price"><span class="special-price">209,95 €<del>244,99 €</del></span></div>]

In [49]:
price_div_tags[0]

<div class="product-item-price"><span class="special-price">569,95 €<del>575,00 €</del></span></div>

Every one of these `div` tags has at the same time a `span` tag inside it, lets find it!

In [51]:
price_div_tags[0].find("span")

<span class="special-price">569,95 €<del>575,00 €</del></span>

In [52]:
price_span_tags = [p.find("span") for p in price_div_tags]

In [53]:
price_span_tags[:10]

[<span class="special-price">569,95 €<del>575,00 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">259,95 €<del>305,00 €</del></span>,
 <span>320,00 €</span>,
 <span class="special-price">494,00 €<del>549,00 €</del></span>,
 <span>383,00 €</span>,
 <span>475,00 €</span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>]

In [58]:
price_span_tags[0].text

'569,95 €575,00 €'

In [57]:
price_span_tags[3].text

'320,00 €'

In [None]:
price_span_tags[2]

Equivalent using parent child css selector

In [44]:
soup.select("div.product-item-price span")

[<span class="special-price">569,95 €<del>575,00 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">259,95 €<del>305,00 €</del></span>,
 <span>320,00 €</span>,
 <span class="special-price">494,00 €<del>549,00 €</del></span>,
 <span>383,00 €</span>,
 <span>475,00 €</span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">209,95 €<del>244,99 €</del></span>,
 <span class="special-price">225,95 €<del>244,99 €</del></span>,
 <span class="special-price">5.399,00 €<del>5.999,00 €</del></span>,
 <span class="special-price">5.039,00 €<del>5.599,00 €</del></span>,
 <span class="special-price">5.039,00 €<del>5.599,00 €</del></span>,
 <span class="special-price">5.579,00 €<del>6.198,99 €</del></span>,
 <span class="special-price">2.754,05 €<del>2.899,00 €</del></span>

#### Creating a DataFrame with the data

In [59]:
recuadros = soup.find_all("div", class_="product-item-component")

In [60]:
len(recuadros)

47

In [66]:
recuadros[0]

<div class="product-item-component"><div class="product-item-picture large"><picture class="picture-component picture-lazy"><div class="lazyload-placeholder" style="height:260px"></div></picture></div><div class="product-item-price"><span class="special-price">569,95 €<del>575,00 €</del></span></div><div class="product-item-name">Bicicleta MTB Conor - WRC 7200 Lady gris</div></div>

In [68]:
print(recuadros[0].prettify())

<div class="product-item-component">
 <div class="product-item-picture large">
  <picture class="picture-component picture-lazy">
   <div class="lazyload-placeholder" style="height:260px">
   </div>
  </picture>
 </div>
 <div class="product-item-price">
  <span class="special-price">
   569,95 €
   <del>
    575,00 €
   </del>
  </span>
 </div>
 <div class="product-item-name">
  Bicicleta MTB Conor - WRC 7200 Lady gris
 </div>
</div>


In [76]:
def extract_price(recuadro):
    # TODO make sure only one price if special price
    return recuadro.select("div.product-item-price span")[0].text

In [77]:
def extract_name(recuadro):
    return recuadro.select("div.product-item-name")[0].text

In [75]:
extract_price(recuadros[0])

'569,95 €575,00 €'

In [78]:
extract_name(recuadros[0])

'Bicicleta MTB Conor - WRC 7200 Lady gris'

In [83]:
pares = [(extract_name(r), extract_price(r)) for r in recuadros]

In [85]:
import pandas as pd

In [88]:
df = pd.DataFrame(pares)

In [90]:
df.columns = ["name", "price"]

In [92]:
"569,95 €575,00 €".split(" €")

['569,95', '575,00', '']

In [93]:
"320,00 €".split(" €")

['320,00', '']

In [91]:
df.head()

Unnamed: 0,name,price
0,Bicicleta MTB Conor - WRC 7200 Lady gris,"569,95 €575,00 €"
1,"Bicicleta MTB Umit XR 26"" rosa","209,95 €244,99 €"
2,Bicicleta MTB Conor - WRC 5400 Lady verde,"259,95 €305,00 €"
3,Bicicleta MTB Conor - WRC 5500 verde,"320,00 €"
4,"Bicicleta MTB Marin Bikes Bolinas Ridge 2 29"" ...","494,00 €549,00 €"


In [95]:
def get_real_price(price_text):
    return price_text.split(" €")[0]

In [98]:
df["real_price"] = df.price.apply(get_real_price)

In [105]:
df.real_price

0       569,95
1       209,95
2       259,95
3       320,00
4       494,00
5       383,00
6       475,00
7       209,95
8       209,95
9       209,95
10      209,95
11      225,95
12    5.399,00
13    5.039,00
14    5.039,00
15    5.579,00
16    2.754,05
17    2.754,05
18    2.754,05
19    2.754,05
20      320,00
21      475,00
22      575,00
23      625,00
24      699,00
25      699,00
26    5.799,00
27    3.999,00
28    2.590,00
29    1.950,00
30    3.399,00
31    1.598,95
32    4.299,00
33    1.899,00
34    2.399,00
35    4.299,00
36    7.999,00
37      625,00
38      713,80
39      499,00
40    2.999,00
41    4.059,35
42      259,95
43    5.399,00
44      625,00
45      625,00
46    3.399,00
Name: real_price, dtype: object

In [112]:
df.real_price = df.real_price.str.replace(".", "").str.replace(",", ".").astype(float)

In [113]:
df.head()

Unnamed: 0,name,price,real_price
0,Bicicleta MTB Conor - WRC 7200 Lady gris,"569,95 €575,00 €",569.95
1,"Bicicleta MTB Umit XR 26"" rosa","209,95 €244,99 €",209.95
2,Bicicleta MTB Conor - WRC 5400 Lady verde,"259,95 €305,00 €",259.95
3,Bicicleta MTB Conor - WRC 5500 verde,"320,00 €",320.0
4,"Bicicleta MTB Marin Bikes Bolinas Ridge 2 29"" ...","494,00 €549,00 €",494.0


In [116]:
df[df.real_price < 400]

Unnamed: 0,name,price,real_price
1,"Bicicleta MTB Umit XR 26"" rosa","209,95 €244,99 €",209.95
2,Bicicleta MTB Conor - WRC 5400 Lady verde,"259,95 €305,00 €",259.95
3,Bicicleta MTB Conor - WRC 5500 verde,"320,00 €",320.0
5,Bicicleta MTB Conor - WRC 6000 verde,"383,00 €",383.0
7,"Bicicleta MTB Umit XR 26"" negro verde","209,95 €244,99 €",209.95
8,"Bicicleta MTB Umit XR 26"" azul","209,95 €244,99 €",209.95
9,"Bicicleta MTB Umit XR 26"" verde pistacho","209,95 €244,99 €",209.95
10,"Bicicleta MTB Umit XR 26"" naranja","209,95 €244,99 €",209.95
11,"Bicicleta MTB Umit XR 26"" negro rojo","225,95 €244,99 €",225.95
20,Bicicleta MTB Conor - WRC 5500 negro verde,"320,00 €",320.0


In [121]:
df.real_price.median()

713.8

In [122]:
df.shape

(47, 3)

In [123]:
df.real_price.max()

7999.0

In [125]:
df.real_price.min()

209.95

There are many pages, what if we want to scrape all of them?  
1. Build a function `get_df_from_url` and insert all the previous logic  
2. Build a list of urls

In [127]:
pages = [f"https://www.deporvillage.com/bicicletas-mtb?p={pag}" for pag in range(10)]

3. Build the df for each URL

In [None]:
dfs = [get_df_from_url(p) for p in pages]

## Advanced

### CSS selectors

We can use CSS selectors to find in a more specific way:
 * descendant selectors
 * combined selectors
 * siblings
 * has attribute
 * ...

In what other ways can we use CSS selectors to find tags inside HTML content?

`soup.select("tagname1 tagname2")` finds tag 2 inside tag 1

In [128]:
# how many spans?
len(soup.select("span"))

845

In [129]:
# how many spans inside spans?
len(soup.select("span span"))

20

In [130]:
# how many spans inside spans inside spans?
len(soup.select("span span span"))

2

In [134]:
# how many spans inside spans inside spans?
len(soup.select("div div div div div div div div span"))

803

We use a `.` to find by class

`soup.select(".classname")`

In [135]:
len(soup.select(".picture-component"))

90

`soup.select("tagname.classname")`

In [136]:
len(soup.select("div.product-item-price"))

47

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Getting attribute values

Do you want to get the description and hyperlinks to images in a webpage?  

We use `tag.get(attr_name)` to get the attribute value

In [139]:
response = requests.get("https://www.elpais.com")

In [140]:
soup = BeautifulSoup(response.content)

In [141]:
img_tags = soup.find_all("img")

In [142]:
len(img_tags)

2

In [143]:
img_tags[0]

<img alt="MADRID, 28/01/2021.- La vicepresidenta primera del Gobierno, Carmen Calvo (d), y el representante de EH Bildu, Oskar Matute (i), este jueves, en el Congreso de los Diputados. El Gobierno defiende en el pleno del Congreso la convalidación del decreto ley por el que se adoptan medidas de adaptación tras la retirada del Reino Unido de la Unión Europea el pasado 1 de enero. EFE/ Mariscal" class="block width_full breakout_mobile" src="https://imagenes.elpais.com/resizer/8uRxVyA0u9gS3DBVOHH0g8W-sKA=/62x34/cloudfront-eu-central-1.images.arcpublishing.com/prisa/3OMZAFB7RFG35CQDT4FNCQYDNQ.jpg"/>

In [144]:
img_tags[0]

''

In [145]:
img_tags[0].get("alt")

'MADRID, 28/01/2021.- La vicepresidenta primera del Gobierno, Carmen Calvo (d), y el representante de EH Bildu, Oskar Matute (i), este jueves, en el Congreso de los Diputados. El Gobierno defiende en el pleno del Congreso la convalidación del decreto ley por el que se adoptan medidas de adaptación tras la retirada del Reino Unido de la Unión Europea el pasado 1 de enero. EFE/ Mariscal'

In [146]:
img_tags[0].get("src")

'https://imagenes.elpais.com/resizer/8uRxVyA0u9gS3DBVOHH0g8W-sKA=/62x34/cloudfront-eu-central-1.images.arcpublishing.com/prisa/3OMZAFB7RFG35CQDT4FNCQYDNQ.jpg'

### Querying by other attributes

In [147]:
soup = BeautifulSoup(requests.get("https://www.elmundo.es/").content)

In [149]:
len(soup.find_all("a"))

377

In [151]:
menuitems = soup.find_all("a", attrs={"role": "menuitem"})

In [153]:
[m.text for m in menuitems]

['España  ',
 'España',
 'Madrid',
 'Andalucía',
 'Sevilla',
 'Málaga',
 'El caminante',
 'Campus andaluz',
 'Baleares',
 'Ibiza',
 'Castilla y León',
 'El correo de Burgos',
 'Diario de Soria',
 'Diario de Valladolid',
 'Cataluña',
 'Elecciones catalanas',
 'Comunidad Valenciana',
 'Castellón',
 'País Vasco',
 'Opinión  ',
 'Opinión',
 'Editoriales',
 'Columnistas',
 'Blogs',
 'Economía  ',
 'Economía',
 'Actualidad económica',
 'Consumistas',
 'Macroeconomía',
 'Empresas',
 'Vivienda',
 'INnovadores',
 'Comparador',
 'Los más ricos',
 'Internacional  ',
 'Internacional',
 'Europa',
 'América',
 'Asia',
 'África',
 'Oceanía',
 'Deportes  ',
 'Deportes',
 'Fútbol',
 'LaLiga Santander',
 'LaLiga SmartBank',
 'Champions League',
 'Europa League',
 'Copa del Rey',
 'Premier League',
 'Bundesliga',
 'Serie A',
 'Ligue 1',
 'Liga portuguesa',
 'Liga argentina',
 'Segunda división B',
 'UEFA Nations League',
 'Eurocopa',
 'Fútbol femenino',
 'Baloncesto',
 'Liga Endesa',
 'NBA',
 'Euroliga',

## Exercise

Can we just change the url "https://www.deporvillage.com/textil-running" to do analogous findings?  

## Comments

Always try and find if there is an **API** instead of scraping, because:
 * much easier
 * well documented
 * preferred by server

robots.txt help us know how much the server dislikes your scraping

[Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## Summary

 * Webs are build with HTML, CSS, JavaScript
 * HTML has the content. We scrape HTML
 * `requests` to `get` the HTML
 * `Beautiful Soup` to programatically analyse the HTML

 * HTML is hierarchical
 * HTML uses tags
 * HTML tags have attributes
 * We find tags by tagname, class name, id name, or other attributes name
 * We can use CSS selectors to select in very complex ways

* Hint yourself by using `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` or similar

## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!