# Web scraping

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Webpages" data-toc-modified-id="Webpages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webpages</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Basics</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tags</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web scraping</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Example:-Bycicles-webpage" data-toc-modified-id="Example:-Bycicles-webpage-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example: Bycicles webpage</a></span><ul class="toc-item"><li><span><a href="#Research" data-toc-modified-id="Research-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Research</a></span></li><li><span><a href="#Getting-the-title-tag" data-toc-modified-id="Getting-the-title-tag-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Getting the title tag</a></span></li><li><span><a href="#Get-shoes-names" data-toc-modified-id="Get-shoes-names-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Get shoes names</a></span></li><li><span><a href="#Shoes-prices" data-toc-modified-id="Shoes-prices-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Shoes prices</a></span></li><li><span><a href="#Finding-alltoguether,-price-and-name" data-toc-modified-id="Finding-alltoguether,-price-and-name-3.2.5"><span class="toc-item-num">3.2.5&nbsp;&nbsp;</span>Finding alltoguether, price and name</a></span></li></ul></li></ul></li><li><span><a href="#Advanced" data-toc-modified-id="Advanced-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Advanced</a></span><ul class="toc-item"><li><span><a href="#CSS-selectors" data-toc-modified-id="CSS-selectors-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>CSS selectors</a></span></li><li><span><a href="#Getting-attribute-values" data-toc-modified-id="Getting-attribute-values-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Getting attribute values</a></span></li><li><span><a href="#Querying-by-other-attributes" data-toc-modified-id="Querying-by-other-attributes-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Querying by other attributes</a></span></li></ul></li><li><span><a href="#Example:-Wiki-medallero" data-toc-modified-id="Example:-Wiki-medallero-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Example: Wiki medallero</a></span></li><li><span><a href="#Comments" data-toc-modified-id="Comments-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Webpages

Webpages are built mainly with 3 tools:
 * HTML: content (structure, headings, paragraphs, tables...)
 * CSS: style (color, shape, size...)
 * JavaScript: logic (clicks, popups, dynamic banners...)

## HTML

### Basics

HTML code consists of `<tagged>` content.

HTML has a hierarchichal structure: parent tags, children tags, sibling tags:  
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

### Tags

Tags may be classified in different groups, depending on the type of content they are expected to posess
 * heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * phrasing: `<b>`, `<img>`, `<sub>`...
 * embedded: `<audio>`, `<img>`, `<video>`...
 * tabulated: `<table>`, `<tr>`, `<tbody>`...
 * sections: `<header>`, `<section>`, `<article>`...
 * metadata: `<meta>`, `<title>`, `<script>`...

### Attributes

This tag has no attributes
`<div> Zapas Marca Joma X54 </div>`

Tags may have attributes. Here,  
`<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`,  the `div` tag has:
 * a `class` attribute with value `price-item`
 * an `id` attribute with value `offer`  

`id` attribute should be unique for a tag (no two tags should have same `id`)

`class` is not intended to be unique, it usually groups tags with similar behavior

Other frequently used attributes are:
 * `dir`
 * `lang`
 * `style` (not to be confused with `<style>` tag)
 * `title` (not to be confused with `<title>` tag)

## Web scraping

### Intro

When scraping, we want to filter tags by:
 * tag name
 * class
 * id
 * other attribute

Our browser Console is very useful for this:  we can **Inspect** content in the web and find the corresponding piece of HTML code

We use `requests` library to bring the HTML content to our Python script

We use `Beautiful Soup` library to easily navigate through the HTML in Python

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 🤭

In [None]:
!pip install beautifulsoup4

In [1]:
import requests

In [2]:
from bs4 import BeautifulSoup

### Example: Bycicles webpage

Lets scrape Bycicle Store!

In [55]:
url_depor = "https://www.nike.com/es/w/hombre-lifestyle-zapatillas-13jrmznik1zy7ok"

#### Research

In [56]:
response = requests.get(url_depor)

In [57]:
response

<Response [200]>

In [58]:
response.status_code

200

When we requested API endpoints, generally response content was a JSON.

Lets see what happens in this case

In [62]:
try:
    response.json()
except:
    print("Response is not JSON")

Response is not JSON


But we just requested a non-API URL, so we receive raw HTML content!

In [65]:
# just show some substring
response.content[:1000]

b"\n    <!doctype html>\n    \n    \n          \n      <!-- segmenter ran at 2021-06-18T08:46:27.011Z -->\n      \n      \n\n      \n        \n      \n    \n          \n\n<script>\nif (!(() =>\n            document.cookie.includes('audience_segmentation_performed'))([])) {\n  (fragmentPath => {\n  const error = new Error(\n    // eslint-disable-next-line prefer-template\n    'ESI fragment ' + fragmentPath + ' requested, but not found in client',\n  );\n  // eslint-disable-next-line no-undef\n  if (window.newrelic) {\n    console.warn(error);\n    // eslint-disable-next-line no-undef\n    newrelic.noticeError(error, {\n      type: 'ESI_LOAD_ERROR',\n      fragment: fragmentPath,\n    });\n  } else {\n    throw error; // let another monitoring system catch the error.\n  }\n})('https://www.nike.com/fragments/audience')\n} else {\n  (fragmentPath => {\n  // eslint-disable-next-line no-undef\n  if (window.newrelic) {\n    // eslint-disable-next-line no-undef\n    newrelic.addPageAction('ESI

In [66]:
"<span>" in str(response.content)

True

In [67]:
type(response.content)

bytes

`Beautiful Soup` helps us with the task of accessing this info

In [68]:
soup = BeautifulSoup(response.content)

In [69]:
type(soup)

bs4.BeautifulSoup

#### Getting the title tag

When using `find` or `find_all`, we search by tag

In [72]:
soup.find("title")

<title data-react-helmet="true">Men's Lifestyle Shoes. Nike ES</title>

[Documentation .find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)    

In [77]:
len(soup.find_all("span"))

192

In [78]:
len(soup.find_all("div"))

1148

#### Get shoes names

Going to Chrome console and inspecting...

You can find html tags by running `soup.find_all(name=tag_name, class_=class_name)`

In [79]:
shoes_names_tags = soup.find_all(
    name="div", 
    class_="product-card__title"
)

In [81]:
type(shoes_names_tags)

bs4.element.ResultSet

In [82]:
len(shoes_names_tags)

24

**HINT**: execute in the console `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` to check how your guesses are working.  

You shall change `a` by your used tag or by `.class_name` for a class name

In [85]:
shoes_names_tags[0]

<div class="product-card__title" id="Nike Blazer Low X">Nike Blazer Low X</div>

In [86]:
shoes_names_tags[0].text

'Nike Blazer Low X'

In [87]:
names = [shoe.text for shoe in shoes_names_tags]

In [88]:
names[:10]

['Nike Blazer Low X',
 "Nike Blazer Low '77",
 'Nike Air Max Plus',
 'Nike Air Max Plus 3',
 "Nike Blazer Low '77",
 "Nike Blazer Mid '77",
 'Nike Air Max Genome',
 'Nike Crater Impact',
 'Nike Air Max 95 Essential',
 'Nike Air Max 95 OG']

#### Shoes prices

Inspect, guess the tag you must look for, paint green all item prices

In [95]:
price_div_tags = soup.find_all("div", class_="product-card__price-wrapper")

In [96]:
len(price_div_tags)

24

In [108]:
def precio_str_to_float(texto_precio):
    return float(texto_precio.split()[0].replace(",", "."))

In [116]:
precios = [precio_str_to_float(p.text) for p in price_div_tags]

In [117]:
precios

[94.99,
 89.99,
 169.99,
 179.99,
 89.99,
 99.99,
 169.99,
 109.99,
 169.99,
 169.99,
 109.99,
 169.99,
 169.99,
 179.99,
 109.99,
 149.99,
 119.99,
 109.99,
 84.99,
 74.99,
 180.0,
 179.99,
 129.99,
 129.99]

#### Finding alltoguether, price and name

In [168]:
cuadros = soup.find_all("div", class_="product-card__body")

In [169]:
len(cuadros)

24

`find_all` can also be applied to a "cuadros" object

In [125]:
c = cuadros[0]

In [126]:
c

<div class="product-card__body" data-el-type="Card"><figure><a class="product-card__link-overlay" href="https://www.nike.com/es/t/blazer-low-zapatillas-xBBTB0/DA2045-100">Nike Blazer Low X</a><a aria-label="Nike Blazer Low X" class="product-card__img-link-overlay" data-el-type="Hero" href="https://www.nike.com/es/t/blazer-low-zapatillas-xBBTB0/DA2045-100"><div class="wall-image-loader css-1la3v4n"><div><noscript><img alt="Nike Blazer Low X Zapatillas - Hombre" class="css-1fxh5tw product-card__hero-image" height="400" loading="lazy" src="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/82aee44a-6880-47b7-90df-c5174965732f/blazer-low-zapatillas-xBBTB0.png" width="400"/></noscript></div></div></a><div class="product-card__info disable-animations"><div class="">

In [133]:
c.find("div", class_="product-card__title")

<div class="product-card__title" id="Nike Blazer Low X">Nike Blazer Low X</div>

In [131]:
c.find("div", class_="product-card__price")

<div class="product-card__price"><div class="product-price__wrapper css-cl9118"><div class="product-price css-11s12ax is--current-price" data-test="product-price">94,99 €</div></div></div>

In [157]:
def get_properties(cuadro):
    name = cuadro.find("div", class_="product-card__title").text
    sub = cuadro.find("div", class_="product-card__subtitle").text    
    price = precio_str_to_float(cuadro.find("div", class_="product-card__price").text)
    
    return {
        "name": name,
        "sub": sub,
        "price": price
    }

In [158]:
get_properties(cuadros[11])

{'name': 'Nike Air Max Plus', 'sub': 'Zapatillas - Hombre', 'price': 169.99}

In [159]:
import pandas as pd

In [160]:
df = pd.DataFrame([get_properties(cuadrito) for cuadrito in cuadros])

In [161]:
df.sort_values("price").tail()

Unnamed: 0,name,sub,price
11,Nike Air Max Plus,Zapatillas - Hombre,169.99
13,Nike Air Max Plus 3,Zapatillas - Hombre,179.99
3,Nike Air Max Plus 3,Zapatillas - Hombre,179.99
21,Nike Air Max Plus 3,Zapatillas - Hombre,179.99
20,Nike Air Max 97,Zapatillas - Hombre,180.0


In [165]:
df["sub"].value_counts()

Zapatillas - Hombre         18
Zapatillas                   4
Zapatillas de skateboard     2
Name: sub, dtype: int64

In [163]:
df[df.name.str.contains("Plus")]

Unnamed: 0,name,sub,price
2,Nike Air Max Plus,Zapatillas - Hombre,169.99
3,Nike Air Max Plus 3,Zapatillas - Hombre,179.99
11,Nike Air Max Plus,Zapatillas - Hombre,169.99
12,Nike Air Max Plus,Zapatillas - Hombre,169.99
13,Nike Air Max Plus 3,Zapatillas - Hombre,179.99
21,Nike Air Max Plus 3,Zapatillas - Hombre,179.99


In [162]:
df[df.price < 100]

Unnamed: 0,name,sub,price
0,Nike Blazer Low X,Zapatillas - Hombre,94.99
1,Nike Blazer Low '77,Zapatillas,89.99
4,Nike Blazer Low '77,Zapatillas - Hombre,89.99
5,Nike Blazer Mid '77,Zapatillas - Hombre,99.99
18,Nike SB Zoom Stefan Janoski RM,Zapatillas de skateboard,84.99
19,Nike SB Shane,Zapatillas de skateboard,74.99


There are many pages, what if we want to scrape all of them?  
1. Build a function `get_df_from_url` and insert all the previous logic  
2. Build a list of urls

## Advanced

### CSS selectors

We can use CSS selectors to find in a more specific way:
 * descendant selectors
 * combined selectors
 * siblings
 * has attribute
 * ...

In what other ways can we use CSS selectors to find tags inside HTML content?

[Documentation .select](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) 

`soup.select("tagname1 tagname2")` finds tag 2 inside tag 1

In [184]:
# how many spans?
len(soup.select("span"))

192

In [185]:
len(soup.select("span span"))

6

In [186]:
# how many spans inside spans inside spans?
len(soup.select("span span span"))

0

In [191]:
# how many spans inside spans inside spans?
len(soup.select("div div div div div div div div span"))

177

We use a `.` to find by class

`soup.select(".classname")`

If a classname has spaces, they must be replaced by `.`

In [193]:
len(soup.select(".product-card__link-overlay"))

24

`soup.select("tagname.classname")`

In [196]:
response = requests.get("https://www.elpais.com")

In [197]:
soup_pais = BeautifulSoup(response.content)

In [200]:
soup_pais.select("div.logo.primary_claim")

[<div class="logo primary_claim"></div>]

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Getting attribute values

Do you want to get the description and hyperlinks to images in a webpage?  

We use `tag.get(attr_name)` to get the attribute value

In [201]:
img_tags = soup_pais.select("img")

In [202]:
len(img_tags)

4

In [206]:
img_tags[0]

<img alt="SEVILLA, 12/06/2021,- Aficionados se fotografían junto a una réplica del balón del torneo en el estadio de La Cartuja de Sevilla. El España-Suecia, con el que la selección que entrena Luis Enrique inicia la Eurocopa el próximo lunes, será el tercero con aficionados en el estadio sevillano desde que en marzo del pasado año se dictaron normas sanitarias por la covid-19 y solo puntualmente se autorizó la presencia de espectadores en los recintos deportivos. EFE/Julio Muñoz" class="block width_full breakout_mobile" loading="eager" src="https://imagenes.elpais.com/resizer/R-vHzyT-R8uqnVTA0x8IziAuG-M=/62x62/cloudfront-eu-central-1.images.arcpublishing.com/prisa/LLLOTEC42JGXPJBGXIA2BQUH5M.jpg"/>

In [207]:
img_tags[0].get("alt")

'SEVILLA, 12/06/2021,- Aficionados se fotografían junto a una réplica del balón del torneo en el estadio de La Cartuja de Sevilla. El España-Suecia, con el que la selección que entrena Luis Enrique inicia la Eurocopa el próximo lunes, será el tercero con aficionados en el estadio sevillano desde que en marzo del pasado año se dictaron normas sanitarias por la covid-19 y solo puntualmente se autorizó la presencia de espectadores en los recintos deportivos. EFE/Julio Muñoz'

In [205]:
img_tags[0].get("src")

'https://imagenes.elpais.com/resizer/R-vHzyT-R8uqnVTA0x8IziAuG-M=/62x62/cloudfront-eu-central-1.images.arcpublishing.com/prisa/LLLOTEC42JGXPJBGXIA2BQUH5M.jpg'

In [208]:
for im in img_tags:
    print(im.get("alt"), im.get("src"))

SEVILLA, 12/06/2021,- Aficionados se fotografían junto a una réplica del balón del torneo en el estadio de La Cartuja de Sevilla. El España-Suecia, con el que la selección que entrena Luis Enrique inicia la Eurocopa el próximo lunes, será el tercero con aficionados en el estadio sevillano desde que en marzo del pasado año se dictaron normas sanitarias por la covid-19 y solo puntualmente se autorizó la presencia de espectadores en los recintos deportivos. EFE/Julio Muñoz https://imagenes.elpais.com/resizer/R-vHzyT-R8uqnVTA0x8IziAuG-M=/62x62/cloudfront-eu-central-1.images.arcpublishing.com/prisa/LLLOTEC42JGXPJBGXIA2BQUH5M.jpg
¿Cómo de bueno es cada equipo? https://imagenes.elpais.com/resizer/0S0H_TRPIaopf1nWnbJn52__kF8=/62x62/cloudfront-eu-central-1.images.arcpublishing.com/prisa/JTUJ7NTDLZHHVCCDVHE6MYOCII.jpg
Dior recupera los grandes desfiles con una monumental puesta en escena en Atenas https://imagenes.elpais.com/resizer/Jbb_MGMhVGSE3LBc4EF6H5NekHw=/62x46/cloudfront-eu-central-1.imag

### Querying by other attributes

In [215]:
soup = BeautifulSoup(requests.get("https://www.elmundo.es/").content)

In [216]:
len(soup.find_all("a"))

346

In [218]:
[a.text for a in soup.find_all("li", attrs={"role": "presentation"})]

[' España     España   Madrid   Andalucía   Sevilla   Málaga   El caminante   Campus andaluz     Baleares   Ibiza     Castilla y León   El correo de Burgos   Diario de Soria   Diario de Valladolid     Cataluña   Comunidad Valenciana   Castellón     País Vasco   El panel EM/Sigmados   ',
 ' España ',
 ' Madrid ',
 ' Andalucía   Sevilla   Málaga   El caminante   Campus andaluz   ',
 ' Sevilla ',
 ' Málaga ',
 ' El caminante ',
 ' Campus andaluz ',
 ' Baleares   Ibiza   ',
 ' Ibiza ',
 ' Castilla y León   El correo de Burgos   Diario de Soria   Diario de Valladolid   ',
 ' El correo de Burgos ',
 ' Diario de Soria ',
 ' Diario de Valladolid ',
 ' Cataluña ',
 ' Comunidad Valenciana   Castellón   ',
 ' Castellón ',
 ' País Vasco ',
 ' El panel EM/Sigmados ',
 ' Opinión     Opinión   Editoriales   Columnistas   Blogs   ',
 ' Opinión ',
 ' Editoriales ',
 ' Columnistas ',
 ' Blogs ',
 ' Economía     Economía   Actualidad económica   Consumistas   Macroeconomía   Empresas   Vivienda   INnovad

## Example: Wiki medallero

In [219]:
url_medallero = "https://es.wikipedia.org/wiki/Juegos_Ol%C3%ADmpicos_de_Barcelona_1992"

In [220]:
soup = BeautifulSoup(requests.get(url_medallero).content)

In [225]:
# tables
tablas = soup.find_all("table")

In [232]:
len(tablas)

14

In [238]:
for i, t in enumerate(tablas):
    if "Unificado" in str(t) and "45" in str(t):
        print(i)

9
10


tablas 10 seems to be our objective

In [242]:
t = tablas[10]

In [250]:
tbody = t.find("tbody")

In [251]:
tbody

<tbody><tr>
<th>Núm.
</th>
<th>País
</th>
<th><a class="image" href="/wiki/Archivo:Gold_medal.svg" title="Oro"><img alt="Oro" data-file-height="300" data-file-width="300" decoding="async" height="18" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Gold_medal.svg/18px-Gold_medal.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Gold_medal.svg/27px-Gold_medal.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/15/Gold_medal.svg/36px-Gold_medal.svg.png 2x" width="18"/></a>
</th>
<th><a class="image" href="/wiki/Archivo:Silver_medal.svg" title="Plata"><img alt="Plata" data-file-height="300" data-file-width="300" decoding="async" height="18" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/03/Silver_medal.svg/18px-Silver_medal.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/03/Silver_medal.svg/27px-Silver_medal.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/03/Silver_medal.svg/36px-Silver_medal.svg.png 2x" width="

In [252]:
trs = tbody.find_all("tr")

In [253]:
len(trs)

11

In [257]:
tr = trs[1]

In [258]:
tr

<tr>
<td>1</td>
<td align="left"><img alt="Equipo Unificado" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Olympic_flag.svg/22px-Olympic_flag.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Olympic_flag.svg/33px-Olympic_flag.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Olympic_flag.svg/44px-Olympic_flag.svg.png 2x" width="22"/> <a href="/wiki/Equipo_Unificado_en_los_Juegos_Ol%C3%ADmpicos_de_Barcelona_1992" title="Equipo Unificado en los Juegos Olímpicos de Barcelona 1992">Equipo Unificado </a> <small>(EUN)</small></td>
<td>45</td>
<td>38</td>
<td>29</td>
<td>112
</td></tr>

In [272]:
cols = tr.find_all("td")

In [273]:
cols

[<td>1</td>,
 <td align="left"><img alt="Equipo Unificado" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Olympic_flag.svg/22px-Olympic_flag.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Olympic_flag.svg/33px-Olympic_flag.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Olympic_flag.svg/44px-Olympic_flag.svg.png 2x" width="22"/> <a href="/wiki/Equipo_Unificado_en_los_Juegos_Ol%C3%ADmpicos_de_Barcelona_1992" title="Equipo Unificado en los Juegos Olímpicos de Barcelona 1992">Equipo Unificado </a> <small>(EUN)</small></td>,
 <td>45</td>,
 <td>38</td>,
 <td>29</td>,
 <td>112
 </td>]

In [279]:
cols[5].text.strip()

'112'

In [290]:
def get_row_info(row):
    info = dict()
    
    cols = row.find_all("td")
    
    info["rank"] = int(cols[0].text)
    info["country"] = cols[1].text.strip()
    info["ngold"] = int(cols[2].text)
    info["nsilver"] = int(cols[3].text)
    info["nbronze"] = (cols[4].text)
    info["total"] = (cols[5].text.strip())
    
    return info

In [294]:
get_row_info(trs[5])

{'rank': 5,
 'country': 'Cuba  (CUB)',
 'ngold': 14,
 'nsilver': 6,
 'nbronze': '11',
 'total': '31'}

In [292]:
trs

[<tr>
 <th>Núm.
 </th>
 <th>País
 </th>
 <th><a class="image" href="/wiki/Archivo:Gold_medal.svg" title="Oro"><img alt="Oro" data-file-height="300" data-file-width="300" decoding="async" height="18" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Gold_medal.svg/18px-Gold_medal.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Gold_medal.svg/27px-Gold_medal.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/15/Gold_medal.svg/36px-Gold_medal.svg.png 2x" width="18"/></a>
 </th>
 <th><a class="image" href="/wiki/Archivo:Silver_medal.svg" title="Plata"><img alt="Plata" data-file-height="300" data-file-width="300" decoding="async" height="18" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/03/Silver_medal.svg/18px-Silver_medal.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/03/Silver_medal.svg/27px-Silver_medal.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/03/Silver_medal.svg/36px-Silver_medal.svg.png 2x" width=

In [298]:
for i, row in enumerate(trs):
    try:
        get_row_info(row)
    except:
        print(i)

0


In [297]:
pd.DataFrame([get_row_info(row) for row in trs[1:]])

Unnamed: 0,rank,country,ngold,nsilver,nbronze,total
0,1,Equipo Unificado (EUN),45,38,29,112
1,2,Estados Unidos (USA),37,34,37,108
2,3,Alemania (GER),33,21,28,82
3,4,China (CHN),16,22,16,54
4,5,Cuba (CUB),14,6,11,31
5,6,España (ESP),13,7,2,22
6,7,Corea del Sur (KOR),12,5,12,29
7,8,Hungría (HUN),11,12,7,30
8,9,Francia (FRA),8,5,16,29
9,10,Australia (AUS),7,9,11,27


## Comments

Always try and find if there is an **API** instead of scraping, because:
 * much easier
 * well documented
 * preferred by server

robots.txt helps us know how much the server dislikes your scraping

## Summary

 * Webs are build with HTML, CSS, JavaScript
 * HTML has the content. We scrape HTML
 * `requests` to `get` the HTML
 * `Beautiful Soup` to programatically analyse the HTML

 * HTML is hierarchical
 * HTML uses tags
 * HTML tags have attributes
 * We find tags by tagname, class name, id name, or other attributes name
 * We can use CSS selectors to select in very complex ways

* Hint yourself by using `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` or similar

## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!

In [300]:
soup = BeautifulSoup(requests.get("https://www.20minutos.com").content)

In [307]:
len(soup.find_all("a"))

164

In [306]:
[s.text for s in soup.select("div.media-content header h1 a")]

['ONU: pese al coronavirus, millones dejaron sus casas en 2020',
 'Israel enviará 1 millón de vacunas de COVID-19 a palestinos',
 'Sistema tropical dejará lluvias, crecidas en Golfo de México',
 'Sistema tropical dejará lluvias, crecidas en Golfo de México',
 'Kim Jong Un promete estar listo para confrontación con EEUU',
 'Juicio nulo en caso de salario a migrantes detenidos en EEUU',
 'ELN niega estar detrás de atentado a base militar colombiana',
 'Canadá: Preferible vacuna de Pfizer o Moderna para 2da dosis',
 'Grupo hispano demanda al alcalde de Santa Fe por obelisco',
 'Muere Kenneth Kaunda, 1er presidente electo de Zambia',
 'Huelga general en Líbano por crisis financiera y política',
 'Ambientalistas interponen queja judicial contra España',
 'España y Polonia chocan en la Euro peleados con el gol',
 'Alemania busca encontrar contundencia frente a Portugal',
 'Marineros remontan con 2 carreras en 9no, superan a Rays 6-5',
 'Angelinos ganan a Tigres impulsados por Ohtani, slam de