# Web scraping

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Webpages" data-toc-modified-id="Webpages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webpages</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Basics</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tags</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web scraping</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Example:" data-toc-modified-id="Example:-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example:</a></span><ul class="toc-item"><li><span><a href="#Research" data-toc-modified-id="Research-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Research</a></span></li><li><span><a href="#Getting-the-title-tag" data-toc-modified-id="Getting-the-title-tag-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Getting the title tag</a></span></li><li><span><a href="#Get-items-from-the-webpage" data-toc-modified-id="Get-items-from-the-webpage-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Get items from the webpage</a></span></li><li><span><a href="#Get-prices-from-the-webpage" data-toc-modified-id="Get-prices-from-the-webpage-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Get prices from the webpage</a></span></li><li><span><a href="#Finding-alltoguether,-price-and-name" data-toc-modified-id="Finding-alltoguether,-price-and-name-3.2.5"><span class="toc-item-num">3.2.5&nbsp;&nbsp;</span>Finding alltoguether, price and name</a></span></li><li><span><a href="#Building-a-DataFrame-with-the-information" data-toc-modified-id="Building-a-DataFrame-with-the-information-3.2.6"><span class="toc-item-num">3.2.6&nbsp;&nbsp;</span>Building a DataFrame with the information</a></span></li></ul></li></ul></li><li><span><a href="#Advanced" data-toc-modified-id="Advanced-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Advanced</a></span><ul class="toc-item"><li><span><a href="#CSS-selectors" data-toc-modified-id="CSS-selectors-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>CSS selectors</a></span></li><li><span><a href="#Getting-attribute-values" data-toc-modified-id="Getting-attribute-values-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Getting attribute values</a></span></li><li><span><a href="#Querying-by-other-attributes" data-toc-modified-id="Querying-by-other-attributes-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Querying by other attributes</a></span></li></ul></li><li><span><a href="#Example:-Wiki-medallero" data-toc-modified-id="Example:-Wiki-medallero-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Example: Wiki medallero</a></span></li><li><span><a href="#Comments" data-toc-modified-id="Comments-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Webpages

Webpages are built mainly with 3 tools:
 * HTML: content (structure, headings, paragraphs, tables...)
 * CSS: style (color, shape, size...)
 * JavaScript: logic (clicks, popups, dynamic banners...)

## HTML

### Basics

HTML code consists of `<tagged>` content.

HTML has a hierarchichal structure: parent tags, children tags, sibling tags:  
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

### Tags

Tags may be classified in different groups, depending on the type of content they are expected to posess
 * heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * phrasing: `<b>`, `<img>`, `<sub>`...
 * embedded: `<audio>`, `<img>`, `<video>`...
 * tabulated: `<table>`, `<tr>`, `<tbody>`...
 * sections: `<header>`, `<section>`, `<article>`...
 * metadata: `<meta>`, `<title>`, `<script>`...

### Attributes

This tag has no attributes
`<div> Zapas Marca Joma X54 </div>`

Tags may have attributes. Here,  
`<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`,  the `div` tag has:
 * a `class` attribute with value `price-item`
 * an `id` attribute with value `offer`  

`id` attribute should be unique for a tag (no two tags should have same `id`)

`class` is not intended to be unique, it usually groups tags with similar behavior

Other frequently used attributes are:
 * `dir`
 * `lang`
 * `style` (not to be confused with `<style>` tag)
 * `title` (not to be confused with `<title>` tag)

## Web scraping

### Intro

When scraping, we want to filter tags by:
 * tag name
 * class
 * id
 * other attribute

Our browser Console is very useful for this:  we can **Inspect** content in the web and find the corresponding piece of HTML code

We use `requests` library to bring the HTML content to our Python script

We use `Beautiful Soup` library to easily navigate through the HTML in Python

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 🤭

In [None]:
!pip install beautifulsoup4

In [1]:
import requests

In [43]:
from bs4 import BeautifulSoup

### Example: 

Lets scrape Python books of Amazon

In [28]:
url = "https://www.amazon.es/s?k=python"

In [76]:
url2 = "https://www.amazon.es/s?k=nike"

In [77]:
response = requests.get(url2)

#### Research

When we requested API endpoints, generally response content was a JSON.

Lets see what happens in this case

In [93]:
try:
    response.json()
except:
    print("Response is not JSON")

Response is not JSON


But we just requested a non-API URL, so we receive raw HTML content!

In [94]:
response.content

b'<!doctype html><html lang="es-es" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n\n<!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|01ZTHTZObnL.css,41SIz69qHYL.css,31qGOnSAToL.css,013z33uKh2L.css,017DsKjNQJL.css,0131vqwP5UL.css,41EWOOlBJ9L.css,11TIuySqr6L.css,01ElnPiDxWL.css,11bGSgD5pDL.css,01Dm5eKVxwL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,21N4kUH7pxL.css,01oDR3IULNL.css,41-PwE7+H0L.css,21j0IlW7xKL.css,01XPHJk60-L.css,014OeDQisGL.css,21aPhFy+riL.css,11gneA3MtJL.css,21fecG8pUzL.css,01RddH8vm-L.css,01CFUgsA-Y

In [96]:
len(response.content)

671607

In [97]:
# just show some substring
response.content[:1000]

b'<!doctype html><html lang="es-es" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n\n<!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|01ZTHTZObnL.css,41SIz69qHYL.css,31qGOnSAToL.css,013z33uKh2L.css,017DsKjNQJL.css,0131vqwP5UL.css,41EWOOlBJ9L.css,11TIuySqr6L.css,01ElnPiDxWL.css,11bGSgD5pDL.css,01Dm5eKVxwL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,21N4kUH7pxL.css,01oDR3IULNL.css,41-PwE7+H0L.css,21j0IlW7xKL.css,01XPHJk60-L.css,014OeDQisGL.css,21aPhFy+riL.css,11gneA3MtJL.css,21fecG8pUzL.css,01RddH8vm-L.css,01CFUgsA-Y

In [98]:
"<span>" in str(response.content)

True

In [99]:
type(response.content)

bytes

`Beautiful Soup` helps us with the task of accessing this info

In [100]:
soup = BeautifulSoup(response.content)

In [101]:
type(soup)

bs4.BeautifulSoup

#### Getting the title tag

When using `find` or `find_all`, we search by tag

In [102]:
soup.find("title")

<title>Amazon.es: nike</title>

[Documentation .find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)    

In [103]:
len(soup.find_all("span"))

1474

In [104]:
len(soup.find_all("div"))

1321

In [105]:
len(soup.find_all("a"))

648

In [106]:
len(soup.find_all("h2"))

49

#### Get items from the webpage

Going to Chrome console and inspecting...

You can find html tags by running `soup.find_all(name=tag_name, class_=class_name)`

In [108]:
shoes = soup.find_all(
    name="span",
    class_="a-size-base-plus a-color-base a-text-normal"
)

In [109]:
type(shoes)

bs4.element.ResultSet

In [110]:
len(shoes)

48

**HINT**: execute in the console `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` to check how your guesses are working.  

You shall change `a` by your used tag or by `.class_name` for a class name

In [111]:
shoes[0]

<span class="a-size-base-plus a-color-base a-text-normal">Revolution 5, Zapatillas Hombre</span>

In [112]:
shoes[0].text

'Revolution 5, Zapatillas Hombre'

In [116]:
names = [shoe.text for shoe in shoes]

In [117]:
names[:10]

['Revolution 5, Zapatillas Hombre',
 'Venture Runner, Zapatillas Hombre',
 'M tee TM Club19 SS Camiseta Hombre',
 'Court Borough Low, Zapatillas Unisex niños',
 'M Nk Rpl Park20 Rn Jkt W Chaqueta de Deporte Hombre',
 'U Nk Everyday Cush Crew 3pr Calcetines Hombre',
 'Blazer Low, Zapatillas de bsquetbol Hombre',
 'Court Royale 2, Zapatos Hombre',
 'Tanjun, Zapatillas de Running Hombre',
 'Wearallday, Zapatillas para Correr Hombre']

#### Get prices from the webpage

Inspect, guess the tag you must look for, paint green all item prices

In [118]:
prices_tags = soup.find_all("span", class_="a-price-whole")

In [120]:
len(prices_tags)

42

Different number of prices

#### Finding alltoguether, price and name

In [122]:
cuadros = soup.find_all(
    name="div",
    class_="a-section a-spacing-medium a-text-center"
)

In [123]:
len(cuadros)

48

Testing

In [127]:
cuadros[0].find("span", class_="a-size-base-plus a-color-base a-text-normal").text

'Revolution 5, Zapatillas Hombre'

In [150]:
import pandas as pd

In [171]:
def get_stars(c):
    stars_tag = c.find("span", class_="a-icon-alt")
    
    if stars_tag is None:
        return 0
    else:
        return float(stars_tag.text.split()[0].replace(",", "."))

In [174]:
def get_properties(cuadro):
    name = cuadro.find("span", class_="a-size-base-plus a-color-base a-text-normal").text
    
    precio_tag = cuadro.find("span", class_="a-price-whole")
    if precio_tag is None:
        precio = 0
    else:
        precio = float(precio_tag.text.replace(",", "."))
        
    stars = get_stars(cuadro)
    
    return {"name": name, "price": precio, "stars": stars}

In [176]:
get_properties(cuadros[7])

{'name': 'Court Royale 2, Zapatos Hombre', 'price': 54.99, 'stars': 4.6}

In [177]:
df = pd.DataFrame([get_properties(c) for c in cuadros])

In [160]:
df[df.name.str.contains("Air")]

Unnamed: 0,name,price
10,"Air MAX Ltd 3, Zapatillas de Correr Hombre",119.0
12,Wmns Nike Air Court Mo IV 431847102 - Zapatill...,132.0
14,"Air Zoom Superrep 2, Zapatillas de ftbol Hombre",0.0
15,"Air MAX Excee, Zapatillas Hombre",87.99
21,Wmns Air MAX Axis - Zapatillas de Fitness para...,38.69
24,"Air MAX Infinity 2, Zapatillas para Correr Hombre",98.99
27,"Air Zoom Structure 23, Running Shoe Hombre",95.99
31,"Air Tailwind 79, Zapatillas para Correr Hombre",67.45
42,"Air MAX III, Zapatillas para Correr Hombre",131.0


#### Building a DataFrame with the information

There are many pages, what if we want to scrape all of them?  
1. Build a function `get_df_from_url` and insert all the previous logic  
2. Build a list of urls

## Advanced

### CSS selectors

We can use CSS selectors to find in a more specific way:
 * descendant selectors
 * combined selectors
 * siblings
 * has attribute
 * ...

In what other ways can we use CSS selectors to find tags inside HTML content?

[Documentation .select](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) 

`soup.select("tagname1 tagname2")` finds tag 2 inside tag 1

In [192]:
# how many spans?
len(soup.select("span"))

1474

In [193]:
len(soup.select("span span"))

1417

In [194]:
# how many spans inside spans inside spans?
len(soup.select("span span span"))

1354

In [195]:
# how many spans inside spans inside spans?
len(soup.select("div div div div div div div div span"))

1397

We use a `.` to find by class

`soup.select(".classname")`

If a classname has spaces, they must be replaced by `.`

In [196]:
len(soup.select(".a-price-whole"))

42

`soup.select("tagname.classname")`

In [197]:
response = requests.get("https://www.elpais.com")

In [198]:
soup_pais = BeautifulSoup(response.content)

In [199]:
soup_pais.select("div.logo.primary_claim")

[<div class="logo primary_claim"></div>]

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [216]:
titles = [p.text for p in soup_pais.select("h2 a")]

In [217]:
links = ["https://www.elpais.com" + p.get("href") for p in soup_pais.select("h2 a")]

In [221]:
for t, l in zip(titles, links):
    print(f"{t[:20]:30}: {l}")

España cierra la mis          : https://www.elpais.com/espana/2021-08-27/espana-da-por-concluida-la-mision-de-evacuacion-de-afganistan.html
ISIS-K, el enemigo n          : https://www.elpais.com/internacional/2021-08-26/isis-k-el-enemigo-numero-uno-de-los-talibanes.html
Matanza en la evacua          : https://www.elpais.com/internacional/2021-08-26/un-doble-atentado-en-torno-al-aeropuerto-de-kabul-causa-al-menos-una-decena-de-muertos.html
Los avisos de amenaz          : https://www.elpais.com/internacional/2021-08-26/la-multitud-que-desafiaba-el-riesgo-de-ataques.html
Dirigentes del PP pi          : https://www.elpais.com/espana/2021-08-27/dirigentes-del-pp-piden-a-casado-que-modere-su-tono-y-su-estrategia.html
Los sindicatos aplau          : https://www.elpais.com/economia/2021-08-26/los-sindicatos-aplauden-la-prevista-subida-del-smi-aunque-creen-que-llega-tarde-mientras-la-patronal-sigue-firme-en-su-rechazo.html
España tacha de “ges          : https://www.elpais.com/economia/2021-08-

### Getting attribute values

Do you want to get the description and hyperlinks to images in a webpage?  

We use `tag.get(attr_name)` to get the attribute value

In [188]:
cuadros[0].find("img")

<img alt="NIKE Revolution 5, Zapatillas Hombre" class="s-image" data-image-index="1" data-image-latency="s-product-image" data-image-load="" data-image-source-density="1" src="https://m.media-amazon.com/images/I/61byLB-7wAL._AC_UL320_.jpg" srcset="https://m.media-amazon.com/images/I/61byLB-7wAL._AC_UL320_.jpg 1x, https://m.media-amazon.com/images/I/61byLB-7wAL._AC_UL480_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61byLB-7wAL._AC_UL640_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61byLB-7wAL._AC_UL800_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61byLB-7wAL._AC_UL960_QL65_.jpg 3x"/>

### Querying by other attributes

In [222]:
soup = BeautifulSoup(requests.get("https://www.elmundo.es/").content)

In [223]:
len(soup.find_all("a"))

358

In [230]:
soup.find("li", attrs={"role": "presentation"})

<li class="ue-c-main-navigation__list-item" role="presentation"> <a aria-expanded="false" aria-haspopup="true" class="ue-c-main-navigation__link ue-c-main-navigation__link-dropdown js-accessible-link" data-ue-cmp="MENUHOM01" data-ue-skw="espana" href="https://www.elmundo.es/espana.html" role="menuitem" tabindex="0">España <span aria-hidden="true" class="ue-c-main-navigation__link-dropdown-icon"></span> </a> <ul aria-hidden="true" aria-label="Menú España" class="ue-c-main-navigation__list ue-c-main-navigation__list--second-level ue-c-main-navigation__list-dropdown ue-c-main-navigation__list-dropdown--aligned-left ue-c-main-navigation__list-dropdown--2-columns js-accessible-list" role="menu"> <li class="ue-c-main-navigation__list-item" role="presentation"> <a class="ue-c-main-navigation__list-dropdown-title ue-c-main-navigation__link is-bold" data-ue-cmp="MENUDES01" data-ue-skw="espana" href="https://www.elmundo.es/espana.html" role="menuitem">España</a> </li> <li class="ue-c-main-naviga

## Example: Wiki medallero

In [None]:
url_medallero = "https://es.wikipedia.org/wiki/Juegos_Ol%C3%ADmpicos_de_Barcelona_1992"

In [None]:
soup = BeautifulSoup(requests.get(url_medallero).content)

In [None]:
# tables
tablas = soup.find_all("table")

In [None]:
len(tablas)

In [None]:
for i, t in enumerate(tablas):
    if "Unificado" in str(t) and "45" in str(t):
        print(i)

tablas 10 seems to be our objective

In [None]:
t = tablas[10]

In [None]:
tbody = t.find("tbody")

In [None]:
tbody

In [None]:
trs = tbody.find_all("tr")

In [None]:
len(trs)

In [None]:
tr = trs[1]

In [None]:
tr

In [None]:
cols = tr.find_all("td")

In [None]:
cols

In [None]:
cols[5].text.strip()

In [None]:
def get_row_info(row):
    info = dict()
    
    cols = row.find_all("td")
    
    info["rank"] = int(cols[0].text)
    info["country"] = cols[1].text.strip()
    info["ngold"] = int(cols[2].text)
    info["nsilver"] = int(cols[3].text)
    info["nbronze"] = (cols[4].text)
    info["total"] = (cols[5].text.strip())
    
    return info

In [None]:
get_row_info(trs[5])

In [None]:
trs

In [None]:
for i, row in enumerate(trs):
    try:
        get_row_info(row)
    except:
        print(i)

In [None]:
pd.DataFrame([get_row_info(row) for row in trs[1:]])

## Comments

Always try and find if there is an **API** instead of scraping, because:
 * much easier
 * well documented
 * preferred by server

robots.txt helps us know how much the server dislikes your scraping

## Summary

 * Webs are build with HTML, CSS, JavaScript
 * HTML has the content. We scrape HTML
 * `requests` to `get` the HTML
 * `Beautiful Soup` to programatically analyse the HTML

 * HTML is hierarchical
 * HTML uses tags
 * HTML tags have attributes
 * We find tags by tagname, class name, id name, or other attributes name
 * We can use CSS selectors to select in very complex ways

* Hint yourself by using `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` or similar

## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!