# Web scraping

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Webpages" data-toc-modified-id="Webpages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webpages</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Basics</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tags</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web scraping</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Example:-Bycicles-webpage" data-toc-modified-id="Example:-Bycicles-webpage-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example: Bycicles webpage</a></span><ul class="toc-item"><li><span><a href="#Research" data-toc-modified-id="Research-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Research</a></span></li><li><span><a href="#Getting-the-title-tag" data-toc-modified-id="Getting-the-title-tag-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Getting the title tag</a></span></li><li><span><a href="#Get-bycicle-names" data-toc-modified-id="Get-bycicle-names-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Get bycicle names</a></span></li><li><span><a href="#Bike-prices" data-toc-modified-id="Bike-prices-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Bike prices</a></span></li><li><span><a href="#Creating-a-DataFrame-with-the-data" data-toc-modified-id="Creating-a-DataFrame-with-the-data-3.2.5"><span class="toc-item-num">3.2.5&nbsp;&nbsp;</span>Creating a DataFrame with the data</a></span></li></ul></li></ul></li><li><span><a href="#Advanced" data-toc-modified-id="Advanced-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Advanced</a></span><ul class="toc-item"><li><span><a href="#CSS-selectors" data-toc-modified-id="CSS-selectors-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>CSS selectors</a></span></li><li><span><a href="#Getting-attribute-values" data-toc-modified-id="Getting-attribute-values-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Getting attribute values</a></span></li><li><span><a href="#Querying-by-other-attributes" data-toc-modified-id="Querying-by-other-attributes-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Querying by other attributes</a></span></li></ul></li><li><span><a href="#Example:-Wiki-medallero" data-toc-modified-id="Example:-Wiki-medallero-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Example: Wiki medallero</a></span></li><li><span><a href="#Comments" data-toc-modified-id="Comments-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Webpages

Webpages are built mainly with 3 tools:
 * HTML: content (structure, headings, paragraphs, tables...)
 * CSS: style (color, shape, size...)
 * JavaScript: logic (clicks, popups, dynamic banners...)

## HTML

### Basics

HTML code consists of `<tagged>` content.

HTML has a hierarchichal structure: parent tags, children tags, sibling tags:  
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

### Tags

Tags may be classified in different groups, depending on the type of content they are expected to posess
 * heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * phrasing: `<b>`, `<img>`, `<sub>`...
 * embedded: `<audio>`, `<img>`, `<video>`...
 * tabulated: `<table>`, `<tr>`, `<tbody>`...
 * sections: `<header>`, `<section>`, `<article>`...
 * metadata: `<meta>`, `<title>`, `<script>`...

### Attributes

This tag has no attributes
`<div> Zapas Marca Joma X54 </div>`

Tags may have attributes. Here,  
`<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`,  the `div` tag has:
 * a `class` attribute with value `price-item`
 * an `id` attribute with value `offer`  



`id` attribute should be unique for a tag (no two tags should have same `id`)

`class` is not intended to be unique, it usually groups tags with similar behavior

Other frequently used attributes are:
 * `dir`
 * `lang`
 * `style` (not to be confused with `<style>` tag)
 * `title` (not to be confused with `<title>` tag)

## Web scraping

### Intro

When scraping, we want to filter tags by:
 * tag name
 * class
 * id
 * other attribute

Our browser Console is very useful for this:  we can **Inspect** content in the web and find the corresponding piece of HTML code

We use `requests` library to bring the HTML content to our Python script

We use `Beautiful Soup` library to easily navigate through the HTML in Python

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 🤭

In [None]:
!pip install beautifulsoup4

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup

### Example: Bycicles webpage

Lets scrape Bycicle Store!

In [None]:
url_depor = "https://www.deporvillage.com/bicicletas-mtb"

#### Research

In [None]:
response = requests.get(url_depor)

In [None]:
response

In [None]:
response.status_code

When we requested API endpoints, generally response content was a JSON.

Lets see what happens in this case

But we just requested a non-API URL, so we receive raw HTML content!

In [None]:
response.content

In [None]:
"<span>" in str(response.content)

In [None]:
type(response.content)

`Beautiful Soup` helps us with the task of accessing this info

In [None]:
soup = BeautifulSoup(response.content)

In [None]:
type(soup)

#### Getting the title tag

In [None]:
soup.find("title")

[Documentation .find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)    

In [None]:
soup.find_all("title")

In [None]:
len(soup.find_all("div"))

#### Get bycicle names

Going to Chrome console and inspecting...

You can find html tags by running `soup.find_all(name=tag_name, class_=class_name)`

In [None]:
bike_names_tags = soup.find_all(
    name="div", 
    class_="product-item-name"
)

In [None]:
type(bike_names_tags)

Equivalently, using css selectors, which is a universal syntax, you can try and find `tag_name.class_name`. If class name has spaces, they must be changed by `.`

In [None]:
bike_names_tags = soup.select("div.product-item-name")

In [None]:
len(bike_names_tags)

**HINT**: execute in the console `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` to check how your guesses are working.  

You shall change `a` by your used tag or by `.class_name` for a class name

In [None]:
bike_names_tags[:10]

In [None]:
bike_names_tags[0]

In [None]:
bike_names_tags[0].text

In [None]:
names = [bike.text for bike in bike_names_tags]

In [None]:
names[:10]

#### Bike prices

Inspect, guess the tag you must look for, paint green all item prices

In [None]:
price_div_tags = soup.find_all("div", class_="product-item-price")

In [None]:
len(price_div_tags)

In [None]:
price_div_tags[:10]

In [None]:
price_div_tags[0]

Every one of these `div` tags has at the same time a `span` tag inside it, lets find it!

In [None]:
price_div_tags[0].find("span")

In [None]:
price_span_tags = [p.find("span") for p in price_div_tags]

In [None]:
price_span_tags[:10]

In [None]:
price_span_tags[0].text

In [None]:
price_span_tags[3].text

In [None]:
price_span_tags[2]

Equivalent using parent child css selector

In [None]:
soup.select("div.product-item-price span")

#### Creating a DataFrame with the data

In [None]:
recuadros = soup.find_all("div", class_="product-item-component")

In [None]:
len(recuadros)

In [None]:
recuadros[0]

In [None]:
print(recuadros[0].prettify())

In [None]:
def extract_price(recuadro):
    # TODO make sure only one price if special price
    return recuadro.select("div.product-item-price span")[0].text

In [None]:
def extract_name(recuadro):
    return recuadro.select("div.product-item-name")[0].text

In [None]:
extract_price(recuadros[0])

In [None]:
extract_name(recuadros[0])

In [None]:
pares = [(extract_name(r), extract_price(r)) for r in recuadros]

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(pares)

In [None]:
df.columns = ["name", "price"]

In [None]:
"569,95 €575,00 €".split(" €")

In [None]:
"320,00 €".split(" €")

In [None]:
df.head()

In [None]:
def get_real_price(price_text):
    return price_text.split(" €")[0]

In [None]:
df["real_price"] = df.price.apply(get_real_price)

In [None]:
df.real_price

In [None]:
df.real_price = df.real_price.str.replace(".", "").str.replace(",", ".").astype(float)

In [None]:
df.head()

In [None]:
df[df.real_price < 400]

In [None]:
df.real_price.median()

In [None]:
df.shape

In [None]:
df.real_price.max()

In [None]:
df.real_price.min()

There are many pages, what if we want to scrape all of them?  
1. Build a function `get_df_from_url` and insert all the previous logic  
2. Build a list of urls

In [None]:
pages = [f"https://www.deporvillage.com/bicicletas-mtb?p={pag}" for pag in range(10)]

3. Build the df for each URL

In [None]:
dfs = [get_df_from_url(p) for p in pages]

## Advanced

### CSS selectors

We can use CSS selectors to find in a more specific way:
 * descendant selectors
 * combined selectors
 * siblings
 * has attribute
 * ...

In what other ways can we use CSS selectors to find tags inside HTML content?

[Documentation .select](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) 

`soup.select("tagname1 tagname2")` finds tag 2 inside tag 1

In [None]:
# how many spans?
len(soup.select("span"))

In [None]:
# how many spans inside spans?
len(soup.select("span span"))

In [None]:
# how many spans inside spans inside spans?
len(soup.select("span span span"))

In [None]:
# how many spans inside spans inside spans?
len(soup.select("div div div div div div div div span"))

We use a `.` to find by class

`soup.select(".classname")`

In [None]:
len(soup.select(".picture-component"))

`soup.select("tagname.classname")`

In [None]:
len(soup.select("div.product-item-price"))

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Getting attribute values

Do you want to get the description and hyperlinks to images in a webpage?  

We use `tag.get(attr_name)` to get the attribute value

In [None]:
response = requests.get("https://www.elpais.com")

In [None]:
soup = BeautifulSoup(response.content)

In [None]:
img_tags = soup.find_all("img")

In [None]:
len(img_tags)

In [None]:
img_tags[0]

In [None]:
img_tags[0]

In [None]:
img_tags[0].get("alt")

In [None]:
img_tags[0].get("src")

### Querying by other attributes

In [None]:
soup = BeautifulSoup(requests.get("https://www.elmundo.es/").content)

In [None]:
len(soup.find_all("a"))

In [None]:
menuitems = soup.find_all("a", attrs={"role": "menuitem"})

In [None]:
[m.text for m in menuitems]

## Example: Wiki medallero

In [None]:
url_medallero='https://es.wikipedia.org/wiki/Juegos_Ol%C3%ADmpicos_de_Barcelona_1992'
html_medallero = requests.get(url_medallero)

In [None]:
soup = BeautifulSoup(html_medallero.content, "html.parser")

In [None]:
# tables
tablas = soup.find_all("table")
len(tablas)

In [None]:
ultima = tablas[-1]

In [None]:
# Busco todo lo que tenga la etiqueta "a" y como findall me devuelve una lista, me quedo con el primer elemento
#porque vamos a ver unas funcionalidades de soup 
elemento = ultima.find_all("a")[0]
elemento

In [None]:
# contenido de la caja, en este caso texto
elemento.text

In [None]:
# contenido como una lista
elemento.contents

In [None]:
# sticker en la tapa de la caja
elemento.attrs

In [None]:
type(elemento.attrs)

In [None]:
#medallero
medallero = soup.find_all("table")[-4]

In [None]:
medallero.find_all("tr")[1].find_all("td")[1].find("a").text.strip()

In [None]:
medallero.find_all("tr")[1].find_all("td")[2]

In [None]:
medallero.find_all("tr")[1].find_all("td")[3]

In [None]:
med_paises = []
for f in medallero.find_all("tr"): #lista con las filas de la  tabla
    fila = [e for e in f.find_all("td")] # elementos dentro de la fila
    if len(fila) > 0 :
        pais = {
            "nombre": fila[1].find("a").text.strip(),
            "oros": int(fila[2].text),
            "platas": int(fila[3].text),
            "bronce": int(fila[4].text)
            
        }
        med_paises.append(pais)

data = pd.DataFrame(med_paises)
data.head()

## Comments

Always try and find if there is an **API** instead of scraping, because:
 * much easier
 * well documented
 * preferred by server

robots.txt help us know how much the server dislikes your scraping

## Summary

 * Webs are build with HTML, CSS, JavaScript
 * HTML has the content. We scrape HTML
 * `requests` to `get` the HTML
 * `Beautiful Soup` to programatically analyse the HTML

 * HTML is hierarchical
 * HTML uses tags
 * HTML tags have attributes
 * We find tags by tagname, class name, id name, or other attributes name
 * We can use CSS selectors to select in very complex ways

* Hint yourself by using `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` or similar

## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!