# Web scraping

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Webpages" data-toc-modified-id="Webpages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Webpages</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Basics" data-toc-modified-id="Basics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Basics</a></span></li><li><span><a href="#Tags" data-toc-modified-id="Tags-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Tags</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Web-scraping" data-toc-modified-id="Web-scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web scraping</a></span><ul class="toc-item"><li><span><a href="#Intro" data-toc-modified-id="Intro-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Intro</a></span></li><li><span><a href="#Example:" data-toc-modified-id="Example:-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example:</a></span><ul class="toc-item"><li><span><a href="#Research" data-toc-modified-id="Research-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Research</a></span></li><li><span><a href="#Getting-the-title-tag" data-toc-modified-id="Getting-the-title-tag-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Getting the title tag</a></span></li><li><span><a href="#Get-items-from-the-webpage" data-toc-modified-id="Get-items-from-the-webpage-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Get items from the webpage</a></span></li><li><span><a href="#Get-prices-from-the-webpage" data-toc-modified-id="Get-prices-from-the-webpage-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Get prices from the webpage</a></span></li><li><span><a href="#Finding-alltoguether,-price-and-name" data-toc-modified-id="Finding-alltoguether,-price-and-name-3.2.5"><span class="toc-item-num">3.2.5&nbsp;&nbsp;</span>Finding alltoguether, price and name</a></span></li><li><span><a href="#Building-a-DataFrame-with-the-information" data-toc-modified-id="Building-a-DataFrame-with-the-information-3.2.6"><span class="toc-item-num">3.2.6&nbsp;&nbsp;</span>Building a DataFrame with the information</a></span></li></ul></li></ul></li><li><span><a href="#Advanced" data-toc-modified-id="Advanced-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Advanced</a></span><ul class="toc-item"><li><span><a href="#CSS-selectors" data-toc-modified-id="CSS-selectors-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>CSS selectors</a></span></li><li><span><a href="#Getting-attribute-values" data-toc-modified-id="Getting-attribute-values-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Getting attribute values</a></span></li><li><span><a href="#Querying-by-other-attributes" data-toc-modified-id="Querying-by-other-attributes-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Querying by other attributes</a></span></li></ul></li><li><span><a href="#Example:-Wiki-medallero" data-toc-modified-id="Example:-Wiki-medallero-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Example: Wiki medallero</a></span></li><li><span><a href="#Comments" data-toc-modified-id="Comments-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further materials</a></span></li></ul></div>

## Webpages

Webpages are built mainly with 3 tools:
 * HTML: content (structure, headings, paragraphs, tables...)
 * CSS: style (color, shape, size...)
 * JavaScript: logic (clicks, popups, dynamic banners...)

## HTML

### Basics

HTML code consists of `<tagged>` content.

HTML has a hierarchichal structure: parent tags, children tags, sibling tags:  
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

### Tags

Tags may be classified in different groups, depending on the type of content they are expected to posess
 * heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * phrasing: `<b>`, `<img>`, `<sub>`...
 * embedded: `<audio>`, `<img>`, `<video>`...
 * tabulated: `<table>`, `<tr>`, `<tbody>`...
 * sections: `<header>`, `<section>`, `<article>`...
 * metadata: `<meta>`, `<title>`, `<script>`...

### Attributes

This tag has no attributes
`<div> Zapas Marca Joma X54 </div>`

Tags may have attributes. Here,  
`<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`,  the `div` tag has:
 * a `class` attribute with value `price-item`
 * an `id` attribute with value `offer`  

`id` attribute should be unique for a tag (no two tags should have same `id`)

`class` is not intended to be unique, it usually groups tags with similar behavior

Other frequently used attributes are:
 * `dir`
 * `lang`
 * `style` (not to be confused with `<style>` tag)
 * `title` (not to be confused with `<title>` tag)

## Web scraping

### Intro

When scraping, we want to filter tags by:
 * tag name
 * class
 * id
 * other attribute

Our browser Console is very useful for this:  we can **Inspect** content in the web and find the corresponding piece of HTML code

We use `requests` library to bring the HTML content to our Python script

We use `Beautiful Soup` library to easily navigate through the HTML in Python

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 🤭

In [None]:
!pip install beautifulsoup4

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup

### Example: 

Lets scrape (students choose topic for live scraping!)

In [None]:
url = 

#### Research

In [None]:
response = requests.get(url_depor)

In [None]:
response

In [None]:
response.status_code

When we requested API endpoints, generally response content was a JSON.

Lets see what happens in this case

In [None]:
try:
    response.json()
except:
    print("Response is not JSON")

But we just requested a non-API URL, so we receive raw HTML content!

In [None]:
# just show some substring
response.content[:1000]

In [None]:
"<span>" in str(response.content)

In [None]:
type(response.content)

`Beautiful Soup` helps us with the task of accessing this info

In [None]:
soup = BeautifulSoup(response.content)

In [None]:
type(soup)

#### Getting the title tag

When using `find` or `find_all`, we search by tag

In [None]:
soup.find("title")

[Documentation .find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)    

In [None]:
len(soup.find_all("span"))

In [None]:
len(soup.find_all("div"))

#### Get items from the webpage

Going to Chrome console and inspecting...

You can find html tags by running `soup.find_all(name=tag_name, class_=class_name)`

In [None]:
shoes_names_tags = soup.find_all(
    name="div", 
    class_="product-card__title"
)

In [None]:
type(shoes_names_tags)

In [None]:
len(shoes_names_tags)

**HINT**: execute in the console `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` to check how your guesses are working.  

You shall change `a` by your used tag or by `.class_name` for a class name

In [None]:
shoes_names_tags[0]

In [None]:
shoes_names_tags[0].text

In [None]:
names = [shoe.text for shoe in shoes_names_tags]

In [None]:
names[:10]

#### Get prices from the webpage

Inspect, guess the tag you must look for, paint green all item prices

In [None]:
price_div_tags = soup.find_all("div", class_="product-card__price-wrapper")

In [None]:
len(price_div_tags)

#### Finding alltoguether, price and name

#### Building a DataFrame with the information

There are many pages, what if we want to scrape all of them?  
1. Build a function `get_df_from_url` and insert all the previous logic  
2. Build a list of urls

## Advanced

### CSS selectors

We can use CSS selectors to find in a more specific way:
 * descendant selectors
 * combined selectors
 * siblings
 * has attribute
 * ...

In what other ways can we use CSS selectors to find tags inside HTML content?

[Documentation .select](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) 

`soup.select("tagname1 tagname2")` finds tag 2 inside tag 1

In [None]:
# how many spans?
len(soup.select("span"))

In [None]:
len(soup.select("span span"))

In [None]:
# how many spans inside spans inside spans?
len(soup.select("span span span"))

In [None]:
# how many spans inside spans inside spans?
len(soup.select("div div div div div div div div span"))

We use a `.` to find by class

`soup.select(".classname")`

If a classname has spaces, they must be replaced by `.`

In [None]:
len(soup.select(".product-card__link-overlay"))

`soup.select("tagname.classname")`

In [None]:
response = requests.get("https://www.elpais.com")

In [None]:
soup_pais = BeautifulSoup(response.content)

In [None]:
soup_pais.select("div.logo.primary_claim")

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### Getting attribute values

Do you want to get the description and hyperlinks to images in a webpage?  

We use `tag.get(attr_name)` to get the attribute value

In [None]:
img_tags = soup_pais.select("img")

In [None]:
len(img_tags)

In [None]:
img_tags[0]

In [None]:
img_tags[0].get("alt")

In [None]:
img_tags[0].get("src")

In [None]:
for im in img_tags:
    print(im.get("alt"), im.get("src"))

### Querying by other attributes

In [None]:
soup = BeautifulSoup(requests.get("https://www.elmundo.es/").content)

In [None]:
len(soup.find_all("a"))

In [None]:
[a.text for a in soup.find_all("li", attrs={"role": "presentation"})]

## Example: Wiki medallero

In [None]:
url_medallero = "https://es.wikipedia.org/wiki/Juegos_Ol%C3%ADmpicos_de_Barcelona_1992"

In [None]:
soup = BeautifulSoup(requests.get(url_medallero).content)

In [None]:
# tables
tablas = soup.find_all("table")

In [None]:
len(tablas)

In [None]:
for i, t in enumerate(tablas):
    if "Unificado" in str(t) and "45" in str(t):
        print(i)

tablas 10 seems to be our objective

In [None]:
t = tablas[10]

In [None]:
tbody = t.find("tbody")

In [None]:
tbody

In [None]:
trs = tbody.find_all("tr")

In [None]:
len(trs)

In [None]:
tr = trs[1]

In [None]:
tr

In [None]:
cols = tr.find_all("td")

In [None]:
cols

In [None]:
cols[5].text.strip()

In [None]:
def get_row_info(row):
    info = dict()
    
    cols = row.find_all("td")
    
    info["rank"] = int(cols[0].text)
    info["country"] = cols[1].text.strip()
    info["ngold"] = int(cols[2].text)
    info["nsilver"] = int(cols[3].text)
    info["nbronze"] = (cols[4].text)
    info["total"] = (cols[5].text.strip())
    
    return info

In [None]:
get_row_info(trs[5])

In [None]:
trs

In [None]:
for i, row in enumerate(trs):
    try:
        get_row_info(row)
    except:
        print(i)

In [None]:
pd.DataFrame([get_row_info(row) for row in trs[1:]])

## Comments

Always try and find if there is an **API** instead of scraping, because:
 * much easier
 * well documented
 * preferred by server

robots.txt helps us know how much the server dislikes your scraping

## Summary

 * Webs are build with HTML, CSS, JavaScript
 * HTML has the content. We scrape HTML
 * `requests` to `get` the HTML
 * `Beautiful Soup` to programatically analyse the HTML

 * HTML is hierarchical
 * HTML uses tags
 * HTML tags have attributes
 * We find tags by tagname, class name, id name, or other attributes name
 * We can use CSS selectors to select in very complex ways

* Hint yourself by using `document.querySelectorAll('a').forEach(elm => elm.style.background = 'red')` or similar

## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!

In [None]:
soup = BeautifulSoup(requests.get("https://www.20minutos.com").content)

In [None]:
len(soup.find_all("a"))

In [None]:
[s.text for s in soup.select("div.media-content header h1 a")]