# Web scraping. 1

Web scraping is a technique used to extract information from web pages using programs that simulate a person's navigation through the WWW.

Web scraping can be a good alternative when a website does not offer its information through an API.
APIs offer many advantages:
* They are documented
* They offer structured information
* They are relatively stable over time
On the other hand, when using web scraping we find some disadvantages:
* The information is not structured
* The code of the page varies with frequency
* The structure of the page is complex and not documented
* Some pages prevent the use of scraping

## Scraping Steps
A scraping process follows the following steps:
* Inspect the HTML code of the web site
* Download HTML content from the site
* Parse HTML code with Beautiful Soup
* Work with the extracted data

Un proceso de scraping segue os seguintes pasos:
* Inspeccionar o código HTML da páxina web
* Descargar contido HTML da páxina
* Parsear código HTML con Beautiful Soup
* Traballar cos datos extraídos

In [None]:
# The fundamental libraries are:
# requests <- for http requests
# BeautifulSoup <- for parsing HTML code

In [2]:
import requests
from bs4 import BeautifulSoup

In [None]:
# Example of a very simple page
# https://bigdatawirtz.github.io/exemplo-web/01.html

In [4]:
# Http request to web url
url = 'https://bigdatawirtz.github.io/exemplo-web/01.html'
paxina = requests.get(url)

#paxina.content
#paxina.text
print(paxina.text)

<!DOCTYPE html>
<html>

<head>
    <title>Este é o meu exemplo</title>
</head>

<body>
    <p data-attribute="atributo">O meu primeiro texto</p>
</body>

</html>



In [3]:
# Parse the web content in response
soup = BeautifulSoup(paxina.content, 'html.parser')

In [4]:
type(soup)

bs4.BeautifulSoup

In [6]:
# Show the content of 'the soup'
soup
# prettify function can be helpful
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Este é o meu exemplo
  </title>
 </head>
 <body>
  <p data-attribute="atributo">
   O meu primeiro texto
  </p>
 </body>
</html>



In [7]:
# We search for elements according to its tag
# Search element title
soup.find("title")
#soup.title

<title>Este é o meu exemplo</title>

In [8]:
type(soup.title)

bs4.element.Tag

In [9]:
# Every Tag has a name
soup.title.name

'title'

In [10]:
type(soup.title.name)

str

In [11]:
# TAgs encloses text
soup.title.text

'Este é o meu exemplo'

In [12]:
type(soup.title.text)

str

In [13]:
# Search for element "p" for paragraph
soup.find('p')
#soup.p

<p data-attribute="atributo">O meu primeiro texto</p>

In [14]:
soup.p.name

'p'

In [15]:
soup.p.text

'O meu primeiro texto'

In [16]:
# In addition to a name and text, labels can have attributes
soup.p.attrs

{'data-attribute': 'atributo'}

In [17]:
type(soup.p.attrs)

dict

In [18]:
# Access the value of the attributes
soup.p.attrs['data-attribute']

'atributo'

In [19]:
# NEW WEB: https://bigdatawirtz.github.io/exemplo-web/02.html
url = 'https://bigdatawirtz.github.io/exemplo-web/02.html'
paxina = requests.get(url)

print(paxina.text)

<!DOCTYPE html>
<html>

<head>
    <title>Probamos os textos</title>
</head>

<body>

    <h1>Isto é un h1</h1>
    <p>Ut enim ad minim veniam,
        <em>quis nostrud exercitation ullamco laboris</em>nisi ut aliquip ex ea commodo consequat.
        <strong>Duis aute irure dolor in reprehenderit</strong>in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</p>


        <h2>Isto é un h2</h2>
    <p>Lorem ipsum dolor sit amet,
        <strong>consectetur adipisicing elit</strong>.</p>

    <h3>Isto é un un h3</h3>
    <p><b>Lorem ipsum dolor sit amet</b>, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

    

In [20]:
# Parser web content
soup = BeautifulSoup(paxina.content, 'html.parser')

In [21]:
# Search for h1
soup.h1

<h1>Isto é un h1</h1>

In [23]:
soup.h1.text

'Isto é un h1'

In [22]:
# Search for unordered list <ul>
soup.ul

<ul>
<li>Primero elemento</li>
<li>Segundo elemento</li>
<li>Terceiro elemento</li>
</ul>

In [24]:
soup.ul.li

<li>Primero elemento</li>

In [25]:
soup.ul.li.text

'Primero elemento'

In [26]:
# What about searching for several occurrences of an element
# Obtaining the first element
soup.h3

<h3>Isto é un un h3</h3>

In [27]:
# Obtaining the first element devuelve el primer elemento 'h3' que encuentre
soup.find('h3')

<h3>Isto é un un h3</h3>

In [28]:
# Obtaining of all the appearances of the element
soup.find_all('h3')

[<h3>Isto é un un h3</h3>, <h3>Isto é outro h3</h3>]

In [34]:
elementos_h3 = soup.find_all('h3') 
elementos_h3 = elementos_h3[1]
print(elementos_h3)

<h3>Isto é outro h3</h3>


In [29]:
type(soup.find_all('h3'))

bs4.element.ResultSet

In [30]:
elementos_h3 = soup.find_all('h3') 
for i in elementos_h3:
              print(i.text)

Isto é un un h3
Isto é outro h3


In [35]:
# Obtaining the elements of an ordered list
elementos_lista = soup.find_all('ol')
for i in elementos_lista:
    print(i.text)


Primeiro
Segundo
Terceiro



In [36]:
# List the paragraphs present on the web site
lista_paragrafos = soup.find_all('p')
contador = 1
for i in lista_paragrafos:
    print("Parágrafo ", contador , ": " , i.text)
    contador = contador + 1

Parágrafo  1 :  Ut enim ad minim veniam,
        quis nostrud exercitation ullamco laborisnisi ut aliquip ex ea commodo consequat.
        Duis aute irure dolor in reprehenderitin voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Parágrafo  2 :  Lorem ipsum dolor sit amet,
        consectetur adipisicing elit.
Parágrafo  3 :  Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Parágrafo  4 :  Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Parágrafo  5 :  Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmo

In [37]:
# New web page with links
url = 'https://bigdatawirtz.github.io/exemplo-web/04.html'
paxina = requests.get(url)

print(paxina.text)

<!DOCTYPE html>
<html>

<head>

    <title>Links</title>
    <meta charset="utf-8">

</head>

<body>

    <ol>
        <li><a href="01.html" title="Estrutura">Estrutura</a>
        </li>
        <li><a href="02.html" title="Textos">Textos</a>
        </li>
        <li><a href="03.html" title="Definition Lists">Definition Lists</a>
        </li>
        <li><a href="04.html" title="Índice">Índice (enlaces)</a>
        </li>
        <li><a href="http://www.esat.es/id3" target="_blank" title="Índice">Enlace con target _blank</a>
        </li>
        <li><a href="02.html#ancla" title="Textos">Enlace con anchor</a>
        </li>
        <li><a href="http://www.meteogalicia.gal" target="_blank" title="Índice">Meteogalicia</a>
        </li>
        <li><a href="http://www.praza.gal" target="_blank" title="Índice">Xornal Praza</a>
        </li>

    </ol>

    <img src="http://www.esat.es/imagenes/postgrado_diseno_interactivo_digital.png" width="900" height="270" alt="ID3">

</body>

</html>


In [38]:
# Parsing
soup = BeautifulSoup(paxina.content,'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<title>Links</title>
<meta charset="utf-8"/>
</head>
<body>
<ol>
<li><a href="01.html" title="Estrutura">Estrutura</a>
</li>
<li><a href="02.html" title="Textos">Textos</a>
</li>
<li><a href="03.html" title="Definition Lists">Definition Lists</a>
</li>
<li><a href="04.html" title="Índice">Índice (enlaces)</a>
</li>
<li><a href="http://www.esat.es/id3" target="_blank" title="Índice">Enlace con target _blank</a>
</li>
<li><a href="02.html#ancla" title="Textos">Enlace con anchor</a>
</li>
<li><a href="http://www.meteogalicia.gal" target="_blank" title="Índice">Meteogalicia</a>
</li>
<li><a href="http://www.praza.gal" target="_blank" title="Índice">Xornal Praza</a>
</li>
</ol>
<img alt="ID3" height="270" src="http://www.esat.es/imagenes/postgrado_diseno_interactivo_digital.png" width="900"/>
</body>
</html>

In [39]:
# Search for all occurrences of a tag
soup.find_all("a")

[<a href="01.html" title="Estrutura">Estrutura</a>,
 <a href="02.html" title="Textos">Textos</a>,
 <a href="03.html" title="Definition Lists">Definition Lists</a>,
 <a href="04.html" title="Índice">Índice (enlaces)</a>,
 <a href="http://www.esat.es/id3" target="_blank" title="Índice">Enlace con target _blank</a>,
 <a href="02.html#ancla" title="Textos">Enlace con anchor</a>,
 <a href="http://www.meteogalicia.gal" target="_blank" title="Índice">Meteogalicia</a>,
 <a href="http://www.praza.gal" target="_blank" title="Índice">Xornal Praza</a>]

In [40]:
# Remember: this is not a list
type(soup.find_all('a'))

bs4.element.ResultSet

In [41]:
# Iterate result
for enlace in soup.find_all('a'):
    print(enlace)

<a href="01.html" title="Estrutura">Estrutura</a>
<a href="02.html" title="Textos">Textos</a>
<a href="03.html" title="Definition Lists">Definition Lists</a>
<a href="04.html" title="Índice">Índice (enlaces)</a>
<a href="http://www.esat.es/id3" target="_blank" title="Índice">Enlace con target _blank</a>
<a href="02.html#ancla" title="Textos">Enlace con anchor</a>
<a href="http://www.meteogalicia.gal" target="_blank" title="Índice">Meteogalicia</a>
<a href="http://www.praza.gal" target="_blank" title="Índice">Xornal Praza</a>


In [42]:
# Iterate <a> elements to extract texts
for enlace in soup.find_all('a'):
    print(enlace.text)

Estrutura
Textos
Definition Lists
Índice (enlaces)
Enlace con target _blank
Enlace con anchor
Meteogalicia
Xornal Praza


In [43]:
# Links aren't in the texts, but in the href attribute
for enlace in soup.find_all('a'):
    print(enlace.get('href'))

01.html
02.html
03.html
04.html
http://www.esat.es/id3
02.html#ancla
http://www.meteogalicia.gal
http://www.praza.gal


In [44]:
# Another attributes
for enlace in soup.find_all('a'):
    print(enlace.get('title'))

Estrutura
Textos
Definition Lists
Índice
Índice
Textos
Índice
Índice
