<a href="https://colab.research.google.com/github/INmais/Energy_Services_2022/blob/main/Energy_Services_2022_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

## HTML page structure

**Hypertext Markup Language (HTML)** is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page and it can be used with **Cascading Style Sheets (CSS)** and a scripting language such as **JavaScript** to create interactive websites. HTML consists of a series of elements that "tell" to the browser how to display the content. Lastly, elements are represented by **tags**.

Here are some tags:
* `<!DOCTYPE html>` declaration defines this document to be HTML5.  
* `<html>` element is the root element of an HTML page.  
* `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
* `<head>` element contains meta information about the document.  
* `<title>` element specifies a title for the document.  
* `<body>` element contains the visible page content.  
* `<h1>` element defines a large heading.  
* `<p>` element defines a paragraph.  
* `<a>` element defines a hyperlink.

HTML tags normally come in pairs like `<p>` and `</p>`. The first tag in a pair is the opening tag, the second tag is the closing tag. The end tag is written like the start tag, but with a slash inserted before the tag name.

<img src="https://github.com/nestauk/im-tutorials/blob/3-ysi-tutorial/figures/Web-Scraping/tags.png?raw=1" width="512">

HTML has a tree-like 🌳 🌲 structure thanks to the **Document Object Model (DOM)**, a cross-platform and language-independent interface. Here's how a very simple HTML tree looks like.

<img src="https://github.com/nestauk/im-tutorials/blob/3-ysi-tutorial/figures/Web-Scraping/dom_tree.gif?raw=1">

### Creating a simple HTML page

In [1]:
from IPython.core.display import display, HTML

In [2]:
display(HTML("""
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
  <title>Intro to HTML</title>
</head>

<body>
  <h1>Heading h1</h1>
  <h2>Heading h2</h2>
  <h3>Heading h3</h3>
  <h4>Heading h4</h4>

  <p>
    That's a text paragraph. You can also <b>bold</b>, <mark>mark</mark>, <ins>underline</ins>, <del>strikethrough</del> and <i>emphasize</i> words.
    You can also add links - here's one to <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a>.
  </p>

  <p>
    This <br> is a paragraph <br> with <br> line breaks
  </p>

  <p style="color:red">
    Add colour to your paragraphs.
  </p>

  <p>Unordered list:</p>
  <ul>
    <li>Python</li>
    <li>R</li>
    <li>Julia</li>
  </ul>

  <p>Ordered list:</p>
  <ol>
    <li>Data collection</li>
    <li>Exploratory data analysis</li>
    <li>Data analysis</li>
    <li>Policy recommendations</li>
  </ol>
  <hr>

  <!-- This is a comment -->

</body>
</html>
"""))

## Chrome DevTools

[Chrome DevTools](https://developers.google.com/web/tools/chrome-devtools/) is a set of web developer tools built directly into the Google Chrome browser. DevTools can help you view and edit web pages. We will use Chrome's tool to inspect an HTML page and find which elements correspond to the data we might want to scrape.

**Tip**: Hit *Command+Option+C* (Mac) or *Control+Shift+C* (Windows, Linux) to access the elements panel.



## Web Scraping with `requests` and `BeautifulSoup`

We will use `requests` and `BeautifulSoup` to access and scrape the content of [Meteo Téecnico](http://caboruivo.tecnico.ulisboa.pt:63104/).

### What is `BeautifulSoup`?

It is a Python library for pulling data out of HTML and XML files. It provides methods to navigate the document's tree structure that we discussed before and scrape its content.

### Our pipeline

- Access a web page (requests)
- Parse the HTML document (Beautifulsoup)
- Inspect what to scrape (Google Dev Tools)
- Find the tags (Beautiful Soup)
    - (max temperature)
    - (min temperature)
    - (city)
- Print/Store their content


In [8]:
import requests
from bs4 import BeautifulSoup
import pandas


page_url="http://caboruivo.tecnico.ulisboa.pt:63104/"
page=requests.get(page_url)
soup=BeautifulSoup(page.content,"html.parser")


table = soup.findAll('table',{'class':'table table-hover'})

In [9]:
#city={17:'Faro', 16:'Beja', 15:'Évora', 14:'Setúbal', 13:'Lisboa', 12:'Portalegre', 11:'Santarém', 10:'Leiria', 9:'Castelo Branco', 8:'Coimbra', 7:'Guarda',6:'Viseu', 5:'Aveiro', 4:'Porto', 3:'Bragança', 2:'Vila Real', 1:'Braga', 0:'Viana do Castelo'}

l=[]
for items in table:
  for i in range(len(items.find_all("tr"))-1):
    d = {}
    d["temp max"]= items("span",{"class":"max-temp"})[i].text
    d["temp min"]= items("span",{"class":"min-temp"})[i].text 
    l.append(d)
    

In [11]:
l

[{'temp max': '18º', 'temp min': '11º'},
 {'temp max': '15º', 'temp min': '8º'},
 {'temp max': '14º', 'temp min': '9º'},
 {'temp max': '13º', 'temp min': '9º'},
 {'temp max': '12º', 'temp min': '9º'},
 {'temp max': '15º', 'temp min': '11º'},
 {'temp max': '15º', 'temp min': '12º'},
 {'temp max': '20º', 'temp min': '12º'},
 {'temp max': '17º', 'temp min': '8º'},
 {'temp max': '16º', 'temp min': '8º'},
 {'temp max': '15º', 'temp min': '8º'},
 {'temp max': '13º', 'temp min': '8º'},
 {'temp max': '15º', 'temp min': '9º'},
 {'temp max': '16º', 'temp min': '11º'},
 {'temp max': '21º', 'temp min': '12º'},
 {'temp max': '16º', 'temp min': '9º'},
 {'temp max': '15º', 'temp min': '6º'},
 {'temp max': '13º', 'temp min': '6º'},
 {'temp max': '10º', 'temp min': '7º'},
 {'temp max': '13º', 'temp min': '4º'},
 {'temp max': '15º', 'temp min': '9º'},
 {'temp max': '20º', 'temp min': '9º'},
 {'temp max': '14º', 'temp min': '6º'},
 {'temp max': '12º', 'temp min': '4º'},
 {'temp max': '12º', 'temp min': '