# 10. Web Scraping

Web  scraping  is  the  practice  of  gathering  data  through  any  means  otherthan a program interacting with an API (or, obviously, through a human using a webbrowser).  This  is  most  commonly  accomplished  by  writing  an  automated  programthat queries a web server, requests data (usually in the form of the HTML and otherfiles  that  comprise  web  pages),  and  then  parses  that  data  to  extract  needed  information.

# 10.1 Selenium
Selenium automates browsers. That's it! <br>
Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable. <br>
**For this course, we use Chrome.**

In [1]:
#!pip install selenium==4.2.0
# !pip install webdriver-manager

# La versión actual de selenium es 4.10, pero deben instalar la versión 4.2 

## 10.2 Calling Libraries

In [9]:
# check selenium version 

import selenium
print(selenium.__version__)

4.2.0


In [1]:

from selenium import webdriver  # manipulación de driver 
from webdriver_manager.chrome import ChromeDriverManager # manejar diferentes versiones del driver


import re # expresiones regulares 
import time # time 
from selenium.webdriver.support.ui import Select  # Trabaja con el tag <select></select>
import os
import sys
from selenium.webdriver.common.by import By  # permite seleccionar los elementos en un html
import warnings
warnings.filterwarnings('ignore') # eliminar warning messages 

from selenium.webdriver.common.keys import Keys  # ingresar información a la página web (nombres, fechas)
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import ActionChains # movilizarnos en la página web 
import pandas as pd
import numpy as np 
import unidecode  # usaremos para reconocer las tildes 
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore') # eliminar warning messages 

## 7.3 Launch/Set the Driver
Este código abre un controlador Chrome. Lo vamos a usar para navegar en la web.

In [14]:
# Case 1 - Download the driver

driver = webdriver.Chrome("chromedriver.exe") # abrimos el simulador de chrome
                         # se coloca la ubicación del ejecutable de chrome 


In [15]:
driver.maximize_window() # maximiza la ventama 

url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/' # URL de ONPE


In [16]:
driver.get( url ) # ingresar el url al browser 

## Otra forma de abrir driver de chrome sin usarel ejecutable

In [18]:
# driver = webdriver.Chrome( ChromeDriverManager().install() )

In [19]:
# driver.maximize_window() # maximiza la ventama 

# url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/' # URL de ONPE


In [20]:
# driver.get( url ) # ingresar el url al browser 

## Chrome está siendo controlado por un software de prueba automatizado !!!

In [6]:
# Acceso al contenido del tag <title></title>
print('Title: ', driver.title)

Title:  Presentación de Resultados Elecciones Generales y Parlamento Andino 2021


In [7]:
# Access al url

print('Current Page URL: ', driver.current_url)

Current Page URL:  https://resultadoshistorico.onpe.gob.pe/EG2021/


In [14]:
type(driver)

selenium.webdriver.chrome.webdriver.WebDriver

In [7]:
dir(driver) #observamos los métodos y atributos del objeto 

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_authenticator_id',
 '_file_detector',
 '_get_cdp_details',
 '_is_remote',
 '_mobile',
 '_shadowroot_cls',
 '_switch_to',
 '_unwrap_value',
 '_web_element_cls',
 '_wrap_value',
 'add_cookie',
 'add_credential',
 'add_virtual_authenticator',
 'application_cache',
 'back',
 'bidi_connection',
 'capabilities',
 'caps',
 'close',
 'command_executor',
 'create_options',
 'create_web_element',
 'current_url',
 'current_window_handle',
 'delete_all_cookies',
 'delete_cookie',
 'delete_network_conditions',
 'desired_capabilities',
 'error_handler',
 'execute',
 'ex

### 7.4.1. HTML
HTML stands for HyperText Markup Language. You can deduce that it’s a language for creating web pages. It’s not a programming language like Python or Java, but it’s a markup language. It describes the elements of a page through tags characterized by angle brackets.

1. The document always begins and ends using `<html>` and `</html>`.
2. `<body></body>` constitutes the visible part of HTML document.
3. `<h1>` to `<h3>` tags are defined for the headings.

#### 7.4.1.1. HTML Headings
HTML headings are defined with the `<h1>` to `<h6>` tags.
`<h1>` defines the most important heading. `<h6>` defines the least important heading.

We can use text cells since markdown reads html tags.

<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>

#### 7.4.1.2. HTML Paragraphs
HTML paragraphs are defined with the `<p>` tag.
`<br>` tag is similar to `"\n"`.

<br>
<p>My first paragraph.</p> <br>
<p>This is another paragraph for this text cell.</p>

#### 7.4.1.3. HTML Links
HTML links are defined with the <a> tag:

<a href="http://bayes.cs.ucla.edu/jp_home.html">This is a link for Judea Pearl Website</a> 

#### 7.4.1.3. Unordered HTML List
An unordered list starts with the `<ul>` tag. Each list item starts with the `<li>` tag.

<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>

#### 7.4.1.4. Ordered HTML List
An ordered list starts with the `<ol>` tag. Each list item starts with the `<li>` tag.

<ol>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ol>

#### 7.4.1.4. HTML Tables

A table in HTML consists of table cells inside rows and columns. Each table cell is defined by a `<td>` and a `</td>` tag. Each table row starts with a `<tr>` and end with a `</tr>` tag.

<table>
  <tr>
    <th>Manager</th>
    <th>Club</th>
    <th>Nationality</th>
  </tr>
    
  <tr>
    <td>Mikel Arteta</td>
    <td>Arsenal</td>
    <td>Spain</td>
  </tr>
    
  <tr>
    <td>Thomas Tuchel</td>
    <td>Chelsea</td>
    <td>Germany</td>
  </tr>
</table>

#### 7.4.1.5. HTML Iframes

An HTML iframe is used to display a web page within a web page.


<!DOCTYPE html>
<html>
  
<head>
    <title>1.0 HTML adrres </title>
</head>
  
<body> style="text-align: center">
    <h1>Diploma</h1>
    <h2>HTML iframe</h2>
   
   <p> Add personal information </p> 
   
 <address>
     
Written by <a href="mailto:webmaster@example.com">Jon Doe</a>.<br> 
Visit us at:<br>
Example.com<br>
Box 564, Disneyland<br>
USA
</address>
    
</body>
  
</html>

<!DOCTYPE html>
<html>
<head>
<style>
table, th, td {
  border: 1px solid black;
}
</style>
</head>
<body>

<h1>2.0 The td element</h1>

<p>The td element defines a cell in a table:</p>

<table>
  <tr>
    <td>Cell A</td>
    <td>Cell B</td>
  </tr>
  <tr>
    <td>Cell C</td>
    <td>Cell D</td>
  </tr>
</table>

</body>
</html>


<!DOCTYPE html>
<html>
<body>

<h1>3.0 Bottom </h1>

<p>Click the button below to display the hidden content from the template element.</p>

<button onclick="showContent()">Click here</button>



<!DOCTYPE html>
<html>
<body>

<h1>4.0 The form element</h1>

<form action="/action_page.php">
  <label for="fname">First name:</label>
  <input type="text" id="fname" name="fname"><br><br>
  <label for="lname">Last name:</label>
  <input type="text" id="lname" name="lname"><br><br>
  <button onclick="showContent()">Submit</button>
</form>

<p>Click the "Submit" button and the form-data will be sent to a page on the 
server called "action_page.php".</p>

</body>
</html>


<!DOCTYPE html>
<html>
<body>

<h1> 5.0 The label element</h1>

<p>Click on one of the text labels to toggle the related radio button:</p>

<form action="/action_page.php">
  <input type="radio" id="html" name="fav_language" value="HTML">
  <label for="html">HTML</label><br>
  <input type="radio" id="css" name="fav_language" value="CSS">
  <label for="css">CSS</label><br>
  <input type="radio" id="javascript" name="fav_language" value="JavaScript">
  <label for="javascript">JavaScript</label><br><br>

  <button onclick="showContent()">Submit</button>
    
</form>

</body>
</html>


<!DOCTYPE html>
<html>
<body>

<h1>6.0 The select element</h1>

<p>The select element is used to create a drop-down list.</p>

<form action="/action_page.php">
  <label for="cars">Choose a car:</label>
  <select name="cars" id="cars">
    <option value="volvo">Volvo</option>
    <option value="saab">Saab</option>
    <option value="opel">Opel</option>
    <option value="audi">Audi</option>
  </select>
  <br><br>
</form>

<button onclick="showContent()">Show hidden content</button>
    
<p>Click the "Submit" button and the form-data will be sent to a page on the 
server called "action_page.php".</p>

</body>
</html>

<!DOCTYPE html>
<html>
 
<h1>7.0 Class atribute</h1>    
    
<head>
    <style>
        .country {
            background-color: black;
            color: white;
            padding: 8px;
        }
    </style>
</head>
 
<body>
 
<h2 class="country">CHINA</h2>
     
<p>China has the largest population
       in the world.</p>
 
 
<h2 class="country">INDIA</h2>
     
<p>India has the second largest
       population in the world.</p>
 
 
<h2 class="country">UNITED STATES</h2>
     
<p>United States has the third largest
       population in the world.</p>
 
 
</body>
 
</html>


<!DOCTYPE html>
<html>
    
<h1> 8.0 Style</h1>   
    
<head>
<style>
h1 {color:red;}
p {color:blue;}
</style>
</head>
<body>

<h1>This is a heading</h1>
<p>This is a paragraph.</p>

</body>
</html>


<!DOCTYPE html>
<html>
<head>
<style>
#myHeader {
  background-color: lightblue;
  color: black;
  padding: 40px;
  text-align: center;
} 
</style>
</head>
<body>

<h2> 9.0 The id Attribute</h2>
<p>Use CSS to style an element with the id "myHeader":</p>

<h1 id="myHeader">My Header</h1>

</body>
</html>



<html>
<head>
    
<h1> 10.0 Div tagname </h1>   
    
<style>
.myDiv {
  border: 5px outset red;
  background-color: lightblue;
  text-align: center;
}
</style>
</head>
<body>

<div class="myDiv">
  <h2>This is a heading in a div element</h2>
  <p>This is some text in a div element.</p>
</div>

</body>
</html>

#### 7.4.1.6. HTML Tags - Key

|Tag|Description|
|---|---|
|`<h1>` to `<h6>`|	Defines HTML headings|
|`<ul>`|	Defines an unordered list|
|`<ol>`|	Defines an ordered list|
|`<p>`|	Defines a paragraph|
|`<a>`|	It is termed as anchor tag and it creates a hyperlink or link.|
|`<div>`|	It defines a division or section within HTML document.|
|`<strong>`|	It is used to define important text.|
|`<table>`|	It is used to present data in tabular form or to create a table within HTML document.|
|`<td>`|	It is used to define cells of an HTML table which contains table data|
|`<iframe>`|	Defines an inline frame|

### 7.4. Identifying elements in a web page

To identify elements of a webpage, we need to inspect the webpage. Open the driver and press `Ctrl`+ `Shift` + `I`.

#### One Element
|Method|Description|
|---|---|
|find_element_by_id| Use id.|
|find_element_by_name| Use name.|
|find_element_by_xpath| Use Xpath.|
|find_element_by_tag_name| Use HTML tag.|
|find_element_by_class_name| Use class name.|
|find_element_by_css_selector| Use css selector.|

#### Multiple  elements
|Method|Description|
|---|---|
|find_elements_by_id| Use id.|
|find_elements_by_name| Use name.|
|find_elements_by_xpath| Use Xpath.|
|find_elements_by_tag_name| Use HTML tag.|
|find_elements_by_class_name| Use class name.|
|find_elements_by_css_selector| Use css selector.|

### 7.4.1. Xpath
XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.

The basic format of XPath in selenium is explained below with screen shot.
<img src="../../_images/x_path.png">

**DO NOT COMPLICATE!**
Finding the XPath of a element:
1. Go to the element
2. Right click
3. Inspect - You may have to do it twice.
4. Go to the selected line
5. Right click
7. Copy 
8. Copy Full Xpath