# Statement
Learn how to do web scraping.

Level 1

- Exercise 1

Perform web scraping of a page of the Madrid Stock Exchange (https://www.bolsamadrid.es) using BeautifulSoup and Selenium.

Level 2

- Exercise 2

Document in a word document the data set generated with the information contained in the different Kaggle archives.

Level 3 - Exercise 3

Download a web page of your choice and perform web scraping using the Scrapy library.

# Level 1

## - Exercise 1

Perform web scraping of a page of the Madrid Stock Exchange (https://www.bolsamadrid.es) using BeautifulSoup and Selenium.


In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


In [2]:

import requests

URL = "https://www.bolsamadrid.es"
page = requests.get(URL)

fsock = open('Output_Bolsa_raw.html', 'w')
print(page.text, file=fsock)
fsock.close()

In [3]:
URL = "https://www.bolsamadrid.es"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

### Let's scrap the web of stock market with BeautifulSoup.

In [4]:
fsock = open('Output_Bolsa_BeautifulSoup_Prettify.html', 'w')
print(soup.prettify(), file=fsock)
fsock.close()


![](2022-03-30-16-07-37.png)

In [5]:
ibex35 = soup.find(class_="TblPort TblAccPort")
ibex35.text

'\nNombreÚltimo% Dif.\nACCIONA170,40000,35ACERINOX10,0850-0,20ACS24,6800-0,04AENA152,70000,56ALMIRALL11,7600-1,18AMADEUS58,8200-0,94ARCELORMIT.29,5550-0,34B.SANTANDER3,16400,68BA.SABADELL0,77121,37BANKINTER5,46401,00BBVA5,3150-1,39CAIXABANK3,17400,95CELLNEX44,0100-0,74CIE AUTOMOT.21,0200-0,28ENAGAS19,9350-0,03ENDESA19,75002,23FERROVIAL24,0100-0,29FLUIDRA26,6500-0,93GRIFOLS CL.A16,3550-0,91IAG1,71200,23IBERDROLA10,01000,72INDITEX20,2300-2,83INDRA A10,2400-0,19INM.COLONIAL8,2650-0,72MAPFRE1,9190-0,03MELIA HOTELS6,8480-0,49MERLIN10,6050-1,44NATURGY27,1700-0,26PHARMA MAR68,3400-2,37R.E.C.18,47000,65REPSOL11,9320-0,75ROVI68,00000,29SIEMENS GAME15,9650-0,99SOLARIA20,01000,45TELEFONICA4,3675-1,01'

In [6]:
# print(ibex35)
rows = []
for child in ibex35.children:
    element = []
    for i in child:
        if i != '\n':
            element.append(i.text)
    if len(element) != 0:
        rows.append(element)
rows

[['Nombre', 'Último', '% Dif.'],
 ['ACCIONA', '170,4000', '0,35'],
 ['ACERINOX', '10,0850', '-0,20'],
 ['ACS', '24,6800', '-0,04'],
 ['AENA', '152,7000', '0,56'],
 ['ALMIRALL', '11,7600', '-1,18'],
 ['AMADEUS', '58,8200', '-0,94'],
 ['ARCELORMIT.', '29,5550', '-0,34'],
 ['B.SANTANDER', '3,1640', '0,68'],
 ['BA.SABADELL', '0,7712', '1,37'],
 ['BANKINTER', '5,4640', '1,00'],
 ['BBVA', '5,3150', '-1,39'],
 ['CAIXABANK', '3,1740', '0,95'],
 ['CELLNEX', '44,0100', '-0,74'],
 ['CIE AUTOMOT.', '21,0200', '-0,28'],
 ['ENAGAS', '19,9350', '-0,03'],
 ['ENDESA', '19,7500', '2,23'],
 ['FERROVIAL', '24,0100', '-0,29'],
 ['FLUIDRA', '26,6500', '-0,93'],
 ['GRIFOLS CL.A', '16,3550', '-0,91'],
 ['IAG', '1,7120', '0,23'],
 ['IBERDROLA', '10,0100', '0,72'],
 ['INDITEX', '20,2300', '-2,83'],
 ['INDRA A', '10,2400', '-0,19'],
 ['INM.COLONIAL', '8,2650', '-0,72'],
 ['MAPFRE', '1,9190', '-0,03'],
 ['MELIA HOTELS', '6,8480', '-0,49'],
 ['MERLIN', '10,6050', '-1,44'],
 ['NATURGY', '27,1700', '-0,26'],
 ['PHAR

### Collect all them in a dataframe

In [7]:
df_scrap = pd.DataFrame(rows[1::],columns=rows[0])
df_scrap

Unnamed: 0,Nombre,Último,% Dif.
0,ACCIONA,1704000,35
1,ACERINOX,100850,-20
2,ACS,246800,-4
3,AENA,1527000,56
4,ALMIRALL,117600,-118
5,AMADEUS,588200,-94
6,ARCELORMIT.,295550,-34
7,B.SANTANDER,31640,68
8,BA.SABADELL,7712,137
9,BANKINTER,54640,100


### Let's now do something similar with Selenium

In [39]:
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
opts = Options()
opts.set_headless()
assert opts.headless  # Operating in headless mode
browser = Chrome(options=opts)
browser.get('https://www.bolsamadrid.es/esp/aspx/Portada/Portada.aspx')

  opts.set_headless()


In [44]:
ibex35s = browser.find_elements_by_class_name('TblPort.TblAccPort') # find_elements_by_name('a') # find_element_by_class_name(name='TblPort TblAccPort')

for element in ibex35s:
    print(element.text)

### As we can see below, the behavior of the web scrapping with beautifulsoup or Selenium, is different.  
BeautifulSoup sees the values of all the shares in the table TblPort.TblAccPort, but Selenium only sees  
the values that are showing in the screen. So, if we want to see all values, we have to change the approach now.

In [41]:
browser.get('https://www.bolsamadrid.es/esp/aspx/Mercados/Precios.aspx?indice=ESI100000000')

In [52]:
ibex35s = browser.find_elements_by_id('ctl00_Contenido_tblAcciones') # find_elements_by_name('a') # find_element_by_class_name(name='TblPort TblAccPort')

for element in ibex35s:
    str_dump = element.text
    

In [61]:
str_dump

'Nombre Últ. % Dif. Máx. Mín. Volumen Efectivo (miles €) Fecha Hora\nACCIONA 170,9000 0,65 171,5000 169,5000 11.617 1.980,10 31/03/2022 12:23:28\nACERINOX 10,0700 -0,35 10,2050 10,0500 297.302 3.005,88 31/03/2022 12:23:22\nACS 24,6500 -0,16 24,9100 24,6500 97.381 2.412,28 31/03/2022 12:23:03\nAENA 152,9500 0,72 153,9500 151,8000 30.591 4.677,60 31/03/2022 12:23:18\nALMIRALL 11,7400 -1,34 11,9600 11,6200 122.535 1.442,54 31/03/2022 12:15:34\nAMADEUS 58,6800 -1,18 60,3600 58,6800 140.228 8.366,21 31/03/2022 12:23:07\nARCELORMIT. 29,4500 -0,69 29,7900 29,4150 87.949 2.603,39 31/03/2022 12:23:27\nB.SANTANDER 3,1545 0,38 3,1995 3,1500 29.208.859 92.804,42 31/03/2022 12:23:28\nBA.SABADELL 0,7634 0,34 0,7830 0,7628 133.100.953 102.961,56 31/03/2022 12:22:49\nBANKINTER 5,4300 0,37 5,5020 5,3980 881.470 4.822,59 31/03/2022 12:22:10\nBBVA 5,2970 -1,73 5,4220 5,2920 35.417.329 190.980,60 31/03/2022 12:23:32\nCAIXABANK 3,1520 0,25 3,1860 3,1420 2.695.761 8.541,52 31/03/2022 12:23:15\nCELLNEX 44,17

In [67]:
rows = str_dump.splitlines()

In [75]:
table = []
for row in rows:
    line = row.split(' ')
    table.append(line)


In [94]:
# We found in the columns name problems with spaces in titles
columns = table[0]
data = table [1::]
print(columns)

['Nombre', 'Últ.', '%', 'Dif.', 'Máx.', 'Mín.', 'Volumen', 'Efectivo', '(miles', '€)', 'Fecha', 'Hora']


In [98]:
columns = ['Index','Nombre', 'Últ.', '% Dif.', 'Máx.', 'Mín.', 'Volumen', 'Efectivo (miles €)', 'Fecha', 'Hora']
print(columns)

['Index', 'Nombre', 'Últ.', '% Dif.', 'Máx.', 'Mín.', 'Volumen', 'Efectivo (miles €)', 'Fecha', 'Hora']


In [100]:
df = pd.DataFrame(data,columns=columns)
df

Unnamed: 0,Index,Nombre,Últ.,% Dif.,Máx.,Mín.,Volumen,Efectivo (miles €),Fecha,Hora
0,ACCIONA,1709000,65,1715000,1695000,11.617,"1.980,10",31/03/2022,12:23:28,
1,ACERINOX,100700,-35,102050,100500,297.302,"3.005,88",31/03/2022,12:23:22,
2,ACS,246500,-16,249100,246500,97.381,"2.412,28",31/03/2022,12:23:03,
3,AENA,1529500,72,1539500,1518000,30.591,"4.677,60",31/03/2022,12:23:18,
4,ALMIRALL,117400,-134,119600,116200,122.535,"1.442,54",31/03/2022,12:15:34,
5,AMADEUS,586800,-118,603600,586800,140.228,"8.366,21",31/03/2022,12:23:07,
6,ARCELORMIT.,294500,-69,297900,294150,87.949,"2.603,39",31/03/2022,12:23:27,
7,B.SANTANDER,31545,38,31995,31500,29.208.859,"92.804,42",31/03/2022,12:23:28,
8,BA.SABADELL,07634,34,7830,7628,133.100.953,"102.961,56",31/03/2022,12:22:49,
9,BANKINTER,54300,37,55020,53980,881.470,"4.822,59",31/03/2022,12:22:10,



# Level 2

## - Exercise 2

Document in a word document the data set generated with the information contained in the different Kaggle archives.


### We will do this in a second delivery


# Level 3 

## - Exercise 3

Download a web page of your choice and perform web scraping using the Scrapy library.

### We will do this in a second delivery