# Dataframe Structures

Web Scraping exercises.



### - Exercise 1
Perform web scraping of two of the three proposed web pages using BeautifulSoup first and Selenium afterwards. 

- http://quotes.toscrape.com

- https://www.bolsamadrid.es

- www.wikipedia.es (do a search first and scrape some content)



### - Exercise 2
Document your data set generated with the information in the different Kaggle files in a Word document.

To know more

As an example of what is requested, you can consult this link:

-> https://www.kaggle.com/datasets/vivovinco/20212022-football-team-stats .



### - Exercise 3
Choose a web page of your choice and perform web scraping using the Selenium library first and Scrapy later. 

In [2]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from time import sleep
import scrapy
from scrapy.crawler import CrawlerProcess
import os


### https://www.bolsamadrid.es using BeautifulSoup:

In [3]:
#since I am going to scraping a table, I will try to maintain the same format of it.

#starting with the link
web = requests.get('https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000')
#init soup 
soup = BeautifulSoup(web.content,"html.parser")
#find the table we want to scrap
tables = soup.findChildren("table", { "id" :"ctl00_Contenido_tblAcciones" })
#limit table
my_table = tables[0]
#find rows in the table
rows = my_table.findChildren(['tbody', 'tr'])
#find names of the columns:
names = my_table.findChildren(['tbody', 'th'])
#keep columns names
columns_name = []

for name in names:
    columns_name.append(name.text)

#define empy dataframe with columns_names
df_bolsaMadrid_1 = pd.DataFrame(columns=columns_name)

#add values to our dataset
for row in rows:
    cols = []
    for x in row.findAll("td"):  
      cols.append(x.text)
    if len(cols)!=0:
      df_bolsaMadrid_1.loc[len(df_bolsaMadrid_1.index)] = cols
#show the table
display(df_bolsaMadrid_1)



Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,184.0,1.04,185.8,181.1,73453,13511.55,01/11/2022,Close
1,ACCIONA ENER,40.48,1.81,40.98,40.06,161955,6571.24,01/11/2022,Close
2,ACERINOX,9.034,1.94,9.046,8.908,831749,7467.44,01/11/2022,Close
3,ACS,25.95,0.0,26.29,25.72,431238,11240.13,01/11/2022,Close
4,AENA,118.3,-0.71,120.0,117.45,90727,10736.65,01/11/2022,Close
5,AMADEUS,52.4,-0.64,53.96,52.26,363502,19249.49,01/11/2022,Close
6,ARCELORMIT.,22.855,0.84,23.25,22.72,253017,5809.99,01/11/2022,Close
7,B.SANTANDER,2.6435,0.82,2.6775,2.6315,27135256,71952.74,01/11/2022,Close
8,BA.SABADELL,0.8138,2.29,0.8194,0.798,35203200,28630.53,01/11/2022,Close
9,BANKINTER,6.12,0.07,6.194,6.12,1523104,9346.63,01/11/2022,Close


### http://quotes.toscrape.com/ using BeautifulSoup:

In [4]:
URL = "http://quotes.toscrape.com/tag/life/" #specify the URL of the webpage you want to scrape.

page = requests.get(URL) 

In [5]:
soup = BeautifulSoup(page.content, "html.parser")

In [6]:
quotes = soup.find_all(class_ = "quote")


In [7]:
author = []
text = []

for i in quotes:
    text.append(i.find(class_= "text").text)
    author.append(i.find(class_ = "author").text)
    
print(text)

print(author)

['“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“It is better to be hated for what you are than to be loved for what you are not.”', "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doe

In [8]:
df_quotes_1 = pd.DataFrame({'Author':author,'Quote':text})
df_quotes_1

Unnamed: 0,Author,Quote
0,Albert Einstein,“There are only two ways to live your life. On...
1,André Gide,“It is better to be hated for what you are tha...
2,Marilyn Monroe,“This life is what you make it. No matter what...
3,Douglas Adams,"“I may not have gone where I intended to go, b..."
4,Mark Twain,"“Good friends, good books, and a sleepy consci..."
5,Allen Saunders,“Life is what happens to us while we are makin...
6,Dr. Seuss,"“Today you are You, that is truer than true. T..."
7,Albert Einstein,“Life is like riding a bicycle. To keep your b...
8,George Bernard Shaw,“Life isn't about finding yourself. Life is ab...
9,Ralph Waldo Emerson,“Finish each day and be done with it. You have...


### https://www.bolsamadrid.es using Selenium:

In [None]:
#I had to reinstall an older version of Selenium in order to fix a persistant error: 
# https://pythoninoffice.com/fixing-attributeerror-webdriver-object-has-no-attribute-find_element_by_xpath/
pip install selenium==4.2.0 --force-reinstall

In [9]:
#initializing the driver
driver_path = "/Users/sandychiereghin/ITACADEMY/Sprint10_Web_scraping/chromedriver"
driver = webdriver.Chrome(executable_path= driver_path)
driver.get('https://google.com')

In [10]:
#driver settings
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=driver_path)


In [11]:
#print the driver source
driver.get("https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000")
#print(driver.page_source)


In [12]:
#rows X_path
row_Xpath = '//*[@id="ctl00_Contenido_tblAcciones"]/tbody/tr'

#header X_path
header_X_path = '//*[@id="ctl00_Contenido_tblAcciones"]/tbody/tr[1]/th'

In [13]:
#driver option
options.headless = True

In [14]:
# set column names
columns_name = []

#keep columns names
for name in driver.find_elements(by=By.XPATH, value=header_X_path):
    columns_name.append(name.text)

#define empy dataframe with columns_names
df_bolsaMadrid_2 = pd.DataFrame(columns=columns_name)

# #keep rows
for x in driver.find_elements(by=By.XPATH, value=row_Xpath):
    rows = x.find_elements_by_tag_name('td')
    data = []
    for row in rows:
        data.append(row.text)
    if len(data) != 0:
        df_bolsaMadrid_2.loc[len(df_bolsaMadrid_2.index)] = data
        
df_bolsaMadrid_2.head()


Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,184.0,1.04,185.8,181.1,73453,13511.55,01/11/2022,Close
1,ACCIONA ENER,40.48,1.81,40.98,40.06,161955,6571.24,01/11/2022,Close
2,ACERINOX,9.034,1.94,9.046,8.908,831749,7467.44,01/11/2022,Close
3,ACS,25.95,0.0,26.29,25.72,431238,11240.13,01/11/2022,Close
4,AENA,118.3,-0.71,120.0,117.45,90727,10736.65,01/11/2022,Close


In [15]:
df_bolsaMadrid_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    35 non-null     object
 1   Last                    35 non-null     object
 2   % Dif.                  35 non-null     object
 3   High                    35 non-null     object
 4   Low                     35 non-null     object
 5   Volume                  35 non-null     object
 6   Turnover (€ Thousands)  35 non-null     object
 7   Date                    35 non-null     object
 8   Time                    35 non-null     object
dtypes: object(9)
memory usage: 2.7+ KB


### http://quotes.toscrape.com using Selenium

In [16]:
#initializing the driver
driver_path = "/Users/sandychiereghin/ITACADEMY/Sprint10_Web_scraping/chromedriver"
driver = webdriver.Chrome(executable_path= driver_path)
driver.get('https://google.com')

In [17]:
#print the driver source
driver.get(URL)
#print(driver.page_source)


In [18]:
text = driver.find_elements(by= By.XPATH, value = '//div[@class="quote"]')


In [19]:
texts = []
author2 = []
for element in text:
    texts.append(element.find_element(By.CLASS_NAME, 'text').text)
    author2.append(element.find_element(By.CLASS_NAME, 'author').text)
    
texts

['“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, d

In [20]:
table2 = pd.DataFrame({'Author':author2,'Quote':texts})
table2

Unnamed: 0,Author,Quote
0,Albert Einstein,“There are only two ways to live your life. On...
1,André Gide,“It is better to be hated for what you are tha...
2,Marilyn Monroe,“This life is what you make it. No matter what...
3,Douglas Adams,"“I may not have gone where I intended to go, b..."
4,Mark Twain,"“Good friends, good books, and a sleepy consci..."
5,Allen Saunders,“Life is what happens to us while we are makin...
6,Dr. Seuss,"“Today you are You, that is truer than true. T..."
7,Albert Einstein,“Life is like riding a bicycle. To keep your b...
8,George Bernard Shaw,“Life isn't about finding yourself. Life is ab...
9,Ralph Waldo Emerson,“Finish each day and be done with it. You have...


### - Exercise 2
Document your data set generated with the information in the different Kaggle files in a Word document.

To know more

As an example of what is requested, you can consult this link:

-> https://www.kaggle.com/datasets/vivovinco/20212022-football-team-stats .

### Context

The IBEX 35 (Spanish Stock Market Index) is the reference stock market index of the Madrid Stock Exchange. This dataset contains the share prices of IBEX 35 companies.
The data has been collected from The Madrid stock exchange website on 01/11/2022.

## Content

35 rows and 9 columns. 

Columns' description:

- Name: Company name.
- Last.: Current share price (in €).
- % Dif: % difference between the current price and the opening price.
- Máx. : Maximum price value of the current session.
- Mín. : Minimum price value of the current session.
- Volume : Total shars transactions.
- Turnover (miles €) : Total share capital (thousand €).
- Date : Date.
- Time : Time of of the day. If this value shows 'Close', it means that the time is capted when the session is closed.

### Acknowledgements

Data extracted from: https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000

## References:

BeautifulSoup: https://j2logo.com/python/web-scraping-con-python-guia-inicio-beautifulsoup/

Selenium: https://realpython.com/modern-web-automation-with-python-and-selenium/

https://www.scrapingbee.com/blog/selenium-python/

## Conclusions:

- For webscraping it is required a minimum knowledge of HTML.
- The libraries we used are very powerful tools that allows us to get data from web.
- Since the Selenium library and BeautifulSoap give us the same result I prefer to use BeautifulSoap since I found it easier and more user-friendly.