$~$

# IT Academy - Data Science Itinerary



$~$

$~$

In this project we are going to use 3 of the most used libraries to do web scraping: **Beautifulsoup**, **Selenium** and **Scrapy**.

$~$

*This project is divided into three parts, in the first two we are going to use beatifulsoup and selenium to extract information from the Madrid Stock Exchange website. The intention is to test each of these libraries for the same task and see how efficient they are.*

$~$

*In the third part we are going to use the scrapy library to obtain 190 articles from the website of the newspaper El Pais (English version).*

$~$

## S12 T12: Web Scraping
___


In [1]:
#importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from time import sleep


import scrapy
from scrapy.crawler import CrawlerProcess
import os

import warnings
warnings.filterwarnings("ignore")

$~$
___
####  Exercise 1

$~$

Scrape a page on the [Madrid Stock Exchange](https://www.bolsamadrid.es) using BeautifulSoup and Selenium.

$~$
___

$~$

In this first part of the project we are going to scrap the website of the Madrid Stock Exchange. Specifically we are going to extract the information of the [prices of the session](https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000) of the day. We want to store this information in a Pandas dataframe. For do it we are going to use two libraries: First using the **Beautifulsoup** library, then  using the **Selemiun library**.

$~$

In the following image we will see the **keys** we have used to carry out the exercise, on the one hand, the **table id** and on the other the **"X_Path"** of the table. From this information we have been able to obtain all the information from the table

$~$

+ note here the **Id** and **X_path** we are going to use to scrap the table
<img src="image.png" alt="alt_text" align="center" width="650" height="320" />



[](./image.png)

In [2]:
#define the url to scrap
url = "https://www.bolsamadrid.es/ing/aspx/Mercados/Precios.aspx?indice=ESI100000000"


$~$

+ Scraping using BeatifulSoup:

$~$

In [3]:
#request the url 
web = requests.get(url)
#init soup 
soup = BeautifulSoup(web.content,"html.parser")
#find the table we want to scrap
tables = soup.findChildren("table", { "id" :"ctl00_Contenido_tblAcciones" })
#limit table
my_table = tables[0]
#find rows in the table
rows = my_table.findChildren(['tbody', 'tr'])
#find names of the columns:
names = my_table.findChildren(['tbody', 'th'])
#keep columns names
columns_name = []

for name in names:
    columns_name.append(name.text)

#define empy dataframe with columns_names
df1 = pd.DataFrame(columns=columns_name)

#add values to our dataset
for row in rows:
    cols = []
    for x in row.findAll("td"):  
      cols.append(x.text)
    if len(cols)!=0:
      df1.loc[len(df1.index)] = cols
#show the table
display(df1)

#save to cvs
df1.to_csv('data1.csv')

Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,189.2,-0.16,190.0,187.4,20137,3808.69,07/06/2022,12:25:10
1,ACERINOX,11.68,-0.34,11.76,11.605,823017,9619.55,07/06/2022,12:25:15
2,ACS,26.79,-0.22,26.91,26.66,144000,3862.44,07/06/2022,12:24:38
3,AENA,141.6,-0.32,142.1,141.35,20318,2879.65,07/06/2022,12:24:57
4,ALMIRALL,10.39,0.39,10.82,10.36,403502,4256.16,07/06/2022,12:21:42
5,AMADEUS,57.2,-0.97,57.64,56.78,84981,4863.99,07/06/2022,12:24:53
6,ARCELORMIT.,30.695,-0.34,30.815,30.45,79819,2443.79,07/06/2022,12:25:05
7,B.SANTANDER,3.015,-0.76,3.042,3.009,5023780,15185.24,07/06/2022,12:25:17
8,BA.SABADELL,0.851,0.07,0.856,0.8452,7015644,5970.51,07/06/2022,12:25:06
9,BANKINTER,5.932,-0.07,5.964,5.91,319594,1899.05,07/06/2022,12:25:06


$~$

+ Scraping using Selenium:

$~$

In [4]:
#rows X_path
row_Xpath = '//*[@id="ctl00_Contenido_tblAcciones"]/tbody/tr'

#header X_path
header_X_path = '//*[@id="ctl00_Contenido_tblAcciones"]/tbody/tr[1]/th'


# configure webdriver
options = webdriver.FirefoxOptions()
options.headless = True  # hide GUI
# set window size to native GUI size
options.add_argument("--window-size=1920,1080")
options.add_argument("start-maximized")

# get method to launch the URL
driver = webdriver.Firefox(options=options)
driver.get(url)


# to keep columns names
columns_name = []

#keep columns names
for name in driver.find_elements(by=By.XPATH, value=header_X_path):
    columns_name.append(name.text)

#define empy dataframe with columns_names
df = pd.DataFrame(columns=columns_name)

# #keep rows
for x in driver.find_elements(by=By.XPATH, value=row_Xpath):
    rows = x.find_elements_by_tag_name('td')
    data = []
    for row in rows:
        data.append(row.text)
    if len(data) != 0:
      df.loc[len(df.index)] = data

#to close the browser
driver.close()

#show the table
display(df)

#save to cvs
df.to_csv('data2.csv')

Unnamed: 0,Name,Last,% Dif.,High,Low,Volume,Turnover (€ Thousands),Date,Time
0,ACCIONA,189.2,-0.16,190.0,187.4,20137,3808.69,07/06/2022,12:25:10
1,ACERINOX,11.68,-0.34,11.76,11.605,823017,9619.55,07/06/2022,12:25:15
2,ACS,26.79,-0.22,26.91,26.66,144000,3862.44,07/06/2022,12:24:38
3,AENA,141.6,-0.32,142.1,141.35,20318,2879.65,07/06/2022,12:24:57
4,ALMIRALL,10.39,0.39,10.82,10.36,403502,4256.16,07/06/2022,12:21:42
5,AMADEUS,57.2,-0.97,57.64,56.78,84981,4863.99,07/06/2022,12:24:53
6,ARCELORMIT.,30.695,-0.34,30.815,30.45,79819,2443.79,07/06/2022,12:25:05
7,B.SANTANDER,3.015,-0.76,3.042,3.009,5023780,15185.24,07/06/2022,12:25:17
8,BA.SABADELL,0.851,0.07,0.856,0.8452,7015644,5970.51,07/06/2022,12:25:06
9,BANKINTER,5.932,-0.07,5.964,5.91,319594,1899.05,07/06/2022,12:25:06


$~$
___
####  Exercise 2

$~$

Document in a word your data set generated with the information that the different Kaggle files have.


$~$
___

**IBEX 35 prices**.

The IBEX 35 (Spanish Stock Market Index) is the reference stock market index of the Madrid Stock Exchange and is made up of 35 companies. This dataset contains the share prices of IBEX35 companies. The data has been collected from The Madrid stock exchange  website on 31/05/2022 at 19:05 p.m.

Contents

The categories of this dataset:
+  Name: Company name.
+ Last: Last recorded price
+ % Diff.: Difference between the last registered price and the current one in %-
+ Maximum: Maximum price reached.
+ Minimum: Minimum price reached.
+ Volume: Total shares.
+ Cash: Total share capital (thousand €):
+ Date: Date of registration.

Prices expressed in euros. Cash expressed in thousands of euros. The volume and cash for each security includes all operations carried out until the close of the trading session.

$~$
___
####  Exercise 3

$~$

Choose a web page you want and do web scraping using the Scrapy library

$~$
___

$~$

For this part, we are going to collect the last 195 press articles from the newspaper **El Pais* ([its version in English](https://english.elpais.com/)), specifically from the economy section.

$~$

Of the articles we are going to save :
$~$

+   **url** = the link (where you can find the article)
+   **date**= the date the article was published
+   **author** = name of the journalist who wrote the article
+ **art_text** = full text of the article.

$~$

In [5]:
class ArticleSpider(scrapy.Spider):
    name = "articles"
    
 
    start_urls = ['https://english.elpais.com/economy-and-business/' ]

    def parse(self, response):
        author_page_links = response.xpath("//header/h2/a/@href")
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = "https://english.elpais.com/" + response.xpath("/html/body/div/main/div/div/a/@href").get()
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.xpath(query).get(default='').strip()
        def extract_all(query):
            return response.xpath(query).getall()

        yield {
            'url': response.url,
            'date': extract_with_css('//*[@id="article_date_p"]/text()'),
            'title': extract_with_css('//header/div/h1/text()'),
            'author': extract_with_css('/html/body/div/article/div/div/div/a/text()'),
            'text_art': extract_all('/html/body/div/article/div/p/text()'),
        }
FILE_NAME = 'data.csv'
SETTINGS = {
            'FEED_FORMAT': 'csv',
            'FEED_URI': FILE_NAME,
            'DOWNLOAD_DELAY': 1,
            } 


In [None]:
process = CrawlerProcess(SETTINGS)
process.crawl(ArticleSpider) 
process.start()


In [7]:
df=pd.read_csv('data.csv')
df.shape

(297, 5)

In [8]:
df.sort_values(by="date").head()


Unnamed: 0,url,date,tittle,author,text_art
44,https://english.elpais.com/culture/2022-06-01/...,01 Jun 2022,Johnny Depp vs Amber Heard: The form the jury ...,Miguel Jiménez,"The contentious , is now in the hands of the j..."
52,https://english.elpais.com/science-tech/2022-0...,01 Jun 2022,Microbe study sheds light on a critical step i...,Nuño Domínguez,A research study on a microscopic microbe foun...
53,https://english.elpais.com/international/2022-...,01 Jun 2022,"Colombia’s Trump, Rodolfo Hernández: ‘Ideally,...",Sally Palomino,Rodolfo Hernández cannot hide his sexism. In a...
54,https://english.elpais.com/international/2022-...,01 Jun 2022,Resistance on rails: How Ukraine’s biggest emp...,Luis de Vega (Enviado Especial),"Ukraine’s national railway company, Ukrzalizni..."
218,https://english.elpais.com/international/2022-...,01 Jun 2022,Resistance on rails: How Ukraine’s biggest emp...,Luis de Vega (Enviado Especial),"Ukraine’s national railway company, Ukrzalizni..."


$~$

+ we print an article to see the result of the extraction:

$~$

In [10]:
print(df["tittle"][44])
print("-"*10)
print(df["text_art"][44])

Johnny Depp vs Amber Heard: The form the jury will use to reach its verdict
----------
The contentious , is now in the hands of the jury. The seven-member panel heard the closing arguments on May 27, and met to deliberate for the first time. Since then, it has concluded two days of deliberations without reaching a decision. The jury will resume deliberations on Wednesday morning, but no one knows how long it will take the jury to reach a verdict.,As the end of the trial draws near, people have been waiting in huge lines and even camping overnight outside the courthouse to secure a seat in one of the most-talked-about cases in recent history. It is not a criminal case; the jury will not rule on the allegations of abuse and assault that have been raised in the trial. Instead, they will focus on the subject of the lawsuit: defamation.,Depp is suing Heard for $50 million for defaming him in an ,in 2018, in which she described herself as “a public figure representing domestic abuse.” Althou

$~$
___
####  Conclusions

$~$




___
$~$

+ There are different libraries that we use to scrape websites. Here, we have reviewed 3 of the most common ones. (**BeautifulSoup, Selenium, Scrapy**)

+ For the use of these libraries they require a minimum previous knowledge of HTML

+ We have achieved the same result using **Beautifulsoup** and **Selenium**, however, the **Beautifulsoup** library has been more efficient since its handling (and learning curve) make it a more user-friendly tool.

+ The articles that we have obtained could serve us as data for any **NLP** project

___
$~$

####  *References*:

+ [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/) 
+ [Selenium Example Explained](https://selenium-python.readthedocs.io/getting-started.html#simple-usage) 
+ [Learn Scrapy](https://www.tutorialspoint.com/scrapy/index.htm) 

$~$
___