# Web Scarping with Python

### 1. Introduction

Web scraping is a technique to extract data from websites. It turns out that many websites offer a public API for accessing their platform data. However, there are still many websites that do not offer an API. In this case, we can use web scraping to extract data from the website. In this tutorial, we will learn how to scrape data from a website using Python.

### 2. Importing Libraries

We will use the following libraries for this tutorial:
- **requests** - for making HTTP requests in Python
- **BeautifulSoup** - for pulling data out of HTML and XML files
- **pandas** - for data manipulation and analysis

In [10]:
from bs4 import BeautifulSoup
import requests

### 3. Making HTTP Requests in Python

In order to scrape a website, we need to make a request to the website's server. The server then responds to the request by returning the HTML content of the webpage. We can use the Python requests library to issue a GET request to the server and get the HTML content of the webpage, like so:


In [11]:
url = 'https://www.archeologiaviva.it/6475/gli-imperatori-romani/'
response = requests.get(url)

### 4. Parsing HTML with BeautifulSoup

Now that we have fetched the HTML content from the webpage, we need to parse the HTML content to extract the data we need. We can use the Python BeautifulSoup library for parsing HTML documents. In the following example, we will parse the HTML content we fetched earlier and extract the name of the author of the webpage:


In [12]:
soup = BeautifulSoup(response.text, 'html')
print(soup)

<!DOCTYPE html>

<!--[if IE 8]><html class="ie8"><![endif]-->
<!--[if IE 9]><html class="ie9"><![endif]-->
<!--[if gt IE 8]><!--> <html lang="it-IT"> <!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="user-scalable=yes, width=device-width, initial-scale=1.0, maximum-scale=1" name="viewport"/>
<title>Gli imperatori romani | </title>
<!--[if lt IE 9]>
	<script src="https://www.archeologiaviva.it/wp-content/themes/voice/js/html5.js"></script>
<![endif]-->
<!-- Global site tag (gtag.js) - Google AdWords: 1020337587 -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-1020337587"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'AW-1020337587');
</script>
<!-- Fine - Aggiunto Livia 20180502 -->
<title>Gli imperatori romani – Archeologia Viva</title>
<meta content="max-image-preview:large" name="robots">
<link hr

### 5. Getting Data from a Website

Once we have parsed the HTML content of the webpage, we can start extracting data from it. It is possible to either extract a speciofic class or id from the webpage or extract a specific tag from the webpage. In the following example, we will extract all the elements with the class name "entry-content" from the webpage:

In [13]:
text = soup.find_all(class_ = 'entry-content')[0]
print(text)

<div class="entry-content">
<hr/>
<p><strong>GIULIO CLAUDII</strong></p>
<p>Augusto (27 a.C.-14 d.C.)<br/>
Tiberio (14-34)<br/>
Caligola (37-41)<br/>
Claudio (41-54)<br/>
Nerone (54-68)<br/>
Galba Otone Vitellio (68-69)</p>
<hr/>
<p><strong>FLAVII</strong></p>
<p>Vespasiano (68-79)<br/>
Tito (79-81)<br/>
Domiziano (81-96)<br/>
Nerva ( 96-98)<br/>
Traiano (98-117)</p>
<hr/>
<p><strong>ANTONINI</strong></p>
<p>Adriano (117-138)<br/>
Antonino Pio (138-161)<br/>
Marco Aurelio e Lucio Vero (161-169)<br/>
Marco Aurelio (169-180)<br/>
Commodo (180-193)<br/>
Elvio Pertinace (193)<br/>
Didio Giuliano (193)</p>
<hr/>
<p><strong>SEVERI</strong></p>
<p>Settimio Severo (193-211)<br/>
Caracalla (211-217)<br/>
Macrino (217-218)<br/>
Elagabalo (218-222)<br/>
Alessandro Severo (222-235)<br/>
Massimino il Trace (235-238)<br/>
Gordiano I e II (238)<br/>
Pupieno e Balbino (238)<br/>
Gordiano III (238-249)<br/>
Decio (249-251)<br/>
Treboniano Gallo (251-253)<br/>
Emiliano (253)<br/>
Valeriano e Gallieno (2

Since we can easily spot that each "dinastia" is inserted in a strong tag, in order to get them we will use the following code:

In [14]:
dinastie = text.find_all('strong')
keys = [dinastia.text.strip() for dinastia in dinastie]
keys = keys[:-1] # Removing NB
print(keys)

['GIULIO CLAUDII', 'FLAVII', 'ANTONINI', 'SEVERI']


On the other side, we also have to extract all the "emperador" from the webpage. In this case, we will use the following code since we can easily spot that each list of "emperador" is inserted in a p tag. However, we have to be careful since the first p tag is not a list of "emperador" but the name of the dinastia and hence we have to use the string method to check whether the text is in uppercase or not.

In [15]:
subtext = text.find_all('p')
imperatori = []
for dinastia in subtext:
    if dinastia.text.isupper():
        continue
    print("----")
    print(dinastia.text)
    imperatori += [dinastia.text.split("\n")]
imperatori = imperatori[:-1]
print(imperatori)



----
Augusto (27 a.C.-14 d.C.)
Tiberio (14-34)
Caligola (37-41)
Claudio (41-54)
Nerone (54-68)
Galba Otone Vitellio (68-69)
----
Vespasiano (68-79)
Tito (79-81)
Domiziano (81-96)
Nerva ( 96-98)
Traiano (98-117)
----
Adriano (117-138)
Antonino Pio (138-161)
Marco Aurelio e Lucio Vero (161-169)
Marco Aurelio (169-180)
Commodo (180-193)
Elvio Pertinace (193)
Didio Giuliano (193)
----
Settimio Severo (193-211)
Caracalla (211-217)
Macrino (217-218)
Elagabalo (218-222)
Alessandro Severo (222-235)
Massimino il Trace (235-238)
Gordiano I e II (238)
Pupieno e Balbino (238)
Gordiano III (238-249)
Decio (249-251)
Treboniano Gallo (251-253)
Emiliano (253)
Valeriano e Gallieno (253-260)
Gallieno (260-268)
Aurelio Claudio (268-270)
Aureliano (270-275)
Claudio Tacito (275-276)
Floriano (276)
Aurelio Probo (276-282)
Caro (282-283)
Numeriano e Carino (283-284)
Carino (284-285)
Diocleziano (284-305)
Massimiano (286-310)
Costanzo Cloro (292-306)
Galerio (292-306)
Massenzio (306-312)
Costantino (306-337)


We then insert the data into a dictionary, removing the last element of the list since it is a noisy element. 

In [16]:
dic = {}
for i in range(len(keys)):
    if i == 3:
        dic[keys[i]] = imperatori[i][:-1]
    else:
        dic[keys[i]] = imperatori[i]
print(dic)

{'GIULIO CLAUDII': ['Augusto (27 a.C.-14 d.C.)', 'Tiberio (14-34)', 'Caligola (37-41)', 'Claudio (41-54)', 'Nerone (54-68)', 'Galba Otone Vitellio (68-69)'], 'FLAVII': ['Vespasiano (68-79)', 'Tito (79-81)', 'Domiziano (81-96)', 'Nerva ( 96-98)', 'Traiano (98-117)'], 'ANTONINI': ['Adriano (117-138)', 'Antonino Pio (138-161)', 'Marco Aurelio e Lucio Vero (161-169)', 'Marco Aurelio (169-180)', 'Commodo (180-193)', 'Elvio Pertinace (193)', 'Didio Giuliano (193)'], 'SEVERI': ['Settimio Severo (193-211)', 'Caracalla (211-217)', 'Macrino (217-218)', 'Elagabalo (218-222)', 'Alessandro Severo (222-235)', 'Massimino il Trace (235-238)', 'Gordiano I e II (238)', 'Pupieno e Balbino (238)', 'Gordiano III (238-249)', 'Decio (249-251)', 'Treboniano Gallo (251-253)', 'Emiliano (253)', 'Valeriano e Gallieno (253-260)', 'Gallieno (260-268)', 'Aurelio Claudio (268-270)', 'Aureliano (270-275)', 'Claudio Tacito (275-276)', 'Floriano (276)', 'Aurelio Probo (276-282)', 'Caro (282-283)', 'Numeriano e Carino (

### 6. Inserting Data into a Database with Python (mysql.connector)

First of all, we need to establish a connection with the database. In this case, we will use the mysql.connector library to connect to the database. We will then create a cursor object and use it to execute SQL queries.

In [27]:
import mysql.connector as mysql

mydb = mysql.connect(
    host = "localhost",
    user = "root",
    password = input("Insert password:")
)

The following block verifies whether the connection is established or not:

In [28]:
if mydb.is_connected():
    print("Connected to MySQL database")
else:
    print("Connection failed")

cursor = mydb.cursor()

Connected to MySQL database


In [33]:
import time
cursor.execute("SHOW DATABASES")
databases = cursor.fetchall()
for database in databases:
    if database[0] == 'impero_romano':
        print("Database already exists")
        time.sleep(1)
        cursor.execute("DROP DATABASE impero_romano")
        mydb.commit() # this line is necessary to make the changes visible in the database
        print("Database dropped")
        break
print("Creating database...")
time.sleep(1)
cursor.execute("CREATE DATABASE impero_romano")
mydb.commit()
print("Database created")

Database already exists
Database dropped
Creating database...
Database created


In [36]:
cursor.execute("USE impero_romano")
cursor.execute("SHOW TABLES")
tables = cursor.fetchall()
try:
    cursor.execute("CREATE TABLE imperatori (name VARCHAR(255), dinastia VARCHAR(255), regno VARCHAR(255), PRIMARY KEY (name));")
    mydb.commit()
    print("Table created")
except:
    print("Table already exists")
    cursor.execute("DROP TABLE imperatori")
    print("Table dropped")
    cursor.execute("CREATE TABLE imperatori (name VARCHAR(255), dinastia VARCHAR(255), regno VARCHAR(255), PRIMARY KEY (name));")
    mydb.commit()
    print("Table created")

    


Table created


In [37]:
data = []
for key in dic:
    for imperatore in dic[key]:
        end = imperatore.find("(")
        name = imperatore[:end]
        date = imperatore[end+1:-1]
        data.append([name.strip(),key.lower(),date])
print(data)

[['Augusto', 'giulio claudii', '27 a.C.-14 d.C.'], ['Tiberio', 'giulio claudii', '14-34'], ['Caligola', 'giulio claudii', '37-41'], ['Claudio', 'giulio claudii', '41-54'], ['Nerone', 'giulio claudii', '54-68'], ['Galba Otone Vitellio', 'giulio claudii', '68-69'], ['Vespasiano', 'flavii', '68-79'], ['Tito', 'flavii', '79-81'], ['Domiziano', 'flavii', '81-96'], ['Nerva', 'flavii', ' 96-98'], ['Traiano', 'flavii', '98-117'], ['Adriano', 'antonini', '117-138'], ['Antonino Pio', 'antonini', '138-161'], ['Marco Aurelio e Lucio Vero', 'antonini', '161-169'], ['Marco Aurelio', 'antonini', '169-180'], ['Commodo', 'antonini', '180-193'], ['Elvio Pertinace', 'antonini', '193'], ['Didio Giuliano', 'antonini', '193'], ['Settimio Severo', 'severi', '193-211'], ['Caracalla', 'severi', '211-217'], ['Macrino', 'severi', '217-218'], ['Elagabalo', 'severi', '218-222'], ['Alessandro Severo', 'severi', '222-235'], ['Massimino il Trace', 'severi', '235-238'], ['Gordiano I e II', 'severi', '238'], ['Pupieno 

In [39]:
for imperatore in data:
    try:
        cursor.execute("INSERT INTO imperatori (name, dinastia, regno) VALUES (%s, %s, %s)", imperatore)
    except:
        pass
cursor.execute("SELECT * FROM imperatori;")
result = cursor.fetchall()
for row in result:
    print(row)
mydb.commit()

('Adriano', 'antonini', '117-138')
('Alessandro Severo', 'severi', '222-235')
('Antonino Pio', 'antonini', '138-161')
('Augusto', 'giulio claudii', '27 a.C.-14 d.C.')
('Aureliano', 'severi', '270-275')
('Aurelio Claudio', 'severi', '268-270')
('Aurelio Probo', 'severi', '276-282')
('Caligola', 'giulio claudii', '37-41')
('Caracalla', 'severi', '211-217')
('Carino', 'severi', '284-285')
('Caro', 'severi', '282-283')
('Claudio', 'giulio claudii', '41-54')
('Claudio Tacito', 'severi', '275-276')
('Commodo', 'antonini', '180-193')
('Costante e Costanzo', 'severi', '335-350')
('Costantino', 'severi', '306-337')
('Costantino II, Costante e Costanzo', 'severi', '337-340')
('Costanzo', 'severi', '350-361')
('Costanzo Cloro', 'severi', '292-306')
('Decio', 'severi', '249-251')
('Didio Giuliano', 'antonini', '193')
('Diocleziano', 'severi', '284-305')
('Domiziano', 'flavii', '81-96')
('Elagabalo', 'severi', '218-222')
('Elvio Pertinace', 'antonini', '193')
('Emiliano', 'severi', '253')
('Florian