# A very intro to web scraping

Let´s start with a [video](https://www.youtube.com/watch?v=Ct8Gxo8StBU)

- A lot of data isn't accessible through data sets or APIs. This data may exist on the internet as web pages, however. One way to access the data without waiting for the provider to create an API is to use a technique called web scraping

- Web scraping loads a web page into Python so we can extract the information we want. We can then work with the data using standard analysis tools like `pandas` and `numpy`

- Before we can do web scraping, we need to understand the structure of the web page we're working with and then find a way to extract parts of that structure in a manner that makes sense.

- We'll use the `requests` library often as we learn about web scraping. (This library enables us to download a web page.) We'll also use the `beautifulsoup` library to extract the relevant parts of the web page.

## Pandas and HTML tables

The pandas `read_html()` function is a quick and convenient way to turn an HTML table into a pandas DataFrame. This function can be useful for quickly incorporating tables from various websites without figuring out how to scrape the site’s HTML. However, there can be some challenges in cleaning and formatting the data before analyzing it.

<img src="https://pbpython.com/images/html-to-pandas-header.png" width="400">

You cand find a tutorial [here](https://pbpython.com/pandas-html-table.html)

In [15]:
import pandas as pd 
URL_MEF = "https://es.wikipedia.org/wiki/Anexo:Ministros_de_Econom%C3%ADa_del_Per%C3%BA"
MEF = pd.read_html(URL_MEF) 

In [None]:
MEF[1]

In [None]:
MEF[1] # Starting in 1969

In [None]:
df = MEF[1] 
df

In [None]:
import pandas as pd

columna = df["Periodo"]
columna_clean = columna.replace(" a ", "-")["Periodo"].str.split("-")
columna_clean

In [None]:
df[['Inicio', 'Fin']] = columna_clean.apply(pd.Series)
df.head(5)

In [None]:
import locale
# Configurar el locale en español 
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

In [None]:
# Convertir la columna 'Inicio'
df['Inicio'] = pd.to_datetime(df['Inicio'], format='%d de %B de %Y', errors='coerce', dayfirst=True)
# Convertir la columna 'Fin'
df['Fin'] = pd.to_datetime(df['Fin'], format='%d de %B de %Y', errors='coerce', dayfirst=True)
df.head(5)

In [None]:
df["Titular"]["Nombre"].head(5)

In [None]:
# Calcular la duración para cada ministro en días
df['Duración'] = (df['Fin'] - df['Inicio']).dt.days
df['Identificador'] = df["Titular"]["Nombre"] + " (" + df['Inicio'].dt.year.astype(str) + ")"

df.head(5)

In [None]:
df.tail(5)

In [38]:
import plotly.graph_objects as go

In [None]:
# Crear el gráfico de cascada horizontal
fig = go.Figure(go.Waterfall(
    name="Días",
    orientation="h",  # Orientación horizontal
    measure=["relative" for _ in df['Duración']],
    y=df['Identificador'],  # Usar identificador único en el eje Y
    textposition="outside",
    text=[f"{val} días" for val in df['Duración']],
    x=df['Duración'],  # Duraciones en el eje X
    connector={"line":{"color":"rgb(63, 63, 63)"}},
))

# Configurar el layout del gráfico
fig.update_layout(
    title="Duración en el Cargo de los Ministros de Economía del Perú",
    yaxis_title="Ministros",
    xaxis_title="Duración en el Cargo (días)",
    showlegend=True
)

# Mostrar el gráfico
fig.show()


### Usemos el API del BCRP

In [None]:
BCRP_URL = "https://estadisticas.bcrp.gob.pe/estadisticas/series/trimestrales/resultados/PN02526AQ/html/2007-1/2024-4/"
BCRP = pd.read_html(BCRP_URL) 


In [None]:
BCRP_data = BCRP[1]
BCRP_data

In [None]:
BCRP_data.columns

In [None]:
import matplotlib.pyplot as plt

# Creación del gráfico de líneas
plt.figure(figsize=(10, 5))
plt.plot(BCRP_data['Fecha'], BCRP_data["Producto bruto interno por tipo de gasto (variaciones porcentuales reales anualizadas) - PBI"], marker='o')
plt.title('Producto Bruto Interno por Tipo de Gasto (Variaciones Porcentuales Reales Anualizadas)')
plt.xlabel('Fecha')
plt.ylabel('Variación Porcentual del PBI')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()


## Web page structure

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a web browser like Chrome or Firefox downloads a web page, it reads the HTML to determine how to render and display it.

Here's the HTML for a very simple web page:

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <p> Here is some simple content for this page </p>
    </body>
</html>
```

HTML consists of tags. We open a tag like this:

```html
<p>
```

We close a tag like this:

```html
</p>
```

Anything in between the opening and closing of a tag is the content of that tag. We can nest tags to create complex formatting rules. Here's an example:

```html
<p><b>This is a bold text</b></p>
```

The `b` tag bolds the text inside it, and the `p` tag creates a new paragraph. The HTML above will display as a bold paragraph because the `b` tag is inside the `p` tag. In other words, the `b` tag is nested within the `p` tag.

HTML documents contain a few major sections. The `head` section contains information that's useful to the web browser that's rendering the page. (The user doesn't see it.) The `body` section contains the bulk of the content you will see in your browser.

Different tags have different purposes. For example, the `title` tag tells the browser what to display at the top of your tab. The `p` tag indicates that the content inside it is a single paragraph.

Let´s start with a very [simple](https://dataquestio.github.io/web-scraping-pages/simple.html) website

In [5]:
import requests

In [None]:
response = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
response.status_code

In [None]:
content =response.text
content

In [None]:
# !pip install beautifulsoup4

## BeautifulSoup

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want

We'll use the `BeautifulSoup` library to parse the web page with Python. This library allows us to extract tags from an HTML document.

We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree). BeautifulSoup works the same way.

In our simple page, for example, the root of the "tree" is the `html` tag

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <p> Here is some simple content for this page </p>
    </body>
</html>
```

The `html` tag contains two "branches," `head` and `body`. `head` contains one "branch", `title` and `body` contains one branch, `p`. Drilling down through these multiple branches is one way to parse a web page.

To extract the text inside the `p` tag, we need to get the `body` element, then the `p` element, and then finally the text inside the `p` element.

In [None]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')
parser

## Applying methods

Use the tag type as a property is not always the best way to parse a document. It's usually better to be more specific by using the `find_all` method. This method will find all occurrences of a tag in the current element, and return a list.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, the process is the same as passing in the tag type as an attribute.

In [None]:
# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
p[0].get_text()

Veámos un ejemplo más grande

In [None]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

html_doc

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

In [None]:
soup.title
# <title>The Dormouse's story</title>

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.p['class']


In [None]:
soup.a

In [None]:
soup.find_all('a')

In [None]:
# Extract all the URLs found within a page's <a> tags
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
# Extract all the text from a page
print(soup.get_text())

## Element IDs

HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.

Here's an example page:

```html
<html>
    <head>
        <title> A simple example page </title>
    </head>
    <body>
        <div>
            <p id="first">
                First paragraph
            </p>
        </div>
        <p id="second">
            <b>
                Second paragraph
            </b>
        </p>
    </body>
</html>
```

HTML uses the `div` tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a web page's footer, sidebar, and horizontal menu.

There are two paragraphs on this page. The first is nested inside a `div`. Luckily, the paragraphs have IDs. This means we can access them easily, even though they're nested.

In [None]:
soup.find(id="link3")

Otro ejemplo de juguete

In [None]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')


# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)