<div style="text-align: center; color: #292929;">
  <h1 style="margin-bottom: 10px;">New GDP Real-Time Dataset</h1>
  <div style="height: 2px; width: 90%; margin: 0 auto; background-color: #292929;"></div>
  <h2>Documentation</h2>
  </div>

<div style="text-align: center; margin-right: 40px;">
  <span style="display: inline-block; margin-right: 10px;">
    <a href="https://github.com/JasonCruz18" target="_blank">
      <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/github/github-original.svg" alt="GitHub" style="width: 24px;">
    </a>
  </span>
  <span style="display: inline-block;">
    <a href="mailto:jj.cruza@up.edu.pe">
      <img src="https://upload.wikimedia.org/wikipedia/commons/4/4e/Mail_%28iOS%29.svg" alt="Email" style="width: 24px;">
    </a>
  </span>
</div>

**Author:** Jason Cruz  
**Last updated:** 08/13/2025  
**Python version:** 3.12  
**Project:** Rationality and Nowcasting on Peruvian GDP Revisions  

---
## üìå Summary
This notebook documents the step-by-step **construction of datasets** for analyzing **Peruvian GDP revisions** from 2013‚Äì2024.  
It covers:
1. **Data acquisition** from the Central Reserve Bank of Peru's Weekly Reports (PDF).
2. **Data cleaning** and extraction of GDP tables.
3. **Creation of real-time GDP vintages**.
4. **Preparation of the final revisions dataset**.
5. **Export to SQL** for further analysis.

üåê **Main Data Source:** [BCRP Weekly Report](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html) (üì∞ WR, from here on)  
Any questions or issues regarding the coding, please email [Jason üì®](mailto:jj.cruza@alum.up.edu.pe)  

---

## üõ†Ô∏è Libraries

If you don't have the libraries below, please use the following code (as example) to install the required libraries.

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

Check out Python information

In [1]:
import sys
import platform

print("üêç Python Information")
print(f"  Version  : {sys.version.split()[0]}")
print(f"  Compiler : {platform.python_compiler()}")
print(f"  Build    : {platform.python_build()}")
print(f"  OS       : {platform.system()} {platform.release()}")

üêç Python Information
  Version  : 3.12.1
  Compiler : MSC v.1916 64 bit (AMD64)
  Build    : ('main', 'Jan 19 2024 15:44:08')
  OS       : Windows 10


In [None]:
# 1. PDF downloader
#-------------------------------------------------------------------------------------------------------------------------------

import os  # For file and directory manipulation, for interacting with the operating system
import random  # To generate random numbers
from selenium import webdriver  # For automating web browsers
from selenium.webdriver.common.by import By  # To locate elements on a webpage
from selenium.webdriver.support.ui import WebDriverWait  # To wait until certain conditions are met on a webpage.
from selenium.webdriver.support import expected_conditions as EC  # To define expected conditions
from selenium.common.exceptions import StaleElementReferenceException  # To handle exceptions related to elements on the webpage that are no longer available.
import pygame # Allows you to handle graphics, sounds and input events.
from webdriver_manager.chrome import ChromeDriverManager # To avoid compatibility issues with the ChromeDrive version of ChromeDrive

import shutil # Used for high-level file operations, such as copying, moving, renaming, and deleting files and directories.


# 2. Generate PDF input with key tables
#-------------------------------------------------------------------------------------------------------------------------------

import fitz  # This library is used for working with PDF documents, including reading, writing, and modifying PDFs (PyMuPDF).
import tkinter as tk  # This library is used for creating graphical user interfaces (GUIs) in Python.


# 3. Data cleaning
#-------------------------------------------------------------------------------------------------------------------------------

# 3.1. A brief documentation on issus in the table information of the PDFs

from PIL import Image  # Used for opening, manipulating, and saving image files.
import matplotlib.pyplot as plt  # Used for creating static, animated, and interactive visualizations.

# 3.2. Extracting tables and data cleanup

import pdfplumber  # For extracting text and metadata from PDF files
import pandas as pd  # For data manipulation and analysis
import unicodedata  # For manipulating Unicode data
import re  # For regular expressions operations
from datetime import datetime  # For working with dates and times
import locale  # For locale-specific formatting of numbers, dates, and currencies

# 3.2.1. Table 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.

import tabula  # Used to extract tables from PDF files into pandas DataFrames
from tkinter import Tk, messagebox, TOP, YES, NO  # Used for creating graphical user interfaces
from sqlalchemy import create_engine  # Used for connecting to and interacting with SQL databases

# 3.2.2. Table 2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.

import roman
from datetime import datetime


# 4. Real-time data of Peru's GDP growth rates
#-------------------------------------------------------------------------------------------------------------------------------

import psycopg2  # For interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # For creating and executing SQL queries using SQLAlchemy

## ‚öôÔ∏è Initial set-up

Before preprocessing new GDP releases data, we will:

* **Create necessary folders** for storing inputs, outputs, logs, and screenshots.
* **Connect to the PostgreSQL database** containing GDP revisions datasets.
* **Import helper functions** from `new_gdp_datasets_functions.py`.

**Create necessary folders**

In [4]:
# Define base folder for saving all digital PDFs
digital_pdf = 'digital_pdf'

# Define subfolder for saving the original PDFs as downloaded from the BCRP website
raw_pdf = os.path.join(digital_pdf, 'raw_pdf')

# Define subfolder for saving reduced PDFs containing only selected pages with GDP growth tables (monthly, quarterly, and annual frequencies)
input_pdf = os.path.join(digital_pdf, 'input_pdf')

# Define folder for saving .txt files with download and dataframe record
record = 'record'

# Define folder for saving warning bells. This is for download notifications (see section 1).
alert_track = 'alert_track'

# Create all required folders (if they do not already exist) and confirm creation
for folder in [digital_pdf, raw_pdf, input_pdf, record, alert_track]:
    os.makedirs(folder, exist_ok=True)
    print(f"üìÇ {folder} created")

üìÇ digital_pdf created
üìÇ digital_pdf\raw_pdf created
üìÇ digital_pdf\input_pdf created
üìÇ record created
üìÇ alert_track created


**Connect to the PostgreSQL database**

The following function will establish a connection to the `gdp_revisions_datasets` database in `PostgreSQL`. The **input data** used in this jupyter notebook will be loaded from this `PostgreSQL` database, and similarly, all **output data** generated by this jupyter notebook will be stored in that database. Ensure that you set the necessary parameters to access the server once you have obtained the required permissions.

> üí° **Tip:** To request permissions, please email [Jason üì®](mailto:jj.cruza@alum.up.edu.pe)  
> ‚ö†Ô∏è **Warning:** Make sure you have set your SQL credentials as environment variables before proceeding.  

In [None]:
def create_sqlalchemy_engine(database="gdp_revisions_datasets", port=5432):
    """
    Create an SQLAlchemy engine to connect to the PostgreSQL database.
    
    Environment Variables Required:
        CIUP_SQL_USER: SQL username
        CIUP_SQL_PASS: SQL password
        CIUP_SQL_HOST: SQL host address

    Args:
        database (str): Name of the database. Default is 'gdp_revisions_datasets'.
        port (int): Port number. Default is 5432.

    Returns:
        engine (sqlalchemy.engine.Engine): SQLAlchemy engine object.
    
    Raises:
        ValueError: If required environment variables are missing.

    Example:
        engine = create_sqlalchemy_engine()
    """
    user = os.environ.get('CIUP_SQL_USER')
    password = os.environ.get('CIUP_SQL_PASS')
    host = os.environ.get('CIUP_SQL_HOST')

    if not all([host, user, password]):
        raise ValueError("‚ùå Missing environment variables: CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS")

    connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"
    engine = create_engine(connection_string)

    print(f"üîó Connected to PostgreSQL database: {database} at {host}:{port}")
    return engine

In [None]:
engine = create_sqlalchemy_engine()

**Import helper functions**

> ‚ö†Ô∏è Please, check the script `new_gdp_datasets_functions.py` which contains all the functions required by this _jupyter notebook_. The functions there are ordered according to the sections of this jupyter notebok.

In [2]:
from new_gdp_datasets_functions import *

pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


## 1. PDF Downloader

Our main source for data collection is the [BCRP Weekly Report](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html). The weekly report is a periodic (weekly) publication of the BCRP in compliance with article 84 of the Peruvian Constitution and articles 2 and 74 of the BCRP's organic law, which include, among its functions, the periodic publication of the main national macroeconomic statistics.
    
Our project requires the publication of **two tables**: the table of monthly growth rates of real GDP (12-month percentage changes), and the table of quarterly (annual) growth rates of real GDP. These tables are referred to as **Table 1** and **Table 2**, respectively, throughout this jupyter notebook.

### Scraper bot

This section automates the download of the **BCRP Weekly Report PDFs** directly from the official BCRP website.

**What it does:**
1. Opens the official BCRP Weekly Report page.
2. Finds and collects all PDF links.
3. Downloads them in chronological order (oldest to newest).
4. Optionally plays a notification sound every N downloads.
5. Organizes downloaded PDFs into year-based folders.

> üí° If a CAPTCHA appears, solve it manually in the browser window and re-run the cell.

> üîÅ This script uses webdriver-manager to automatically handle browser drivers (default: Chrome), so you DO NOT need to manually download ChromeDriver, GeckoDriver, etc. If you want to change browser for your replication, modify the 'browser' parameter in init_driver().

> üéµ Place your own MP3 file in `alert_track` folder for download notifications. Recommended free sources (CC0/public domain):
>  - Pixabay Audio: https://pixabay.com/music/
>  - FreeSound: https://freesound.org/
>  - FreePD: https://freepd.com/

In [None]:
# Run the function to start the scraper bot
pdf_downloader(
    bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html",
    raw_pdf_folder = raw_pdf,
    download_record_folder = record,
    download_record_txt = 'new_downloaded_pdfs.txt',
    alert_track_folder = alert_track,
    max_downloads = 60
)

Probably the üì∞ WR were downloaded in a single folder, but we would like the WR to be sorted by years. The following code sorts the PDFs into subfolders (years) for us by placing each WR according to the year of its publication. This happens in the **"blink of an eye"**.

In [None]:
# Get the list of files in the directory
files = os.listdir(raw_pdf)

# Call the function to organize files
organize_files_by_year(raw_pdf)

# WR-08-2017

This  is crucial for the upcoming steps, specially for the section 3, cleansing. If -in the future- you enconuter some issues by executing cleaing it is likely to atributte to the pdf nature. IN that case, you can return to this code to replace defectiv pdfs for those convinient ones

Don't worry about it...

T√∫ puedes hacer lo mismo si te enfrentas a un inconveniente similar. Incluso puedes descargar los casos excepecionales de WR de un mismo mes y reemplazar los defectuosos.

In [None]:
# Replace specific defective PDFs (friendly outputs with icons)
replace_ns_pdfs(
    items=[
        ("2017", "ns-08-2017.pdf", "ns-07-2017"), # Enter the year (folder) that contains the defective PDF, the defective PDF, and the new chosen PDF 
        ("2019", "ns-23-2019.pdf", "ns-22-2019"), # The same one above
    ],
    root_folder=raw_pdf, # base folder with /2017, /2019, ...
    record_folder=record, # folder with new_downloaded_pdfs.txt
    download_record_txt = 'new_downloaded_pdfs.txt',
    quarantine=os.path.join(raw_pdf, "_quarantine")  # set to None to delete instead
)

## 2. Generate PDF input with key tables

Now that we have downloaded the üì∞ WR from the Central Bank, we should know that each of these files has more than 100 pages, but not all of them contain the information required for this project.

All we really want is a couple of pages from each üì∞ WR, one for **Table 1** (monthly real GDP growth) and one for **Table 2** (annual and quarterly real GDP growth). The code below is executed to maintain the **two key pages** with both tables of each PDF plus the cover page that contains the information that helps us identify one üì∞ WR from another such as its date of publication and serial number.

_quarentine will be discard of the input PDF generator

In [None]:
# Run the function to generate trimmed PDFs for input
input_pdfs_generator(
    raw_pdf_folder = raw_pdf,
    input_pdf_folder = input_pdf,
    input_pdf_record_folder = record,
    input_pdf_record_txt = 'new_generated_input_pdfs.txt',
    keywords = ["ECONOMIC SECTORS"]
)

Again, probably the WR (PDF files, now of few pages) were stored in disorder in the `input_pdf_folder` folder. The following code sorts the PDFs into subfolders (years) by placing each WR (which now includes only the key tables) according to the year of its publication. This happens in the **"blink of an eye"**.  

In [None]:
# Get the list of files in the directory
files = os.listdir(input_pdf)

# Call the function to organize files
organize_files_by_year(input_pdf)

## 3. Data cleaning

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
Since we already have the PDFs <span style="font-size: 24px;">&#128462;</span> with just the tables required for this project, we can start extracting them. Then we can proceed with data cleaning.
</p>  
<div/>

### 3.2 Extracting tables and data cleanup

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The main library used for extracting tables from PDFs <span style="font-size: 24px;">&#128462;</span> is <code>pdfplumber</code>. You can review the official documentation by clicking <a href="https://github.com/jsvine/pdfplumber" style="color: rgb(0, 153, 123); font-size: 16px;">here</a>.
</p>
    
<p>     
    The functions in <b>Section 3</b> of the <code>"new_gdp_datasets_functions.py"</code> script were built to deal with each of these issues. An interesting exercise is to compare the original tables (the ones in the PDF <span style="font-size: 24px;">&#128462;</span>) and the cleaned tables (by the cleanup codes below). Thus, the cleanup codes for <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a> generates two dictionaries, the first one stores the raw tables; that is, the original tables from the PDF <span style="font-size: 24px;">&#128462;</span> extracted by the <code>pdfplumber</code> library, while the second dictionary stores the fully cleaned tables.
</p>
<div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    The code iterates through each PDF <span style="font-size: 24px;">&#128462;</span> and extracts the two required tables from each. The extracted information is then transformed into dataframes and the columns and values are cleaned up to conform to Python conventions (pythonic).
    <div/>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 1.</span> Extraction and cleaning of data from tables on monthly real GDP growth rates.
    </span>
    </h3>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The basic criterion to start extracting tables is to use keywords (sufficient condition). I mean, tables containing the following keywords meet the requirements to be extracted.
</p>
<div/>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF <span style="font-size: 24px;">&#128462;</span> (WR). That file is located in the <b>"ns_dates"</b> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

Si por alguna raz√≥n ejecutas el c√≥digo de la secci√≥n 3 y no continuas ejecutando la secci√≥n subsecuente, puedes estar tranquilo de que un registro los guard√≥. La pr√≥xima vez que visite este script basta con empezar desde esta secci√≥n 3 (eliminando el txt) para generar los dataframes que no se guardaron en ningun lado, estos son insumos esenciales para la secci√≥n 4. Alternativamente puede guardar todos los dataframes generados en una carpeta como respaldo y empezar desde la secci√≥n 4 carg√°ndolos.

In [5]:
# table 1
raw_1, clean_1 = table_1_cleaner(
    input_pdf_folder=input_pdf,
    record_folder=record,
    record_txt="new_generated_dataframes_1.txt",
    log_folder="logs",
    log_txt="3_cleaner_1.log",
)


üßπ Starting Table 1 cleaning...

üìÇ Processing Table 1 in 2013


üöß 2013:   0%|          | 0/12

‚úîÔ∏è 2013:   0%|          | 0/12

üìÇ Processing Table 1 in 2014


üöß 2014:   0%|          | 0/12

‚úîÔ∏è 2014:   0%|          | 0/12

üìÇ Processing Table 1 in 2015


üöß 2015:   0%|          | 0/12

‚úîÔ∏è 2015:   0%|          | 0/12

üìÇ Processing Table 1 in 2016


üöß 2016:   0%|          | 0/12

‚úîÔ∏è 2016:   0%|          | 0/12

üìÇ Processing Table 1 in 2017


üöß 2017:   0%|          | 0/12

‚úîÔ∏è 2017:   0%|          | 0/12

üìÇ Processing Table 1 in 2018


üöß 2018:   0%|          | 0/12

‚úîÔ∏è 2018:   0%|          | 0/12

üìÇ Processing Table 1 in 2019


üöß 2019:   0%|          | 0/12

‚úîÔ∏è 2019:   0%|          | 0/12

üìÇ Processing Table 1 in 2020


üöß 2020:   0%|          | 0/12

‚úîÔ∏è 2020:   0%|          | 0/12

üìÇ Processing Table 1 in 2021


üöß 2021:   0%|          | 0/12

‚úîÔ∏è 2021:   0%|          | 0/12

üìÇ Processing Table 1 in 2022


üöß 2022:   0%|          | 0/12

‚úîÔ∏è 2022:   0%|          | 0/12


üìä Summary:

üóÉÔ∏è Record file: record\new_generated_dataframes_1.txt
‚ú® Cleaned: 12  --  ‚è© Skipped: 108
‚è±Ô∏è Time: 6 seconds



In [6]:
raw_1.keys()

dict_keys(['ns_04_2022_1', 'ns_08_2022_1', 'ns_12_2022_1', 'ns_16_2022_1', 'ns_20_2022_1', 'ns_23_2022_1', 'ns_26_2022_1', 'ns_30_2022_1', 'ns_33_2022_1', 'ns_37_2022_1', 'ns_41_2022_1', 'ns_44_2022_1'])

In [7]:
clean_1.keys()

dict_keys(['ns_04_2022_1', 'ns_08_2022_1', 'ns_12_2022_1', 'ns_16_2022_1', 'ns_20_2022_1', 'ns_23_2022_1', 'ns_26_2022_1', 'ns_30_2022_1', 'ns_33_2022_1', 'ns_37_2022_1', 'ns_41_2022_1', 'ns_44_2022_1'])

In [8]:
raw_1['ns_08_2022_1'].head(5)

Unnamed: 0.1,Unnamed: 0,2020,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,2021,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,SECTORES ECON√ìMICOS,,,,,,,,,,,,,,,ECONOMIC SECTORS
1,,A√±o,Ene.,Feb.,Mar.,Abr.,May.,Jun.,Jul.,Ago.,Sep.,Oct.,Nov.,Dic.,A√±o,
2,Agropecuario 2/,10,14,-07,-14,-44,-48,119,118,75,120,52,25,92,"3,8 Agriculture and Livestock 2/",
3,Agr√≠cola,28,26,-03,-21,-66,-75,161,174,105,186,64,19,139,50,Agriculture
4,Pecuario,-18,-01,-12,-04,03,22,24,31,37,34,35,33,28,19,Livestock


In [9]:
clean_1['ns_08_2022_1']

Unnamed: 0,year,id_ns,sectores_economicos,economic_sectors,2020_year,2021_ene,2021_feb,2021_mar,2021_abr,2021_may,2021_jun,2021_jul,2021_ago,2021_sep,2021_oct,2021_nov,2021_dic,2021_year
1,2022,8,agropecuario,agriculture and livestock,1.0,1.4,-0.7,-1.4,-4.4,-4.8,11.9,11.8,7.5,12.0,5.2,2.5,9.2,3.8
2,2022,8,agricola,agriculture,2.8,2.6,-0.3,-2.1,-6.6,-7.5,16.1,17.4,10.5,18.6,6.4,1.9,13.9,5.0
3,2022,8,pecuario,livestock,-1.8,-0.1,-1.2,-0.4,0.3,2.2,2.4,3.1,3.7,3.4,3.5,3.3,2.8,1.9
4,2022,8,pesca,fishing,4.2,70.6,6.8,35.7,141.1,97.6,-37.7,-41.4,-29.9,-39.2,-33.1,13.0,-12.6,2.8
5,2022,8,mineria e hidrocarburos,mining and fuel,-13.4,-8.4,-4.4,15.5,58.0,67.1,7.9,-0.9,3.2,11.0,1.5,-5.2,-6.1,7.4
6,2022,8,mineria metalica,metals,-13.8,-7.1,-1.0,20.6,76.7,82.7,7.2,1.4,5.2,12.2,0.5,-5.8,-7.1,9.7
7,2022,8,hidrocarburos,fuel,-11.0,-15.6,-20.8,-10.1,-8.1,6.7,11.9,-13.6,-8.4,4.0,7.2,-1.9,0.6,-4.6
8,2022,8,manufactura,manufacturing,-12.5,7.4,0.1,51.2,115.6,84.0,18.5,7.2,11.2,6.7,-1.4,4.0,1.4,17.8
9,2022,8,procesadores recursos primarios,based on raw materials,-2.0,26.2,-4.8,23.7,32.3,54.1,-11.2,-13.5,0.2,-11.3,-17.7,-8.8,-13.3,1.9
10,2022,8,manufactura no primaria,nonprimary,-16.4,1.6,1.9,63.0,175.2,105.2,38.7,16.1,14.8,12.4,4.0,9.1,9.1,24.6


<div id="3-2-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 2.</span> Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.
    </span>
    </h3>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The basic criterion to start extracting tables is to use keywords (sufficient condition). I mean, tables containing the following keywords meet the requirements to be extracted.
</p>
<div/>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF <span style="font-size: 24px;">&#128462;</span> (WR). That file is located in the <code>ns_dates</code> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [10]:
# table 2
raw_2, clean_2 = table_2_cleaner(
    input_pdf_folder=input_pdf,
    record_folder=record,
    record_txt="new_generated_dataframes_2.txt",
    log_folder="logs",
    log_txt="3_cleaner_2.log",
)


üßπ Starting Table 2 cleaning...

üìÇ Processing Table 2 in 2013


üöß 2013:   0%|          | 0/12

üèÅ 2013:   0%|          | 0/12

üìÇ Processing Table 2 in 2014


üöß 2014:   0%|          | 0/12

üèÅ 2014:   0%|          | 0/12

üìÇ Processing Table 2 in 2015


üöß 2015:   0%|          | 0/12

üèÅ 2015:   0%|          | 0/12

üìÇ Processing Table 2 in 2016


üöß 2016:   0%|          | 0/12

üèÅ 2016:   0%|          | 0/12

üìÇ Processing Table 2 in 2017


üöß 2017:   0%|          | 0/12

üèÅ 2017:   0%|          | 0/12

üìÇ Processing Table 2 in 2018


üöß 2018:   0%|          | 0/12

üèÅ 2018:   0%|          | 0/12

üìÇ Processing Table 2 in 2019


üöß 2019:   0%|          | 0/12

üèÅ 2019:   0%|          | 0/12

üìÇ Processing Table 2 in 2020


üöß 2020:   0%|          | 0/12

üèÅ 2020:   0%|          | 0/12

üìÇ Processing Table 2 in 2021


üöß 2021:   0%|          | 0/12

üèÅ 2021:   0%|          | 0/12

üìÇ Processing Table 2 in 2022


üöß 2022:   0%|          | 0/12

üèÅ 2022:   0%|          | 0/12


üìä Summary:

üóÉÔ∏è  Record file: record\new_generated_dataframes_2.txt
üßπ Cleaned: 12  --  ‚è© Skipped: 108
‚è±Ô∏è Time: 4 seconds



In [15]:
raw_2['ns_04_2022_2']

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,2019,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,2020,Unnamed: 9,Unnamed: 10,Unnamed: 11,2021,Unnamed: 13,Unnamed: 14
0,SECTORES ECON√ìMICOS,,,,,,,,,,,,,,ECONOMIC SECTORS
1,,I,II,III,IV,A√ëO,I,II,III,IV,A√ëO,I,II,III,
2,Agropecuario,48,24,22,53,35,40,23,-17,08,14,00,-02,"9,6 Agriculture and Livestock",
3,Pesca,-129,-273,295,-239,-172,-165,-145,149,386,42,373,212,"-37,8 Fishing",
4,Miner√≠a e hidrocarburos,-05,-22,03,21,00,-57,-343,-102,-39,-134,00,389,"4,3 Mining and fuel",
5,Manufactura,-07,-68,39,-24,-17,-93,-362,-69,20,-125,167,609,"8,4 Manufacturing",
6,Electricidad y agua,59,38,37,24,39,-19,-194,-31,-02,-61,28,253,"6,3 Electricity and water",
7,Construcci√≥n,22,74,34,-48,14,-120,-661,-45,190,-139,416,2309,"23,8 Construction",
8,Comercio,24,27,33,36,30,-71,-468,-81,-26,-160,14,859,"10,1 Commerce",
9,Servicios,38,35,39,33,36,-15,-247,-107,-46,-103,06,314,"13,7 Services",


In [16]:
clean_2['ns_04_2022_2']

Unnamed: 0,year,id_ns,sectores_economicos,economic_sectors,2019_1,2019_2,2019_3,2019_4,2019_year,2020_1,2020_2,2020_3,2020_4,2020_year,2021_1,2021_2,2021_3
0,2022,4,agropecuario,agriculture and livestock,4.8,2.4,2.2,5.3,3.5,4.0,2.3,-1.7,0.8,1.4,0.0,-0.2,9.6
1,2022,4,pesca,fishing,-12.9,-27.3,29.5,-23.9,-17.2,-16.5,-14.5,14.9,38.6,4.2,37.3,21.2,-37.8
2,2022,4,mineria e hidrocarburos,mining and fuel,-0.5,-2.2,0.3,2.1,0.0,-5.7,-34.3,-10.2,-3.9,-13.4,0.0,38.9,4.3
3,2022,4,manufactura,manufacturing,-0.7,-6.8,3.9,-2.4,-1.7,-9.3,-36.2,-6.9,2.0,-12.5,16.7,60.9,8.4
4,2022,4,electricidad y agua,electricity and water,5.9,3.8,3.7,2.4,3.9,-1.9,-19.4,-3.1,-0.2,-6.1,2.8,25.3,6.3
5,2022,4,construccion,construction,2.2,7.4,3.4,-4.8,1.4,-12.0,-66.1,-4.5,19.0,-13.9,41.6,230.9,23.8
6,2022,4,comercio,commerce,2.4,2.7,3.3,3.6,3.0,-7.1,-46.8,-8.1,-2.6,-16.0,1.4,85.9,10.1
7,2022,4,otros servicios,other services,3.8,3.5,3.9,3.3,3.6,-1.5,-24.7,-10.7,-4.6,-10.3,0.6,31.4,13.7
8,2022,4,pbi global,gdp,2.4,1.1,3.3,1.8,2.2,-3.9,-29.9,-8.8,-1.4,-11.0,4.5,41.9,11.4
9,2022,4,sectores primarios,primary sectors,-1.2,-4.4,1.9,0.5,-0.9,-2.9,-20.0,-6.5,0.2,-7.7,2.6,20.1,3.0


<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Real-time data of Peru's GDP growth rates</span></h1>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
This section creates the GDP growth rate vintages for Peru using <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a>, which were extracted and cleaned in the previous section. Each table from each WR (PDF <span style="font-size: 24px;">&#128462;</span>) was extracted and cleaned individually in the previous section. Here, we will concatenate all the tables for a specific economic sector, thus creating a vintage dataset of (real) GDP growth by economic sector from <b>2013</b> to <b>2024</b>.
<div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    <span style="font-size: 24px; color: #FFA823; font-weight: bold;">&#9888;</span>
As preferred or as appropriate, you can create the data manually, step by step, or focus on specific sectors or frequencies. Alternatively, you can choose a more efficient or automated approach by generating the data for all sectors and frequencies simultaneously.
<div/>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Manual process of data creation in real time: sector by sector and frequency by frequency.
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    With this method you can create and inspect the dataset sector by sector and frequency by frequency. This is useful if you want to create data only for particular sectors and frequencies.
<div/>

<div id="select_sector">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>sector_economico</code> and <code>economic_sector</code></span></h1>
    </div>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
When executing the following code, a window will be displayed with options in <b>Spanish</b> and <b>English</b> to select <b>economic sectors</b>. Choose them to concatenate Peru GDP growth rates (annual, quarterly or monthly) by sector.
</p>
<div/>

In [None]:
# Call the function to display the window and capture the selected values
selected_spanish, selected_english, sector = show_option_window()

# Display the selected values
print(f"You have selected sector = {sector}, selected_spanish = {selected_spanish}, and selected_english = {selected_english}.")

<div id="select_freq">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>frequency</code></span></h1>
    </div>

In [None]:
# Call the function to show the popup window
frequency = show_frequency_window()
print("Selected frequency:", frequency)

<div id="counter">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Set counter (dataframe name suffix)</span></h1>
    </div>

In [None]:
# Call the function to set the counter
if frequency == "monthly":
    counter = 1
elif frequency == "quarterly":
    counter = 2
elif frequency == "annual":
    counter = 2
else:
    counter = None 

print(counter)

<div id="4-1-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Growth rates datasets concatenation for all frequencies
    </span>
    </h3>

In [None]:
# Dynamically construct the function name and dictionary name
function_name = f"concatenate_{frequency}_df"
dataframe_dict_name = f"new_dataframes_dict_{counter}"

# Check that both the function and dictionary exist in the global scope
if function_name in globals() and dataframe_dict_name in globals():
    # Call the function using its reference from globals()
    globals()[f"new_{sector}_{frequency}_growth_rates"] = globals()[function_name](
        globals()[dataframe_dict_name], selected_spanish, selected_english
    )
else:
    print(f"Error: {function_name} or {dataframe_dict_name} does not exist in the global scope.")

In [None]:
#pd.set_option('display.max_rows', None)
globals()[f"new_{sector}_{frequency}_growth_rates"].head(10)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4-1-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.2.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Uploading data to SQL</span></h3>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Finally, we upload all the datasets generated in this jupyter notebook to the <code>'gdp_revisions_datasets'</code> database of <code>PostgresSQL</code>.
<div/>

In [None]:
engine = create_sqlalchemy_engine()

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Loading
<div/>

In [None]:
globals()[f"new_{sector}_{frequency}_growth_rates"].to_sql(f'new_{sector}_{frequency}_growth_rates', engine, index=False, if_exists='replace')

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select_sector" style="color: rgb(255, 32, 78); text-decoration: none;">‚Æù</a>
    </span> 
    <a href="#select_sector" style="color: rgb(255, 32, 78); text-decoration: none;">Back to select sectors.</a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select_freq" style="color: rgb(255, 32, 78); text-decoration: none;">‚Æù</a>
    </span> 
    <a href="#select_freq" style="color: rgb(255, 32, 78); text-decoration: none;">Back to select frequency.</a>
</div>

<div id="4-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Automatic data creation process in real time: all sectors and frequencies at the same time.
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    With this method you can create the dataset for all sectors and all frequencies at the same time. This is more efficient if the goal is to generate all possible combinations of datasets for <code>sector</code> and <code>frequency</code> (without excluding any sector or frequency).
<div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    List of frequencies to be used to create concatenated datasets
    <div/>

In [None]:
frequencies = [
        "monthly", 
        "quarterly",
        "annual"
    ]

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Function to process growth rates datasets: concatenate and load to SQL
    <div/>

In [None]:
def process_new_datasets_to_sql(sector, frequency):
    # Initialize counter for loaded DataFrames
    total_loaded = 0

    # Set counter based on frequency
    if frequency == "monthly":
        counter = 1
    elif frequency in ["quarterly", "annual"]:
        counter = 2
    else:
        print(f"Unknown frequency: {frequency}")
        return None

    # Dynamically build function and dictionary names
    function_name = f"concatenate_{frequency}_df"
    dataframe_dict_name = f"new_dataframes_dict_{counter}"

    if function_name in globals() and dataframe_dict_name in globals():
        # Generate the DataFrame
        df_name = f"new_{sector}_{frequency}_growth_rates"
        globals()[df_name] = globals()[function_name](
            globals()[dataframe_dict_name], option_mapping[sector][0], option_mapping[sector][1]
        )

        # Load to SQL
        engine = create_sqlalchemy_engine()
        globals()[df_name].to_sql(df_name, engine, index=False, if_exists='replace')

        return globals()[df_name]
    else:
        print(f"Error: {function_name} or {dataframe_dict_name} does not exist in the global scope.")
        return None

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Run the function to create concatenated datasets for all sectors and frequencies and load to SQL
    <div/>

In [None]:
# Initialize counter
processed_datasets = 0

# Process all combinations
for sector in option_mapping.keys():
    for frequency in frequencies:
        print(f"Processing {sector} - {frequency}")
        df = process_new_datasets_to_sql(sector, frequency)
        if df is not None:
            display(df.head(10))  # Display the first 10 rows
            processed_datasets += 1  # Increment counter

# Display total number of processed datasets
print(f"Total datasets processed: {processed_datasets}")

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div style="font-size: 16px; background-color: #F5F5F5; padding: 18px; line-height: 1.5; font-family: 'PT Serif Pro Book';">
    <span style="font-size: 24px; color: #FFA823; font-weight: bold;">&#9888;</span>
    Once you have all the datasets generated by this script (<code>new_gdp_datasets.ipynb</code>) you can concatenate with those generated in the script <code>old_gdp_datasets.ipynb</code>. <b>Section 6</b> of the script <code>aux_files_to_sql.ipynb</code> concatenates both <b>new</b> and <b>old</b> datasets for <b>all sectors</b> and <b>all frequencies</b>.
</div>

---
---
