<div style="text-align: center; color: #292929;">
  <h1 style="margin-bottom: 10px;">New GDP Real-Time Dataset</h1>
  <div style="height: 2px; width: 90%; margin: 0 auto; background-color: #292929;"></div>
  <h2>Documentation</h2>
  </div>

<div style="text-align: center; margin-right: 40px;">
  <span style="display: inline-block; margin-right: 10px;">
    <a href="https://github.com/JasonCruz18" target="_blank">
      <img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/github/github-original.svg" alt="GitHub" style="width: 24px;">
    </a>
  </span>
  <span style="display: inline-block;">
    <a href="mailto:jj.cruza@up.edu.pe">
      <img src="https://upload.wikimedia.org/wikipedia/commons/4/4e/Mail_%28iOS%29.svg" alt="Email" style="width: 24px;">
    </a>
  </span>
</div>

**Author:** Jason Cruz  
**Last updated:** 08/13/2025  
**Python version:** 3.12  
**Project:** Rationality and Nowcasting on Peruvian GDP Revisions  

---
## üìå Summary
This notebook documents the step-by-step **construction of datasets** for analyzing **Peruvian GDP revisions** from 2013‚Äì2024.  
It covers:
1. **Data acquisition** from the Central Reserve Bank of Peru's Weekly Reports (PDF).
2. **Data cleaning** and extraction of GDP tables.
3. **Creation of real-time GDP vintages**.
4. **Preparation of the final revisions dataset**.
5. **Export to SQL** for further analysis.

üåê **Main Data Source:** [BCRP Weekly Report](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html) (üì∞ WR, from here on)  
Any questions or issues regarding the coding, please email [Jason üì®](mailto:jj.cruza@alum.up.edu.pe)  

---

## üõ†Ô∏è Libraries

If you don't have the libraries below, please use the following code (as example) to install the required libraries.

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

Check out Python information

In [None]:
import sys
import platform

print("üêç Python Information")
print(f"  Version  : {sys.version.split()[0]}")
print(f"  Compiler : {platform.python_compiler()}")
print(f"  Build    : {platform.python_build()}")
print(f"  OS       : {platform.system()} {platform.release()}")

In [1]:
# 1. PDF downloader
#-------------------------------------------------------------------------------------------------------------------------------

import os  # For file and directory manipulation, for interacting with the operating system
import random  # To generate random numbers
from selenium import webdriver  # For automating web browsers
from selenium.webdriver.common.by import By  # To locate elements on a webpage
from selenium.webdriver.support.ui import WebDriverWait  # To wait until certain conditions are met on a webpage.
from selenium.webdriver.support import expected_conditions as EC  # To define expected conditions
from selenium.common.exceptions import StaleElementReferenceException  # To handle exceptions related to elements on the webpage that are no longer available.
import pygame # Allows you to handle graphics, sounds and input events.
from webdriver_manager.chrome import ChromeDriverManager # To avoid compatibility issues with the ChromeDrive version of ChromeDrive

import shutil # Used for high-level file operations, such as copying, moving, renaming, and deleting files and directories.


# 2. Generate PDF input with key tables
#-------------------------------------------------------------------------------------------------------------------------------

import fitz  # This library is used for working with PDF documents, including reading, writing, and modifying PDFs (PyMuPDF).
import tkinter as tk  # This library is used for creating graphical user interfaces (GUIs) in Python.


# 3. Data cleaning
#-------------------------------------------------------------------------------------------------------------------------------

# 3.1. A brief documentation on issus in the table information of the PDFs

from PIL import Image  # Used for opening, manipulating, and saving image files.
import matplotlib.pyplot as plt  # Used for creating static, animated, and interactive visualizations.

# 3.2. Extracting tables and data cleanup

import pdfplumber  # For extracting text and metadata from PDF files
import pandas as pd  # For data manipulation and analysis
import unicodedata  # For manipulating Unicode data
import re  # For regular expressions operations
from datetime import datetime  # For working with dates and times
import locale  # For locale-specific formatting of numbers, dates, and currencies

# 3.2.1. Table 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.

import tabula  # Used to extract tables from PDF files into pandas DataFrames
from tkinter import Tk, messagebox, TOP, YES, NO  # Used for creating graphical user interfaces
from sqlalchemy import create_engine  # Used for connecting to and interacting with SQL databases

# 3.2.2. Table 2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.

import roman
from datetime import datetime


# 4. Real-time data of Peru's GDP growth rates
#-------------------------------------------------------------------------------------------------------------------------------

import psycopg2  # For interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # For creating and executing SQL queries using SQLAlchemy

pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


## ‚öôÔ∏è Initial set-up

Before preprocessing new GDP releases data, we will:

* **Create necessary folders** for storing inputs, outputs, logs, and screenshots.
* **Connect to the PostgreSQL database** containing GDP revisions datasets.
* **Import helper functions** from `new_gdp_datasets_functions.py`.

**Create necessary folders**

In [2]:
# Define base folder for saving all digital PDFs
digital_pdf = 'digital_pdf'

# Define subfolder for saving the original PDFs as downloaded from the BCRP website
raw_pdf = os.path.join(digital_pdf, 'raw_pdf')

# Define subfolder for saving reduced PDFs containing only selected pages with GDP growth tables (monthly, quarterly, and annual frequencies)
input_pdf = os.path.join(digital_pdf, 'input_pdf')

# Define folder for saving .txt files with download and dataframe record
record = 'record'

# Define folder for saving warning bells. This is for download notifications (see section 1).
alert_track = 'alert_track'

# Create all required folders (if they do not already exist) and confirm creation
for folder in [digital_pdf, raw_pdf, input_pdf, record, alert_track]:
    os.makedirs(folder, exist_ok=True)
    print(f"üìÇ {folder} created")

üìÇ digital_pdf created
üìÇ digital_pdf\raw_pdf created
üìÇ digital_pdf\input_pdf created
üìÇ record created
üìÇ alert_track created


**Connect to the PostgreSQL database**

The following function will establish a connection to the `gdp_revisions_datasets` database in `PostgreSQL`. The **input data** used in this jupyter notebook will be loaded from this `PostgreSQL` database, and similarly, all **output data** generated by this jupyter notebook will be stored in that database. Ensure that you set the necessary parameters to access the server once you have obtained the required permissions.

> üí° **Tip:** To request permissions, please email [Jason üì®](mailto:jj.cruza@alum.up.edu.pe)  
> ‚ö†Ô∏è **Warning:** Make sure you have set your SQL credentials as environment variables before proceeding.  

In [3]:
def create_sqlalchemy_engine(database="gdp_revisions_datasets", port=5432):
    """
    Create an SQLAlchemy engine to connect to the PostgreSQL database.
    
    Environment Variables Required:
        CIUP_SQL_USER: SQL username
        CIUP_SQL_PASS: SQL password
        CIUP_SQL_HOST: SQL host address

    Args:
        database (str): Name of the database. Default is 'gdp_revisions_datasets'.
        port (int): Port number. Default is 5432.

    Returns:
        engine (sqlalchemy.engine.Engine): SQLAlchemy engine object.
    
    Raises:
        ValueError: If required environment variables are missing.

    Example:
        engine = create_sqlalchemy_engine()
    """
    user = os.environ.get('CIUP_SQL_USER')
    password = os.environ.get('CIUP_SQL_PASS')
    host = os.environ.get('CIUP_SQL_HOST')

    if not all([host, user, password]):
        raise ValueError("‚ùå Missing environment variables: CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS")

    connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"
    engine = create_engine(connection_string)

    print(f"üîó Connected to PostgreSQL database: {database} at {host}:{port}")
    return engine

In [4]:
engine = create_sqlalchemy_engine()

üîó Connected to PostgreSQL database: gdp_revisions_datasets at localhost:5432


**Import helper functions**

> ‚ö†Ô∏è Please, check the script `new_gdp_datasets_functions.py` which contains all the functions required by this _jupyter notebook_. The functions there are ordered according to the sections of this jupyter notebok.

In [5]:
from new_gdp_datasets_functions import *

## 1. PDF Downloader

Our main source for data collection is the [BCRP Weekly Report](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html). The weekly report is a periodic (weekly) publication of the BCRP in compliance with article 84 of the Peruvian Constitution and articles 2 and 74 of the BCRP's organic law, which include, among its functions, the periodic publication of the main national macroeconomic statistics.
    
Our project requires the publication of **two tables**: the table of monthly growth rates of real GDP (12-month percentage changes), and the table of quarterly (annual) growth rates of real GDP. These tables are referred to as **Table 1** and **Table 2**, respectively, throughout this jupyter notebook.

### Scraper bot

This section automates the download of the **BCRP Weekly Report PDFs** directly from the official BCRP website.

**What it does:**
1. Opens the official BCRP Weekly Report page.
2. Finds and collects all PDF links.
3. Downloads them in chronological order (oldest to newest).
4. Optionally plays a notification sound every N downloads.
5. Organizes downloaded PDFs into year-based folders.

> üí° If a CAPTCHA appears, solve it manually in the browser window and re-run the cell.

> üîÅ This script uses webdriver-manager to automatically handle browser drivers (default: Chrome), so you DO NOT need to manually download ChromeDriver, GeckoDriver, etc. If you want to change browser for your replication, modify the 'browser' parameter in init_driver().

> üéµ Place your own MP3 file in `alert_track` folder for download notifications. Recommended free sources (CC0/public domain):
>  - Pixabay Audio: https://pixabay.com/music/
>  - FreeSound: https://freesound.org/
>  - FreePD: https://freepd.com/

In [None]:
# Run the function to start the scraper bot
download_pdfs(
    bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html",
    raw_pdf_folder = raw_pdf,
    download_record_folder = record,
    alert_track_folder = alert_track,
    max_downloads = 36
)

Probably the üì∞ WR were downloaded in a single folder, but we would like the WR to be sorted by years. The following code sorts the PDFs into subfolders (years) for us by placing each WR according to the year of its publication. This happens in the **"blink of an eye"**.

In [None]:
# Get the list of files in the directory
files = os.listdir(raw_pdf)

# Call the function to organize files
organize_files_by_year(raw_pdf)

# WR-08-2017

Don't worry about it...

T√∫ puedes hacer lo mismo si te enfrentas a un inconveniente similar. Incluso puedes descargar los casos excepecionales de WR de un mismo mes y reemplazar los defectuosos.

In [6]:
fix_defective_pdf(
    pdf_folder="digital_pdf/raw_pdf/2017",
    defective_pdf="ns-08-2017.pdf",
    source_pdf="ns-04-2017.pdf"
)

‚úÖ ns-08-2017.pdf replaced by a copy of ns-04-2017.pdf


## 2. Generate PDF input with key tables

Now that we have downloaded the üì∞ WR from the Central Bank, we should know that each of these files has more than 100 pages, but not all of them contain the information required for this project.

All we really want is a couple of pages from each üì∞ WR, one for **Table 1** (monthly real GDP growth) and one for **Table 2** (annual and quarterly real GDP growth). The code below is executed to maintain the **two key pages** with both tables of each PDF plus the cover page that contains the information that helps us identify one üì∞ WR from another such as its date of publication and serial number.

In [7]:
# Run the function to generate trimmed PDFs for input
generate_input_pdfs(
    raw_pdf_folder = raw_pdf,
    input_pdf_folder = input_pdf,
    input_pdf_record_folder = record,
    input_pdf_record_txt = 'generated_input_pdfs.txt',
    keywords = ["ECONOMIC SECTORS"]
)


üìÇ Processing folder: 2013



Generating input PDFs in 2013:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2013'? (y = yes / n = no):  y



üìÇ Processing folder: 2014



Generating input PDFs in 2014:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2014'? (y = yes / n = no):  y



üìÇ Processing folder: 2015



Generating input PDFs in 2015:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2015'? (y = yes / n = no):  y



üìÇ Processing folder: 2016



Generating input PDFs in 2016:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2016'? (y = yes / n = no):  y



üìÇ Processing folder: 2017



Generating input PDFs in 2017:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2017'? (y = yes / n = no):  y



üìÇ Processing folder: 2018



Generating input PDFs in 2018:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2018'? (y = yes / n = no):  y



üìÇ Processing folder: 2019



Generating input PDFs in 2019:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2019'? (y = yes / n = no):  y



üìÇ Processing folder: 2020



Generating input PDFs in 2020:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2020'? (y = yes / n = no):  y



üìÇ Processing folder: 2021



Generating input PDFs in 2021:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2021'? (y = yes / n = no):  y



üìÇ Processing folder: 2022



Generating input PDFs in 2022:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2022'? (y = yes / n = no):  y



üìÇ Processing folder: 2023



Generating input PDFs in 2023:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2023'? (y = yes / n = no):  y



üìÇ Processing folder: 2024



Generating input PDFs in 2024:   0%|          | 0/12

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (12 new, 0 skipped)


Do you want to continue to the next folder after '2024'? (y = yes / n = no):  y



üìÇ Processing folder: 2025



Generating input PDFs in 2025:   0%|          | 0/4

‚úÖ Shortened PDFs saved in 'digital_pdf\input_pdf' (4 new, 0 skipped)


Do you want to continue to the next folder after '2025'? (y = yes / n = no):  y



üìä Summary:

üìÇ 13 folders (years) found containing raw PDFs
üóÇÔ∏è Already generated input PDFs: 0
‚ûï Newly generated input PDFs: 148
‚è±Ô∏è Time: 284 seconds


Again, probably the WR (PDF files, now of few pages) were stored in disorder in the `input_pdf_folder` folder. The following code sorts the PDFs into subfolders (years) by placing each WR (which now includes only the key tables) according to the year of its publication. This happens in the **"blink of an eye"**.  

In [9]:
# Get the list of files in the directory
files = os.listdir(input_pdf)

# Call the function to organize files
organize_files_by_year(input_pdf)

## 3. Data cleaning

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
Since we already have the PDFs <span style="font-size: 24px;">&#128462;</span> with just the tables required for this project, we can start extracting them. Then we can proceed with data cleaning.
</p>  
<div/>

### 3.2 Extracting tables and data cleanup

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The main library used for extracting tables from PDFs <span style="font-size: 24px;">&#128462;</span> is <code>pdfplumber</code>. You can review the official documentation by clicking <a href="https://github.com/jsvine/pdfplumber" style="color: rgb(0, 153, 123); font-size: 16px;">here</a>.
</p>
    
<p>     
    The functions in <b>Section 3</b> of the <code>"new_gdp_datasets_functions.py"</code> script were built to deal with each of these issues. An interesting exercise is to compare the original tables (the ones in the PDF <span style="font-size: 24px;">&#128462;</span>) and the cleaned tables (by the cleanup codes below). Thus, the cleanup codes for <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a> generates two dictionaries, the first one stores the raw tables; that is, the original tables from the PDF <span style="font-size: 24px;">&#128462;</span> extracted by the <code>pdfplumber</code> library, while the second dictionary stores the fully cleaned tables.
</p>
<div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    The code iterates through each PDF <span style="font-size: 24px;">&#128462;</span> and extracts the two required tables from each. The extracted information is then transformed into dataframes and the columns and values are cleaned up to conform to Python conventions (pythonic).
    <div/>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 1.</span> Extraction and cleaning of data from tables on monthly real GDP growth rates.
    </span>
    </h3>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The basic criterion to start extracting tables is to use keywords (sufficient condition). I mean, tables containing the following keywords meet the requirements to be extracted.
</p>
<div/>

In [None]:
# Keywords to search in the page text
keywords = ["ECONOMIC SECTORS"]

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF <span style="font-size: 24px;">&#128462;</span> (WR). That file is located in the <b>"ns_dates"</b> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [36]:
import os
import re
import locale
import pdfplumber
import tabula
import pandas as pd
from tkinter import Tk, messagebox

# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Dictionary to store generated DataFrames
new_dataframes_dict_1 = {}

# Path for the processed folders log file
record_path = 'record/new_processed_folders_1.txt'

# Function to correct month names
def correct_month_name(month):
    months_mapping = {
        'setiembre': 'septiembre',
        # Add more mappings as needed
    }
    return months_mapping.get(month, month)

# Function to register processed folder
def register_processed_folder(folder, num_processed_files):
    with open(record_path, 'a') as file:
        file.write(f"{folder}:{num_processed_files}\n")

# Function to check if folder has been processed
def folder_processed(folder):
    if not os.path.exists(record_path):
        return False
    with open(record_path, 'r') as file:
        for line in file:
            if line.startswith(folder):
                return True
    return False

# Function to process PDF file (extract table from first page)
def process_pdf(pdf_path):
    new_tables_dict_1 = {}
    table_counter = 1

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None

    new_filename = os.path.splitext(filename)[0].replace('-', '_')

    # Extract table from first page only
    tables = tabula.read_pdf(pdf_path, pages=1, multiple_tables=False, stream=True)
    for j, table_df in enumerate(tables, start=1):
        dataframe_name = f"{new_filename}_1"
        new_tables_dict_1[dataframe_name] = table_df
        table_counter += 1

    return id_ns, year, new_tables_dict_1, 1  # keyword_count replaced by 1

# Function to process folder
def process_folder(folder):
    print(f"Processing folder {os.path.basename(folder)}")
    pdf_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.pdf')]

    num_pdfs_processed = 0
    num_dataframes_generated = 0
    table_counter = 1
    new_tables_dict_1 = {}

    for pdf_file in pdf_files:
        id_ns, year, tables_dict_temp, _ = process_pdf(pdf_file)

        if tables_dict_temp:
            for dataframe_name, df in tables_dict_temp.items():
                file_name = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                dataframe_name = f"{file_name}_1"

                # Store raw DataFrame
                new_tables_dict_1[dataframe_name] = df.copy()

                # Apply cleaning pipeline
                df_clean = df.copy()

                # Use your same cleaning functions as before
                if any(col.isdigit() and len(col) == 4 for col in df_clean.columns):
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    df_clean = replace_first_dot(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = exchange_values(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                else:
                    df_clean = check_first_row(df_clean)
                    df_clean = check_first_row_1(df_clean)
                    df_clean = replace_first_row_with_columns(df_clean)
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = expand_column(df_clean)
                    df_clean = split_values_1(df_clean)
                    df_clean = split_values_2(df_clean)
                    df_clean = split_values_3(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = exchange_values(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_nan_with_previous_column_1(df_clean)
                    df_clean = replace_nan_with_previous_column_2(df_clean)
                    df_clean = replace_nan_with_previous_column_3(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)

                # Add 'year' and 'id_ns' columns
                df_clean.insert(0, 'year', year)
                df_clean.insert(1, 'id_ns', id_ns)

                # Store cleaned DataFrame
                new_dataframes_dict_1[dataframe_name] = df_clean

                print(f'  {table_counter}. DataFrame generated for file {pdf_file}: {dataframe_name}')
                num_dataframes_generated += 1
                table_counter += 1
        
        num_pdfs_processed += 1

    return num_pdfs_processed, num_dataframes_generated, new_tables_dict_1

# Function to process folders
def process_folders():
    input_pdf_folder = input_pdf
    folders = [os.path.join(input_pdf_folder, d) for d in os.listdir(input_pdf_folder) if os.path.isdir(os.path.join(input_pdf_folder, d))]
    
    new_tables_dict_1 = {}
    
    for folder in folders:
        if folder_processed(folder):
            print(f"Folder {folder} has already been processed.")
            continue
        
        num_pdfs_processed, num_dataframes_generated, tables_dict_temp = process_folder(folder)
        
        new_tables_dict_1.update(tables_dict_temp)
        
        register_processed_folder(folder, num_pdfs_processed)

        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)
        message = f"Process {folder} complete. Processed {num_pdfs_processed} PDF(s) and generated {num_dataframes_generated} DataFrame(s). Continue with next folder?"
        if not messagebox.askyesno("Continue?", message):
            break
            
    print("Processing completed for all folders.")
    
    return new_tables_dict_1

if __name__ == '__main__':
    new_tables_dict_1 = process_folders()



Folder digital_pdf\input_pdf\2019 has already been processed.
Processing folder 2020
  1. DataFrame generated for file digital_pdf\input_pdf\2020\ns-03-2020.pdf: ns_03_2020_1
  2. DataFrame generated for file digital_pdf\input_pdf\2020\ns-07-2020.pdf: ns_07_2020_1
  3. DataFrame generated for file digital_pdf\input_pdf\2020\ns-11-2020.pdf: ns_11_2020_1
  4. DataFrame generated for file digital_pdf\input_pdf\2020\ns-16-2020.pdf: ns_16_2020_1
  5. DataFrame generated for file digital_pdf\input_pdf\2020\ns-20-2020.pdf: ns_20_2020_1
  6. DataFrame generated for file digital_pdf\input_pdf\2020\ns-24-2020.pdf: ns_24_2020_1
  7. DataFrame generated for file digital_pdf\input_pdf\2020\ns-28-2020.pdf: ns_28_2020_1
  8. DataFrame generated for file digital_pdf\input_pdf\2020\ns-32-2020.pdf: ns_32_2020_1
  9. DataFrame generated for file digital_pdf\input_pdf\2020\ns-36-2020.pdf: ns_36_2020_1
  10. DataFrame generated for file digital_pdf\input_pdf\2020\ns-39-2020.pdf: ns_39_2020_1
  11. DataFram

In [28]:
import pdfplumber
import os

# Path to the PDF
pdf_path = r"C:\Users\Jason Cruz\OneDrive\Documentos\RA\CIUP\GDP Revisions\GitHub\peru_gdp_revisions\gdp_revisions_datasets\digital_pdf\input_pdf\2019\ns-24-2019.pdf"

# Open PDF with pdfplumber
with pdfplumber.open(pdf_path) as pdf:
    first_page = pdf.pages[0]  # only first page

    # Extract tables
    tables = first_page.extract_tables()

    # Inspect tables
    if tables:
        for i, table in enumerate(tables, 1):
            print(f"\nTable {i}:")
            for row in table:
                print(row)
    else:
        print("No tables found on the first page.")



Table 1:
['SECTORES ECON√ìMICOS', '2018', '2019', '', 'ECONOMIC SECTORS']
[None, 'May. Jun. Jul. Ago. Sep. Oct. Nov. Dic. A√±o', 'Ene. Feb. Mar. Abr. May.', 'Ene.-May.', None]
['Agropecuario 2/\nAgr√≠cola\nPecuario\nPesca\nMiner√≠a e hidrocarburos 3/\nMiner√≠a met√°lica\nHidrocarburos\nManufactura 4/\nProcesadores recursos primarios\nManufactura no primaria\nElectricidad y agua\nConstrucci√≥n\nComercio\nOtros servicios\nDerechos de importaci√≥n y otros impuestos\nPBI\nSectores primarios\nSectores no primarios\nPBI desestacionalizado 5/', '16,2 4,6 4,6 8,3 7,3 8,2 5,9 3,1 7,8\n19,0 3,3 5,1 10,5 9,5 10,2 6,4 1,0 9,4\n9,8 7,7 3,8 5,7 5,0 6,0 5,2 6,2 5,5\n26,7 -7,9 -17,3 26,0 19,7 22,7 188,5 225,9 39,7\n2,1 -4,6 -5,2 -3,9 0,8 -2,4 -2,5 -1,2 -1,3\n0,4 -5,7 -5,7 0,1 -1,4 -3,1 -3,7 -1,7 -1,5\n12,5 2,6 -2,2 -26,3 15,5 2,2 4,8 1,4 0,0\n10,6 1,6 2,1 2,1 1,1 9,8 12,1 12,4 6,2\n23,5 -0,4 -6,5 -0,8 3,3 8,9 41,0 47,0 13,2\n4,8 2,4 5,0 3,0 0,5 10,0 3,4 1,7 3,7\n4,4 4,2 4,6 3,3 3,8 5,1 6,5 7,5 4,4\n1

In [31]:
new_tables_dict_1.keys()

dict_keys(['ns_24_2019_1'])

In [32]:
new_dataframes_dict_1.keys()

dict_keys(['ns_24_2019_1'])

In [34]:
new_tables_dict_1['ns_24_2019_1'].head(5)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,2018,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,2019,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,SECTORES ECON√ìMICOS,,,,,,,,,,,,,,,,ECONOMIC SECTORS
1,,May.,Jun.,Jul.,Ago.,Sep.,Oct.,Nov.,Dic.,A√±o,Ene.,Feb.,Mar.,Abr.,May.,Ene.-May.,
2,Agropecuario 2/,162,46,46,83,73,82,59,31,78,45,49,53,30,12,"3,5 Agriculture and Livestock 2/",
3,Agr√≠cola,190,33,51,105,95,102,64,10,94,42,52,59,24,02,30,Agriculture
4,Pecuario,98,77,38,57,50,60,52,62,55,48,46,44,43,38,44,Livestock


In [35]:
df_1 = new_dataframes_dict_1['ns_24_2019_1']
df_1

Unnamed: 0,year,id_ns,sectores_economicos,economic_sectors,2018_may,2018_jun,2018_jul,2018_ago,2018_sep,2018_oct,2018_nov,2018_dic,2018_year,2019_ene,2019_feb,2019_mar,2019_abr,2019_may,2019_ene_may
1,2019,24,agropecuario,agriculture and livestock,16.2,4.6,4.6,8.3,7.3,8.2,5.9,3.1,7.8,4.5,4.9,5.3,3.0,1.2,3.5
2,2019,24,agricola,agriculture,19.0,3.3,5.1,10.5,9.5,10.2,6.4,1.0,9.4,4.2,5.2,5.9,2.4,0.2,3.0
3,2019,24,pecuario,livestock,9.8,7.7,3.8,5.7,5.0,6.0,5.2,6.2,5.5,4.8,4.6,4.4,4.3,3.8,4.4
4,2019,24,pesca,fishing,26.7,-7.9,-17.3,26.0,19.7,22.7,188.5,225.9,39.7,-31.3,-9.5,-7.4,-63.0,-26.8,-33.6
5,2019,24,mineria e hidrocarburos,mining and fuel,2.1,-4.6,-5.2,-3.9,0.8,-2.4,-2.5,-1.2,-1.3,-1.3,-0.7,0.1,-2.9,-1.5,-1.2
6,2019,24,mineria metalica,metals,0.4,-5.7,-5.7,0.1,-1.4,-3.1,-3.7,-1.7,-1.5,-1.4,-5.9,0.3,-1.7,-0.2,-1.7
7,2019,24,hidrocarburos,fuel,12.5,2.6,-2.2,-26.3,15.5,2.2,4.8,1.4,0.0,-0.7,40.0,-0.4,-9.0,-8.8,1.5
8,2019,24,manufactura,manufacturing,10.6,1.6,2.1,2.1,1.1,9.8,12.1,12.4,6.2,-5.4,-1.3,3.7,-13.2,-6.8,-4.9
9,2019,24,procesadores recursos primarios,based on raw materials,23.5,-0.4,-6.5,-0.8,3.3,8.9,41.0,47.0,13.2,-28.3,-10.2,3.1,-34.0,-19.0,-19.7
10,2019,24,manufactura no primaria,nonprimary,4.8,2.4,5.0,3.0,0.5,10.0,3.4,1.7,3.7,4.1,1.4,3.9,-3.3,0.0,1.2


In [None]:
df_1[(df_1['sectores_economicos'] == 'agropecuario') | (df_1['economic_sectors'] == 'agriculture and livestock')]

<div id="3-2-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 2.</span> Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.
    </span>
    </h3>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The basic criterion to start extracting tables is to use keywords (sufficient condition). I mean, tables containing the following keywords meet the requirements to be extracted.
</p>
<div/>

In [None]:
# Keywords to search in the page text
keywords = ["ECONOMIC SECTORS"]

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF <span style="font-size: 24px;">&#128462;</span> (WR). That file is located in the <code>ns_dates</code> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [39]:
import os
import re
import locale
import pdfplumber
import tabula
import pandas as pd
import numpy as np
from tkinter import Tk, messagebox

# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Dictionary to store generated DataFrames
new_dataframes_dict_2 = {}

# Path for the processed folders log file
record_path = 'record/new_processed_folders_2.txt'

# Function to correct month names
def correct_month_name(month):
    months_mapping = {
        'setiembre': 'septiembre',
        # Add more mappings as needed
    }
    return months_mapping.get(month, month)

# Function to register processed folder
def register_processed_folder(folder, num_processed_files):
    with open(record_path, 'a') as file:
        file.write(f"{folder}:{num_processed_files}\n")
        
# Function to check if folder has been processed
def folder_processed(folder):
    if not os.path.exists(record_path):
        return False
    with open(record_path, 'r') as file:
        for line in file:
            if line.startswith(folder):
                return True
    return False

# Function to process PDF file (extract table from second page)
def process_pdf(pdf_path):
    new_tables_dict_2 = {}  # Local dictionary for each PDF
    table_counter = 1

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None

    new_filename = os.path.splitext(filename)[0].replace('-', '_')

    # Extract table from second page only
    tables = tabula.read_pdf(pdf_path, pages=2, multiple_tables=False, stream=True)
    for j, table_df in enumerate(tables, start=1):
        dataframe_name = f"{new_filename}_2"
        new_tables_dict_2[dataframe_name] = table_df
        table_counter += 1

    return id_ns, year, new_tables_dict_2, 2  # keyword_count replaced by 2

# Function to process folder
def process_folder(folder):
    print(f"Processing folder {os.path.basename(folder)}")
    pdf_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.pdf')]

    num_pdfs_processed = 0
    num_dataframes_generated = 0
    table_counter = 1
    new_tables_dict_2 = {}  # Declare tables_dict outside main loop
    
    for pdf_file in pdf_files:
        id_ns, year, tables_dict_temp, _ = process_pdf(pdf_file)

        if tables_dict_temp:
            for dataframe_name, df in tables_dict_temp.items():
                file_name = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                dataframe_name = f"{file_name}_2"

                # Store raw DataFrame
                new_tables_dict_2[dataframe_name] = df.copy()

                # Apply cleaning pipeline
                df_clean = df.copy()
                if df_clean.iloc[0, 0] is np.nan:
                    # 20 lines of cleaning
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = separate_years(df_clean)
                    df_clean = relocate_roman_numerals(df_clean)
                    df_clean = extract_mixed_values(df_clean)
                    df_clean = replace_first_row_nan(df_clean)
                    df_clean = first_row_columns(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = drop_nan_row(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = split_values(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                else:
                    # 15 lines of cleaning
                    df_clean = exchange_roman_nan(df_clean)
                    df_clean = exchange_columns(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = last_column_es(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)

                # Add 'year' and 'id_ns' columns
                df_clean.insert(0, 'year', year)
                df_clean.insert(1, 'id_ns', id_ns)

                # Store cleaned DataFrame
                new_dataframes_dict_2[dataframe_name] = df_clean

                print(f'  {table_counter}. DataFrame generated for file {pdf_file}: {dataframe_name}')
                num_dataframes_generated += 1
                table_counter += 1
                    
        num_pdfs_processed += 1

    return num_pdfs_processed, num_dataframes_generated, new_tables_dict_2
        
# Function to process folders
def process_folders():
    input_pdf_folder = input_pdf
    folders = [os.path.join(input_pdf_folder, d) for d in os.listdir(input_pdf_folder) if os.path.isdir(os.path.join(input_pdf_folder, d))]

    new_tables_dict_2 = {}
    
    for folder in folders:
        if folder_processed(folder):
            print(f"Folder {folder} has already been processed.")
            continue
        
        num_pdfs_processed, num_dataframes_generated, tables_dict_temp = process_folder(folder)
        
        new_tables_dict_2.update(tables_dict_temp)
        
        register_processed_folder(folder, num_pdfs_processed)
        
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)
        message = f"Process {folder} complete. Processed {num_pdfs_processed} PDF(s) and generated {num_dataframes_generated} DataFrame(s). Continue with next folder?"
        if not messagebox.askyesno("Continue?", message):
            break
            
    print("Processing completed for all folders.")

    return new_tables_dict_2
    
if __name__ == "__main__":
    new_tables_dict_2 = process_folders()


Processing folder 2013
  1. DataFrame generated for file digital_pdf\input_pdf\2013\ns-04-2013.pdf: ns_04_2013_2
  2. DataFrame generated for file digital_pdf\input_pdf\2013\ns-08-2013.pdf: ns_08_2013_2
  3. DataFrame generated for file digital_pdf\input_pdf\2013\ns-12-2013.pdf: ns_12_2013_2
  4. DataFrame generated for file digital_pdf\input_pdf\2013\ns-16-2013.pdf: ns_16_2013_2
  5. DataFrame generated for file digital_pdf\input_pdf\2013\ns-21-2013.pdf: ns_21_2013_2
  6. DataFrame generated for file digital_pdf\input_pdf\2013\ns-25-2013.pdf: ns_25_2013_2
  7. DataFrame generated for file digital_pdf\input_pdf\2013\ns-29-2013.pdf: ns_29_2013_2
  8. DataFrame generated for file digital_pdf\input_pdf\2013\ns-33-2013.pdf: ns_33_2013_2
  9. DataFrame generated for file digital_pdf\input_pdf\2013\ns-37-2013.pdf: ns_37_2013_2
  10. DataFrame generated for file digital_pdf\input_pdf\2013\ns-42-2013.pdf: ns_42_2013_2
  11. DataFrame generated for file digital_pdf\input_pdf\2013\ns-46-2013.pdf

  9. DataFrame generated for file input_pdf\2024\ns-09-2024.pdf: ns_09_2024_2
  10. DataFrame generated for file input_pdf\2024\ns-10-2024.pdf: ns_10_2024_2
  11. DataFrame generated for file input_pdf\2024\ns-11-2024.pdf: ns_11_2024_2
  12. DataFrame generated for file input_pdf\2024\ns-12-2024.pdf: ns_12_2024_2
  13. DataFrame generated for file input_pdf\2024\ns-13-2024.pdf: ns_13_2024_2
  14. DataFrame generated for file input_pdf\2024\ns-14-2024.pdf: ns_14_2024_2
  15. DataFrame generated for file input_pdf\2024\ns-15-2024.pdf: ns_15_2024_2
  16. DataFrame generated for file input_pdf\2024\ns-16-2024.pdf: ns_16_2024_2
  17. DataFrame generated for file input_pdf\2024\ns-17-2024.pdf: ns_17_2024_2
  18. DataFrame generated for file input_pdf\2024\ns-18-2024.pdf: ns_18_2024_2
  19. DataFrame generated for file input_pdf\2024\ns-19-2024.pdf: ns_19_2024_2
  20. DataFrame generated for file input_pdf\2024\ns-20-2024.pdf: ns_20_2024_2
  21. DataFrame generated for file input_pdf\2024\ns-

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

In [None]:
new_tables_dict_2.keys()

In [None]:
new_dataframes_dict_2.keys()

In [None]:
new_tables_dict_2['ns_43_2024_2'].head(5)

In [None]:
new_dataframes_dict_2['ns_43_2024_2']

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Real-time data of Peru's GDP growth rates</span></h1>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
This section creates the GDP growth rate vintages for Peru using <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a>, which were extracted and cleaned in the previous section. Each table from each WR (PDF <span style="font-size: 24px;">&#128462;</span>) was extracted and cleaned individually in the previous section. Here, we will concatenate all the tables for a specific economic sector, thus creating a vintage dataset of (real) GDP growth by economic sector from <b>2013</b> to <b>2024</b>.
<div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    <span style="font-size: 24px; color: #FFA823; font-weight: bold;">&#9888;</span>
As preferred or as appropriate, you can create the data manually, step by step, or focus on specific sectors or frequencies. Alternatively, you can choose a more efficient or automated approach by generating the data for all sectors and frequencies simultaneously.
<div/>

<div id="4-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Manual process of data creation in real time: sector by sector and frequency by frequency.
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    With this method you can create and inspect the dataset sector by sector and frequency by frequency. This is useful if you want to create data only for particular sectors and frequencies.
<div/>

<div id="select_sector">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>sector_economico</code> and <code>economic_sector</code></span></h1>
    </div>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
When executing the following code, a window will be displayed with options in <b>Spanish</b> and <b>English</b> to select <b>economic sectors</b>. Choose them to concatenate Peru GDP growth rates (annual, quarterly or monthly) by sector.
</p>
<div/>

In [None]:
# Call the function to display the window and capture the selected values
selected_spanish, selected_english, sector = show_option_window()

# Display the selected values
print(f"You have selected sector = {sector}, selected_spanish = {selected_spanish}, and selected_english = {selected_english}.")

<div id="select_freq">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>frequency</code></span></h1>
    </div>

In [None]:
# Call the function to show the popup window
frequency = show_frequency_window()
print("Selected frequency:", frequency)

<div id="counter">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Set counter (dataframe name suffix)</span></h1>
    </div>

In [None]:
# Call the function to set the counter
if frequency == "monthly":
    counter = 1
elif frequency == "quarterly":
    counter = 2
elif frequency == "annual":
    counter = 2
else:
    counter = None 

print(counter)

<div id="4-1-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Growth rates datasets concatenation for all frequencies
    </span>
    </h3>

In [None]:
# Dynamically construct the function name and dictionary name
function_name = f"concatenate_{frequency}_df"
dataframe_dict_name = f"new_dataframes_dict_{counter}"

# Check that both the function and dictionary exist in the global scope
if function_name in globals() and dataframe_dict_name in globals():
    # Call the function using its reference from globals()
    globals()[f"new_{sector}_{frequency}_growth_rates"] = globals()[function_name](
        globals()[dataframe_dict_name], selected_spanish, selected_english
    )
else:
    print(f"Error: {function_name} or {dataframe_dict_name} does not exist in the global scope.")

In [None]:
#pd.set_option('display.max_rows', None)
globals()[f"new_{sector}_{frequency}_growth_rates"].head(10)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4-1-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.2.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Uploading data to SQL</span></h3>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Finally, we upload all the datasets generated in this jupyter notebook to the <code>'gdp_revisions_datasets'</code> database of <code>PostgresSQL</code>.
<div/>

In [None]:
engine = create_sqlalchemy_engine()

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Loading
<div/>

In [None]:
globals()[f"new_{sector}_{frequency}_growth_rates"].to_sql(f'new_{sector}_{frequency}_growth_rates', engine, index=False, if_exists='replace')

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select_sector" style="color: rgb(255, 32, 78); text-decoration: none;">‚Æù</a>
    </span> 
    <a href="#select_sector" style="color: rgb(255, 32, 78); text-decoration: none;">Back to select sectors.</a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select_freq" style="color: rgb(255, 32, 78); text-decoration: none;">‚Æù</a>
    </span> 
    <a href="#select_freq" style="color: rgb(255, 32, 78); text-decoration: none;">Back to select frequency.</a>
</div>

<div id="4-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Automatic data creation process in real time: all sectors and frequencies at the same time.
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    With this method you can create the dataset for all sectors and all frequencies at the same time. This is more efficient if the goal is to generate all possible combinations of datasets for <code>sector</code> and <code>frequency</code> (without excluding any sector or frequency).
<div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    List of frequencies to be used to create concatenated datasets
    <div/>

In [None]:
frequencies = [
        "monthly", 
        "quarterly",
        "annual"
    ]

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Function to process growth rates datasets: concatenate and load to SQL
    <div/>

In [None]:
def process_new_datasets_to_sql(sector, frequency):
    # Initialize counter for loaded DataFrames
    total_loaded = 0

    # Set counter based on frequency
    if frequency == "monthly":
        counter = 1
    elif frequency in ["quarterly", "annual"]:
        counter = 2
    else:
        print(f"Unknown frequency: {frequency}")
        return None

    # Dynamically build function and dictionary names
    function_name = f"concatenate_{frequency}_df"
    dataframe_dict_name = f"new_dataframes_dict_{counter}"

    if function_name in globals() and dataframe_dict_name in globals():
        # Generate the DataFrame
        df_name = f"new_{sector}_{frequency}_growth_rates"
        globals()[df_name] = globals()[function_name](
            globals()[dataframe_dict_name], option_mapping[sector][0], option_mapping[sector][1]
        )

        # Load to SQL
        engine = create_sqlalchemy_engine()
        globals()[df_name].to_sql(df_name, engine, index=False, if_exists='replace')

        return globals()[df_name]
    else:
        print(f"Error: {function_name} or {dataframe_dict_name} does not exist in the global scope.")
        return None

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Run the function to create concatenated datasets for all sectors and frequencies and load to SQL
    <div/>

In [None]:
# Initialize counter
processed_datasets = 0

# Process all combinations
for sector in option_mapping.keys():
    for frequency in frequencies:
        print(f"Processing {sector} - {frequency}")
        df = process_new_datasets_to_sql(sector, frequency)
        if df is not None:
            display(df.head(10))  # Display the first 10 rows
            processed_datasets += 1  # Increment counter

# Display total number of processed datasets
print(f"Total datasets processed: {processed_datasets}")

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div style="font-size: 16px; background-color: #F5F5F5; padding: 18px; line-height: 1.5; font-family: 'PT Serif Pro Book';">
    <span style="font-size: 24px; color: #FFA823; font-weight: bold;">&#9888;</span>
    Once you have all the datasets generated by this script (<code>new_gdp_datasets.ipynb</code>) you can concatenate with those generated in the script <code>old_gdp_datasets.ipynb</code>. <b>Section 6</b> of the script <code>aux_files_to_sql.ipynb</code> concatenates both <b>new</b> and <b>old</b> datasets for <b>all sectors</b> and <b>all frequencies</b>.
</div>

---
---
