<div style="text-align: center; font-family: 'charter bt pro roman'; color: rgb(0, 65, 75);">
    <h1>
    GDP Revisions Datasets
    </h1>
</div>

<div style="text-align: center; font-family: 'charter bt pro roman'; color: rgb(0, 65, 75);">
    <h3>
        Documentation
        <br>
        ____________________
            </br>
    </h3>
</div>

<div style="text-align: center; font-family: 'PT Serif Pro Book'; color: rgb(0, 65, 75); font-size: 16px;">
    Jason Cruz
    <br>
    <a href="mailto:jj.cruza@up.edu.pe" style="color: rgb(0, 153, 123); font-size: 16px;">
        jj.cruza@up.edu.pe
    </a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
This <span style="color: rgb(0, 65, 75);">jupyter notebook</span> documents step-by-step the <b>construction of datasets</b> for the project <b>'Revisions and Biases in Preliminary GDP Estimates in Peru'</b>.

This jupyter notebook goes from downloading the Weekly Notes (NS) from the Central Reserve Bank of Peru (BCRP), stored on their website as PDF files, to generating datasets of growth rates and revisions to Peru's GDP, loaded as tables to SQL. The NS contain the information on annual, quarterly and monthly GDP growth rates by economic sectors of Peru, while the main datasets that will be used for the data analysis of this project are generated in this jupyter notebook using big data and machine learning techniques.
</div>

<div style="font-family: Amaya; text-align: left; color: rgb(0, 65, 75); font-size:16px">The following <b>outline is functional</b>. By utilising the provided buttons, users are able to enhance their experience by browsing this script.<div/>

<div id="outilne">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #292929; padding: 10px; line-height: 1.5; font-family: 'PT Serif Pro Book';">
    <h2 style="text-align: left; color: #E0E0E0;">
        Outline
    </h2>
    <br>
    <a href="#libraries" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        Libraries</a>
    <br>
    <a href="#setup" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        Initial set-up</a>
    <br>
    <a href="#1" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        1. PDF Downloader</a>
    <br>
    <a href="#2" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        2. Generate PDF input with key tables</a>
    <br>
    <a href="#3" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        3. Data cleaning</a>
    <br>
    <a href="#3-1" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        3.1. A brief documentation on issues in the table information of the PDFs.</a>
    <br>
    <a href="#3-2" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        3.2. Extracting tables and data cleanup.</a>
    <br>
    <a href="#3-2-1" style="color: #94FFD8; font-size: 14px; margin-left: 40px;">
        3.2.1. Table 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.</a>
    <br>
    <a href="#3-2-2" style="color: #94FFD8; font-size: 14px; margin-left: 40px;">
        3.2.2. Table 2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.</a>
    <br>
    <a href="#4" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">4. Real-time data of Peru's GDP growth rates</a>
    <br>
    <a href="#4-1" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        4.1. Annual vintages concatenation.</a>
    <br>
    <a href="#4-2" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        4.2. Quarterly vintages concatenation.</a>
    <br>
    <a href="#4-3" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        4.3. Monthly vintages concatenation.</a>
    <br>
    <a href="#4-4" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        4.4. Loading SQL.</a>
    <br>
    <a href="#5" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">5. GDP final revision dataset</a>
    <br>
    <a href="#5-1" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        5.1. Annual revisions.</a>
    <br>
    <a href="#5-2" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        5.2. Quarterly revisions.</a>
    <br>
    <a href="#5-3" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        5.3. Monthly revisions.</a>
    <br>
    <a href="#6" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">6. Uploading data to SQL</a>
    
</div>


<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Any questions or issues regarding the coding, please email Jason Cruz <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123); text-decoration: none;"><span style="font-size: 24px;">&#x2709;</span>
    </a>.
    <div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    If you don't have the libraries below, please use the following code (as example) to install the required libraries.
    <div/>

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

<div id="libraries">
   <!-- Contenido de la celda de destino -->
</div>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark;">
    <h2>
    Libraries
    </h2>
    <div/>

In [1]:
# 1. PDF downloader
#-------------------------------------------------------------------------------------------------------------------------------

import os  # For file and directory manipulation, for interacting with the operating system
import random  # To generate random numbers
from selenium import webdriver  # For automating web browsers
from selenium.webdriver.common.by import By  # To locate elements on a webpage
from selenium.webdriver.support.ui import WebDriverWait  # To wait until certain conditions are met on a webpage.
from selenium.webdriver.support import expected_conditions as EC  # To define expected conditions
from selenium.common.exceptions import StaleElementReferenceException  # To handle exceptions related to elements on the webpage that are no longer available.
import pygame # Allows you to handle graphics, sounds and input events.

import shutil # Used for high-level file operations, such as copying, moving, renaming, and deleting files and directories.


# 2. Generate PDF input with key tables
#-------------------------------------------------------------------------------------------------------------------------------

import fitz  # This library is used for working with PDF documents, including reading, writing, and modifying PDFs (PyMuPDF).
import tkinter as tk  # This library is used for creating graphical user interfaces (GUIs) in Python.


# 3. Data cleaning
#-------------------------------------------------------------------------------------------------------------------------------

# 3.1. A brief documentation on issus in the table information of the PDFs

from PIL import Image  # Used for opening, manipulating, and saving image files.
import matplotlib.pyplot as plt  # Used for creating static, animated, and interactive visualizations.

# 3.2. Extracting tables and data cleanup

import pdfplumber  # For extracting text and metadata from PDF files
import pandas as pd  # For data manipulation and analysis
import unicodedata  # For manipulating Unicode data
import re  # For regular expressions operations
from datetime import datetime  # For working with dates and times
import locale  # For locale-specific formatting of numbers, dates, and currencies

# 3.2.1. Tabla 1. Extraction and cleaning of data from tables on monthly real GDP growth rates

import tabula  # Used to extract tables from PDF files into pandas DataFrames
from tkinter import Tk, messagebox, TOP, YES, NO  # Used for creating graphical user interfaces
from sqlalchemy import create_engine  # Used for connecting to and interacting with SQL databases

# 3.2.2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates

import roman
from datetime import datetime


# 4. SQL tables
#-------------------------------------------------------------------------------------------------------------------------------

import psycopg2  # For interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # For creating and executing SQL queries using SQLAlchemy


pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="setup">
   <!-- Contenido de la celda de destino -->
</div>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark;">
    <h2>
    Initial set-up
    </h2>
    <div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> The following code lines will create folders in your current path, call them to import and export your outputs. <div/>

In [2]:
# Folder path to save downloaded PDF files

raw_pdf = 'raw_pdf' # to save raw data (.pdf).
if not os.path.exists(raw_pdf):
    os.mkdir(raw_pdf) # to create the folder (if it doesn't exist)

In [3]:
# Folder path to save text file with the names of already downloaded files

download_record = 'download_record'
if not os.path.exists(download_record):
    os.mkdir(download_record) # to create the folder (if it doesn't exist)

In [4]:
# Folder path to download the trimmed PDF files (these are PDF inputs for the extraction and cleanup code)

input_pdf = 'input_pdf'
if not os.path.exists(input_pdf):
    os.makedirs(input_pdf) # to create the folder (if it doesn't exist)

In [5]:
# Folder path to save PDF files containing only the pages of interest (where the GDP growth rate tables are located)

input_pdf_record = 'input_pdf_record'
if not os.path.exists(input_pdf_record):
    os.makedirs(input_pdf_record)

In [6]:
# Folder path to save dataframes generated record by year

dataframes_record = 'dataframes_record'
if not os.path.exists(dataframes_record):
    os.makedirs(dataframes_record) # to create the folder (if it doesn't exist)

In [7]:
# Folder path to save sound files

sound_folder = 'sound'
if not os.path.exists(sound_folder):
    os.makedirs(sound_folder) # to create the folder (if it doesn't exist)

In [8]:
# Folder path to save screenshots about issues in the table information

ns_issues_folder = 'ns_issues_folder'
if not os.path.exists(ns_issues_folder):
    os.makedirs(ns_issues_folder) # to create the folder (if it doesn't exist)

<p style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> The following function will establish a connection to the <code>gdp_revisions_datasets</code> database in <code>PostgreSQL</code>. The <b>input data</b> used in this jupyter notebook will be loaded from this <code>PostgreSQL</code> database, and similarly, all <b>output data</b> generated by this jupyter notebook will be stored in that database. Ensure that you set the necessary parameters to access the server once you have obtained the required permissions from.<p/>
    
<p style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
To request permissions, please email Jason Cruz <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123); text-decoration: none;"> <span style="font-size: 24px;">&#x2709;</span>
    </a>.
<p/>

In [9]:
def create_sqlalchemy_engine():
    """
    Function to create an SQLAlchemy engine using environment variables.
    
    Returns:
        engine: SQLAlchemy engine object.
    """
    # Get environment variables
    user = os.environ.get('CIUP_SQL_USER')  # Get the SQL user from environment variables
    password = os.environ.get('CIUP_SQL_PASS')  # Get the SQL password from environment variables
    host = os.environ.get('CIUP_SQL_HOST')  # Get the SQL host from environment variables
    port = 5432  # Set the SQL port to 5432
    database = 'gdp_revisions_datasets'  # Set the database name 'gdp_revisions_datasets' from SQL

    # Check if all environment variables are defined
    if not all([host, user, password]):
        raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

    # Create connection string
    connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

    # Create SQLAlchemy engine
    engine = create_engine(connection_string)
    
    return engine

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Import all other functions required by this jupyter notebook.
    </span>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> Please, check the script <code>gdp_revisions_datasets_functions.py</code> which contains all the functions required by this jupyter notebook. The functions there are ordered according to the <a href="#outilne" style="color: #3d30a2;">sections</a> of this jupyter notebok.<div/>

In [10]:
from gdp_revisions_datasets_functions import *

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="1">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: 'PT Serif Pro Book'; color: dark;">1.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">PDF Downloader</span></h1>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Our main source for data collection is the <a href="https://www.bcrp.gob.pe/publicaciones/nota-semanal.html" style="color: rgb(0, 153, 123)">BCRP's web page</a>. The weekly note is a periodic (weekly) publication of the BCRP in compliance with article 84 of the Peruvian Constitution and articles 2 and 74 of the BCRP's organic law, which include, among its functions, the periodic publication of the main national macroeconomic statistics.
    
Our project requires the publication of two tables: the table of monthly growth rates of real GDP (12-month percentage changes), and the table of quarterly (annual) growth rates of real GDP. These tables are referred to as Table 1 and Table 2, respectively, throughout this jupyter notebook.
<div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    The following bot runs the following steps:
    <ol>
        <li>Download the PDF files (NS) from the BCRP web page, starting with the oldest and continuing to the most recent.</li>
        <li>Notify you with a fabulous song each time a certain number of downloads is reached.</li>
        <li>Display a window asking if you want to continue with the downloads. You can stop them at any time.</li>
        <li>Report in detail about the downloaded files. If a file has already been downloaded, you will also be notified.</li>
        <li>Save the raw PDFs to the paths set in the preamble of this Jupyter Notebook.</li>
    </ol>
    Try the bot, it's an adventure!
</div>

In [None]:
# Setting the BCRP URL
bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html"  # Never replace this URL

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    The next stage in the process will be to execute the code which enables the bot to carry out the downloading tasks.
    <div/>

In [None]:
# Initialize pygame
pygame.mixer.init()

# List of available sound files
available_sounds = os.listdir(sound_folder)

# Select a random sound
random_sound = random.choice(available_sounds)

# Full path of the random sound
sound_path = os.path.join(sound_folder, random_sound)

# Load the selected sound
pygame.mixer.music.load(sound_path)

# List to keep track of successfully downloaded files
downloaded_files = []

# Load the list of previously downloaded files if it exists
if os.path.exists(os.path.join(download_record, "downloaded_files.txt")):
    with open(os.path.join(download_record, "downloaded_files.txt"), "r") as f:
        downloaded_files = f.read().splitlines()

# Web driver setup

'''
Nota: Download chrome.exe from 'https://googlechromelabs.github.io/chrome-for-testing/#stable'
and call in (1) the folder where you saved this application.
'''
driver_path = os.environ.get('driver_path') # (1)
driver = webdriver.Chrome(executable_path=driver_path)

# Number of downloads per batch
downloads_per_batch = 5
# Total number of downloads
total_downloads = 5

try:
    # Open the test page
    driver.get(bcrp_url)
    print("Site opened successfully")

    # Wait for the container area to be present
    wait = WebDriverWait(driver, 60)
    container_area = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="rightside"]')))

    # Get all the links within the container area
    pdf_links = container_area.find_elements(By.XPATH, './/a')

    # Reverse the order of links
    pdf_links = list(reversed(pdf_links))

    # Initialize download counter
    download_counter = 0

    # Iterate over reversed links and download PDFs in batches
    for pdf_link in pdf_links:
        download_counter += 1

        # Get the file name from the URL
        new_url = pdf_link.get_attribute("href")
        file_name = new_url.split("/")[-1]

        # Check if the file has already been downloaded
        if file_name in downloaded_files:
            print(f"{download_counter}. The file {file_name} has already been downloaded previously. Skipping...")
            continue

        # Try to download the file
        try:
            download_pdf(driver, pdf_link, wait, download_counter, raw_pdf, download_record)

            # Update the list of downloaded files
            downloaded_files.append(file_name)

        except Exception as e:
            print(f"Error downloading the file {file_name}: {str(e)}")

        # If the download count reaches a multiple of batch size, notify
        if download_counter % downloads_per_batch == 0:
            print(f"Batch {download_counter // downloads_per_batch} of {downloads_per_batch} completed")

        # If the download count reaches a multiple of 25, ask the user if they want to continue
        if download_counter % 5 == 0: # after the fifth PDF downloaded, you'll listen a beautiful song
            play_sound()
            user_input = input("Do you want to continue downloading? (Enter 'y' to continue, any other key to stop): ")
            pygame.mixer.music.stop()
            if user_input.lower() != 'y':
                break

        # Random wait before the next iteration
        random_wait(5, 10)

        # If total downloads reached, break out of loop
        if download_counter == total_downloads:
            print(f"All downloads completed ({total_downloads} in total)")
            break

except StaleElementReferenceException:
    print("StaleElementReferenceException occurred. Retrying...")

finally:
    # Close the browser when finished
    driver.quit()


<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
Probably the NS (PDF files) were downloaded in a single folder (raw_pdf), but we would like the NS to be sorted by years. The following code sorts the PDFs into subfolders (years) for us by placing each NS according to the year of its publication. This happens in the "blink of an eye". 
    <div/>

In [None]:
# Get the list of files in the directory
files = os.listdir(raw_pdf)

# Call the function to organize files
organize_files_by_year(raw_pdf)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="2">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">2.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Generate PDF input with key tables</span></h1>

In [None]:
class PopupWindow(tk.Toplevel):
    """Creates a pop-up window for user interaction."""

    def __init__(self, root, message):
        """Initialize the pop-up window."""
        super().__init__(root)
        self.root = root
        self.title("Attention!")
        self.message = message
        self.result = None
        self.configure_window()
        self.create_widgets()

    def configure_window(self):
        """Configure the window to be non-resizable."""
        self.resizable(False, False)

    def create_widgets(self):
        """Create widgets (labels and buttons) inside the pop-up window."""
        self.label = tk.Label(self, text=self.message, wraplength=250)  # Adjust text if too long
        self.label.pack(pady=10, padx=10)
        self.btn_frame = tk.Frame(self)
        self.btn_frame.pack(pady=5)
        self.btn_yes = tk.Button(self.btn_frame, text="Yes", command=self.yes)
        self.btn_yes.pack(side=tk.LEFT, padx=5)
        self.btn_no = tk.Button(self.btn_frame, text="No", command=self.no)
        self.btn_no.pack(side=tk.RIGHT, padx=5)

        # Calculate window size based on text size
        width = self.label.winfo_reqwidth() + 20
        height = self.label.winfo_reqheight() + 100
        self.geometry(f"{width}x{height}")

    def yes(self):
        """Set result to True and close the window."""
        self.result = True
        self.destroy()

    def no(self):
        """Set result to False and close the window."""
        self.result = False
        self.destroy()

if __name__ == "__main__":
    keywords = ["ECONOMIC SECTORS"]
    root = tk.Tk()
    root.withdraw()  # Hide the main Tkinter window

    input_pdf_files = read_input_pdf_files()
    processing_counter = 1

    for folder in os.listdir(raw_pdf):
        folder_path = os.path.join(raw_pdf, folder)
        if os.path.isdir(folder_path):
            print("Processing folder:", folder)
            num_pdfs_trimmed = 0
            for filename in os.listdir(folder_path):
                if filename.endswith(".pdf"):
                    pdf_file = os.path.join(folder_path, filename)
                    if filename in input_pdf_files:
                        print(f"{processing_counter}. The PDF '{filename}' has already been trimmed and saved in '{input_pdf}'...")
                        processing_counter += 1
                        continue
                    print(f"{processing_counter}. Processing:", pdf_file)
                    
                    pages_with_keywords = search_keywords(pdf_file, keywords)
                    num_pages_new_pdf = trim_pdf(pdf_file, pages_with_keywords)
                    if num_pages_new_pdf > 0:
                        num_pdfs_trimmed += 1
                        input_pdf_files.add(filename)
                        processing_counter += 1
            
            write_input_pdf_files(input_pdf_files)

            message = f"{num_pdfs_trimmed} PDFs have been trimmed in folder {folder}. Do you want to continue?"
            popup = PopupWindow(root, message)
            root.wait_window(popup)
            if not popup.result:
                break
                
    print("Process completed for all PDFs in directory:", input_pdf)


<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
Again, probably the NS (PDF files, now of few pages) were stored in disorder in the input_pdf folder. The following code sorts the PDFs into subfolders (years) by placing each NS (which now includes only the key tables) according to the year of its publication. This happens in the blink of an eye.  
    <div/>

In [None]:
# Get the list of files in the directory
files = os.listdir(input_pdf)

# Call the function to organize files
organize_files_by_year(input_pdf)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="3">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">3.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Data cleaning</span></h1>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
Since we already have the PDFs with just the tables required for this project, we can start extracting them. Then we can proceed with data cleaning.
</p>  
<div/>

<div id="3-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    A brief documentation on issus in the table information of the PDFs
    </span>
    </h2>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
Note that, the table information within the PDFs are available as editable text (including numeric values), but sometimes they can have various encoded formats that can make them difficult to extract and clean up. Undoubtedly, this is the most challenging stage of this jupyter notebook because there is no single pattern in which the information in the PDFs is arranged, each PDF adds a difficulty to extract the information. To understand more about this last point, we will start this section by documenting the most common problems we may face when trying to extract and clean tables from PDFs.
<div/>

<div id="3-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Extracting tables and data cleanup
    </span>
    </h2>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The main library used for extracting tables from PDFs is <code>pdfplumber</code>. You can review the official documentation by clicking <a href="https://github.com/jsvine/pdfplumber" style="color: rgb(0, 153, 123); font-size: 16px;">here</a>.
</p>
    
<p>     
    The functions in <b>Section 3</b> of the <code>"gdp_revisions_datasets_functions.py"</code> script were built to deal with each of these issues. An interesting exercise is to compare the original tables (the ones in the PDF) and the cleaned tables (by the cleanup codes below). Thus, the cleanup codes for <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a> generates two dictionaries, the first one stores the raw tables; that is, the original tables from the PDF extracted by the <code>pdfplumber</code> library, while the second dictionary stores the fully cleaned tables.
</p>
<div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The basic criterion to start extracting tables is to use keywords (sufficient condition). I mean, tables containing the following keywords meet the requirements to be extracted.
</p>
<div/>

In [11]:
# Keywords to search in the page text
keywords = ["ECONOMIC SECTORS"]

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    The code iterates through each PDF and extracts the two required tables from each. The extracted information is then transformed into dataframes and the columns and values are cleaned up to conform to Python conventions (pythonic).
    <div/>

<div id="3-2-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 1.</span> Extraction and cleaning of data from tables on monthly real GDP growth rates.
    </span>
    </h3>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF (Weekly Notes). That file is located in the <b>"ns_dates"</b> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [13]:
# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Dictionary to store generated DataFrames
dataframes_dict_1 = {}

# Path for the processed folders log file
registro_path = 'dataframes_record/carpetas_procesadas_1.txt'

# Function to correct month names
def corregir_nombre_mes(mes):
    meses_mapping = {
        'setiembre': 'septiembre',
        # Add more mappings as needed for other month names
    }
    return meses_mapping.get(mes, mes)

# Function to register processed folder
def registrar_carpeta_procesada(carpeta, num_archivos_procesados):
    with open(registro_path, 'a') as file:
        file.write(f"{carpeta}:{num_archivos_procesados}\n")

# Function to check if folder has been processed
def carpeta_procesada(carpeta):
    if not os.path.exists(registro_path):
        return False
    with open(registro_path, 'r') as file:
        for line in file:
            if line.startswith(carpeta):
                return True
    return False

# Function to fetch date from database
def obtener_fecha(df, engine):
    id_ns = df['id_ns'].iloc[0]
    year = df['year'].iloc[0]
    query = f"SELECT date FROM dates_growth_rates WHERE id_ns = '{id_ns}' AND year = '{year}';"
    fecha = pd.read_sql(query, engine)
    return fecha.iloc[0, 0] if not fecha.empty else None

# Function to process PDF file
def procesar_pdf(pdf_path):
    tables_dict_1 = {}  # Local dictionary for each PDF
    table_counter = 1
    keyword_count = 0 

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None, None  # Return None for tables_dict_1 as well

    new_filename = os.path.splitext(os.path.basename(pdf_path))[0].replace('-', '_')

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, 1):
            text = page.extract_text()
            if all(keyword in text for keyword in keywords):
                keyword_count += 1
                if keyword_count == 1:  # Process only the first occurrence
                    tables = tabula.read_pdf(pdf_path, pages=i, multiple_tables=False, stream=True) # Change stream to another option if desired
                    for j, table_df in enumerate(tables, start=1):
                        dataframe_name = f"{new_filename}_{keyword_count}"
                        tables_dict_1[dataframe_name] = table_df
                        table_counter += 1

                    break  # Exit loop after finding the first occurrence

    return id_ns, year, tables_dict_1, keyword_count

# Function to process folder
def procesar_carpeta(carpeta, engine):
    print(f"Processing folder {os.path.basename(carpeta)}")
    pdf_files = [os.path.join(carpeta, f) for f in os.listdir(carpeta) if f.endswith('.pdf')]

    num_pdfs_procesados = 0
    num_dataframes_generados = 0

    table_counter = 1  # Initialize table counter here
    tables_dict_1 = {}  # Declare tables_dict_1 outside main loop
    
    for pdf_file in pdf_files:
        id_ns, year, tables_dict_temp, keyword_count = procesar_pdf(pdf_file)

        if tables_dict_temp:
            for dataframe_name, df in tables_dict_temp.items():
                file_name = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                dataframe_name = f"{file_name}_{keyword_count}"
                
                # Store raw DataFrame in tables_dict_1
                tables_dict_1[dataframe_name] = df.copy()
                
                # Apply 20 lines of cleaning functions to a copy of the DataFrame
                df_clean = df.copy()

                if any(col.isdigit() and len(col) == 4 for col in df_clean.columns):
                    # If there is at least one column representing a year
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    df_clean = replace_first_dot(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = exchange_values(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                else: # 2014 ns 08
                    # If there are no columns representing years
                    df_clean = check_first_row(df_clean)
                    df_clean = check_first_row_1(df_clean)
                    df_clean = replace_first_row_with_columns(df_clean)
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = expand_column(df_clean) # 2014 ns 08
                    df_clean = split_values_1(df_clean) # 2014 ns 08
                    df_clean = split_values_2(df_clean) # 2016 ns 15
                    df_clean = split_values_3(df_clean) # 2016 ns 19
                    df_clean = separate_text_digits(df_clean)
                    df_clean = exchange_values(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_nan_with_previous_column_1(df_clean)
                    df_clean = replace_nan_with_previous_column_2(df_clean)
                    df_clean = replace_nan_with_previous_column_3(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                
                # Add 'year' column to cleaned DataFrame
                df_clean.insert(0, 'year', year)
                
                # Add 'id_ns' column to cleaned DataFrame
                df_clean.insert(1, 'id_ns', id_ns)
                
                # Get corresponding date from database
                fecha = obtener_fecha(df_clean, engine)
                if fecha:
                    # Add 'date' column to cleaned DataFrame
                    df_clean.insert(2, 'date', fecha)
                else:
                    print("Date not found in database for id_ns:", id_ns, "and year:", year)
                
                # Store cleaned DataFrame in dataframes_dict_1
                dataframes_dict_1[dataframe_name] = df_clean

                print(f'  {table_counter}. DataFrame generated for file {pdf_file}: {dataframe_name}')
                num_dataframes_generados += 1
                table_counter += 1  # Increment table counter here
        
        num_pdfs_procesados += 1  # Increment number of processed PDFs for each PDF in folder

    return num_pdfs_procesados, num_dataframes_generados, tables_dict_1

# Function to process folders
def procesar_carpetas():
    pdf_folder = 'input_pdf'
    carpetas = [os.path.join(pdf_folder, d) for d in os.listdir(pdf_folder) if os.path.isdir(os.path.join(pdf_folder, d))]
    
    tables_dict_1 = {}  # Initialize tables_dict_1 here
    
    for carpeta in carpetas:
        if carpeta_procesada(carpeta):
            print(f"Folder {carpeta} has already been processed.")
            continue
        
        num_pdfs_procesados, num_dataframes_generados, tables_dict_temp = procesar_carpeta(carpeta, engine)
        
        # Update tables_dict_1 with values returned from procesar_carpeta()
        tables_dict_1.update(tables_dict_temp)
        
        registrar_carpeta_procesada(carpeta, num_pdfs_procesados)

        # Ask user if they want to continue with next folder
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Ensure the messagebox is in front
        
        message = f"{num_dataframes_generados} dataframes have been generated in folder {carpeta}. Do you want to continue with the next folder?"
        continuar = messagebox.askyesno("Continue", message)
        root.destroy()

        if not continuar:
            print("Processing stopped by user.")
            break  # Break the loop if user decides not to continue

    print("Processing completed for all folders.")  # Add a message to indicate completion

    return tables_dict_1  # Return tables_dict_1 at the end of the function

if __name__ == "__main__":
    engine = create_sqlalchemy_engine() # Creates the SQL connection to merge the date, year and id from a SQL database to dataframes
    tables_dict_1 = procesar_carpetas() # Capture the returned value from procesar_carpetas()


Folder input_pdf\2013 has already been processed.
Folder input_pdf\2014 has already been processed.
Folder input_pdf\2015 has already been processed.
Folder input_pdf\2016 has already been processed.
Folder input_pdf\2017 has already been processed.
Folder input_pdf\2018 has already been processed.
Folder input_pdf\2019 has already been processed.
Folder input_pdf\2020 has already been processed.
Folder input_pdf\2021 has already been processed.
Folder input_pdf\2022 has already been processed.
Folder input_pdf\2023 has already been processed.
Processing folder 2024
  1. DataFrame generated for file input_pdf\2024\ns-01-2024.pdf: ns_01_2024_1
  2. DataFrame generated for file input_pdf\2024\ns-02-2024.pdf: ns_02_2024_1
  3. DataFrame generated for file input_pdf\2024\ns-03-2024.pdf: ns_03_2024_1
  4. DataFrame generated for file input_pdf\2024\ns-04-2024.pdf: ns_04_2024_1
  5. DataFrame generated for file input_pdf\2024\ns-05-2024.pdf: ns_05_2024_1
  6. DataFrame generated for file inp

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

In [None]:
tables_dict_1.keys()

In [None]:
dataframes_dict_1.keys()

In [None]:
tables_dict_1['ns_09_2024_1'].head(5)

In [None]:
df_1 = dataframes_dict_1['ns_09_2024_1']
df_1

In [None]:
df_1[(df_1['sectores_economicos'] == 'agropecuario') | (df_1['economic_sectors'] == 'agriculture and livestock')]

<div id="3-2-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 2.</span> Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.
    </span>
    </h3>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF (Weekly Notes). That file is located in the <b>"ns_dates"</b> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [14]:
# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Dictionary to store generated DataFrames
dataframes_dict_2 = {}

# Path for the processed folders log file
registro_path = 'dataframes_record/carpetas_procesadas_2.txt'

# Function to correct month names
def corregir_nombre_mes(mes):
    meses_mapping = {
        'setiembre': 'septiembre',
        # Add more mappings as needed for other month names
    }
    return meses_mapping.get(mes, mes)

def registrar_carpeta_procesada(carpeta, num_archivos_procesados):
    with open(registro_path, 'a') as file:
        file.write(f"{carpeta}:{num_archivos_procesados}\n")

def carpeta_procesada(carpeta):
    if not os.path.exists(registro_path):
        return False
    with open(registro_path, 'r') as file:
        for line in file:
            if line.startswith(carpeta):
                return True
    return False

def obtener_fecha(df, engine):
    id_ns = df['id_ns'].iloc[0]
    year = df['year'].iloc[0]
    query = f"SELECT date FROM dates_growth_rates WHERE id_ns = '{id_ns}' AND year = '{year}';"
    fecha = pd.read_sql(query, engine)
    return fecha.iloc[0, 0] if not fecha.empty else None

def procesar_pdf(pdf_path):
    tables_dict_2 = {}  # Local dictionary for each PDF
    table_counter = 1
    keyword_count = 0 

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None

    new_filename = os.path.splitext(os.path.basename(pdf_path))[0].replace('-', '_')

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, 1):
            text = page.extract_text()
            if all(keyword in text for keyword in keywords):
                keyword_count += 1
                if keyword_count == 2:
                    tables = tabula.read_pdf(pdf_path, pages=i, multiple_tables=False)
                    for j, table_df in enumerate(tables, start=1):
                        dataframe_name = f"{new_filename}_{keyword_count}"
                        tables_dict_2[dataframe_name] = table_df
                        table_counter += 1

    return id_ns, year, tables_dict_2, keyword_count


def procesar_carpeta(carpeta, engine):
    print(f"Processing folder {os.path.basename(carpeta)}")
    pdf_files = [os.path.join(carpeta, f) for f in os.listdir(carpeta) if f.endswith('.pdf')]

    num_pdfs_procesados = 0
    num_dataframes_generados = 0

    table_counter = 1  # Initialize table counter here
    tables_dict_2 = {}  # Declare tables_dict outside main loop
    
    for pdf_file in pdf_files:
        id_ns, year, tables_dict_temp, keyword_count = procesar_pdf(pdf_file)

        if tables_dict_temp:
            for dataframe_name, df in tables_dict_temp.items():
                file_name = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                dataframe_name = f"{file_name}_{keyword_count}"

                # Store raw DataFrame in tables_dict
                tables_dict_2[dataframe_name] = df.copy()

                # Apply 20 lines of cleaning functions to a copy of the DataFrame
                df_clean = df.copy()
                if df_clean.iloc[0, 0] is np.nan:
                    # Apply 20 lines of cleaning
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = separate_years(df_clean)
                    df_clean = relocate_roman_numerals(df_clean)
                    df_clean = extract_mixed_values(df_clean)
                    df_clean = replace_first_row_nan(df_clean)
                    df_clean = first_row_columns(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = drop_nan_row(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = split_values(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                else:
                    # Apply 15 lines of cleaning
                    df_clean = exchange_roman_nan(df_clean)
                    df_clean = exchange_columns(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = last_column_es(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)

                # Add 'year' column to cleaned DataFrame
                df_clean.insert(0, 'year', year)
                
                # Add 'id_ns' column to cleaned DataFrame
                df_clean.insert(1, 'id_ns', id_ns)
                
                # Get corresponding date from database
                fecha = obtener_fecha(df_clean, engine)
                if fecha:
                    # Add 'date' column to cleaned DataFrame
                    df_clean.insert(2, 'date', fecha)
                else:
                    print("Date not found in database for id_ns:", id_ns, "and year:", year)

                # Store cleaned DataFrame in dataframes_dict
                dataframes_dict_2[dataframe_name] = df_clean

                print(f'  {table_counter}. DataFrame generated for file {pdf_file}: {dataframe_name}')
                num_dataframes_generados += 1
                table_counter += 1  # Increment table counter here
                    
        num_pdfs_procesados += 1  # Increment number of PDFs processed for each PDF in folder

    return num_pdfs_procesados, num_dataframes_generados, tables_dict_2
        
def procesar_carpetas():
    pdf_folder = 'input_pdf'
    carpetas = [os.path.join(pdf_folder, d) for d in os.listdir(pdf_folder) if os.path.isdir(os.path.join(pdf_folder, d))]

    tables_dict_2 = {}  # Initialize tables_dict here
    
    for carpeta in carpetas:
        if carpeta_procesada(carpeta):
            print(f"Folder {carpeta} has already been processed.")
            continue
        
        num_pdfs_procesados, num_dataframes_generados, tables_dict_temp = procesar_carpeta(carpeta, engine)
        
        # Update tables_dict with values returned from procesar_carpeta()
        tables_dict_2.update(tables_dict_temp)
        
        registrar_carpeta_procesada(carpeta, num_pdfs_procesados)

        # Ask user if they want to continue with next folder
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Ensure the messagebox is in front
        
        message = f"{num_dataframes_generados} dataframes have been generated in folder {carpeta}. Do you want to continue with the next folder?"
        continuar = messagebox.askyesno("Continue", message)
        root.destroy()

        if not continuar:
            print("Processing stopped by user.")
            break  # Break the loop if user decides not to continue

    print("Processing completed for all folders.")  # Add a message to indicate completion

    return tables_dict_2  # Return tables_dict at the end of the function
    
if __name__ == "__main__":
    engine = create_sqlalchemy_engine() # Creates the SQL connection to merge the date, year and id from a SQL database to dataframes
    tables_dict_2 = procesar_carpetas()  # Capture the returned value from procesar_carpetas()

Folder input_pdf\2013 has already been processed.
Folder input_pdf\2014 has already been processed.
Folder input_pdf\2015 has already been processed.
Folder input_pdf\2016 has already been processed.
Folder input_pdf\2017 has already been processed.
Folder input_pdf\2018 has already been processed.
Folder input_pdf\2019 has already been processed.
Folder input_pdf\2020 has already been processed.
Folder input_pdf\2021 has already been processed.
Folder input_pdf\2022 has already been processed.
Folder input_pdf\2023 has already been processed.
Processing folder 2024
  1. DataFrame generated for file input_pdf\2024\ns-01-2024.pdf: ns_01_2024_2
  2. DataFrame generated for file input_pdf\2024\ns-02-2024.pdf: ns_02_2024_2
  3. DataFrame generated for file input_pdf\2024\ns-03-2024.pdf: ns_03_2024_2
  4. DataFrame generated for file input_pdf\2024\ns-04-2024.pdf: ns_04_2024_2
  5. DataFrame generated for file input_pdf\2024\ns-05-2024.pdf: ns_05_2024_2
  6. DataFrame generated for file inp

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

In [None]:
tables_dict_2.keys()

In [None]:
dataframes_dict_2.keys()

In [None]:
tables_dict_2['ns_01_2024_2'].head(5)

In [None]:
dataframes_dict_2['ns_18_2024_2']

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Real-time data of Peru's GDP growth rates</span></h1>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
This section creates the GDP growth rate vintages for Peru using <b>Table 1</b> and <b>Table 2</b>, which were extracted and cleaned in the previous section. Each table from each NS (PDF) was extracted and cleaned individually in the previous section. Here, we will concatenate all the tables for a specific economic sector, thus creating a vintage dataset of (real) GDP growth by economic sector from 2013 to 2024.
<div/>

<div id="select">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>sector_economico</code> and <code>economic_sector</code></span></h1>
    </div>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
When executing the following code, a window will be displayed with options in <b>Spanish</b> and <b>English</b> to select <b>economic sectors</b>. Choose them to concatenate Peru GDP growth rates (annual, quarterly or monthly) by sector.
</p>
<div/>

In [16]:
# Call the function with the Spanish options list to display the window
selected_spanish, selected_english = select_economic_sector(spanish_options, english_options)

# Display the values selected by the user
print(f"You have selected sector_economico = {selected_spanish} and economic_sector = {selected_english}.")

You have selected sector_economico = pbi and economic_sector = gdp.


<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select the dataset name prefix</span></h1>
    </div>

In [18]:
root = tk.Tk()   # Create a main window
root.withdraw()  # Hide the main window

# Ask the user to enter the value of economic_sector
sector = simpledialog.askstring("Economic Sector", "Enter the value of the sector:")

# Display the value entered by the user
print("Sector value:", sector)

Sector value: gdp


<div id="4-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Annual vintages concatenation
    </span>
    </h2>

In [19]:
globals()[f"{sector}_annual_growth_rates"] = concatenate_annual_df(dataframes_dict_2, selected_spanish, selected_english)

DataFrames ending with '_2' that will be concatenated:
ns_01_2024_2
ns_02_2024_2
ns_03_2024_2
ns_04_2024_2
ns_05_2024_2
ns_06_2024_2
ns_07_2024_2
ns_08_2024_2
ns_09_2024_2
ns_10_2024_2
ns_11_2024_2
ns_12_2024_2
ns_13_2024_2
ns_14_2024_2
ns_15_2024_2
ns_16_2024_2
ns_17_2024_2
ns_18_2024_2
ns_19_2024_2
ns_20_2024_2
ns_21_2024_2
ns_22_2024_2
ns_23_2024_2
Number of rows in the concatenated dataframe: 23


In [20]:
pd.set_option('display.max_rows', None)
globals()[f"{sector}_annual_growth_rates"]

Unnamed: 0,year,id_ns,date,year_2020,year_2021,year_2022,year_2023
0,2024,1,2024-01-04,-10.9,13.4,2.7,
1,2024,2,2024-01-11,-10.9,13.4,2.7,
2,2024,3,2024-01-18,-10.9,13.4,2.7,
3,2024,4,2024-01-25,-10.9,13.4,2.7,
4,2024,5,2024-02-01,-10.9,13.4,2.7,
5,2024,6,2024-02-08,-10.9,13.4,2.7,
6,2024,7,2024-02-15,-10.9,13.4,2.7,
7,2024,8,2024-02-22,-10.9,13.4,2.7,-0.6
8,2024,9,2024-03-07,-10.9,13.4,2.7,-0.6
9,2024,10,2024-03-14,-10.9,13.4,2.7,-0.6


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Quarterly vintages concatenation
    </span>
    </h2>

In [21]:
globals()[f"{sector}_quarterly_growth_rates"] = concatenate_quarterly_df(dataframes_dict_2, selected_spanish, selected_english)

DataFrames ending with '_2' that will be concatenated:
ns_01_2024_2
ns_02_2024_2
ns_03_2024_2
ns_04_2024_2
ns_05_2024_2
ns_06_2024_2
ns_07_2024_2
ns_08_2024_2
ns_09_2024_2
ns_10_2024_2
ns_11_2024_2
ns_12_2024_2
ns_13_2024_2
ns_14_2024_2
ns_15_2024_2
ns_16_2024_2
ns_17_2024_2
ns_18_2024_2
ns_19_2024_2
ns_20_2024_2
ns_21_2024_2
ns_22_2024_2
ns_23_2024_2
Number of rows in the concatenated dataframe: 23


In [22]:
pd.set_option('display.max_rows', None)
globals()[f"{sector}_quarterly_growth_rates"]

Unnamed: 0,year,id_ns,date,2020_3,2020_4,2021_1,2021_2,2021_3,2021_4,2022_1,2022_2,2022_3,2022_4,2023_1,2023_2,2023_3,2023_4,2024_1
0,2024,1,2024-01-04,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
1,2024,2,2024-01-11,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
2,2024,3,2024-01-18,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
3,2024,4,2024-01-25,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
4,2024,5,2024-02-01,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
5,2024,6,2024-02-08,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
6,2024,7,2024-02-15,-8.6,-1.3,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-1.0,,
7,2024,8,2024-02-22,,,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-0.9,-0.4,
8,2024,9,2024-03-07,,,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-0.9,-0.4,
9,2024,10,2024-03-14,,,4.3,42.2,11.6,3.4,3.8,3.3,2.0,1.8,-0.4,-0.5,-0.9,-0.4,


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4-3">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.3.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Monthly vintages concatenation
    </span>
    </h2>

In [24]:
globals()[f"{sector}_monthly_growth_rates"] = concatenate_monthly_df(dataframes_dict_1, selected_spanish, selected_english)

DataFrames ending with '_1' that will be concatenated:
ns_01_2024_1
ns_02_2024_1
ns_03_2024_1
ns_04_2024_1
ns_05_2024_1
ns_06_2024_1
ns_07_2024_1
ns_08_2024_1
ns_09_2024_1
ns_10_2024_1
ns_11_2024_1
ns_12_2024_1
ns_13_2024_1
ns_14_2024_1
ns_15_2024_1
ns_16_2024_1
ns_17_2024_1
ns_18_2024_1
ns_19_2024_1
ns_20_2024_1
ns_21_2024_1
ns_22_2024_1
ns_23_2024_1
Number of rows in the concatenated dataframe: 23


In [25]:
pd.set_option('display.max_rows', None)
globals()[f"{sector}_monthly_growth_rates"]

Unnamed: 0,year,id_ns,date,oct_2022,nov_2022,dic_2022,ene_2023,feb_2023,mar_2023,abr_2023,...,jul_2023,ago_2023,sep_2023,oct_2023,nov_2023,dic_2023,ene_2024,feb_2024,mar_2024,abr_2024
0,2024,1,2024-01-04,2.3,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,,,,,,
1,2024,2,2024-01-11,2.3,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,,,,,,
2,2024,3,2024-01-18,,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,0.3,,,,,
3,2024,4,2024-01-25,,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,0.3,,,,,
4,2024,5,2024-02-01,,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,0.3,,,,,
5,2024,6,2024-02-08,,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,0.3,,,,,
6,2024,7,2024-02-15,,2.1,1.0,-0.9,-0.6,0.3,0.4,...,-1.2,-0.5,-1.3,-0.8,0.3,,,,,
7,2024,8,2024-02-22,,,,-0.9,-0.6,0.3,0.4,...,-1.2,-0.4,-1.2,-0.7,0.3,-0.7,,,,
8,2024,9,2024-03-07,,,,-0.9,-0.6,0.3,0.4,...,-1.2,-0.4,-1.2,-0.7,0.3,-0.7,,,,
9,2024,10,2024-03-14,,,,-0.9,-0.6,0.3,0.4,...,-1.2,-0.4,-1.2,-0.7,0.3,-0.7,,,,


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="5">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">5.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">GDP final revision dataset</span></h1>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
This section creates the final revisions dataset of Peru's GDP growth.
<div/>

<div id="5-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">5.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Annual revisions
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
We calculate the <b>final revision</b> as the <b>difference</b> between the <b>last annual release</b> and the <b>first annual release</b> of the GDP growth rate.
<div/>

In [36]:
# Calculate the difference between the last and the first non-NaN value for each column, except 'year', 'ns_id' and 'date'.
revision = globals()[f"{sector}_annual_growth_rates"].drop(columns=['year', 'id_ns', 'date']).apply(lambda x: x.loc[x.last_valid_index()] - x.loc[x.first_valid_index()])

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Annual revisions as dataframe
<div/>

In [39]:
# Set the columns name
globals()[f"{sector}_annual_revisions"] = pd.DataFrame({'revision_date': revision.index, f'{sector}_revision': revision.values})
globals()[f"{sector}_annual_revisions"]

Unnamed: 0,revision_date,gdp_revision
0,year_2020,0.0
1,year_2021,0.0
2,year_2022,0.0
3,year_2023,0.0


<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Clean-up the revision_date column
<div/>

In [38]:
# Extract the year from the string and convert it to an integer type
globals()[f"{sector}_annual_revisions"]['year'] = globals()[f"{sector}_annual_revisions"]['revision_date'].str.extract(r'(\d+)')

# Create an auxiliary column to generate the values extracted from the year
globals()[f"{sector}_annual_revisions"]['year'] = globals()[f"{sector}_annual_revisions"]['year'].astype(int)

# Convert revision_date column values as dates
globals()[f"{sector}_annual_revisions"]['revision_date'] = pd.to_datetime(globals()[f"{sector}_annual_revisions"]['year'], format='%Y')

# Delete the auxiliary column 'year'
globals()[f"{sector}_annual_revisions"].drop(columns=['year'], inplace=True)

# Display the result
globals()[f"{sector}_annual_revisions"]

Unnamed: 0,revision_date,gdp_revision
0,2020-01-01,0.0
1,2021-01-01,0.0
2,2022-01-01,0.0
3,2023-01-01,0.0


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="5-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">5.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Quarterly revisions
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
We calculate the <b>final revision</b> as the <b>difference</b> between the <b>last quarterly release</b> and the <b>first quarterly release</b> of the GDP growth rate.
<div/>

In [40]:
# Calculate the difference between the last and the first non-NaN value for each column, except 'year', 'ns_id' and 'date'.
revision = globals()[f"{sector}_quarterly_growth_rates"].drop(columns=['year', 'id_ns', 'date']).apply(lambda x: x.loc[x.last_valid_index()] - x.loc[x.first_valid_index()])

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Quarterly revisions as dataframe
<div/>

In [41]:
# Set the columns name
globals()[f"{sector}_quarterly_revisions"] = pd.DataFrame({'revision_date': revision.index, f'{sector}_revision': revision.values})
globals()[f"{sector}_quarterly_revisions"]

Unnamed: 0,revision_date,gdp_revision
0,2020_3,0.0
1,2020_4,0.0
2,2021_1,0.0
3,2021_2,0.0
4,2021_3,0.0
5,2021_4,0.0
6,2022_1,0.0
7,2022_2,0.0
8,2022_3,0.0
9,2022_4,0.0


<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Clean-up the revision_date column
<div/>

In [42]:
# Convert column 'revision_date' to date data type
globals()[f"{sector}_quarterly_revisions"]['revision_date'] = pd.to_datetime(globals()[f"{sector}_quarterly_revisions"]['revision_date'], format='%Y_%m')

# Display the result
globals()[f"{sector}_quarterly_revisions"]

Unnamed: 0,revision_date,gdp_revision
0,2020-03-01,0.0
1,2020-04-01,0.0
2,2021-01-01,0.0
3,2021-02-01,0.0
4,2021-03-01,0.0
5,2021-04-01,0.0
6,2022-01-01,0.0
7,2022-02-01,0.0
8,2022-03-01,0.0
9,2022-04-01,0.0


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="5-3">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">5.3.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Monthly revisions
    </span>
    </h2>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
We calculate the <b>final revision</b> as the <b>difference</b> between the <b>last monthly release</b> and the <b>first monthly release</b> of the GDP growth rate.
<div/>

In [43]:
# Calculate the difference between the last and the first non-NaN value for each column, except 'year', 'ns_id' and 'date'.
revision = globals()[f"{sector}_monthly_growth_rates"].drop(columns=['year', 'id_ns', 'date']).apply(lambda x: x.loc[x.last_valid_index()] - x.loc[x.first_valid_index()])

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Monthly revisions as dataframe
<div/>

In [44]:
# Set the columns name
globals()[f"{sector}_monthly_revisions"] = pd.DataFrame({'revision_date': revision.index, f'{sector}_revision': revision.values})
globals()[f"{sector}_monthly_revisions"]

Unnamed: 0,revision_date,gdp_revision
0,oct_2022,0.0
1,nov_2022,0.0
2,dic_2022,0.0
3,ene_2023,0.0
4,feb_2023,0.0
5,mar_2023,0.0
6,abr_2023,0.0
7,may_2023,0.0
8,jun_2023,-0.1
9,jul_2023,0.0


<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Clean-up dataframe
<div/>

In [45]:
# Extract the month and year from the column 'revision_date'
globals()[f"{sector}_monthly_revisions"]['month'] = globals()[f"{sector}_monthly_revisions"]['revision_date'].str.split('_').str[0]
globals()[f"{sector}_monthly_revisions"]['year'] = globals()[f"{sector}_monthly_revisions"]['revision_date'].str.split('_').str[1]

# Match the names of the months to their respective numbers
month_mapping = {
    'ene': '01', 'feb': '02', 'mar': '03', 'abr': '04',
    'may': '05', 'jun': '06', 'jul': '07', 'ago': '08',
    'sep': '09', 'oct': '10', 'nov': '11', 'dic': '12'
}

globals()[f"{sector}_monthly_revisions"]['month'] = globals()[f"{sector}_monthly_revisions"]['month'].map(month_mapping)

# Create a new column with the date in YYYYY-MM-DD format
globals()[f"{sector}_monthly_revisions"]['revision_date'] = globals()[f"{sector}_monthly_revisions"]['year'] + '-' + globals()[f"{sector}_monthly_revisions"]['month']

# Convert column 'revision_date' to date data type
globals()[f"{sector}_monthly_revisions"]['revision_date'] = pd.to_datetime(globals()[f"{sector}_monthly_revisions"]['revision_date'], format='%Y-%m')

# Remove temporary columns 'month' and 'year'
globals()[f"{sector}_monthly_revisions"].drop(['month', 'year'], axis=1, inplace=True)

# Display the result
globals()[f"{sector}_monthly_revisions"]

Unnamed: 0,revision_date,gdp_revision
0,2022-10-01,0.0
1,2022-11-01,0.0
2,2022-12-01,0.0
3,2023-01-01,0.0
4,2023-02-01,0.0
5,2023-03-01,0.0
6,2023-04-01,0.0
7,2023-05-01,0.0
8,2023-06-01,-0.1
9,2023-07-01,0.0


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="6">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">6.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Uploading data to SQL</span></h1> 

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Finally, we upload all the datasets generated in this jupyter notebook to the <code>'gdp_revisions_datasets'</code> database of <code>PostgresSQL</code>.
<div/>

In [None]:
engine = create_sqlalchemy_engine()

# Vintages

In [None]:
#globals()[f"{sector}_annual_growth_rates"].to_sql(f'{sector}_annual_growth_rates', engine, index=False, if_exists='replace')

In [None]:
#globals()[f"{sector}_quarterly_growth_rates"].to_sql(f'{sector}_quarterly_growth_rates', engine, index=False, if_exists='replace')

In [None]:
globals()[f"{sector}_monthly_growth_rates"].to_sql(f'{sector}_monthly_growth_rates', engine, index=False, if_exists='replace')


# Revisions

In [None]:
#globals()[f"{sector}_annual_revisions"].to_sql(f'{sector}_annual_revisions', engine, index=False, if_exists='replace')

In [None]:
#globals()[f"{sector}_quarterly_revisions"].to_sql(f'{sector}_quarterly_revisions', engine, index=False, if_exists='replace')

In [None]:
globals()[f"{sector}_monthly_revisions"].to_sql(f'{sector}_monthly_revisions', engine, index=False, if_exists='replace')

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select" style="color: rgb(255, 32, 78); text-decoration: none;">⮝</a>
    </span> 
    <a href="#select" style="color: rgb(255, 32, 78); text-decoration: none;">Back to economic sectors.</a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

---
