<div style="text-align: center; font-family: 'charter bt pro roman'; color: rgb(0, 65, 75);">
    <h1>
    GDP Revisions Datasets
    </h1>
</div>

<div style="text-align: center; font-family: 'charter bt pro roman'; color: rgb(0, 65, 75);">
    <h3>
        Documentation
        <br>
        ____________________
            </br>
    </h3>
</div>

<div style="text-align: center; font-family: 'PT Serif Pro Book'; color: rgb(0, 65, 75); font-size: 16px;">
    Jason Cruz
    <br>
    <a href="mailto:jj.cruza@up.edu.pe" style="color: rgb(0, 153, 123); font-size: 16px;">
        jj.cruza@up.edu.pe
    </a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
This <span style="color: rgb(0, 65, 75);">jupyter notebook</span> documents step-by-step the <b>construction of datasets</b> for the project <b>'Revisions and biases in preliminary GDP estimates in Peru'</b>.

This jupyter notebook goes from downloading the Weekly Notes (NS) from the Central Reserve Bank of Peru (BCRP), stored on their website as PDF files, to generating datasets of growth rates and revisions to Peru's GDP, loaded as tables to SQL. The NS contain the information on annual, quarterly and monthly GDP growth rates by economic sectors of Peru, while the main datasets that will be used for the data analysis of this project are generated in this jupyter notebook using big data and machine learning techniques.
</div>

<div style="font-family: Amaya; text-align: left; color: rgb(0, 65, 75); font-size:16px">The following <b>outline is functional</b>. By utilising the provided buttons, users are able to enhance their experience by browsing this script.<div/>

<div id="outilne">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #292929; padding: 10px;">
<h2 style="text-align: left; font-family: 'charter'; color: #E0E0E0;">
    Outline
    </h2>
    <br>
    <a href="#libraries" style="color: #687EFF; font-size: 18px;">
        Libraries</a>
    <br>
    <a href="#setup" style="color: #687EFF; font-size: 18px;">
        Initial set-up</a>
    <br>
    <a href="#1" style="color: #687EFF; font-size: 18px;">
        1. PDF Downloader</a>
    <br>
    <a href="#2" style="color: #687EFF; font-size: 18px;">
        2. Generate PDF input with key tables</a>
    <br>
    <a href="#3" style="color: #687EFF; font-size: 18px;">
        3. Data cleaning</a>
    <br>
    <a href="#3-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.1. A brief documentation on issus in the table information of the PDFs.</a>
    <br>
    <a href="#3-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.2. Extracting tables and data cleanup.</a>
    <br>
    <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.2.1. Tabla 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.</a>
    <br>
    <a href="#3-2-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.2.2. Tabla 2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.</a>
    <br>
    <a href="#4" style="color: #687EFF; font-size: 18px;">4. SQL Tables</a>
    <br>
    <a href="#4-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        4.1. Annual Concatenation.</a>
    <br>
    <a href="#4-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        4.2. Quarterly Concatenation.</a>
    <br>
    <a href="#4-3" style="color: rgb(0, 153, 123); font-size: 12px;">
        4.3. Monthly Concatenation.</a>
    <br>
    <a href="#4-4" style="color: rgb(0, 153, 123); font-size: 12px;">
        4.4. Loading SQL.</a>
</div>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Any questions or issues regarding the coding, please <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123)">email Jason Cruz
    </a>.
    <div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px"">
    If you don't have the libraries below, please use the following code (as example) to install the required libraries.
    <div/>

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

<div id="libraries">
   <!-- Contenido de la celda de destino -->
</div>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark;">
    <h2>
    Libraries
    </h2>
    <div/>

In [1]:
# 1. PDF downloader
#-------------------------------------------------------------------------------------------------------------------------------

import os  # For file and directory manipulation, for interacting with the operating system
import random  # To generate random numbers
from selenium import webdriver  # For automating web browsers
from selenium.webdriver.common.by import By  # To locate elements on a webpage
from selenium.webdriver.support.ui import WebDriverWait  # To wait until certain conditions are met on a webpage.
from selenium.webdriver.support import expected_conditions as EC  # To define expected conditions
from selenium.common.exceptions import StaleElementReferenceException  # To handle exceptions related to elements on the webpage that are no longer available.
import pygame # Allows you to handle graphics, sounds and input events.

import shutil # Used for high-level file operations, such as copying, moving, renaming, and deleting files and directories.


# 2. Generate PDF input with key tables
#-------------------------------------------------------------------------------------------------------------------------------

import fitz  # This library is used for working with PDF documents, including reading, writing, and modifying PDFs (PyMuPDF).
import tkinter as tk  # This library is used for creating graphical user interfaces (GUIs) in Python.


# 3. Data cleaning
#-------------------------------------------------------------------------------------------------------------------------------

# 3.1. A brief documentation on issus in the table information of the PDFs

from PIL import Image  # Used for opening, manipulating, and saving image files.
import matplotlib.pyplot as plt  # Used for creating static, animated, and interactive visualizations.

# 3.2. Extracting tables and data cleanup

import pdfplumber  # For extracting text and metadata from PDF files
import pandas as pd  # For data manipulation and analysis
import unicodedata  # For manipulating Unicode data
import re  # For regular expressions operations
from datetime import datetime  # For working with dates and times
import locale  # For locale-specific formatting of numbers, dates, and currencies

# 3.2.1. Tabla 1. Extraction and cleaning of data from tables on monthly real GDP growth rates

import tabula  # Used to extract tables from PDF files into pandas DataFrames
from tkinter import Tk, messagebox, TOP, YES, NO  # Used for creating graphical user interfaces
from sqlalchemy import create_engine  # Used for connecting to and interacting with SQL databases

# 3.2.2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates

import roman
from datetime import datetime


# 4. SQL tables
#-------------------------------------------------------------------------------------------------------------------------------

import psycopg2  # For interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # For creating and executing SQL queries using SQLAlchemy


pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="setup">
   <!-- Contenido de la celda de destino -->
</div>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark;">
    <h2>
    Initial set-up
    </h2>
    <div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> The following code lines will create folders in your current path, call them to import and export your outputs. <div/>

In [2]:
# Folder path to save downloaded PDF files

raw_pdf = 'raw_pdf' # to save raw data (.pdf).
if not os.path.exists(raw_pdf):
    os.mkdir(raw_pdf) # to create the folder (if it doesn't exist)

In [3]:
# Folder path to save text file with the names of already downloaded files

download_record = 'download_record'
if not os.path.exists(download_record):
    os.mkdir(download_record) # to create the folder (if it doesn't exist)

In [4]:
# Folder path to download the trimmed PDF files (these are PDF inputs for the extraction and cleanup code)

input_pdf = 'input_pdf'
if not os.path.exists(input_pdf):
    os.makedirs(input_pdf) # to create the folder (if it doesn't exist)

In [5]:
# Folder path to save PDF files containing only the pages of interest (where the GDP growth rate tables are located)

input_pdf_record = 'input_pdf_record'
if not os.path.exists(input_pdf_record):
    os.makedirs(input_pdf_record)

In [6]:
# Folder path to save dataframes generated record by year

dataframes_record = 'dataframes_record'
if not os.path.exists(dataframes_record):
    os.makedirs(dataframes_record) # to create the folder (if it doesn't exist)

In [7]:
# Folder path to save sound files

sound_folder = 'sound'
if not os.path.exists(sound_folder):
    os.makedirs(sound_folder) # to create the folder (if it doesn't exist)

In [8]:
# Folder path to save screenshots about issues in the table information

ns_issues_folder = 'ns_issues_folder'
if not os.path.exists(ns_issues_folder):
    os.makedirs(ns_issues_folder) # to create the folder (if it doesn't exist)

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(178, 6, 0); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Import all functions required by this jupyter notebook.
    </span>
</div>


In [9]:
from gdp_revisions_datasets_functions import *

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="1">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: 'PT Serif Pro Book'; color: dark;">1.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">PDF Downloader</span></h1>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Our main source for data collection is the <a href="https://www.bcrp.gob.pe/publicaciones/nota-semanal.html" style="color: rgb(0, 153, 123)">BCRP's web page</a>. The weekly note is a periodic (weekly) publication of the BCRP in compliance with article 84 of the Peruvian Constitution and articles 2 and 74 of the BCRP's organic law, which include, among its functions, the periodic publication of the main national macroeconomic statistics.
    
Our project requires the publication of two tables: the table of monthly growth rates of real GDP (12-month percentage changes), and the table of quarterly (annual) growth rates of real GDP. These tables are referred to as Table 1 and Table 2, respectively, throughout this jupyter notebook.
<div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    The following bot runs the following steps:
    <ol>
        <li>Download the PDF files (NS) from the BCRP web page, starting with the oldest and continuing to the most recent.</li>
        <li>Notify you with a fabulous song each time a certain number of downloads is reached.</li>
        <li>Display a window asking if you want to continue with the downloads. You can stop them at any time.</li>
        <li>Report in detail about the downloaded files. If a file has already been downloaded, you will also be notified.</li>
        <li>Save the raw PDFs to the paths set in the preamble of this Jupyter Notebook.</li>
    </ol>
    Try the bot, it's an adventure!
</div>

In [None]:
# Setting the BCRP URL
bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html"  # Never replace this URL

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    The next stage in the process will be to execute the code which enables the bot to carry out the downloading tasks.
    <div/>

In [None]:
# Initialize pygame
pygame.mixer.init()

# List of available sound files
available_sounds = os.listdir(sound_folder)

# Select a random sound
random_sound = random.choice(available_sounds)

# Full path of the random sound
sound_path = os.path.join(sound_folder, random_sound)

# Load the selected sound
pygame.mixer.music.load(sound_path)

# List to keep track of successfully downloaded files
downloaded_files = []

# Load the list of previously downloaded files if it exists
if os.path.exists(os.path.join(download_record, "downloaded_files.txt")):
    with open(os.path.join(download_record, "downloaded_files.txt"), "r") as f:
        downloaded_files = f.read().splitlines()

# Web driver setup

'''
Nota: Download chrome.exe from 'https://googlechromelabs.github.io/chrome-for-testing/#stable'
and call in (1) the folder where you saved this application.
'''
driver_path = os.environ.get('driver_path') # (1)
driver = webdriver.Chrome(executable_path=driver_path)

# Number of downloads per batch
downloads_per_batch = 5
# Total number of downloads
total_downloads = 5

try:
    # Open the test page
    driver.get(bcrp_url)
    print("Site opened successfully")

    # Wait for the container area to be present
    wait = WebDriverWait(driver, 60)
    container_area = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="rightside"]')))

    # Get all the links within the container area
    pdf_links = container_area.find_elements(By.XPATH, './/a')

    # Reverse the order of links
    pdf_links = list(reversed(pdf_links))

    # Initialize download counter
    download_counter = 0

    # Iterate over reversed links and download PDFs in batches
    for pdf_link in pdf_links:
        download_counter += 1

        # Get the file name from the URL
        new_url = pdf_link.get_attribute("href")
        file_name = new_url.split("/")[-1]

        # Check if the file has already been downloaded
        if file_name in downloaded_files:
            print(f"{download_counter}. The file {file_name} has already been downloaded previously. Skipping...")
            continue

        # Try to download the file
        try:
            download_pdf(driver, pdf_link, wait, download_counter, raw_pdf, download_record)

            # Update the list of downloaded files
            downloaded_files.append(file_name)

        except Exception as e:
            print(f"Error downloading the file {file_name}: {str(e)}")

        # If the download count reaches a multiple of batch size, notify
        if download_counter % downloads_per_batch == 0:
            print(f"Batch {download_counter // downloads_per_batch} of {downloads_per_batch} completed")

        # If the download count reaches a multiple of 25, ask the user if they want to continue
        if download_counter % 5 == 0: # after the fifth PDF downloaded, you'll listen a beautiful song
            play_sound()
            user_input = input("Do you want to continue downloading? (Enter 'y' to continue, any other key to stop): ")
            pygame.mixer.music.stop()
            if user_input.lower() != 'y':
                break

        # Random wait before the next iteration
        random_wait(5, 10)

        # If total downloads reached, break out of loop
        if download_counter == total_downloads:
            print(f"All downloads completed ({total_downloads} in total)")
            break

except StaleElementReferenceException:
    print("StaleElementReferenceException occurred. Retrying...")

finally:
    # Close the browser when finished
    driver.quit()


<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
Probably the NS (PDF files) were downloaded in a single folder (raw_pdf), but we would like the NS to be sorted by years. The following code sorts the PDFs into subfolders (years) for us by placing each NS according to the year of its publication. This happens in the "blink of an eye". 
    <div/>

In [None]:
# Get the list of files in the directory
files = os.listdir(raw_pdf)

# Call the function to organize files
organize_files_by_year(raw_pdf)

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="2">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">2.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Generate PDF input with key tables</span></h1>

In [None]:
class PopupWindow(tk.Toplevel):
    """Creates a pop-up window for user interaction."""

    def __init__(self, root, message):
        """Initialize the pop-up window."""
        super().__init__(root)
        self.root = root
        self.title("Attention!")
        self.message = message
        self.result = None
        self.configure_window()
        self.create_widgets()

    def configure_window(self):
        """Configure the window to be non-resizable."""
        self.resizable(False, False)

    def create_widgets(self):
        """Create widgets (labels and buttons) inside the pop-up window."""
        self.label = tk.Label(self, text=self.message, wraplength=250)  # Adjust text if too long
        self.label.pack(pady=10, padx=10)
        self.btn_frame = tk.Frame(self)
        self.btn_frame.pack(pady=5)
        self.btn_yes = tk.Button(self.btn_frame, text="Yes", command=self.yes)
        self.btn_yes.pack(side=tk.LEFT, padx=5)
        self.btn_no = tk.Button(self.btn_frame, text="No", command=self.no)
        self.btn_no.pack(side=tk.RIGHT, padx=5)

        # Calculate window size based on text size
        width = self.label.winfo_reqwidth() + 20
        height = self.label.winfo_reqheight() + 100
        self.geometry(f"{width}x{height}")

    def yes(self):
        """Set result to True and close the window."""
        self.result = True
        self.destroy()

    def no(self):
        """Set result to False and close the window."""
        self.result = False
        self.destroy()

if __name__ == "__main__":
    keywords = ["ECONOMIC SECTORS"]
    root = tk.Tk()
    root.withdraw()  # Hide the main Tkinter window

    input_pdf_files = read_input_pdf_files()
    processing_counter = 1

    for folder in os.listdir(raw_pdf):
        folder_path = os.path.join(raw_pdf, folder)
        if os.path.isdir(folder_path):
            print("Processing folder:", folder)
            num_pdfs_trimmed = 0
            for filename in os.listdir(folder_path):
                if filename.endswith(".pdf"):
                    pdf_file = os.path.join(folder_path, filename)
                    if filename in input_pdf_files:
                        print(f"{processing_counter}. The PDF '{filename}' has already been trimmed and saved in '{input_pdf}'...")
                        processing_counter += 1
                        continue
                    print(f"{processing_counter}. Processing:", pdf_file)
                    
                    pages_with_keywords = search_keywords(pdf_file, keywords)
                    num_pages_new_pdf = trim_pdf(pdf_file, pages_with_keywords)
                    if num_pages_new_pdf > 0:
                        num_pdfs_trimmed += 1
                        input_pdf_files.add(filename)
                        processing_counter += 1
            
            write_input_pdf_files(input_pdf_files)

            message = f"{num_pdfs_trimmed} PDFs have been trimmed in folder {folder}. Do you want to continue?"
            popup = PopupWindow(root, message)
            root.wait_window(popup)
            if not popup.result:
                break
                
    print("Process completed for all PDFs in directory:", input_pdf)


<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
Again, probably the NS (PDF files, now of few pages) were stored in disorder in the input_pdf folder. The following code sorts the PDFs into subfolders (years) by placing each NS (which now includes only the key tables) according to the year of its publication. This happens in the blink of an eye.  
    <div/>

In [None]:
# Get the list of files in the directory
files = os.listdir(input_pdf)

# Call the function to organize files
organize_files_by_year(input_pdf)

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="3">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">3.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Data cleaning</span></h1>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
Since we already have the PDFs with just the tables required for this project, we can start extracting them. Then we can proceed with data cleaning.
</p>  
<div/>

<div id="3-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    A brief documentation on issus in the table information of the PDFs. 
    </span>
    </h2>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
Note that, the table information within the PDFs are available as editable text (including numeric values), but sometimes they can have various encoded formats that can make them difficult to extract and clean up. Undoubtedly, this is the most challenging stage of this jupyter notebook because there is no single pattern in which the information in the PDFs is arranged, each PDF adds a difficulty to extract the information. To understand more about this last point, we will start this section by documenting the most common problems we may face when trying to extract and clean tables from PDFs.
<div/>

<div id="3-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Extracting tables and data cleanup
    </span>
    </h2>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The main library used for extracting tables from PDFs is <code>pdfplumber</code>. You can review the official documentation by clicking <a href="https://github.com/jsvine/pdfplumber" style="color: rgb(0, 153, 123); font-size: 16px;">here</a>.
</p>
    
<p>     
    The functions in <b>Section 3</b> of the <code>"gdp_revisions_datasets_functions.py"</code> script were built to deal with each of these issues. An interesting exercise is to compare the original tables (the ones in the PDF) and the cleaned tables (by the cleanup codes below). Thus, the cleanup codes for <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a> generates two dictionaries, the first one stores the raw tables; that is, the original tables from the PDF extracted by the <code>pdfplumber</code> library, while the second dictionary stores the fully cleaned tables.
</p>
<div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
The basic criterion to start extracting tables is to use keywords (sufficient condition). I mean, tables containing the following keywords meet the requirements to be extracted.
</p>
<div/>

In [None]:
# Keywords to search in the page text
keywords = ["ECONOMIC SECTORS"]

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    The code iterates through each PDF and extracts the two required tables from each. The extracted information is then transformed into dataframes and the columns and values are cleaned up to conform to Python conventions (pythonic).
    <div/>

<div id="3-2-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Tabla 1.</span> Extraction and cleaning of data from tables on monthly real GDP growth rates.
    </span>
    </h3>

In [10]:
# Establecer la localización en español
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Palabras clave para buscar en el texto de la página
keywords = ["ECONOMIC SECTORS"]

# Diccionario para almacenar los DataFrames generados
dataframes_dict_1 = {}

# Ruta del archivo de registro de carpetas procesadas
registro_path = 'dataframes_record/carpetas_procesadas_1.txt'

# Función para corregir los nombres de los meses
def corregir_nombre_mes(mes):
    meses_mapping = {
        'setiembre': 'septiembre',
        # Agrega más mapeos si es necesario para otros nombres de meses
    }
    return meses_mapping.get(mes, mes)

def registrar_carpeta_procesada(carpeta, num_archivos_procesados):
    with open(registro_path, 'a') as file:
        file.write(f"{carpeta}:{num_archivos_procesados}\n")

def carpeta_procesada(carpeta):
    if not os.path.exists(registro_path):
        return False
    with open(registro_path, 'r') as file:
        for line in file:
            if line.startswith(carpeta):
                return True
    return False

def obtener_fecha(df, engine):
    id_ns = df['id_ns'].iloc[0]
    year = df['year'].iloc[0]
    query = f"SELECT date FROM dates_growth_rates WHERE id_ns = '{id_ns}' AND year = '{year}';"
    fecha = pd.read_sql(query, engine)
    return fecha.iloc[0, 0] if not fecha.empty else None

def procesar_pdf(pdf_path):
    tables_dict_1 = {}  # Diccionario local para cada PDF
    table_counter = 1
    keyword_count = 0 

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No se encontraron coincidencias para id_ns y year en el nombre del archivo:", filename)
        return None, None, None, None, None  # Return None for tables_dict_1 as well

    new_filename = os.path.splitext(os.path.basename(pdf_path))[0].replace('-', '_')

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, 1):
            text = page.extract_text()
            if all(keyword in text for keyword in keywords):
                keyword_count += 1
                if keyword_count == 1:  # Solo procesar la primera ocurrencia
                    tables = tabula.read_pdf(pdf_path, pages=i, multiple_tables=False, stream=True) # change stream to another option if desired
                    for j, table_df in enumerate(tables, start=1):
                        nombre_dataframe = f"{new_filename}_{keyword_count}"
                        tables_dict_1[nombre_dataframe] = table_df
                        table_counter += 1

                    break  # Salir del bucle después de encontrar la primera ocurrencia

    return id_ns, year, tables_dict_1, keyword_count  # No retornar date aquí

def procesar_carpeta(carpeta, engine):
    print(f"Procesando la carpeta {os.path.basename(carpeta)}")
    pdf_files = [os.path.join(carpeta, f) for f in os.listdir(carpeta) if f.endswith('.pdf')]

    num_pdfs_procesados = 0
    num_dataframes_generados = 0

    table_counter = 1  # Inicializar el contador de tabla aquí
    tables_dict_1 = {}  # Declarar tables_dict_1 fuera del bucle principal
    
    for pdf_file in pdf_files:
        id_ns, year, tables_dict_temp, keyword_count = procesar_pdf(pdf_file)

        if tables_dict_temp:
            for nombre_df, df in tables_dict_temp.items():
                nombre_archivo = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                nombre_df = f"{nombre_archivo}_{keyword_count}"
                
                # Almacenar DataFrame sin procesar en tables_dict_1
                tables_dict_1[nombre_df] = df.copy()
                
                # Aplicar las 20 líneas de funciones de limpieza a una copia del DataFrame
                df_clean = df.copy()

                if any(col.isdigit() and len(col) == 4 for col in df_clean.columns):
                    # Si hay al menos una columna que representa un año
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    df_clean = replace_first_dot(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = exchange_values(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimales=1)
                else: # 2014 ns 08
                    # Si no hay columnas que representen años
                    df_clean = check_first_row(df_clean)
                    df_clean = check_first_row_1(df_clean)
                    df_clean = replace_first_row_with_columns(df_clean)
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    #df_clean = replace_first_dot(df_clean) # comment for 2014 ns 08
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = expand_column(df_clean) # 2014 ns 08
                    df_clean = split_values_1(df_clean) # 2014 ns 08
                    df_clean = split_values_2(df_clean) # 2016 ns 15
                    df_clean = split_values_3(df_clean) # 2016 ns 19
                    df_clean = separate_text_digits(df_clean)
                    df_clean = exchange_values(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_nan_with_previous_column_1(df_clean)
                    df_clean = replace_nan_with_previous_column_2(df_clean)
                    df_clean = replace_nan_with_previous_column_3(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimales=1)
                
                # Añadir la columna 'year' al DataFrame limpio
                df_clean.insert(0, 'year', year)
                
                # Añadir la columna 'id_ns' al DataFrame limpio
                df_clean.insert(1, 'id_ns', id_ns)
                
                # Obtener la fecha correspondiente de la base de datos
                fecha = obtener_fecha(df_clean, engine)
                if fecha:
                    # Añadir la columna 'date' al DataFrame limpio
                    df_clean.insert(2, 'date', fecha)
                else:
                    print("No se encontró fecha en la base de datos para id_ns:", id_ns, "y year:", year)
                
                # Almacenar DataFrame limpio en dataframes_dict_1
                dataframes_dict_1[nombre_df] = df_clean

                print(f'  {table_counter}. El dataframe generado para el archivo {pdf_file} es: {nombre_df}')
                num_dataframes_generados += 1
                table_counter += 1  # Incrementar el contador de tabla aquí
        
        num_pdfs_procesados += 1  # Incrementar el número de PDFs procesados por cada PDF en la carpeta

    return num_pdfs_procesados, num_dataframes_generados, tables_dict_1


def procesar_carpetas():
    pdf_folder = 'input_pdf'
    carpetas = [os.path.join(pdf_folder, d) for d in os.listdir(pdf_folder) if os.path.isdir(os.path.join(pdf_folder, d))]
    
    tables_dict_1 = {}  # Inicializar tables_dict_1 aquí
    
    for carpeta in carpetas:
        if carpeta_procesada(carpeta):
            print(f"La carpeta {carpeta} ya ha sido procesada.")
            continue
        
        num_pdfs_procesados, num_dataframes_generados, tables_dict_temp = procesar_carpeta(carpeta, engine)
        
        # Actualizar tables_dict_1 con los valores devueltos de procesar_carpeta()
        tables_dict_1.update(tables_dict_temp)
        
        registrar_carpeta_procesada(carpeta, num_pdfs_procesados)

        # Preguntar al usuario si desea continuar con la siguiente carpeta
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Para asegurar que la ventana esté en primer plano
        
        mensaje = f"Se han generado {num_dataframes_generados} dataframes en la carpeta {carpeta}. ¿Deseas continuar con la siguiente carpeta?"
        continuar = messagebox.askyesno("Continuar", mensaje)
        root.destroy()

        if not continuar:
            print("Procesamiento detenido por el usuario.")
            break  # Romper el bucle for si el usuario decide no continuar

    print("Procesamiento completado para todas las carpetas.")  # Add a message to indicate completion

    return tables_dict_1  # Devolver tables_dict_1 al final de la función

if __name__ == "__main__":
    # Get environment variables
    user = os.environ.get('CIUP_SQL_USER')
    password = os.environ.get('CIUP_SQL_PASS')
    host = os.environ.get('CIUP_SQL_HOST')
    port = 5432
    database = 'gdp_revisions_datasets'

    # Check if all environment variables are defined
    if not all([host, user, password]):
        raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

    # Create connection string
    connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

    # Create SQLAlchemy engine
    engine = create_engine(connection_string)

    tables_dict_1 = procesar_carpetas()  # Capturar el valor devuelto de procesar_carpetas()


La carpeta input_pdf\2013 ya ha sido procesada.
La carpeta input_pdf\2014 ya ha sido procesada.
La carpeta input_pdf\2015 ya ha sido procesada.
La carpeta input_pdf\2016 ya ha sido procesada.
La carpeta input_pdf\2017 ya ha sido procesada.
La carpeta input_pdf\2018 ya ha sido procesada.
La carpeta input_pdf\2019 ya ha sido procesada.
La carpeta input_pdf\2020 ya ha sido procesada.
La carpeta input_pdf\2021 ya ha sido procesada.
La carpeta input_pdf\2022 ya ha sido procesada.
La carpeta input_pdf\2023 ya ha sido procesada.
Procesando la carpeta 2024


NameError: name 'number_moving_average' is not defined

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

In [None]:
tables_dict_1.keys()

In [None]:
dataframes_dict_1.keys()

In [None]:
tables_dict_1['ns_01_2024_1'].head(5)

In [None]:
df_1 = dataframes_dict_1['ns_01_2024_1']
df_1

In [None]:
df_1[(df_1['sectores_economicos'] == 'agropecuario') | (df_1['economic_sectors'] == 'agriculture and livestock')]

<div id="3-2-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.2.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Tabla 2.</span> Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.
    </span>
    </h3>

In [None]:
# Establecer la localización en español
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Palabras clave para buscar en el texto de la página
keywords = ["ECONOMIC SECTORS"]

# Diccionario para almacenar los DataFrames generados
dataframes_dict_2 = {}

# Ruta del archivo de registro de carpetas procesadas
registro_path = 'dataframes_record/carpetas_procesadas_2.txt'

# Función para corregir los nombres de los meses
def corregir_nombre_mes(mes):
    meses_mapping = {
        'setiembre': 'septiembre',
        # Agrega más mapeos si es necesario para otros nombres de meses
    }
    return meses_mapping.get(mes, mes)

def registrar_carpeta_procesada(carpeta, num_archivos_procesados):
    with open(registro_path, 'a') as file:
        file.write(f"{carpeta}:{num_archivos_procesados}\n")

def carpeta_procesada(carpeta):
    if not os.path.exists(registro_path):
        return False
    with open(registro_path, 'r') as file:
        for line in file:
            if line.startswith(carpeta):
                return True
    return False

def obtener_fecha(df, engine):
    id_ns = df['id_ns'].iloc[0]
    year = df['year'].iloc[0]
    query = f"SELECT date FROM dates_growth_rates WHERE id_ns = '{id_ns}' AND year = '{year}';"
    fecha = pd.read_sql(query, engine)
    return fecha.iloc[0, 0] if not fecha.empty else None

def procesar_pdf(pdf_path):
    tables_dict_2 = {}  # Diccionario local para cada PDF
    table_counter = 1
    keyword_count = 0 

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No se encontraron coincidencias para id_ns y year en el nombre del archivo:", filename)
        return None, None, None, None

    new_filename = os.path.splitext(os.path.basename(pdf_path))[0].replace('-', '_')

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, 1):
            text = page.extract_text()
            if all(keyword in text for keyword in keywords):
                keyword_count += 1
                if keyword_count == 2:
                    tables = tabula.read_pdf(pdf_path, pages=i, multiple_tables=False)
                    for j, table_df in enumerate(tables, start=1):
                        nombre_dataframe = f"{new_filename}_{keyword_count}"
                        tables_dict_2[nombre_dataframe] = table_df
                        table_counter += 1

    return id_ns, year, tables_dict_2, keyword_count


def procesar_carpeta(carpeta, engine):
    print(f"Procesando la carpeta {os.path.basename(carpeta)}")
    pdf_files = [os.path.join(carpeta, f) for f in os.listdir(carpeta) if f.endswith('.pdf')]

    num_pdfs_procesados = 0
    num_dataframes_generados = 0

    table_counter = 1  # Inicializar el contador de tabla aquí
    tables_dict_2 = {}  # Declarar tables_dict fuera del bucle principal
    
    for pdf_file in pdf_files:
        id_ns, year, tables_dict_temp, keyword_count = procesar_pdf(pdf_file)

        if tables_dict_temp:
            for nombre_df, df in tables_dict_temp.items():
                nombre_archivo = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                nombre_df = f"{nombre_archivo}_{keyword_count}"

                # Almacenar DataFrame sin procesar en tables_dict
                tables_dict_2[nombre_df] = df.copy()

                # Aplicar las 20 líneas de funciones de limpieza a una copia del DataFrame
                df_clean = df.copy()
                if df_clean.iloc[0, 0] is np.nan:
                    # Aplicar las 20 líneas de limpieza
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = separate_years(df_clean)
                    df_clean = relocate_roman_numerals(df_clean)
                    df_clean = extract_mixed_values(df_clean)
                    df_clean = replace_first_row_nan(df_clean)
                    df_clean = first_row_columns(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = drop_nan_row(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = split_values(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimales=1)
                else:
                    # Aplicar las 15 líneas de limpieza
                    df_clean = exchange_roman_nan(df_clean)
                    df_clean = exchange_columns(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = last_column_es(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_services(df_clean)
                    df_clean = rounding_values(df_clean, decimales=1)

                # Añadir la columna 'year' al DataFrame limpio
                df_clean.insert(0, 'year', year)
                
                # Añadir la columna 'id_ns' al DataFrame limpio
                df_clean.insert(1, 'id_ns', id_ns)
                
                # Obtener la fecha correspondiente de la base de datos
                fecha = obtener_fecha(df_clean, engine)
                if fecha:
                    # Añadir la columna 'date' al DataFrame limpio
                    df_clean.insert(2, 'date', fecha)
                else:
                    print("No se encontró fecha en la base de datos para id_ns:", id_ns, "y year:", year)

                # Almacenar DataFrame limpio en dataframes_dict
                dataframes_dict_2[nombre_df] = df_clean

                print(f'  {table_counter}. El dataframe generado para el archivo {pdf_file} es: {nombre_df}')
                num_dataframes_generados += 1
                table_counter += 1  # Incrementar el contador de tabla aquí
                    
        num_pdfs_procesados += 1  # Incrementar el número de PDFs procesados por cada PDF en la carpeta

    return num_pdfs_procesados, num_dataframes_generados, tables_dict_2
        
def procesar_carpetas():
    pdf_folder = 'input_pdf'
    carpetas = [os.path.join(pdf_folder, d) for d in os.listdir(pdf_folder) if os.path.isdir(os.path.join(pdf_folder, d))]

    tables_dict_2 = {}  # Inicializar tables_dict aquí
    
    for carpeta in carpetas:
        if carpeta_procesada(carpeta):
            print(f"La carpeta {carpeta} ya ha sido procesada.")
            continue
        
        num_pdfs_procesados, num_dataframes_generados, tables_dict_temp = procesar_carpeta(carpeta, engine)
        
        # Actualizar tables_dict con los valores devueltos de procesar_carpeta()
        tables_dict_2.update(tables_dict_temp)
        
        registrar_carpeta_procesada(carpeta, num_pdfs_procesados)

        # Preguntar al usuario si desea continuar con la siguiente carpeta
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Para asegurar que la ventana esté en primer plano
        
        mensaje = f"Se han generado {num_dataframes_generados} dataframes en la carpeta {carpeta}. ¿Deseas continuar con la siguiente carpeta?"
        continuar = messagebox.askyesno("Continuar", mensaje)
        root.destroy()

        if not continuar:
            print("Procesamiento detenido por el usuario.")
            break  # Romper el bucle for si el usuario decide no continuar

    print("Procesamiento completado para todas las carpetas.")  # Add a message to indicate completion

    return tables_dict_2  # Devolver tables_dict al final de la función
    
if __name__ == "__main__":
    # Get environment variables
    user = os.environ.get('CIUP_SQL_USER')
    password = os.environ.get('CIUP_SQL_PASS')
    host = os.environ.get('CIUP_SQL_HOST')
    port = 5432
    database = 'gdp_revisions_datasets'

    # Check if all environment variables are defined
    if not all([host, user, password]):
        raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

    # Create connection string
    connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

    # Create SQLAlchemy engine
    engine = create_engine(connection_string)
    tables_dict_2 = procesar_carpetas() # Capturar el valor devuelto de procesar_carpetas()

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

In [None]:
tables_dict_2.keys()

In [None]:
dataframes_dict_2.keys()

In [None]:
tables_dict_2['ns_01_2024_2'].head(5)

In [None]:
dataframes_dict_2['ns_01_2024_2']

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="4">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">SQL Tables</span></h1>

<div style="font-family: charter; text-align: left; color:dark">
    Finally, after obtaining and cleaning all the necessary data, we can create the three most important datasets to store realeses, vintages, and revisions. These datasets will be stored as tables in SQL and can be loaded into any software or programming language.
    <div/>

<div id="sector">
   <!-- Contenido de la celda de destino -->
</div>

# Chose sector_economico and economic_sector

In [None]:
import tkinter as tk

# Definir la lista de opciones
opciones = [
    "pbi",
    "agropecuario",
    "pesca",
    "mineria e hidrocarburos",
    "manufactura",
    "electricidad y agua",
    "construccion",
    "comercio",
    "otros servicios"
]

# Función para guardar la opción seleccionada y cerrar la ventana
def guardar_opcion():
    global sector_economico
    sector_economico = opcion_seleccionada.get()
    root.destroy()  # Cerrar la ventana después de seleccionar una opción

# Crear la ventana emergente
root = tk.Tk()
root.title("Seleccionar opción")

# Variable para almacenar la opción seleccionada
opcion_seleccionada = tk.StringVar(root)
opcion_seleccionada.set(opciones[0])  # Opción predeterminada

# Crear el menú de opciones
menu = tk.OptionMenu(root, opcion_seleccionada, *opciones)
menu.pack(pady=10)

# Botón para confirmar la selección
boton_confirmar = tk.Button(root, text="Confirmar", command=guardar_opcion)
boton_confirmar.pack()

# Mostrar la ventana
root.update_idletasks()
root.wait_window()

# Mostrar el valor seleccionado
print("Sector económico seleccionado:", opcion_seleccionada.get())


In [None]:
import tkinter as tk

# Definir la lista de opciones
opciones = [
    "gdp",
    "agriculture and livestock",
    "fishing",
    "mining and fuel",
    "manufacturing",
    "electricity and water",
    "construction",
    "commerce",
    "other services"
]

# Función para guardar la opción seleccionada y cerrar la ventana
def guardar_opcion():
    global economic_sector
    economic_sector = opcion_seleccionada.get()
    root.destroy()  # Cerrar la ventana después de seleccionar una opción

# Crear la ventana emergente
root = tk.Tk()
root.title("Seleccionar opción")

# Variable para almacenar la opción seleccionada
opcion_seleccionada = tk.StringVar(root)
opcion_seleccionada.set(opciones[0])  # Opción predeterminada

# Crear el menú de opciones
menu = tk.OptionMenu(root, opcion_seleccionada, *opciones)
menu.pack(pady=10)

# Botón para confirmar la selección
boton_confirmar = tk.Button(root, text="Confirmar", command=guardar_opcion)
boton_confirmar.pack()

# Mostrar la ventana
root.update_idletasks()
root.wait_window()

# Mostrar el valor seleccionado
print("Sector económico seleccionado:", opcion_seleccionada.get())

# Chose the year and label datatset

In [None]:
import tkinter as tk
from tkinter import simpledialog

# Crear una ventana principal
root = tk.Tk()
root.withdraw()  # Ocultar la ventana principal

# Pedir al usuario que introduzca el valor de sector_economico
sector = simpledialog.askstring("Sector Económico", "Introduce el valor del sector:")

# Pedir al usuario que introduzca el valor de economic_sector
#year = simpledialog.askstring("Year", "Introduce el valor de year:")

# Mostrar los valores introducidos por el usuario
print("Valor del sector:", sector)
#print("Valor de economic_sector:", year)


<div id="4-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Annual Concatenation
    </span>
    </h2>

In [None]:
def concatenate_annual_df(dataframes_dict):
    # List to store the names of dataframes that meet the criterion of ending in '_2'
    dataframes_ending_with_2 = []

    # List to store the names of dataframes to be concatenated
    dataframes_to_concatenate = []

    # Iterate over the dataframe names in the all_dataframes dictionary
    for df_name in dataframes_dict.keys():
        # Check if the dataframe name ends with '_2' and add it to the corresponding list
        if df_name.endswith('_2'):
            dataframes_ending_with_2.append(df_name)
            dataframes_to_concatenate.append(dataframes_dict[df_name])

    # Print the names of dataframes that meet the criterion of ending in '_2'
    print("DataFrames ending with '_2' that will be concatenated:")
    for df_name in dataframes_ending_with_2:
        print(df_name)

    # Concatenate all dataframes in the 'dataframes_to_concatenate' list
    if dataframes_to_concatenate:
        # Concatenate only rows that meet the specified conditions
        annual_growth_rates = pd.concat([df[(df['sectores_economicos'] == sector_economico) | (df['economic_sectors'] == economic_sector)] 
                                    for df in dataframes_to_concatenate 
                                    if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                    ignore_index=True)

        # Keep only columns that start with 'year' and the 'id_ns', 'year', and 'date' columns
        columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in annual_growth_rates.columns if col.endswith('_year')]

        # Drop unwanted columns
        annual_growth_rates = annual_growth_rates[columns_to_keep]
        
        # Remove duplicate columns if any
        annual_growth_rates = annual_growth_rates.loc[:,~annual_growth_rates.columns.duplicated()]
    
        # Cambia el nombre de las columnas a partir de la cuarta columna
        annual_growth_rates.columns = [col.split('_')[1] + '_' + col.split('_')[0] if '_' in col and idx >= 3 else col for idx, col in enumerate(annual_growth_rates.columns)]

        # Print the number of rows in the concatenated dataframe
        print("Number of rows in the concatenated dataframe:", len(annual_growth_rates))
        
        return annual_growth_rates
    else:
        print("No dataframes were found to concatenate.")
        return None

In [None]:
globals()[f"{sector}_annual_growth_rates"] = concatenate_annual_df(dataframes_dict_2)

In [None]:
pd.set_option('display.max_rows', None)
globals()[f"{sector}_annual_growth_rates"]

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="4-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Quarterly Concatenation
    </span>
    </h2>

In [None]:
import pandas as pd

def concatenate_quarterly_df(dataframes_dict):
    # List to store the names of dataframes that meet the criterion of ending in '_2'
    dataframes_ending_with_2 = []

    # List to store the names of dataframes to be concatenated
    dataframes_to_concatenate = []

    # Iterate over the dataframe names in the all_dataframes dictionary
    for df_name in dataframes_dict.keys():
        # Check if the dataframe name ends with '_2' and add it to the corresponding list
        if df_name.endswith('_2'):
            dataframes_ending_with_2.append(df_name)
            dataframes_to_concatenate.append(dataframes_dict[df_name])

    # Print the names of dataframes that meet the criterion of ending in '_2'
    print("DataFrames ending with '_2' that will be concatenated:")
    for df_name in dataframes_ending_with_2:
        print(df_name)

    # Concatenate all dataframes in the 'dataframes_to_concatenate' list
    if dataframes_to_concatenate:
        # Concatenate only rows that meet the specified conditions
        quarterly_growth_rates = pd.concat([df[(df['sectores_economicos'] == sector_economico) | (df['economic_sectors'] == economic_sector)] 
                                    for df in dataframes_to_concatenate 
                                    if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                    ignore_index=True)

        # Keep all columns except those starting with 'year_', in addition to the 'id_ns', 'year', and 'date' columns
        columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in quarterly_growth_rates.columns if not col.endswith('_year')]

        # Select unwanted columns
        quarterly_growth_rates = quarterly_growth_rates[columns_to_keep]

        # Drop the 'sectores_economicos' and 'economic_sectors' columns
        quarterly_growth_rates.drop(columns=['sectores_economicos', 'economic_sectors'], inplace=True)

        # Remove duplicate columns if any
        quarterly_growth_rates = quarterly_growth_rates.loc[:,~quarterly_growth_rates.columns.duplicated()]

        # Cambia el nombre de las columnas a partir de la cuarta columna
        #quarterly_growth_rates.columns = [col.split('_')[0] + '_q' + col.split('_')[1] if '_' in col and idx >= 3 else col
        #for idx, col in enumerate(quarterly_growth_rates.columns)]
        
        # Print the number of rows in the concatenated dataframe
        print("Number of rows in the concatenated dataframe:", len(quarterly_growth_rates))
        
        return quarterly_growth_rates
    else:
        print("No dataframes were found to concatenate.")
        return None

In [None]:
globals()[f"{sector}_quarterly_growth_rates"] = concatenate_quarterly_df(dataframes_dict_2)

In [None]:
pd.set_option('display.max_rows', None)
globals()[f"{sector}_quarterly_growth_rates"]

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="4-3">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.3.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Monthly Concatenation
    </span>
    </h2>

In [None]:
import pandas as pd

def concatenate_monthly_df(dataframes_dict):
    # List to store the names of dataframes that meet the criterion of ending in '_1'
    dataframes_ending_with_1 = []

    # List to store the names of dataframes to be concatenated
    dataframes_to_concatenate = []

    # Iterate over the dataframe names in the all_dataframes dictionary
    for df_name in dataframes_dict.keys():
        # Check if the dataframe name ends with '_1' and add it to the corresponding list
        if df_name.endswith('_1'):
            dataframes_ending_with_1.append(df_name)
            dataframes_to_concatenate.append(dataframes_dict[df_name])

    # Print the names of dataframes that meet the criterion of ending with '_1'
    print("DataFrames ending with '_1' that will be concatenated:")
    for df_name in dataframes_ending_with_1:
        print(df_name)

    # Concatenate all dataframes in the 'dataframes_to_concatenate' list
    if dataframes_to_concatenate:
        # Concatenate only rows that meet the specified conditions
        monthly_growth_rates = pd.concat([df[(df['sectores_economicos'] == sector_economico) | (df['economic_sectors'] == economic_sector)] 
                                    for df in dataframes_to_concatenate 
                                    if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                    ignore_index=True)

        # Keep all columns except those starting with 'year_', in addition to the 'id_ns', 'year', and 'date' columns
        columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in monthly_growth_rates.columns if not col.endswith('_year')]

        # Select unwanted columns
        monthly_growth_rates = monthly_growth_rates[columns_to_keep]

        # Drop the 'sectores_economicos' and 'economic_sectors' columns
        monthly_growth_rates.drop(columns=['sectores_economicos', 'economic_sectors'], inplace=True)

        # Remove duplicate columns if any
        monthly_growth_rates = monthly_growth_rates.loc[:,~monthly_growth_rates.columns.duplicated()]
        
        # Drop columns with at least two underscores in their names
        columns_to_drop = [col for col in monthly_growth_rates.columns if col.count('_') >= 2]
        monthly_growth_rates.drop(columns=columns_to_drop, inplace=True)
        
        # Cambia el nombre de las columnas a partir de la cuarta columna
        monthly_growth_rates.columns = [col.split('_')[1] + '_' + col.split('_')[0] if '_' in col and idx >= 3 else col for idx, col in enumerate(monthly_growth_rates.columns)]
        
        # Diccionario de mapeo de nombres de meses
        #meses = {
        #    'ene': 'm1', 'feb': 'm2', 'mar': 'm3', 'abr': 'm4',
        #    'may': 'm5', 'jun': 'm6', 'jul': 'm7', 'ago': 'm8',
        #    'sep': 'm9', 'oct': 'm10', 'nov': 'm11', 'dic': 'm12'
        #}
        
        # Función para reemplazar las claves por los valores del diccionario en el nombre de las columnas
        #def replace_months(column_name, meses):
        #    for key, value in meses.items():
        #        if key in column_name:
        #            return column_name.replace(key, value)
        #    return column_name

        # Aplicar la función a todas las columnas del DataFrame
        #monthly_growth_rates.columns = monthly_growth_rates.columns.map(lambda x: replace_months(x, meses))

        # Print the number of rows in the concatenated dataframe
        print("Number of rows in the concatenated dataframe:", len(monthly_growth_rates))

        return monthly_growth_rates


In [None]:
globals()[f"{sector}_monthly_growth_rates"] = concatenate_monthly_df(dataframes_dict_1)

In [None]:
pd.set_option('display.max_rows', None)
globals()[f"{sector}_monthly_growth_rates"]

<div style="background-color: #1E3B58; color: white; padding: 10px;">
    <h2>Create revision dataset (annual)</h2>
</div>

In [None]:
sector

In [None]:
# 1. Calcular la diferencia entre el último y el primer valor no NaN para cada columna, excepto 'year', 'ns_id' y 'date'
revision = globals()[f"{sector}_annual_growth_rates"].drop(columns=['year', 'id_ns', 'date']).apply(lambda x: x.loc[x.last_valid_index()] - x.loc[x.first_valid_index()])


#  Nuevo

In [None]:
# 2. Crear un nuevo DataFrame con los resultados
globals()[f"{sector}_annual_revisions"] = pd.DataFrame({'revision_date': revision.index, f'{sector}_revision': revision.values})
# Suponiendo que 'gdp_monthly_growth_rates' es tu DataFrame
pd.set_option('display.max_rows', None)
globals()[f"{sector}_annual_revisions"]

# Para graficar en Python

In [None]:
import pandas as pd

# Extraer el año de la cadena y convertirlo a un tipo entero
globals()[f"{sector}_annual_revisions"]['year'] = globals()[f"{sector}_annual_revisions"]['revision_date'].str.extract(r'(\d+)')
globals()[f"{sector}_annual_revisions"]['year'] = globals()[f"{sector}_annual_revisions"]['year'].astype(int)

# Crear una nueva columna de tipo fecha
globals()[f"{sector}_annual_revisions"]['revision_date'] = pd.to_datetime(globals()[f"{sector}_annual_revisions"]['year'], format='%Y')

# Eliminar la columna 'year' si ya no es necesaria
globals()[f"{sector}_annual_revisions"].drop(columns=['year'], inplace=True)

# Mostrar el resultado
globals()[f"{sector}_annual_revisions"]


<div style="background-color: #1E3B58; color: white; padding: 10px;">
    <h2>Create revision dataset (quarterly)</h2>
</div>

In [None]:
sector

In [None]:
# 1. Calcular la diferencia entre el último y el primer valor no NaN para cada columna, excepto 'year', 'ns_id' y 'date'
revision = globals()[f"{sector}_quarterly_growth_rates"].drop(columns=['year', 'id_ns', 'date']).apply(lambda x: x.loc[x.last_valid_index()] - x.loc[x.first_valid_index()])


#  Nuevo

In [None]:
# 2. Crear un nuevo DataFrame con los resultados
globals()[f"{sector}_quarterly_revisions"] = pd.DataFrame({'revision_date': revision.index, f'{sector}_revision': revision.values})
# Suponiendo que 'gdp_monthly_growth_rates' es tu DataFrame
pd.set_option('display.max_rows', None)
globals()[f"{sector}_quarterly_revisions"]

# Para graficar en Python

In [None]:
import pandas as pd

# Convertir la columna 'revision_date' a tipo de datos de fecha
globals()[f"{sector}_quarterly_revisions"]['revision_date'] = pd.to_datetime(globals()[f"{sector}_quarterly_revisions"]['revision_date'], format='%Y_%m')

# Mostrar el resultado
globals()[f"{sector}_quarterly_revisions"]

<div style="background-color: #1E3B58; color: white; padding: 10px;">
    <h2>Create revision dataset (monthly)</h2>
</div>

In [None]:
sector

In [None]:
# 1. Calcular la diferencia entre el último y el primer valor no NaN para cada columna, excepto 'year', 'ns_id' y 'date'
revision = globals()[f"{sector}_monthly_growth_rates"].drop(columns=['year', 'id_ns', 'date']).apply(lambda x: x.loc[x.last_valid_index()] - x.loc[x.first_valid_index()])


#  Nuevo

In [None]:
# 2. Crear un nuevo DataFrame con los resultados
globals()[f"{sector}_monthly_revisions"] = pd.DataFrame({'revision_date': revision.index, f'{sector}_revision': revision.values})
# Suponiendo que 'gdp_monthly_growth_rates' es tu DataFrame
pd.set_option('display.max_rows', None)
globals()[f"{sector}_monthly_revisions"]

# Para graficar en python

In [None]:
import pandas as pd

# Extraer el mes y el año de la columna 'revision_date'
globals()[f"{sector}_monthly_revisions"]['month'] = globals()[f"{sector}_monthly_revisions"]['revision_date'].str.split('_').str[0]
globals()[f"{sector}_monthly_revisions"]['year'] = globals()[f"{sector}_monthly_revisions"]['revision_date'].str.split('_').str[1]

# Mapear los nombres de los meses a sus respectivos números
month_mapping = {
    'ene': '01', 'feb': '02', 'mar': '03', 'abr': '04',
    'may': '05', 'jun': '06', 'jul': '07', 'ago': '08',
    'sep': '09', 'oct': '10', 'nov': '11', 'dic': '12'
}

globals()[f"{sector}_monthly_revisions"]['month'] = globals()[f"{sector}_monthly_revisions"]['month'].map(month_mapping)

# Crear una nueva columna con la fecha en formato YYYY-MM-DD
globals()[f"{sector}_monthly_revisions"]['revision_date'] = globals()[f"{sector}_monthly_revisions"]['year'] + '-' + globals()[f"{sector}_monthly_revisions"]['month']

# Convertir la columna 'revision_date' a tipo de datos de fecha
globals()[f"{sector}_monthly_revisions"]['revision_date'] = pd.to_datetime(globals()[f"{sector}_monthly_revisions"]['revision_date'], format='%Y-%m')

# Eliminar columnas temporales 'month' y 'year'
globals()[f"{sector}_monthly_revisions"].drop(['month', 'year'], axis=1, inplace=True)

# Mostrar el resultado
globals()[f"{sector}_monthly_revisions"]

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div id="4-4">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.4.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Loading SQL
    </span>
    </h2>

# Growth Rates

In [None]:
import os
from sqlalchemy import create_engine

# Get environment variables
user = os.environ.get('CIUP_SQL_USER')
password = os.environ.get('CIUP_SQL_PASS')
host = os.environ.get('CIUP_SQL_HOST')
port = 5432
database = 'gdp_revisions_datasets'

# Check if all environment variables are defined
if not all([host, user, password]):
    raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

# Create connection string
connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

# Create SQLAlchemy engine
engine = create_engine(connection_string)

# gdp_monthly_growth_rates is the DataFrame you want to save to the database
#gdp_annual_growth_rates.to_sql('gdp_annual_growth_rates_2013', engine, index=False, if_exists='replace')
#gdp_quarterly_growth_rates.to_sql('gdp_quarterly_growth_rates', engine, index=False, if_exists='replace')
#gdp_monthly_growth_rates.to_sql(f'{sector}_monthly_growth_rates', engine, index=False, if_exists='replace')

# REVISIONES

globals()[f"{sector}_monthly_growth_rates"].to_sql(f'{sector}_monthly_growth_rates', engine, index=False, if_exists='replace')
#globals()[f"{sector}_quarterly_growth_rates"].to_sql(f'{sector}_quarterly_growth_rates', engine, index=False, if_exists='replace')
#globals()[f"{sector}_annual_growth_rates"].to_sql(f'{sector}_annual_growth_rates', engine, index=False, if_exists='replace')

# Revisions

In [None]:
import os
from sqlalchemy import create_engine

# Get environment variables
user = os.environ.get('CIUP_SQL_USER')
password = os.environ.get('CIUP_SQL_PASS')
host = os.environ.get('CIUP_SQL_HOST')
port = 5432
database = 'gdp_revisions_datasets'

# Check if all environment variables are defined
if not all([host, user, password]):
    raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

# Create connection string
connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

# Create SQLAlchemy engine
engine = create_engine(connection_string)

# gdp_monthly_growth_rates is the DataFrame you want to save to the database
#gdp_annual_growth_rates.to_sql('gdp_annual_growth_rates_2013', engine, index=False, if_exists='replace')
#gdp_quarterly_growth_rates.to_sql('gdp_quarterly_growth_rates', engine, index=False, if_exists='replace')
#gdp_monthly_growth_rates.to_sql(f'{sector}_monthly_growth_rates', engine, index=False, if_exists='replace')

# REVISIONES

globals()[f"{sector}_monthly_revisions"].to_sql(f'{sector}_monthly_revisions', engine, index=False, if_exists='replace')
#globals()[f"{sector}_quarterly_revisions"].to_sql(f'{sector}_quarterly_revisions', engine, index=False, if_exists='replace')
#globals()[f"{sector}_annual_revisions"].to_sql(f'{sector}_annual_revisions', engine, index=False, if_exists='replace')

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    Back to the
    <a href="#outilne" style="color: #3d30a2;">
    outline.
    </a>
    <div/>

---
---
---