<div style="text-align: center; font-family: 'charter bt pro roman'; color: rgb(0, 65, 75);">
<h1>
Old GDP Revisions Datasets
</h1>
</div>

<div style="text-align: center; font-family: 'charter bt pro roman'; color: rgb(0, 65, 75);">
<h3>
Documentation
<br>
____________________
<br>
</h3>
</div>

<div style="text-align: center; font-family: 'PT Serif Pro Book'; color: rgb(0, 65, 75); font-size: 16px;">
    Jason Cruz
    <br>
    <a href="mailto:jj.cruza@up.edu.pe" style="color: rgb(0, 153, 123); font-size: 16px;">
        jj.cruza@up.edu.pe
    </a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
This <span style="color: rgb(0, 65, 75);">jupyter notebook</span> documents step-by-step the <b>construction of old datasets</b> for the project <b>'Revisions and Biases in Preliminary GDP Estimates in Peru'</b>.

This jupyter notebook goes from the cleaning of the tables (Weekly Reports, WR) provided confidentially by the Central Bank to the construction of the <b>vintages</b> and <b>revisions</b> datasets from 1994-2024.
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;line-height: 1.5;">
<span style="font-size: 24px;">&#128196;</span> The Weekly Report/<i>Nota Semanal</i> (WR/<i>NS</i>) of the Central Bank.
    <br>
    <span style="font-size: 24px;">&#8987;</span> Available since <b>1994-2012</b> (Table 1) and since <b>1997-2012</b> (Table 2). 
    <br>
</div>

<div style="font-family: Amaya; text-align: left; color: rgb(0, 65, 75); font-size:16px">The following <b>outline is functional</b>. By utilising the provided buttons, users are able to enhance their experience by browsing this script.<div/>

<div id="outilne">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #292929; padding: 10px; line-height: 1.5; font-family: 'PT Serif Pro Book';">
    <h2 style="text-align: left; color: #E0E0E0;">
        Outline
    </h2>
    <br>
    <a href="#libraries" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        Libraries</a>
    <br>
    <a href="#setup" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        Initial set-up</a>
    <br>
    <a href="#1" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        1. Duplicate tables for all other NS ids</a>
    <br>
    <a href="#1-1" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        1.1. Table 1.</a>
    <br>
    <a href="#1-2" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        1.2. Table 2.</a>
    <br>
    <a href="#2" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">
        2. Data cleaning</a>
    <br>
    <a href="#2-1" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        2.1. Extracting tables and data cleanup.</a>
    <br>
    <a href="#2-1-1" style="color: #94FFD8; font-size: 14px; margin-left: 40px;">
        2.1.1. Table 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.</a>
    <br>
    <a href="#2-1-2" style="color: #94FFD8; font-size: 14px; margin-left: 40px;">
        2.1.2. Table 2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.</a>
    <br>
    <a href="#3" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">3. Real-time data of Peru's GDP growth rates</a>
    <br>
    <a href="#3-1" style="color: #94FFD8; font-size: 16px; margin-left: 20px;">
        3.1. Growth rates datasets concatenation for all frequencies.</a>
    <br>
    <a href="#4" style="color: #E0E0E0; font-size: 18px; margin-left: 0px;">4. Uploading data to SQL</a>
</div>


<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    Any questions or issues regarding the coding, please email Jason Cruz <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123); text-decoration: none;"><span style="font-size: 24px;">&#x2709;</span>
    </a>.
    <div/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    If you don't have the libraries below, please use the following code (as example) to install the required libraries.
    <div/>

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

<div id="libraries">
   <!-- Contenido de la celda de destino -->
</div>

<div style="text-align: left; font-family: 'charter'; color: dark;">
    <h2>
    Libraries
    </h2>
    <div/>

In [1]:
# 1. Duplicate tables for all other NS ids
#-------------------------------------------------------------------------------------------------------------------------------
# 1.1. Table 1

import os
import shutil
import psycopg2
import pandas as pd



# 2. Data cleaning
#-------------------------------------------------------------------------------------------------------------------------------

# 2.2. Extracting tables and data cleanup

import pandas as pd  # For data manipulation and analysis
import unicodedata  # For manipulating Unicode data
import re  # For regular expressions operations
from datetime import datetime  # For working with dates and times
import locale  # For locale-specific formatting of numbers, dates, and currencies
import numpy as np
import unidecode

# 2.2.1. Table 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.

import tabula  # Used to extract tables from PDF files into pandas DataFrames
from tkinter import Tk, messagebox, TOP, YES, NO  # Used for creating graphical user interfaces
from sqlalchemy import create_engine  # Used for connecting to and interacting with SQL databases

# 2.2.2. Table 2. Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.

import roman
from datetime import datetime


# 3. Real-time data of Peru's GDP growth rates
#-------------------------------------------------------------------------------------------------------------------------------

import psycopg2  # For interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # For creating and executing SQL queries using SQLAlchemy


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="setup">
   <!-- Contenido de la celda de destino -->
</div>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark;">
    <h2>
    Initial set-up
    </h2>
    <div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> Setting the base path. <div/>

In [None]:
# Ask the user for the base path
base_path = input("Please enter your base path to set your main working directory, then press enter: \n")

# Check if the path is valid
if os.path.isdir(base_path):
    print(f"Correctly defined base path.")
    os.chdir(base_path)
else:
    print("The entered path is not valid. Please try again.")

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    Request the <code>.zip</code> files of <code>table_1</code> and <code>table_2</code> and paste them into <code>raw_data</code> and <code>input_data</code> paths set below. 
    <p>
     Please note that missing tables will be completed in the <a href="#1" style="color: rgb(0, 153, 123); font-size: 16px;">Section 1</a> under the <code>input_data</code> directory. While changes will be applied directly in <code>input_data</code>, we have created the <code>raw_data</code> directory to retain a backup of the tables as originally extracted, either via OCR or manually, with no transformations. This serves as a backup of the raw data.
       </p>
    <div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> The following code lines will create folders in your current path, call them to import and export your outputs. <div/>

In [None]:
# Folder path to save csv files delivered by Central Bank 
raw_data = 'raw_data' # to save raw data (.csv).
if not os.path.exists(raw_data):
    os.mkdir(raw_data) # to create the folder (if it doesn't exist)

In [None]:
# Folder path to save csv files delivered by Central Bank as inputs (duplicated for all NS id) 
input_data = 'input_data' # to save input data (.csv).
if not os.path.exists(input_data):
    os.mkdir(input_data) # to create the folder (if it doesn't exist)

In [None]:
# Folder path to save csv files delivered by Central Bank (Table 1)
table_1_folder = os.path.join(input_data, 'table_1') # to save raw data (.csv).
if not os.path.exists(table_1_folder):
    os.mkdir(table_1_folder) # to create the folder (if it doesn't exist)

In [None]:
# Folder path to save csv files delivered by Central Bank (Table 2)
table_2_folder = os.path.join(input_data, 'table_2') # to save raw data (.csv).
if not os.path.exists(table_2_folder):
    os.mkdir(table_2_folder) # to create the folder (if it doesn't exist)

In [None]:
# Folder path to save dataframes generated record by year

dataframes_record = 'dataframes_record'
if not os.path.exists(dataframes_record):
    os.makedirs(dataframes_record) # to create the folder (if it doesn't exist)

<p style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> The following function will establish a connection to the <code>gdp_revisions_datasets</code> database in <code>PostgreSQL</code>. The <b>input data</b> used in this jupyter notebook will be loaded from this <code>PostgreSQL</code> database, and similarly, all <b>output data</b> generated by this jupyter notebook will be stored in that database. Ensure that you set the necessary parameters to access the server once you have obtained the required permissions.<p/>
    
<p style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
To request permissions, please email Jason Cruz <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123); text-decoration: none;"> <span style="font-size: 24px;">&#x2709;</span>
    </a>.
<p/>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
    <span style="font-size: 24px; color: #FFA823; font-weight: bold;">&#9888;</span>
    Enter your user credentials to acces to SQL.
    <div/>

In [None]:
def create_sqlalchemy_engine():
    """
    Function to create an SQLAlchemy engine using environment variables.
    
    Returns:
        engine: SQLAlchemy engine object.
    """
    # Get environment variables
    user = os.environ.get('CIUP_SQL_USER')  # Get the SQL user from environment variables
    password = os.environ.get('CIUP_SQL_PASS')  # Get the SQL password from environment variables
    host = os.environ.get('CIUP_SQL_HOST')  # Get the SQL host from environment variables
    port = 5432  # Set the SQL port to 5432
    database = 'gdp_revisions_datasets'  # Set the database name 'gdp_revisions_datasets' from SQL

    # Check if all environment variables are defined
    if not all([host, user, password]):
        raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

    # Create connection string
    connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

    # Create SQLAlchemy engine
    engine = create_engine(connection_string)
    
    return engine

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Import all other functions required by this jupyter notebook.
    </span>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px"> Please, check the script <code>old_gdp_datasets_functions.py</code> and <code>new_gdp_datasets_functions.py</code> which contains all the functions required by this jupyter notebook. The functions there are ordered according to the <a href="#outilne" style="color: #3d30a2;">sections</a> of this jupyter notebok.<div/>

In [3]:
from old_gdp_datasets_functions import *
from new_gdp_datasets_functions import *

pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="1">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">1.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Duplicate tables for all other NS ids</span></h1>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
    <p>By means of an encoding or optical character recognition (OCR) process, <code>.csv</code> files have been generated with tables corresponding to certain Weekly Notes (WN), in those where revisions were identified, according to the information at the bottom of each table (within each WN).</p>
    
   <p>So, we will duplicate these tables to cover the total number of NS in each year, which equals approximately 50, given the number of weeks per year.</p>
    <div/>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
The following code imports from SQL data that includes a dummy variable, which indicates the availability of specific tables in <code>.csv</code> files for each Weekly Note (1 if the table is available and 0 if it is not).
    <div/>

In [None]:
# Connect to PostgresSQL
engine = create_sqlalchemy_engine()

# Define SQL query to import data
query = f"SELECT * FROM old_raw_data_delivered"

# Importing data into a pandas DataFrame
df = pd.read_sql(query, engine)

<div id="1-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">1.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Table 1
    </span>
    </h2>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please change your preferred year range
    </span>
</div>

In [None]:
# Process each year
for year in range(1994, 2013): # Please change your preferred year range (the last year + 1)
    df_year = df[df['year'] == year]
    duplicate_files_table_1(year, df_year, table_1_folder)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="1-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">1.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Table 2
    </span>
    </h2>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please change your preferred year range
    </span>
</div>

In [None]:
# Process each year
for year in range(1997, 2013): # Please change your preferred year range (the last year + 1)
    df_year = df[df['year'] == year]
    duplicate_files_table_2(year, df_year, table_2_folder)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="2">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">2.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Data cleaning</span></h1>

<div id="2-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">2.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Extracting tables and data cleanup
    </span>
    </h2>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
PENDING
</p>
<div/>

<div id="2-1-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">2.1.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 1.</span> Extraction and cleaning of data from tables on monthly real GDP growth rates.
    </span>
    </h3>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF <span style="font-size: 24px;">&#128462;</span> (WR). That file is located in the <b>"ns_dates"</b> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [None]:
# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Dictionary to store generated DataFrames
old_dataframes_dict_1 = {}

# Path for the processed folders log file
record_path = 'dataframes_record/old_processed_folders_1.txt'

# Function to correct month names
def correct_month_name(month):
    months_mapping = {
        'setiembre': 'septiembre',
        # Add more mappings as needed for other month names
    }
    return months_mapping.get(month, month)

# Function to register processed folder
def register_processed_folder(folder, num_processed_files):
    with open(record_path, 'a') as file:
        file.write(f"{folder}:{num_processed_files}\n")

# Function to check if folder has been processed
def folder_processed(folder):
    if not os.path.exists(record_path):
        return False
    with open(record_path, 'r') as file:
        for line in file:
            if line.startswith(folder):
                return True
    return False

# Function to fetch date from database
def get_date(df, engine):
    id_ns = df['id_ns'].iloc[0]
    year = df['year'].iloc[0]
    query = f"SELECT date FROM dates_growth_rates WHERE id_ns = '{id_ns}' AND year = '{year}';"
    date_result = pd.read_sql(query, engine)
    return date_result.iloc[0, 0] if not date_result.empty else None

def process_csv(csv_path, engine):
    old_tables_dict_1 = {}  # Local dictionary for each CSV
    table_counter = 1

    filename = os.path.basename(csv_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None # Return None for tables_dict_1 as well

    new_filename = os.path.splitext(os.path.basename(csv_path))[0].replace('-', '_')

    df = pd.read_csv(csv_path, delimiter=';')
    
    dataframe_name = f"{new_filename}_{table_counter}"
    old_tables_dict_1[dataframe_name] = df.copy()

    # Apply cleanup functions to a copy of the DataFrame
    df_clean = df.copy()

    # Conditional cleaning based on the first column
    if df_clean.columns[1] == 'economic_sectors':
        df_clean = drop_nan_rows(df_clean)
        df_clean = drop_nan_columns(df_clean)
        df_clean = clean_columns_values(df_clean)
        df_clean = convert_float(df_clean)
        df_clean = replace_set_sep(df_clean)
        df_clean = spaces_se_es(df_clean)
        df_clean = replace_mineria(df_clean)
        df_clean = replace_mining(df_clean)
        df_clean = rounding_values(df_clean, decimals=1)
    else:
        # Cleaning functions (set as required)
        df_clean = clean_column_names(df_clean)
        df_clean = adjust_column_names(df_clean)
        df_clean = drop_rare_caracter_row(df_clean)
        df_clean = drop_nan_rows(df_clean)
        df_clean = drop_nan_columns(df_clean)
        df_clean = reset_index(df_clean)
        df_clean = remove_digit_slash(df_clean)
        df_clean = replace_var_perc_first_column(df_clean)
        df_clean = replace_var_perc_last_columns(df_clean)
        df_clean = replace_number_moving_average(df_clean)
        df_clean = relocate_last_column(df_clean)
        df_clean = clean_first_row(df_clean)
        df_clean = find_year_column(df_clean)
        year_columns = extract_years(df_clean)
        df_clean = get_months_sublist_list(df_clean, year_columns)
        df_clean = first_row_columns(df_clean)
        df_clean = clean_columns_values(df_clean)
        df_clean = convert_float(df_clean)
        df_clean = replace_set_sep(df_clean)
        df_clean = spaces_se_es(df_clean)
        df_clean = replace_mineria(df_clean)
        df_clean = replace_mining(df_clean)
        df_clean = rounding_values(df_clean, decimals=1)

    # Add the column 'year' to the clean DataFrame
    df_clean.insert(0, 'year', year)
    
    # Add the column 'id_ns' to the clean DataFrame
    df_clean.insert(1, 'id_ns', id_ns)
    
    # Get corresponding date from database
    date = get_date(df_clean, engine)
    if date:
        # Add 'date' column to cleaned DataFrame
        df_clean.insert(2, 'date', date)
    else:
        print("Date not found in database for id_ns:", id_ns, "and year:", year)

    # Store cleaned DataFrame in old_dataframes_dict_1
    old_dataframes_dict_1[dataframe_name] = df_clean

    return id_ns, year, old_tables_dict_1

# Function to process folder
def process_folder(folder, engine):
    print(f"Processing folder {os.path.basename(folder)}")
    csv_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.csv')]

    num_csv_processed = 0
    num_dataframes_generated = 0

    table_counter = 1  # Initialize table counter here
    old_tables_dict_1 = {}  # Declare old_tables_dict_1 outside main loop
    
    for csv_file in csv_files:
        id_ns, year, tables_dict_temp = process_csv(csv_file, engine)

        if tables_dict_temp:
            for dataframe_name, df in tables_dict_temp.items():
                file_name = os.path.splitext(os.path.basename(csv_file))[0].replace('-', '_')
                dataframe_name = f"{file_name}_1"
                
                # Store raw DataFrame in old_tables_dict_1
                old_tables_dict_1[dataframe_name] = df.copy()
                
                # Apply cleanup functions to a copy of the DataFrame
                df_clean = df.copy()

                # Conditional cleaning based on the first column
                if df_clean.columns[1] == 'economic_sectors':
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                else:
                    # Cleaning functions (set as required)
                    df_clean = clean_column_names(df_clean)
                    df_clean = adjust_column_names(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)

                # Add the column 'year' to the clean DataFrame
                df_clean.insert(0, 'year', year)

                # Add the column 'id_ns' to the clean DataFrame
                df_clean.insert(1, 'id_ns', id_ns)

                # Get corresponding date from database
                date = get_date(df_clean, engine)
                if date:
                    # Add 'date' column to cleaned DataFrame
                    df_clean.insert(2, 'date', date)
                else:
                    print("Date not found in database for id_ns:", id_ns, "and year:", year)

                # Store cleaned DataFrame in old_dataframes_dict_1
                old_dataframes_dict_1[dataframe_name] = df_clean

                print(f'  {table_counter}. The dataframe generated for the {csv_file} file is: {dataframe_name}')
                num_dataframes_generated += 1
                table_counter += 1  # Increment table counter here
        
        num_csv_processed += 1  # Increment number of processed CSV for each CSV in folder

    return num_csv_processed, num_dataframes_generated, old_tables_dict_1

def process_folders():
    csv_folder = table_1_folder
    folders = [os.path.join(csv_folder, d) for d in os.listdir(csv_folder) if os.path.isdir(os.path.join(csv_folder, d)) and re.match(r'\d{4}', d)]
    
    old_tables_dict_1 = {}  # Initialize old_tables_dict_1 here
    
    for folder in folders:
        if folder_processed(folder):
            print(f"Folder {folder} has already been processed.")
            continue
        
        num_csv_processed, num_dataframes_generated, tables_dict_temp = process_folder(folder, engine)
        
        # Update old_tables_dict_1 with values returned from process_folder()
        old_tables_dict_1.update(tables_dict_temp)
        
        register_processed_folder(folder, num_csv_processed)

        # Ask user if they want to continue with next folder
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Ensure the messagebox is in front
        message = f"Process {folder} complete. Processed {num_csv_processed} CSV(s) and generated {num_dataframes_generated} DataFrame(s). Continue with next folder?"
        if not messagebox.askyesno("Continue?", message):
            break
            
    print("Processing completed for all folders.")  # Add a message to indicate completion
    
    return old_tables_dict_1

if __name__ == "__main__":
    engine = create_sqlalchemy_engine()
    old_tables_dict_1 = process_folders()


In [None]:
old_tables_dict_1['ns_13_1997_1'].head(30)

In [None]:
old_dataframes_dict_1['ns_13_1997_1'].head(30)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="2-1-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">2.1.2.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    <span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">Table 2.</span> Extraction and cleaning of data from tables on quarterly and annual real GDP growth rates.
    </span>
    </h3>

<div style="text-align: left;">
    <span style="font-size: 24px; color: rgb(255, 32, 78); font-weight: bold;">&#9888;</span>
    <span style="font-family: PT Serif Pro Book; color: black; font-size: 16px;">
        Please check that the flat file <b>"ns_dates.csv"</b> is updated with the dates, years and ids for the newly downloaded PDF <span style="font-size: 24px;">&#128462;</span> (WR). That file is located in the <b>"ns_dates"</b> folder and is uploaded to SQL from the jupyeter notebook <code>aux_files_to_sql.ipynb</code>
    </span>
</div>

In [None]:
# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Dictionary to store generated DataFrames
old_dataframes_dict_2 = {}

# Path for the processed folders log file
record_path = 'dataframes_record/old_processed_folders_2.txt'

# Function to correct month names
def correct_month_name(month):
    months_mapping = {
        'setiembre': 'septiembre',
        # Add more mappings as needed for other month names
    }
    return months_mapping.get(month, month)

# Function to register processed folder
def register_processed_folder(folder, num_processed_files):
    with open(record_path, 'a') as file:
        file.write(f"{folder}:{num_processed_files}\n")

# Function to check if folder has been processed
def folder_processed(folder):
    if not os.path.exists(record_path):
        return False
    with open(record_path, 'r') as file:
        for line in file:
            if line.startswith(folder):
                return True
    return False

# Function to fetch date from database
def get_date(df, engine):
    id_ns = df['id_ns'].iloc[0]
    year = df['year'].iloc[0]
    query = f"SELECT date FROM dates_growth_rates WHERE id_ns = '{id_ns}' AND year = '{year}';"
    date_result = pd.read_sql(query, engine)
    return date_result.iloc[0, 0] if not date_result.empty else None

def process_csv(csv_path, engine):
    old_tables_dict_2 = {}  # Local dictionary for each CSV
    table_counter = 2

    filename = os.path.basename(csv_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None # Return None for tables_dict_1 as well

    new_filename = os.path.splitext(os.path.basename(csv_path))[0].replace('-', '_')

    df = pd.read_csv(csv_path, delimiter=';')
    
    dataframe_name = f"{new_filename}_{table_counter}"
    old_tables_dict_2[dataframe_name] = df.copy()

    # Apply cleanup functions to a copy of the DataFrame
    df_clean = df.copy()

    # Conditional cleaning based on the first column
    if df_clean.columns[1] == 'economic_sectors':
        df_clean = drop_nan_rows(df_clean)
        df_clean = drop_nan_columns(df_clean)
        df_clean = clean_columns_values(df_clean)
        df_clean = convert_float(df_clean)
        df_clean = replace_set_sep(df_clean)
        df_clean = spaces_se_es(df_clean)
        df_clean = replace_mineria(df_clean)
        df_clean = replace_mining(df_clean)
        df_clean = rounding_values(df_clean, decimals=1)
    else:
        # Cleaning functions (set as required)
        df_clean = replace_total_with_year(df_clean)
        df_clean = drop_nan_rows(df_clean)
        df_clean = drop_nan_columns(df_clean)
        year_columns = extract_years(df_clean)
        df_clean = roman_arabic(df_clean)
        df_clean = fix_duplicates(df_clean)
        df_clean = relocate_last_column(df_clean)
        df_clean = replace_first_row_nan(df_clean)
        df_clean = clean_first_row(df_clean)
        df_clean = get_quarters_sublist_list(df_clean, year_columns)
        df_clean = reset_index(df_clean)
        df_clean = first_row_columns(df_clean)
        df_clean = clean_columns_values(df_clean)
        df_clean = reset_index(df_clean)
        df_clean = convert_float(df_clean)
        df_clean = replace_set_sep(df_clean)
        df_clean = spaces_se_es(df_clean)
        df_clean = replace_mineria(df_clean)
        df_clean = replace_mining(df_clean)
        df_clean = rounding_values(df_clean, decimals=1)

    # Add the column 'year' to the clean DataFrame
    df_clean.insert(0, 'year', year)
    
    # Add the column 'id_ns' to the clean DataFrame
    df_clean.insert(1, 'id_ns', id_ns)
    
    # Get corresponding date from database
    date = get_date(df_clean, engine)
    if date:
        # Add 'date' column to cleaned DataFrame
        df_clean.insert(2, 'date', date)
    else:
        print("Date not found in database for id_ns:", id_ns, "and year:", year)

    # Store cleaned DataFrame in old_dataframes_dict_2
    old_dataframes_dict_2[dataframe_name] = df_clean

    return id_ns, year, old_tables_dict_2

# Function to process folder
def process_folder(folder, engine):
    print(f"Processing folder {os.path.basename(folder)}")
    csv_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.csv')]

    num_csv_processed = 0
    num_dataframes_generated = 0

    table_counter = 1  # Initialize table counter here
    old_tables_dict_2 = {}  # Declare old_tables_dict_1 outside main loop
    
    for csv_file in csv_files:
        id_ns, year, tables_dict_temp = process_csv(csv_file, engine)

        if tables_dict_temp:
            for dataframe_name, df in tables_dict_temp.items():
                file_name = os.path.splitext(os.path.basename(csv_file))[0].replace('-', '_')
                dataframe_name = f"{file_name}_2"
                
                # Store raw DataFrame in old_tables_dict_2
                old_tables_dict_2[dataframe_name] = df.copy()
                
                # Procesar y limpiar el DataFrame
                df_clean = df.copy()
                
                # Conditional cleaning based on the first column
                if df_clean.columns[1] == 'economic_sectors':
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                else:
                    # Cleaning functions (set as required)
                    df_clean = replace_total_with_year(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = replace_first_row_nan(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = reset_index(df_clean)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convert_float(df_clean)
                    df_clean = replace_set_sep(df_clean)
                    df_clean = spaces_se_es(df_clean)
                    df_clean = replace_mineria(df_clean)
                    df_clean = replace_mining(df_clean)
                    df_clean = rounding_values(df_clean, decimals=1)
                
                # Add the column 'year' to the clean DataFrame
                df_clean.insert(0, 'year', year)

                # Add the column 'id_ns' to the clean DataFrame
                df_clean.insert(1, 'id_ns', id_ns)

                # Get corresponding date from database
                date = get_date(df_clean, engine)
                if date:
                    # Add 'date' column to cleaned DataFrame
                    df_clean.insert(2, 'date', date)
                else:
                    print("Date not found in database for id_ns:", id_ns, "and year:", year)

                # Store cleaned DataFrame in old_dataframes_dict_2
                old_dataframes_dict_2[dataframe_name] = df_clean

                print(f'  {table_counter}. The dataframe generated for the {csv_file} file is: {dataframe_name}')
                num_dataframes_generated += 1
                table_counter += 1  # Increment table counter here
        
        num_csv_processed += 1  # Increment number of processed CSV for each CSV in folder

    return num_csv_processed, num_dataframes_generated, old_tables_dict_2

def process_folders():
    csv_folder = table_2_folder
    folders = [os.path.join(csv_folder, d) for d in os.listdir(csv_folder) if os.path.isdir(os.path.join(csv_folder, d)) and re.match(r'\d{4}', d)]
    
    old_tables_dict_2 = {}  # Initialize old_tables_dict_1 here
    
    for folder in folders:
        if folder_processed(folder):
            print(f"Folder {folder} has already been processed.")
            continue
        
        num_csv_processed, num_dataframes_generated, tables_dict_temp = process_folder(folder, engine)
        
        # Update old_tables_dict_2 with values returned from process_folder()
        old_tables_dict_2.update(tables_dict_temp)
        
        register_processed_folder(folder, num_csv_processed)
    
        # Ask user if they want to continue with next folder
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Ensure the messagebox is in front
        message = f"Process {folder} complete. Processed {num_csv_processed} CSV(s) and generated {num_dataframes_generated} DataFrame(s). Continue with next folder?"
        if not messagebox.askyesno("Continue?", message):
            break

    print("Processing completed for all folders.")  # Add a message to indicate completion

    return old_tables_dict_2
    
if __name__ == "__main__":
    engine = create_sqlalchemy_engine()
    old_tables_dict_2 = process_folders()

In [None]:
old_tables_dict_2['ns_44_2004_2']

In [None]:
old_dataframes_dict_2['ns_44_2004_2']

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="3">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Real-time data of Peru's GDP growth rates</span></h1>

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
This section creates the GDP growth rate vintages for Peru using <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a> and <a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a>, which were extracted and cleaned in the previous section. Each table from each WR (CSV <span style="font-size: 24px;">&#128452;</span>) was extracted and cleaned individually in the previous section. Here, we will concatenate all the tables for a specific economic sector, thus creating a vintage dataset of (real) GDP growth by economic sector from <b>1994</b> to <b>2012</b> (<a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 1</a>) and <b>1997</b> to <b>2012</b> (<a href="#3-2-1" style="color: rgb(0, 153, 123); font-size: 16px;">Table 2</a>).
<div/>

<div id="select_sector">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>sector_economico</code> and <code>economic_sector</code></span></h1>
    </div>

<div style="font-family: PT Serif Pro Book; text-align: left; color:dark; font-size:16px">
<p>     
When executing the following code, a window will be displayed with options in <b>Spanish</b> and <b>English</b> to select <b>economic sectors</b>. Choose them to concatenate Peru GDP growth rates (annual, quarterly or monthly) by sector.
</p>
<div/>

In [9]:
# Call the function to display the window and capture the selected values
selected_spanish, selected_english, sector = show_option_window()

# Display the selected values
print(f"You have selected sector = {sector}, selected_spanish = {selected_spanish}, and selected_english = {selected_english}.")

You have selected sector = manufacturing, selected_spanish = manufactura, and selected_english = manufacturing.


<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select the dataset name prefix</span></h1>
    </div>

In [None]:
# Call the function to show the popup window
sector = show_option_window()
print("Selected economic sector:", sector)

<div id="select_freq">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Select <code>frequency</code></span></h1>
    </div>

In [None]:
# Call the function to show the popup window
frequency = show_frequency_window()
print("Selected frequency:", frequency)

<div id="counter">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #00414C; color: white; padding: 10px;">
<h1><span style = "color: #15F5BA; font-family: 'PT Serif Pro Book'; color: dark;">$\bullet$</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Set counter (dataframe name suffix)</span></h1>
    </div>

In [None]:
# Call the function to set the counter
if frequency == "monthly":
    counter = 1
elif frequency == "quarterly":
    counter = 2
elif frequency == "annual":
    counter = 2
else:
    counter = None 

print(counter)

<div id="3-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">3.1.</span>
    <span style = "color: dark; font-family: PT Serif Pro Book;">
    Growth rates datasets concatenation for all frequencies
    </span>
    </h2>

In [None]:
# Dynamically construct the function name and dictionary name
function_name = f"concatenate_{frequency}_df"
dataframe_dict_name = f"old_dataframes_dict_{counter}"

# Check that both the function and dictionary exist in the global scope
if function_name in globals() and dataframe_dict_name in globals():
    # Call the function using its reference from globals()
    globals()[f"old_{sector}_{frequency}_growth_rates"] = globals()[function_name](
        globals()[dataframe_dict_name], selected_spanish, selected_english
    )
else:
    print(f"Error: {function_name} or {dataframe_dict_name} does not exist in the global scope.")

In [None]:
#pd.set_option('display.max_rows', None)
globals()[f"old_{sector}_{frequency}_growth_rates"].head(10)

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

<div id="4">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;">4.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Uploading data to SQL</span></h1> 

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Finally, we upload all the datasets generated in this jupyter notebook to the <code>'gdp_revisions_datasets'</code> database of <code>PostgresSQL</code>.
<div/>

In [None]:
engine = create_sqlalchemy_engine()

<div style="text-align: left; font-family: 'PT Serif Pro Book'; color: dark; font-size:16px">
Loading
<div/>

In [None]:
globals()[f"old_{sector}_{frequency}_growth_rates"].to_sql(f'old_{sector}_{frequency}_growth_rates', engine, index=False, if_exists='replace')

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select_sector" style="color: rgb(255, 32, 78); text-decoration: none;">⮝</a>
    </span> 
    <a href="#select_sector" style="color: rgb(255, 32, 78); text-decoration: none;">Back to select sectors.</a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 20px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#select_freq" style="color: rgb(255, 32, 78); text-decoration: none;">⮝</a>
    </span> 
    <a href="#select_freq" style="color: rgb(255, 32, 78); text-decoration: none;">Back to select frequency.</a>
</div>

<div style="font-family: PT Serif Pro Book; text-align: left; color: dark; font-size: 16px;">
    <span style="font-size: 30px; color: rgb(255, 32, 78); font-weight: bold;">
        <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">&#11180;</a>
    </span> 
    <a href="#outilne" style="color: rgb(0, 153, 123); text-decoration: none;">Back to the outline.</a>
</div>

---
