<div style="text-align: center; font-family: 'charter'; color: rgb(0, 65, 75);">
    <h1>
    GDP Revisions Datasets
    </h1>
</div>

<div style="text-align: center; font-family: 'charter'; color: rgb(0, 65, 75);">
    <h4>
        Documentation
        <br>
        ____________________
            </br>
    </h4>
</div>

<div style="font-family: charter; text-align: left; color: dark;">
    This 
    <span style="color: rgb(61, 48, 162);">jupyter notebook</span>
    provides a step-by-step guide to <b>data building</b> regarding the project <b>'Revisiones y sesgos en las estimaciones preliminares del PBI en el Perú'</b>. The guide covers downloading PDF files containing tables with information on annual, quarterly, and monthly Peru's GDP growth rates (including sectoral GDP) and extracting this information into SQL tables. These data sets will be used for data analysis.
</div>


<div style="text-align: center; font-family: 'charter'; color: rgb(0, 65, 75);">
    Jason Cruz
    <br>
    <a href="mailto:jj.cruza@up.edu.pe" style="color: rgb(0, 153, 123)">
        jj.cruza@up.edu.pe
    </a>
</div>

<div style="font-family: Times New Roman; text-align: left; color: rgb(61, 48, 162)">The provided outline is functional. Use the buttons to enhance the experience of this script.<div/>

<div id="outilne">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #141414; padding: 10px;">
<h2 style="text-align: left; font-family: 'charter'; color: #E0E0E0;">
    Outline
    </h2>
    <br>
    <a href="#1" style="color: #687EFF; font-size: 18px;">
        1. PDF Downloader</a>
    <br>
    <a href="#2" style="color: #687EFF; font-size: 18px;">
        2. Extracting Tables (and data cleaning)</a>
    <br>
    <a href="#2-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        2.1. 'pdfplumber' demo.</a>
    <br>
    <a href="#2-1-1" style="color: #E0E0E0; font-size: 12px;">
        2.1.1. What data would we get if we used the default settings?.</a>   
    <br>
    <a href="#2-1-2" style="color: #E0E0E0; font-size: 12px;">
        2.1.2. Using custom '.extract_table' settings.</a>
    <br> 
    <a href="#2-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        2.2. Extracting tables and generating dataframes (includes data cleanup).</a>
    <br>
    <a href="#3" style="color: #687EFF; font-size: 18px;">3. SQL Tables</a>
    <br>
    <a href="#3-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.1. Annual Concatenation.</a>
    <br>
    <a href="#3-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.2. Quarterly Concatenation.</a>
    <br>
    <a href="#3-3" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.3. Monthly Concatenation.</a>
    <br>
    <a href="#3-4" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.4. Loading SQL.</a>
</div>

<div style="text-align: left; font-family: 'charter'; color: dark;">
    Any questions or issues regarding the coding, please <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123)">email Jason Cruz
    </a>.
    <div/>

<div style="text-align: left; font-family: 'charter'; color: dark;">
    If you don't have the libraries below, please use the following code (as example) to install the required libraries.
    <div/>

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

<div style="text-align: left; font-family: 'charter'; color: dark;">
    <h2>
    Libraries
    </h2>
    <div/>

In [1]:
# PDF Downloader

import os  # for file and directory manipulation
import random  # to generate random numbers
import time  # to manage time and take breaks in the script
import requests  # to make HTTP requests to web servers
from selenium import webdriver  # for automating web browsers
from selenium.webdriver.common.by import By  # to locate elements on a webpage
from selenium.webdriver.support.ui import WebDriverWait  # to wait until certain conditions are met on a webpage.
from selenium.webdriver.support import expected_conditions as EC  # to define expected conditions
from selenium.common.exceptions import StaleElementReferenceException  # To handle exceptions related to elements on the webpage that are no longer available.


# Extracting Tables (and data cleaning)

import pdfplumber  # for extracting text and metadata from PDF files
import pandas as pd  # for data manipulation and analysis
import os  # for interacting with the operating system
import unicodedata  # for manipulating Unicode data
import re  # for regular expressions operations
from datetime import datetime  # for working with dates and times
import locale  # for locale-specific formatting of numbers, dates, and currencies


# SQL tables

import psycopg2  # for interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # for creating and executing SQL queries using SQLAlchemy


<div style="text-align: left; font-family: 'charter'; color: dark;">
    <h2>
    Initial set-up
    </h2>
    <div/>

<div style="font-family: charter; text-align: left; color:dark"> The next 3 code lines will create folders in your current path, call them to import and export your outputs. <div/>

In [2]:
# Folder path to download PDF files

raw_pdf = 'raw_pdf' # to save raw data (.pdf).
if not os.path.exists(raw_pdf):
    os.mkdir(raw_pdf) # to create the folder (if it doesn't exist)

In [3]:
# Folder path to save text file with the names of already downloaded files

download_record = 'download_record'
if not os.path.exists(download_record):
    os.mkdir(download_record) # to create the folder (if it doesn't exist)

In [4]:
# Folder path to download the trimmed PDF files (these are PDF inputs for the extraction and cleanup code)

input_pdf = 'input_pdf'
if not os.path.exists(input_pdf):
    os.makedirs(input_pdf)

In [5]:
# Folder path to download the trimmed PDF files (these are PDF inputs for the extraction and cleanup code)

trimmed_record = 'trimmed_record'
if not os.path.exists(trimmed_record):
    os.makedirs(trimmed_record)

In [6]:
# Folder path to save dataframes generated record by year

dataframes_record = 'dataframes_record'
if not os.path.exists(dataframes_record):
    os.makedirs(dataframes_record)

In [7]:
# Folder path to download the trimmed PDF files (these are PDF inputs for the extraction and cleanup code)

pseudo_raw_pdf = 'pseudo_raw_pdf'
if not os.path.exists(pseudo_raw_pdf):
    os.makedirs(pseudo_raw_pdf)

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="1">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: charter;">1.</span> <span style = "color: dark; font-family: charter;">PDF Downloader</span></h1>

<div style="font-family: charter; text-align: left; color:dark">
    Our main source for data collection is the <a href="https://www.bcrp.gob.pe/" style="color: rgb(0, 153, 123)">BCRP's web page</a> (.../publicaciones/nota-semanal). The BCRP publishes "Notas Semanales", documents that contain, among other information, tables of GDP and sectoral GDP growth rate values for annual, quarterly and monthly frequencies.
    <div/>

-- (pending) Selenium tutorial

<div style="font-family: charter; text-align: left; color:dark">
    The provided code will download all the 'Notas Semanales' files in PDF format from this web page.
    <div/>

In [None]:
# Setting the BCRP URL
bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html"  # Never replace this URL

<div style="font-family: charter; text-align: left; color:dark">
    The provided code will download all the 'Notas Semanales' files in PDF format from this web page.
    <div/>

# Con ventana input al usuario y alarma por cada lote

In [None]:
import os
import pygame
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import time
import random

# Inicializar pygame
pygame.mixer.init()

# Carpeta donde se almacenarán los archivos de sonido
sound_folder = "sound"

# Lista de archivos de sonido disponibles
available_sounds = os.listdir(sound_folder)

# Seleccionar un sonido aleatorio
random_sound = random.choice(available_sounds)

# Ruta completa del sonido aleatorio
sound_path = os.path.join(sound_folder, random_sound)

# Cargar el sonido seleccionado
pygame.mixer.music.load(sound_path)

# Función para reproducir el sonido
def play_sound():
    pygame.mixer.music.play()

# List to keep track of successfully downloaded files
downloaded_files = []

# Folder where downloaded PDF files will be saved
raw_pdf = "raw_pdf"  # Replace with the actual path

# Folder where the download record file will be saved
download_record = "download_record"  # Replace with the actual path

# Load the list of previously downloaded files if it exists
if os.path.exists(os.path.join(download_record, "downloaded_files.txt")):
    with open(os.path.join(download_record, "downloaded_files.txt"), "r") as f:
        downloaded_files = f.read().splitlines()

# Web driver setup
driver_path = os.environ.get('driver_path')
driver = webdriver.Chrome(executable_path=driver_path)

def random_wait(min_time, max_time):
    wait_time = random.uniform(min_time, max_time)
    print(f"Waiting randomly for {wait_time:.2f} seconds")
    time.sleep(wait_time)

def download_pdf(pdf_link):
    # Click the link using JavaScript
    driver.execute_script("arguments[0].click();", pdf_link)

    # Wait for the new page to fully open (adjust timing as necessary)
    wait.until(EC.number_of_windows_to_be(2))

    # Switch to the new window or tab
    windows = driver.window_handles
    driver.switch_to.window(windows[1])

    # Get the current URL (may vary based on site-specific logic)
    new_url = driver.current_url
    print(f"{download_counter}. New URL: {new_url}")

    # Get the file name from the URL
    file_name = new_url.split("/")[-1]

    # Form the full destination path
    destination_path = os.path.join(raw_pdf, file_name)

    # Download the PDF
    response = requests.get(new_url, stream=True)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Save the PDF content to the local file
        with open(destination_path, 'wb') as pdf_file:
            for chunk in response.iter_content(chunk_size=128):
                pdf_file.write(chunk)

        print(f"PDF downloaded successfully at: {destination_path}")

    else:
        print(f"Error downloading the PDF. Response code: {response.status_code}")

    # Close the new window or tab
    driver.close()

    # Switch back to the main window
    driver.switch_to.window(windows[0])

# Number of downloads per batch
downloads_per_batch = 5
# Total number of downloads
total_downloads = 25

try:
    # Open the test page
    driver.get(bcrp_url)
    print("Site opened successfully")

    # Wait for the container area to be present
    wait = WebDriverWait(driver, 60)
    container_area = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="rightside"]')))

    # Get all the links within the container area
    pdf_links = container_area.find_elements(By.XPATH, './/a')

    # Reverse the order of links
    pdf_links = list(reversed(pdf_links))

    # Initialize download counter
    download_counter = 0

    # Iterate over reversed links and download PDFs in batches
    for pdf_link in pdf_links:
        download_counter += 1

        # Get the file name from the URL
        new_url = pdf_link.get_attribute("href")
        file_name = new_url.split("/")[-1]

        # Check if the file has already been downloaded
        if file_name in downloaded_files:
            print(f"{download_counter}. The file {file_name} has already been downloaded previously. Skipping...")
            continue

        # Try to download the file
        try:
            download_pdf(pdf_link)

            # Update the list of downloaded files
            downloaded_files.append(file_name)

            # Save the file name in the record
            with open(os.path.join(download_record, "downloaded_files.txt"), "a") as f:
                f.write(file_name + "\n")

        except Exception as e:
            print(f"Error downloading the file {file_name}: {str(e)}")

        # If the download count reaches a multiple of batch size, notify
        if download_counter % downloads_per_batch == 0:
            print(f"Batch {download_counter // downloads_per_batch} of {downloads_per_batch} completed")

        # If the download count reaches a multiple of 25, ask the user if they want to continue
        if download_counter % 25 == 0:
            play_sound()
            user_input = input("Do you want to continue downloading? (Enter 's' to continue, any other key to stop): ")
            pygame.mixer.music.stop()
            if user_input.lower() != 's':
                break

        # Random wait before the next iteration
        random_wait(5, 10)

        # If total downloads reached, break out of loop
        if download_counter == total_downloads:
            print(f"All downloads completed ({total_downloads} in total)")
            break

except StaleElementReferenceException:
    print("StaleElementReferenceException occurred. Retrying...")

finally:
    # Close the browser when finished
    driver.quit()


### Ordenando pdf por años

In [None]:
import os
import shutil


# Obtener la lista de archivos en el directorio
archivos = os.listdir(raw_pdf)

# Iterar sobre cada archivo
for archivo in archivos:
    # Obtener el año del nombre del archivo
    nombre, extension = os.path.splitext(archivo)
    año = None
    partes_nombre = nombre.split('-')
    for parte in partes_nombre:
        if parte.isdigit() and len(parte) == 4:
            año = parte
            break

    # Si se encontró el año, mover el archivo a la carpeta correspondiente
    if año:
        carpeta_destino = os.path.join(raw_pdf, año)
        # Crear la carpeta si no existe
        if not os.path.exists(carpeta_destino):
            os.makedirs(carpeta_destino)
        # Mover el archivo a la carpeta destino
        shutil.move(os.path.join(raw_pdf, archivo), carpeta_destino)


# Recortando PDFs

In [None]:
import fitz  # PyMuPDF
import os
import tkinter as tk

# Rutas de directorios
trimmed_record_dir = 'trimmed_record'
trimmed_record_file = 'trimmed_files.txt'

class PopupWindow(tk.Toplevel):
    def __init__(self, root, message):
        super().__init__(root)
        self.root = root
        self.title("Atención!")
        self.message = message
        self.result = None
        self.configure_window()
        self.create_widgets()

    def configure_window(self):
        self.resizable(False, False)  # Evita cambiar el tamaño de la ventana

    def create_widgets(self):
        self.label = tk.Label(self, text=self.message, wraplength=250)  # Ajusta el texto si es demasiado largo
        self.label.pack(pady=10, padx=10)
        self.btn_frame = tk.Frame(self)
        self.btn_frame.pack(pady=5)
        self.btn_yes = tk.Button(self.btn_frame, text="Sí", command=self.yes)
        self.btn_yes.pack(side=tk.LEFT, padx=5)
        self.btn_no = tk.Button(self.btn_frame, text="No", command=self.no)
        self.btn_no.pack(side=tk.RIGHT, padx=5)

        # Calcula el tamaño de la ventana en función del tamaño del texto
        width = self.label.winfo_reqwidth() + 20
        height = self.label.winfo_reqheight() + 100
        self.geometry(f"{width}x{height}")

    def yes(self):
        self.result = True
        self.destroy()

    def no(self):
        self.result = False
        self.destroy()

def search_keywords(pdf_file, keywords):
    pages_with_keywords = []
    with fitz.open(pdf_file) as doc:
        for page_num in range(doc.page_count):
            page = doc.load_page(page_num)
            text = page.get_text()
            if any(keyword in text for keyword in keywords):
                pages_with_keywords.append(page_num)
    return pages_with_keywords

def trim_pdf(pdf_file, pages):
    if not pages:
        print(f"No se encontraron páginas con palabras clave en {pdf_file}")
        return 0
    
    new_pdf_file = os.path.join(input_pdf, os.path.basename(pdf_file))
    
    with fitz.open(pdf_file) as doc:
        new_doc = fitz.open()
        new_doc.insert_pdf(doc, from_page=0, to_page=0)
        for page_num in pages:
            new_doc.insert_pdf(doc, from_page=page_num, to_page=page_num)
        new_doc.save(new_pdf_file)
    
    num_pages_new_pdf = new_doc.page_count
    print(f"El PDF recortado '{new_pdf_file}' tiene {num_pages_new_pdf} páginas.")

    if num_pages_new_pdf == 5:
        final_doc = fitz.open()
        final_doc.insert_pdf(new_doc, from_page=0, to_page=0)
        final_doc.insert_pdf(new_doc, from_page=1, to_page=1)
        final_doc.insert_pdf(new_doc, from_page=3, to_page=3)
        final_doc.save(new_pdf_file)

        num_pages_new_pdf = final_doc.page_count
        print(f"Solo se conservaron la portada y las páginas con 2 tablas de interés en el PDF recortado '{new_pdf_file}'.")
    else:
        print(f"Se conservaron todas las páginas en el PDF recortado '{new_pdf_file}'.")

    return num_pages_new_pdf

def read_trimmed_files():
    trimmed_files_path = os.path.join(trimmed_record_dir, trimmed_record_file)
    if not os.path.exists(trimmed_files_path):
        return set()
    
    with open(trimmed_files_path, 'r') as file:
        return set(file.read().splitlines())

def write_trimmed_files(trimmed_files):
    trimmed_files_path = os.path.join(trimmed_record_dir, trimmed_record_file)
    sorted_filenames = sorted(trimmed_files)  # Sort the filenames
    with open(trimmed_files_path, 'w') as file:
        for filename in sorted_filenames:
            file.write(filename + '\n')

if __name__ == "__main__":
    keywords = ["ECONOMIC SECTORS"]
    root = tk.Tk()
    root.withdraw()  # Oculta la ventana principal de Tkinter

    trimmed_files = read_trimmed_files()
    processing_counter = 1

    for folder in os.listdir(raw_pdf):
        folder_path = os.path.join(raw_pdf, folder)
        if os.path.isdir(folder_path):
            print("Procesando carpeta:", folder)
            num_pdfs_trimmed = 0
            for filename in os.listdir(folder_path):
                if filename.endswith(".pdf"):
                    pdf_file = os.path.join(folder_path, filename)
                    if filename in trimmed_files:
                        print(f"{processing_counter}. El PDF '{filename}' ya ha sido recortado y guardado en '{input_pdf}'...")
                        processing_counter += 1
                        continue
                    print(f"{processing_counter}. Procesando:", pdf_file)
                    
                    pages_with_keywords = search_keywords(pdf_file, keywords)
                    num_pages_new_pdf = trim_pdf(pdf_file, pages_with_keywords)
                    if num_pages_new_pdf > 0:
                        num_pdfs_trimmed += 1
                        trimmed_files.add(filename)
                        processing_counter += 1
            
            write_trimmed_files(trimmed_files)

            message = f"{num_pdfs_trimmed} PDFs han sido recortados en la carpeta {folder}. ¿Desea continuar?"
            popup = PopupWindow(root, message)
            root.wait_window(popup)
            if not popup.result:
                break
                
    print("Proceso completado para todos los PDFs en el directorio:", input_pdf)


### Ordenando pdf por años

In [None]:
import os
import shutil


# Obtener la lista de archivos en el directorio
archivos = os.listdir(input_pdf)

# Iterar sobre cada archivo
for archivo in archivos:
    # Obtener el año del nombre del archivo
    nombre, extension = os.path.splitext(archivo)
    año = None
    partes_nombre = nombre.split('-')
    for parte in partes_nombre:
        if parte.isdigit() and len(parte) == 4:
            año = parte
            break

    # Si se encontró el año, mover el archivo a la carpeta correspondiente
    if año:
        carpeta_destino = os.path.join(input_pdf, año)
        # Crear la carpeta si no existe
        if not os.path.exists(carpeta_destino):
            os.makedirs(carpeta_destino)
        # Mover el archivo a la carpeta destino
        shutil.move(os.path.join(input_pdf, archivo), carpeta_destino)


<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="2">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: charter;">2.</span> <span style = "color: dark; font-family: charter;">Extracting Tables (and data cleaning)</span></h1>

<div id="2-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">2.1.</span>
    <span style = "color: dark; font-family: charter;">
    <span style="background-color: #f2f2f2; font-family: Courier New;">
        pdfplumber
    </span> 
    demo
    </span>
    </h2>

<div style="font-family: charter; text-align: left; color:dark">
    Import
    <span style="background-color: #f2f2f2; font-family: Courier New;">
        pdfplumber
    </span>
    <div/>

In [None]:
import pdfplumber
print(f'This library version is: {pdfplumber.__version__}')

<div style="font-family: charter; text-align: left; color:dark">
    Load the PDF
    <div/>

In [None]:
pdf = pdfplumber.open(".\\ns-10-2013.pdf")

<div style="font-family: charter; text-align: left; color:dark">
    Get the page 82
    <div/>

In [None]:
p_82 = pdf.pages[81]

In [None]:
# Convert the page to a higher resolution image (e.g., 300 DPI).
image = p_82.to_image(resolution=300)
image

<div id="2-1-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: charter;">2.1.1.</span>
    <span style = "color: dark; font-family: charter;">
    What data would we get if we used the default settings?
    </span>
    </h3>

<div style="font-family: charter; text-align: left; color:dark">
    We can check by using <span style="background-color: #f2f2f2; font-family: Courier New;">
        PageImage.debug_tablefinder()
    </span>:
    <div/>

In [None]:
image.reset().debug_tablefinder()

<div style="font-family: charter; text-align: left; color:dark">
    The default settings correctly identify the table's vertical demarcations, but don't capture the horizontal demarcations between each group of five states/territories. So:
    <div/>

<div id="2-1-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: charter;">2.1.2.</span>
    <span style = "color: dark; font-family: charter;">
    Using custom <span style="background-color: #f2f2f2; font-family: Courier New;">
        <b>.extract_table
            </b>
    </span>'s settings
    </span>
    </h3>

<div style="font-family: charter; text-align: left; color:dark">
    <ul>
        <li>Because the columns are separated by lines, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        vertical_strategy="lines"
    </span>.
            </li>
        <li>Because the rows are, primarily, separated by gutters between the text, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        horizontal_strategy="text"
    </span>.
            <li>To snap together a handful of the gutters at the top which aren't fully flush with one another, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        snap_y_tolerance
    </span>which snaps horizontal lines within a certain distance to the same vertical alignment.
                </li>
        <li>And because the left and right-hand extremities of the text aren't quite flush with the vertical lines, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        "intersection_tolerance": 15
    </span>.
            </li>
        </ul>
    <div/>

In [None]:
table_settings = {
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "text_keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
}

In [None]:
image.reset().debug_tablefinder(table_settings)

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="2-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">2.2.</span>
    <span style = "color: dark; font-family: charter;">
    Extracting tables and generating dataframes (includes data cleanup)
    </span>
    </h2>

<div style="font-family: charter; text-align: left; color:dark">
    We would like to get specific tables: information on GDP growth rates with annual, quarterly and monthly frequency. We don't need other tables also related to GDP that don't meet these requirements. Extraction will be easier if we use keywords.
    <div/>

In [None]:
# Keywords to search in the page text
keywords = ["PRODUCTO BRUTO INTERNO", "SECTORES ECONÓMICOS", "PBI", "GDP", "Variaciones"]

<div style="font-family: charter; text-align: left; color:dark">
    The code iterates through each PDF and extracts the two required tables from each. The extracted information is then transformed into dataframes and the columns and values are cleaned up to conform to Python conventions (pythonic).
    <div/>

# Funciones

### Auxiliares

In [8]:
def remove_rare_characters_first_row(texto):
    texto = re.sub(r'\s*-\s*', '-', texto)  # Remueve espacios alrededor de guiones
    texto = re.sub(r'[^a-zA-Z0-9\s-]', '', texto)  # Remueve caracteres raros excepto letras, dígitos y guiones
    return texto

def remove_rare_characters(texto):
    return re.sub(r'[^a-zA-Z\s]', '', texto)

def remove_tildes(texto):
    return ''.join((c for c in unicodedata.normalize('NFD', texto) if unicodedata.category(c) != 'Mn'))

### Común

In [9]:
# 0.
def drop_nan_rows(df):
    df = df.dropna(how='all')
    return df

# 1. 
def drop_nan_columns(df):
    return df.dropna(axis=1, how='all')

# 2.
def swap_first_second_row(df):
    temp = df.iloc[0, 0]
    df.iloc[0, 0] = df.iloc[1, 0]
    df.iloc[1, 0] = temp

    temp = df.iloc[0, -1]
    df.iloc[0, -1] = df.iloc[1, -1]
    df.iloc[1, -1] = temp
    return df

# 8. 
def reset_index(df):
    df.reset_index(drop=True, inplace=True)
    return df

# 5.
def remove_digit_slash(df):
    # Aplica la función de reemplazo a la primera columna y a las dos últimas columnas
    df.iloc[:, [0, -2, -1]] = df.iloc[:, [0, -2, -1]].apply(lambda x: x.str.replace(r'\d+/', '', regex=True))
    return df

# 9. AUX (ROBUSTO)

def separate_text_digits(df):
    for index, row in df.iterrows():
        if any(char.isdigit() for char in str(row.iloc[-2])) and any(char.isalpha() for char in str(row.iloc[-2])):
            if pd.isnull(row.iloc[-1]):
                df.loc[index, df.columns[-1]] = ''.join(filter(lambda x: x.isalpha() or x == ' ', str(row.iloc[-2])))
                df.loc[index, df.columns[-2]] = ''.join(filter(lambda x: not (x.isalpha() or x == ' '), str(row.iloc[-2])))
            
            # Check if comma or dot is used as decimal separator
            if ',' in str(row.iloc[-2]):
                split_values = str(row.iloc[-2]).split(',')
            elif '.' in str(row.iloc[-2]):
                split_values = str(row.iloc[-2]).split('.')
            else:
                # If neither comma nor dot found, assume no decimal part
                split_values = [str(row.iloc[-2]), '']
                
            cleaned_integer = ''.join(filter(lambda x: x.isdigit() or x == '-', split_values[0]))
            cleaned_decimal = ''.join(filter(lambda x: x.isdigit(), split_values[1]))
            if cleaned_decimal:
                # Use comma as decimal separator
                cleaned_numeric = cleaned_integer + ',' + cleaned_decimal
            else:
                cleaned_numeric = cleaned_integer
            df.loc[index, df.columns[-2]] = cleaned_numeric
    return df


# 4. 
def extract_years(df):
    year_columns = [col for col in df.columns if re.match(r'\b\d{4}\b', col)]
    #print("Años (4 dígitos) extraídos:")
    #print(year_columns)
    return year_columns

# 6. 
def first_row_columns(df):
    df.columns = df.iloc[0]
    df = df.drop(df.index[0])
    return df

# 15.
def clean_columns_values(df):
    df.columns = df.columns.str.lower()
    # Only normalize string column names
    df.columns = [unicodedata.normalize('NFKD', col).encode('ASCII', 'ignore').decode('utf-8') if isinstance(col, str) else col for col in df.columns]
    df.columns = df.columns.str.replace(' ', '_').str.replace('ano', 'year').str.replace('-', '_')
    
    text_columns = df.select_dtypes(include='object').columns
    for col in df.columns:
        df.loc[:, col] = df[col].apply(lambda x: remove_tildes(x) if isinstance(x, str) else x)
        df.loc[:, col] = df[col].apply(lambda x: str(x).replace(',', '.') if isinstance(x, (int, float, str)) else x)
    df.loc[:, 'sectores_economicos'] = df['sectores_economicos'].str.lower()
    df.loc[:, 'economic_sectors'] = df['economic_sectors'].str.lower()
    df.loc[:, 'sectores_economicos'] = df['sectores_economicos'].apply(remove_rare_characters)
    df.loc[:, 'economic_sectors'] = df['economic_sectors'].apply(remove_rare_characters)
    return df

# 16.
def convertir_float(df):
    excluded_columns = ['sectores_economicos', 'economic_sectors']
    columns_to_convert = [col for col in df.columns if col not in excluded_columns]
    df[columns_to_convert] = df[columns_to_convert].apply(pd.to_numeric, errors='coerce')
    return df

# 15.
def relocate_last_column(df):
    last_column = df.pop(df.columns[-1])
    df.insert(1, last_column.name, last_column)
    return df

### Exclusiva Tabla 1

In [10]:
# ATIPIC LAST COLUMNS
def relocate_last_columns(df):
    if not pd.isna(df.iloc[1, -1]):
        # Create a new column with NaN
        new_column = 'col_' + ''.join(map(str, np.random.randint(1, 5, size=1)))
        df[new_column] = np.nan
        
        # Get 'ECONOMIC SECTORS' and relocate
        insert_value_1 = df.iloc[0, -2]
        # Convert the value to string before assignment
        insert_value_1 = str(insert_value_1)
        # Ensure the dtype of the last column is object (string) to accommodate string values
        df.iloc[:, -1] = df.iloc[:, -1].astype('object')
        df.iloc[0, -1] = insert_value_1
        
        # NaN first obs
        df.iloc[0,-2] = np.nan
    return df

# Extraer meses

def get_months_sublist_list(df, year_columns):
    first_row = df.iloc[0]
    # Initialize the list of sublists
    months_sublist_list = []

    # Initialize the current sublist
    months_sublist = []

    # Iterate over the elements of the first row
    for item in first_row:
        # Check if the item meets the requirements
        if len(str(item)) == 3:
            months_sublist.append(item)
        elif '-' in item or str(item) == 'year':
            months_sublist.append(item)
            months_sublist_list.append(months_sublist)
            months_sublist = []

    # Add the last sublist if it's not empty
    if months_sublist:
        months_sublist_list.append(months_sublist)

    new_elements = []

    # Check if year_columns is not empty
    if year_columns:
        for i, year in enumerate(year_columns):
            # Check if index i is valid for quarters_sublist_list
            if i < len(months_sublist_list):
                for element in months_sublist_list[i]:
                    new_elements.append(f"{year}_{element}")
                    
    two_first_elements = df.iloc[0][:2].tolist()

    # Ensure that the two_first_elements are added if they are not in new_elements
    for index in range(len(two_first_elements) - 1, -1, -1):
        if two_first_elements[index] not in new_elements:
            new_elements.insert(0, two_first_elements[index])

    # Ensure that the length of new_elements matches the number of columns in df
    while len(new_elements) < len(df.columns):
        new_elements.append(None)

    temp_df = pd.DataFrame([new_elements], columns=df.columns)
    df.iloc[0] = temp_df.iloc[0]

    return df


def find_year_column(df):
    # List to store the found years
    found_years = []

    # Iterating over the column names of the DataFrame
    for column in df.columns:
        # Checking if the column name is a year (4 digits)
        if column.isdigit() and len(column) == 4:
            found_years.append(column)

    # If more than one year is found, do nothing
    if len(found_years) > 1:
        pass
    # If exactly one year is found, implement additional code
    elif len(found_years) == 1:
        # Getting the name of the found year
        year_name = found_years[0]
        print("The name of the column representing the year is:", year_name)

        # Getting the first row of the DataFrame
        first_row = df.iloc[0]

        # Searching for the first column containing the word "year" or some hyphen-separated expression
        column_contains_year = first_row[first_row.astype(str).str.contains(r'\byear\b')]

        if not column_contains_year.empty:
            # Getting the name of the first column containing 'year' or some hyphen-separated expression in the first row
            column_contains_year_name = column_contains_year.index[0]
            print("The name of the first column containing 'year' or some hyphen-separated expression in the first row is:", column_contains_year_name)

            # Getting the indices of the columns
            column_contains_year_index = df.columns.get_loc(column_contains_year_name)
            year_name_index = df.columns.get_loc(year_name)
            print("The index of the column containing 'year' is:", column_contains_year_index)
            print("The index of the column representing the year is:", year_name_index)

            # Checking if the column representing the year is to the right or to the left of column_contains_year
            if column_contains_year_index < year_name_index:
                print("The year column is to the right of the column containing 'year'.")
                # Adding one to the year
                new_year = str(int(year_name) - 1)
                # Renaming the column containing 'year' with the new year
                df.rename(columns={column_contains_year_name: new_year}, inplace=True)
                print(f"The column containing 'year' is now named '{new_year}'.")
            elif column_contains_year_index > year_name_index:
                print("The year column is to the left of the column containing 'year'.")
                # Subtracting one from the year
                new_year = str(int(year_name) + 1)
                # Renaming the year column with the new year
                df.rename(columns={column_contains_year_name: new_year}, inplace=True)
                print(f"The column containing 'year' is now named '{new_year}'.")
            else:
                print("The year column is in the same position as the column containing 'year'.")
        else:
            print("No columns containing 'year' were found in the first row.")
    # If no year is found, print a message
    else:
        print("No years were found in the column names.")
    
    return df


#

def clean_first_row(df):
    for col in df.columns:
        if df[col].dtype == 'object':
            if isinstance(df.at[0, col], str):
                df.at[0, col] = df.at[0, col].lower()  # Convertir a minúsculas solo si es un objeto
                df.at[0, col] = remove_tildes(df.at[0, col])
                df.at[0, col] = remove_rare_characters_first_row(df.at[0, col])
                # Reemplazar 'ano' por 'year'
                df.at[0, col] = df.at[0, col].replace('ano', 'year')

    return df


def intercambiar_valores(df):
    # Verificar si hay al menos dos columnas en el DataFrame
    if len(df.columns) < 2:
        print("El DataFrame tiene menos de dos columnas. No se pueden intercambiar valores.")
        return df

    # Verificar si hay valores NaN en la última columna
    if df.iloc[:, -1].isnull().any():
        # Obtener índice de filas con NaN en la última columna
        last_column_rows_nan = df[df.iloc[:, -1].isnull()].index

        # Iterar sobre las filas con NaN en la última columna
        for idx in last_column_rows_nan:
            # Verificar si el índice está dentro del rango de las columnas
            if -2 >= -len(df.columns):
                # Intercambiar los valores de la última columna y la penúltima columna
                df.iloc[idx, -1], df.iloc[idx, -2] = df.iloc[idx, -2], df.iloc[idx, -1]
            else:
                print(f"Índice fuera de rango para la fila {idx}. No se pueden intercambiar valores.")

    return df



def replace_var_perc_first_column(df):
    # Regular expression to search for "Var. %" or "Var.%"
    regex = re.compile(r'Var\. ?%')

    # Iterate over the rows of the dataframe
    for index, row in df.iterrows():
        # Convert the value in the first column to a string
        value = str(row.iloc[0])

        # Check if the value matches the regular expression
        if regex.search(value):
            # Replace only the characters that match the regular expression
            df.at[index, df.columns[0]] = regex.sub("variacion porcentual", value)
    
    return df


# 8.
number_moving_average = 'three' # Keep a space at the end

def replace_number_moving_average(df):
    for index, row in df.iterrows():
        # Buscar la expresión regular en la penúltima o última columna
        if pd.notnull(row.iloc[-1]) and re.search(r'(\d\s*-)', str(row.iloc[-1])):
            df.at[index, df.columns[-1]] = re.sub(r'(\d\s*-)', f'{number_moving_average}-', str(row.iloc[-1]))
        #elif pd.notnull(row.iloc[-2]) and re.search(r'(\d\s*-)', str(row.iloc[-2])):
        #   df.at[index, df.columns[-2]] = re.sub(r'(\d\s*-)', f'{number_moving_average}-', str(row.iloc[-2]))
    return df


# 7.
def replace_var_perc_last_columns(df):
    # Expresión regular para buscar "Var. %" o "Var.%"
    regex = re.compile(r'(Var\. ?%)(.*)')

    # Iterar sobre las filas del dataframe
    for index, row in df.iterrows():
        # Verificar si el valor en la penúltima columna es una cadena no nula
        if isinstance(row.iloc[-2], str) and regex.search(row.iloc[-2]):
            # Realizar el reemplazo al final del valor de la penúltima columna
            replaced_text = regex.sub(r'\2 percent change', row.iloc[-2])
            df.at[index, df.columns[-2]] = replaced_text.strip()
        
        # Verificar si el valor en la última columna es una cadena no nula
        if isinstance(row.iloc[-1], str) and regex.search(row.iloc[-1]):
            # Realizar el reemplazo al final del valor de la última columna
            replaced_text = regex.sub(r'\2 percent change', row.iloc[-1])
            df.at[index, df.columns[-1]] = replaced_text.strip()
    
    return df

# Función para buscar y reemplazar en la segunda fila del DataFrame
def replace_first_dot(df):
    second_row = df.iloc[1]  # Segunda fila del DataFrame
    
    # Verificar si al menos una observación cumple con el patrón
    if any(isinstance(cell, str) and re.match(r'^\w+\.\s?\w+', cell) for cell in second_row):
        for col in df.columns:
            if isinstance(second_row[col], str):  # Verificar si el valor es una cadena
                if re.match(r'^\w+\.\s?\w+', second_row[col]):  # Verificar si cumple con el patrón Xxx.Xxx o Xxx. Xxx.
                    df.at[1, col] = re.sub(r'(\w+)\.(\s?\w+)', r'\1-\2', second_row[col], count=1)  # Reemplazar solo el primer punto
    return df

def drop_rare_caracter_row(df):
    # Buscar el caracter solitario "}" en cada fila y obtener un booleano para cada fila
    rare_caracter_row = df.apply(lambda row: '}' in row.values, axis=1)
    
    # Filtrar el DataFrame para eliminar las filas con el caracter solitario "}"
    df = df[~rare_caracter_row]
    
    return df

def split_column_by_pattern(df):
    # Iteramos sobre las columnas del dataframe
    for col in df.columns:
        # Verificamos si la segunda fila de la columna contiene el patrón
        if re.match(r'^[A-Z][a-z]+\.\s[A-Z][a-z]+\.$', str(df.iloc[1][col])):
            # Realizamos el split de la columna usando como criterio el espacio
            split_values = df[col].str.split(expand=True)
            # Guardamos los primeros valores en la columna original
            df[col] = split_values[0]
            # Guardamos los segundos valores en una nueva columna con el sufijo "_split"
            new_col_name = col + '_split'
            df.insert(df.columns.get_loc(col) + 1, new_col_name, split_values[1])
    return df

$\Large{\color{blue}{ns\_2014\_07}}$

se: sectores económicos

In [11]:
def swap_nan_se(df):
    # Check if the first observation of the first column is NaN
    if pd.isna(df.iloc[0, 0]) and df.iloc[0, 1] == "SECTORES ECONÓMICOS":
        # Create a temporary copy of the values
        column_1_value = df.iloc[0, 1]
        # Swap values in the original row
        df.iloc[0, 0] = column_1_value
        df.iloc[0, 1] = np.nan
    # Drop the second column
    df = df.drop(df.columns[1], axis=1)
    return df

$\Large{\color{blue}{ns\_2014\_08}}$

In [12]:
def replace_first_row_with_columns(df):
    # Check if the first row contains at least one year
    if any(isinstance(element, str) and element.isdigit() and len(element) == 4 for element in df.iloc[0]):
        # Replace NaN values in the first row with random column names
        for col_index, value in enumerate(df.iloc[0]):
            if pd.isna(value):
                df.iloc[0, col_index] = f"column_{col_index + 1}"
        # Replace column names with the values of the first row
        df.columns = df.iloc[0]
        # Drop the first row after setting it as column names
        df = df.drop(df.index[0])
    return df

In [13]:
def expand_column(df):
    columna_a_expandir = df.columns[-2]
    
    def reemplazar_guiones(match_obj):
        return match_obj.group(1) + ' ' + match_obj.group(2)    

    if df[columna_a_expandir].str.contains(r'\d').any() and df[columna_a_expandir].str.contains(r'[a-zA-Z]').any():
        df[columna_a_expandir] = df[columna_a_expandir].apply(lambda x: re.sub(r'([a-zA-Z]+)\s*-\s*([a-zA-Z]+)', reemplazar_guiones, str(x)) if pd.notnull(x) else x)

        
        # Expresión regular para extraer palabras
        pattern = re.compile(r'[a-zA-Z\s]+$')

        # Función para aplicar la lógica de extracción y reemplazo a cada fila
        def extract_replace(row):
            if pd.notnull(row[columna_a_expandir]) and isinstance(row[columna_a_expandir], str):  # Verifica que el valor no sea NaN y sea de tipo string
                if row.name != 0:  # Para que empiece desde la segunda fila
                    value_to_replace = pattern.search(row[columna_a_expandir])
                    if value_to_replace:
                        value_to_replace = value_to_replace.group().strip()
                        row[df.columns[-1]] = value_to_replace
                        row[columna_a_expandir] = re.sub(pattern, '', row[columna_a_expandir]).strip()
            return row

        # Aplicar la función a cada fila del DataFrame
        df = df.apply(extract_replace, axis=1)

    return df

In [14]:
def split_values_1(df):
    columna_a_expandir = df.columns[-2]
    nuevas_columnas = df[columna_a_expandir].str.split(expand=True)
    nuevas_columnas.columns = [f'{columna_a_expandir}_{i+1}' for i in range(nuevas_columnas.shape[1])]
    posicion_insercion = len(df.columns) - 1
    for col in reversed(nuevas_columnas.columns):
        df.insert(posicion_insercion, col, nuevas_columnas[col])
    df.drop(columns=[columna_a_expandir], inplace=True)
    return df

$\Large{\color{blue}{ns\_2015\_11}}$

In [15]:
def revisar_primera_fila(df):
    primera_fila = df.iloc[0]  # Obtenemos la primera fila del DataFrame
    
    for i, (col, valor) in enumerate(primera_fila.items()):  # Iteramos sobre los índices y valores de la primera fila
        # Comprobamos si algún valor de la primera fila tiene dos años juntos
        if re.search(r'\b\d{4}\s\d{4}\b', str(valor)):
            # Si es así, extraemos los dos años
            anios = valor.split()
            primer_anio = anios[0]
            segundo_anio = anios[1]
            
            # Nombre de la columna original
            nombre_columna_original = f'col_{i}'
            df.at[0, col] = nombre_columna_original
            
            # Actualizamos el valor de la primera columna si es NaN con el primer año
            if pd.isna(df.iloc[0, 0]):
                df.iloc[0, 0] = primer_anio
            
            # Actualizamos el valor de la segunda columna si es NaN con el segundo año
            if pd.isna(df.iloc[0, 1]):
                df.iloc[0, 1] = segundo_anio
    
    return df

In [16]:
def replace_nan_with_previous_column(df):
    columns = df.columns
    
    for i in range(len(columns) - 1):
        if columns[i].endswith('_year'):
            nan_indices = df[columns[i+1]].isnull()
            df.loc[nan_indices, [columns[i], columns[i+1]]] = df.loc[nan_indices, [columns[i+1], columns[i]]].values
    
    return df

$\Large{\color{blue}{ns\_2016\_15}}$

In [17]:
def revisar_primera_fila_1(df):
    # Comprobar si el valor en la primera fila y primera columna es NaN
    if pd.isnull(df.iloc[0, 0]):
        # Comprobar si el valor en la penúltima columna y primera fila es un año (4 dígitos)
        penultimate_column = df.iloc[0, -2]
        if isinstance(penultimate_column, str) and len(penultimate_column) == 4 and penultimate_column.isdigit():
            # Intercambiar los valores
            df.iloc[0, 0] = penultimate_column
            df.iloc[0, -2] = np.nan
    
    # Comprobar si el valor en la segunda columna y primera fila es NaN
    if pd.isnull(df.iloc[0, 1]):
        # Comprobar si el valor en la última columna y primera fila es un año (4 dígitos)
        last_column = df.iloc[0, -1]
        if isinstance(last_column, str) and len(last_column) == 4 and last_column.isdigit():
            # Intercambiar los valores
            df.iloc[0, 1] = last_column
            df.iloc[0, -1] = np.nan
    
    return df

In [18]:
def split_values_2(df):
    columna_a_expandir = df.columns[-4]
    nuevas_columnas = df[columna_a_expandir].str.split(expand=True)
    nuevas_columnas.columns = [f'{columna_a_expandir}_{i+1}' for i in range(nuevas_columnas.shape[1])]
    posicion_insercion = len(df.columns) - 3
    for col in reversed(nuevas_columnas.columns):
        df.insert(posicion_insercion, col, nuevas_columnas[col])
    df.drop(columns=[columna_a_expandir], inplace=True)
    return df

### Exclusiva Tabla 2

In [19]:
def clean_first_row(df):
    for col in df.columns:
        if df[col].dtype == 'object':
            if isinstance(df.at[0, col], str):
                df.at[0, col] = df.at[0, col].lower()  # Convertir a minúsculas solo si es un objeto
                df.at[0, col] = remove_tildes(df.at[0, col])
                df.at[0, col] = remove_rare_characters_first_row(df.at[0, col])
                # Reemplazar 'ano' por 'year'
                df.at[0, col] = df.at[0, col].replace('ano', 'year')

    return df

# 2.
def separate_years(df):
    df = df.copy()  # Se crea una copia del DataFrame para evitar SettingWithCopyWarning
    if isinstance(df.iloc[0, -2], str) and len(df.iloc[0, -2].split()) == 2:
        years = df.iloc[0, -2].split()
        if all(len(year) == 4 for year in years):
            segundo_anio = years[1]
            df.iloc[0, -2] = years[0]
            df.insert(len(df.columns) - 1, 'new_column', [segundo_anio] + [None] * (len(df) - 1))
    return df

# 3.
def find_roman_numerals(text):
    pattern = r'\b(?:I{1,3}|IV|V|VI{0,3}|IX|X)\b'
    matches = re.findall(pattern, text)
    return matches

def relocate_roman_numerals(df):
    numeros_romanos = find_roman_numerals(df.iloc[2, -1])
    if numeros_romanos:
        original_text = df.iloc[2, -1]
        for roman_numeral in numeros_romanos:
            original_text = original_text.replace(roman_numeral, '').strip()
        df.iloc[2, -1] = original_text
        df.at[2, 'new_column'] = ', '.join(numeros_romanos)
        df.iloc[2, -1] = np.nan
    return df

# 4.
def extract_mixed_values(df):
    df = df.copy()  # Se crea una copia del DataFrame para evitar SettingWithCopyWarning
    regex_pattern = r'(-?\d+,\d [a-zA-Z\s]+)'
    for index, row in df.iterrows():
        antepenultima_obs = row.iloc[-3]
        penultima_obs = row.iloc[-2]

        if isinstance(antepenultima_obs, str) and pd.notnull(antepenultima_obs):
            match = re.search(regex_pattern, antepenultima_obs)
            if match:
                parte_extraida = match.group(0)
                if pd.isna(penultima_obs) or pd.isnull(penultima_obs):
                    df.iloc[index, -2] = parte_extraida
                    antepenultima_obs = re.sub(regex_pattern, '', antepenultima_obs).strip()
                    df.iloc[index, -3] = antepenultima_obs
    return df

# 5.
def replace_first_row_nan(df):
    for col in df.columns:
        if pd.isna(df.iloc[0][col]):
            df.iloc[0, df.columns.get_loc(col)] = col
    return df

# 11. 
def split_values(df):
    columna_a_expandir = df.columns[-3]
    nuevas_columnas = df[columna_a_expandir].str.split(expand=True)
    nuevas_columnas.columns = [f'{columna_a_expandir}_{i+1}' for i in range(nuevas_columnas.shape[1])]
    posicion_insercion = len(df.columns) - 2
    for col in reversed(nuevas_columnas.columns):
        df.insert(posicion_insercion, col, nuevas_columnas[col])
    df.drop(columns=[columna_a_expandir], inplace=True)
    return df


# 13.
def roman_arabic(df):
    primera_fila = df.iloc[0]
    def convert_roman_number(numero):
        try:
            return str(roman.fromRoman(numero))
        except roman.InvalidRomanNumeralError:
            return numero

    primera_fila_convertida = []
    for valor in primera_fila:
        if isinstance(valor, str) and not pd.isna(valor):
            primera_fila_convertida.append(convert_roman_number(valor))
        else:
            primera_fila_convertida.append(valor)

    df.iloc[0] = primera_fila_convertida
    return df

# 14.
def fix_duplicates(df):
    fila_segunda = df.iloc[0].copy()
    prev_num = None
    first_one_index = None

    for i, num in enumerate(fila_segunda):
        try:
            num = int(num)
            prev_num = int(prev_num) if prev_num is not None else None

            if num == prev_num:
                if num == 1:
                    if first_one_index is None:
                        first_one_index = i - 1
                    next_num = int(fila_segunda[i - 1]) + 1
                    for j in range(i, len(fila_segunda)):
                        if fila_segunda.iloc[j].isdigit():
                            fila_segunda.iloc[j] = str(next_num)
                            next_num += 1
                elif i - 1 >= 0:
                    fila_segunda.iloc[i] = str(int(fila_segunda.iloc[i - 1]) + 1)

            prev_num = num
        except ValueError:
            pass

    df.iloc[0] = fila_segunda
    return df

## More

In [20]:
def get_quarters_sublist_list(df, year_columns):
    first_row = df.iloc[0]
    # Initialize the list of sublists
    quarters_sublist_list = []

    # Initialize the current sublist
    quarters_sublist = []

    # Iterate over the elements of the first row
    for item in first_row:
        # Check if the item meets the requirements
        if len(str(item)) == 1:
            quarters_sublist.append(item)
        elif str(item) == 'year':
            quarters_sublist.append(item)
            quarters_sublist_list.append(quarters_sublist)
            quarters_sublist = []

    # Add the last sublist if it's not empty
    if quarters_sublist:
        quarters_sublist_list.append(quarters_sublist)

    new_elements = []

    # Check if year_columns is not empty
    if year_columns:
        for i, year in enumerate(year_columns):
            # Check if index i is valid for quarters_sublist_list
            if i < len(quarters_sublist_list):
                for element in quarters_sublist_list[i]:
                    new_elements.append(f"{year}_{element}")

    two_first_elements = df.iloc[0][:2].tolist()

    # Ensure that the two_first_elements are added if they are not in new_elements
    for index in range(len(two_first_elements) - 1, -1, -1):
        if two_first_elements[index] not in new_elements:
            new_elements.insert(0, two_first_elements[index])

    # Ensure that the length of new_elements matches the number of columns in df
    while len(new_elements) < len(df.columns):
        new_elements.append(None)

    temp_df = pd.DataFrame([new_elements], columns=df.columns)
    df.iloc[0] = temp_df.iloc[0]

    return df


$\Large{\color{blue}{ns\_2016\_20}}$

In [21]:
def drop_nan_row(df):
    if df.iloc[0].isnull().all():
        df = df.drop(index=0)
        df.reset_index(drop=True, inplace=True)
    return df

$\Large{\color{blue}{ns\_2019\_17}}$

In [22]:
def last_column_es(df): # similar than relocate last columns
    # Check if the DataFrame has at least two columns and the last column is a 4-digit year
    if len(df.columns) >= 2 and df.columns[-1].isdigit() and len(df[df.columns[-1]].iloc[:2]) >= 2:
        # Check if the first observation of the last column is 'ECONOMIC SECTORS'
        if df[df.columns[-1]].iloc[0] == 'ECONOMIC SECTORS':
            # Check if the second observation of the last column is not empty
            if pd.notnull(df[df.columns[-1]].iloc[1]):
                # Create a new column with NaN values
                new_column_name = f"col_{len(df.columns)}"
                df[new_column_name] = np.nan
                
                # Get 'ECONOMIC SECTORS' and relocate
                insert_value_1 = df.iloc[0, -2]
                # Convert the value to string before assignment
                insert_value_1 = str(insert_value_1)
                # Ensure the dtype of the last column is object (string) to accommodate string values
                df.iloc[:, -1] = df.iloc[:, -1].astype('object')
                df.iloc[0, -1] = insert_value_1

                # NaN first obs
                df.iloc[0,-2] = np.nan
    return df

$\Large{\color{blue}{ns\_2019\_26}}$

In [23]:
def intercambiar_columnas(df):
    # Buscar una columna con todos los valores NaN
    columna_nan = None
    for columna in df.columns:
        if df[columna].isnull().all() and len(columna) == 4 and columna.isdigit():
            columna_nan = columna
            break
    
    if columna_nan:
        # Revisar la columna de la izquierda
        indice_columna = df.columns.get_loc(columna_nan)
        if indice_columna > 0:
            columna_izquierda = df.columns[indice_columna - 1]
            # Verificar si no es un año (no tiene 4 dígitos)
            if not (len(columna_izquierda) == 4 and columna_izquierda.isdigit()):
                # Intercambiar nombres de columnas
                df.rename(columns={columna_nan: columna_izquierda, columna_izquierda: columna_nan}, inplace=True)
    
    return df

# Table 1

# Con registro de carpetas procesdas 1

In [None]:
import tabula
import pdfplumber
import pandas as pd
import numpy as np
import os
import re
from datetime import datetime
import locale
from tkinter import Tk, messagebox, TOP, YES, NO
import os

# Establecer la localización en español
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Palabras clave para buscar en el texto de la página
keywords = ["ECONOMIC SECTORS"]

# Diccionario para almacenar los DataFrames generados
dataframes_dict_1 = {}

# Ruta del archivo de registro de carpetas procesadas
registro_path = 'dataframes_record/carpetas_procesadas_1.txt'

# Función para corregir los nombres de los meses
def corregir_nombre_mes(mes):
    meses_mapping = {
        'setiembre': 'septiembre',
        # Agrega más mapeos si es necesario para otros nombres de meses
    }
    return meses_mapping.get(mes, mes)

def registrar_carpeta_procesada(carpeta, num_archivos_procesados):
    with open(registro_path, 'a') as file:
        file.write(f"{carpeta}:{num_archivos_procesados}\n")

def carpeta_procesada(carpeta):
    if not os.path.exists(registro_path):
        return False
    with open(registro_path, 'r') as file:
        for line in file:
            if line.startswith(carpeta):
                return True
    return False

def procesar_pdf(pdf_path):
    tables_dict_1 = {}  # Diccionario local para cada PDF
    table_counter = 1
    keyword_count = 0 

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No se encontraron coincidencias para id_ns y year en el nombre del archivo:", filename)
        return None, None, None, None, None  # Return None for tables_dict_1 as well

    new_filename = os.path.splitext(os.path.basename(pdf_path))[0].replace('-', '_')
    date = None

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, 1):
            text = page.extract_text()
            if i == 1:
                match = re.search(r'(\d{1,2}\s+de\s+\w+\s+de\s+\d{4})', text, re.IGNORECASE)
                if match:
                    fecha_str = match.group(0)
                    # Corregir el nombre del mes
                    partes_fecha = fecha_str.split()
                    partes_fecha[2] = corregir_nombre_mes(partes_fecha[2].lower())
                    fecha_str_corregida = ' '.join(partes_fecha)
                    date = datetime.strptime(fecha_str_corregida, '%d de %B de %Y')

            if all(keyword in text for keyword in keywords):
                keyword_count += 1
                if keyword_count == 1:  # Solo procesar la primera ocurrencia
                    tables = tabula.read_pdf(pdf_path, pages=i, multiple_tables=False, stream=True) # change stream to another option if desired
                    for j, table_df in enumerate(tables, start=1):
                        nombre_dataframe = f"{new_filename}_{keyword_count}"
                        tables_dict_1[nombre_dataframe] = table_df
                        table_counter += 1

                    break  # Salir del bucle después de encontrar la primera ocurrencia

    return id_ns, year, date, tables_dict_1, keyword_count  # Return tables_dict_1 here


def procesar_carpeta(carpeta):
    print(f"Procesando la carpeta {os.path.basename(carpeta)}")
    pdf_files = [os.path.join(carpeta, f) for f in os.listdir(carpeta) if f.endswith('.pdf')]

    num_pdfs_procesados = 0
    num_dataframes_generados = 0

    table_counter = 1  # Inicializar el contador de tabla aquí
    tables_dict_1 = {}  # Declarar tables_dict_1 fuera del bucle principal
    
    for pdf_file in pdf_files:
        id_ns, year, date, tables_dict_temp, keyword_count = procesar_pdf(pdf_file)

        if tables_dict_temp:
            for nombre_df, df in tables_dict_temp.items():
                nombre_archivo = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                nombre_df = f"{nombre_archivo}_{keyword_count}"
                
                # Almacenar DataFrame sin procesar en tables_dict_1
                tables_dict_1[nombre_df] = df.copy()
                
                # Aplicar las 20 líneas de funciones de limpieza a una copia del DataFrame
                df_clean = df.copy()

                if any(col.isdigit() and len(col) == 4 for col in df_clean.columns):
                    # Si hay al menos una columna que representa un año
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    df_clean = replace_first_dot(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = intercambiar_valores(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convertir_float(df_clean)
                else: # 2014 ns 08
                    # Si no hay columnas que representen años
                    df_clean = revisar_primera_fila(df_clean)
                    df_clean = revisar_primera_fila_1(df_clean)
                    df_clean = replace_first_row_with_columns(df_clean)
                    df_clean = swap_nan_se(df_clean)
                    df_clean = split_column_by_pattern(df_clean)
                    df_clean = drop_rare_caracter_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = relocate_last_columns(df_clean)
                    #df_clean = replace_first_dot(df_clean) # comment for 2014 ns 08
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = replace_var_perc_first_column(df_clean)
                    df_clean = replace_var_perc_last_columns(df_clean)
                    df_clean = replace_number_moving_average(df_clean)
                    df_clean = expand_column(df_clean) # 2014 ns 08
                    df_clean = split_values_1(df_clean) # 2014 ns 08
                    df_clean = split_values_2(df_clean) # 2016 ns 15
                    df_clean = separate_text_digits(df_clean)
                    df_clean = intercambiar_valores(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = find_year_column(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = get_months_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = convertir_float(df_clean)
                    df_clean = replace_nan_with_previous_column(df_clean)
                
                # Añadir las columnas 'year', 'id_ns', 'date' al DataFrame limpio
                df_clean.insert(0, 'date', date)
                df_clean.insert(1, 'id_ns', id_ns)
                df_clean.insert(2, 'year', year)
                
                # Almacenar DataFrame limpio en dataframes_dict_1
                dataframes_dict_1[nombre_df] = df_clean

                print(f'  {table_counter}. El dataframe generado para el archivo {pdf_file} es: {nombre_df}')
                num_dataframes_generados += 1
                table_counter += 1  # Incrementar el contador de tabla aquí
        
        num_pdfs_procesados += 1  # Incrementar el número de PDFs procesados por cada PDF en la carpeta

    return num_pdfs_procesados, num_dataframes_generados, tables_dict_1

def procesar_carpetas():
    pdf_folder = 'pseudo_raw_pdf'
    carpetas = [os.path.join(pdf_folder, d) for d in os.listdir(pdf_folder) if os.path.isdir(os.path.join(pdf_folder, d))]
    
    tables_dict_1 = {}  # Inicializar tables_dict_1 aquí
    
    for carpeta in carpetas:
        if carpeta_procesada(carpeta):
            print(f"La carpeta {carpeta} ya ha sido procesada.")
            continue
        
        num_pdfs_procesados, num_dataframes_generados, tables_dict_temp = procesar_carpeta(carpeta)
        
        # Actualizar tables_dict_1 con los valores devueltos de procesar_carpeta()
        tables_dict_1.update(tables_dict_temp)
        
        registrar_carpeta_procesada(carpeta, num_pdfs_procesados)

        # Preguntar al usuario si desea continuar con la siguiente carpeta
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Para asegurar que la ventana esté en primer plano
        
        mensaje = f"Se han generado {num_dataframes_generados} dataframes en la carpeta {carpeta}. ¿Deseas continuar con la siguiente carpeta?"
        continuar = messagebox.askyesno("Continuar", mensaje)
        root.destroy()

        if not continuar:
            print("Procesamiento detenido por el usuario.")
            break  # Romper el bucle for si el usuario decide no continuar

    print("Procesamiento completado para todas las carpetas.")  # Add a message to indicate completion

    return tables_dict_1  # Devolver tables_dict_1 al final de la función

if __name__ == "__main__":
    tables_dict_1 = procesar_carpetas()  # Capturar el valor devuelto de procesar_carpetas()

In [None]:
tables_dict_1.keys()

In [None]:
dataframes_dict_1.keys()

In [None]:
tables_dict_1['ns_08_2014_1'].head(5)

In [None]:
dataframes_dict_1['ns_08_2014_1']

# Table 2

# Con registro de carpetas procesdas 2

In [24]:
import tabula
import pdfplumber
import pandas as pd
import numpy as np
import roman
import os
import re
from datetime import datetime
import locale
from tkinter import Tk, messagebox, TOP, YES, NO
import os


# Establecer la localización en español
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Palabras clave para buscar en el texto de la página
keywords = ["ECONOMIC SECTORS"]

# Diccionario para almacenar los DataFrames generados
dataframes_dict_2 = {}

# Ruta del archivo de registro de carpetas procesadas
registro_path = 'dataframes_record/carpetas_procesadas_2.txt'

# Función para corregir los nombres de los meses
def corregir_nombre_mes(mes):
    meses_mapping = {
        'setiembre': 'septiembre',
        # Agrega más mapeos si es necesario para otros nombres de meses
    }
    return meses_mapping.get(mes, mes)

def registrar_carpeta_procesada(carpeta, num_archivos_procesados):
    with open(registro_path, 'a') as file:
        file.write(f"{carpeta}:{num_archivos_procesados}\n")

def carpeta_procesada(carpeta):
    if not os.path.exists(registro_path):
        return False
    with open(registro_path, 'r') as file:
        for line in file:
            if line.startswith(carpeta):
                return True
    return False

def procesar_pdf(pdf_path):
    tables_dict_2 = {}  # Diccionario local para cada PDF
    table_counter = 1
    keyword_count = 0 

    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No se encontraron coincidencias para id_ns y year en el nombre del archivo:", filename)
        return None, None, None, None

    new_filename = os.path.splitext(os.path.basename(pdf_path))[0].replace('-', '_')
    date = None

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, 1):
            text = page.extract_text()
            if i == 1:
                match = re.search(r'(\d{1,2}\s+de\s+\w+\s+de\s+\d{4})', text, re.IGNORECASE)
                if match:
                    fecha_str = match.group(0)
                    # Corregir el nombre del mes
                    partes_fecha = fecha_str.split()
                    partes_fecha[2] = corregir_nombre_mes(partes_fecha[2].lower())
                    fecha_str_corregida = ' '.join(partes_fecha)
                    date = datetime.strptime(fecha_str_corregida, '%d de %B de %Y')

            if all(keyword in text for keyword in keywords):
                keyword_count += 1
                if keyword_count == 2:
                    tables = tabula.read_pdf(pdf_path, pages=i, multiple_tables=False)
                    for j, table_df in enumerate(tables, start=1):
                        nombre_dataframe = f"{new_filename}_{keyword_count}"
                        tables_dict_2[nombre_dataframe] = table_df
                        table_counter += 1

    return id_ns, year, date, tables_dict_2, keyword_count


def procesar_carpeta(carpeta):
    print(f"Procesando la carpeta {os.path.basename(carpeta)}")
    pdf_files = [os.path.join(carpeta, f) for f in os.listdir(carpeta) if f.endswith('.pdf')]

    num_pdfs_procesados = 0
    num_dataframes_generados = 0

    table_counter = 1  # Inicializar el contador de tabla aquí
    tables_dict_2 = {}  # Declarar tables_dict fuera del bucle principal
    
    for pdf_file in pdf_files:
        id_ns, year, date, tables_dict_temp, keyword_count = procesar_pdf(pdf_file)

        if tables_dict_temp:
            for nombre_df, df in tables_dict_temp.items():
                nombre_archivo = os.path.splitext(os.path.basename(pdf_file))[0].replace('-', '_')
                nombre_df = f"{nombre_archivo}_{keyword_count}"

                # Almacenar DataFrame sin procesar en tables_dict
                tables_dict_2[nombre_df] = df.copy()

                # Aplicar las 20 líneas de funciones de limpieza a una copia del DataFrame
                df_clean = df.copy()
                if df_clean.iloc[0, 0] is np.nan:
                    # Aplicar las 20 líneas de limpieza
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = separate_years(df_clean)
                    df_clean = relocate_roman_numerals(df_clean)
                    df_clean = extract_mixed_values(df_clean)
                    df_clean = replace_first_row_nan(df_clean)
                    df_clean = first_row_columns(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = drop_nan_row(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = split_values(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convertir_float(df_clean)
                else:
                    # Aplicar las 15 líneas de limpieza
                    df_clean = intercambiar_columnas(df_clean)
                    df_clean = drop_nan_columns(df_clean)
                    df_clean = remove_digit_slash(df_clean)
                    df_clean = last_column_es(df_clean)
                    df_clean = swap_first_second_row(df_clean)
                    df_clean = drop_nan_rows(df_clean)
                    df_clean = reset_index(df_clean)
                    year_columns = extract_years(df_clean)
                    df_clean = separate_text_digits(df_clean)
                    df_clean = roman_arabic(df_clean)
                    df_clean = fix_duplicates(df_clean)
                    df_clean = relocate_last_column(df_clean)
                    df_clean = clean_first_row(df_clean)
                    df_clean = get_quarters_sublist_list(df_clean, year_columns)
                    df_clean = first_row_columns(df_clean)
                    df_clean = clean_columns_values(df_clean)
                    df_clean = reset_index(df_clean)
                    df_clean = convertir_float(df_clean)

                # Añadir las columnas 'year', 'id_ns', 'date' al DataFrame limpio
                df_clean.insert(0, 'date', date)
                df_clean.insert(1, 'id_ns', id_ns)
                df_clean.insert(2, 'year', year)

                # Almacenar DataFrame limpio en dataframes_dict
                dataframes_dict_2[nombre_df] = df_clean

                print(f'  {table_counter}. El dataframe generado para el archivo {pdf_file} es: {nombre_df}')
                num_dataframes_generados += 1
                table_counter += 1  # Incrementar el contador de tabla aquí
                    
        num_pdfs_procesados += 1  # Incrementar el número de PDFs procesados por cada PDF en la carpeta

    return num_pdfs_procesados, num_dataframes_generados, tables_dict_2


def procesar_carpetas():
    pdf_folder = 'pseudo_raw_pdf'
    carpetas = [os.path.join(pdf_folder, d) for d in os.listdir(pdf_folder) if os.path.isdir(os.path.join(pdf_folder, d))]

    tables_dict_2 = {}  # Inicializar tables_dict aquí
    
    for carpeta in carpetas:
        if carpeta_procesada(carpeta):
            print(f"La carpeta {carpeta} ya ha sido procesada.")
            continue
        
        num_pdfs_procesados, num_dataframes_generados, tables_dict_temp = procesar_carpeta(carpeta)
        
        # Actualizar tables_dict con los valores devueltos de procesar_carpeta()
        tables_dict_2.update(tables_dict_temp)
        
        registrar_carpeta_procesada(carpeta, num_pdfs_procesados)

        # Preguntar al usuario si desea continuar con la siguiente carpeta
        root = Tk()
        root.withdraw()
        root.attributes('-topmost', True)  # Para asegurar que la ventana esté en primer plano
        
        mensaje = f"Se han generado {num_dataframes_generados} dataframes en la carpeta {carpeta}. ¿Deseas continuar con la siguiente carpeta?"
        continuar = messagebox.askyesno("Continuar", mensaje)
        root.destroy()

        if not continuar:
            print("Procesamiento detenido por el usuario.")
            break  # Romper el bucle for si el usuario decide no continuar

    print("Procesamiento completado para todas las carpetas.")  # Add a message to indicate completion

    return tables_dict_2  # Devolver tables_dict al final de la función

if __name__ == "__main__":
    tables_dict_2 = procesar_carpetas()


La carpeta pseudo_raw_pdf\2013 ya ha sido procesada.
La carpeta pseudo_raw_pdf\2014 ya ha sido procesada.
La carpeta pseudo_raw_pdf\2015 ya ha sido procesada.
La carpeta pseudo_raw_pdf\2016 ya ha sido procesada.
La carpeta pseudo_raw_pdf\2017 ya ha sido procesada.
La carpeta pseudo_raw_pdf\2018 ya ha sido procesada.
Procesando la carpeta 2019
  1. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-01-2019.pdf es: ns_01_2019_2
  2. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-02-2019.pdf es: ns_02_2019_2
  3. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-03-2019.pdf es: ns_03_2019_2
  4. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-04-2019.pdf es: ns_04_2019_2
  5. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-05-2019.pdf es: ns_05_2019_2
  6. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-06-2019.pdf es: ns_06_2019_2
  7. El dataframe generado para el archivo pseudo_raw_pdf\2019\ns-07-2019.pdf es: ns_

ValueError: Setting with non-unique columns is not allowed.

PENDING

* 2019

In [None]:
tables_dict_2.keys()

In [None]:
dataframes_dict_2.keys()

In [None]:
tables_dict_2['ns_08_2014_2'].head(5)

In [None]:
dataframes_dict_2['ns_08_2014_2']

# Se recomienda procesar por carpetas. Asegúrese de que cuando una carpeta ya ha sido procesada con todos los dataframes generados correctamente, se corra el siguiente código que concatena por sectores las revisiones de todas las NS de un mismo año (para todas las frecuencias). Los datasets de growth rates deben ser cargados a SQL por años. 

**Nota sobre $\color{blue}{NS-08-2017}$: La NS-08-2017 no tiene la tabla 1, pero resulta que la tabla 1 no cambia desde la NS 07 a la NS 09 del mismo año. Además la tabla 2 sí existe en la NS 08 y es la misma que en la NS 09. Por ello se puede reconstruir la NS (con las 3 pa´ginas de interés) a partir de la NS 09, se duplicarían las dos últimas páginas de la NS 09 y estas reemplazarían las dos últimas de la NS 08 actual, la primera no porque la primera es la portada y nos da información sobre la fecha en que ocurrió la NS.**

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: charter;">3.</span> <span style = "color: dark; font-family: charter;">SQL Tables</span></h1>

<div style="font-family: charter; text-align: left; color:dark">
    Finally, after obtaining and cleaning all the necessary data, we can create the three most important datasets to store realeses, vintages, and revisions. These datasets will be stored as tables in SQL and can be loaded into any software or programming language.
    <div/>

<div id="3-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.1.</span>
    <span style = "color: dark; font-family: charter;">
    Annual Concatenation
    </span>
    </h2>

In [None]:
def concatenate_annual_df(dataframes_dict):
    # List to store the names of dataframes that meet the criterion of ending in '_2'
    dataframes_ending_with_2 = []

    # List to store the names of dataframes to be concatenated
    dataframes_to_concatenate = []

    # Iterate over the dataframe names in the all_dataframes dictionary
    for df_name in dataframes_dict.keys():
        # Check if the dataframe name ends with '_2' and add it to the corresponding list
        if df_name.endswith('_2'):
            dataframes_ending_with_2.append(df_name)
            dataframes_to_concatenate.append(dataframes_dict[df_name])

    # Print the names of dataframes that meet the criterion of ending in '_2'
    print("DataFrames ending with '_2' that will be concatenated:")
    for df_name in dataframes_ending_with_2:
        print(df_name)

    # Concatenate all dataframes in the 'dataframes_to_concatenate' list
    if dataframes_to_concatenate:
        # Concatenate only rows that meet the specified conditions
        gdp_annual_growth_rates = pd.concat([df[(df['sectores_economicos'] == 'pbi') | (df['economic_sectors'] == 'gdp')] 
                                    for df in dataframes_to_concatenate 
                                    if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                    ignore_index=True)

        # Keep only columns that start with 'year' and the 'id_ns', 'year', and 'date' columns
        columns_to_keep = ['id_ns', 'year', 'date'] + [col for col in gdp_annual_growth_rates.columns if col.endswith('_year')]

        # Drop unwanted columns
        gdp_annual_growth_rates = gdp_annual_growth_rates[columns_to_keep]
        
        # Remove duplicate columns if any
        gdp_annual_growth_rates = gdp_annual_growth_rates.loc[:,~gdp_annual_growth_rates.columns.duplicated()]

        # Print the number of rows in the concatenated dataframe
        print("Number of rows in the concatenated dataframe:", len(gdp_annual_growth_rates))
        
        return gdp_annual_growth_rates
    else:
        print("No dataframes were found to concatenate.")
        return None

In [None]:
gdp_annual_growth_rates = concatenate_annual_df(dataframes_dict_2)

In [None]:
gdp_annual_growth_rates

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.2.</span>
    <span style = "color: dark; font-family: charter;">
    Quarterly Concatenation
    </span>
    </h2>

In [None]:
import pandas as pd

def concatenate_quarterly_df(dataframes_dict):
    # List to store the names of dataframes that meet the criterion of ending in '_2'
    dataframes_ending_with_2 = []

    # List to store the names of dataframes to be concatenated
    dataframes_to_concatenate = []

    # Iterate over the dataframe names in the all_dataframes dictionary
    for df_name in dataframes_dict.keys():
        # Check if the dataframe name ends with '_2' and add it to the corresponding list
        if df_name.endswith('_2'):
            dataframes_ending_with_2.append(df_name)
            dataframes_to_concatenate.append(dataframes_dict[df_name])

    # Print the names of dataframes that meet the criterion of ending in '_2'
    print("DataFrames ending with '_2' that will be concatenated:")
    for df_name in dataframes_ending_with_2:
        print(df_name)

    # Concatenate all dataframes in the 'dataframes_to_concatenate' list
    if dataframes_to_concatenate:
        # Concatenate only rows that meet the specified conditions
        gdp_quarterly_growth_rates = pd.concat([df[(df['sectores_economicos'] == 'pbi') | (df['economic_sectors'] == 'gdp')] 
                                    for df in dataframes_to_concatenate 
                                    if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                    ignore_index=True)

        # Keep all columns except those starting with 'year_', in addition to the 'id_ns', 'year', and 'date' columns
        columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in gdp_quarterly_growth_rates.columns if not col.endswith('_year')]

        # Select unwanted columns
        gdp_quarterly_growth_rates = gdp_quarterly_growth_rates[columns_to_keep]

        # Drop the 'sectores_economicos' and 'economic_sectors' columns
        gdp_quarterly_growth_rates.drop(columns=['sectores_economicos', 'economic_sectors'], inplace=True)

        # Remove duplicate columns if any
        gdp_quarterly_growth_rates = gdp_quarterly_growth_rates.loc[:,~gdp_quarterly_growth_rates.columns.duplicated()]

        # Print the number of rows in the concatenated dataframe
        print("Number of rows in the concatenated dataframe:", len(gdp_quarterly_growth_rates))
        
        return gdp_quarterly_growth_rates
    else:
        print("No dataframes were found to concatenate.")
        return None

# Uso de la función con el diccionario como argumento
# gdp_quarterly_growth_rates = concatenate_and_filter_dataframes(dataframes_dict)


In [None]:
gdp_quarterly_growth_rates = concatenate_quarterly_df(dataframes_dict_2)

In [None]:
gdp_quarterly_growth_rates

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3-3">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.3.</span>
    <span style = "color: dark; font-family: charter;">
    Monthly Concatenation
    </span>
    </h2>

In [None]:
import pandas as pd

def concatenate_monthly_df(dataframes_dict):
    # List to store the names of dataframes that meet the criterion of ending in '_1'
    dataframes_ending_with_1 = []

    # List to store the names of dataframes to be concatenated
    dataframes_to_concatenate = []

    # Iterate over the dataframe names in the all_dataframes dictionary
    for df_name in dataframes_dict.keys():
        # Check if the dataframe name ends with '_1' and add it to the corresponding list
        if df_name.endswith('_1'):
            dataframes_ending_with_1.append(df_name)
            dataframes_to_concatenate.append(dataframes_dict[df_name])

    # Print the names of dataframes that meet the criterion of ending with '_1'
    print("DataFrames ending with '_1' that will be concatenated:")
    for df_name in dataframes_ending_with_1:
        print(df_name)

    # Concatenate all dataframes in the 'dataframes_to_concatenate' list
    if dataframes_to_concatenate:
        # Concatenate only rows that meet the specified conditions
        gdp_monthly_growth_rates = pd.concat([df[(df['sectores_economicos'] == 'pbi') | (df['economic_sectors'] == 'gdp')] 
                                    for df in dataframes_to_concatenate 
                                    if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                    ignore_index=True)

        # Keep all columns except those starting with 'year_', in addition to the 'id_ns', 'year', and 'date' columns
        columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in gdp_monthly_growth_rates.columns if not col.endswith('_year')]

        # Select unwanted columns
        gdp_monthly_growth_rates = gdp_monthly_growth_rates[columns_to_keep]

        # Drop the 'sectores_economicos' and 'economic_sectors' columns
        gdp_monthly_growth_rates.drop(columns=['sectores_economicos', 'economic_sectors'], inplace=True)

        # Remove duplicate columns if any
        gdp_monthly_growth_rates = gdp_monthly_growth_rates.loc[:,~gdp_monthly_growth_rates.columns.duplicated()]
        
        # Drop columns with at least two underscores in their names
        columns_to_drop = [col for col in gdp_monthly_growth_rates.columns if col.count('_') >= 2]
        gdp_monthly_growth_rates.drop(columns=columns_to_drop, inplace=True)

        # Print the number of rows in the concatenated dataframe
        print("Number of rows in the concatenated dataframe:", len(gdp_monthly_growth_rates))
        
        return gdp_monthly_growth_rates
    else:
        print("No dataframes were found to concatenate.")
        return None

# Uso de la función con el diccionario como argumento
# gdp_monthly_growth_rates = concatenate_and_filter_dataframes(dataframes_dict)


In [None]:
gdp_monthly_growth_rates = concatenate_monthly_df(dataframes_dict_1)

In [None]:
gdp_monthly_growth_rates

In [None]:
# Asegúrate de que todas las fechas estén en el mismo formato
#gdp_monthly_growth_rates['date'] = pd.to_datetime(gdp_monthly_growth_rates['date'])

# Ahora, convierte la columna 'date' a solo fecha
#gdp_monthly_growth_rates['date'] = pd.to_datetime(gdp_monthly_growth_rates['date'].dt.date)

#print(gdp_monthly_growth_rates['date'].dtype)

In [None]:
#gdp_monthly_growth_rates['date'].dtype

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3-4">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.4.</span>
    <span style = "color: dark; font-family: charter;">
    Loading SQL
    </span>
    </h2>

In [None]:
import os
from sqlalchemy import create_engine

# Get environment variables
user = os.environ.get('CIUP_SQL_USER')
password = os.environ.get('CIUP_SQL_PASS')
host = os.environ.get('CIUP_SQL_HOST')
port = 5432
database = 'gdp_revisions_datasets'

# Check if all environment variables are defined
if not all([host, user, password]):
    raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

# Create connection string
connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

# Create SQLAlchemy engine
engine = create_engine(connection_string)

# gdp_monthly_growth_rates is the DataFrame you want to save to the database
#gdp_annual_growth_rates.to_sql('gdp_annual_growth_rates_2013', engine, index=False, if_exists='replace')
#gdp_quarterly_growth_rates.to_sql('gdp_quarterly_growth_rates', engine, index=False, if_exists='replace')
gdp_monthly_growth_rates.to_sql('gdp_monthly_growth_rates_2014', engine, index=False, if_exists='replace')


### PENDINGS

1.


<div style="background-color: red; color: white;">
    Incluir variables globales para que tan solo cambiar el nombre del sector, el código de concatenacipon funcione. Lo mismo para el año al final de "gdp_annual_growth_rates_"
</div>

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

---
---
---