<div style="text-align: center; font-family: 'charter'; color: rgb(0, 65, 75);">
    <h1>
    GDP Revisions Datasets
    </h1>
</div>

<div style="text-align: center; font-family: 'charter'; color: rgb(0, 65, 75);">
    <h4>
        Documentation
        <br>
        ____________________
            </br>
    </h4>
</div>

<div style="font-family: charter; text-align: left; color: dark;">
    This 
    <span style="color: rgb(61, 48, 162);">jupyter notebook</span>
    provides a step-by-step guide to <b>data building</b> regarding the project <b>'Revisiones y sesgos en las estimaciones preliminares del PBI en el Perú'</b>. The guide covers downloading PDF files containing tables with information on annual, quarterly, and monthly Peru's GDP growth rates (including sectoral GDP) and extracting this information into SQL tables. These data sets will be used for data analysis.
</div>


<div style="text-align: center; font-family: 'charter'; color: rgb(0, 65, 75);">
    Jason Cruz
    <br>
    <a href="mailto:jj.cruza@up.edu.pe" style="color: rgb(0, 153, 123)">
        jj.cruza@up.edu.pe
    </a>
</div>

<div style="font-family: Times New Roman; text-align: left; color: rgb(61, 48, 162)">The provided outline is functional. Use the buttons to enhance the experience of this script.<div/>

<div id="outilne">
   <!-- Contenido de la celda de destino -->
</div>

<div style="background-color: #141414; padding: 10px;">
<h2 style="text-align: left; font-family: 'charter'; color: #E0E0E0;">
    Outline
    </h2>
    <br>
    <a href="#1" style="color: #687EFF; font-size: 18px;">
        1. PDF Downloader</a>
    <br>
    <a href="#2" style="color: #687EFF; font-size: 18px;">
        2. Extracting Tables (and data cleaning)</a>
    <br>
    <a href="#2-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        2.1. 'pdfplumber' demo.</a>
    <br>
    <a href="#2-1-1" style="color: #E0E0E0; font-size: 12px;">
        2.1.1. What data would we get if we used the default settings?.</a>   
    <br>
    <a href="#2-1-2" style="color: #E0E0E0; font-size: 12px;">
        2.1.2. Using custom '.extract_table' settings.</a>
    <br> 
    <a href="#2-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        2.2. Extracting tables and generating dataframes (includes data cleanup).</a>
    <br>
    <a href="#3" style="color: #687EFF; font-size: 18px;">3. SQL Tables</a>
    <br>
    <a href="#3-1" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.1. Annual Concatenation.</a>
    <br>
    <a href="#3-2" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.2. Quarterly Concatenation.</a>
    <br>
    <a href="#3-3" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.3. Monthly Concatenation.</a>
    <br>
    <a href="#3-4" style="color: rgb(0, 153, 123); font-size: 12px;">
        3.4. Loading SQL.</a>
</div>

<div style="text-align: left; font-family: 'charter'; color: dark;">
    Any questions or issues regarding the coding, please <a href="mailto:jj.cruza@alum.up.edu.pe" style="color: rgb(0, 153, 123)">email Jason Cruz
    </a>.
    <div/>

<div style="text-align: left; font-family: 'charter'; color: dark;">
    If you don't have the libraries below, please use the following code (as example) to install the required libraries.
    <div/>

In [1]:
#!pip install os # Comment this code with "#" if you have already installed this library.

<div style="text-align: left; font-family: 'charter'; color: dark;">
    <h2>
    Libraries
    </h2>
    <div/>

In [2]:
# PDF Downloader

import os  # for file and directory manipulation
import random  # to generate random numbers
import time  # to manage time and take breaks in the script
import requests  # to make HTTP requests to web servers
from selenium import webdriver  # for automating web browsers
from selenium.webdriver.common.by import By  # to locate elements on a webpage
from selenium.webdriver.support.ui import WebDriverWait  # to wait until certain conditions are met on a webpage.
from selenium.webdriver.support import expected_conditions as EC  # to define expected conditions
from selenium.common.exceptions import StaleElementReferenceException  # To handle exceptions related to elements on the webpage that are no longer available.


# Extracting Tables (and data cleaning)

import pdfplumber  # for extracting text and metadata from PDF files
import pandas as pd  # for data manipulation and analysis
import os  # for interacting with the operating system
import unicodedata  # for manipulating Unicode data
import re  # for regular expressions operations
from datetime import datetime  # for working with dates and times
import locale  # for locale-specific formatting of numbers, dates, and currencies


# SQL tables

import psycopg2  # for interacting with PostgreSQL databases
from sqlalchemy import create_engine, text  # for creating and executing SQL queries using SQLAlchemy


<div style="text-align: left; font-family: 'charter'; color: dark;">
    <h2>
    Initial set-up
    </h2>
    <div/>

<div style="font-family: charter; text-align: left; color:dark"> The next 3 code lines will create folders in your current path, call them to import and export your outputs. <div/>

In [3]:
# Folder path to download PDF files

raw_pdf = 'raw_pdf' # to save raw data (.pdf).
if not os.path.exists(raw_pdf):
    os.mkdir(raw_pdf) # to create the folder (if it doesn't exist)

In [4]:
# Folder path to save text file with the names of already downloaded files

download_record = 'download_record'
if not os.path.exists(download_record):
    os.mkdir(download_record) # to create the folder (if it doesn't exist)

<div id="1">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: charter;">1.</span> <span style = "color: dark; font-family: charter;">PDF Downloader</span></h1>

<div style="font-family: charter; text-align: left; color:dark">
    Our main source for data collection is the <a href="https://www.bcrp.gob.pe/" style="color: rgb(0, 153, 123)">BCRP's web page</a> (.../publicaciones/nota-semanal). The BCRP publishes "Notas Semanales", documents that contain, among other information, tables of GDP and sectoral GDP growth rate values for annual, quarterly and monthly frequencies.
    <div/>

-- (pending) Selenium tutorial

<div style="font-family: charter; text-align: left; color:dark">
    The provided code will download all the 'Notas Semanales' files in PDF format from this web page.
    <div/>

In [5]:
# Setting the BCRP URL
bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html"  # Never replace this URL

<div style="font-family: charter; text-align: left; color:dark">
    The provided code will download all the 'Notas Semanales' files in PDF format from this web page.
    <div/>

In [10]:
# List to keep track of successfully downloaded files
downloaded_files = []

# Folder where downloaded PDF files will be saved
raw_pdf = "raw_pdf"  # Replace with the actual path

# Folder where the download record file will be saved
download_record = "download_record"  # Replace with the actual path

# Load the list of previously downloaded files if it exists
if os.path.exists(os.path.join(download_record, "downloaded_files.txt")):
    with open(os.path.join(download_record, "downloaded_files.txt"), "r") as f:
        downloaded_files = f.read().splitlines()

# Web driver setup
driver_path = os.environ.get('driver_path')
driver = webdriver.Chrome(executable_path=driver_path)

def random_wait(min_time, max_time):
    wait_time = random.uniform(min_time, max_time)
    print(f"Waiting randomly for {wait_time:.2f} seconds")
    time.sleep(wait_time)

def download_pdf(pdf_link):
    # Click the link using JavaScript
    driver.execute_script("arguments[0].click();", pdf_link)

    # Wait for the new page to fully open (adjust timing as necessary)
    wait.until(EC.number_of_windows_to_be(2))

    # Switch to the new window or tab
    windows = driver.window_handles
    driver.switch_to.window(windows[1])

    # Get the current URL (may vary based on site-specific logic)
    new_url = driver.current_url
    print(f"{download_counter}. New URL: {new_url}")

    # Get the file name from the URL
    file_name = new_url.split("/")[-1]

    # Form the full destination path
    destination_path = os.path.join(raw_pdf, file_name)

    # Download the PDF
    response = requests.get(new_url, stream=True)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Save the PDF content to the local file
        with open(destination_path, 'wb') as pdf_file:
            for chunk in response.iter_content(chunk_size=128):
                pdf_file.write(chunk)

        print(f"PDF downloaded successfully at: {destination_path}")

    else:
        print(f"Error downloading the PDF. Response code: {response.status_code}")

    # Close the new window or tab
    driver.close()

    # Switch back to the main window
    driver.switch_to.window(windows[0])

# Number of downloads per batch
downloads_per_batch = 5
# Total number of downloads
total_downloads = 25

try:
    # Open the test page
    driver.get(bcrp_url)
    print("Site opened successfully")

    # Wait for the container area to be present
    wait = WebDriverWait(driver, 60)
    container_area = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="rightside"]')))

    # Get all the links within the container area
    pdf_links = container_area.find_elements(By.XPATH, './/a')

    # Reverse the order of links
    pdf_links = list(reversed(pdf_links))

    # Initialize download counter
    download_counter = 0

    # Iterate over reversed links and download PDFs in batches
    for pdf_link in pdf_links:
        download_counter += 1

        # Get the file name from the URL
        new_url = pdf_link.get_attribute("href")
        file_name = new_url.split("/")[-1]

        # Check if the file has already been downloaded
        if file_name in downloaded_files:
            print(f"{download_counter}. The file {file_name} has already been downloaded previously. Skipping...")
            continue

        # Try to download the file
        try:
            download_pdf(pdf_link)

            # Update the list of downloaded files
            downloaded_files.append(file_name)

            # Save the file name in the record
            with open(os.path.join(download_record, "downloaded_files.txt"), "a") as f:
                f.write(file_name + "\n")

        except Exception as e:
            print(f"Error downloading the file {file_name}: {str(e)}")

        # If the download count reaches a multiple of batch size, notify
        if download_counter % downloads_per_batch == 0:
            print(f"Batch {download_counter // downloads_per_batch} of {downloads_per_batch} completed")

        # Random wait before the next iteration
        random_wait(5, 10)

        # If total downloads reached, break out of loop
        if download_counter == total_downloads:
            print(f"All downloads completed ({total_downloads} in total)")
            break

except StaleElementReferenceException:
    print("StaleElementReferenceException occurred. Retrying...")

finally:
    # Close the browser when finished
    driver.quit()


Site opened successfully
1. The file ns-01-2013.pdf has already been downloaded previously. Skipping...
2. The file ns-02-2013.pdf has already been downloaded previously. Skipping...
3. The file ns-03-2013.pdf has already been downloaded previously. Skipping...
4. The file ns-04-2013.pdf has already been downloaded previously. Skipping...
5. New URL: https://www.bcrp.gob.pe/docs/Publicaciones/Nota-Semanal/2013/ns-05-2013.pdf
PDF downloaded successfully at: raw_pdf\ns-05-2013.pdf
Batch 1 of 5 completed
Waiting randomly for 8.94 seconds
6. New URL: https://www.bcrp.gob.pe/docs/Publicaciones/Nota-Semanal/2013/ns-06-2013.pdf
PDF downloaded successfully at: raw_pdf\ns-06-2013.pdf
Waiting randomly for 6.90 seconds
7. New URL: https://www.bcrp.gob.pe/docs/Publicaciones/Nota-Semanal/2013/ns-07-2013.pdf
PDF downloaded successfully at: raw_pdf\ns-07-2013.pdf
Waiting randomly for 9.94 seconds
8. New URL: https://www.bcrp.gob.pe/docs/Publicaciones/Nota-Semanal/2013/ns-08-2013.pdf
PDF downloaded su

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="2">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: charter;">2.</span> <span style = "color: dark; font-family: charter;">Extracting Tables (and data cleaning)</span></h1>

<div id="2-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">2.1.</span>
    <span style = "color: dark; font-family: charter;">
    <span style="background-color: #f2f2f2; font-family: Courier New;">
        pdfplumber
    </span> 
    demo
    </span>
    </h2>

<div style="font-family: charter; text-align: left; color:dark">
    Import
    <span style="background-color: #f2f2f2; font-family: Courier New;">
        pdfplumber
    </span>
    <div/>

In [11]:
import pdfplumber
print(f'This library version is: {pdfplumber.__version__}')

This library version is: 0.10.4


<div style="font-family: charter; text-align: left; color:dark">
    Load the PDF
    <div/>

In [14]:
pdf = pdfplumber.open(".\\ns-10-2013.pdf")

<div style="font-family: charter; text-align: left; color:dark">
    Get the page 82
    <div/>

In [None]:
p_82 = pdf.pages[81]

In [None]:
# Convert the page to a higher resolution image (e.g., 300 DPI).
image = p_82.to_image(resolution=300)
image

<div id="2-1-1">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: charter;">2.1.1.</span>
    <span style = "color: dark; font-family: charter;">
    What data would we get if we used the default settings?
    </span>
    </h3>

<div style="font-family: charter; text-align: left; color:dark">
    We can check by using <span style="background-color: #f2f2f2; font-family: Courier New;">
        PageImage.debug_tablefinder()
    </span>:
    <div/>

In [None]:
image.reset().debug_tablefinder()

<div style="font-family: charter; text-align: left; color:dark">
    The default settings correctly identify the table's vertical demarcations, but don't capture the horizontal demarcations between each group of five states/territories. So:
    <div/>

<div id="2-1-2">
   <!-- Contenido de la celda de destino -->
</div>

<h3><span style = "color: rgb(0, 65, 75); font-family: charter;">2.1.2.</span>
    <span style = "color: dark; font-family: charter;">
    Using custom <span style="background-color: #f2f2f2; font-family: Courier New;">
        <b>.extract_table
            </b>
    </span>'s settings
    </span>
    </h3>

<div style="font-family: charter; text-align: left; color:dark">
    <ul>
        <li>Because the columns are separated by lines, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        vertical_strategy="lines"
    </span>.
            </li>
        <li>Because the rows are, primarily, separated by gutters between the text, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        horizontal_strategy="text"
    </span>.
            <li>To snap together a handful of the gutters at the top which aren't fully flush with one another, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        snap_y_tolerance
    </span>which snaps horizontal lines within a certain distance to the same vertical alignment.
                </li>
        <li>And because the left and right-hand extremities of the text aren't quite flush with the vertical lines, we use <span style="background-color: #f2f2f2; font-family: Courier New;">
        "intersection_tolerance": 15
    </span>.
            </li>
        </ul>
    <div/>

In [None]:
table_settings = {
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "join_x_tolerance": 3,
    "join_y_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "text_keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": 3,
    "intersection_y_tolerance": 3,
}

In [None]:
image.reset().debug_tablefinder(table_settings)

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="2-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">2.2.</span>
    <span style = "color: dark; font-family: charter;">
    Extracting tables and generating dataframes (includes data cleanup)
    </span>
    </h2>

<div style="font-family: charter; text-align: left; color:dark">
    We would like to get specific tables: information on GDP growth rates with annual, quarterly and monthly frequency. We don't need other tables also related to GDP that don't meet these requirements. Extraction will be easier if we use keywords.
    <div/>

In [None]:
# Keywords to search in the page text
keywords = ["PRODUCTO BRUTO INTERNO", "SECTORES ECONÓMICOS", "PBI", "GDP", "Variaciones"]

<div style="font-family: charter; text-align: left; color:dark">
    The code iterates through each PDF and extracts the two required tables from each. The extracted information is then transformed into dataframes and the columns and values are cleaned up to conform to Python conventions (pythonic).
    <div/>

In [None]:
# Set the locale to Spanish
locale.setlocale(locale.LC_TIME, 'es_ES.UTF-8')

# Function to process a PDF file and generate corresponding DataFrames
def process_pdf(pdf_path):
    # Dictionary to store tables that meet the criteria
    tables_dict = {}

    # Counter to assign names to tables
    table_counter = 1

    # Get id_ns and year from the PDF filename
    filename = os.path.basename(pdf_path)
    id_ns_year_matches = re.findall(r'ns-(\d+)-(\d{4})', filename)
    if id_ns_year_matches:
        id_ns, year = id_ns_year_matches[0]
    else:
        print("No matches found for id_ns and year in filename:", filename)
        return None, None, None, None

    # Replace hyphens with underscores in the filename
    new_filename = filename.replace('-', '_')

    # Date extracted from the first text of the first page
    date = None

    # Open the PDF file
    with pdfplumber.open(pdf_path) as pdf:
        # Iterate through the pages of the PDF
        for i, page in enumerate(pdf.pages, 1):
            # Extract text from the page
            text = page.extract_text()
            if i == 1:
                # Get the date from the first text of the first page
                match = re.search(r'(\d{1,2}\s+de\s+\w+\s+de\s+\d{4})', text)
                if match:
                    # Convert the date string to a datetime object with the desired format
                    date = datetime.strptime(match.group(0), '%d de %B de %Y')

            # Check if all keywords are present in the page text
            if all(keyword in text for keyword in keywords):
                # Extract all tables from the page and add them to the tables dictionary
                for j, table in enumerate(page.extract_tables(), start=1):
                    tables_dict[f"table_{table_counter}"] = table
                    table_counter += 1

    # Process each table in the dictionary
    all_dataframes = {}
    for table_name, table in tables_dict.items():
        # Process the sublists to create DataFrames
        for i, sublist in enumerate(table):
            # Check if it's the first or second sublist
            if i < 2:
                # Apply space replacement around the hyphen only for the first and second sublist
                for j, item in enumerate(sublist):
                    if isinstance(item, str):
                        table[i][j] = re.sub(r'\s*-\s*', '-', item)

        # Process the first sublist to define the DataFrame columns
        columns = table[0]

        # List to store DataFrames of each sublist
        dfs_temp = []

        # Process the remaining sublists to create DataFrames
        for sublist in table[1:]:
            # Check if the sublist element is None
            if sublist is not None:
                # Iterate over sublist elements and split them by "\n"
                elements = [elem.split('\n') if elem is not None else [''] for elem in sublist]
                # Transpose the list to group elements of the same position into sublists
                rows = zip(*elements)
                # Convert rows to a Pandas DataFrame
                df = pd.DataFrame(rows, columns=columns)

                # Split observations into multiple columns, excluding 'SECTORES ECONÓMICOS' and 'ECONOMIC SECTORS' columns
                columns_to_split = [col for col in df.columns if col not in ['SECTORES ECONÓMICOS', 'ECONOMIC SECTORS']]
                for col in columns_to_split:
                    df_temp = df[col].str.split(expand=True)
                    for i in range(len(df_temp.columns)):
                        df[f"{col}_{i+1}"] = df_temp[i]

                # Remove the original columns that were split
                df = df.drop(columns=columns_to_split)

                # Add the DataFrame to the temporary DataFrame list
                dfs_temp.append(df)

        # Concatenate temporary DataFrames into one
        df_final_temp = pd.concat(dfs_temp, ignore_index=True)

        # Rename columns containing '_'
        new_names = {col: col.split('_')[0] for col in df_final_temp.columns if '_' in col}
        df_final_temp.rename(columns=new_names, inplace=True)

        # Iterate over columns and remove leading underscores if necessary
        df_final_temp.columns = [col[1:] if col.startswith('_') else col for col in df_final_temp.columns]

        # Remove spaces before and after hyphens in column names
        df_final_temp.columns = [col.strip().replace(' - ', '-') for col in df_final_temp.columns]

        # Replace empty columns
        columns = list(df_final_temp.columns)
        for i, column in enumerate(columns):
            if column.strip() == '':
                # Find the left column name containing a year
                left_name = next((col for col in reversed(columns[:i]) if col.isdigit()), None)

                # Find the right column name containing a year
                right_name = next((col for col in columns[i + 1:] if col.isdigit()), None)

                if left_name and not right_name:
                    # If there's a 4-digit number to the left and no more columns to the right
                    df_final_temp.rename(columns={column: left_name}, inplace=True)
                elif not left_name and right_name:
                    # If there's a 4-digit number to the right and no more columns to the left
                    right_year = int(right_name)
                    df_final_temp.rename(columns={column: str(right_year - 1)}, inplace=True)
                elif left_name and right_name:
                    # If there are column names to the left and right containing 4-digit numbers
                    left_year = int(left_name)
                    df_final_temp.rename(columns={column: str(left_year)}, inplace=True)

        # Replace column names with prefixes from the first row
        new_names = df_final_temp.iloc[0].apply(lambda x: x.strip() if isinstance(x, str) else x).astype(str) + '_' + df_final_temp.columns
        df_final_temp.columns = new_names

        # Remove the first row of the DataFrame, as its values were used as prefixes for columns
        df_final_temp = df_final_temp.drop(0)

        # Convert all columns to lowercase
        df_final_temp.columns = map(str.lower, df_final_temp.columns)

        # Remove periods from column names
        df_final_temp.columns = df_final_temp.columns.str.replace('.', '')

        # Remove special characters and accents from column names
        df_final_temp.columns = [unicodedata.normalize('NFKD', col).encode('ASCII', 'ignore').decode('utf-8') for col in df_final_temp.columns]

        # Replace 'ano' with 'year' in all columns
        df_final_temp.columns = [col.replace('ano', 'year') for col in df_final_temp.columns]

        # Replace empty spaces between words with '_'
        df_final_temp.columns = [col.replace(' ', '_') for col in df_final_temp.columns]

        # Replace hyphens with underscores
        df_final_temp.columns = [col.replace('-', '_') for col in df_final_temp.columns]

        # Implementation of line that renames columns
        df_final_temp.columns = [col[1:] if col.startswith('_') else col for col in df_final_temp.columns]

        # Replace commas with periods in values of all columns except 'sectores_economicos' and 'economic_sectors'
        for col in df_final_temp.columns:
            if col not in ['sectores_economicos', 'economic_sectors']:
                df_final_temp[col] = df_final_temp[col].apply(lambda x: str(x).replace(',', '.') if isinstance(x, (int, float, str)) else x)

        # Convert columns to float type
        for col in df_final_temp.columns:
            if col not in ['sectores_economicos', 'economic_sectors']:
                if isinstance(df_final_temp[col], pd.Series):
                    df_final_temp[col] = pd.to_numeric(df_final_temp[col], errors='coerce')

        # Get object (text string) type columns
        text_columns = df_final_temp.select_dtypes(include='object').columns

        # Iterate over text columns and remove accents
        for col in text_columns:
            df_final_temp[col] = df_final_temp[col].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore').decode('utf-8') if isinstance(x, str) else x)

        # Convert all text strings to lowercase
        for col in text_columns:
            df_final_temp[col] = df_final_temp[col].str.lower()

        # Define function to remove numbers and special characters
        def remove_special_characters(text):
            return re.sub(r'[^a-zA-Z\s]', '', text)

        # Apply function to 'sectores_economicos' and 'economic_sectors' columns
        df_final_temp['sectores_economicos'] = df_final_temp['sectores_economicos'].apply(
            remove_special_characters)
        df_final_temp['economic_sectors'] = df_final_temp['economic_sectors'].apply(
            remove_special_characters)

        # Add new columns id_ns, year, and date to DataFrames
        df_final_temp['year'] = year
        df_final_temp['id_ns'] = id_ns
        df_final_temp['date'] = date

        # Reorganize columns to place year, id_ns, and date at the beginning
        column_order = ['year', 'id_ns', 'date'] + [col for col in df_final_temp.columns if col not in ['id_ns', 'year', 'date']]
        df_final_temp = df_final_temp[column_order]

        # Convert id_ns and year columns to integer type
        df_final_temp['year'] = df_final_temp['year'].astype(int)
        df_final_temp['id_ns'] = df_final_temp['id_ns'].astype(int)

        # Save the final DataFrame with a unique name in the dictionary
        df_name = f"{os.path.splitext(new_filename)[0]}_{table_name.split('_')[1]}"
        all_dataframes[df_name] = df_final_temp

    # Return the results
    return all_dataframes, year, id_ns, date

# Iterate over the PDF files in the folder

# Initialize a counter
file_counter = 0

# Dictionary to store all generated dataframes
all_dataframes = {}

# Iterate over the PDF files in the folder
for filename in os.listdir(raw_pdf):
    if filename.endswith(".pdf"):
        file_counter += 1
        pdf_file = os.path.join(raw_pdf, filename)
        print(f"Processing file {file_counter}: {pdf_file}")
        generated_dataframes, year, id_ns, date = process_pdf(pdf_file)
        print("Generated DataFrames:")
        for df_name in generated_dataframes.keys():
            print(df_name)
        # Add the generated dataframes to the general dictionary
        all_dataframes.update(generated_dataframes)

# Use the all_dataframes dictionary as needed

        #print("year:", year)
        #print("id_ns:", id_ns)
        #print("date:", date)
        

In [None]:
all_dataframes.keys()

In [None]:
all_dataframes['ns_10_2013_1']

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3">
   <!-- Contenido de la celda de destino -->
</div>

<h1><span style = "color: rgb(0, 65, 75); font-family: charter;">3.</span> <span style = "color: dark; font-family: charter;">SQL Tables</span></h1>

<div style="font-family: charter; text-align: left; color:dark">
    Finally, after obtaining and cleaning all the necessary data, we can create the three most important datasets to store realeses, vintages, and revisions. These datasets will be stored as tables in SQL and can be loaded into any software or programming language.
    <div/>

<div id="3-1">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.1.</span>
    <span style = "color: dark; font-family: charter;">
    Annual Concatenation
    </span>
    </h2>

In [None]:
# List to store the names of dataframes that meet the criterion of ending in '_2'
dataframes_ending_with_2 = []

# List to store the names of dataframes to be concatenated
dataframes_to_concatenate = []

# Iterate over the dataframe names in the all_dataframes dictionary
for df_name in all_dataframes.keys():
    # Check if the dataframe name ends with '_2' and add it to the corresponding list
    if df_name.endswith('_2'):
        dataframes_ending_with_2.append(df_name)
        dataframes_to_concatenate.append(all_dataframes[df_name])

# Print the names of dataframes that meet the criterion of ending in '_2'
print("DataFrames ending with '_2' that will be concatenated:")
for df_name in dataframes_ending_with_2:
    print(df_name)

# Concatenate all dataframes in the 'dataframes_to_concatenate' list
if dataframes_to_concatenate:
    # Concatenate only rows that meet the specified conditions
    gdp_annual_growth_rates = pd.concat([df[(df['sectores_economicos'] == 'pbi') | (df['economic_sectors'] == 'gdp')] 
                                for df in dataframes_to_concatenate 
                                if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                ignore_index=True)

    # Keep only columns that start with 'year' and the 'id_ns', 'year', and 'date' columns
    columns_to_keep = ['id_ns', 'year', 'date'] + [col for col in gdp_annual_growth_rates.columns if col.startswith('year')]

    # Drop unwanted columns
    gdp_annual_growth_rates = gdp_annual_growth_rates[columns_to_keep]
    
    # Remove duplicate columns if any
    gdp_annual_growth_rates = gdp_annual_growth_rates.loc[:,~gdp_annual_growth_rates.columns.duplicated()]

    # Print the number of rows in the concatenated dataframe
    print("Number of rows in the concatenated dataframe:", len(gdp_annual_growth_rates))
else:
    print("No dataframes were found to concatenate.")


In [None]:
gdp_annual_growth_rates

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3-2">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.2.</span>
    <span style = "color: dark; font-family: charter;">
    Quarterly Concatenation
    </span>
    </h2>

In [None]:
import pandas as pd

# List to store the names of dataframes that meet the criterion of ending in '_2'
dataframes_ending_with_2 = []

# List to store the names of dataframes to be concatenated
dataframes_to_concatenate = []

# Iterate over the dataframe names in the all_dataframes dictionary
for df_name in all_dataframes.keys():
    # Check if the dataframe name ends with '_2' and add it to the corresponding list
    if df_name.endswith('_2'):
        dataframes_ending_with_2.append(df_name)
        dataframes_to_concatenate.append(all_dataframes[df_name])

# Print the names of dataframes that meet the criterion of ending in '_2'
print("DataFrames ending with '_2' that will be concatenated:")
for df_name in dataframes_ending_with_2:
    print(df_name)

# Concatenate all dataframes in the 'dataframes_to_concatenate' list
if dataframes_to_concatenate:
    # Concatenate only rows that meet the specified conditions
    gdp_quarterly_growth_rates = pd.concat([df[(df['sectores_economicos'] == 'pbi') | (df['economic_sectors'] == 'gdp')] 
                                for df in dataframes_to_concatenate 
                                if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                ignore_index=True)

    # Keep all columns except those starting with 'year_', in addition to the 'id_ns', 'year', and 'date' columns
    columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in gdp_quarterly_growth_rates.columns if not col.startswith('year_')]

    # Select unwanted columns
    gdp_quarterly_growth_rates = gdp_quarterly_growth_rates[columns_to_keep]

    # Drop the 'sectores_economicos' and 'economic_sectors' columns
    gdp_quarterly_growth_rates.drop(columns=['sectores_economicos', 'economic_sectors'], inplace=True)

    # Remove duplicate columns if any
    gdp_quarterly_growth_rates = gdp_quarterly_growth_rates.loc[:,~gdp_quarterly_growth_rates.columns.duplicated()]

    # Print the number of rows in the concatenated dataframe
    print("Number of rows in the concatenated dataframe:", len(gdp_quarterly_growth_rates))
else:
    print("No dataframes were found to concatenate.")


In [None]:
gdp_quarterly_growth_rates

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3-3">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.3.</span>
    <span style = "color: dark; font-family: charter;">
    Monthly Concatenation
    </span>
    </h2>

In [None]:
import pandas as pd

# List to store the names of dataframes that meet the criterion of ending in '_1'
dataframes_ending_with_1 = []

# List to store the names of dataframes to be concatenated
dataframes_to_concatenate = []

# Iterate over the dataframe names in the all_dataframes dictionary
for df_name in all_dataframes.keys():
    # Check if the dataframe name ends with '_1' and add it to the corresponding list
    if df_name.endswith('_1'):
        dataframes_ending_with_1.append(df_name)
        dataframes_to_concatenate.append(all_dataframes[df_name])

# Print the names of dataframes that meet the criterion of ending with '_1'
print("DataFrames ending with '_1' that will be concatenated:")
for df_name in dataframes_ending_with_1:
    print(df_name)

# Concatenate all dataframes in the 'dataframes_to_concatenate' list
if dataframes_to_concatenate:
    # Concatenate only rows that meet the specified conditions
    gdp_monthly_growth_rates = pd.concat([df[(df['sectores_economicos'] == 'pbi') | (df['economic_sectors'] == 'gdp')] 
                                for df in dataframes_to_concatenate 
                                if 'sectores_economicos' in df.columns and 'economic_sectors' in df.columns], 
                                ignore_index=True)

    # Keep all columns except those starting with 'year_', in addition to the 'id_ns', 'year', and 'date' columns
    columns_to_keep = ['year', 'id_ns', 'date'] + [col for col in gdp_monthly_growth_rates.columns if not col.startswith('year_')]

    # Select unwanted columns
    gdp_monthly_growth_rates = gdp_monthly_growth_rates[columns_to_keep]

    # Drop the 'sectores_economicos' and 'economic_sectors' columns
    gdp_monthly_growth_rates.drop(columns=['sectores_economicos', 'economic_sectors'], inplace=True)

    # Remove duplicate columns if any
    gdp_monthly_growth_rates = gdp_monthly_growth_rates.loc[:,~gdp_monthly_growth_rates.columns.duplicated()]
    
    # Drop columns with at least two underscores in their names
    columns_to_drop = [col for col in gdp_monthly_growth_rates.columns if col.count('_') >= 2]
    gdp_monthly_growth_rates.drop(columns=columns_to_drop, inplace=True)

    # Print the number of rows in the concatenated dataframe
    print("Number of rows in the concatenated dataframe:", len(gdp_monthly_growth_rates))
else:
    print("No dataframes were found to concatenate.")


In [None]:
gdp_monthly_growth_rates['date'].dtype

<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

<div id="3-4">
   <!-- Contenido de la celda de destino -->
</div>

<h2><span style = "color: rgb(0, 65, 75); font-family: charter;">3.4.</span>
    <span style = "color: dark; font-family: charter;">
    Loading SQL
    </span>
    </h2>

In [None]:
import os
from sqlalchemy import create_engine

# Get environment variables
user = os.environ.get('CIUP_SQL_USER')
password = os.environ.get('CIUP_SQL_PASS')
host = os.environ.get('CIUP_SQL_HOST')
port = 5432
database = 'gdp_revisions_datasets'

# Check if all environment variables are defined
if not all([host, user, password]):
    raise ValueError("Some environment variables are missing (CIUP_SQL_HOST, CIUP_SQL_USER, CIUP_SQL_PASS)")

# Create connection string
connection_string = f"postgresql://{user}:{password}@{host}:{port}/{database}"

# Create SQLAlchemy engine
engine = create_engine(connection_string)

# gdp_monthly_growth_rates is the DataFrame you want to save to the database
gdp_annual_growth_rates.to_sql('gdp_annual_growth_rates', engine, index=False, if_exists='replace')
gdp_quarterly_growth_rates.to_sql('gdp_quarterly_growth_rates', engine, index=False, if_exists='replace')
gdp_monthly_growth_rates.to_sql('gdp_monthly_growth_rates', engine, index=False, if_exists='replace')


<div style="color: rgb(61, 48, 162); font-size: 12px;">
    Back to the
    <a href="#outilne" style="color: #687EFF;">
    outline.
    </a>
    <div/>

---
---
---