<div style="background:#3366FF; color:white; padding:12px; box-sizing:border-box; border-radius:4px;">

</div>

# New GDP Real-Time Dataset 

> **Author:** Jason Cruz  
  **Last updated:** 11/13/2025  
  **Python version:** 3.12  
  **Project:** Rationality and Nowcasting on Peruvian GDP Revisions 

---

## üìå Summary
Welcome to the **Peruvian GDP Real-Time Dataset (RTD)** construction notebook! This notebook will guide you through the **step-by-step process** of creating your own RTD using GDP revisions from the **Central Reserve Bank of Peru** (BCRP). Whether you are a researcher, policymaker, or analyst, this notebook helps you construct real-time data of monthly GDP growth for Peru, starting from scratch.

### What will this notebook help you achieve?
1. **Downloading PDFs** from the BCRP Weekly Reports (WR).
2. **Generating PDF inputs** by shortening them to focus on key pages containing GDP growth rate tables.
3. **Cleaning-up extracted data** to ensure it's usable and building RTD.
4. **Concatenating RTD** from different years and frequencies (monthly, quarterly, annual).
5. **Updating metadata** for storing base years changes and other revisions-based information.
6. **Converting RTD** to releases dataset for econometric analysis.

üåê **Main Data Source:** [BCRP Weekly Report](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html) (üì∞ WR, from here on)  
For any questions or issues, feel free to reach out via email: [Jason üì®](mailto:jj.cruza@up.edu.pe)

---

### ‚öôÔ∏è Initial Set-up

Before preprocessing the new GDP releases data, we need to perform some initial set-up steps:

1. üß∞ **Import helper functions** from `gdp_rtd_pipeline.py` that are required for this notebook.
2. üõ¢Ô∏è **Connect to the PostgreSQL database** that will contain GDP revisions datasets. _(This step is pending: direct access will be provided via ODBC or other methods, allowing users to connect from any software or programming language.)_
3. üìÇ **Create necessary folders** to store inputs, outputs, logs, and screenshots.


> üöß Although the second step (database connection) is pending, the notebook currently works using **flat files (CSV)**. These CSV files will **not be saved in GitHub** as they are included in the `.gitignore` to ensure no data is stored publicly. Users can be confident that no data will be stored on GitHub. The notebook **automatically generates the CSV files**, giving users direct access to the dataset on their own systems. The data is created on the fly and can be saved locally for further use.

### üß∞ Import helper functions

This notebook relies on a set of helper functions found in the script `gdp_rtd_pipeline.py`. These functions will be used throughout the notebook, so please ensure you have them ready by running the line of code below.

In [1]:
from gdp_rtd_pipeline import *

pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


> üõ†Ô∏è **Libraries:** Before you begin, please ensure that you have the required libraries installed and imported. See all the libraries you need section by section in `gdp_rtd_pipeline.py`.

In [2]:
#!pip install os # Comment this code with "#" if you have already installed this library.

**Check out Python information**

In [3]:
import sys
import platform

print("üêç Python Information")
print(f"  Version  : {sys.version.split()[0]}")
print(f"  Compiler : {platform.python_compiler()}")
print(f"  Build    : {platform.python_build()}")
print(f"  OS       : {platform.system()} {platform.release()}")

üêç Python Information
  Version  : 3.12.1
  Compiler : MSC v.1916 64 bit (AMD64)
  Build    : ('main', 'Jan 19 2024 15:44:08')
  OS       : Windows 10


### üìÇ Create necessary folders

We will start by creating the necessary folders to store the data at various stages of processing. The following code ensures all required directories exist, and if not, it creates them.

In [4]:
from pathlib import Path  # Importing Path module from pathlib to handle file and directory paths in a cross-platform way.

# Get current working directory
PROJECT_ROOT = Path.cwd()  # Get the current working directory where the notebook is being executed.

# User input for folder location
user_input = input("Enter relative path (default='.'): ").strip() or "."  # Prompt user to input the folder path or use the default value "."
target_path = (PROJECT_ROOT / user_input).resolve()  # Combine the project root directory with user input to get the full target path.

# Create the necessary directories if they don't already exist
target_path.mkdir(parents=True, exist_ok=True)  # Creates the target folder and any necessary parent directories.
print(f"Using path: {target_path}")  # Print out the path being used for confirmation.

# Define paths for saving data and PDFs
pdf_folder = 'new_weekly_reports'  # This folder will store the new Weekly Reports (post-2013), which are in PDF format.
raw_pdf_subfolder = os.path.join(pdf_folder, 'raw')  # Subfolder for saving the raw PDFs exactly as downloaded from the BCRP website.
input_pdf_subfolder = os.path.join(pdf_folder, 'input')  # Subfolder for saving reduced PDFs that contain only the selected pages with GDP growth tables.

data_folder = 'data'  # Main folder for storing all data files.
input_data_subfolder = os.path.join(data_folder, 'input')  # Folder for storing preprocessed data throughout all periods (NEW+OLD data).
output_data_subfolder = os.path.join(data_folder, 'output')  # Folder for storing final RTD datasets and releases after processing.

# Create all folders if they don't exist yet
for folder in [pdf_folder, raw_pdf_subfolder, input_pdf_subfolder, data_folder, input_data_subfolder, output_data_subfolder]:
    os.makedirs(folder, exist_ok=True)  # Create each folder in the list if it doesn't already exist.
    print(f"üìÇ {folder} created")  # Print confirmation for each folder created.

# Additional folders for metadata, records, and alert tracking
metadata_folder = 'metadata'  # Folder for storing metadata files like wr_metadata.csv.
record_folder = 'record'  # Folder for storing .txt files that track the files already processed to avoid reprocessing them.
alert_track_folder = 'alert_track'  # Folder for saving download notifications and alerts.

# Create additional required folders
for folder in [metadata_folder, pdf_folder, input_pdf_subfolder, record_folder]:
    os.makedirs(folder, exist_ok=True)  # Create the additional folders if they don't exist.
    print(f"üìÇ {folder} created")  # Print confirmation for each of these additional folders.


Enter relative path (default='.'):  .


Using path: C:\Users\Jason Cruz\OneDrive\Documentos\RA\CIUP\GDP Revisions\GitHub\peru_gdp_revisions\gdp_revisions_datasets
üìÇ new_weekly_reports created
üìÇ new_weekly_reports\raw created
üìÇ new_weekly_reports\input created
üìÇ data created
üìÇ data\input created
üìÇ data\output created
üìÇ metadata created
üìÇ new_weekly_reports created
üìÇ new_weekly_reports\input created
üìÇ record created


---

## 1. Downloading PDFs

---

The **BCRP Weekly Report** is our primary source of data collection for constructing the Peruvian GDP Real-Time Dataset (RTD). This report, published weekly by the **Central Reserve Bank of Peru (BCRP)**, is an official document that contains critical macroeconomic statistics, including GDP growth rates.

The two main tables we focus on in this project are:
- **Table 1:** Monthly GDP growth rates (real GDP, 12-month percentage changes)
- **Table 2:** Quarterly/Annual GDP growth rates (real GDP, 12-month percentage changes)

This section automates the process of downloading the **BCRP Weekly Report PDFs** directly from the official BCRP website, ensuring that we can collect the most up-to-date data for our analysis.

---

### üõ†Ô∏è What the Scraper Bot Does:

1. **Opens the official BCRP Weekly Report page** at [this link](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html).
2. **Finds and collects all PDF links** for the reports.
3. **Downloads the PDFs** in chronological order (from newest to oldest).
4. Optionally, plays a **notification sound** after every batch of downloads.
5. **Organizes** the downloaded PDFs into year-based folders.

---

#### ‚ö†Ô∏è Important Notes:

- **CAPTCHA Handling**: If a CAPTCHA appears during the download process, you'll need to manually solve it in the browser window and then **re-run the Scraper Bot**. The Scraper Bot cannot bypass CAPTCHA verification.
  
- **Automatic WebDriver Management**: This script uses `webdriver-manager` to automatically handle browser drivers (by default, it uses Chrome). **No need to manually download ChromeDriver or GeckoDriver**. If you wish to use a different browser, you can modify the `browser` parameter in the `init_driver()` function.
  
- **Custom Notification Sound**: If you'd like to receive notifications when each batch of downloads finishes, you can place your own MP3 file in the `alert_track` folder. We provide a warning track (in .mp3 format on GitHub). However, here are some free sources of .mp3 files so you can choose the ones you prefer:
  - [Pixabay Audio](https://pixabay.com/music/) üéµ
  - [FreeSound](https://freesound.org/) üé∂
  - [FreePD](https://freepd.com/) üéº

---

### üì• Scraper Bot for BCRP Weekly Reports

In [5]:
# Run the function to start the scraper bot
pdf_downloader(
    bcrp_url = "https://www.bcrp.gob.pe/publicaciones/nota-semanal.html",  # URL of the BCRP Weekly Report
    raw_pdf_folder = raw_pdf_subfolder,  # Folder to save the raw downloaded PDFs
    download_record_folder = record_folder,  # Folder to store download logs
    download_record_txt = '1_downloaded_pdfs.txt',  # Record of downloaded PDFs
    alert_track_folder = alert_track_folder,  # Folder for MP3 alert sound
    max_downloads = 60,  # Maximum number of PDFs to download
    downloads_per_batch = 6,  # Number of PDFs to download per batch
    headless = False  # Run in browser window (set to True for headless mode)
)


üì• Starting PDF downloader for BCRP WR...

üåê BCRP site opened successfully.
üîé Found 155 WR blocks on page (one per month).

1. ‚úîÔ∏è Downloaded: ns-27-2024.pdf
‚è≥ Waiting 8.63 seconds...
2. ‚úîÔ∏è Downloaded: ns-31-2024.pdf
‚è≥ Waiting 8.66 seconds...
3. ‚úîÔ∏è Downloaded: ns-35-2024.pdf
‚è≥ Waiting 9.36 seconds...
4. ‚úîÔ∏è Downloaded: ns-39-2024.pdf
‚è≥ Waiting 8.15 seconds...
5. ‚úîÔ∏è Downloaded: ns-43-2024.pdf
‚è≥ Waiting 6.07 seconds...
6. ‚úîÔ∏è Downloaded: ns-47-2024.pdf


‚è∏Ô∏è Continue? (y = yes, any other key = stop):  y


‚è≥ Waiting 7.44 seconds...
7. ‚úîÔ∏è Downloaded: ns-04-2025.pdf
‚è≥ Waiting 8.28 seconds...
8. ‚úîÔ∏è Downloaded: ns-08-2025.pdf
‚è≥ Waiting 9.49 seconds...
9. ‚úîÔ∏è Downloaded: ns-11-2025.pdf
‚è≥ Waiting 5.56 seconds...
10. ‚úîÔ∏è Downloaded: ns-14-2025.pdf
‚è≥ Waiting 6.19 seconds...
11. ‚úîÔ∏è Downloaded: ns-18-2025.pdf
‚è≥ Waiting 9.30 seconds...
12. ‚úîÔ∏è Downloaded: ns-22-2025.pdf


‚è∏Ô∏è Continue? (y = yes, any other key = stop):  y


‚è≥ Waiting 7.39 seconds...
13. ‚úîÔ∏è Downloaded: ns-26-2025.pdf
‚è≥ Waiting 6.32 seconds...
14. ‚úîÔ∏è Downloaded: ns-30-2025.pdf
‚è≥ Waiting 8.11 seconds...
15. ‚úîÔ∏è Downloaded: ns-34-2025.pdf
‚è≥ Waiting 7.25 seconds...
16. ‚úîÔ∏è Downloaded: ns-40-2025.pdf
‚è≥ Waiting 9.33 seconds...
17. ‚úîÔ∏è Downloaded: viewform
‚è≥ Waiting 8.52 seconds...

üëã Browser closed.

üìä Summary:

üîó Total monthly links kept: 155
üóÇÔ∏è 138 already downloaded PDFs were skipped.
‚ûï Newly downloaded: 17
‚è±Ô∏è 247 seconds


### üóÇÔ∏è Organize Downloaded PDFs

After downloading the PDFs, it is essential to organize them into year-based folders to keep everything structured. This will help in later stages of data extraction and cleaning.

Run the following code to organize the downloaded PDFs. It'll happen in the blink of an eye.

In [None]:
# Get the list of files in the directory
files = os.listdir(raw_pdf_subfolder)

# Call the function to organize files by year
organize_files_by_year(raw_pdf_subfolder)

### üîß Handling Defective PDFs

Occasionally, you may encounter defective PDFs (e.g., corrupted files, incomplete downloads, etc.). In such cases, you can replace the defective PDFs with new, valid ones. The following function allows you to replace defective PDFs.

üîÑ Replace Defective PDFs:

Use this function to replace any defective PDFs that were downloaded. Just specify the year, the defective PDF name, and the new PDF that you want to use as a replacement.

In [None]:
# Replace specific defective PDFs (friendly outputs with icons)
replace_defective_pdfs(
    items=[
        ("2017", "ns-08-2017.pdf", "ns-07-2017"), # Replace a defective PDF in 2017 folder
        ("2019", "ns-23-2019.pdf", "ns-22-2019"), # Replace a defective PDF in 2019 folder
    ],
    root_folder=input_pdf_subfolder,  # Base folder containing year-based folders
    record_folder=record_folder,  # Folder where downloaded PDF logs are stored
    download_record_txt = '1_downloaded_pdfs.txt',  # Log of downloaded PDFs
    quarantine=os.path.join(input_pdf_subfolder, "_quarantine")  # Folder to store defective PDFs (set to None to delete them)
)

> ‚ö° **Troubleshooting Tip:** If you encounter any issues during the data cleansing step (section 3), and suspect that the problem lies with defective PDFs, you can replace those PDFs using the above function. This will help avoid errors in the following sections. In case you encounter a problem with any particular defective PDF, you can also download alternative versions of the Weekly Reports for the same month, and replace the faulty ones as needed.

#### üß© Key Takeaways
- Downloading PDFs: The scraper bot automates the process of collecting the latest BCRP Weekly Reports.
- Organizing PDFs: After downloading, the PDFs are organized by year to make further processing easier.
- Replacing Defective PDFs: If any PDFs are corrupted or incomplete, you can replace them with valid ones to ensure clean data.

> üöÄ **Next Steps**: With the PDFs downloaded, organized, and ready for use, we can move on to the data cleaning and extraction steps. This will be covered in the next section of the notebook. 

---

## 2. Generating input PDFs with key tables

---

Now that we have successfully downloaded the **BCRP Weekly Reports (WR)**, it is important to note that each PDF file contains over 100 pages. However, not all pages are relevant to this project.

For this analysis, we only need a **few key pages** from each WR:
- **Table 1**: Monthly real GDP growth (12-month percentage changes)
- **Table 2**: Annual and quarterly real GDP growth

The goal of this section is to **trim the PDFs**, retaining just the necessary pages for analysis: the key tables and the cover page that provides the publication date and serial number for identification.

The following steps will guide you through the process of generating these trimmed PDF files.

---

### üõ†Ô∏è What This Step Does:

1. **Extracts key pages** from each WR, focusing on the pages that contain **Table 1** and **Table 2**.
2. **Retains the cover page** that provides metadata, such as publication date and serial number.
3. **Creates new PDFs** containing only the relevant pages, ensuring efficiency by reducing file sizes.
4. Organizes these **trimmed PDFs** into year-based subfolders for easy access.

---

#### ‚öôÔ∏è How the Code Works

In this section, we use a combination of `PyMuPDF` and `PyPDF2` to handle PDF file manipulation. Here's a breakdown of the core steps:

1. **Keyword Search:** The function search_keywords() scans each PDF to find pages containing the specified keywords (in this case, "ECONOMIC SECTORS"), which helps us locate the relevant tables.
2. **PDF Trimming:** The function `shortened_pdf()` creates a new PDF containing only the selected pages. If the trimmed PDF contains 4 pages, we retain only the 1st and 3rd pages, which typically hold the key GDP tables.
3. **Tracking Processed PDFs:** The function `read_input_pdf_files()` reads the record of previously processed PDFs, ensuring that we do not reprocess the same file. The function `write_input_pdf_files()` updates the record with new files, ensuring that the workflow is deterministic.

---

### ‚úÇÔ∏è Run the Code to Generate Trimmed PDFs

The following function extracts the relevant pages from each raw WR PDF, creating a shortened version that contains only the key tables and metadata.

In [None]:
# Run the function to generate trimmed PDFs for input
pdf_input_generator(
    raw_pdf_folder = raw_pdf_subfolder,  # Folder containing raw WR PDFs
    input_pdf_folder = input_pdf_subfolder,  # Folder to store the shortened PDFs
    input_pdf_record_folder = record_folder,  # Folder to store the record of generated PDFs
    input_pdf_record_txt = '2_generated_input_pdfs.txt',  # Record file name
    keywords = ["ECONOMIC SECTORS"]  # Keywords to help find relevant pages
)

This code processes the raw WR PDFs, extracts the pages containing the key tables, and stores them in the designated input PDF folder.

### üìÇ Organizing Trimmed PDFs

After generating the trimmed PDFs, it‚Äôs essential to organize them into subfolders based on the year of publication. This makes it easier to locate and manage the files in future steps.

The following code sorts the trimmed PDFs into year-based subfolders:

In [None]:
# Get the list of files in the directory
files = os.listdir(input_pdf_subfolder)

# Call the function to organize files by year
organize_files_by_year(input_pdf_subfolder)

This will ensure that each trimmed WR PDF is placed into its respective year folder, making it simple to access data from specific years.

#### üöÄ Moving Forward

With the PDFs now trimmed and organized, we can proceed to the next steps in the data extraction and cleaning process.

This section will significantly improve the efficiency of handling large numbers of PDFs, as we‚Äôve reduced the file size by focusing only on the pages that contain the data we need.

#### üß© Key Takeaways

- The trimmed PDFs will now contain only the relevant pages, making them easier to handle and faster to process.
- The PDFs are organized by year for easy access and management.
- Efficient processing: The record of processed files ensures that no data is reprocessed, saving time and resources.

> üöÄ **Next Steps:** Now that we have our trimmed PDFs, we are ready to move on to the data extraction and cleaning steps, where we will begin working with the key data from these PDFs.

---

## 3. Cleaning tables and building RTD

---

In this section, we will tackle the core task of **extracting and cleaning** the tables required for constructing the **Real-Time Dataset (RTD)**. The input data consists of PDFs containing only the 2 key tables, and our goal is to **extract GDP growth rates data in the most faithful way** from these tables and clean the data for further analysis.

---

### üßπ Extracting Tables and Data Cleanup

We will use the **`tabula`** library for extracting tables from the PDFs. This library efficiently converts PDF tables into **pandas DataFrames**, which are easier to manipulate and analyze.

For more information on how **`tabula`** works, feel free to check out its [official documentation](https://tabula-py.readthedocs.io/en/latest/).

In this section, we apply several cleaning functions, defined in **`gdp_rtd_pipeline.py`**, to address the challenges of cleaning the extracted tables.

The cleaning process involves **3** main dictionaries:
1. **The raw dictionary**: stores the original tables extracted directly from the PDFs.
2. **The clean dictionary**: contains the fully cleaned tables, ready for analysis.
3. **The vintage dictionary**: contains the fully converted tables into vintages. Every table from every PDF has been converted to vintage format.

---

### üî¢ Step-by-Step Breakdown of the Cleaning Process

#### 1. Extraction
We begin by using **`tabula`** to extract the raw tables from the PDF files. These raw tables are then stored in dictionaries for easy access during the cleaning phase.

#### 2. Cleaning
A series of cleaning functions are applied to each table to ensure the data is in a usable format. See the `üö® Main Issues in Weekly Reports and How We Cleaned Them` subsection below.

#### 3. Data Inspection
After cleaning, we provide a way for users to **inspect the data**. This will allow you to compare the **raw**, **cleaned**, and **vintage** tables by reviewing the output of the raw, clean, and vintage dictionaries. The goal is to visually ensure the quality of the data and confirm that all cleaning steps were applied correctly.

---

### üö® Main Issues in Weekly Reports and How We Cleaned Them

The BCRP Weekly Reports (WRs) often present structural inconsistencies that can complicate data extraction. In this section, we focus on resolving key issues that commonly arise, ensuring the data is consistent, usable, and ready for analysis.

Below are specific problems encountered in the reports, along with the corresponding cleaning steps we implemented:

**1. Misaligned Headers**

* **Problem:** The header row, which typically contains sector names or year labels, was sometimes misaligned, especially when certain headers like "SECTORES ECON√ìMICOS" were incorrectly placed.
* **Solution:** We used the `swap_nan_se()` function to correct misalignments, ensuring the header "SECTORES ECON√ìMICOS" is placed in the correct column.

> `d = swap_nan_se(d)                       # Align 'SECTORES ECON√ìMICOS' header properly`

**2. Mixed Header Patterns**

* **Problem:** Some columns had combined headers, such as "Sector. Subsector," leading to confusion when analyzing data.
* **Solution:** The `split_column_by_pattern()` function was used to split these combined headers into separate, meaningful columns for easier analysis.

> `d = split_column_by_pattern(d)           # Separate combined header values like "Sector. Subsector"`

**3. Missing or Irregular Year Labels**

* **Problem:** Year labels (e.g., "2019", "2020") were either missing or misaligned across columns.
* **Solution:** We implemented the `find_year_column()` function to automatically detect and correct year columns, ensuring consistency across the data.

> `d = find_year_column(d)                  # Detect and align year columns automatically`

**4. Extra or Irrelevant Rows and Columns**

* **Problem:** Some tables contained rows or columns with redundant or irrelevant data (such as placeholders or completely missing values).
* **Solution:** We used functions like `drop_nan_rows()`, `drop_nan_columns()`, and `drop_rare_caracter_row()` to remove these unwanted entries.

> `d = drop_nan_rows(d)                    # Remove rows where all values are NaN`
>
> `d = drop_nan_columns(d)                 # Drop columns with all NaN values`
>
> `d = drop_rare_caracter_row(d)           # Remove rows with rare characters like '}'`

**5. Mixed Numeric and Text Values**

* **Problem:** Some columns contained mixed content, with text and numeric values in the same column (e.g., "Var. %").
* **Solution:** The `separate_text_digits()` function was used to split the mixed content into separate numeric and text values, making the data easier to analyze.

> `d = separate_text_digits(d)            # Split mixed content (text + numeric) into separate columns`

**6. Formatting and Naming Inconsistencies**

* **Problem:** Sector names and labels had inconsistencies, especially between Spanish and English versions of terms like "services" and "mining."
* **Solution:** We standardized these terms using functions like `replace_services()` and `replace_mineria()` to harmonize the labels across all reports.

> `d = replace_services(d)               # Standardize 'services' naming across sectors`
>
> `d = replace_mineria(d)                # Standardize 'mineria' naming in Spanish sectors`
>
> `d = replace_mining(d)                 # Standardize 'mining' naming in English sectors`

**üßπ Final DataFrame Cleaning**

* After applying the aforementioned cleaning functions, the final DataFrame is fully normalized and ready for further analysis. We perform additional final cleaning steps to ensure the data is consistent across all columns:

> `d = clean_columns_values(d)`         # Normalize column names and values
> 
> `d = convert_float(d)`                # Convert non-label columns to numeric
>
> `d = rounding_values(d, decimals=1)`  # Round float columns to one decimal place

These steps ensure that the final dataset is in a format suitable for analysis, with properly cleaned and formatted columns.

### üßº Cleaning Process: Code Walkthrough

The following steps are implemented using the functions defined in **`gdp_rtd_pipeline.py`**:

#### Table ‚ù∂: Monthly Data into Row-Based Vintage Format
The first table we clean is **Table 1**, which contains monthly growth data. The goal here is to transform the table into a **row-based vintage format**, ensuring that each record corresponds to a specific observation (row) with relevant vintage and period information.

In [None]:
# Cleaning Table 1 (monthly growth data) into row-based vintage format
raw_1, clean_1, vintages_1 = new_table_1_cleaner(
    input_pdf_folder=input_pdf_subfolder,
    record_folder=record_folder,
    record_txt='3_created_new_rtd_tab_1.txt',
    persist=True,
    persist_folder=input_data_subfolder,
    pipeline_version="s3.0.0",
)

In [None]:
# Check the structure of raw, clean, and vintage data
raw_1.keys()
clean_1.keys()
vintages_1.keys()

In [None]:
# Inspecting the cleaned table for a specific vintage
clean_1['ns_11_2024_1']

In this step, we use **`gdp_rtd_pipeline.py`** to clean Table 1 and transform it into a vintage format. We also check the structure of the data and inspect specific vintages (e.g., ns_11_2024_1) to ensure everything is cleaned correctly.

**Checking the cleaning version out**

In [None]:
#version = vintages_1["ns_04_2022_1"]
#print(version.attrs)
# {'pipeline_version': 's3.0.0'}

In [None]:
#vintages_1["ns_04_2022_1"].attrs

#### Table ‚ù∑: Quarterly/Annual Data into Row-Based Vintage Format

Similarly, we clean Table 2, which contains quarterly and annual growth data, and transform it into the same vintage format for consistency. Just like for Table 1, new_table_2_cleaner is used to clean Table 2, and we again inspect the data for correctness.

In [None]:
# Cleaning Table 2 (quarterly/annual growth data) into row-based vintage format
raw_2, clean_2, vintages_2 = new_table_2_cleaner(
    input_pdf_folder=input_pdf_subfolder,
    record_folder=record_folder,
    record_txt='3_created_new_rtd_tab_2.txt',
    persist=True,
    persist_folder=input_data_subfolder,
    pipeline_version="s3.0.0",
)


In [None]:
# Check the structure of raw, clean, and vintage data
raw_2['ns_04_2022_2']
clean_2['ns_04_2022_2']
vintages_2['ns_04_2022_2']

**Checking the cleaning version out**

In [None]:
#version_2 = vintages_2["ns_04_2022_2"]
#print(version_2.attrs)
# {'pipeline_version': 's3.0.0'}

In [None]:
#vintages_2["ns_04_2022_1"].attrs

#### üß© Key Takeaways

* Extracting Tables: We used pdfplumber to extract Table 1 (monthly GDP growth) and Table 2 (quarterly/annual GDP growth) from each WR.
* Cleaning the Data: The cleaning pipeline addressed issues such as misaligned headers, missing year labels, mixed content, and more, using a series of functions tailored for these specific problems.
* Standardizing the Data: The final cleaned tables were standardized and formatted for easy use in further analysis.

> üöÄ **Next Steps:** With the data now cleaned and formatted, we can proceed to the next steps of building the real-time GDP dataset (RTD) by reshaping the tables and creating vintages for analysis. This will be covered in the following sections.

---

## 4. Concatenating RTD across years by frequency

---

In this section, we focus on **concatenating** the **Real-Time Data (RTD)** for **Table 1** (monthly GDP growth) and **Table 2** (quarterly/annual GDP growth) across multiple years. The goal is to create a unified RTD that spans all available years, aligned by **frequency** (monthly, quarterly, and annual).

This process ensures that the data is consistent and ready for further analysis. We also support **saving** the concatenated data into a **persistent format**, such as **CSV** or **Parquet**, for easy access.

---

### üõ†Ô∏è What This Step Does:

1. **Concatenates** Table 1 (monthly data) and Table 2 (quarterly/annual data) across years into unified DataFrames.
2. Ensures that all columns are aligned properly by frequency and year.
3. Optionally, **persists** the concatenated data to disk in **CSV** format.
4. Provides a **summary** of the concatenation process, including how many files were processed, skipped, and newly concatenated.

---

‚öôÔ∏è How the Code Works

1. Reads CSVs by Year: The functions read all the CSV files from each year folder (monthly for Table 1, quarterly/annual for Table 2).
2. Identifies Target Period Columns: The code checks for columns that represent different time periods (e.g., tp_YYYYmM for monthly, tp_YYYYqN for quarterly).
3. Aligns Columns Chronologically: Target period columns are sorted by year and month (for Table 1) or year and quarter (for Table 2).
4. Vertical Concatenation: The individual DataFrames are concatenated vertically, ensuring the data from all years is combined into one unified DataFrame.
5. Enforces Data Types: Columns are reindexed to match the final column schema, and the data types are normalized (e.g., converting numeric columns to the correct type).
6. Optional Persistence: If the persist flag is set to True, the concatenated DataFrame is saved to disk in the specified format (CSV or Parquet).

---

üßπ Cleaning and Alignment

The process ensures that data from different years is aligned by the following steps:
- Target period columns: All columns corresponding to time periods (e.g., months, quarters) are identified and sorted chronologically.
- Reindexing: Each DataFrame is reindexed to match the full set of target period columns, ensuring that the columns are consistently aligned across years.
- Handling missing data: If any data is missing for specific time periods, it is handled in the concatenation step, either by using NaN or applying a specific data imputation strategy.

### üîó Run the Code to Concatenate RTD

We use two functions: **`concatenate_table_1`** for monthly data (Table 1) and **`concatenate_table_2`** for quarterly/annual data (Table 2).

#### Table ‚ù∂: Concatenate Monthly Data (Table 1)

In [None]:
# Concatenate Table 1 (monthly GDP growth data) across years
concatenated_1 = concatenate_table_1(
    input_data_subfolder=input_data_subfolder,  # Path to the input data
    record_folder=record_folder,  # Path to store the processed record
    record_txt="4_concatenated_rtd_tab_1.txt",  # Name of the record file
    persist=True,  # Flag to persist the concatenated output
    persist_folder=output_data_subfolder,  # Folder to save the output
    csv_file_label="monthly_gdp_rtd.csv",  # Custom name for the output file
)

In [None]:
# Check the first 10 rows of the concatenated data
concatenated_1.head(10)

#### Table ‚ù∑: Concatenate Quarterly/Annual Data (Table 2)

In [None]:
# Concatenate Table 2 (quarterly and annual GDP growth data) across years
concatenated_2 = concatenate_table_2(
    input_data_subfolder=input_data_subfolder,  # Path to the input data
    record_folder=record_folder,  # Path to store the processed record
    record_txt="4_concatenated_rtd_tab_2.txt",  # Name of the record file
    persist=True,  # Flag to persist the concatenated output
    persist_folder=output_data_subfolder,  # Folder to save the output
    csv_file_label="quarterly_annual_gdp_rtd.csv",  # Custom name for the output file
)

In [None]:
# Check the first 10 rows of the concatenated data
concatenated_2.head(10)

These functions will load the raw data from each year, concatenate it vertically, and return a unified DataFrame with the full RTD.

> ‚ùó‚ùó **Disclaimer:** If compared with the tables displayed in the supplemental document, the concatenated tables above have been transposed to compactly save the dataset. This approach avoids creating excessively long datasets with too many columns, which could be cumbersome for most software when saved. Transposing the tables results in the same structure as the ones in the supplemental material

#### üß© Key Takeaways

* Concatenation: The code vertically concatenates the data for Table 1 (monthly) and Table 2 (quarterly/annual) from all years.
* Column Alignment: Ensures that columns representing different time periods are aligned correctly (e.g., months, quarters).
* Persistence: Saves the concatenated data to disk if the persist flag is set to True.
* Data Inspection: You can inspect the first 10 rows of the concatenated data to verify its correctness.

> üöÄ **Next Steps:** Once the data has been concatenated, we can proceed to the next steps in building the real-time GDP dataset (RTD). This involves reshaping the data and creating vintages for further analysis, which will be covered in the upcoming sections.

---

## 5. Metadata

---

In this section, we handle the metadata associated with GDP revisions, which is essential for tracking, understanding, and ensuring the accuracy of our Real-Time Dataset (RTD). Metadata plays a key role in managing and tracking changes in GDP growth estimates over time, providing transparency and consistency for replication and further analysis.

The primary goals in this section are:
1. **Reading and updating metadata**: Extracts revision info from the Weekly Reports (WRs) and update through the time.
2. **Base-year adjustments**: Tracks when and how base years are updated, ensuring data integrity. This is useful to adjust RTD removing GDP growwth rate affected by base-year changes. 
3. **Generating benchmark datasets**: Creates datasets adjusted according to benchmark revision procedures, facilitating accurate comparisons and analysis.

---

### üìÖ Revision Calendar

The Revision Calendar is essential for understanding the timing and sequence of GDP updates published by the Central Reserve Bank of Peru (BCRP). The calendar helps track the evolution of GDP estimates, given that initial releases often undergo revisions over time. While the timing of the initial releases is predictable, revisions happen without a formal public schedule, making it difficult to track revisions in real-time without a clear framework.

To address this, we constructed an implicit revision calendar using the information provided by the BCRP's Weekly Reports (WR). We used two main criteria to harmonize and standardize the calendar:

* Chief Resolution No. 316-2003-INEI (see [https://www.gob.pe/institucion/inei/normas-legales/2294897-316-2003-inei](here)) mandates that sectoral offices update their data at least quarterly (March, June, September, December), suggesting monthly revisions.
* Our analysis confirms that revisions are updated at least monthly in the WRs.
  
This revision calendar is essential to construct the RTD, allowing us to define "vintages" (sets of GDP estimates available at a specific time) and track the evolution of these estimates consistently across time.

---

### üîÑ Updating Metadata

The `update_metadata` function is used to read, update, and store the metadata related to the revisions of GDP growth rates. It works by:
1. Reading the existing metadata from a CSV file.
2. Extracting revision data from the BCRP's WR PDFs.
3. Applying base-year adjustments to the new rows based on the provided base-year list.
4. Marking where base-year changes have occurred.
5. Appending the new metadata to the existing records.

> ‚ùó **Disclaimer:** This is the only file requested externally by users. Therefore, it is the only (plain) file available on GitHub.

**Example: Define a List of Base Years**

In [None]:
# Define the base_year_list for mapping base years (modify or extend this list as needed)
base_year_list = [
    {"year": 1994, "wr": 1, "base_year": 1990},
    {"year": 2000, "wr": 28, "base_year": 1994},
    {"year": 2014, "wr": 11, "base_year": 2007},
    {"year": 2022, "wr": 20, "base_year": 2019},
    # Add more mappings if needed
]

The runner below updates metadata

In [None]:
# Call the function to update the metadata
updated_df = update_metadata(
    metadata_folder = metadata_folder,
    input_pdf_folder = input_pdf_subfolder,
    record_folder = record_folder,
    record_txt = "5_weekly_report_metadata.txt",
    wr_metadata_csv = "wr_metadata.csv",
    base_year_list = base_year_list
)

After updating the metadata, you can inspect the last few rows to verify the changes and ensure the revisions were applied correctly.

In [None]:
updated_df.iloc[-10:]   # last 5 rows

### üßΩüìÖ Generating Adjusted RTDs by Removing Revisions Affected by Base Years

In this step, we apply base-year adjustments to the RTD data, marking values that are affected by changes in the base year. This process helps ensure that the dataset reflects the most accurate and up-to-date growth rates.

**Example: Apply Base-Year Sentinel**

In [None]:
base_year_list_2 = [
    "2000m7",   # 1990 -> 1994
    "2014m3",   # 1994 -> 2007
]

The `apply_base_year_sentinel` function applies a sentinel value (e.g., `-999999.0`) to data that is affected by a base-year change. This ensures that the affected data is marked as invalid, making it clear when and where the base-year changes occurred.

In [None]:
# Process both monthly and quarterly GDP files and save them with new names
adjusted_rtd = apply_base_year_sentinel(
    base_year_list=base_year_list_2,
    sentinel=-999999.0,
    output_data_subfolder=output_data_subfolder,
    csv_file_labels=["monthly_gdp_rtd.csv", "quarterly_annual_gdp_rtd.csv"]
)

In [None]:
# Access the processed data (adjusted CSV files)
adjusted_monthly_rtd = adjusted_rtd["by_adjusted_monthly_gdp_rtd.csv"]
adjusted_quarterly_rtd = adjusted_rtd["by_adjusted_quarterly_annual_gdp_rtd.csv"]

### üìêüìä Generating Benchmark RTDs for Revisions Affected by Benchmarking Procedures

The benchmark RTDs are generated by applying the benchmark revision mapping to the real-time GDP data. This process ensures that GDP growth rates are adjusted based on the benchmark revisions, creating datasets that are aligned with the most recent and consistent methods used by statistical agencies.

**Example: Generate Benchmark RTDs**

In [None]:
csv_file_labels = [
    "monthly_gdp_rtd",
    "quarterly_annual_gdp_rtd",
    "by_adjusted_monthly_gdp_rtd",
    "by_adjusted_quarterly_annual_gdp_rtd"
]
benchmark_dataset_csv = [
    "monthly_gdp_benchmark",
    "quarterly_annual_gdp_benchmark",
    "by_adjusted_monthly_gdp_benchmark",
    "by_adjusted_quarterly_annual_gdp_benchmark"
]
record_txt = "5_converted_to_benchmark.txt"

In [None]:
wr_metadata_csv = "wr_metadata.csv"

The `convert_to_benchmark_dataset` function applies the benchmark revision procedure to the real-time GDP data, ensuring that revisions are consistent with the latest updates from the statistical agencies.

In [None]:
processed_datasets = convert_to_benchmark_dataset(
    output_data_subfolder=output_data_subfolder,
    csv_file_labels=csv_file_labels,
    metadata_folder=metadata_folder,
    wr_metadata_csv=wr_metadata_csv,
    record_folder=record_folder,
    record_txt=record_txt,
    benchmark_dataset_csv=benchmark_dataset_csv
)


In [None]:
# Acceder a los resultados procesados
processed_datasets.keys()

In [None]:
processed_datasets['monthly_gdp_benchmark']

### üß© Key Takeaways

* Update Metadata: Extract and apply base-year changes, ensuring consistency across the dataset.
* Adjust RTDs: Mark revisions affected by base-year changes using a sentinel value.
* Generate Benchmark RTDs: Apply the benchmark revision mapping to ensure the data is aligned with official procedures.

> üöÄ **Next Steps:** With the updated metadata and adjusted RTDs, we can proceed to the next steps in building the real-time GDP dataset (RTD). These steps will include reshaping the tables and creating vintages for in-depth analysis of GDP growth and revisions over time.

---

## 6. Releases

---

This section is responsible for converting Real-Time GDP (RTD) datasets into releases datasets. The releases dataset is crucial for tracking and analyzing the sequence of GDP revisions. By restructuring the data into a release-based format, we can better map the evolution of GDP estimates in terms of "releases", helping to capture changes and dependence patterns in the statistical analysis.

In this section, we will:

* Convert raw RTD data into release datasets.
* Align non-NaN values for each industry and vintage. The first release of all target periods aligns in the first row, and so on.
* Organize the data by release sequence for each industry and target period.

---

üõ†Ô∏è Converting RTD to Releases Dataset

The `convert_to_releases_dataset` function is designed to transform the RTD data into a format that is structured by release sequence. This function processes each dataset, aligning the non-NaN values for every target period and each industry, while removing any invalid values due to base-year changes.

Key Steps in Conversion:

1. **File Validation:** Ensures that the input and output file lists match in length.
2. **Sorting and Grouping:** The data is sorted by industry, year, and month, ensuring chronological order.
3. **Aligning Non-NaN Values:** For each industry, the function aligns non-NaN values across the target periods (`tp_` columns), creating a sequence of releases.
4. **Removing Invalid Rows:** It drops rows where all target period columns are NaN, ensuring that only valid data is retained.
5. **Reorganization:** The dataset is pivoted, with each industry and release forming new columns.
6. **Final Output:** The dataset is saved into a CSV file for each industry and release sequence.

---

üîç Data Processing Workflow

1. Input Data: We start with the RTD data, which is stored in CSV files. Each dataset corresponds to a specific frequency (monthly, quarterly, or annual) and includes data for different vintages.

2. Processing:

* Sorting: Data is sorted by industry, year, and month to ensure the chronological order of releases.
* Aligning Releases: Non-NaN values for each industry and vintage are aligned, ensuring that all releases are consistent and in the correct sequence.
* Pivoting: The data is then pivoted to arrange each industry‚Äôs releases in separate columns.

3. Output: The converted releases datasets are saved as new CSV files, each named according to its respective label (e.g., `monthly_gdp_releases.csv`, `quarterly_annual_gdp_releases.csv`).

---

üîÑ Key Concepts and Terminology

* **Industry:** Represents the economic sector (e.g., manufacturing, agriculture) for which GDP growth rates are reported.
* **Vintage:** Refers to the specific release of GDP data for a given period (e.g., the first release, second release, etc.).
* **Release:** Each release refers to an updated estimate of GDP for a given target period. These releases are tracked sequentially (first, second, third, etc.) for each industry.
* **Target Period (tp_):** These are the columns representing GDP growth rates for specific periods (e.g., "tp_2021m01" for January 2021).

**Example: Convert RTD to Releases Dataset**

In [None]:
csv_file_labels = [
    "monthly_gdp_rtd",
    "quarterly_annual_gdp_rtd",
    "by_adjusted_monthly_gdp_rtd",
    "by_adjusted_quarterly_annual_gdp_rtd",
    "monthly_gdp_benchmark",
    "quarterly_annual_gdp_benchmark",
    "by_adjusted_monthly_gdp_benchmark",
    "by_adjusted_quarterly_annual_gdp_benchmark"
]
releases_dataset_csv = [
    "monthly_gdp_releases",
    "quarterly_annual_gdp_releases",
    "by_adjusted_monthly_gdp_releases",
    "by_adjusted_quarterly_annual_gdp_releases",
    "monthly_gdp_benchmark_releases",
    "quarterly_annual_gdp_benchmark_releases",
    "by_adjusted_monthly_gdp_benchmark_releases",
    "by_adjusted_quarterly_annual_gdp_benchmark_releases"
]
record_txt = "6_converted_to_releases.txt"

In [None]:
# Run the conversion function
releases_df = convert_to_releases_dataset(
    output_data_subfolder=output_data_subfolder,
    csv_file_labels=csv_file_labels,
    record_folder=record_folder,
    record_txt=record_txt,
    releases_dataset_csv=releases_dataset_csv
)

After running the conversion, you can check the `releases_df` for specific datasets like "monthly_gdp_releases" to verify that the releases data has been processed and organized correctly.

In [None]:
# Displaying the converted releases dataset for "monthly_gdp_releases"
releases_df["by_adjusted_monthly_gdp_releases"]

### üß© Key Takeaways

* Input Validation: Ensures matching lengths for input and output file lists.
* Sorting and Grouping: Data is sorted by industry, year, and month.
* Release Alignment: Non-NaN values are aligned vertically to form a sequence of releases for each industry.
* Cleaning: Rows with missing data are dropped, and only valid data is retained.
* Pivoting and Reshaping: The data is restructured to group all releases by industry and release sequence.
* Saving Results: The final releases dataset is saved as a CSV file.

> üöÄ **Next Steps:** Now that the data has been converted into releases datasets, we can proceed with further analysis, including:
> * Revision analysis: Understanding how GDP estimates evolve over time.
> * Benchmark testing: Comparing the real-time dataset to benchmark revisions.

<div style="background:#3366FF; color:white; padding:12px; box-sizing:border-box; border-radius:4px;">
<b>üèÅ The End</b>
</div>

---
---