# Old GDP Real-Time Dataset

**Author:** Jason Cruz  
**Last updated:** 08/13/2025  
**Python version:** 3.12  
**Project:** Rationality and Nowcasting on Peruvian GDP Revisions  

---
## üìå Summary
This notebook documents the step-by-step **construction of real-time datasets** for **Peruvian GDP revisions** since 2013‚ÄìPRESENT. It covers:

1. **Downloading PDFs** (actually Weekly Reports (WR)) from the Central Reserve Bank of Peru's website.
2. **Generating PDF inputs** by shorten them in order to keep key pages containing required tables where GDP growth rates are in.
3. **Cleaning-up data** extracted from input PDFs.
4. **Concatenating real-time datasets across years by frequency** 
5. **Storing RTD to SQL** for availability to users upon request and further analysis.

üåê **Main Data Source:** [BCRP Weekly Report](https://www.bcrp.gob.pe/publicaciones/nota-semanal.html) (üì∞ WR, from here on)  
Any questions or issues regarding the coding, please email [Jason üì®](mailto:jj.cruza@up.edu.pe)  

---

## üõ†Ô∏è Libraries

If you don't have the libraries below, please use the following code (as example) to install the required libraries.

In [None]:
#!pip install os # Comment this code with "#" if you have already installed this library.

Check out Python information

In [1]:
import sys
import platform

print("üêç Python Information")
print(f"  Version  : {sys.version.split()[0]}")
print(f"  Compiler : {platform.python_compiler()}")
print(f"  Build    : {platform.python_build()}")
print(f"  OS       : {platform.system()} {platform.release()}")

üêç Python Information
  Version  : 3.12.1
  Compiler : MSC v.1916 64 bit (AMD64)
  Build    : ('main', 'Jan 19 2024 15:44:08')
  OS       : Windows 10


**Import helper functions**

> ‚ö†Ô∏è Please, check the script `new_gdp_datasets_functions.py` which contains all the functions required by this _jupyter notebook_. The functions there are ordered according to the sections of this jupyter notebok.

In [2]:
from gdp_rtd_pipeline import *

pygame 2.5.2 (SDL 2.28.3, Python 3.12.1)
Hello from the pygame community. https://www.pygame.org/contribute.html


## ‚öôÔ∏è Initial set-up

Before preprocessing new GDP releases data, we will:

* **Create necessary folders** for storing inputs, outputs, logs, and screenshots.
* **Connect to the PostgreSQL database** containing GDP revisions datasets.
* **Import helper functions** from `new_gdp_datasets_functions.py`.

**Create necessary folders**

In [3]:
from pathlib import Path

PROJECT_ROOT = Path.cwd()
user_input = input("Enter relative path (default='.'): ").strip() or "."
target_path = (PROJECT_ROOT / user_input).resolve()
target_path.mkdir(parents=True, exist_ok=True)
print(f"üìÇ Using path: {target_path}")

Enter relative path (default='.'):  .


üìÇ Using path: C:\Users\Jason Cruz\OneDrive\Documentos\RA\CIUP\GDP Revisions\GitHub\peru_gdp_revisions\gdp_revisions_datasets


<h1><span style = "color: rgb(0, 65, 75); font-family: PT Serif Pro Book;; color: dark;">2.</span> <span style = "color: dark; font-family: PT Serif Pro Book;">Data cleaning</span></h1>

In [4]:
# Define base folder for saving all digital PDFs
pdf_folder = 'pdf'

# Define subfolder for saving the original PDFs as downloaded from the BCRP website
raw_pdf_subfolder = os.path.join(pdf_folder, 'raw')

# Define subfolder for saving reduced PDFs containing only selected pages with GDP growth tables (monthly, quarterly, and annual frequencies)
input_pdf_subfolder = os.path.join(pdf_folder, 'input')

# Define folder for saving .txt files with download and dataframe record
record_folder = 'record'

# Define folder for saving warning bells. This is for download notifications (see section 1).
alert_track_folder = 'alert_track'

# Create all required folders (if they do not already exist) and confirm creation
for folder in [pdf_folder, raw_pdf_subfolder, input_pdf_subfolder, record_folder, alert_track_folder]:
    os.makedirs(folder, exist_ok=True)
    print(f"üìÇ {folder} created")

üìÇ pdf created
üìÇ pdf\raw created
üìÇ pdf\input created
üìÇ record created
üìÇ alert_track created


In [5]:
# Define base folder for saving vintages data (.csv)
old_wr_folder = 'old_wr'

# Define subfolder for saving 
old_wr_subfolder_1 = os.path.join(old_wr_folder, 'table_1')

# Define subfolder for saving 
old_wr_subfolder_2 = os.path.join(old_wr_folder, 'table_2')

# Define base folder for saving vintages data (.csv)
data_folder = 'data'

# Define subfolder for saving 
input_data_subfolder = os.path.join(data_folder, 'input')

# Define subfolder for saving 
output_data_subfolder = os.path.join(data_folder, 'output')

# Create all required folders (if they do not already exist) and confirm creation
for folder in [old_wr_folder, old_wr_subfolder_1, old_wr_subfolder_2, data_folder, input_data_subfolder, output_data_subfolder]:
    os.makedirs(folder, exist_ok=True)
    print(f"üìÇ {folder} created")

üìÇ old_wr created
üìÇ old_wr\table_1 created
üìÇ old_wr\table_2 created
üìÇ data created
üìÇ data\input created
üìÇ data\output created


In [None]:
old_raw_1, old_clean_1, old_vintages_1 = old_table_1_cleaner(
    input_csv_folder = old_wr_subfolder_1,
    record_folder = record_folder,
    record_txt = 'OLD_created_vintages_tab_1.txt',
    persist = True,
    persist_folder = input_data_subfolder,
    pipeline_version = "s3.0.0",
)

In [None]:
old_raw_1.keys()

In [None]:
old_clean_1.keys()

In [None]:
old_vintages_1.keys()

In [None]:
old_raw_1['ns_51_1995_1']

In [None]:
old_clean_1['ns_51_1995_1']

In [None]:
old_vintages_1['ns_13_1995_1']

# Checking the cleaning version out

In [None]:
df100 = old_vintages_1["ns_04_1995_1"]
print(df100.attrs)
# {'pipeline_version': 's3.0.0'}


In [None]:
old_vintages_1["ns_04_1995_1"].attrs

### 2.1.1. Table 1. Extraction and cleaning of data from tables on monthly real GDP growth rates.

In [6]:
old_raw_2, old_clean_2, old_vintages_2 = old_table_2_cleaner(
    input_csv_folder = old_wr_subfolder_2,
    record_folder = record_folder,
    record_txt = 'OLD_created_vintages_tab_2.txt',
    persist = True,
    persist_folder = input_data_subfolder,
    pipeline_version = "s3.0.0",
)


üßπ Starting Table 2 cleaning...


üìÇ Processing Table 2 in 2010



üßπ 2010:   0%|                                                                                                 ‚Ä¶

‚úîÔ∏è 2010:   0%|                                                                                                ‚Ä¶


üìÇ Processing Table 2 in 2011



üßπ 2011:   0%|                                                                                                 ‚Ä¶

‚úîÔ∏è 2011:   0%|                                                                                                ‚Ä¶


üìÇ Processing Table 2 in 2012



üßπ 2012:   0%|                                                                                                 ‚Ä¶

‚úîÔ∏è 2012:   0%|                                                                                                ‚Ä¶


‚è© 156 cleaned tables already generated for years: 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009

üìä Summary:

üìÇ 16 folders (years) found containing input CSVs
üóÉÔ∏è Already cleaned tables: 157
‚ú® Newly cleaned tables: 35
‚è±Ô∏è 1 seconds


In [8]:
old_raw_2.keys()

dict_keys(['ns_07_2010_2', 'ns_12_2010_2', 'ns_16_2010_2', 'ns_20_2010_2', 'ns_24_2010_2', 'ns_28_2010_2', 'ns_32_2010_2', 'ns_36_2010_2', 'ns_41_2010_2', 'ns_45_2010_2', 'ns_49_2010_2', 'ns_04_2011_2', 'ns_08_2011_2', 'ns_12_2011_2', 'ns_16_2011_2', 'ns_20_2011_2', 'ns_24_2011_2', 'ns_28_2011_2', 'ns_32_2011_2', 'ns_37_2011_2', 'ns_41_2011_2', 'ns_45_2011_2', 'ns_49_2011_2', 'ns_04_2012_2', 'ns_08_2012_2', 'ns_13_2012_2', 'ns_16_2012_2', 'ns_20_2012_2', 'ns_25_2012_2', 'ns_28_2012_2', 'ns_33_2012_2', 'ns_37_2012_2', 'ns_41_2012_2', 'ns_45_2012_2', 'ns_49_2012_2'])

In [9]:
old_clean_2.keys()

dict_keys(['ns_07_2010_2', 'ns_12_2010_2', 'ns_16_2010_2', 'ns_20_2010_2', 'ns_24_2010_2', 'ns_28_2010_2', 'ns_32_2010_2', 'ns_36_2010_2', 'ns_41_2010_2', 'ns_45_2010_2', 'ns_49_2010_2', 'ns_04_2011_2', 'ns_08_2011_2', 'ns_12_2011_2', 'ns_16_2011_2', 'ns_20_2011_2', 'ns_24_2011_2', 'ns_28_2011_2', 'ns_32_2011_2', 'ns_37_2011_2', 'ns_41_2011_2', 'ns_45_2011_2', 'ns_49_2011_2', 'ns_04_2012_2', 'ns_08_2012_2', 'ns_13_2012_2', 'ns_16_2012_2', 'ns_20_2012_2', 'ns_25_2012_2', 'ns_28_2012_2', 'ns_33_2012_2', 'ns_37_2012_2', 'ns_41_2012_2', 'ns_45_2012_2', 'ns_49_2012_2'])

In [10]:
old_vintages_2.keys()

dict_keys(['ns_07_2010_2', 'ns_12_2010_2', 'ns_16_2010_2', 'ns_20_2010_2', 'ns_24_2010_2', 'ns_28_2010_2', 'ns_32_2010_2', 'ns_36_2010_2', 'ns_41_2010_2', 'ns_45_2010_2', 'ns_49_2010_2', 'ns_04_2011_2', 'ns_08_2011_2', 'ns_12_2011_2', 'ns_16_2011_2', 'ns_20_2011_2', 'ns_24_2011_2', 'ns_28_2011_2', 'ns_32_2011_2', 'ns_37_2011_2', 'ns_41_2011_2', 'ns_45_2011_2', 'ns_49_2011_2', 'ns_04_2012_2', 'ns_08_2012_2', 'ns_13_2012_2', 'ns_16_2012_2', 'ns_20_2012_2', 'ns_25_2012_2', 'ns_28_2012_2', 'ns_33_2012_2', 'ns_37_2012_2', 'ns_41_2012_2', 'ns_45_2012_2', 'ns_49_2012_2'])

In [11]:
old_raw_2["ns_20_2012_2"]

Unnamed: 0,SECTORES ECON√ìMICOS,2010,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,2011,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,2012,ECONOMIC SECTORS
0,,I,II,III,IV,A√ëO,I,II,III,IV,A√ëO,I,
1,,,,,,,,,,,,,
2,Agropecuario,3.846833706,4.428734604,2.378505492,6.563876766,4.286827564,3.049282313,2.871059147,7.234301168,2.321170188,3.777913045,2.323059617,Agriculture and Livestock
3,Agr√≠cola,3.937745248,4.205884873,2.072302781,6.557337316,4.144685913,0.27330985,1.085396805,10.33679216,0.903970057,2.83303905,0.528732387,Agriculture
4,Pecuario,3.723188976,4.868573929,2.764359212,6.37260151,4.437536655,6.641089714,6.693082923,3.496098582,4.03120211,5.192287046,4.498196688,Livestock
5,,,,,,,,,,,,,
6,Pesca,-8.249842112,-9.713663155,-27.01910977,-25.31402536,-16.44431964,12.25685893,20.7646321,65.95461469,36.60353323,29.70306397,-7.581425785,Fishing
7,,,,,,,,,,,,,
8,Miner√≠a e Hidrocarburos,1.143472525,1.724177983,-2.299979817,-0.992876445,-0.14711,-0.292028967,-2.268165934,0.870596687,0.875352337,-0.195345333,2.089011498,Mining and fuel
9,Miner√≠a met√°lica,-1.007298752,-2.272071127,-8.193624673,-7.402366817,-4.792478413,-5.622157038,-7.696308894,-1.131514212,0.183515025,-3.598691022,1.72068904,Metals


In [12]:
old_clean_2["ns_20_2012_2"]

Unnamed: 0,year,wr,sectores_economicos,economic_sectors,2010_1,2010_2,2010_3,2010_4,2010_year,2011_1,2011_2,2011_3,2011_4,2011_year,2012_1
0,2012,20,agropecuario,agriculture and livestock,3.8,4.4,2.4,6.6,4.3,3.0,2.9,7.2,2.3,3.8,2.3
1,2012,20,agricola,agriculture,3.9,4.2,2.1,6.6,4.1,0.3,1.1,10.3,0.9,2.8,0.5
2,2012,20,pecuario,livestock,3.7,4.9,2.8,6.4,4.4,6.6,6.7,3.5,4.0,5.2,4.5
3,2012,20,pesca,fishing,-8.2,-9.7,-27.0,-25.3,-16.4,12.3,20.8,66.0,36.6,29.7,-7.6
4,2012,20,mineria e hidrocarburos,mining and fuel,1.1,1.7,-2.3,-1.0,-0.1,-0.3,-2.3,0.9,0.9,-0.2,2.1
5,2012,20,mineria metalica,metals,-1.0,-2.3,-8.2,-7.4,-4.8,-5.6,-7.7,-1.1,0.2,-3.6,1.7
6,2012,20,hidrocarburos,fuel,11.0,22.3,37.4,44.8,29.5,34.6,31.5,10.4,3.7,18.1,3.7
7,2012,20,manufactura,manufacturing,7.5,16.8,17.4,13.0,13.6,12.3,6.0,3.8,1.0,5.6,-0.9
8,2012,20,de procesamiento de recursos primarios,based on raw materials,-5.6,-1.9,2.4,-3.7,-2.3,11.6,12.0,14.7,11.3,12.3,-2.6
9,2012,20,no primaria,nonprimary,10.1,21.4,20.1,16.2,16.9,12.4,4.8,2.1,-0.7,4.4,-0.6


Aqui _d indica el mes, y simplemente se mapea en el orden en que aparece en su carpeta 

In [13]:
old_vintages_2["ns_20_2012_2"]

vintage_id,target_period,agriculture_2012_5,fishing_2012_5,mining_2012_5,manufacturing_2012_5,electricity_2012_5,construction_2012_5,commerce_2012_5,services_2012_5,gdp_2012_5
0,2010q1,3.8,-8.2,1.1,7.5,6.5,16.8,8.1,4.9,6.2
1,2010q2,4.4,-9.7,1.7,16.8,8.6,21.5,11.0,8.8,10.0
2,2010q3,2.4,-27.0,-2.3,17.4,8.4,16.6,9.6,9.3,9.6
3,2010q4,6.6,-25.3,-1.0,13.0,7.3,15.5,9.9,8.9,9.2
4,2010,4.3,-16.4,-0.1,13.6,7.7,17.4,9.7,8.0,8.8
5,2011q1,3.0,12.3,-0.3,12.3,7.3,8.1,10.3,9.3,8.8
6,2011q2,2.9,20.8,-2.3,6.0,7.4,0.4,8.8,9.0,6.9
7,2011q3,7.2,66.0,0.9,3.8,7.7,1.8,8.6,8.0,6.7
8,2011q4,2.3,36.6,0.9,1.0,7.2,3.8,7.6,7.1,5.5
9,2011,3.8,29.7,-0.2,5.6,7.4,3.4,8.8,8.3,6.9
