#### 0. Author
<br>
Carolina Saavedra - carolinasaavedra01@gmail.com
<br>

#### 1. Title 
<br>
02_PDF Parsing for Peruvian Electronic Invoices
<br>

#### 2. Purpose
<br>
The following code converts encrypted PDF files into text files, allowing them to be subsequently organized and exported into classified Excel files. The following code consists of two parts. The first part converts the first page of a tax folder into information for a text file. The second part organizes the information from the text file into tables for separate exportation.

Below is an image showing how this tax folder is presented:

![PDF Example](invoice_example_peru.jpg)

The text file would look like this:

![text_file_example](text_example_Peru.jpg)

As you can see, the text file exports the invoice in a disordered manner.For this reason, the code is adapted with different regex functions to extract all the data. The names below are fake. This document does not contain PII 

<br>

#### 2.1. Table with main information

| Nombre del Emisor | RUC del Emisor | Nombre del Receptor | RUC del Receptor | Fecha de Emisión | Concepto         | Monto Bruto | Retención | Monto Neto | ID             |
|:----------------- |:-------------- |:------------------- |:---------------- |:---------------- |:---------------- |-----------:|---------:|-----------:|:-------------- |
| Carlos Pérez      | 123456789      | Asociados S.A.      | 123456789        | 12 de mayo       | DISEÑO DE LOGOS  |      2500  |       00 |       2500 | nombre_archivo |

<br>

#### 3. Packages to install
<br>
Before to run the code you must need install 'pip install pdfplumber'; 'pip install pdfquery'; 'pip install fitz'.

#### 4. Code Utility


1. **Data Collection Efficiency:** Automated extraction of information from PDFs saves significant time and effort compared to manual data entry or extraction. This efficiency allows researchers to focus more on analysis and interpretation rather than spending resources on tedious data collection tasks.


2. **Data Standardization:** Extracting information from PDFs using code ensures consistency and standardization in data collection. This helps to minimize errors and discrepancies that may arise from manual data entry, ensuring the accuracy and reliability of the data used for impact evaluation or RCTs.


3. **Scalability:** With a code in place for PDF data extraction, the process can be easily scaled up to handle large volumes of documents efficiently. This scalability is particularly valuable in projects with extensive data requirements or when dealing with a large number of documents.


4. **Versatility:** The ability to extract information from PDFs allows researchers to incorporate a diverse range of data sources into their impact evaluation projects or RCTs. This versatility enables researchers to access valuable data from various sources, including reports, surveys, and academic papers, enhancing the comprehensiveness and robustness of the analysis.


5. **Automation of Data Preprocessing:** Extracting information from PDFs through code facilitates automated preprocessing of data, including cleaning, formatting, and structuring. This automation streamlines the data preparation process, making it easier to perform subsequent analysis tasks such as data merging, transformation, and analysis, thereby accelerating the overall research workflow.


In summary, having a code for extracting information from PDFs provides several advantages for impact evaluation projects or RCTs, including efficiency, standardization, scalability, versatility, and automation of data preprocessing. These benefits contribute to improved data quality, streamlined research processes, and enhanced analytical capabilities, ultimately facilitating more robust and insightful evaluations of social programs or interventions.

#### 5. Code

#### 5.1. Import libraries

In [None]:
import os 
import pdfplumber
import PyPDF2
import re
import pdfquery
import pandas as pd

#### 5.2. Directory where the tax folders are stored

In [None]:
pdf_folder_path = 'C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\01_PDFs_Analysis\\01_PDFs' # Change here

#### 5.3. Function to find the files

In [None]:
# All files must have the extension ".pdf"
pdf_files = [file for file in os.listdir(pdf_folder_path) if file.endswith(".pdf")]

#### 5.4. Loop for convert pfds to text files

In [None]:
# File where you store the block notes. Change here
output_folder = "C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\6_PDFs_Analysis\\03_Outputs\\01_Block_Notes"
os.makedirs(output_folder, exist_ok=True)


for pdf_file in pdf_files:
    pdf_path = os.path.join(pdf_folder_path, pdf_file)

    try:
        
        with open(pdf_path, 'rb') as pdfFileObj:
            
            pdfReader = PyPDF2.PdfReader(pdfFileObj)
                       
            if pdfReader.is_encrypted: # Verify if the PDF is encrypted
                
                pdfReader.decrypt("")  # Try to decrypted the file
                      
            pageObj = pdfReader.pages[0] # Select the first page
            extracted_text = pageObj.extract_text()

            output_path = os.path.join(output_folder, f"{os.path.splitext(pdf_file)[0]}_output.txt")
           
            with open(output_path, 'w', encoding='utf-8') as output_file:
                output_file.write(extracted_text)

    except Exception as e:   
        print(f"Error processing file {pdf_file}: {str(e)}")  # General exceptions
        continue
#Non-export folders are folders that are not electronic tax folders or contain other types of content.

#### 5.5. Directory where text files are stored

In [None]:
directorio_txt = 'C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\6_PDFs_Analysis\\02_Outputs\\01_Block_Notes' # Change here

#### 5.6. Logic for export tables with <span style="color:purple">MAIN</span> information 

In [None]:
def extraer_informacion_desde_txt(ruta_txt):
    with open(ruta_txt, 'r', encoding='utf-8') as archivo:
        contenido = archivo.read()

    # Normalizar texto a una línea para evitar saltos y espacios extras
    contenido = re.sub(r'\s+', ' ', contenido)

    def extraer_montos_pegados(texto):
        texto = texto.replace(',', '').replace(' ', '')
        patron = r'(\d+\.\d{2})\((\d+\.\d{2})\)(\d+\.\d{2})'
        match = re.search(patron, texto)
        if match:
            return match.group(1), match.group(2), match.group(3)
        else:
            return "NA", "NA", "NA"

    # Variables por defecto
    nombre_emisor = "NA"
    ruc_emisor = "NA"
    nombre_receptor = "NA"
    ruc_receptor = "NA"
    fecha_emision = "NA"
    concepto = "NA"
    monto_bruto = "NA"
    retencion = "NA"
    monto_neto = "NA"
    monto_letras = "NA"

    # Buscar RUC emisor
    ruc_emisor_match = re.search(r'RECIBO POR HONORARIOS ELECTRONICO\s+Nro:(\d{11})', contenido)
    if ruc_emisor_match:
        ruc_emisor = ruc_emisor_match.group(1)

    # Buscar nombre emisor (entre Nro: y MZA)
    nombre_emisor = "NA"
    nombre_match = re.search(r'Nro:\d{11}\s*(.*)', contenido)
    if nombre_match:
        linea_siguiente = nombre_match.group(1).strip()
        nombre_emisor = re.split(r'\s*MZA', linea_siguiente)[0].strip()
        
    # Buscar nombre receptor (entre "Total Neto Recibido:" y "RUC")
    nombre_receptor_match = re.search(r'Total Neto Recibido:(.*?)\s+RUC', contenido)
    if nombre_receptor_match:
        nombre_receptor = nombre_receptor_match.group(1).strip()

    # Buscar RUC receptor
    ruc_receptor_match = re.search(r'RUC\s+(\d{11})', contenido)
    if ruc_receptor_match:
        ruc_receptor = ruc_receptor_match.group(1)

    # Buscar concepto (entre "Por concepto de" y "- A" o "Forma de Pago" o "Fecha de emisión")
    concepto_match = re.search(r'SOLES\s+(.*?)\s+(?:- A|Forma de Pago:|Fecha de emisión)', contenido, re.IGNORECASE | re.DOTALL)
    if concepto_match:
        concepto = concepto_match.group(1).strip()


    # Buscar montos pegados (bruto)(retención)(neto)
    montos_raw_match = re.search(r'(\d{1,3}(?:[.,]\d{3})*[.,]\d{2}\(\d{1,3}(?:[.,]\d{3})*[.,]\d{2}\)\d{1,3}(?:[.,]\d{3})*[.,]\d{2})', contenido)
    if montos_raw_match:
        monto_bruto, retencion, monto_neto = extraer_montos_pegados(montos_raw_match.group(1))

    # Buscar fecha emisión (día mes año)
    fecha_match = re.search(r'(\d{2}\s+\w+\s+\d{4})', contenido)
    if fecha_match:
        fecha_emision = fecha_match.group(1)

    df_resultante = pd.DataFrame({
        'Nombre del Emisor': [nombre_emisor],
        'RUC del Emisor': [ruc_emisor],
        'Nombre del Receptor': [nombre_receptor],
        'RUC del Receptor': [ruc_receptor],
        'Fecha de Emisión': [fecha_emision],
        'Concepto': [concepto],
        'Monto Bruto': [monto_bruto],
        'Retención': [retencion],
        'Monto Neto': [monto_neto]
    })

    return df_resultante


dfs_resultantes = {}

for nombre_archivo in os.listdir(directorio_txt):
    if nombre_archivo.endswith('_output.txt'):
        ruta_completa = os.path.join(directorio_txt, nombre_archivo)
        
        id_archivo = os.path.splitext(nombre_archivo)[0].replace('_output', '')  # El nombre del archivo sin "_output"
        
        df_resultante = extraer_informacion_desde_txt(ruta_completa)
        
        # Añadís la columna ID con el valor del id_archivo
        df_resultante['ID'] = id_archivo
        
        dfs_resultantes[id_archivo] = df_resultante

# Export the output
directorio_exportacion_m = 'C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\6_PDFs_Analysis\\02_Outputs\\csv'

for id_archivo, df_resultante in dfs_resultantes.items():
    nombre_excel = f'{id_archivo}_Resumen.xlsx'
    ruta_excel = os.path.join(directorio_exportacion_m, nombre_excel)
    df_resultante.to_excel(ruta_excel, index=False)

    print(f"The Excel file for the ID {id_archivo} has been exported to: {ruta_excel}")