#### 0. Author
<br>
Carolina Saavedra - carolinasaavedra01@gmail.com
<br>

#### 1. Title 
<br>
01_PDF_Scrapping with an example 
<br>

#### 2. Purpose
<br>
The following code converts encrypted PDF files into text files, allowing them to be subsequently organized and exported into classified Excel files. The following code consists of two parts. The first part converts the first page of a tax folder into information for a text file. The second part organizes the information from the text file into tables for separate exportation.

Below is an image showing how this tax folder is presented:

![PDF Example](Pdf_example.jpg)

The text file would look like this:

![text_file_example](text_example.jpg)

Four different types of excel files will be exported. A table with main data, another with legal representatives, another with the formation of the company and a last one with participation data of current companies. The names below are fake. This document does not contain PII information.

<br>

#### 2.1. Table with main information

| ID  | nombre_emisor | rut_emisor | fecha_inicio_emisor |  actividades_economicas|                                             
|:-------|:-----------|:------------|:--------------------:|-----------------------------|
| 1 | Carlos Pérez | 1237467-A | 17/08/2027 | SPA                              | |

<br>


This code was developed as part of a consultancy for the Inter-American Development Bank. If you wish to apply it to other types of documents with a PDF extension, the second part must be adjusted accordingly.
<br>

#### 3. Packages to install
<br>
Before to run the code you must need install 'pip install pdfplumber'; 'pip install pdfquery'; 'pip install fitz'.

#### 4. Code Utility


1. **Data Collection Efficiency:** Automated extraction of information from PDFs saves significant time and effort compared to manual data entry or extraction. This efficiency allows researchers to focus more on analysis and interpretation rather than spending resources on tedious data collection tasks.


2. **Data Standardization:** Extracting information from PDFs using code ensures consistency and standardization in data collection. This helps to minimize errors and discrepancies that may arise from manual data entry, ensuring the accuracy and reliability of the data used for impact evaluation or RCTs.


3. **Scalability:** With a code in place for PDF data extraction, the process can be easily scaled up to handle large volumes of documents efficiently. This scalability is particularly valuable in projects with extensive data requirements or when dealing with a large number of documents.


4. **Versatility:** The ability to extract information from PDFs allows researchers to incorporate a diverse range of data sources into their impact evaluation projects or RCTs. This versatility enables researchers to access valuable data from various sources, including reports, surveys, and academic papers, enhancing the comprehensiveness and robustness of the analysis.


5. **Automation of Data Preprocessing:** Extracting information from PDFs through code facilitates automated preprocessing of data, including cleaning, formatting, and structuring. This automation streamlines the data preparation process, making it easier to perform subsequent analysis tasks such as data merging, transformation, and analysis, thereby accelerating the overall research workflow.


In summary, having a code for extracting information from PDFs provides several advantages for impact evaluation projects or RCTs, including efficiency, standardization, scalability, versatility, and automation of data preprocessing. These benefits contribute to improved data quality, streamlined research processes, and enhanced analytical capabilities, ultimately facilitating more robust and insightful evaluations of social programs or interventions.

#### 5. Code

#### 5.1. Import libraries

In [1]:
import os 
import pdfplumber
import PyPDF2
import re
import pdfquery
import pandas as pd

#### 5.2. Directory where the tax folders are stored

In [2]:
pdf_folder_path = 'C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\01_PDFs_Analysis\\01_PDFs\\tax_folders' 

#### 5.3. Function to find the files

In [3]:
# All files must have the extension ".pdf"
pdf_files = [file for file in os.listdir(pdf_folder_path) if file.endswith(".pdf")]

#### 5.4. Loop for convert pfds to text files

In [4]:
# File where you store the block notes 
output_folder = "C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\6_PDFs_Analysis\\03_Outputs\\01_Block_Notes"
os.makedirs(output_folder, exist_ok=True)


for pdf_file in pdf_files:
    pdf_path = os.path.join(pdf_folder_path, pdf_file)

    try:
        
        with open(pdf_path, 'rb') as pdfFileObj:
            
            pdfReader = PyPDF2.PdfReader(pdfFileObj)
                       
            if pdfReader.is_encrypted: # Verify if the PDF is encrypted
                
                pdfReader.decrypt("")  # Try to decrypted the file
                      
            pageObj = pdfReader.pages[0] # Select the first page
            extracted_text = pageObj.extract_text()

            output_path = os.path.join(output_folder, f"{os.path.splitext(pdf_file)[0]}_output.txt")
           
            with open(output_path, 'w', encoding='utf-8') as output_file:
                output_file.write(extracted_text)

    except Exception as e:   
        print(f"Error processing file {pdf_file}: {str(e)}")  # General exceptions
        continue
#Non-export folders are folders that are not electronic tax folders or contain other types of content.

Error processing file 314349.pdf: File has not been decrypted
Error processing file 316237.pdf: EOF marker not found
Error processing file 316771.pdf: EOF marker not found


incorrect startxref pointer(1)


Error processing file 342887.pdf: File has not been decrypted


#### 5.5. Directory where text files are stored

In [None]:
directorio_txt = 'C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\6_PDFs_Analysis\\03_Outputs\\01_Block_Notes'

#### 5.6. Logic for export tables with <span style="color:purple">MAIN</span> information 

In [None]:
def extraer_informacion_desde_txt(ruta_txt):
    with open(ruta_txt, 'r', encoding='utf-8') as archivo:
        contenido = archivo.read()

    # Firts group of variables
    nombre_emisor = "NA"
    rut_emisor = "NA"
    fecha_inicio_emisor = "NA"
    actividades_economicas = "NA"

    # Look for the information
    for linea in contenido.split('\n'):
        if "Nombre del emisor:" in linea:
            nombre_emisor = linea.split(':')[-1].strip()
        elif "RUT del emisor:" in linea:
            rut_emisor = linea.split(':')[-1].strip()
        elif "Fecha de Inicio de Actividades:" in linea:
            fecha_inicio_emisor = linea.split(':')[-1].strip()
        elif "Actividades Económicas:" in linea:
            actividades_economicas = linea.split(':')[-1].strip()

    # Dataframe
    df_resultante = pd.DataFrame({
        'ID': [id_archivo],
        'Nombre del Emisor': [nombre_emisor],
        'RUT del Emisor': [rut_emisor],
        'Fecha de Inicio de Actividades': [fecha_inicio_emisor],
        'Actividades Económicas': [actividades_economicas]
    })

    return df_resultante

dfs_resultantes = {}

for nombre_archivo in os.listdir(directorio_txt):
    if nombre_archivo.endswith('_output.txt'):  # You must verify that all the files has the extension .txt
        ruta_completa = os.path.join(directorio_txt, nombre_archivo)
        
        id_archivo = nombre_archivo.split('_')[0]  # ID file
        
        df_resultante = extraer_informacion_desde_txt(ruta_completa)  # Apply function here
        
        dfs_resultantes[id_archivo] = df_resultante

# Export the output 
directorio_exportacion_m = 'C:\\Users\\CAROLINA\\Dropbox\\Project_Name\\6_PDFs_Analysis\\03_Outputs\\02_Principal_Information'

for id_archivo, df_resultante in dfs_resultantes.items():
    nombre_excel = f'{id_archivo}_informacion_principal.xlsx'
    ruta_excel = os.path.join(directorio_exportacion_m, nombre_excel)
    df_resultante.to_excel(ruta_excel, index=False)

    print(f"The Excel file for the ID {id_archivo} has been exported to: {ruta_excel}")