## Layer 1:

### 1- Segmentation Based on Service Provided

- **Purpose**: Perform an initial classification based on the type of service provided, allowing for a first distinction between documents.
- **Methodology**: In this first layer, files are categorized based on the service they represent, facilitating subsequent filtering stages.

### 2- Filtering by Number of Pages

- **Purpose**: Refine the dataset by excluding documents that do not meet certain criteria defined by the data analyst.
- **Methodology**: Files are distinguished based on their number of pages: 1-2, 3, 4-5, and 6 or more pages. This segmentation allows for a better understanding of the dataset composition and informed decisions on which documents to include for model training.

### 3- Identification through OCR and Classification

- **Purpose**: Identify and categorize documents based on specific textual content, further optimizing data selection.
- **Methodology**: An OCR process is implemented to detect the frequency of the word "Presentismo" in each file. This metric guides the decision of which documents should be prioritized or excluded in training.

### 4- Data Filtering

- **Purpose**: Further refine the dataset according to specific criteria, ensuring optimal training.
- **Methodology**: Using the detailed report of PDF file composition as reference, filtering is done based on visual characteristics and file size. The observed relationship between file size, image quality, and scanning medium guides the decision on which documents to exclude from training.

### 5- Compliance with GCP Requirements

- **Purpose**: Ensure selected data meet the minimum standards established by GCP.
- **Methodology**: Based on the report of PDF file composition per folder, segmentation and subsequent data filtering are performed. A limit of up to 1GB of data for training is set and documents are verified to meet the 8-point quality at 150 DPI requirement.

   - **GCP Compliance**: It is essential to ensure that the final data meets GCP specifications.
   - **Composition Report**: A detailed report reflecting the distribution and characteristics of PDF files in each folder is prepared.
   - **Data Limits**: For training, up to a maximum of 1GB of data is selected, ensuring efficiency and optimal handling.
   - **Quality Verification**: A check is conducted to ensure that documents contain text of at least 8 points at 150 DPI, in line with minimum requirements.

   GCP Services - Custom model input requirements:

      - **(OK)** For **PDF** and TIFF, **up to 1500 pages can be processed** (with a free tier subscription, only the first two pages are processed).

      - **(OK)** The **file size** for analyzing documents must be less than **500 MB for paid (S0) tier** and 4 MB for free (F0) tier. 

      - **(OK)** **Image dimensions** must be between **50 x 50 pixels** and **10,000 px x 10,000 px.**

      - **(OK)**  If your PDFs are password-locked, you must remove the lock before submission.

      - **(OK)** The **minimum height of the text** to be extracted is **12 pixels for a 1024 x 768 pixel image**. This dimension corresponds to about **8-point text at 150 dots per inch (DPI).**

      - **(OK)**  For **custom model training**, the maximum number of pages for training data is 500 for the custom template model and **50,000 for the custom neural model**.

      - **(NO NEEDED)** For **custom extraction model training**, the total size of training data is 50 MB for template model and **1G-MB for the neural model**.

      - **(OK)** For **custom classification model training**, the total size of training data is **1GB with a maximum of 10,000 pages**. 

### Conclusion

The detailed process provides a rigorous and structured method for PDF file preparation and filtering. This methodology ensures an optimized dataset, ready for training the HOUSE24 classification model, aligned with GCP standards.

NOTA: Realizar las modificaciones en las ubicaciones de los archivos y carpetas según corresponda.

## 1- Segmentation Based on Service Provided


In [20]:
import PyPDF2
import shutil
import os

route = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/'
# Directory where the PDF files are located
pdf_directory = route + 'Psychology'

# Directory where PDFs with 1-2 pages will be saved
output_directory_1_2 = route + 'Psychology/PSI-1-2pags'

# Directory where PDFs with 3 pages will be saved
output_directory_3 = route + 'Psychology/PSI-3pags'

# Directory where PDFs with 4-5 pages will be saved
output_directory_4_5 = route + 'Psychology/PSI-4-5pags'

# Directory where PDFs with 6 or more pages will be saved
output_directory_6_plus = route + 'Psychology/PSI-6+pags'

# Create output directories if they don't exist
os.makedirs(output_directory_1_2, exist_ok=True)
os.makedirs(output_directory_3, exist_ok=True)
os.makedirs(output_directory_4_5, exist_ok=True)
os.makedirs(output_directory_6_plus, exist_ok=True)

# Iterate through files in the input directory
pdf_files = [filename for filename in os.listdir(pdf_directory) if filename.endswith('.pdf')]
total_files = len(pdf_files)
copied_files = 0

for filename in pdf_files:
    pdf_file_path = os.path.join(pdf_directory, filename)
    
    # Open the PDF file in binary read mode
    with open(pdf_file_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        num_pages = len(pdf_reader.pages)  # Get the number of pages
        
        # Decide which directory to place the PDF based on the number of pages
        if num_pages <= 2:
            output_path = os.path.join(output_directory_1_2, filename)
        elif num_pages == 3:
            output_path = os.path.join(output_directory_3, filename)
        elif 4 <= num_pages <= 5:
            output_path = os.path.join(output_directory_4_5, filename)
        elif num_pages >= 6:
            output_path = os.path.join(output_directory_6_plus, filename)
        
        # Copy the PDF file to the corresponding output directory
        shutil.copy(pdf_file_path, output_path)
        copied_files += 1
        
        # Calculate the progress percentage
        percentage = (copied_files / total_files) * 100
        
        print(f"{percentage:.2f}% completed, '{filename}' has been copied")

print("Process completed.")




1.35% completado, Se ha copiado 'Evaluacion psicologica.pdf'
2.70% completado, Se ha copiado 'Psicologa (2).pdf'
4.05% completado, Se ha copiado 'Psicologa.pdf'
5.41% completado, Se ha copiado 'Psicologia (10).pdf'
6.76% completado, Se ha copiado 'Psicologia (11).pdf'
8.11% completado, Se ha copiado 'Psicologia (12).pdf'
9.46% completado, Se ha copiado 'Psicologia (2).pdf'
10.81% completado, Se ha copiado 'Psicologia (3).pdf'
12.16% completado, Se ha copiado 'Psicologia (4).pdf'
13.51% completado, Se ha copiado 'Psicologia (5).pdf'
14.86% completado, Se ha copiado 'Psicologia (6).pdf'
16.22% completado, Se ha copiado 'Psicologia (7).pdf'
17.57% completado, Se ha copiado 'Psicologia (8).pdf'
18.92% completado, Se ha copiado 'Psicologia (9).pdf'
20.27% completado, Se ha copiado 'Psicologia.pdf'
21.62% completado, Se ha copiado 'Psicología (10).pdf'
22.97% completado, Se ha copiado 'Psicología (11).pdf'
24.32% completado, Se ha copiado 'Psicología (12).pdf'
25.68% completado, Se ha copiad

## 2- Filtering by Number of Pages


In [120]:
import os
import datetime

# Location directory
location = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing'
report_version = '2.0'  # Change the version as needed

# Function to get the report of a folder
def get_folder_report(folder):
   
    full_folder_path = os.path.join(location, folder)
    
    # Counter for the total number of files in the folder
    total_files = 0
    
    # Iterate through files in the folder
    for _, _, files in os.walk(full_folder_path):
        total_files += len(files)
    
    # Get the current date
    current_date = datetime.date.today()
    report = f"Date: {current_date}\n"
    
    # Add the report version
    report += f"Version: {report_version}\n\n"
    
    report += f"Folder: {folder}\n"
    # Add the total number of files to the report
    report += f"Total PDFs: {total_files}\n"
    
    # Iterate through subfolders in the main folder
    for subfolder in os.listdir(full_folder_path):
        subfolder_full_path = os.path.join(full_folder_path, subfolder)
        if os.path.isdir(subfolder_full_path):
            # Calculate the number of files in the subfolder
            subfolder_files = len(os.listdir(subfolder_full_path))
            
            # Calculate the percentage of files relative to the total number of files
            percentage = (subfolder_files / total_files) * 100
            
            # Add the information to the report
            report += f"  -{subfolder}: {subfolder_files} PDFs - {percentage:.2f}%\n"
    
    return report

# Generate the report for all folders in the location
general_report = ""
for folder in os.listdir(location):
    if os.path.isdir(os.path.join(location, folder)):
        general_report += get_folder_report(folder)

# Write the report to a text file
file_name = f'report_v{report_version}.txt'
with open(file_name, 'w') as file:
    file.write(general_report)

print(f"Report generated successfully in '{file_name}'.")




Informe generado con éxito en 'informe_v3.0.txt'.


## 3- Identification through OCR and Classification

In [28]:
import os
import pytesseract
from pdf2image import convert_from_path
import re
import nltk
from tqdm import tqdm

# Download the 'wordnet' resource from NLTK if you haven't downloaded it already
# nltk.download('wordnet')

# Function to find all word forms of the keyword
def find_all_word_forms(word):
    forms = set()
    # Add the original word and the variant "P R E S E N T I S M O" (without spaces)
    forms.add(word)
    forms.add("PRESENTISMO")
    # Find the lemma
    lemma = nltk.WordNetLemmatizer().lemmatize(word)
    # Add variations in lowercase and uppercase
    forms.add(word.lower())
    forms.add(word.upper())
    forms.add(word.capitalize())
    forms.add(lemma.lower())
    forms.add(lemma.upper())
    forms.add(lemma.capitalize())
    # Replace letters with accents and diacritics
    forms.add(re.sub(r'[^\x00-\x7F]+', '', word))
    forms.add(re.sub(r'[^\x00-\x7F]+', '', lemma))
    return list(forms)

# Base folder containing the 'CAREGIVER' folders
base_directory = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/Medical Visit'

# Get the list of folders in 'CAREGIVER'
caregiver_folders = [d for d in os.listdir(base_directory) if os.path.isdir(os.path.join(base_directory, d))]

# Keywords you want to search for (in this case, "Presentismo" and "P R E S E N T I S M O")
keywords = ["Presentismo", "P R E S E N T I S M O"]

# Counter for processed files
processed_counter = 0

# Iterate through the folders with a file-based progress bar
for caregiver_folder in caregiver_folders:
    pdf_directory = os.path.join(base_directory, caregiver_folder)

    # Get all word forms of the keyword
    word_forms = [form for keyword in keywords for form in find_all_word_forms(keyword)]

    # Initialize counters for this folder
    counters = {i: 0 for i in range(7)}  # Fixed to include counter 0
    files_with_presentismo = {i: [] for i in range(7)}

    # Iterate through the files in the current directory with tqdm
    for filename in tqdm(os.listdir(pdf_directory), desc=f"Progress ({caregiver_folder})", dynamic_ncols=True):
        if filename.endswith('.pdf'):
            pdf_file = os.path.join(pdf_directory, filename)

            # Extract images from the PDF
            images = convert_from_path(pdf_file)

            # Initialize a counter for this file
            counter = 0

            # Iterate through the images and search for the keywords
            for image in images:
                text = pytesseract.image_to_string(image)
                counter += sum(text.count(form) for form in word_forms)

            # Update the corresponding counter
            if counter >= 6:
                counters[6] += 1
                files_with_presentismo[6].append(filename)
            else:
                counters[counter] += 1
                files_with_presentismo[counter].append(filename)

            # Increment the processed file counter
            processed_counter += 1

            # Report file name for this folder
            report_name = f"presentismo_report_{caregiver_folder}.txt"

            # Open the report file in write mode
            with open(report_name, 'w', encoding='utf-8') as report:
                report.write(f"Report of files with the word 'Presentismo' in the folder '{caregiver_folder}':\n\n")
                for times, count in counters.items():
                    report.write(f"{times} times: {count} files\n")
                    report.write(f"Files with 'Presentismo' ({times} times):\n")
                    for file in files_with_presentismo[times]:
                        report.write(f"- {file}\n")
                    report.write("\n")

print("Analysis completed.")



Progreso (MED-1-2pags): 100%|██████████| 871/871 [1:08:39<00:00,  4.73s/it]
Progreso (MED-3pags): 100%|██████████| 390/390 [50:38<00:00,  7.79s/it] 
Progreso (MED-4-5pags): 100%|██████████| 509/509 [1:35:12<00:00, 11.22s/it]
Progreso (MED-6+pags): 100%|██████████| 322/322 [1:42:36<00:00, 19.12s/it]

Análisis completo.





## 4- Data Filtering

In [46]:
# Through the following script, you can extract the files that will not be used for training and save them in the selected destination folder.
import os
import shutil

# Base directory where the files will be searched
base_directory = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/Caregiver'

# Destination folder where the files will be moved
destination_folder = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/Concatenated Forms (Split)'

# List of file names you want to search for
file_names = [
    '''put list of files to search'''
]


# Iterate through the base directory and its subdirectories
for current_directory, subdirectories, files in os.walk(base_directory):
    for file_name in file_names:
        if file_name in files:
            # Full path of the found file
            source_path = os.path.join(current_directory, file_name)
            # Destination path where the file will be moved
            destination_path = os.path.join(destination_folder, file_name)

            # Move the file to the destination directory
            try:
                shutil.move(source_path, destination_path)
                print(f"{file_name} has been moved to {destination_path}")
            except Exception as e:
                print(f"Error moving {file_name}: {str(e)}")

print("Process completed.")



Proceso completado.


We organize and unify the files that will not be used for training into the folder.

In [61]:
import os
import shutil

# Path of the main folder
main_folder_path = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/Split'
destination_folder = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/Concatenated Forms (Split)'

# Function to move subfolders with files
def move_subfolders_with_files(path, destination):
    for current_directory, subdirectories, files in os.walk(path, topdown=False):
        for subdirectory in subdirectories:
            subfolder = os.path.join(current_directory, subdirectory)
            if os.listdir(subfolder):
                # If the subfolder is not empty, move it to the destination directory
                destination_subfolder = os.path.join(destination, subdirectory)
                shutil.move(subfolder, destination_subfolder)
                print(f"Subfolder with files moved to {destination_subfolder}: {subdirectory}")

# Call the function to move subfolders with files to the destination folder
move_subfolders_with_files(main_folder_path, destination_folder)



Subcarpeta con archivos movida a C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/Formularios concatenados (Dividir)\Kinesiologia Ej-22 (B).pdf: Kinesiologia Ej-22 (B).pdf
Subcarpeta con archivos movida a C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/Formularios concatenados (Dividir)\Kinesiologia Ej-8.pdf: Kinesiologia Ej-8.pdf
Subcarpeta con archivos movida a C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/Formularios concatenados (Dividir)\Kinesiologia Ej-9.pdf: Kinesiologia Ej-9.pdf
Subcarpeta con archivos movida a C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/Formularios concatenados (Dividir)\Kinesiologia motora.pdf: Kinesiologia motora.pdf
Subcarpeta con archivos movida a C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/Formularios concatenados (Dividir)\Kinesiología (10).pdf

We modify the names of the files for better control and organization of the dataset.

In [70]:
import os

# Path of the main folder where the files and subfolders are located
main_folder_path = 'C:/Users/HP/My Drive/Artificial Intelligence/PROJECTS/House24 - Toto/House24 Invoicing/'
main_folder = main_folder_path + 'Medical Visit'

# Check if the main folder exists
if os.path.exists(main_folder):
    # Initialize a global counter for all Cardiologist files
    cardiologist_counter = 0

    # Iterate through the main folder and all its subfolders
    for current_directory, subdirectories, files in os.walk(main_folder):
        for file in files:
            if file.lower().endswith('.pdf'):
                cardiologist_counter += 1
                new_name = f'Form{cardiologist_counter}- MED.pdf'
                
                # Full path of source and destination
                source = os.path.join(current_directory, file)
                destination = os.path.join(current_directory, new_name)

                # Rename the file
                os.rename(source, destination)
                print(f"Renamed: {file} -> {new_name}")
else:
    print(f"The main folder {main_folder} does not exist.")


Renombrado: Achilli Maria Luisa (Medico).pdf -> Form1- MED.pdf
Renombrado: AGOSTO - OSOSS - ABAD, Jose (Medico).pdf -> Form2- MED.pdf
Renombrado: ALBA AURORA (MEDICO).pdf -> Form3- MED.pdf
Renombrado: AMIGO NELIDA (05-05 Médico).pdf -> Form4- MED.pdf
Renombrado: BARONI TERESA (MEDICO).pdf -> Form5- MED.pdf
Renombrado: Barrera Raquel (Medico) (2).pdf -> Form6- MED.pdf
Renombrado: Barrera Raquel (Medico) (3).pdf -> Form7- MED.pdf
Renombrado: Barrera Raquel (Medico) (4).pdf -> Form8- MED.pdf
Renombrado: Barrera Raquel (Medico) (5).pdf -> Form9- MED.pdf
Renombrado: Barrera Raquel (Médico) (2).pdf -> Form10- MED.pdf
Renombrado: Barrera Raquel (Médico) (3).pdf -> Form11- MED.pdf
Renombrado: Barrera Raquel (Médico).pdf -> Form12- MED.pdf
Renombrado: BRUNET NORMA (MEDICO).pdf -> Form13- MED.pdf
Renombrado: CAIBOZN SALOMON (16-05 Médico).pdf -> Form14- MED.pdf
Renombrado: CALORI OSVALDO (06-05 Médico).pdf -> Form15- MED.pdf
Renombrado: Capirone Angela (MEDICO).pdf -> Form16- MED.pdf
Renombrado:

## informe_v2.0

 This report provides detailed information about the division of files within each folder, including the total number of PDFs and the distribution of PDFs based on the number of pages they contain. It also specifies the date and version of the report.

In [None]:
import os

# Path of the main folder where files and subfolders are located
ruta_principal = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion'
ruta_destino = ruta_principal + '1-Form to Train'
# Check if the main folder exists
if os.path.exists(ruta_principal):
    # Traverse the main folder and all its subfolders
    for directorio_actual, subdirectorios, archivos in os.walk(ruta_principal):
        for archivo in archivos:
            if archivo.lower().endswith('.pdf'):
                # Full path of source and destination
                origen = os.path.join(directorio_actual, archivo)
                destino = os.path.join(ruta_destino, archivo)
                # Copy the file
                shutil.copy(origen, destino)
                
else:
    print(f"The main folder {ruta_principal} does not exist.")


Based on the results of the informe_v2.0, we proceed with the creation of 3 datasets to train the machine learning model.

    - Training Dataset

    - Excluded Data 

    - Test Dataset

In [111]:
# In this stage, we perform an initial selection of files 

import os
import shutil
from tqdm import tqdm  # Import tqdm for progress bar

# Path to the main folder where files and subfolders are located
ruta_principal = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion'
ruta_destino = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/1-Forms to Train'

# Folder to be omitted
carpeta_omitir = 'Test Dataset'

# Check if the main folder and the destination folder exist
if os.path.exists(ruta_principal):
    if not os.path.exists(ruta_destino):
        os.makedirs(ruta_destino)  # Create the destination folder if it doesn't exist

    # List of PDF files in the main folder
    archivos_pdf = [os.path.join(directorio_actual, archivo) for directorio_actual, _, archivos in os.walk(ruta_principal) for archivo in archivos if archivo.lower().endswith('.pdf')]

    # Total number of PDF files
    total_archivos_pdf = len(archivos_pdf)

    # Iterate through the PDF files and copy them using tqdm, omitting the specified folder
    for archivo in tqdm(archivos_pdf, unit='archivo', unit_scale=True, desc='Copying files'):
        nombre_archivo = os.path.basename(archivo)
        destino = os.path.join(ruta_destino, nombre_archivo)

        # Check if the current directory is equal to the folder to be omitted
        if not carpeta_omitir in archivo:
            # Copy the file
            shutil.copy2(archivo, destino)

    print("PDF files successfully copied to '1-Forms to Train'.")
else:
    print(f"The main folder {ruta_principal} does not exist.")




Copiando archivos: 100%|██████████| 11.4k/11.4k [07:29<00:00, 25.3archivo/s]

Archivos PDF copiados a '1-Form to Train' con éxito.





 Moving files that will not be used to another folder for better dataset control.

In [117]:
import os
import random
from shutil import move
from tqdm import tqdm

# Paths of the source and destination folders
source_folder = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/1-Forms to Train'
test_destination_folder = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/3-Forms to Test'
validation_destination_folder = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion/2-Validation set'

# Number of files to send to each destination
test_files_count = 806
validation_files_count = 2351

# List of files in the source folder
files_in_source = os.listdir(source_folder)

# Random selection of files for the Test folder
files_for_test = random.sample(files_in_source, test_files_count)

# Random selection of files for the Validation folder
files_for_validation = random.sample(files_in_source, validation_files_count)

# Move files to the Test folder
print("Sending files to the '3-Forms to Test' folder...")
for file in tqdm(files_for_test, total=test_files_count):
    source = os.path.join(source_folder, file)
    destination = os.path.join(test_destination_folder, file)
    move(source, destination)

# Move files to the Validation folder
print("Sending files to the '2-Validation set' folder...")
for file in tqdm(files_for_validation, total=validation_files_count):
    source = os.path.join(source_folder, file)
    destination = os.path.join(validation_destination_folder, file)
    move(source, destination)

print("Files sent successfully.")



Enviando archivos a la carpeta 3-Forms to Test...


  7%|▋         | 53/806 [00:00<00:01, 526.33it/s]

100%|██████████| 806/806 [00:03<00:00, 220.94it/s]

Archivos enviados con éxito.





## 5- Compliance with GCP Minimum Requirements

This repository ensures compliance with the minimum requirements set by Google Cloud Platform (GCP) through the following measures:

   - PDF Composition Report: A comprehensive report detailing the composition of PDF files within each folder is generated to maintain transparency and facilitate understanding of the dataset.

   - Data Selection: A careful selection process is implemented to extract data up to 1GB specifically tailored for model training purposes, ensuring efficiency and optimal resource utilization.
   
   - Text Verification: Document verification is conducted to ensure that all included files meet the minimum text requirements, specifically containing 8-point text at 150 DPI, thereby ensuring data quality and consistency.



## informe_v3.0

After moving the files that will not be used for training, the following script generates a report detailing the composition of PDF files in each folder within the specified location. 

It calculates the total number of PDFs, their percentage relative to all files, and provides a breakdown of PDFs in each subfolder. 

Finally, it saves the report to a text file. 





In [134]:
import os
import datetime

# Defining the location where the folders containing PDF files are stored
location = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion'

# Setting the version of the report
report_version = '3.0'

# Function to generate a report for a specific folder
def get_folder_report(folder):
    # Getting the full path of the folder
    full_folder_path = os.path.join(location, folder)
    
    # Counting the total number of files in the folder
    total_files = 0
    for _, _, files in os.walk(full_folder_path):
        total_files += len(files)
    
    # Getting the current date
    current_date = datetime.date.today()
    
    # Constructing the report header with date and version
    report = f"Date: {current_date}\n"
    report += f"Version: {report_version}\n\n"
    report += f"Folder: {folder}\n"
    
    # Counting the total number of files in all folders
    sum_total_files = 0
    for _, _, files in os.walk(location):
        sum_total_files += len(files)
        
    # Calculating the percentage of files in the current folder compared to all files
    total_percentage = (total_files / sum_total_files) * 100
    
    # Adding total number of files and percentage to the report
    report += f"Total PDFs: {total_files} - {total_percentage:.2f}%\n"
    
    # Iterating through subfolders and adding their details to the report
    for subfolder in os.listdir(full_folder_path):
        full_subfolder_path = os.path.join(full_folder_path, subfolder)
        if os.path.isdir(full_subfolder_path):
            subfolder_files = len(os.listdir(full_subfolder_path))
            percentage = (subfolder_files / sum_total_files) * 100
            select_new_data = (percentage * subfolder_files) / 100
            report += f"  -{subfolder}: {subfolder_files} PDFs - {percentage:.2f}% - {select_new_data:.2f} files to select\n"
    
    return report

# Generating a report for each folder in the location
general_report = ""
for folder in os.listdir(location):
    if os.path.isdir(os.path.join(location, folder)):
        general_report += get_folder_report(folder)

# Saving the report to a text file
file_name = f'report_v{report_version}.txt'
with open(file_name, 'w') as file:
    file.write(general_report)

# Printing a success message
print(f"Report generated successfully in '{file_name}'.")



Informe generado con éxito en 'informe_v3.0.txt'.


## Data Filtering for Azure Minimum Requirements

To meet the minimum requirements of Azure, a data filtering process is performed.

Based on the report detailing the composition of PDF files in each folder, a filtering of the files is carried out, utilizing xx criteria to reduce the final dataset that will be used to train the model:

- Visual Characteristics: According to the method by which the file was digitized.
   
- File Size: According to the amount of disk space it occupies.

Final Conclusion: There is an apparent relationship between the file size, image quality, and the digitization method of the form. Based on this observation, files that will not be used for model training are separated.



## informe_v4.0

After all the modifications and studies carried out on the files, a final report is prepared that shows how the files that will serve as training data are composed and distributed. The ETF process continues, with another layer that will be developed in another stage.

In [3]:
import os
import datetime

# Directory where the files and subfolders are located
location = 'C:/Users/HP/My Drive/Inteligencia Artificial/PROJECTS/House24 - Toto/House24 Facturacion'

# Report version
report_version = '4.0'

# Function to generate folder report
def get_folder_report(folder):
    full_folder_path = os.path.join(location, folder)
    
    # Count total number of files in the folder
    total_files = 0
    for _, _, files in os.walk(full_folder_path):
        total_files += len(files)
    
    # Get current date
    current_date = datetime.date.today()
    
    # Initialize report string
    report = f"Fecha: {current_date}\n"
    report += f"Version: {report_version}\n\n"
    report += f"Carpeta: {folder}\n"
    
    # Calculate total size of files in the folder
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(full_folder_path):
        for filename in filenames:
            file_path = os.path.join(dirpath, filename)
            total_size += os.path.getsize(file_path)
    
    # Convert total size to MB
    total_size = total_size / (1024 * 1024)
    
    # Calculate percentage of total files
    total_files_percentage = (total_files / sum_total_files) * 100   
    
    # Add total files information to the report
    report += f"Total PDFs: {total_files} - {total_size:.2f}MB - {total_files_percentage:.2f}%\n"
    
    # Loop through subfolders
    for subfolder in os.listdir(full_folder_path):
        subfolder_path = os.path.join(full_folder_path, subfolder)
        if os.path.isdir(subfolder_path):
            # Count total number of files in each subfolder
            files_in_subfolder = len(os.listdir(subfolder_path))
            report += f"  -{subfolder}: {files_in_subfolder} PDFs\n"
    
    return report

# Generate the overall report
general_report = ""
for folder in os.listdir(location):
    if os.path.isdir(os.path.join(location, folder)):
        general_report += get_folder_report(folder)

# Save the report to a text file
file_name = f'report_v{report_version}.txt'
with open(file_name, 'w') as file:
    file.write(general_report)

print(f"Report generated successfully in '{file_name}'.")


Informe generado con éxito en 'informe_v4.0.txt'.
