## 1 -Sampling of PDF Files

In [22]:
import os
import random
import shutil

def sample_files(directory_path, sample_size=100, destination_folder="metadata_sample"):
    """
    Takes a random sample of files from a directory and copies them to a destination folder.
    """
    # Get a list of all files in the directory
    all_files = [f for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]
    
    # Check if there are enough files
    if len(all_files) < sample_size:
        print(f"Error: The directory only has {len(all_files)} files, but {sample_size} were requested.")
        return
    
    # Take a random sample
    sampled_files = random.sample(all_files, sample_size)

    # Create the destination directory if it doesn't exist
    dest_path = os.path.join(directory_path, destination_folder)
    if not os.path.exists(dest_path):
        os.makedirs(dest_path)

    # Copy the sampled files to the destination directory
    for file in sampled_files:
        shutil.copy2(os.path.join(directory_path, file), dest_path)

    print(f"{sample_size} files copied to {dest_path}")

# Specify the directory
directory = r"C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\Training Dataset"
sample_files(directory)



85 archivos copiados a C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\Training Dataset\metadata_sample


### Relevant Fields for Training Process

The following fields are identified as relevant for the training process:

- **Name**: Document title
- **Producer**: Software used to convert the document to PDF
- **Creator**: Software used to create the document
- **TotalPages**: Number of pages in the document
- **FileSize**: Size of the document file
- **Title**: Document title
- **Author**: Document author
- **Subject**: Document subject
- **CreationDate**: Document creation date
- **ModDate**: Document modification date



## 2 - Analysis of PDF Metadata

Creation of a CSV file with the selected metadata and subsequent analysis of the same.


In [45]:
import pandas as pd

route = pd.read_csv(r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv')

This script extracts metadata from PDF files, including the file path, title, author, subject, creator, producer, creation date, and modification date. It saves this information to a CSV file for further analysis.

In [19]:
import PyPDF2
import os
import csv

def extract_metadata(pdf_path):
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        metadata = reader.metadata
    return metadata if metadata else {}

def process_directory(directory_path, output_csv):
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Filepath', 'Title', 'Author', 'Subject', 'Creator', 'Producer', 'CreationDate', 'ModDate']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        for root, dirs, files in os.walk(directory_path):
            for file in files:
                if file.endswith(".pdf"):
                    pdf_path = os.path.join(root, file)
                    try:
                        metadata = extract_metadata(pdf_path)
                        writer.writerow({
                            'Filepath': pdf_path,
                            'Title': metadata.get('/Title', ''),
                            'Author': metadata.get('/Author', ''),
                            'Subject': metadata.get('/Subject', ''),
                            'Creator': metadata.get('/Creator', ''),
                            'Producer': metadata.get('/Producer', ''),
                            'CreationDate': metadata.get('/CreationDate', ''),
                            'ModDate': metadata.get('/ModDate', '')
                        })
                    except Exception as e:
                        print(f"Error processing {pdf_path}: {e}")

directory = r"C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\Training Dataset"
output_file = 'metadata_v1.0.csv'
process_directory(directory, output_file)


### Metadata Analysis Script Revision

The result obtained was not as expected. Therefore, a new script is modified and created for the analysis of metadata.

An analysis of the CSV generated by the `process_directory()` script is performed to determine which files are relevant for the training process.


### Extracting Metadata from PDF Files

The provided script is used to extract metadata from PDF files and store it in a CSV format. Here's an explanation of the script:

1. **Importing Libraries**: The script imports necessary libraries including PyPDF2 for working with PDF files, os for file operations, and csv for CSV file handling.

2. **Defining Functions**:
    - `extract_metadata(pdf_path)`: This function takes the path of a PDF file as input and extracts metadata such as producer, creator, total number of pages, etc. It utilizes PyPDF2 library to read the PDF file and extract metadata.
    - `process_directory(directory_path, output_csv)`: This function processes all PDF files in the specified directory. It iterates through each PDF file, extracts metadata using the `extract_metadata()` function, calculates the file size, and writes the metadata to a CSV file specified by `output_csv`.

3. **Processing PDF Files**:
    - The script iterates through the directory specified by `directory_path`.
    - For each PDF file found in the directory, it calls the `extract_metadata()` function to extract metadata.
    - The metadata is then written to a CSV file specified by `output_csv`.

4. **Handling Exceptions**:
    - The script includes exception handling to catch any errors that may occur during the metadata extraction process. If an error occurs, it prints an error message along with the file path.

5. **Usage**:
    - Specify the directory containing the PDF files in the `directory` variable.
    - Specify the name of the output CSV file in the `output_file` variable.
    - Call the `process_directory()` function with the directory path and output CSV file name as arguments.

This script is useful for extracting metadata from PDF files in a specified directory and storing it in a structured format for further analysis or processing.

In [24]:
import PyPDF2
import os
import csv

def extract_metadata(pdf_path):
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        metadata = reader.metadata
        total_pages = len(reader.pages)
    return {**metadata, 'TotalPages': total_pages} if metadata else {'TotalPages': total_pages}

def process_directory(directory_path, output_csv):
    with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Name', 'Producer', 'Creator', 'TotalPages', 'FileSize', 'Title', 'Author', 'Subject', 'CreationDate', 'ModDate']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        for root, dirs, files in os.walk(directory_path):
            for file in files:
                if file.endswith(".pdf"):
                    pdf_path = os.path.join(root, file)
                    try:
                        metadata = extract_metadata(pdf_path)
                        file_size = os.path.getsize(pdf_path)
                        writer.writerow({
                            'Name': file,
                            'Producer': metadata.get('/Producer', ''),
                            'Creator': metadata.get('/Creator', ''),
                            'TotalPages': metadata.get('TotalPages', ''),
                            'FileSize': file_size,
                            'Title': metadata.get('/Title', ''),
                            'Author': metadata.get('/Author', ''),
                            'Subject': metadata.get('/Subject', ''),
                            'CreationDate': metadata.get('/CreationDate', ''),
                            'ModDate': metadata.get('/ModDate', '')
                        })
                    except Exception as e:
                        print(f"Error processing {pdf_path}: {e}")

directory = r"C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\Training Dataset\metadata_sample"
output_file = 'metadata_2.0.csv'
process_directory(directory, output_file)

### Determining Relevant Metadata Fields

Based on the extracted metadata from the PDF files, the following fields are determined to be relevant:

- **Name**: Title of the document
- **Producer**: Software used to convert the document to PDF
- **Creator**: Software used to create the document
- **TotalPages**: Number of pages in the document
- **FileSize**: Size of the document file

These fields provide essential information about the PDF documents, which can be useful for various analytical and processing tasks.




## Summary of Data Analysis

### General Data

- **Total Files**: 9933.

### 'Name' Column

- **Unique Files**: 9933.
- Each file has a unique name in the dataset, indicating no duplicates based on name.

### 'Producer' Column

- **Unique Producers**: 45.
- **Primary Producer**: Samsung-M4580FX, which produced 6416 files.
- **Missing Data**: There are 50 files with missing producer information.

### 'Creator' Column

- **Unique Creators**: 25.
- **Primary Creator**: Created By SAMSUNG MFP, associated with 6425 files.
- **Missing Data**: A significant amount, 3363 files have no creator information. This might require further investigation, as more than a third of the data lacks this information.

### 'TotalPages' Column

- **Average Pages**: Approximately 3.62 pages.
- **Range**: Varies between 1 page and 27 pages.
- Most files have between 2 and 4 pages.

### 'FileSize' Column

- **Average Size**: Approximately 1604.79 KB.
- **Range**: From as small as 1.88 KB to 22536.08 KB.
- Most files have a size between 856.74 KB and 1914.33 KB.
