### Data Exploration and Analysis (EDA) to Determine Relevant Files for Training Process.


In [None]:
import pandas as pd

# Cargar el archivo csv
df = pd.read_csv('metadata_2.0.csv')

# Mostrar las primeras 5 filas
df.info()

In [None]:
df.describe()

## Classification of File Creators

Creators of the files are categorized into numerical labels for further analysis.

- **1 - Samsung-M4580FX**
- **2 - intsig.com pdf producer**
- **3 - 3-Heights™ PDF Merge Split Shell 6.12.1.11 (http://www.pdf-tools.com)**
- **4 - Microsoft® Word 2019**
- **5 - iLovePDF**
- **6 - RxRelease / Haru2.4.0dev**
- **7 - iText® 5.4.4 ©2000-2013 1T3XT BVBA (AGPL-version)**


In [None]:
import pandas as pd
import os
import shutil

# Path to the CSV file
route = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv'

# Folder where the files are currently located and where the new producer folders will be created
source_folder = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\Training Dataset\metadata_sample'

# Read the CSV file
data = pd.read_csv(route)

# For each row in the DataFrame
for index, row in data.iterrows():
    file_name = row['Name']
    producer = row['Producer']

    # Create a folder for the producer if it does not exist
    producer_folder = os.path.join(source_folder, producer)
    if not os.path.exists(producer_folder):
        os.makedirs(producer_folder)

    # Move the file to the producer's folder
    source_file_path = os.path.join(source_folder, file_name)
    dest_file_path = os.path.join(producer_folder, file_name)
    
    # Check if the file exists in the original location before moving it
    if os.path.exists(source_file_path):
        shutil.move(source_file_path, dest_file_path)

The metadata folder is divided into 7 subfolders:

- **Samsung-M4580FX**: This producer appears to be the most common in the dataset, suggesting that many of these documents were likely created or scanned using a Samsung M4580FX machine.
- **intsig.com pdf producer**: Represents another significant portion of the documents. "CamScanner" is often associated with this as the creator, indicating that many of these documents were likely scanned using the CamScanner app.
- **iLovePDF**: Appears several times, suggesting that this tool was used to modify or combine PDFs.
- **3-Heights™ PDF Merge Split Shell**: This indicates the use of PDF-Tools to merge or split the documents.
- **Microsoft® Word 2019**: Some documents were produced directly from MS Word.
- **iText® 5.4.4**: An older version of iText was used for some PDFs, which is a library for creating and manipulating PDFs.
- **RxRelease / Haru2.4.0dev**: A less common producer, possibly indicating another application or platform for PDF creation.


## Producers:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Leemos el archivo CSV
df = pd.read_csv(r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv')

# Contamos la cantidad de archivos por productor
conteo_producers = df['Producer'].value_counts()

# Mostramos el resultado
print(conteo_producers)

# Visualizamos el resultado en un gráfico de barras
conteo_producers.plot(kind='bar', figsize=(10,6))
plt.title('Cantidad de Archivos por Productor')
plt.xlabel('Productor')
plt.ylabel('Cantidad de Archivos')
plt.show()


## FileSize:

- Most of the PDFs have 2-3 pages. However, there are some with more pages, up to 11 pages in one case. We studied the distribution of the number of pages.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv(r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv')

# Count the number of files per number of pages
page_count = df['TotalPages'].value_counts()

# Display the result
print(page_count)

# Visualize the result in a bar plot
page_count.plot(kind='bar', figsize=(10,6))
plt.title('Files by Number of Pages')
plt.xlabel('Pages')
plt.ylabel('Number of Files')
plt.show()

## TotalPages:

- File sizes vary widely, ranging from just over 100KB to over 13MB, indicating a diversity in the content of these documents (e.g., plain text vs high-resolution images).


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv(r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv') 
# Convert sizes from B to MB
df['FileSize'] = df['FileSize'] / 1024 / 1024

# We can get some basic statistics
statistics_size = df['FileSize'].describe()
print(statistics_size)

# Histogram to visualize the distribution of file sizes
plt.figure(figsize=(10,6))
plt.hist(df['FileSize'], bins=50, color='blue', edgecolor='black')
plt.title('File Size Distribution')
plt.xlabel('File Size (MB)')
plt.ylabel('Number of Files')
plt.grid(axis='y')
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the CSV file
data = route

# Convert sizes from B to MB
data['FileSize'] = data['FileSize'] / 1024 / 1024

# Group by 'Producer' and calculate the total sum of 'FileSize' for each 'Producer'
grouped_data = data.groupby('Producer')['FileSize'].sum().sort_values(ascending=False)

# Visualize the results
plt.figure(figsize=(12, 8))
grouped_data.plot(kind='barh', color='skyblue')
plt.title('Total File Size by Producer')
plt.xlabel('Total Size (unknown units)')
plt.ylabel('Producer')
plt.gca().invert_yaxis()  # This is to have the producer with the largest size at the top
plt.tight_layout()
plt.show()


In [None]:
import pandas as pd

# Path to the CSV file
route = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv'

# Read the CSV file and load it into a DataFrame
data = pd.read_csv(route)

# Create a dictionary that assigns a unique code to each producer
producer_codes = {producer: code for code, producer in enumerate(data['Producer'].unique(), start=1)}

# Map the dictionary to the DataFrame to create the new column
data['producer_code'] = data['Producer'].map(producer_codes)

# Create a reference DataFrame for the producer codes
producer_reference = pd.DataFrame(list(producer_codes.items()), columns=['Producer', 'producer_code'])

# Save the reference DataFrame to another CSV
reference_route = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\producer_reference.csv'
producer_reference.to_csv(reference_route, index=False)

# Save the modified DataFrame back to the original CSV
data.to_csv(route, index=False)


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Path to the CSV file
route = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv'

# Read the CSV file and load it into a DataFrame
data = pd.read_csv(route)

# Convert FileSize from Bytes to Megabytes (MB)
data['FileSize'] = data['FileSize'] / (1024 * 1024)

# Visualize all relationships between pairs of variables
# Use the 'hue' argument to color the points according to the 'Producer'
sns.pairplot(data, hue='Producer', diag_kind="kde", markers='o', plot_kws={'alpha': 0.9}, height=2.5)
plt.suptitle('Relationships between pairs of variables', y=1.02)
plt.show()


### Title & Author:

- In several cases, the title and author fields appear to contain names, suggesting that these documents are related to individual persons. Some titles such as "Screenshot 08-01-2022 13.59" indicate that they might have been taken as captures or scans of other content.

### CreationDate & ModDate:
- The documents span a range of creation and modification dates, with the most recent ones from January 2023 and the oldest from February 2022.

### Subject:
- Some documents have specific subjects mentioned, but they do not contain relevant information.

### Null data:
- Not all fields are filled for each entry, but it is not a significant consideration in this case.


## 3 - Comparison of Low-Quality Files

We have identified a specific set of files that exhibit different characteristics from the rest of our dataset. These files, listed below, are images that have been converted to PDFs.

### Motivation

These files inherently display lower quality compared to others due to their origin as images. This could affect the accuracy and performance of machine learning models if mixed with higher-quality files.

Hence, our goal is to establish a procedure that allows us to effectively segment these files to consider the possibility of training two different models: one for converted image files and another for the rest.

- List of Files Converted from Images to PDFs

    - **Form1965 ENF**
    - **Form1973 ENF**
    - **Form16 MED**
    - **Form19 CUI**
    - **Form26 MED**
    - **Form31 CUI**
    - **Form100 CUI**
    - **Form101 ENF**
    - **Form103 ENF**
    - **Form251 MED**
    - **Form261 CUI**
    - **Form487 ENF**
    - **Form1302 ENF**

- Next Steps

1. Segment these files from the main dataset.
2. Analyze the characteristics and quality of these files.
3. Decide on the training methodology and modeling strategies for this data.

**Note:** It is essential to adopt a systematic and data-driven approach for this process, ensuring that any decision made benefits the quality and accuracy of the model.


In [None]:
import pandas as pd
import os
import shutil
import re

def clean_directory_name(name):
    # Replace forbidden characters
    name = re.sub(r'[<>:"/\\|?*]', '_', name)
    # Remove any trailing dot
    name = name.rstrip('.')
    return name

# Path to the CSV file
csv_route = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\metadata_2.0.csv'

# Path of the folder containing all the files
source_folder = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\ETF Layer2 - Pdf_datset\Training Dataset\metadata_sample' 

# Read the CSV file and load it into a DataFrame
data = pd.read_csv(csv_route)

# Iterate over each row to obtain the producer and the file name
for index, row in data.iterrows():
    producer = clean_directory_name(row['Producer'])
    file_name = row['Name']  

    # Create a folder for the producer if it does not exist yet
    producer_folder = os.path.join(source_folder, producer)
    if not os.path.exists(producer_folder):
        os.makedirs(producer_folder)

    # Move the file to the producer's folder
    source_file_path = os.path.join(source_folder, file_name)
    destination_file_path = os.path.join(producer_folder, file_name)

    # Check if the file exists and then move it
    if os.path.exists(source_file_path):
        shutil.move(source_file_path, destination_file_path)
    else:
        print(f"The file {file_name} was not found in {source_folder}")

print("Files organized according to producers.")

        

Low-quality files are located to determine a potential segmentation process for them, and a report is prepared with the results.

In [None]:
import os

# List of files to search for
files_to_search = [
    "Form1965- ENF.pdf",
    "Form1973- ENF.pdf",
    "Form16- MED.pdf",
    "Form19- CUI.pdf",
    "Form26- MED.pdf",
    "Form31- CUI.pdf",
    "Form100- CUI.pdf",
    "Form101- ENF.pdf",
    "Form103- ENF.pdf",
    "Form251- MED.pdf",
    "Form261- CUI.pdf",
    "Form487- ENF.pdf",
    "Form1302- ENF.pdf"
]

# Root directory to search for the files
route = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\Layer2 - Pdf_dataset'
root_folder = os.path.join(route, 'metadata_sample')

# Report file
report_file_path = os.path.join(route, 'metadata_report.txt')

# Open the report file in write mode
with open(report_file_path, 'w') as report_file:
    # Traverse through folders and subfolders within the root directory
    for folder_name, subfolders, filenames in os.walk(root_folder):
        for file_name in filenames:
            # If the file name is in the list of files to search for
            if file_name in files_to_search:
                report_file.write(f"File {file_name} found at: {folder_name}\n")

print(f"Report saved at {report_file_path}")


Conclusion: It is decided to use a file segmentation model based on the data source. Since high-quality files are generated by the same source, Samsung-M4580FX.

## Interpretations and Considerations

1. The **Samsung-M4580FX** brand appears to be the dominant producer and is also linked to the main creator. This could be important in future analysis, especially when considering the quality or specific characteristics of files from this source.

2. The lack of data in the 'Creator' column is not a cause for concern as it has been determined through investigation that the creator is the Samsung-M4580FX machine.

3. The process of separating data continues with the purpose of separating files into high and low-quality formats for model training.

- **High Quality Data:** Files generated by `Samsung-M4580FX`
- **Low Quality Data:** Files generated by other machines


## 4 - Pixel Area Analysis

Based on the requirements of GCP Document AI - Form Recognizer, it is determined that PDF files must have a size smaller than 10000 x 10000 pixels, with a resolution of 150 dpi.

### Motivation

- The size of the pages is analyzed to determine if they meet the requirements of GCP.


In [None]:
# Analyze 1 PDF to check if each page measures less than 10000x10000 pixels
import PyPDF2
from pdf2image import convert_from_path

def analyze_pdf(file_path):
    # Open the PDF file
    with open(file_path, 'rb') as file:
        # Use PyPDF2 to determine the number of pages
        pdf = PyPDF2.PdfReader(file)
        num_pages = len(pdf.pages)
        
        # Convert PDF pages to images
        images = convert_from_path(file_path)

        # Store the number of pixels for each page
        pixels_per_page = []
        
        for image in images:
            width, height = image.size
            pixels_per_page.append(width * height)
        
        return pixels_per_page

file_path = r'C:\Users\HP\My Drive\Inteligencia Artificial\PROJECTS\ML Ops - House24 - Form Recognizer\Layer2 - Pdf_dataset\metadata_sample\Samsung-M4580FX\Form256- FONO.pdf'  # Replace this with your file path
pixels_per_page = analyze_pdf(file_path)

for i, pixels in enumerate(pixels_per_page, 1):
    print(f"Page {i}: {pixels} pixels")


# Conclusion:

- It is determined that the files meet the requirements of GCP Document AI, as they have a size smaller than 10000 x 10000 pixels, with a resolution of 150 dpi.
