# PDF Data Extraction and Analysis with H24 Model formrecog-v4.0

## Overview
This repository contains scripts for extracting and analyzing data from PDF documents using the H24 Form Recognition Model version 4.0. The scripts automate the extraction of data, perform analysis, and provide insights into the accuracy of predictions made by the model.

## Folder Structure
Inside the repository, you'll find the following folder structure:

- **`output_analysis`**: Contains scripts and analysis files related to the output of the H24 Model v4.0.
   - `v4_data_extraction_analysis.py`: Python script for analyzing the extracted data from the model.
   - `output_model_v4.csv`: CSV file containing the output data extracted by the model.

## Analysis of Extracted Data
The `v4_data_extraction_analysis.py` script performs the following analyses:

- Calculates the arithmetic mean of the 'confidence' values for each of the 6 categories:
  - Nursing
  - Occupational Therapy
  - Physiotherapy
  - Caregiver
  - Medical
  - Speech Therapy

- Modifies the column names for comparison with the extracted 'file_name'.

- Creates a new column named 'true_category', representing the value derived from the 'file_name' column after the hyphen ('FormXXX-').

- Creates a new column for comparison between the 'category_prediction' column and 'true_category', returning a boolean value indicating whether the prediction is correct or not.

- Creates a new column 'HRverif' to compare the mean of each predicted category with the 'confidence' of each file. If the result is 'true', it means it should not go to Human Revision Verification.

- Prints the rows where true_prediction = False (incorrect categorization) and HRVerif = False (not filtered by HR Verification Filter). The goal is to detect incorrect predictions NOT FILTERED BY HRverif.

- Calculates the total number of files in the DataFrame, the total number of files with 'unfiltered_cases' equal to True, and the percentage of the model's error margin.

## Usage
1. Ensure you have the necessary credentials and environment set up.
2. Run the `v4_data_extraction_analysis.py` script to perform the analysis.
3. Review the output and insights generated by the script.

## Notes
- This repository is intended to provide tools and insights for analyzing the performance and accuracy of the H24 Form Recognition Model v4.0.
- The analysis conducted helps identify areas of improvement in the model's predictions and suggests potential refinements or adjustments to enhance its accuracy.


In [17]:
import pandas as pd

# Create a DataFrame from the provided data
data = pd.read_csv(r'...\output_analysis\output_model_v4.csv')

# Calculate the mean confidence for each of the remaining categories
mean_confidence = data.groupby('category_prediction')['confidence'].mean()

# Extracting and storing the mean confidence values for each category
cui_mean = mean_confidence.get('Cuidadores', None)  # Replace 'cuidadores' with the exact category name if different
enf_mean = mean_confidence.get('Enfermeria', None)  # Assuming 'enfermeria' is a category, else remove this line
fono_mean = mean_confidence.get('fonoaudiologia', None)
kine_mean = mean_confidence.get('Kinesiologia', None)
med_mean = mean_confidence.get('Medico', None)
to_mean = mean_confidence.get('TerapistaOcupacional', None)


In [18]:
mean_confidence

category_prediction
Cuidadores              0.998663
Enfermeria              0.999962
Fonoaudiologia          0.999967
Kinesiologia            0.999944
Medico                  0.999899
TerapistaOcupacional    0.999982
Name: confidence, dtype: float64

The names of the "category prediction" column are modified to then be compared with the extract from the file_name.


In [19]:
df_predictions = data

# Dictionary to map the old category names to new names
category_map = {
    'Kinesiologia': 'KINE',
    'Cuidadores': 'CUI',
    'Fonoaudiologia': 'FONO',
    'TerapistaOcupacional': 'T.O',
    'Enfermeria': 'ENF',
    'Medico': 'MED'
}

# Applying the mapping to the 'category_prediction' column
df_predictions['category_prediction'] = df_predictions['category_prediction'].map(category_map)

df_predictions.head()  # Displaying the first few rows of the modified DataFrame

Unnamed: 0,file_name,category_prediction,confidence
0,Form1036- ENF.pdf,ENF,0.999971
1,Form1037- ENF.pdf,ENF,0.99997
2,Form1050- ENF.pdf,ENF,0.999971
3,Form1051- ENF.pdf,CUI,0.999854
4,Form1052- ENF.pdf,ENF,0.999909


A new column called "true_category" is created, representing the value derived from the part of the "file_name" column after the hyphen ('FormXXX- ').

In [20]:
# Creating a new column 'real_category' by splitting 'file_name' and considering 'T.O' as a whole
df_predictions['true_category'] = df_predictions['file_name'].apply(
    lambda x: x.split('- ')[1].split('.')[0] if '- ' in x and not x.endswith('T.O.pdf') else
              x.split('- ')[1].split('.pdf')[0] if x.endswith('T.O.pdf') else None
)

# Displaying the first few rows to verify the changes
print(df_predictions.head())


           file_name category_prediction  confidence true_category
0  Form1036- ENF.pdf                 ENF    0.999971           ENF
1  Form1037- ENF.pdf                 ENF    0.999970           ENF
2  Form1050- ENF.pdf                 ENF    0.999971           ENF
3  Form1051- ENF.pdf                 CUI    0.999854           ENF
4  Form1052- ENF.pdf                 ENF    0.999909           ENF


A new column is created, which is used to compare with the "category_prediction" column with "real_category", returning a boolean value indicating whether the prediction is correct or not.

In [21]:
df_predictions['true_prediction'] = df_predictions['category_prediction'] == df_predictions['true_category']

df_predictions.head()

Unnamed: 0,file_name,category_prediction,confidence,true_category,true_prediction
0,Form1036- ENF.pdf,ENF,0.999971,ENF,True
1,Form1037- ENF.pdf,ENF,0.99997,ENF,True
2,Form1050- ENF.pdf,ENF,0.999971,ENF,True
3,Form1051- ENF.pdf,CUI,0.999854,ENF,False
4,Form1052- ENF.pdf,ENF,0.999909,ENF,True


### Human Revision Verification

A new column named 'mean_prediction' is created and assigned the mean of each predicted category for that file.

The main objective is to understand the precision of the model's predictions.

In [22]:
# Assuming you already have the DataFrame df and the calculated means
mean_confidence_dict = {
    'CUI': cui_mean,
    'ENF': enf_mean,  # If 'enfermeria' is not a category, you can omit this line
    'FONO': fono_mean,
    'KINE': kine_mean,
    'MED': med_mean,
    'T.O': to_mean
}

# Create the new column 'mean_prediction'
df_predictions['mean_prediction'] = df_predictions['category_prediction'].map(mean_confidence_dict)

# Now df has a new column named 'mean_prediction' with the corresponding values

In [23]:
df_predictions.head()

Unnamed: 0,file_name,category_prediction,confidence,true_category,true_prediction,mean_prediction
0,Form1036- ENF.pdf,ENF,0.999971,ENF,True,0.999962
1,Form1037- ENF.pdf,ENF,0.99997,ENF,True,0.999962
2,Form1050- ENF.pdf,ENF,0.999971,ENF,True,0.999962
3,Form1051- ENF.pdf,CUI,0.999854,ENF,False,0.998663
4,Form1052- ENF.pdf,ENF,0.999909,ENF,True,0.999962


A new column named 'HRverif' is created. It compares the mean of each predicted category with the 'confidence' of each file.


In [24]:
df_predictions['HRVerif'] = df_predictions['confidence'] < df_predictions['mean_prediction'] 

# If the result is 'true', it means it should not go to Human Revision Verification.
df_predictions.head()

Unnamed: 0,file_name,category_prediction,confidence,true_category,true_prediction,mean_prediction,HRVerif
0,Form1036- ENF.pdf,ENF,0.999971,ENF,True,0.999962,False
1,Form1037- ENF.pdf,ENF,0.99997,ENF,True,0.999962,False
2,Form1050- ENF.pdf,ENF,0.999971,ENF,True,0.999962,False
3,Form1051- ENF.pdf,CUI,0.999854,ENF,False,0.998663,False
4,Form1052- ENF.pdf,ENF,0.999909,ENF,True,0.999962,True


Rows are printed that match true_prediction = False (poor categorization) and HRVerif=False (not filtered by HRVerification Filter).

Goal: Detect incorrect predictions NOT FILTERED BY the HRverif.


In [28]:
# Filter rows where both 'true_prediction' and 'HRVeif' are False
df_predictions['unfiltered_cases'] = (df_predictions['true_prediction'] == False) & (df_predictions['HRVerif'] == False)
df_predictions.head()

Unnamed: 0,file_name,category_prediction,confidence,true_category,true_prediction,mean_prediction,HRVerif,unfiltered_cases
0,Form1036- ENF.pdf,ENF,0.999971,ENF,True,0.999962,False,False
1,Form1037- ENF.pdf,ENF,0.99997,ENF,True,0.999962,False,False
2,Form1050- ENF.pdf,ENF,0.999971,ENF,True,0.999962,False,False
3,Form1051- ENF.pdf,CUI,0.999854,ENF,False,0.998663,False,True
4,Form1052- ENF.pdf,ENF,0.999909,ENF,True,0.999962,True,False


In [30]:
# Total files in the DataFrame
total_files = len(df_predictions)

# Total files with 'unfiltered_cases' equal to True
total_unfiltered_cases = df_predictions['unfiltered_cases'].sum()

# Calculating the percentage
percentage_unfiltered_cases = (total_unfiltered_cases / total_files) * 100

print(f"Total files in the DataFrame: {total_files}")
print(f"Total files with unfiltered_cases = True: {total_unfiltered_cases}")
print(f"Model Error Margin Percentage: {percentage_unfiltered_cases:.2f}%")


Total de archivos en el DataFrame: 90
Total de archivos con unfiltered_cases = True: 2
Porcentaje de unfiltered_cases = True sobre el total: 2.22%
