**Before training a CNN, we extracted low-level visual features such as edge density and contour irregularities to validate the presence of visual anomalies in fraudulent documents.**

In [1]:
import cv2
import os
import numpy as np

base_dir = r"G:\fraud_document_ai\data\processed\thresholded"

def extract_features(folder):
    features = []

    for file in os.listdir(folder):
        img = cv2.imread(os.path.join(folder, file), cv2.IMREAD_GRAYSCALE)

        edges = cv2.Canny(img, 100, 200)
        edge_density = np.sum(edges > 0) / edges.size

        contours, _ = cv2.findContours(
            img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
        )
        contour_count = len(contours)

        features.append((edge_density, contour_count))

    return features

genuine_features = extract_features(os.path.join(base_dir, "genuine"))
fraud_features = extract_features(os.path.join(base_dir, "fraud"))

print("Genuine sample features:", genuine_features[:5])
print("Fraud sample features:", fraud_features[:5])

Genuine sample features: [(np.float64(0.033937445750594646), 5), (np.float64(0.028723299211674394), 5), (np.float64(0.023913422214042628), 6), (np.float64(0.028943010918896395), 5), (np.float64(0.027943968861939884), 5)]
Fraud sample features: [(np.float64(0.03395838298387109), 5), (np.float64(0.028793089989262558), 4), (np.float64(0.02389765466799493), 4), (np.float64(0.029029086211255134), 4), (np.float64(0.028075278917550208), 4)]


**The similarity indicates that the fraud is localized and subtle, which reflects real-world document tampering. Rather than relying on a single visual feature, our system combines multiple weak signals visual anomalies, OCR inconsistencies, and structural features to compute a final fraud risk score**

Each feature output is a tuple:
(edge_density, contour_count)

Example:
(0.0339, 5)

Meaning:
• 3.39% of the image pixels are detected as edges
• 5 distinct contour regions are detected

FEATURE 1: EDGE DENSITY

Formula:
edge_density = (number of edge pixels) / (total number of pixels)

How it is computed:
• Canny Edge Detection is applied to the image
• All edge pixels are counted
• The count is divided by total pixels in the image

Interpretation:
• Low edge density → smooth, clean document
• Higher edge density → more sharp transitions, patches, distortions

Typical document range:
• 0.02 – 0.04 → normal scanned document range

Observed values:
• Genuine documents → ~0.024 to 0.034
• Fraud documents   → ~0.024 to 0.034

Conclusion:
• Values are close because fraud is localized
• Real-world document fraud does NOT modify the entire page

FEATURE 2: CONTOUR COUNT

Formula:
contour_count = number of detected connected regions (contours)

How it is computed:
• Thresholded image is used
• Contours are detected using OpenCV
• Each disconnected shape is counted as one contour

Interpretation:
• Fewer contours → uniform document structure
• More contours → fragmented or irregular regions

Observed values:
• Genuine documents → mostly 5 to 6 contours
• Fraud documents   → mostly 4 to 5 contours

Conclusion:
• Fraud alters local structure, slightly changing contour distribution
• Difference is subtle but measurable