# Document Forgery Detection - Complete Workflow

This notebook demonstrates the complete workflow for document forgery detection using computer vision and machine learning techniques.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading and Exploration](#data-loading)
3. [Data Preprocessing](#preprocessing)
4. [Feature Extraction](#feature-extraction)
5. [Model Training](#model-training)
6. [Model Evaluation](#evaluation)
7. [Making Predictions](#predictions)
8. [Visualization and Analysis](#visualization)

## 1. Setup and Imports <a id="setup"></a>

In [None]:
# Import necessary libraries
import sys
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2

# Add src to path for importing our modules
sys.path.append('../src')

# Import our custom modules
from data.make_dataset import load_image, create_dataset_structure
from features.build_features import DocumentFeatureExtractor, extract_features_from_directory
from models.train_model import DocumentForgeryDetector
from models.predict_model import DocumentForgeryPredictor
from visualization.visualize import DocumentForgeryVisualizer, create_comprehensive_report

# Setup plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

print("Setup completed successfully!")

## 2. Data Loading and Exploration <a id="data-loading"></a>

First, let's load and explore our document dataset.

In [None]:
# Define data paths
RAW_DATA_PATH = '../data/raw'
PROCESSED_DATA_PATH = '../data/processed'
MODELS_PATH = '../models'

# Create directories if they don't exist
Path(RAW_DATA_PATH).mkdir(parents=True, exist_ok=True)
Path(PROCESSED_DATA_PATH).mkdir(parents=True, exist_ok=True)
Path(MODELS_PATH).mkdir(parents=True, exist_ok=True)

print(f"Raw data path: {RAW_DATA_PATH}")
print(f"Processed data path: {PROCESSED_DATA_PATH}")
print(f"Models path: {MODELS_PATH}")

In [None]:
# Check if we have data in the raw directory
raw_path = Path(RAW_DATA_PATH)
image_extensions = ['.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.tif']

image_files = []
for ext in image_extensions:
    image_files.extend(raw_path.glob(f'**/*{ext}'))
    image_files.extend(raw_path.glob(f'**/*{ext.upper()}'))

print(f"Found {len(image_files)} image files in the raw data directory")

if len(image_files) == 0:
    print("\n⚠️ No images found in the raw data directory.")
    print("To use this notebook with real data:")
    print(f"1. Place your document images in: {RAW_DATA_PATH}")
    print("2. Organize them in subfolders like 'authentic' and 'forged'")
    print("3. Or name files with keywords like 'fake', 'authentic', etc.")
    print("\n📝 For now, we'll create synthetic sample data for demonstration.")
    
    # Create sample data for demonstration
    sample_data = True
else:
    sample_data = False
    # Display some sample images if available
    print(f"\n📁 Sample image files:")
    for i, img_path in enumerate(image_files[:5]):
        print(f"  {i+1}. {img_path.name}")

### Creating Sample Data (if no real data is available)

In [None]:
if sample_data:
    # Create synthetic sample images for demonstration
    print("Creating synthetic sample data...")
    
    # Create sample authentic documents (clean, structured)
    authentic_dir = Path(RAW_DATA_PATH) / 'authentic'
    forged_dir = Path(RAW_DATA_PATH) / 'forged'
    
    authentic_dir.mkdir(exist_ok=True)
    forged_dir.mkdir(exist_ok=True)
    
    # Generate sample images
    np.random.seed(42)
    
    for i in range(50):
        # Authentic documents - more structured, less noise
        img = np.random.randint(200, 255, (300, 400, 3), dtype=np.uint8)  # Light background
        # Add some structure (lines, text-like patterns)
        for j in range(5):
            y = 50 + j * 40
            cv2.line(img, (20, y), (380, y), (0, 0, 0), 2)  # Horizontal lines
        
        # Add some noise
        noise = np.random.normal(0, 10, img.shape)
        img = np.clip(img + noise, 0, 255).astype(np.uint8)
        
        cv2.imwrite(str(authentic_dir / f'authentic_doc_{i:03d}.png'), img)
    
    for i in range(50):
        # Forged documents - more irregular, artifacts
        img = np.random.randint(180, 255, (300, 400, 3), dtype=np.uint8)
        
        # Add irregular patterns (forgery artifacts)
        for j in range(5):
            y = 50 + j * 40 + np.random.randint(-5, 5)  # Slight misalignment
            cv2.line(img, (20, y), (380, y), (0, 0, 0), np.random.randint(1, 4))  # Variable thickness
        
        # Add compression artifacts
        for _ in range(10):
            x, y = np.random.randint(0, 400), np.random.randint(0, 300)
            cv2.rectangle(img, (x, y), (x+8, y+8), (np.random.randint(0, 100),)*3, -1)
        
        # Add more noise
        noise = np.random.normal(0, 20, img.shape)
        img = np.clip(img + noise, 0, 255).astype(np.uint8)
        
        cv2.imwrite(str(forged_dir / f'forged_doc_{i:03d}.png'), img)
    
    print("✅ Sample data created successfully!")
    print(f"   - 50 authentic documents in {authentic_dir}")
    print(f"   - 50 forged documents in {forged_dir}")
    
    # Update image files list
    image_files = list(authentic_dir.glob('*.png')) + list(forged_dir.glob('*.png'))
    print(f"\n📊 Total images: {len(image_files)}")

### Visualize Sample Images

In [None]:
# Display some sample images
if len(image_files) > 0:
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    
    # Show authentic images
    authentic_files = [f for f in image_files if 'authentic' in str(f).lower()][:4]
    for i, img_path in enumerate(authentic_files):
        if i < 4:
            img = Image.open(img_path)
            axes[0, i].imshow(img)
            axes[0, i].set_title(f'Authentic: {img_path.name}', fontsize=10)
            axes[0, i].axis('off')
    
    # Show forged images
    forged_files = [f for f in image_files if 'forged' in str(f).lower() or 'fake' in str(f).lower()][:4]
    for i, img_path in enumerate(forged_files):
        if i < 4:
            img = Image.open(img_path)
            axes[1, i].imshow(img)
            axes[1, i].set_title(f'Forged: {img_path.name}', fontsize=10)
            axes[1, i].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print(f"📊 Dataset Summary:")
    print(f"   - Total images: {len(image_files)}")
    print(f"   - Authentic images: {len(authentic_files)}")
    print(f"   - Forged images: {len(forged_files)}")

## 3. Data Preprocessing <a id="preprocessing"></a>

Now let's preprocess the data and create a structured dataset.

In [None]:
# Create structured dataset
if len(image_files) > 0:
    print("Creating structured dataset...")
    
    # Use the create_dataset_structure function
    create_dataset_structure(
        input_dir=RAW_DATA_PATH,
        output_dir=PROCESSED_DATA_PATH,
        train_split=0.7,
        val_split=0.2,
        test_split=0.1
    )
    
    print("✅ Dataset structure created!")
    
    # Load and display the metadata
    metadata_file = Path(PROCESSED_DATA_PATH) / 'dataset_metadata.csv'
    if metadata_file.exists():
        df_metadata = pd.read_csv(metadata_file)
        print(f"\n📊 Dataset Statistics:")
        print(df_metadata.groupby(['split', 'class']).size().unstack(fill_value=0))
        
        # Visualize data distribution
        visualizer = DocumentForgeryVisualizer()
        visualizer.plot_data_distribution(df_metadata, target_column='class')
    else:
        print("⚠️ Metadata file not found")
else:
    print("⚠️ No images available for preprocessing")

## 4. Feature Extraction <a id="feature-extraction"></a>

Extract comprehensive features from the document images.

In [None]:
# Initialize feature extractor
feature_extractor = DocumentFeatureExtractor(image_size=(224, 224))

# Extract features from training data
train_dir = Path(PROCESSED_DATA_PATH) / 'train'
features_file = Path(PROCESSED_DATA_PATH) / 'features.csv'

if train_dir.exists() and len(list(train_dir.glob('**/*.png'))) > 0:
    print("Extracting features from training data...")
    
    extract_features_from_directory(
        input_dir=str(train_dir),
        output_file=str(features_file)
    )
    
    # Load and analyze features
    if features_file.exists():
        df_features = pd.read_csv(features_file)
        print(f"\n✅ Features extracted successfully!")
        print(f"   - Total samples: {len(df_features)}")
        print(f"   - Feature count: {len(df_features.columns) - 2}")
        
        # Determine class from filepath
        df_features['class'] = df_features['filepath'].apply(
            lambda x: 'authentic' if 'authentic' in str(x) else 'forged'
        )
        
        print(f"\n📊 Class distribution:")
        print(df_features['class'].value_counts())
        
        # Display sample features
        print(f"\n🔍 Sample features (first 10):")
        feature_cols = [col for col in df_features.columns if col not in ['filename', 'filepath', 'class']]
        print(df_features[feature_cols[:10]].describe())
        
    else:
        print("❌ Feature extraction failed")
        df_features = None
else:
    print("⚠️ No training data found for feature extraction")
    df_features = None

### Visualize Feature Distributions

In [None]:
if df_features is not None and len(df_features) > 0:
    # Select some important features to visualize
    important_features = [
        'mean', 'std', 'entropy', 'sobel_mean', 'canny_edge_density',
        'laplacian_var', 'fft_energy', 'lbp_contrast'
    ]
    
    # Filter features that exist in our dataset
    available_features = [f for f in important_features if f in df_features.columns]
    
    if available_features:
        print(f"Visualizing {len(available_features)} important features...")
        
        visualizer = DocumentForgeryVisualizer()
        visualizer.plot_feature_distribution(
            df_features, 
            features=available_features[:9],  # Show up to 9 features
            target_column='class'
        )
    else:
        print("⚠️ No matching features found for visualization")
        print(f"Available features: {list(df_features.columns)[:10]}...")

## 5. Model Training <a id="model-training"></a>

Train different types of models for document forgery detection.

### 5.1 Traditional Machine Learning Model

In [None]:
if df_features is not None and len(df_features) > 10:
    print("Training traditional ML model (Random Forest)...")
    
    # Initialize detector for traditional ML
    ml_detector = DocumentForgeryDetector(model_type='traditional_ml', random_state=42)
    
    # Load data using the features CSV
    X, y = ml_detector.load_data(str(features_file))
    
    # Preprocess data
    X_processed, y_processed = ml_detector.preprocess_data(X, y)
    
    # Train model
    ml_results = ml_detector.train_model(
        X_processed, y_processed,
        model_name='random_forest',
        use_grid_search=False  # Set to True for better performance (takes longer)
    )
    
    print(f"\n✅ Traditional ML Model Training Results:")
    print(f"   - Training Accuracy: {ml_results['train_accuracy']:.4f}")
    print(f"   - Test Accuracy: {ml_results['test_accuracy']:.4f}")
    
    # Save model
    ml_model_path = Path(MODELS_PATH) / 'random_forest_model.joblib'
    ml_detector.save_model(str(ml_model_path), metadata=ml_results)
    
    print(f"   - Model saved to: {ml_model_path}")
    
    # Display feature importance if available
    if hasattr(ml_detector.model, 'feature_importances_'):
        feature_cols = [col for col in df_features.columns if col not in ['filename', 'filepath', 'class']]
        visualizer = DocumentForgeryVisualizer()
        visualizer.plot_feature_importance(
            feature_cols, 
            ml_detector.model.feature_importances_,
            top_k=15
        )

else:
    print("⚠️ Insufficient data for training traditional ML model")
    ml_detector = None
    ml_results = None

### 5.2 Deep Learning Model (CNN)

In [None]:
# Check if we have enough processed images for CNN training
processed_images = list(Path(PROCESSED_DATA_PATH).glob('train/**/*.png'))

if len(processed_images) > 10:
    print(f"Training CNN model with {len(processed_images)} images...")
    
    # Initialize CNN detector
    cnn_detector = DocumentForgeryDetector(model_type='cnn', random_state=42)
    
    try:
        # Load image data
        X_images, y_images = cnn_detector.load_data(str(Path(PROCESSED_DATA_PATH) / 'train'))
        
        # Preprocess data
        X_img_processed, y_img_processed = cnn_detector.preprocess_data(X_images, y_images)
        
        print(f"   - Image shape: {X_img_processed.shape}")
        print(f"   - Labels shape: {y_img_processed.shape}")
        
        # Train model (reduced epochs for demo)
        cnn_results = cnn_detector.train_model(
            X_img_processed, y_img_processed,
            epochs=10,  # Reduced for demo - use 50+ for real training
            batch_size=16
        )
        
        print(f"\n✅ CNN Model Training Results:")
        print(f"   - Final Validation Accuracy: {cnn_results['final_accuracy']:.4f}")
        print(f"   - Final Loss: {cnn_results['final_loss']:.4f}")
        
        # Save model
        cnn_model_path = Path(MODELS_PATH) / 'cnn_model.h5'
        cnn_detector.save_model(str(cnn_model_path), metadata=cnn_results)
        
        print(f"   - Model saved to: {cnn_model_path}")
        
        # Plot training history
        cnn_detector.plot_training_history()
        
    except Exception as e:
        print(f"❌ CNN training failed: {str(e)}")
        print("This might be due to insufficient memory or missing dependencies.")
        cnn_detector = None
        cnn_results = None

else:
    print("⚠️ Insufficient images for CNN training")
    print("CNN models typically need hundreds or thousands of images for good performance.")
    cnn_detector = None
    cnn_results = None

## 6. Model Evaluation <a id="evaluation"></a>

Evaluate the trained models on test data.

In [None]:
# Evaluate models if they were successfully trained
test_dir = Path(PROCESSED_DATA_PATH) / 'test'
test_images = list(test_dir.glob('**/*.png')) if test_dir.exists() else []

print(f"Found {len(test_images)} test images for evaluation")

if len(test_images) > 0:
    # Prepare test data
    test_predictions = []
    test_labels = []
    
    for img_path in test_images:
        # Determine true label from path
        true_label = 'authentic' if 'authentic' in str(img_path) else 'forged'
        test_labels.append(true_label)
    
    print(f"\n📊 Test set composition:")
    test_label_counts = pd.Series(test_labels).value_counts()
    print(test_label_counts)

else:
    print("⚠️ No test images available for evaluation")

## 7. Making Predictions <a id="predictions"></a>

Use the trained models to make predictions on new images.

In [None]:
# Test predictions with the traditional ML model if available
ml_model_path = Path(MODELS_PATH) / 'random_forest_model.joblib'

if ml_model_path.exists() and len(test_images) > 0:
    print("Making predictions with Traditional ML model...")
    
    try:
        # Initialize predictor
        ml_predictor = DocumentForgeryPredictor(str(ml_model_path), model_type='traditional_ml')
        
        # Make predictions on test images
        ml_predictions = []
        
        for img_path in test_images[:10]:  # Limit to first 10 for demo
            prediction = ml_predictor.predict_single_image(str(img_path))
            prediction['image_path'] = str(img_path)
            ml_predictions.append(prediction)
        
        # Display results
        print(f"\n✅ ML Predictions completed for {len(ml_predictions)} images:")
        
        for i, pred in enumerate(ml_predictions[:5]):
            img_name = Path(pred['image_path']).name
            print(f"   {i+1}. {img_name}: {pred['prediction']} (confidence: {pred['confidence']:.3f})")
        
        # Calculate accuracy
        correct = 0
        for pred, true_label in zip(ml_predictions, test_labels[:len(ml_predictions)]):
            if pred['prediction'] == true_label:
                correct += 1
        
        ml_accuracy = correct / len(ml_predictions)
        print(f"\n📊 ML Model Test Accuracy: {ml_accuracy:.3f}")
        
    except Exception as e:
        print(f"❌ ML prediction failed: {str(e)}")
        ml_predictions = []

else:
    print("⚠️ ML model not available for predictions")
    ml_predictions = []

In [None]:
# Test predictions with CNN model if available
cnn_model_path = Path(MODELS_PATH) / 'cnn_model.h5'

if cnn_model_path.exists() and len(test_images) > 0:
    print("Making predictions with CNN model...")
    
    try:
        # Initialize predictor
        cnn_predictor = DocumentForgeryPredictor(str(cnn_model_path), model_type='cnn')
        
        # Make batch predictions
        test_image_paths = [str(img) for img in test_images[:10]]  # Limit for demo
        cnn_predictions = cnn_predictor.predict_batch(test_image_paths)
        
        # Display results
        print(f"\n✅ CNN Predictions completed for {len(cnn_predictions)} images:")
        
        for i, pred in enumerate(cnn_predictions[:5]):
            img_name = Path(pred['image_path']).name
            print(f"   {i+1}. {img_name}: {pred['prediction']} (confidence: {pred['confidence']:.3f})")
        
        # Calculate accuracy
        correct = 0
        for pred, true_label in zip(cnn_predictions, test_labels[:len(cnn_predictions)]):
            if pred['prediction'] == true_label:
                correct += 1
        
        cnn_accuracy = correct / len(cnn_predictions)
        print(f"\n📊 CNN Model Test Accuracy: {cnn_accuracy:.3f}")
        
    except Exception as e:
        print(f"❌ CNN prediction failed: {str(e)}")
        cnn_predictions = []

else:
    print("⚠️ CNN model not available for predictions")
    cnn_predictions = []

## 8. Visualization and Analysis <a id="visualization"></a>

Create comprehensive visualizations of our results.

In [None]:
# Visualize predictions if we have any
predictions_to_visualize = ml_predictions if ml_predictions else cnn_predictions

if predictions_to_visualize and len(test_images) > 0:
    print("Creating prediction visualizations...")
    
    visualizer = DocumentForgeryVisualizer()
    
    # Visualize prediction confidence distribution
    visualizer.plot_prediction_confidence_distribution(predictions_to_visualize)
    
    # Visualize images with predictions
    image_paths = [pred['image_path'] for pred in predictions_to_visualize[:8]]
    visualizer.visualize_image_predictions(
        image_paths, 
        predictions_to_visualize[:8],
        max_images=8
    )
    
else:
    print("⚠️ No predictions available for visualization")

In [None]:
# Create confusion matrix if we have enough predictions
if len(predictions_to_visualize) >= 5:
    # Prepare data for confusion matrix
    y_true = test_labels[:len(predictions_to_visualize)]
    y_pred = [pred['prediction'] for pred in predictions_to_visualize]
    
    # Convert to numeric labels for confusion matrix
    label_map = {'authentic': 0, 'forged': 1}
    y_true_numeric = [label_map[label] for label in y_true]
    y_pred_numeric = [label_map[pred] for pred in y_pred]
    
    visualizer = DocumentForgeryVisualizer()
    
    # Plot confusion matrix
    visualizer.plot_confusion_matrix(
        np.array(y_true_numeric),
        np.array(y_pred_numeric),
        class_names=['Authentic', 'Forged']
    )
    
    # Plot classification report
    visualizer.plot_classification_report(
        np.array(y_true_numeric),
        np.array(y_pred_numeric),
        class_names=['Authentic', 'Forged']
    )
    
    print("\n✅ Evaluation visualizations completed!")
else:
    print("⚠️ Not enough predictions for confusion matrix (need at least 5)")

## Summary and Next Steps

This notebook demonstrated a complete workflow for document forgery detection:

### What we accomplished:
1. ✅ **Data Setup**: Created/loaded document image dataset
2. ✅ **Preprocessing**: Organized data into train/val/test splits
3. ✅ **Feature Extraction**: Extracted comprehensive image features
4. ✅ **Model Training**: Trained both traditional ML and deep learning models
5. ✅ **Evaluation**: Assessed model performance on test data
6. ✅ **Prediction**: Made predictions on new images
7. ✅ **Visualization**: Created comprehensive analysis plots

### To improve performance with real data:

1. **Larger Dataset**: Use thousands of authentic and forged document images
2. **Better Features**: Add domain-specific features like:
   - EXIF metadata analysis
   - Copy-move detection
   - Splicing detection algorithms
3. **Advanced Models**: Try:
   - Transfer learning with pre-trained models
   - Ensemble methods
   - Attention-based networks
4. **Data Augmentation**: Apply realistic document transformations
5. **Cross-validation**: Use k-fold CV for robust evaluation

### Real-world considerations:
- **Adversarial attacks**: Test against sophisticated forgery methods
- **Generalization**: Evaluate on different document types and sources
- **Interpretability**: Provide explanations for predictions
- **Performance**: Optimize for speed and memory usage
- **Ethics**: Consider privacy and bias implications

In [None]:
# Print final summary
print("📋 WORKFLOW SUMMARY")
print("=" * 50)

if sample_data:
    print(f"📁 Dataset: Synthetic sample data (100 images)")
else:
    print(f"📁 Dataset: Real data ({len(image_files)} images)")

if df_features is not None:
    print(f"🔧 Features: {len(df_features.columns)-2} extracted features")

if ml_detector and ml_results:
    print(f"🤖 ML Model: Random Forest (Test Accuracy: {ml_results.get('test_accuracy', 0):.3f})")

if cnn_detector and cnn_results:
    print(f"🧠 CNN Model: Custom CNN (Val Accuracy: {cnn_results.get('final_accuracy', 0):.3f})")

if predictions_to_visualize:
    model_type = "ML" if ml_predictions else "CNN"
    accuracy = ml_accuracy if ml_predictions else cnn_accuracy
    print(f"🎯 Predictions: {len(predictions_to_visualize)} images ({model_type} accuracy: {accuracy:.3f})")

print("\n🎉 Document Forgery Detection workflow completed successfully!")
print("\n💡 Next steps: Use real document data and experiment with different models for better performance.")