

# **Multi-Class Image Classifier w/ Auto Grouping**


**Given the diversity of image types, an effective approach is to group them first before performing authenticity classification on each group. This notebook demonstrates that grouping methodology.**


## **Pipeline Explanation**

This is a **rule-based image classification system** that automatically groups images into 6 categories based on visual characteristics without requiring machine learning training.

### **Pipeline Flow Overview:**

```
1. IMAGE LOADING ‚Üí 2. DIMENSION ANALYSIS ‚Üí 3. ASPECT RATIO CHECK ‚Üí 4. VISUAL FEATURE EXTRACTION ‚Üí 5. RULE-BASED CLASSIFICATION
```

### **Detailed Pipeline Stages:**

#### **1. Image Collection & Preprocessing**
- **Function**: `get_limited_image_paths()`
- **Purpose**: Recursively scans directories for image files (JPG, PNG, JPEG)
- **Limitation**: Processes maximum 300 images by default to manage computational load
- **Output**: List of valid image file paths

#### **2. Dimension & Aspect Ratio Analysis**
- **Extracts**: Image width, height, and aspect ratio (width/height)
- **Priority Check**: Identifies extreme aspect ratios first
- **Thresholds**:
  - **Wide images**: Aspect ratio ‚â• 2.5 (panoramas, landscapes)
  - **Tall images**: Aspect ratio ‚â§ 0.4 (portraits, vertical shots)

#### **3. Visual Feature Extraction**
- **Image Normalization**: Resizes all images to 224√ó224 pixels
- **Key Features Calculated**:
  - **Brightness**: Average pixel intensity
  - **Variance**: Pixel value variation (texture complexity)
  - **Color Dominance**: RGB channel averages

#### **4. Rule-Based Classification Logic**

**Priority System:**
1. **Extreme Aspect Ratios** (Highest Priority)
   - Group 1: Extreme Wide (AR ‚â• 2.5)
   - Group 2: Extreme Tall (AR ‚â§ 0.4)

2. **Visual Features** (Normal Aspect Ratios)
   - Group 3: Bright & High Variance (brightness > 170, variance > 4000)
   - Group 4: Dark & Low Variance (brightness < 100, variance < 2000)
   - Group 5: Green Dominant (green channel > red + 15 and > blue + 15)
   - Group 6: Neutral & Balanced (default for remaining images)

#### **5. Confidence Scoring**
- Each classification includes a confidence score
- Higher confidence for more extreme characteristics
- Aspect ratio classifications get bonus confidence based on extremity

#### **6. Results Visualization**
- **Statistical Summary**: Counts, averages, percentages per group
- **Sample Display**: Shows representative images from each category
- **Grouped Visualization**: Creates a comprehensive grid of classified images

### **Key Advantages:**

1. **No Training Required**: Rule-based approach works immediately
2. **Computationally Efficient**: Much faster than ML models
3. **Interpretable**: Clear rules make classifications understandable
4. **Customizable**: Thresholds can be easily adjusted
5. **Handles Diverse Images**: Works with various image types and sizes

### **Use Cases:**
- **Image Organization**: Automatically sort photo libraries
- **Content Analysis**: Understand visual characteristics of image datasets
- **Preprocessing**: Group images before more sophisticated ML analysis
- **Quality Assessment**: Identify image types and characteristics

### **Output Groups:**
1. **Extreme Wide** - Landscape panoramas
2. **Extreme Tall** - Portrait/vertical images  
3. **Bright & High Variance** - Detailed, vibrant scenes
4. **Dark & Low Variance** - Low-light, uniform images
5. **Green Dominant** - Nature/vegetation scenes
6. **Neutral & Balanced** - Standard, well-balanced images

This pipeline provides a quick, efficient way to automatically categorize images based on fundamental visual properties without the complexity of machine learning model training.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image

# ===========================
# 1. Get Limited Image Paths
# ===========================

def get_limited_image_paths(data_dir, max_images=300):
    """Get first max_images image paths from directory"""
    data_path = Path(data_dir)
    
    if not data_path.exists():
        print(f"‚ùå Directory does not exist: {data_dir}")
        return []
    
    image_paths = []
    
    # Recursively search for image files
    for ext in ['*.jpg', '*.jpeg', '*.png', '*.JPG', '*.JPEG', '*.PNG']:
        found_images = list(data_path.rglob(ext))
        for img_path in found_images:
            if len(image_paths) < max_images:
                image_paths.append(img_path)
            else:
                break
        if len(image_paths) >= max_images:
            break
    
    print(f"Found {len(image_paths)} images in {data_dir}")
    return image_paths

In [None]:
def simple_classify_and_group(data_dir, max_images=300, num_classes=6):
    """
    Shape-First Image Classification Pipeline (6 Classes)
    
    PIPELINE FLOW:
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ 1. Load Image & Extract Dimensions                  ‚îÇ
    ‚îÇ    ‚Üí Width, Height, Aspect Ratio                    ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                          ‚îÇ
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ 2. PRIORITY: Check Extreme Aspect Ratios            ‚îÇ
    ‚îÇ    ‚Üí Wide (‚â•1.8) ‚Üí Group 1                          ‚îÇ
    ‚îÇ    ‚Üí Tall (‚â§0.6) ‚Üí Group 2                          ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                          ‚îÇ If normal aspect ratio (0.6-1.8)
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ 3. Extract Visual Features (224√ó224 normalized)     ‚îÇ
    ‚îÇ    ‚Üí Brightness (pixel mean)                        ‚îÇ
    ‚îÇ    ‚Üí Variance (pixel variance)                      ‚îÇ
    ‚îÇ    ‚Üí RGB channel averages                           ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                          ‚îÇ
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îÇ 4. Feature-Based Classification (4 Groups)          ‚îÇ
    ‚îÇ    ‚Üí Bright + High Var ‚Üí Group 3                    ‚îÇ
    ‚îÇ    ‚Üí Dark + Low Var ‚Üí Group 4                       ‚îÇ
    ‚îÇ    ‚Üí Green Dominant ‚Üí Group 5                       ‚îÇ
    ‚îÇ    ‚Üí Others ‚Üí Group 6 (Neutral)                     ‚îÇ
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
    
    CLASSIFICATION GROUPS:
    - Group 1: Extreme Wide (AR‚â•1.8) - Panoramas, landscapes
    - Group 2: Extreme Tall (AR‚â§0.6) - Portraits, vertical shots
    - Group 3: Bright & High Variance - Detailed, vibrant images
    - Group 4: Dark & Low Variance - Low-light, uniform images
    - Group 5: Green Dominant - Nature, vegetation scenes
    - Group 6: Neutral & Balanced - Standard balanced images
    """
    
    print("\n" + "="*70)
    print(f"IMPROVED 6-CLASS CLASSIFICATION (Aspect Ratio + Features)")
    print(f"Processing first {max_images} images")
    print("="*70)
    
    # Get image paths
    image_paths = get_limited_image_paths(data_dir, max_images)
    
    if len(image_paths) == 0:
        print("‚ùå No images found!")
        return {}, []
    
    # Define 6 class names
    class_names = [
        "Group_1_ExtremeWide",       # Extreme Horizontal
        "Group_2_ExtremeTall",       # Extreme Vertical
        "Group_3_Bright_HighVar",    # Bright and High Variation
        "Group_4_Dark_LowVar",       # Dark and Low Variation
        "Group_5_GreenDominant",     # Green Dominant
        "Group_6_Neutral_Balanced"   # Neutral
    ]
    
    grouped_images = {class_name: [] for class_name in class_names}
    
    print(f"\nProcessing {len(image_paths)} images...")
    
    # Aspect ratio thresholds
    WIDE_THRESHOLD = 2.5    # Threshold for wide image detection
    TALL_THRESHOLD = 0.4    # Threshold for tall image detection
    
    for i, img_path in enumerate(image_paths):
        if (i + 1) % 50 == 0:
            print(f"  Processed {i + 1}/{len(image_paths)} images...")
        
        try:
            # 1. Get image dimensions and aspect ratio
            with Image.open(img_path) as original_img:
                width, height = original_img.size
                aspect_ratio = width / height if height != 0 else 1.0
            
            # 2. Load and extract features
            img = keras.preprocessing.image.load_img(img_path, target_size=(224, 224))
            img_array = keras.preprocessing.image.img_to_array(img)
            
            # Feature extraction
            brightness = np.mean(img_array)
            variance = np.var(img_array)
            r, g, b = np.mean(img_array, axis=(0, 1))
            
            # 3. Classification logic
            # Priority 1: Extreme aspect ratios
            if aspect_ratio >= WIDE_THRESHOLD:
                class_idx = 0  # Extreme Wide
                confidence = min(0.95, 0.7 + (aspect_ratio - WIDE_THRESHOLD) * 0.1)
            
            elif aspect_ratio <= TALL_THRESHOLD:
                class_idx = 1  # Extreme Tall
                confidence = min(0.95, 0.7 + (TALL_THRESHOLD - aspect_ratio) * 0.2)
            
            # Priority 2: Feature-based classification for normal aspect ratios
            elif brightness > 170 and variance > 4000:
                class_idx = 2  # Bright & High Variance
                confidence = min(0.90, 0.6 + variance / 15000)
            
            elif brightness < 100 and variance < 2000:
                class_idx = 3  # Dark & Low Variance
                confidence = min(0.90, 0.6 + (100 - brightness) / 100)
            
            elif g > (r + 15) and g > (b + 15):
                class_idx = 4  # Green Dominant
                confidence = min(0.90, 0.6 + (g - max(r, b)) / 100)
            
            else:
                class_idx = 5  # Neutral & Balanced (default)
                confidence = 0.65
            
            predicted_class = class_names[class_idx]
            
            grouped_images[predicted_class].append({
                'path': str(img_path),
                'confidence': confidence,
                'brightness': brightness,
                'variance': variance,
                'width': width,
                'height': height,
                'aspect_ratio': aspect_ratio,
                'avg_r': r,
                'avg_g': g,
                'avg_b': b
            })
            
        except Exception as e:
            print(f"‚ö†Ô∏è Error processing {img_path}: {e}")
    
    # Print detailed results
    print("\n" + "="*70)
    print(f"CLASSIFICATION RESULTS (6 CLASSES)")
    print("="*70)
    
    total_classified = 0
    for class_name in class_names:
        count = len(grouped_images[class_name])
        total_classified += count
        
        if count > 0:
            avg_brightness = np.mean([img['brightness'] for img in grouped_images[class_name]])
            avg_variance = np.mean([img['variance'] for img in grouped_images[class_name]])
            avg_aspect = np.mean([img['aspect_ratio'] for img in grouped_images[class_name]])
            avg_confidence = np.mean([img['confidence'] for img in grouped_images[class_name]])
            
            print(f"\n{class_name}")
            print(f"  Count         : {count:3d} images ({count/len(image_paths)*100:5.1f}%)")
            print(f"  Brightness    : {avg_brightness:6.1f}")
            print(f"  Variance      : {avg_variance:8.1f}")
            print(f"  Aspect Ratio  : {avg_aspect:5.2f}")
            print(f"  Confidence    : {avg_confidence:5.2f}")
        else:
            print(f"\n{class_name}")
            print(f"  Count         : {count:3d} images")
    
    print("\n" + "="*70)
    print(f"Total classified: {total_classified}/{len(image_paths)}")
    print("="*70)
    
    return grouped_images, class_names


def display_sample_images_per_class(grouped_images, class_names, samples_per_class=3):
    """
    Display sample images from each class
    """
    print("\n" + "="*70)
    print("SAMPLE IMAGES FROM EACH CLASS")
    print("="*70)
    
    for class_name in class_names:
        images = grouped_images[class_name]
        if len(images) == 0:
            print(f"\n{class_name}: No images")
            continue
        
        # Get sample indices
        num_samples = min(samples_per_class, len(images))
        sample_indices = np.linspace(0, len(images)-1, num_samples, dtype=int)
        
        print(f"\n{class_name} ({len(images)} images)")
        print("-" * 70)
        
        fig, axes = plt.subplots(1, num_samples, figsize=(5*num_samples, 5))
        if num_samples == 1:
            axes = [axes]
        
        for idx, sample_idx in enumerate(sample_indices):
            img_info = images[sample_idx]
            img = keras.preprocessing.image.load_img(img_info['path'])
            
            axes[idx].imshow(img)
            axes[idx].axis('off')
            title = f"Aspect: {img_info['aspect_ratio']:.2f}\n"
            title += f"Bright: {img_info['brightness']:.0f}\n"
            title += f"Conf: {img_info['confidence']:.2f}"
            axes[idx].set_title(title, fontsize=10)
        
        plt.tight_layout()
        plt.show()

In [None]:
# ===========================
# 3. Display Grouped Images 
# ===========================

def display_simple_grouped_images(grouped_images, class_names, images_per_class=6,
                                 save_path='grouped_images_6classes.png'):
    """Display images grouped by simple classification"""
    
    print("\n" + "="*70)
    print(f"GENERATING GROUPED IMAGE DISPLAY ({len(class_names)} CLASSES)")
    print("="*70)
    
    # Filter out empty classes
    non_empty_classes = [cls for cls in class_names if len(grouped_images[cls]) > 0]
    
    if not non_empty_classes:
        print("No images to display!")
        return
    
    # Create figure
    fig = plt.figure(figsize=(20, 3 * len(non_empty_classes)))
    fig.suptitle(f'Images Grouped into {len(class_names)} Classes', 
                 fontsize=16, fontweight='bold', y=0.995)
    
    for class_idx, class_name in enumerate(non_empty_classes):
        images = grouped_images[class_name]
        
        # Select images to display
        display_images = images[:images_per_class]
        num_display = len(display_images)
        
        # Display images for this class
        for img_idx in range(images_per_class):
            ax = plt.subplot(len(non_empty_classes), images_per_class, 
                             class_idx * images_per_class + img_idx + 1)
            
            if img_idx < num_display:
                img_data = display_images[img_idx]
                
                # Load and display image
                img = keras.preprocessing.image.load_img(img_data['path'])
                ax.imshow(img)
                
                # Add confidence info
                confidence = img_data['confidence']
                ax.set_title(f"Conf: {confidence:.1%}", fontsize=8)
                
                # Add class name on first image
                if img_idx == 0:
                    ax.text(-0.15, 0.5, f"{class_name}\n({len(images)} images)", 
                            transform=ax.transAxes,
                            fontsize=10, fontweight='bold',
                            rotation=90, va='center', ha='right')
            
            ax.axis('off')
    
    plt.tight_layout()
    plt.savefig(save_path, dpi=150, bbox_inches='tight')
    print(f"‚úì Grouped images saved to '{save_path}'")
    
    return fig

# ===========================
# 4. Main Pipeline
# ===========================

def main(data_dir, max_images=300):
    print("\n" + "="*70)
    print(f"PROCESSING UP TO {max_images} IMAGES INTO 6 CLASSES")
    print("="*70)
    
    # Run simple classification based on image characteristics
    print("\n1. Running image-based classification (6 classes)...")
    grouped_images, class_names = simple_classify_and_group(data_dir, max_images, num_classes=6)
    
    if not grouped_images or all(len(v) == 0 for v in grouped_images.values()):
        print("‚ùå No images were classified. Cannot proceed.")
        return None, None
    
    # Display results
    display_simple_grouped_images(grouped_images, class_names)

    # Summary
    print("\n" + "="*70)
    print("SUMMARY")
    print("="*70)
    total_classified = sum(len(v) for v in grouped_images.values())
    print(f"üìä Total images processed: {total_classified}")

    return grouped_images, class_names

# ===========================
# 5. Execute
# ===========================

if __name__ == "__main__":
    # Update this path to your actual directory
    DATA_DIR = '/kaggle/input/recodai-luc-scientific-image-forgery-detection/train_images/authentic'
    
    try:
        test_paths = get_limited_image_paths(DATA_DIR, max_images=1000)
        
        if len(test_paths) == 0:
            print(f"\n‚ùå No images found in {DATA_DIR}")
            print("\nüí° Suggestions:")
            print("   1. Check if the directory path is correct")
            print("   2. Verify the directory contains image files")
            print("   3. Check file permissions")
        elif len(test_paths) < 1001:
            print(f"\n‚ö†Ô∏è  Found only {len(test_paths)} images")
            print(f"Proceeding with available {len(test_paths)} images...")
            grouped_images, class_names = main(DATA_DIR, max_images=len(test_paths))
        else:
            print(f"\n‚úì Found {len(test_paths)} images.")
            grouped_images, class_names = main(DATA_DIR, max_images=1000)
            
    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        import traceback
        traceback.print_exc()