# Exploratory Data Analysis (EDA) - Plant Disease Detection

This notebook performs comprehensive exploratory data analysis on the PlantVillage dataset:
1. **Dataset Overview** - Structure, size, classes
2. **Class Distribution** - Balance analysis
3. **Image Properties** - Dimensions, channels, statistics
4. **Sample Visualization** - View images from each class
5. **Data Quality** - Missing values, corrupted files
6. **Preprocessing Requirements** - Recommendations

---

## Authors: 1.MUHAMMAD AMMAR 2. ABDUL HAKEEM            
## Date: November 2025
## Course: Artificial Intelligence - Progress Report II

---

## 1. Setup and Imports

In [1]:
# Add parent directory to path
import sys
import os
sys.path.append(os.path.abspath('..'))

# Import preprocessing modules
from preprocessing import DataPreprocessingPipeline, DataAugmentation
from torch.utils.data import Dataset

# Import utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from pathlib import Path
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


## 2. Configuration

In [None]:
# Configuration
config = {
    'img_size': 224,
    'batch_size': 32,
    'augmentation': True,
    'train_ratio': 0.7,
    'val_ratio': 0.15,
    'test_ratio': 0.15,
    'num_workers': 4
}

# Data path - PAKISTAN UNIFIED DATASET (34 disease classes)
data_path = '../data/PakistanCrops_Merged'  # Rice, Cotton, Wheat, Mango + PlantVillage crops

# Results directory
results_dir = '../results/eda'
os.makedirs(results_dir, exist_ok=True)

print("üìã Configuration:")
for key, value in config.items():
    print(f"  {key}: {value}")
print(f"\nüìÅ Data Path: {data_path}")
print(f"üìÅ Results Directory: {results_dir}")

üìã Configuration:
  img_size: 224
  batch_size: 32
  augmentation: True
  train_ratio: 0.7
  val_ratio: 0.15
  test_ratio: 0.15
  num_workers: 4

üìÅ Data Path: ../data/PlantVillage
üìÅ Results Directory: ../results/eda


In [None]:
# ‚ö†Ô∏è KAGGLE DOWNLOAD DISABLED - Using Pakistan Unified Dataset
# The Pakistan dataset is already prepared with 46,808 images across 34 classes

print("="*70)
print("üìÅ USING PAKISTAN UNIFIED DATASET (34 CLASSES)")
print("="*70)
print(f"‚úÖ Dataset location: {data_path}")
print("üìä This dataset includes Rice, Cotton, Wheat, Mango + PlantVillage crops")
print("üìä Specifically optimized for Pakistani agricultural context")
print("="*70 + "\n")

# Verify path exists
if os.path.exists(data_path):
    print(f"‚úÖ Dataset path verified: {data_path}")
    # List subdirectories (disease classes)
    subdirs = [d for d in os.listdir(data_path) if os.path.isdir(os.path.join(data_path, d))]
    print(f"üìÇ Found {len(subdirs)} disease classes")
else:
    print(f"‚ö†Ô∏è Warning: Dataset path not found: {data_path}")
    print("Please ensure the Pakistan dataset exists at the specified location.")

Note: you may need to restart the kernel to use updated packages.
üì• Downloading PlantDisease dataset from Kaggle...
‚úÖ Dataset downloaded successfully!
üìÅ Path to dataset files: C:\Users\PC\.cache\kagglehub\datasets\emmarex\plantdisease\versions\1

üîÑ Updated data_path to: C:\Users\PC\.cache\kagglehub\datasets\emmarex\plantdisease\versions\1

üìÇ Dataset contents:
  üìÅ PlantVillage/ (16 files)


## 2.1 Download Kaggle Dataset (Optional)

If you want to use the real PlantDisease dataset from Kaggle, run this cell.
This will download the dataset automatically using kagglehub.

**Note:** You need to have Kaggle API credentials set up.

## 3. Initialize Data Pipeline

In [4]:
# Initialize preprocessing pipeline
pipeline = DataPreprocessingPipeline(config)

print("‚úÖ Data preprocessing pipeline initialized")
print(f"\nüìä Pipeline Configuration:")
print(f"  Image Size: {pipeline.img_size}x{pipeline.img_size}")
print(f"  Batch Size: {pipeline.batch_size}")
print(f"  Augmentation: {pipeline.augmentation_enabled}")
print(f"  Train/Val/Test Split: {pipeline.train_ratio}/{pipeline.val_ratio}/{pipeline.test_ratio}")

INFO:preprocessing.data_pipeline:Transforms configured successfully
INFO:preprocessing.data_pipeline:DataPreprocessingPipeline initialized: img_size=224, batch_size=32, augmentation=True


‚úÖ Data preprocessing pipeline initialized

üìä Pipeline Configuration:
  Image Size: 224x224
  Batch Size: 32
  Augmentation: True
  Train/Val/Test Split: 0.7/0.15/0.15


## 4. Load Dataset

**Note:** If you don't have the dataset yet, you can:
1. Download PlantVillage dataset from Kaggle
2. Or use the test_images from the repository for demonstration
3. Expected structure:
   ```
   data/PlantVillage/
       Apple___Apple_scab/
           image1.jpg
           image2.jpg
       Tomato___Bacterial_spot/
           image1.jpg
           image2.jpg
       ...
   ```

In [5]:
# Check if data path exists
if os.path.exists(data_path):
    print(f"‚úÖ Data directory found: {data_path}")
    
    # Load data
    try:
        pipeline.load_data(data_path, structure='directory')
        print(f"\n‚úÖ Data loaded successfully!")
        print(f"Total Images: {len(pipeline.image_paths)}")
        print(f"Number of Classes: {len(pipeline.class_names)}")
    except Exception as e:
        print(f"‚ùå Error loading data: {e}")
        print("\nüí° Creating mock dataset for demonstration...")
        # Create mock data for demonstration
        pipeline.class_names = ['Apple___Apple_scab', 'Tomato___Bacterial_spot', 
                                'Potato___Late_blight', 'Corn___Common_rust']
        pipeline.image_paths = [f'image_{i}.jpg' for i in range(1000)]
        pipeline.labels = np.random.randint(0, 4, 1000).tolist()
        print("‚úÖ Mock dataset created for demonstration")
else:
    print(f"‚ö†Ô∏è  Data directory not found: {data_path}")
    print("\nüí° Creating mock dataset for demonstration...")
    
    # Create mock data for demonstration
    pipeline.class_names = [
        'Apple___Apple_scab',
        'Apple___Black_rot',
        'Apple___Cedar_apple_rust',
        'Apple___healthy',
        'Tomato___Bacterial_spot',
        'Tomato___Early_blight',
        'Tomato___Late_blight',
        'Tomato___healthy',
        'Potato___Early_blight',
        'Potato___Late_blight',
        'Potato___healthy'
    ]
    
    # Generate mock data with realistic distribution
    num_images = 5000
    pipeline.image_paths = [f'mock_image_{i}.jpg' for i in range(num_images)]
    
    # Create slightly imbalanced distribution (realistic)
    weights = np.array([1.0, 0.9, 0.8, 1.2, 1.1, 0.95, 0.85, 1.3, 1.0, 0.9, 1.1])
    weights = weights / weights.sum()
    pipeline.labels = np.random.choice(len(pipeline.class_names), num_images, p=weights).tolist()
    
    pipeline.class_to_idx = {name: idx for idx, name in enumerate(pipeline.class_names)}
    
    print("‚úÖ Mock dataset created for demonstration")
    print(f"Total Images: {len(pipeline.image_paths)}")
    print(f"Number of Classes: {len(pipeline.class_names)}")

INFO:preprocessing.data_pipeline:Loaded from directory: 1 classes, 0 images
INFO:preprocessing.data_pipeline:Loaded 0 images from 1 classes


‚úÖ Data directory found: C:\Users\PC\.cache\kagglehub\datasets\emmarex\plantdisease\versions\1

‚úÖ Data loaded successfully!
Total Images: 0
Number of Classes: 1


## 5. Dataset Overview

In [None]:
# Display class names
print("üå± Plant Disease Classes:\n")
for i, class_name in enumerate(pipeline.class_names, 1):
    plant, disease = class_name.split('___')
    print(f"{i:2d}. {plant:15s} - {disease}")

üå± Plant Disease Classes:



ValueError: not enough values to unpack (expected 2, got 1)

: 

In [None]:
# Class distribution
label_counts = Counter(pipeline.labels)
class_distribution = {
    pipeline.class_names[label]: count 
    for label, count in label_counts.items()
}

# Create DataFrame
df_dist = pd.DataFrame([
    {'Class': class_name, 'Count': count, 'Percentage': f"{count/len(pipeline.labels)*100:.2f}%"}
    for class_name, count in class_distribution.items()
]).sort_values('Count', ascending=False)

print("\nüìä Class Distribution:\n")
print(df_dist.to_string(index=False))

print(f"\nüìà Statistics:")
print(f"  Total Images: {len(pipeline.labels)}")
print(f"  Mean per class: {np.mean(list(label_counts.values())):.2f}")
print(f"  Std per class: {np.std(list(label_counts.values())):.2f}")
print(f"  Min: {min(label_counts.values())}")
print(f"  Max: {max(label_counts.values())}")
print(f"  Balance Ratio (Min/Max): {min(label_counts.values())/max(label_counts.values()):.2f}")


üìä Class Distribution:

                   Class  Count Percentage
        Tomato___healthy    622     12.44%
        Potato___healthy    552     11.04%
         Apple___healthy    526     10.52%
 Tomato___Bacterial_spot    500     10.00%
   Potato___Early_blight    435      8.70%
   Tomato___Early_blight    424      8.48%
      Apple___Apple_scab    421      8.42%
    Tomato___Late_blight    412      8.24%
    Potato___Late_blight    394      7.88%
       Apple___Black_rot    393      7.86%
Apple___Cedar_apple_rust    321      6.42%

üìà Statistics:
  Total Images: 5000
  Mean per class: 454.55
  Std per class: 82.16
  Min: 321
  Max: 622
  Balance Ratio (Min/Max): 0.52


: 

## 6. Visualizations

In [None]:
# Class distribution bar plot
plt.figure(figsize=(14, 6))
classes = [c.replace('___', '\n') for c in df_dist['Class']]
counts = df_dist['Count'].values

bars = plt.bar(range(len(classes)), counts, color='steelblue', alpha=0.8, edgecolor='black')

# Color the bars based on count (highlight min and max)
max_idx = counts.argmax()
min_idx = counts.argmin()
bars[max_idx].set_color('green')
bars[min_idx].set_color('red')

plt.xlabel('Disease Class', fontsize=12, fontweight='bold')
plt.ylabel('Number of Images', fontsize=12, fontweight='bold')
plt.title('Class Distribution - PlantVillage Dataset', fontsize=14, fontweight='bold')
plt.xticks(range(len(classes)), classes, rotation=45, ha='right', fontsize=9)
plt.axhline(y=np.mean(counts), color='orange', linestyle='--', linewidth=2, 
            label=f'Mean: {np.mean(counts):.0f}')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'class_distribution.png'), dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Class distribution plot saved")

In [None]:
# Pie chart for class proportions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# Pie chart
colors = sns.color_palette('Set3', len(classes))
ax1.pie(counts, labels=classes, autopct='%1.1f%%', colors=colors, startangle=90)
ax1.set_title('Class Proportion', fontsize=14, fontweight='bold')

# Box plot
ax2.boxplot([counts], vert=True, patch_artist=True,
            boxprops=dict(facecolor='lightblue', alpha=0.7),
            medianprops=dict(color='red', linewidth=2))
ax2.set_ylabel('Number of Images', fontsize=12, fontweight='bold')
ax2.set_title('Class Distribution Statistics', fontsize=14, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)
ax2.set_xticklabels(['All Classes'])

plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'class_statistics.png'), dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Class statistics plot saved")

## 7. Data Balance Analysis

In [None]:
# Calculate balance metrics
balance_ratio = min(counts) / max(counts)
imbalance_severity = 1 - balance_ratio

print("‚öñÔ∏è  Data Balance Analysis:\n")
print(f"Balance Ratio: {balance_ratio:.3f}")
print(f"Imbalance Severity: {imbalance_severity:.3f}")

if balance_ratio >= 0.9:
    print("\n‚úÖ Dataset is well-balanced")
    recommendation = "No special handling needed"
elif balance_ratio >= 0.7:
    print("\n‚ö†Ô∏è  Dataset is slightly imbalanced")
    recommendation = "Consider using class weights in loss function"
elif balance_ratio >= 0.5:
    print("\n‚ö†Ô∏è  Dataset is moderately imbalanced")
    recommendation = "Use class weights + data augmentation for minority classes"
else:
    print("\n‚ùå Dataset is severely imbalanced")
    recommendation = "Use class weights + oversampling/undersampling + data augmentation"

print(f"\nüí° Recommendation: {recommendation}")

In [None]:
# Calculate class weights for handling imbalance
if len(pipeline.labels) > 0:
    try:
        import torch
        class_weights = pipeline.get_class_weights()
        
        print("\n‚öñÔ∏è  Calculated Class Weights (for loss function):\n")
        for i, (class_name, weight) in enumerate(zip(pipeline.class_names, class_weights)):
            print(f"{class_name:40s}: {weight:.4f}")
    except:
        print("\n‚ö†Ô∏è  PyTorch not available for class weight calculation")

## 8. Train/Val/Test Split Analysis

In [None]:
# Calculate split sizes
total_images = len(pipeline.labels)
train_size = int(pipeline.train_ratio * total_images)
val_size = int(pipeline.val_ratio * total_images)
test_size = total_images - train_size - val_size

print("üìä Data Split Configuration:\n")
print(f"Total Images: {total_images}")
print(f"\nTrain Set: {train_size} ({pipeline.train_ratio*100:.0f}%)")
print(f"Val Set:   {val_size} ({pipeline.val_ratio*100:.0f}%)")
print(f"Test Set:  {test_size} ({pipeline.test_ratio*100:.0f}%)")

# Visualize split
fig, ax = plt.subplots(figsize=(10, 6))
splits = ['Train', 'Validation', 'Test']
sizes = [train_size, val_size, test_size]
colors = ['#3498db', '#e74c3c', '#2ecc71']

wedges, texts, autotexts = ax.pie(sizes, labels=splits, autopct='%1.1f%%',
                                    colors=colors, startangle=90,
                                    explode=(0.05, 0.05, 0.05))

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontsize(12)
    autotext.set_fontweight('bold')

ax.set_title('Train/Validation/Test Split', fontsize=14, fontweight='bold')

# Add legend with actual numbers
legend_labels = [f"{split}: {size} images" for split, size in zip(splits, sizes)]
ax.legend(legend_labels, loc='upper left', bbox_to_anchor=(1, 1))

plt.tight_layout()
plt.savefig(os.path.join(results_dir, 'data_split.png'), dpi=300, bbox_inches='tight')
plt.show()

print("\n‚úÖ Data split visualization saved")

## 9. Data Augmentation Preview

In [None]:
# Initialize augmentation
augmentation = DataAugmentation({
    'rotation_range': 15,
    'flip_probability': 0.5,
    'brightness_range': 0.2,
    'contrast_range': 0.2,
    'saturation_range': 0.2,
    'hue_range': 0.1,
    'zoom_range': (0.8, 1.0)
})

print("üé® Data Augmentation Configuration:\n")
aug_config = augmentation.get_augmentation_config()
for key, value in aug_config.items():
    print(f"  {key}: {value}")

In [None]:
# Visualize augmentation effects
print("\nüìù Note: To visualize augmentation effects, you would need actual images.")
print("The augmentation pipeline includes:")
print("  ‚úÖ Random rotation (¬±15¬∞)")
print("  ‚úÖ Random horizontal flip (50%)")
print("  ‚úÖ Color jitter (brightness, contrast, saturation, hue)")
print("  ‚úÖ Random resized crop (80-100% scale)")
print("  ‚úÖ Gaussian blur (20% probability)")
print("  ‚úÖ Random affine transformations (30% probability)")

## 10. Data Quality Assessment

In [None]:
print("üîç Data Quality Assessment:\n")

# Check for missing values
print("1. Missing Values:")
missing_paths = sum(1 for p in pipeline.image_paths if not p)
missing_labels = sum(1 for l in pipeline.labels if l is None)
print(f"   Missing image paths: {missing_paths}")
print(f"   Missing labels: {missing_labels}")

# Check label distribution
print("\n2. Label Distribution:")
unique_labels = len(set(pipeline.labels))
print(f"   Unique classes: {unique_labels} / {len(pipeline.class_names)}")

# Check for potential issues
print("\n3. Potential Issues:")
issues = []
if balance_ratio < 0.7:
    issues.append("‚ö†Ô∏è  Class imbalance detected")
if len(pipeline.labels) < 1000:
    issues.append("‚ö†Ô∏è  Small dataset size")
if unique_labels < len(pipeline.class_names):
    issues.append("‚ö†Ô∏è  Some classes have no samples")

if issues:
    for issue in issues:
        print(f"   {issue}")
else:
    print("   ‚úÖ No major issues detected")

# Overall quality score
quality_score = 100
if balance_ratio < 0.7:
    quality_score -= 20
if len(pipeline.labels) < 1000:
    quality_score -= 15
if unique_labels < len(pipeline.class_names):
    quality_score -= 25

print(f"\nüìä Overall Data Quality Score: {quality_score}/100")

if quality_score >= 90:
    print("‚úÖ Excellent data quality")
elif quality_score >= 75:
    print("‚úÖ Good data quality")
elif quality_score >= 60:
    print("‚ö†Ô∏è  Fair data quality - improvements recommended")
else:
    print("‚ùå Poor data quality - significant improvements needed")

## 11. Recommendations for Model Training

In [None]:
print("üí° Recommendations for Model Training:\n")
print("="*70)

recommendations = []

# Based on dataset size
if len(pipeline.labels) < 5000:
    recommendations.append({
        'category': 'üìä Dataset Size',
        'recommendation': 'Use transfer learning with pre-trained models (ResNet, VGG)',
        'reason': f'Dataset has {len(pipeline.labels)} images - transfer learning will help'
    })
    recommendations.append({
        'category': 'üé® Augmentation',
        'recommendation': 'Use heavy data augmentation',
        'reason': 'Small dataset benefits from extensive augmentation'
    })
else:
    recommendations.append({
        'category': 'üìä Dataset Size',
        'recommendation': 'Can train models from scratch or use transfer learning',
        'reason': f'Dataset has {len(pipeline.labels)} images - sufficient for training'
    })

# Based on class imbalance
if balance_ratio < 0.7:
    recommendations.append({
        'category': '‚öñÔ∏è  Class Balance',
        'recommendation': 'Use class weights in loss function',
        'reason': f'Balance ratio is {balance_ratio:.2f} - imbalanced classes detected'
    })
    recommendations.append({
        'category': 'üéØ Sampling',
        'recommendation': 'Consider oversampling minority classes',
        'reason': 'Will help balance training data'
    })

# Training strategy
recommendations.append({
    'category': 'üîß Training Strategy',
    'recommendation': 'Use learning rate scheduling and early stopping',
    'reason': 'Prevents overfitting and improves convergence'
})

recommendations.append({
    'category': 'üìà Evaluation',
    'recommendation': 'Monitor per-class metrics (precision, recall, F1)',
    'reason': 'Important for imbalanced datasets'
})

# Display recommendations
for i, rec in enumerate(recommendations, 1):
    print(f"\n{i}. {rec['category']}")
    print(f"   Recommendation: {rec['recommendation']}")
    print(f"   Reason: {rec['reason']}")

print("\n" + "="*70)

## 12. Summary Statistics

In [None]:
# Create comprehensive summary
summary = {
    'Dataset Overview': {
        'Total Images': len(pipeline.labels),
        'Number of Classes': len(pipeline.class_names),
        'Train Set': f"{train_size} ({pipeline.train_ratio*100:.0f}%)",
        'Val Set': f"{val_size} ({pipeline.val_ratio*100:.0f}%)",
        'Test Set': f"{test_size} ({pipeline.test_ratio*100:.0f}%)"
    },
    'Class Distribution': {
        'Mean Images per Class': f"{np.mean(list(label_counts.values())):.2f}",
        'Std Images per Class': f"{np.std(list(label_counts.values())):.2f}",
        'Min Images': min(label_counts.values()),
        'Max Images': max(label_counts.values()),
        'Balance Ratio': f"{balance_ratio:.3f}"
    },
    'Configuration': {
        'Image Size': f"{config['img_size']}x{config['img_size']}",
        'Batch Size': config['batch_size'],
        'Augmentation': 'Enabled' if config['augmentation'] else 'Disabled',
        'Num Workers': config['num_workers']
    },
    'Data Quality': {
        'Quality Score': f"{quality_score}/100",
        'Missing Values': missing_paths + missing_labels,
        'Issues Detected': len(issues)
    }
}

print("\nüìã COMPREHENSIVE EDA SUMMARY")
print("="*70)

for section, metrics in summary.items():
    print(f"\n{section}:")
    print("-" * 50)
    for key, value in metrics.items():
        print(f"  {key:30s}: {value}")

print("\n" + "="*70)
print("‚úÖ Exploratory Data Analysis Complete!")
print(f"üìÅ Results saved to: {results_dir}")

## 13. Export Summary Report

In [None]:
# Save summary to JSON
import json

summary_file = os.path.join(results_dir, 'eda_summary.json')
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"‚úÖ Summary saved to: {summary_file}")

# Save class distribution to CSV
csv_file = os.path.join(results_dir, 'class_distribution.csv')
df_dist.to_csv(csv_file, index=False)
print(f"‚úÖ Class distribution saved to: {csv_file}")

print("\nüéì EDA complete and ready for Progress Report II!")