# Comprehensive Data Inspection and EDA for FarmAI Analytics
## Project: FarmAI - Smart Crop Disease Detection System


**Objective:**
This notebook initializes the **FarmAI** project pipeline. Our goal is to inspect the raw **PlantVillage** dataset, verify data integrity, analyze class distributions, and prepare a clean dataset for the Deep Learning model.

In [None]:
# 1. ENVIRONMENT SETUP & IMPORTS

In [None]:
import sys
from pathlib import Path
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image # For image handling
import json # For saving class indices


In [None]:
project_root = Path.cwd().parent
print(f"Project root: {project_root}")

In [None]:
sys.path.insert(0, str(project_root))

In [None]:
from src import config

print("✅ Config loaded successfully!")

# Some installs of the `config` module may not expose `get_config_summary`.
# Use it if available; otherwise fall back to a safe summary of key attributes.
if hasattr(config, "get_config_summary"):
	try:
		config.get_config_summary()
	except Exception as e:
		print(f"Error calling config.get_config_summary(): {e}")
		print("Falling back to printing a minimal config summary below.")
		fallback_keys = [
			"RANDOM_SEED",
			"RAW_DATA_DIR",
			"FIGURES_DIR",
			"PROCESSED_DATA_DIR",
			"METRICS_DIR",
			"IMG_SIZE",
			"NUM_CLASSES_TO_USE",
		]
		for k in fallback_keys:
			print(f"  - {k}: {getattr(config, k, None)}")
else:
	print("get_config_summary() not found in config module. Falling back to printing a minimal config summary:")
	fallback_keys = [
		"RANDOM_SEED",
		"RAW_DATA_DIR",
		"FIGURES_DIR",
		"PROCESSED_DATA_DIR",
		"METRICS_DIR",
		"IMG_SIZE",
		"NUM_CLASSES_TO_USE",
	]
	for k in fallback_keys:
		print(f"  - {k}: {getattr(config, k, None)}")

from tensorflow.keras.preprocessing.image import load_img, img_to_array

## Set random seeds for reproducibility
np.random.seed(config.RANDOM_SEED)

random.seed(config.RANDOM_SEED)

tf.random.set_seed(config.RANDOM_SEED)

In [None]:
# Validate configuration
try:
    config.validate_config()
    config.get_config_summary()
    print("Environment and configuration loaded.")
except Exception as e:
    print(f"Error validating configuration: {e}")
    sys.exit(1)

In [None]:
# Ensure output directories exist
config.FIGURES_DIR.mkdir(parents=True, exist_ok=True)
config.METRICS_DIR.mkdir(parents=True, exist_ok=True)


## 2. DATA EXTRACTION & INSPECTION


### 2.1. Load and Verify Dataset Structure

We are loading the raw image data from the local directory. This step ensures that the dataset path is correct and maps out all available crop disease classes that **FarmAI** will learn to detect.

In [None]:
raw_data_path = config.RAW_DATA_DIR

if not raw_data_path.exists():
    print(f"Error: Raw data directory not found at {raw_data_path}")
    print("Action: Please download the PlantVillage dataset and place the 'color' folder inside 'data/raw/plantvillage/'.")
    sys.exit(1)

# List all class directories
class_dirs = [d for d in raw_data_path.iterdir() if d.is_dir()]
all_class_names = sorted([d.name for d in class_dirs])

# Select a subset of classes for this analysis, as specified in config.
# This makes the notebook run faster for demonstration purposes.
if config.NUM_CLASSES_TO_USE and config.NUM_CLASSES_TO_USE < len(all_class_names):
    selected_class_names = random.sample(all_class_names, config.NUM_CLASSES_TO_USE)
    selected_class_names = sorted(selected_class_names) # Keep consistent order
else:
    selected_class_names = all_class_names

num_selected_classes = len(selected_class_names)
print(f"Dataset root: {raw_data_path}")
print(f"Total classes found in raw data: {len(all_class_names)}")
print(f"Using {num_selected_classes} classes for this analysis: {selected_class_names}")

# Collect file paths and labels
filepaths = []
labels = []
for class_name in selected_class_names:
    class_path = raw_data_path / class_name
    images = list(class_path.glob('*.[jJ][pP][gG]')) + list(class_path.glob('*.[pP][nN][gG]'))
    for img_path in images:
        filepaths.append(str(img_path))
        labels.append(class_name)

print(f"Total images collected for selected classes: {len(filepaths)}")

# Create a DataFrame for easier handling
df = pd.DataFrame({'filepath': filepaths, 'label': labels})
print("\nSample of collected data:")
print(df.head())

### 2.2. Analyzing Data Distribution & Imbalance

We analyze the number of images per class to identify any **class imbalance**. If some diseases have very few images compared to others, our model might become biased. This analysis helps us decide if we need techniques like **Data Augmentation** or **Class Weighting** in the training phase.

In [None]:
# Calculate and plot class distribution
class_counts = df['label'].value_counts()
dist_df = pd.DataFrame({'Class': class_counts.index, 'Count': class_counts.values})
dist_df = dist_df.sort_values('Count', ascending=False).reset_index(drop=True)

print("\nClass distribution summary:")
print(dist_df)

plt.figure(figsize=(14, 7))
sns.barplot(data=dist_df, x='Count', y='Class', palette='viridis')
plt.title('Image Distribution Across Selected Disease Classes', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Number of Images', fontsize=12, fontweight='bold')
plt.ylabel('Disease Class', fontsize=12, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig(config.FIGURES_DIR / "eda_class_distribution.png")
plt.show()

min_images = dist_df['Count'].min()
max_images = dist_df['Count'].max()
total_images_in_subset = dist_df['Count'].sum()

print(f"\nTotal images in selected subset: {total_images_in_subset:,}")
print(f"Minimum images per class: {min_images}")
print(f"Maximum images per class: {max_images}")

if max_images / min_images > 2: # Simple heuristic for imbalance
    print("Observation: Significant class imbalance detected. Strategies like class weighting or oversampling may be needed in subsequent modeling phases.")
else:
    print("Observation: Classes appear relatively balanced in this subset.")

### 2.3. Data Integrity Check

To ensure the **FarmAI** training pipeline runs smoothly, we must identify and remove any corrupted or unreadable image files before feeding them into the neural network.

In [None]:
 corrupted_files = []
for fpath in df['filepath']:
    try:
        img = Image.open(fpath)
        img.verify() # Verify image integrity
    except Exception:
        corrupted_files.append(fpath)

if corrupted_files:
    print(f"\nWarning: {len(corrupted_files)} corrupted files found. Examples: {corrupted_files[:5]}")
    # For initial analysis, we'll exclude them. In production, we'd log and handle more gracefully.
    df = df[~df['filepath'].isin(corrupted_files)]
    print(f"Removed {len(corrupted_files)} corrupted files. Remaining images: {len(df)}")
else:
    print("\n✓ No corrupted image files detected in the selected subset.")

# Save the clean dataframe (with selected classes and no corrupted files)
# This DataFrame can be loaded by the next notebook.
df.to_csv(config.PROCESSED_DATA_DIR / "eda_cleaned_data.csv", index=False)
print(f"✓ Cleaned data information saved to: {config.PROCESSED_DATA_DIR / 'eda_cleaned_data.csv'}")


## 3. EXPLORATORY DATA ANALYSIS (EDA)


This section provides visual insights into the image data itself,
helping us understand image characteristics and potential challenges.


In [None]:
## 3.1. Visualize Sample Images

## Displaying a few sample images helps in understanding the visual patterns of diseases.


# Visualize sample images from selected classes
num_sample_images_display = min(5, num_selected_classes) # Display from up to 5 classes
sample_classes_for_viz = random.sample(selected_class_names, num_sample_images_display)
samples_per_class_viz = 2

plt.figure(figsize=(15, 3 * num_sample_images_display))
plt.suptitle('Sample Images from Selected Disease Classes (EDA Visuals)', 
             fontsize=16, fontweight='bold', y=1.02)

for i, class_name in enumerate(sample_classes_for_viz):
    # Get filepaths for current class
    class_filepaths = df[df['label'] == class_name]['filepath'].tolist()
    
    # Select random samples
    display_samples = random.sample(class_filepaths, min(samples_per_class_viz, len(class_filepaths)))
    
    for j, img_path_str in enumerate(display_samples):
        img = load_img(img_path_str, target_size=config.IMG_SIZE)
        
        ax = plt.subplot(num_sample_images_display, samples_per_class_viz, i * samples_per_class_viz + j + 1)
        ax.imshow(img)
        ax.axis('off')
        if j == 0:
            ax.set_title(f"{class_name.replace('___', ' ')}", fontsize=10, fontweight='bold', loc='left')

plt.tight_layout()
plt.savefig(config.FIGURES_DIR / "eda_sample_images.png")
plt.show()

### 3.2. Strategic Observations for FarmAI Model Development

Based on our visual inspection and data analysis, we have established the following roadmap for the FarmAI model:

1.  **Image Quality & Consistency:** The dataset contains high-quality, centered leaf images. However, real-world farm photos vary in lighting and angle. To make **FarmAI** robust for farmers, we will implement heavy **Data Augmentation** (rotation, zoom, brightness adjustments) in the next notebook.

2.  **Visual Complexity:** Certain diseases (e.g., Early Blight vs. Late Blight) exhibit subtle visual differences. This suggests that a simple CNN might not be enough. We will likely employ **Transfer Learning** using architectures like **EfficientNet** or **MobileNet** to capture these fine-grained features accurately.

3.  **Addressing Imbalance:** We observed variations in the image counts across classes. To prevent the model from being biased towards the majority class, we will utilize **Class Weights** during the training process to ensure equitable detection performance across all disease types.

In [None]:
print("\n--- DATA INSPECTION & EDA COMPLETE ---")
print("Outputs generated:")
print(f"  - Class Distribution Plot: {config.FIGURES_DIR / 'eda_class_distribution.png'}")
print(f"  - Sample Images Plot: {config.FIGURES_DIR / 'eda_sample_images.png'}")
print(f"  - Cleaned Data CSV: {config.PROCESSED_DATA_DIR / 'eda_cleaned_data.csv'}")
print("Ready for Preprocessing and Baseline ML.")