# üìä Task 1: Dataset Download and Exploration

## üéØ Objective
Download the Waste Classification dataset from Kaggle and perform comprehensive Exploratory Data Analysis (EDA).

---

## üìö Theory: Why Data Exploration Matters (ML Rule #2, #17)

### Martin Zinkevich's Rules Applied:
- **Rule #2**: First, design and implement metrics
- **Rule #17**: Start with directly observed features

### What is EDA?
Exploratory Data Analysis is the process of:
1. Understanding data distribution
2. Identifying patterns and anomalies
3. Checking data quality
4. Forming hypotheses for modeling

### Mathematical Concepts in EDA:

**1. Class Distribution (Probability)**
```
P(class_i) = count(class_i) / total_samples
```

**2. Image Statistics**
- Mean: Œº = (1/n) √ó Œ£x_i
- Std Dev: œÉ = ‚àö[(1/n) √ó Œ£(x_i - Œº)¬≤]

---

## Step 1: Import Libraries and Setup

In [None]:
# Core Libraries
import numpy as np
import pandas as pd
import os
import shutil
from pathlib import Path

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2

# For progress bars
from tqdm.notebook import tqdm

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("‚úÖ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## Step 2: Define Project Paths

In [None]:
# Project root directory
PROJECT_ROOT = Path(r"D:\het\SELF\RP\YOLO-V11-PRO")

# Data directories
DATA_DIR = PROJECT_ROOT / "data"
RAW_DATA_DIR = DATA_DIR / "raw"
PROCESSED_DATA_DIR = DATA_DIR / "processed"

# Create directories if they don't exist
RAW_DATA_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DATA_DIR.mkdir(parents=True, exist_ok=True)

print("üìÅ Project Structure:")
print(f"   PROJECT_ROOT: {PROJECT_ROOT}")
print(f"   RAW_DATA_DIR: {RAW_DATA_DIR}")
print(f"   PROCESSED_DATA_DIR: {PROCESSED_DATA_DIR}")

## Step 3: Download Dataset from Kaggle

### üìù Pre-requisites:
1. Create a Kaggle account at https://www.kaggle.com
2. Go to Account ‚Üí API ‚Üí Create New API Token
3. This downloads `kaggle.json`
4. Place it in `~/.kaggle/` (Linux/Mac) or `C:\Users\<username>\.kaggle\` (Windows)

### Alternative: Manual Download
If the API doesn't work, download manually from:
https://www.kaggle.com/datasets/techsash/waste-classification-data

In [None]:
# Check if Kaggle API is configured
import subprocess

def check_kaggle_setup():
    """Check if Kaggle API is properly setup"""
    kaggle_dir = Path.home() / ".kaggle"
    kaggle_json = kaggle_dir / "kaggle.json"
    
    if kaggle_json.exists():
        print("‚úÖ Kaggle API token found!")
        return True
    else:
        print("‚ùå Kaggle API token not found!")
        print(f"   Please place kaggle.json in: {kaggle_dir}")
        print("\nüì• Alternative: Download manually from:")
        print("   https://www.kaggle.com/datasets/techsash/waste-classification-data")
        print(f"   Extract to: {RAW_DATA_DIR}")
        return False

kaggle_ready = check_kaggle_setup()

In [None]:
# Download dataset using Kaggle API (run if API is configured)
if kaggle_ready:
    try:
        import kaggle
        
        print("üì• Downloading dataset from Kaggle...")
        kaggle.api.dataset_download_files(
            'techsash/waste-classification-data',
            path=str(RAW_DATA_DIR),
            unzip=True
        )
        print("‚úÖ Dataset downloaded and extracted successfully!")
    except Exception as e:
        print(f"‚ùå Error downloading: {e}")
        print("\nüì• Please download manually from:")
        print("   https://www.kaggle.com/datasets/techsash/waste-classification-data")
else:
    print("‚è≥ Skipping automatic download. Please download manually.")

## Step 4: Explore Dataset Structure

### üìê Mathematical Concept: File System as Tree Structure
```
Dataset Structure:
‚îú‚îÄ‚îÄ TRAIN/
‚îÇ   ‚îú‚îÄ‚îÄ O/ (Organic)     ‚Üí Class 0
‚îÇ   ‚îî‚îÄ‚îÄ R/ (Recyclable)  ‚Üí Class 1
‚îî‚îÄ‚îÄ TEST/
    ‚îú‚îÄ‚îÄ O/
    ‚îî‚îÄ‚îÄ R/
```

In [None]:
def explore_directory_structure(path, indent=0):
    """Recursively explore directory structure"""
    path = Path(path)
    if not path.exists():
        print(f"‚ùå Path does not exist: {path}")
        return
    
    for item in sorted(path.iterdir()):
        prefix = "‚îÇ   " * indent + "‚îú‚îÄ‚îÄ "
        if item.is_dir():
            # Count files in directory
            file_count = len(list(item.rglob("*.*")))
            print(f"{prefix}üìÅ {item.name}/ ({file_count} files)")
            if indent < 2:  # Limit depth
                explore_directory_structure(item, indent + 1)
        else:
            print(f"{prefix}üìÑ {item.name}")

print("\nüìÇ Dataset Directory Structure:")
print("=" * 50)
explore_directory_structure(RAW_DATA_DIR)

In [None]:
# Define dataset paths (adjust based on actual structure)
# The dataset might be in a subdirectory after extraction

# Try to find the DATASET folder
possible_paths = [
    RAW_DATA_DIR / "DATASET",
    RAW_DATA_DIR / "dataset",
    RAW_DATA_DIR,
]

DATASET_DIR = None
for p in possible_paths:
    if (p / "TRAIN").exists() or (p / "train").exists():
        DATASET_DIR = p
        break

if DATASET_DIR:
    print(f"‚úÖ Dataset found at: {DATASET_DIR}")
    
    # Define train and test directories
    TRAIN_DIR = DATASET_DIR / "TRAIN" if (DATASET_DIR / "TRAIN").exists() else DATASET_DIR / "train"
    TEST_DIR = DATASET_DIR / "TEST" if (DATASET_DIR / "TEST").exists() else DATASET_DIR / "test"
    
    print(f"   TRAIN_DIR: {TRAIN_DIR}")
    print(f"   TEST_DIR: {TEST_DIR}")
else:
    print("‚ùå Dataset not found! Please check the extraction.")
    print(f"   Expected location: {RAW_DATA_DIR}")

## Step 5: Count Images and Analyze Class Distribution

### üìê Mathematical Foundation: Class Balance

**Class Imbalance Ratio:**
```
Imbalance Ratio = max(class_count) / min(class_count)
```

**Why it matters:**
- Ratio ‚âà 1: Balanced dataset ‚úÖ
- Ratio > 2: Moderately imbalanced ‚ö†Ô∏è
- Ratio > 10: Severely imbalanced ‚ùå

**Solutions for imbalance:**
1. Oversampling minority class
2. Undersampling majority class
3. Class weights during training
4. Data augmentation

In [None]:
def count_images_in_directory(directory):
    """
    Count images in a directory by class.
    
    Mathematical representation:
    count(class_i) = |{f ‚àà files : f.parent = class_i}|
    """
    directory = Path(directory)
    class_counts = {}
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.webp'}
    
    if not directory.exists():
        print(f"‚ö†Ô∏è Directory not found: {directory}")
        return class_counts
    
    for class_folder in directory.iterdir():
        if class_folder.is_dir():
            count = sum(1 for f in class_folder.iterdir() 
                       if f.suffix.lower() in image_extensions)
            class_counts[class_folder.name] = count
    
    return class_counts

# Count images
if DATASET_DIR:
    train_counts = count_images_in_directory(TRAIN_DIR)
    test_counts = count_images_in_directory(TEST_DIR)
    
    print("\nüìä Dataset Statistics:")
    print("=" * 50)
    print("\nüèãÔ∏è Training Set:")
    for class_name, count in train_counts.items():
        class_label = "Organic" if class_name.upper() == "O" else "Recyclable"
        print(f"   {class_name} ({class_label}): {count:,} images")
    print(f"   Total: {sum(train_counts.values()):,} images")
    
    print("\nüß™ Test Set:")
    for class_name, count in test_counts.items():
        class_label = "Organic" if class_name.upper() == "O" else "Recyclable"
        print(f"   {class_name} ({class_label}): {count:,} images")
    print(f"   Total: {sum(test_counts.values()):,} images")

In [None]:
# Calculate class balance metrics using NumPy (from scratch!)
def calculate_class_metrics(class_counts):
    """
    Calculate class distribution metrics using NumPy.
    
    Mathematical formulas:
    - Probability: P(class_i) = n_i / N
    - Entropy: H = -Œ£ P(i) * log2(P(i))
    - Imbalance Ratio: max(counts) / min(counts)
    """
    counts = np.array(list(class_counts.values()))
    total = np.sum(counts)
    
    # Calculate probabilities
    probabilities = counts / total
    
    # Calculate entropy (measure of balance)
    # H = -Œ£ P(i) * log2(P(i))
    # For 2 classes, max entropy = 1.0 (perfectly balanced)
    entropy = -np.sum(probabilities * np.log2(probabilities + 1e-10))
    max_entropy = np.log2(len(counts))  # Maximum possible entropy
    normalized_entropy = entropy / max_entropy  # 1.0 = perfectly balanced
    
    # Imbalance ratio
    imbalance_ratio = np.max(counts) / np.min(counts)
    
    return {
        'probabilities': probabilities,
        'entropy': entropy,
        'normalized_entropy': normalized_entropy,
        'imbalance_ratio': imbalance_ratio
    }

if DATASET_DIR and train_counts:
    metrics = calculate_class_metrics(train_counts)
    
    print("\nüìà Class Balance Metrics (Training Set):")
    print("=" * 50)
    
    for i, (class_name, prob) in enumerate(zip(train_counts.keys(), metrics['probabilities'])):
        class_label = "Organic" if class_name.upper() == "O" else "Recyclable"
        print(f"   P({class_label}) = {prob:.4f} ({prob*100:.2f}%)")
    
    print(f"\n   Shannon Entropy: {metrics['entropy']:.4f}")
    print(f"   Normalized Entropy: {metrics['normalized_entropy']:.4f} (1.0 = perfectly balanced)")
    print(f"   Imbalance Ratio: {metrics['imbalance_ratio']:.2f}")
    
    if metrics['imbalance_ratio'] < 1.5:
        print("\n   ‚úÖ Dataset is well-balanced!")
    elif metrics['imbalance_ratio'] < 3:
        print("\n   ‚ö†Ô∏è Dataset is slightly imbalanced. Consider using class weights.")
    else:
        print("\n   ‚ùå Dataset is significantly imbalanced. Use augmentation/sampling techniques.")

## Step 6: Visualize Class Distribution

### üìä Visualization Theory
Visualizations help us:
1. Quickly identify patterns
2. Communicate findings effectively
3. Detect anomalies

In [None]:
def plot_class_distribution(train_counts, test_counts):
    """Create comprehensive class distribution visualization"""
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('üóëÔ∏è Waste Classification Dataset Analysis', fontsize=16, fontweight='bold')
    
    # Color palette
    colors = {'O': '#2ecc71', 'R': '#3498db'}  # Green for Organic, Blue for Recyclable
    labels = {'O': 'Organic', 'R': 'Recyclable'}
    
    # 1. Training Set Bar Chart
    ax1 = axes[0, 0]
    classes = list(train_counts.keys())
    counts = list(train_counts.values())
    bars = ax1.bar([labels[c] for c in classes], counts, 
                   color=[colors[c] for c in classes], edgecolor='black', linewidth=1.5)
    ax1.set_title('Training Set Distribution', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Number of Images')
    
    # Add value labels on bars
    for bar, count in zip(bars, counts):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 100,
                f'{count:,}', ha='center', va='bottom', fontweight='bold')
    
    # 2. Test Set Bar Chart
    ax2 = axes[0, 1]
    test_classes = list(test_counts.keys())
    test_count_values = list(test_counts.values())
    bars2 = ax2.bar([labels[c] for c in test_classes], test_count_values,
                    color=[colors[c] for c in test_classes], edgecolor='black', linewidth=1.5)
    ax2.set_title('Test Set Distribution', fontsize=12, fontweight='bold')
    ax2.set_ylabel('Number of Images')
    
    for bar, count in zip(bars2, test_count_values):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 20,
                f'{count:,}', ha='center', va='bottom', fontweight='bold')
    
    # 3. Pie Chart for Training Set
    ax3 = axes[1, 0]
    explode = (0.05, 0.05)
    wedges, texts, autotexts = ax3.pie(counts, explode=explode,
                                        labels=[labels[c] for c in classes],
                                        colors=[colors[c] for c in classes],
                                        autopct='%1.1f%%',
                                        shadow=True, startangle=90)
    ax3.set_title('Training Set Proportion', fontsize=12, fontweight='bold')
    
    # 4. Combined Train/Test Comparison
    ax4 = axes[1, 1]
    x = np.arange(len(classes))
    width = 0.35
    
    bars3 = ax4.bar(x - width/2, counts, width, label='Train', 
                    color='#3498db', edgecolor='black')
    bars4 = ax4.bar(x + width/2, test_count_values, width, label='Test',
                    color='#e74c3c', edgecolor='black')
    
    ax4.set_title('Train vs Test Distribution', fontsize=12, fontweight='bold')
    ax4.set_ylabel('Number of Images')
    ax4.set_xticks(x)
    ax4.set_xticklabels([labels[c] for c in classes])
    ax4.legend()
    
    plt.tight_layout()
    plt.savefig(PROJECT_ROOT / 'docs' / 'assets' / 'class_distribution.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\n‚úÖ Plot saved to: docs/assets/class_distribution.png")

if DATASET_DIR and train_counts and test_counts:
    plot_class_distribution(train_counts, test_counts)

## Step 7: Display Sample Images

### üìê Image Representation Theory

**Digital Image as Matrix:**
```
RGB Image: I ‚àà ‚Ñù^(H √ó W √ó 3)
- H: Height (rows)
- W: Width (columns)  
- 3: Color channels (R, G, B)

Pixel value range: [0, 255] for 8-bit images
```

**Color Channels:**
- Red channel: I[:, :, 0]
- Green channel: I[:, :, 1]
- Blue channel: I[:, :, 2]

In [None]:
def get_sample_images(directory, n_samples=5):
    """
    Get sample images from each class.
    
    Returns: Dictionary with class names as keys and list of image paths as values
    """
    directory = Path(directory)
    samples = {}
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.webp'}
    
    for class_folder in directory.iterdir():
        if class_folder.is_dir():
            images = [f for f in class_folder.iterdir() 
                     if f.suffix.lower() in image_extensions]
            # Random sample
            np.random.seed(42)  # For reproducibility
            if len(images) >= n_samples:
                indices = np.random.choice(len(images), n_samples, replace=False)
                samples[class_folder.name] = [images[i] for i in indices]
            else:
                samples[class_folder.name] = images
    
    return samples

def display_sample_images(samples, title="Sample Images"):
    """Display sample images from each class in a grid"""
    
    n_classes = len(samples)
    n_samples = max(len(imgs) for imgs in samples.values())
    
    fig, axes = plt.subplots(n_classes, n_samples, figsize=(3*n_samples, 3*n_classes))
    fig.suptitle(f'üñºÔ∏è {title}', fontsize=16, fontweight='bold')
    
    labels = {'O': 'Organic ‚ôªÔ∏è', 'R': 'Recyclable üîÑ'}
    
    for i, (class_name, images) in enumerate(samples.items()):
        for j in range(n_samples):
            ax = axes[i, j] if n_classes > 1 else axes[j]
            
            if j < len(images):
                img = Image.open(images[j])
                ax.imshow(img)
                if j == 0:
                    ax.set_ylabel(labels.get(class_name, class_name), fontsize=12, fontweight='bold')
                ax.set_title(f'{img.size[0]}x{img.size[1]}', fontsize=9)
            
            ax.axis('off')
    
    plt.tight_layout()
    plt.savefig(PROJECT_ROOT / 'docs' / 'assets' / 'sample_images.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\n‚úÖ Sample images saved to: docs/assets/sample_images.png")

if DATASET_DIR:
    train_samples = get_sample_images(TRAIN_DIR, n_samples=5)
    display_sample_images(train_samples, title="Training Set Sample Images")

## Step 8: Analyze Image Properties

### üìê Statistical Analysis of Images

**Key Metrics:**
1. **Image Dimensions**: Height √ó Width
2. **Aspect Ratio**: Width / Height
3. **File Size**: In bytes/KB
4. **Color Statistics**: Mean, Std per channel

In [None]:
def analyze_image_properties(directory, max_samples=500):
    """
    Analyze properties of images in a directory.
    
    Returns DataFrame with image properties.
    """
    directory = Path(directory)
    image_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.webp'}
    
    data = []
    
    for class_folder in directory.iterdir():
        if class_folder.is_dir():
            images = [f for f in class_folder.iterdir() 
                     if f.suffix.lower() in image_extensions]
            
            # Sample if too many images
            if len(images) > max_samples // 2:
                np.random.seed(42)
                indices = np.random.choice(len(images), max_samples // 2, replace=False)
                images = [images[i] for i in indices]
            
            for img_path in tqdm(images, desc=f"Analyzing {class_folder.name}"):
                try:
                    img = Image.open(img_path)
                    img_array = np.array(img)
                    
                    # Basic properties
                    width, height = img.size
                    aspect_ratio = width / height
                    file_size = img_path.stat().st_size / 1024  # KB
                    
                    # Color statistics (if RGB)
                    if len(img_array.shape) == 3 and img_array.shape[2] >= 3:
                        mean_r = np.mean(img_array[:, :, 0])
                        mean_g = np.mean(img_array[:, :, 1])
                        mean_b = np.mean(img_array[:, :, 2])
                        std_r = np.std(img_array[:, :, 0])
                        std_g = np.std(img_array[:, :, 1])
                        std_b = np.std(img_array[:, :, 2])
                    else:
                        mean_r = mean_g = mean_b = np.mean(img_array)
                        std_r = std_g = std_b = np.std(img_array)
                    
                    data.append({
                        'class': class_folder.name,
                        'filename': img_path.name,
                        'width': width,
                        'height': height,
                        'aspect_ratio': aspect_ratio,
                        'file_size_kb': file_size,
                        'mean_r': mean_r,
                        'mean_g': mean_g,
                        'mean_b': mean_b,
                        'std_r': std_r,
                        'std_g': std_g,
                        'std_b': std_b,
                        'brightness': (mean_r + mean_g + mean_b) / 3
                    })
                except Exception as e:
                    print(f"Error processing {img_path}: {e}")
    
    return pd.DataFrame(data)

if DATASET_DIR:
    print("\nüìä Analyzing image properties (this may take a minute)...")
    image_df = analyze_image_properties(TRAIN_DIR, max_samples=500)
    print(f"\n‚úÖ Analyzed {len(image_df)} images")

In [None]:
# Display summary statistics
if DATASET_DIR and len(image_df) > 0:
    print("\nüìä Image Property Statistics:")
    print("=" * 60)
    
    # Group by class
    class_stats = image_df.groupby('class').agg({
        'width': ['mean', 'min', 'max', 'std'],
        'height': ['mean', 'min', 'max', 'std'],
        'aspect_ratio': ['mean', 'std'],
        'file_size_kb': ['mean', 'min', 'max'],
        'brightness': ['mean', 'std']
    }).round(2)
    
    print(class_stats)
    
    # Overall statistics
    print("\nüìà Overall Statistics:")
    print(f"   Average Width: {image_df['width'].mean():.0f} px")
    print(f"   Average Height: {image_df['height'].mean():.0f} px")
    print(f"   Width Range: [{image_df['width'].min()}, {image_df['width'].max()}] px")
    print(f"   Height Range: [{image_df['height'].min()}, {image_df['height'].max()}] px")
    print(f"   Average File Size: {image_df['file_size_kb'].mean():.1f} KB")

In [None]:
def plot_image_statistics(df):
    """Create comprehensive visualization of image properties"""
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    fig.suptitle('üìä Image Property Analysis', fontsize=16, fontweight='bold')
    
    colors = {'O': '#2ecc71', 'R': '#3498db'}
    labels = {'O': 'Organic', 'R': 'Recyclable'}
    
    # 1. Image Width Distribution
    ax1 = axes[0, 0]
    for class_name in df['class'].unique():
        class_df = df[df['class'] == class_name]
        ax1.hist(class_df['width'], bins=30, alpha=0.6, 
                label=labels.get(class_name, class_name), color=colors.get(class_name, 'gray'))
    ax1.set_title('Image Width Distribution')
    ax1.set_xlabel('Width (pixels)')
    ax1.set_ylabel('Frequency')
    ax1.legend()
    
    # 2. Image Height Distribution
    ax2 = axes[0, 1]
    for class_name in df['class'].unique():
        class_df = df[df['class'] == class_name]
        ax2.hist(class_df['height'], bins=30, alpha=0.6,
                label=labels.get(class_name, class_name), color=colors.get(class_name, 'gray'))
    ax2.set_title('Image Height Distribution')
    ax2.set_xlabel('Height (pixels)')
    ax2.set_ylabel('Frequency')
    ax2.legend()
    
    # 3. Aspect Ratio Distribution
    ax3 = axes[0, 2]
    for class_name in df['class'].unique():
        class_df = df[df['class'] == class_name]
        ax3.hist(class_df['aspect_ratio'], bins=30, alpha=0.6,
                label=labels.get(class_name, class_name), color=colors.get(class_name, 'gray'))
    ax3.axvline(x=1.0, color='red', linestyle='--', label='Square (1:1)')
    ax3.set_title('Aspect Ratio Distribution')
    ax3.set_xlabel('Aspect Ratio (W/H)')
    ax3.set_ylabel('Frequency')
    ax3.legend()
    
    # 4. Width vs Height Scatter
    ax4 = axes[1, 0]
    for class_name in df['class'].unique():
        class_df = df[df['class'] == class_name]
        ax4.scatter(class_df['width'], class_df['height'], alpha=0.5, 
                   label=labels.get(class_name, class_name), color=colors.get(class_name, 'gray'), s=20)
    ax4.set_title('Width vs Height')
    ax4.set_xlabel('Width (pixels)')
    ax4.set_ylabel('Height (pixels)')
    ax4.legend()
    
    # 5. Brightness Distribution by Class
    ax5 = axes[1, 1]
    brightness_data = [df[df['class'] == c]['brightness'].values for c in df['class'].unique()]
    bp = ax5.boxplot(brightness_data, labels=[labels.get(c, c) for c in df['class'].unique()],
                     patch_artist=True)
    for patch, class_name in zip(bp['boxes'], df['class'].unique()):
        patch.set_facecolor(colors.get(class_name, 'gray'))
    ax5.set_title('Brightness Distribution by Class')
    ax5.set_ylabel('Mean Brightness')
    
    # 6. File Size Distribution
    ax6 = axes[1, 2]
    for class_name in df['class'].unique():
        class_df = df[df['class'] == class_name]
        ax6.hist(class_df['file_size_kb'], bins=30, alpha=0.6,
                label=labels.get(class_name, class_name), color=colors.get(class_name, 'gray'))
    ax6.set_title('File Size Distribution')
    ax6.set_xlabel('File Size (KB)')
    ax6.set_ylabel('Frequency')
    ax6.legend()
    
    plt.tight_layout()
    plt.savefig(PROJECT_ROOT / 'docs' / 'assets' / 'image_statistics.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\n‚úÖ Statistics plot saved to: docs/assets/image_statistics.png")

if DATASET_DIR and len(image_df) > 0:
    plot_image_statistics(image_df)

## Step 9: Color Channel Analysis

### üìê RGB Color Space Mathematics

**Color Model:**
```
RGB Color = (R, G, B) where R, G, B ‚àà [0, 255]

Total possible colors = 256¬≥ = 16,777,216
```

**Channel Statistics:**
- Mean intensity per channel reveals color dominance
- Standard deviation indicates color variance

In [None]:
def plot_color_analysis(df):
    """Analyze and visualize color distribution by class"""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    fig.suptitle('üé® Color Channel Analysis by Class', fontsize=14, fontweight='bold')
    
    labels = {'O': 'Organic', 'R': 'Recyclable'}
    
    # 1. Mean RGB values by class
    ax1 = axes[0]
    classes = df['class'].unique()
    x = np.arange(len(classes))
    width = 0.25
    
    means_r = [df[df['class'] == c]['mean_r'].mean() for c in classes]
    means_g = [df[df['class'] == c]['mean_g'].mean() for c in classes]
    means_b = [df[df['class'] == c]['mean_b'].mean() for c in classes]
    
    ax1.bar(x - width, means_r, width, label='Red', color='#e74c3c')
    ax1.bar(x, means_g, width, label='Green', color='#2ecc71')
    ax1.bar(x + width, means_b, width, label='Blue', color='#3498db')
    
    ax1.set_title('Mean Color Channel Values by Class')
    ax1.set_ylabel('Mean Intensity (0-255)')
    ax1.set_xticks(x)
    ax1.set_xticklabels([labels.get(c, c) for c in classes])
    ax1.legend()
    ax1.set_ylim(0, 255)
    
    # 2. Color space scatter (R vs G with B as hue)
    ax2 = axes[1]
    scatter = ax2.scatter(df['mean_r'], df['mean_g'], c=df['mean_b'], 
                         cmap='viridis', alpha=0.6, s=30)
    plt.colorbar(scatter, ax=ax2, label='Blue Channel Mean')
    ax2.set_title('Color Space Distribution (R vs G, colored by B)')
    ax2.set_xlabel('Red Channel Mean')
    ax2.set_ylabel('Green Channel Mean')
    
    plt.tight_layout()
    plt.savefig(PROJECT_ROOT / 'docs' / 'assets' / 'color_analysis.png', dpi=150, bbox_inches='tight')
    plt.show()
    print("\n‚úÖ Color analysis saved to: docs/assets/color_analysis.png")

if DATASET_DIR and len(image_df) > 0:
    plot_color_analysis(image_df)

## Step 10: Save Analysis Results

### üíæ Data Persistence (ML Rule #11: Documentation)

In [None]:
# Save analysis results
if DATASET_DIR and len(image_df) > 0:
    # Save DataFrame to CSV
    analysis_path = PROJECT_ROOT / 'data' / 'image_analysis.csv'
    image_df.to_csv(analysis_path, index=False)
    print(f"‚úÖ Image analysis saved to: {analysis_path}")
    
    # Create summary report
    summary = {
        'dataset_name': 'Waste Classification',
        'total_train_images': sum(train_counts.values()),
        'total_test_images': sum(test_counts.values()),
        'classes': list(train_counts.keys()),
        'class_counts_train': train_counts,
        'class_counts_test': test_counts,
        'avg_width': image_df['width'].mean(),
        'avg_height': image_df['height'].mean(),
        'avg_file_size_kb': image_df['file_size_kb'].mean(),
        'analyzed_samples': len(image_df)
    }
    
    # Save as JSON
    import json
    summary_path = PROJECT_ROOT / 'data' / 'dataset_summary.json'
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2, default=str)
    print(f"‚úÖ Dataset summary saved to: {summary_path}")

## üìù Summary & Key Findings

### What We Learned:
1. **Dataset Size**: ~22,564 training + ~2,513 test images
2. **Classes**: 2 (Organic and Recyclable)
3. **Class Balance**: Check the imbalance ratio from above
4. **Image Dimensions**: Variable sizes (may need resizing for YOLO)
5. **Color Patterns**: Different color profiles for each class

### Next Steps:
- **Task 2**: Data preprocessing and augmentation
- Resize images to consistent dimensions
- Apply data augmentation techniques

---

## üìö Learning Resources

### Theory:
- [Understanding Data Exploration](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)
- [Image Processing Basics](https://homepages.inf.ed.ac.uk/rbf/HIPR2/wksheets.htm)

### Videos:
- [StatQuest: Histograms](https://www.youtube.com/watch?v=qBigTkBLU6g)
- [3Blue1Brown: Linear Algebra (for image matrix concepts)](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab)

### Code Reference:
See `docs/CODE-THEORY.md` Section 1.1-1.2 for mathematical foundations.

In [None]:
print("\n" + "="*60)
print("‚úÖ TASK 1 COMPLETE: Dataset Download and Exploration")
print("="*60)
print("\nüìã What was accomplished:")
print("   ‚úì Dataset downloaded/verified from Kaggle")
print("   ‚úì Directory structure explored")
print("   ‚úì Class distribution analyzed")
print("   ‚úì Sample images visualized")
print("   ‚úì Image properties analyzed (dimensions, colors)")
print("   ‚úì Analysis results saved")
print("\nüìÅ Generated files:")
print("   - docs/assets/class_distribution.png")
print("   - docs/assets/sample_images.png")
print("   - docs/assets/image_statistics.png")
print("   - docs/assets/color_analysis.png")
print("   - data/image_analysis.csv")
print("   - data/dataset_summary.json")
print("\n‚û°Ô∏è Ready for Task 2: Data Preprocessing and Augmentation")