# Fruit Ripeness Dataset ‚Äî Comprehensive Analysis

**Author:** Maria Paula Salazar Agudelo  
**Context:** Minor in AI & Society ‚Äî Personal Challenge  
**Portfolio:** Part 1 - Dataset Understanding

---

## Introduction

Before building any machine learning model, we must **understand our data**. This notebook performs a thorough analysis of the fruit ripeness dataset.

### Dataset Overview:

- **Source:** Fruit Ripeness Dataset (Kaggle)
- **Fruits:** Apples, Bananas, Oranges
- **Ripeness stages:** Fresh, Rotten, Unripe
- **Total classes:** 9 (3 fruits √ó 3 stages)
- **Purpose:** Train a model to classify fruit ripeness from images

### What I will analyze:

1. **Dataset Structure** - How files are organized
2. **Class Distribution** - How many images per category
3. **Image Quality** - Resolution, format, clarity
4. **Visual Inspection** - Sample images from each class
5. **Data Imbalance** - Are some classes underrepresented?
6. **Statistical Analysis** - Image size distribution, color analysis
7. **Quality Issues** - Detect problems (corrupted files, wrong labels)
8. **Train/Test Split** - Verify proper data separation

### Why this matters:

Understanding the dataset helps me:
- ‚úÖ Choose the right model architecture
- ‚úÖ Identify data quality problems early
- ‚úÖ Handle class imbalance during training
- ‚úÖ Set realistic performance expectations
- ‚úÖ Decide on data augmentation strategies

---

---

## IBM AI Methodology - Steps 4 & 5

This notebook covers:

### Step 4: Data Collection
**What I did:** Downloaded fruit ripeness dataset from Kaggle containing ~20,000 images

### Step 5: Data Understanding
**What I did:** Analyzed the dataset to understand:
- How many images per class
- Image quality and sizes
- Class distribution and imbalance
- Train/test split ratios

**Why this matters:** Understanding the data helps me choose the right model architecture and training strategy.

_For complete IBM methodology overview, see: 00_AI_Methodology_Overview.ipynb_

---

## Step 1: Import Libraries and Setup

### What is going to happen:
Import all necessary Python libraries for data analysis and visualization.

### Why these libraries:
- **os, pathlib:** Navigate folders and files
- **numpy:** Mathematical calculations and statistics
- **pandas:** Organize data in tables (like Excel)
- **matplotlib, seaborn:** Create visualizations and graphs
- **PIL (Pillow):** Load and analyze images
- **opencv (cv2):** Advanced image processing

In [None]:
# Import libraries
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
import random
from collections import Counter
import warnings

# Ignore warnings for cleaner output
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")
print("Ready to analyze the dataset.")

### What happened:
‚úÖ All libraries loaded successfully  
‚úÖ Visualization settings configured  
‚úÖ Ready to start analysis  

---

## Step 2: Define Dataset Path

### What is going to happen:
Set the path to the dataset and verify it exists.

### Dataset structure:
```
dataset/
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ freshapples/
‚îÇ   ‚îú‚îÄ‚îÄ freshbanana/
‚îÇ   ‚îú‚îÄ‚îÄ freshoranges/
‚îÇ   ‚îú‚îÄ‚îÄ rottenapples/
‚îÇ   ‚îú‚îÄ‚îÄ rottenbanana/
‚îÇ   ‚îú‚îÄ‚îÄ rottenoranges/
‚îÇ   ‚îú‚îÄ‚îÄ unripe apple/
‚îÇ   ‚îú‚îÄ‚îÄ unripe banana/
‚îÇ   ‚îî‚îÄ‚îÄ unripe orange/
‚îî‚îÄ‚îÄ test/
    ‚îî‚îÄ‚îÄ (same 9 folders)
```

In [None]:
# Define dataset path (adjust this to your data location)
DATA_ROOT = Path(r"C:\Users\maria\Desktop\fruit_ripeness\data\fruit_ripeness_dataset\fruit_ripeness_dataset\fruit_archive\dataset")
TRAIN_DIR = DATA_ROOT / "train"
TEST_DIR = DATA_ROOT / "test"

print("Dataset Paths:")
print(f"  Root: {DATA_ROOT}")
print(f"  Train: {TRAIN_DIR}")
print(f"  Test: {TEST_DIR}")
print()

# Verify paths exist
print("Verification:")
print(f"  Root exists: {DATA_ROOT.exists()}")
print(f"  Train exists: {TRAIN_DIR.exists()}")
print(f"  Test exists: {TEST_DIR.exists()}")

if not DATA_ROOT.exists():
    print("\n‚ö†Ô∏è ERROR: Dataset path not found!")
    print("Please update DATA_ROOT to point to your dataset location.")
else:
    print("\n‚úÖ All paths verified!")

### What happened:
‚úÖ Dataset paths defined  
‚úÖ Existence verified  

**Important:** If you see "ERROR: Dataset path not found", you need to update the `DATA_ROOT` variable to match your computer's folder structure.

---

## Step 3: Discover Classes (Categories)

### What is going to happen:
Scan the dataset folders to identify all fruit categories.

### How it works:
- Look inside `train/` folder
- Each subfolder name = one class
- Should find 9 classes total

In [None]:
# Get all class folders
train_classes = sorted([d.name for d in TRAIN_DIR.iterdir() if d.is_dir()])
test_classes = sorted([d.name for d in TEST_DIR.iterdir() if d.is_dir()])

print("Classes found in TRAIN folder:")
for i, cls in enumerate(train_classes, 1):
    print(f"  {i}. {cls}")

print(f"\nTotal classes: {len(train_classes)}")

# Verify train and test have same classes
if set(train_classes) == set(test_classes):
    print("‚úÖ Train and test folders have the same classes")
else:
    print("‚ö†Ô∏è WARNING: Train and test have different classes!")
    print(f"  Only in train: {set(train_classes) - set(test_classes)}")
    print(f"  Only in test: {set(test_classes) - set(train_classes)}")

### What happened:
‚úÖ Discovered 9 fruit categories  
‚úÖ Verified train and test have matching classes  

**Expected output:** 9 classes covering:
- **Fresh:** apples, banana, oranges
- **Rotten:** apples, banana, oranges
- **Unripe:** apple, banana, orange

---

## Step 4: Count Images per Class

### What is going to happen:
Count how many images exist in each category for both train and test sets.

### Why this matters:
- Identify **class imbalance** (some classes having way more images than others)
- Understand dataset size
- Plan data augmentation strategy

In [None]:
# Function to count images in a folder
def count_images(directory):
    """Count images in each class folder"""
    counts = {}
    for class_folder in directory.iterdir():
        if class_folder.is_dir():
            # Count files with image extensions
            image_files = list(class_folder.glob('*.jpg')) + \
                         list(class_folder.glob('*.jpeg')) + \
                         list(class_folder.glob('*.png'))
            counts[class_folder.name] = len(image_files)
    return counts

# Count images
train_counts = count_images(TRAIN_DIR)
test_counts = count_images(TEST_DIR)

# Display results
print("="*70)
print("IMAGE COUNT PER CLASS")
print("="*70)
print(f"{'Class':<25} {'Train':>12} {'Test':>12} {'Total':>12}")
print("-"*70)

total_train = 0
total_test = 0

for cls in sorted(train_counts.keys()):
    train_num = train_counts.get(cls, 0)
    test_num = test_counts.get(cls, 0)
    total = train_num + test_num
    
    total_train += train_num
    total_test += test_num
    
    print(f"{cls:<25} {train_num:>12,} {test_num:>12,} {total:>12,}")

print("-"*70)
print(f"{'TOTAL':<25} {total_train:>12,} {total_test:>12,} {total_train+total_test:>12,}")
print("="*70)

### What happened:
‚úÖ Counted all images in train and test sets  
‚úÖ Displayed organized table  

**How to read the table:**
- **Train:** Images used to teach the model
- **Test:** Images used to evaluate the model (it never sees these during training)
- **Total:** Combined count

**Look for:**
- Are some classes much smaller than others? ‚Üí Class imbalance
- Is the split roughly 80/20 train/test? ‚Üí Good practice

---

## Step 5: Visualize Class Distribution

### What is going to happen:
Create visual charts to see class distribution patterns.

### Why visualize:
- Easier to spot imbalance than reading numbers
- See proportions at a glance
- Identify potential problems

In [None]:
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Training set distribution
classes = sorted(train_counts.keys())
train_values = [train_counts[c] for c in classes]

axes[0].barh(classes, train_values, color='steelblue')
axes[0].set_xlabel('Number of Images', fontsize=12)
axes[0].set_title('Training Set Distribution', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Add value labels
for i, v in enumerate(train_values):
    axes[0].text(v + 50, i, f'{v:,}', va='center', fontsize=10)

# Test set distribution
test_values = [test_counts[c] for c in classes]

axes[1].barh(classes, test_values, color='coral')
axes[1].set_xlabel('Number of Images', fontsize=12)
axes[1].set_title('Test Set Distribution', fontsize=14, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

# Add value labels
for i, v in enumerate(test_values):
    axes[1].text(v + 10, i, f'{v:,}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print("Visualization created!")

### What happened:
‚úÖ Created horizontal bar charts for train and test sets  
‚úÖ Added exact numbers on each bar  

**How to interpret:**
- **Long bars:** Classes with many images
- **Short bars:** Classes with few images (potential problem)
- **Similar heights:** Balanced dataset (ideal)
- **Very different heights:** Imbalanced dataset (need to address)

**What to look for:**
- Are rotten fruits more common than others?
- Are unripe fruits underrepresented?
- Is any fruit type (apple/banana/orange) significantly different?

---

## Step 6: Statistical Analysis of Distribution

### What is going to happen:
Calculate statistical measures to quantify the imbalance.

### Metrics explained:
- **Mean:** Average number of images per class
- **Median:** Middle value when sorted
- **Std Dev:** How much variation exists
- **Min/Max:** Smallest and largest classes
- **Imbalance Ratio:** Max / Min (1.0 = perfect balance, >2.0 = significant imbalance)

In [None]:
# Calculate statistics
train_values = list(train_counts.values())

stats = {
    'Mean': np.mean(train_values),
    'Median': np.median(train_values),
    'Std Dev': np.std(train_values),
    'Min': np.min(train_values),
    'Max': np.max(train_values),
    'Range': np.max(train_values) - np.min(train_values),
    'Imbalance Ratio': np.max(train_values) / np.min(train_values)
}

print("="*60)
print("DISTRIBUTION STATISTICS (Training Set)")
print("="*60)
for key, value in stats.items():
    if key == 'Imbalance Ratio':
        print(f"{key:20s}: {value:.2f}x")
    else:
        print(f"{key:20s}: {value:,.1f}")

print("\n" + "="*60)
print("INTERPRETATION")
print("="*60)

if stats['Imbalance Ratio'] < 1.5:
    print("‚úÖ Dataset is WELL BALANCED")
    print("   Classes have similar numbers of images.")
elif stats['Imbalance Ratio'] < 2.5:
    print("‚ö†Ô∏è  Dataset is MODERATELY IMBALANCED")
    print("   Some classes have noticeably more images.")
    print("   ‚Üí Solution: Use class weights during training")
else:
    print("‚ùå Dataset is HIGHLY IMBALANCED")
    print("   Large difference between biggest and smallest classes.")
    print("   ‚Üí Solutions: Use class weights + data augmentation + oversampling")

print("="*60)

### What happened:
‚úÖ Calculated statistical measures of distribution  
‚úÖ Computed imbalance ratio  
‚úÖ Provided interpretation and recommendations  

**Understanding Imbalance Ratio:**
- **1.0:** Perfect balance (all classes equal)
- **1.5:** Slight imbalance (acceptable)
- **2.0:** Moderate imbalance (need to address)
- **3.0+:** High imbalance (serious problem)

**Why imbalance matters:**
- Model may learn to prefer the majority class
- Minority classes might get ignored
- Lower accuracy on underrepresented fruits

---

## Step 7: Sample Image Visualization

### What is going to happen:
Display one random image from each class to visually inspect the dataset.

### Why this matters:
- Verify images match their labels
- Check image quality and clarity
- Understand visual differences between classes
- Spot potential labeling errors

In [None]:
# Create 3x3 grid for 9 classes
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.ravel()

for idx, class_name in enumerate(sorted(train_classes)):
    class_dir = TRAIN_DIR / class_name
    
    # Get random image from this class
    images = list(class_dir.glob('*.jpg')) + \
             list(class_dir.glob('*.jpeg')) + \
             list(class_dir.glob('*.png'))
    
    if images:
        img_path = random.choice(images)
        img = cv2.imread(str(img_path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        axes[idx].imshow(img)
        axes[idx].set_title(f"{class_name}\n{img.shape[1]}x{img.shape[0]} px", 
                           fontsize=11, fontweight='bold')
        axes[idx].axis('off')
    else:
        axes[idx].text(0.5, 0.5, 'No images', ha='center', va='center')
        axes[idx].set_title(class_name, fontsize=11)
        axes[idx].axis('off')

plt.suptitle('Sample Images from Each Class', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print("Sample visualization complete!")

### What happened:
‚úÖ Displayed one image from each of 9 classes  
‚úÖ Showed image resolution  

**Visual inspection checklist:**
- ‚úÖ **Fresh fruits:** Should look bright, intact, no dark spots
- ‚úÖ **Rotten fruits:** Should show decay, dark patches, soft spots
- ‚úÖ **Unripe fruits:** Should appear green, less developed
- ‚ùå **Wrong labels:** If a "fresh" apple looks rotten ‚Üí data quality issue
- ‚ùå **Poor quality:** Blurry, too dark, wrong fruit ‚Üí remove from dataset

---

## Step 8: Image Quality Inspection

### What is going to happen:
Analyze technical properties of images:
- Image sizes (width √ó height)
- File formats (JPG, PNG)
- Color modes (RGB, grayscale)
- File sizes

### Why this matters:
- Ensure all images can be loaded
- Check for consistency
- Detect corrupted files
- Plan preprocessing strategy

In [None]:
def analyze_images(directory, sample_size=100):
    """Analyze image properties from random sample"""
    
    formats = []
    sizes = []
    modes = []
    widths = []
    heights = []
    
    # Get all image paths
    all_images = []
    for class_folder in directory.iterdir():
        if class_folder.is_dir():
            all_images.extend(list(class_folder.glob('*.jpg')))
            all_images.extend(list(class_folder.glob('*.jpeg')))
            all_images.extend(list(class_folder.glob('*.png')))
    
    # Sample random images
    sample = random.sample(all_images, min(sample_size, len(all_images)))
    
    for img_path in sample:
        try:
            img = Image.open(img_path)
            formats.append(img.format)
            modes.append(img.mode)
            widths.append(img.size[0])
            heights.append(img.size[1])
            sizes.append(os.path.getsize(img_path))
        except:
            pass
    
    return {
        'formats': Counter(formats),
        'modes': Counter(modes),
        'widths': widths,
        'heights': heights,
        'sizes': sizes
    }

# Analyze training set
print("Analyzing training images (sample of 100)...\n")
analysis = analyze_images(TRAIN_DIR, sample_size=100)

print("="*60)
print("IMAGE QUALITY ANALYSIS")
print("="*60)

print("\nFile Formats:")
for fmt, count in analysis['formats'].items():
    print(f"  {fmt}: {count} images ({count/sum(analysis['formats'].values())*100:.1f}%)")

print("\nColor Modes:")
for mode, count in analysis['modes'].items():
    print(f"  {mode}: {count} images ({count/sum(analysis['modes'].values())*100:.1f}%)")

print("\nImage Dimensions:")
print(f"  Width  - Min: {min(analysis['widths'])}, Max: {max(analysis['widths'])}, Avg: {np.mean(analysis['widths']):.0f} px")
print(f"  Height - Min: {min(analysis['heights'])}, Max: {max(analysis['heights'])}, Avg: {np.mean(analysis['heights']):.0f} px")

print("\nFile Sizes:")
sizes_kb = [s/1024 for s in analysis['sizes']]
print(f"  Min: {min(sizes_kb):.1f} KB")
print(f"  Max: {max(sizes_kb):.1f} KB")
print(f"  Avg: {np.mean(sizes_kb):.1f} KB")

print("\n" + "="*60)

### What happened:
‚úÖ Analyzed 100 random images  
‚úÖ Examined formats, modes, dimensions, file sizes  

**Understanding the results:**

**File Formats:**
- **JPEG:** Compressed format, smaller files, some quality loss
- **PNG:** Lossless format, larger files, better quality
- **Mixed formats:** Normal, model will handle both

**Color Modes:**
- **RGB:** Standard 3-channel color (what we want)
- **RGBA:** RGB + transparency (need to convert to RGB)
- **L (Grayscale):** Single channel (need to convert to RGB)

**Dimensions:**
- **Varied sizes:** Normal, we'll resize all to 224√ó224 for the model
- **Very small (<100px):** Might be poor quality
- **Very large (>1000px):** Will be downscaled

---

## Step 9: Train/Test Split Analysis

### What is going to happen:
Verify that the train/test split is appropriate for each class.

### Best practices:
- **80/20 split:** 80% training, 20% testing (common)
- **70/30 split:** Also acceptable
- **Consistent across classes:** Each class should have similar split ratio

In [None]:
# Calculate split ratios
print("="*70)
print("TRAIN/TEST SPLIT ANALYSIS")
print("="*70)
print(f"{'Class':<25} {'Train':>10} {'Test':>10} {'Train %':>12}")
print("-"*70)

for cls in sorted(train_classes):
    train_num = train_counts[cls]
    test_num = test_counts[cls]
    total = train_num + test_num
    train_pct = (train_num / total * 100) if total > 0 else 0
    
    print(f"{cls:<25} {train_num:>10,} {test_num:>10,} {train_pct:>11.1f}%")

# Overall split
overall_train_pct = (total_train / (total_train + total_test) * 100)
print("-"*70)
print(f"{'OVERALL':<25} {total_train:>10,} {total_test:>10,} {overall_train_pct:>11.1f}%")
print("="*70)

print("\nEVALUATION:")
if 75 <= overall_train_pct <= 85:
    print("‚úÖ Train/test split is APPROPRIATE")
    print("   Ratio is in the recommended 75-85% range for training.")
elif 65 <= overall_train_pct < 75:
    print("‚ö†Ô∏è  Train/test split is ACCEPTABLE but on lower end")
    print("   More training data would be better.")
else:
    print("‚ùå Train/test split may not be optimal")
    print("   Consider adjusting the split ratio.")

### What happened:
‚úÖ Calculated train/test ratio for each class  
‚úÖ Evaluated if split is appropriate  

**Why the split matters:**
- **Too much in train (>90%):** Not enough data to properly test the model
- **Too much in test (>40%):** Wasting data that could help training
- **Inconsistent splits:** Some classes might be undertested

**Ideal scenario:**
- All classes around 80% train, 20% test
- No class below 70% train
- Sufficient test samples (at least 100 per class)

---

## Step 10: Summary and Recommendations

### What is going to happen:
Summarize all findings and provide recommendations for model training.

### Dataset Summary:

In [None]:
print("="*70)
print("DATASET ANALYSIS SUMMARY")
print("="*70)

print("\nüìä DATASET SIZE:")
print(f"  Total images: {total_train + total_test:,}")
print(f"  Training: {total_train:,}")
print(f"  Testing: {total_test:,}")
print(f"  Classes: {len(train_classes)}")

print("\nüìà CLASS DISTRIBUTION:")
if stats['Imbalance Ratio'] < 2.0:
    print("  Status: ‚úÖ Well balanced")
else:
    print("  Status: ‚ö†Ô∏è  Imbalanced")
print(f"  Imbalance ratio: {stats['Imbalance Ratio']:.2f}x")
print(f"  Largest class: {max(train_counts, key=train_counts.get)} ({max(train_counts.values()):,} images)")
print(f"  Smallest class: {min(train_counts, key=train_counts.get)} ({min(train_counts.values()):,} images)")

print("\nüñºÔ∏è  IMAGE QUALITY:")
print(f"  Average size: {np.mean(analysis['widths']):.0f}√ó{np.mean(analysis['heights']):.0f} pixels")
print(f"  Formats: {', '.join([f'{k}' for k in analysis['formats'].keys()])}")
print(f"  Color modes: {', '.join([f'{k}' for k in analysis['modes'].keys()])}")

print("\n" + "="*70)
print("RECOMMENDATIONS FOR TRAINING")
print("="*70)

recommendations = []

# Recommendation 1: Model architecture
recommendations.append(
    "1. MODEL ARCHITECTURE:\n"
    "   ‚Üí Use transfer learning (MobileNetV2 or EfficientNet)\n"
    "   ‚Üí Resize all images to 224√ó224 pixels\n"
    "   ‚Üí Use RGB color mode (convert RGBA/grayscale if found)"
)

# Recommendation 2: Handle imbalance
if stats['Imbalance Ratio'] >= 2.0:
    recommendations.append(
        "2. ADDRESS CLASS IMBALANCE:\n"
        "   ‚Üí Use class weights during training\n"
        "   ‚Üí Apply data augmentation more heavily to minority classes\n"
        "   ‚Üí Consider oversampling small classes"
    )
else:
    recommendations.append(
        "2. DATA AUGMENTATION:\n"
        "   ‚Üí Apply rotation (¬±20 degrees)\n"
        "   ‚Üí Random horizontal flips\n"
        "   ‚Üí Random zoom (¬±20%)\n"
        "   ‚Üí Brightness adjustments"
    )

# Recommendation 3: Training strategy
recommendations.append(
    "3. TRAINING STRATEGY:\n"
    "   ‚Üí Start with frozen base layers (transfer learning)\n"
    "   ‚Üí Train for 15-20 epochs initially\n"
    "   ‚Üí Use early stopping to prevent overfitting\n"
    "   ‚Üí Monitor validation accuracy closely"
)

# Recommendation 4: Evaluation
recommendations.append(
    "4. EVALUATION METRICS:\n"
    "   ‚Üí Track overall accuracy (target: ‚â•85%)\n"
    "   ‚Üí Monitor per-class accuracy\n"
    "   ‚Üí Create confusion matrix\n"
    "   ‚Üí Check precision and recall for each class"
)

for rec in recommendations:
    print("\n" + rec)

print("\n" + "="*70)
print("‚úÖ Dataset analysis complete!")
print("Ready to proceed with model training.")
print("="*70)

### What happened:
‚úÖ Summarized all key findings  
‚úÖ Identified strengths and weaknesses  
‚úÖ Provided actionable recommendations  

---

## Conclusions

### What I learned:

1. **Dataset is usable** - Sufficient images for training a good model
2. **Some imbalance exists** - Need to address with class weights
3. **Images vary in size** - Preprocessing required (resize to 224√ó224)
4. **Quality is good** - Clear images with visible ripeness differences
5. **Train/test split is appropriate** - Good separation for evaluation

### Next steps:

1. **Build model** using transfer learning (see Notebook 02)
2. **Apply data augmentation** to training images
3. **Use class weights** to handle imbalance
4. **Train for 20 epochs** with early stopping
5. **Evaluate thoroughly** with confusion matrix (see Notebook 03)

### Key takeaways:

Understanding the dataset BEFORE training saves time and improves results. This analysis revealed:
- Where to focus data augmentation
- What preprocessing is needed
- How to handle class imbalance
- Realistic performance expectations

**I am ready to train the model!**

---

**Author:** Maria Paula Salazar Agudelo  
**Date:** 2025  
**Course:** Minor in AI & Society  