## Table of Contents

1. [Overview](#overview)
2. [Dataset Structure](#dataset)
3. [YOLO Label Parsing](#yolo)
4. [Image Extraction](#extraction)
5. [Feature Extraction: Color Histograms](#features)
6. [Data Augmentation](#augmentation)
7. [Dataset Balancing](#balancing)
8. [Normalization (Z-Score)](#normalization)
9. [Saving Processed Data](#saving)
10. [Complete Pipeline](#pipeline)

---

<a id='overview'></a>
## 1. Overview

The preprocessing pipeline transforms raw YOLO-annotated images into machine learning-ready feature vectors. The key steps are:

1. **Parse YOLO labels** to locate tomatoes in images
2. **Extract tomato regions** from images
3. **Extract color histogram features** (RGB + HSV)
4. **Augment minority class** with image transformations
5. **Balance the dataset** to have equal class representation
6. **Normalize features** using Z-score standardization
7. **Save processed data** in pickle and CSV formats

**Input**: YOLO annotated images with bounding boxes  
**Output**: Feature vectors (192 dimensions) with binary labels (0=Fresh, 1=Rotten)

---

<a id='dataset'></a>
## 2. Dataset Structure

The dataset follows the YOLO format with separate train/validation/test splits:

```
dataSet/
├── train/
│   ├── images/          # Training images (.jpg)
│   └── labels/          # YOLO annotation files (.txt)
├── val/
│   ├── images/          # Validation images
│   └── labels/          # Validation annotations
└── test/
    ├── images/          # Test images
    └── labels/          # Test annotations
```

### YOLO Label Format

Each `.txt` file contains bounding box annotations in YOLO format:

```
class_id center_x center_y width height
```

Where:
- `class_id`: Object class (2 = Fresh Tomato, 3 = Rotten Tomato)
- `center_x, center_y`: Normalized center coordinates (0-1)
- `width, height`: Normalized bounding box dimensions (0-1)

**Example**:
```
2 0.4523 0.6234 0.1234 0.0987
3 0.7821 0.3456 0.0876 0.1123
```

---

<a id='yolo'></a>
## 3. YOLO Label Parsing

### Purpose
Convert YOLO normalized coordinates to pixel coordinates to extract tomato regions from images.

### Process

```python
def parse_yolo_label(label_path, img_width, img_height):
    # 1. Read label file
    # 2. Parse each line: class_id, center_x, center_y, width, height
    # 3. Filter only tomato classes (class_id = 2 or 3)
    # 4. Convert normalized coordinates to pixel coordinates
    # 5. Ensure coordinates are within image bounds
```

### Coordinate Conversion

**From YOLO format (normalized 0-1) to pixel coordinates**:

$$\text{center\_x}_{\text{pixel}} = \text{center\_x}_{\text{norm}} \times \text{image\_width}$$

$$\text{center\_y}_{\text{pixel}} = \text{center\_y}_{\text{norm}} \times \text{image\_height}$$

$$\text{box\_width}_{\text{pixel}} = \text{width}_{\text{norm}} \times \text{image\_width}$$

$$\text{box\_height}_{\text{pixel}} = \text{height}_{\text{norm}} \times \text{image\_height}$$

**Corner coordinates**:

$$x_1 = \text{center\_x} - \frac{\text{box\_width}}{2}$$

$$y_1 = \text{center\_y} - \frac{\text{box\_height}}{2}$$

$$x_2 = \text{center\_x} + \frac{\text{box\_width}}{2}$$

$$y_2 = \text{center\_y} + \frac{\text{box\_height}}{2}$$

### Class Mapping

- **Class 2** (YOLO) → **0** (Binary Classification) = Fresh Tomato
- **Class 3** (YOLO) → **1** (Binary Classification) = Rotten Tomato

---

<a id='extraction'></a>
## 4. Image Extraction

### Purpose
Extract individual tomato regions from full images using parsed bounding boxes.

### Process

1. **Read image** using OpenCV
2. **Convert color space** from BGR (OpenCV default) to RGB
3. **Parse YOLO labels** to get bounding boxes
4. **Crop each tomato region** using bounding box coordinates
5. **Filter small regions** (< 10x10 pixels) to remove noise
6. **Resize to target size** (64x64 pixels) for consistency

### Why Resize?

- **Consistency**: All tomatoes have the same dimensions
- **Feature extraction**: Histogram computation works on fixed-size images
- **Computational efficiency**: Smaller images process faster
- **Memory efficiency**: Reduces storage requirements

### Image Processing Pipeline

```
Original Image (variable size)
    ↓ [BGR to RGB conversion]
RGB Image
    ↓ [Bounding box cropping]
Tomato Region (variable size)
    ↓ [Resize to 64x64]
Fixed-size Tomato (64x64)
    ↓ [Feature extraction]
Feature Vector (192 dimensions)
```

---

<a id='features'></a>
## 5. Feature Extraction: Color Histograms

### Why Color Histograms?

Color is the most discriminative feature for distinguishing fresh from rotten tomatoes:
- **Fresh tomatoes**: Bright red, vibrant colors, high saturation
- **Rotten tomatoes**: Dark brown/black, dull colors, low saturation

### Dual Color Space Approach

We extract histograms from **two color spaces** for complementary information:

#### 1. RGB Color Space
- **R (Red)**: Intensity of red channel
- **G (Green)**: Intensity of green channel
- **B (Blue)**: Intensity of blue channel
- **Use case**: Captures absolute color intensities

#### 2. HSV Color Space
- **H (Hue)**: Color type (0-180°) - Red vs Brown
- **S (Saturation)**: Color purity - Vibrant vs Dull
- **V (Value)**: Brightness - Light vs Dark
- **Use case**: Better separation of color properties, more perceptually uniform

### Histogram Computation

For each channel, we divide the intensity range into **bins** (default: 32 bins):

```python
# RGB histograms (0-255 range)
hist_r = cv.calcHist([image], [0], None, [bins], [0, 256])
hist_g = cv.calcHist([image], [1], None, [bins], [0, 256])
hist_b = cv.calcHist([image], [2], None, [bins], [0, 256])

# HSV histograms
# Hue: 0-180 range (OpenCV uses 0-180 for hue)
hist_h = cv.calcHist([hsv_image], [0], None, [bins], [0, 180])
# Saturation: 0-255 range
hist_s = cv.calcHist([hsv_image], [1], None, [bins], [0, 256])
# Value: 0-255 range
hist_v = cv.calcHist([hsv_image], [2], None, [bins], [0, 256])
```

### Histogram Normalization

Each histogram is normalized so values sum to 1 (probability distribution):

$$\text{hist}_{\text{norm}} = \frac{\text{hist}}{\sum \text{hist}}$$

**Why normalize?**
- Makes histograms scale-invariant
- Ensures consistency regardless of image size
- Values represent probabilities/proportions

### Final Feature Vector

All 6 histograms are concatenated into a single feature vector:

$$\text{features} = [R_0, R_1, ..., R_{31}, G_0, ..., G_{31}, B_0, ..., B_{31}, H_0, ..., H_{31}, S_0, ..., S_{31}, V_0, ..., V_{31}]$$

**Total dimensions**: 6 channels × 32 bins = **192 features**

### Visual Interpretation

**Fresh Tomato Features**:
- RGB: High red channel values, moderate green
- Hue: Concentrated around red (0° or 180°)
- Saturation: High values (vibrant color)
- Value: Moderate to high (bright)

**Rotten Tomato Features**:
- RGB: More uniform across channels (brownish)
- Hue: Spread across multiple values (mixed colors)
- Saturation: Low values (dull color)
- Value: Lower values (darker)

---

<a id='augmentation'></a>
## 6. Data Augmentation

### Purpose
Create variations of existing images to:
- Increase dataset size
- Improve model generalization
- Balance class distribution
- Add robustness to lighting/orientation changes

### Augmentation Techniques

#### 1. Brightness Adjustment
**Method**: Increase V (Value) channel in HSV by 20%
```python
hsv[:, :, 2] = hsv[:, :, 2] * 1.2
```
**Effect**: Simulates different lighting conditions

#### 2. Contrast Enhancement
**Method**: Scale pixel values by factor α=1.3
```python
new_pixel = α × original_pixel + β
```
**Effect**: Makes colors more distinct, simulates camera variations

#### 3. Horizontal Flip
**Method**: Mirror image horizontally
```python
flipped = cv.flip(image, 1)
```
**Effect**: Creates orientation variation (left/right doesn't matter for tomatoes)

#### 4. Rotation
**Method**: Rotate image by 10 degrees
```python
rotation_matrix = cv.getRotationMatrix2D(center, 10, 1.0)
```
**Effect**: Simulates different camera angles

#### 5. Gaussian Noise
**Method**: Add random noise with μ=0, σ=10
```python
noise = np.random.normal(0, 10, image.shape)
noisy_image = image + noise
```
**Effect**: Simulates sensor noise, makes model more robust

### Why Real Image Augmentation?

We augment at the **image level** (before feature extraction) rather than at the **feature level** because:

1. **More realistic**: Transformations match real-world variations
2. **Richer variations**: Image transformations create complex feature changes
3. **Better generalization**: Model learns from genuine data variations
4. **Can't reverse engineer**: Features → Image is lossy, but Image → Features preserves information

---

<a id='balancing'></a>
## 7. Dataset Balancing

### Problem: Class Imbalance

Original dataset often has unequal class distribution:
- Fresh tomatoes: 1200 samples
- Rotten tomatoes: 800 samples

**Issues with imbalance**:
- Model biased toward majority class
- Poor performance on minority class
- Misleading accuracy metrics

### Solution: Augmentation-Based Balancing

**Strategy**: Augment the minority class until both classes have equal samples

```python
# Count samples per class
Fresh: 1200 samples
Rotten: 800 samples

# Find majority count
max_count = 1200

# Augment minority class
samples_to_add = 1200 - 800 = 400

# Create 400 augmented rotten tomato samples
# Using: brightness, contrast, flip, rotate, noise (cycled)
```

### Balancing Process

1. **Count samples** in each class
2. **Identify minority class** (fewer samples)
3. **Calculate gap** to majority class
4. **Cycle through augmentation types**:
   - Sample 0: brightness
   - Sample 1: contrast
   - Sample 2: flip
   - Sample 3: rotate
   - Sample 4: noise
   - Sample 5: brightness (cycle repeats)
5. **Apply augmentation** to original images
6. **Extract features** from augmented images
7. **Add to dataset** until balanced

### Result

**Before balancing**:
- Fresh: 1200 (60%)
- Rotten: 800 (40%)
- Total: 2000

**After balancing**:
- Fresh: 1200 (50%)
- Rotten: 1200 (50%)
- Total: 2400

### Important Notes

- **Only applied to training set** - validation/test sets remain unchanged
- **Real image augmentation** - not just feature noise
- **Deterministic cycling** - ensures all augmentation types are used

---

<a id='normalization'></a>
## 8. Normalization (Z-Score Standardization)

### Purpose
Transform features to have mean=0 and standard deviation=1 for each feature dimension.

### Why Normalize?

1. **Gradient Descent Convergence**: Faster and more stable training
2. **Feature Scale Consistency**: All features contribute equally
3. **Numerical Stability**: Prevents overflow/underflow
4. **Better Learning Rate**: Single learning rate works for all features

### Z-Score Formula

For each feature dimension $i$:

$$z_i = \frac{x_i - \mu_i}{\sigma_i}$$

Where:
- $x_i$ = original feature value
- $\mu_i$ = mean of feature $i$ across all training samples
- $\sigma_i$ = standard deviation of feature $i$
- $z_i$ = normalized feature value

### Normalization Process

#### Step 1: Compute Statistics (Training Set Only)
```python
# Calculate mean and std for each of 192 features
mean = np.mean(training_features, axis=0)  # Shape: (192,)
std = np.std(training_features, axis=0) + 1e-8  # Shape: (192,)
```

The small constant (1e-8) prevents division by zero.

#### Step 2: Save Statistics
```python
# Save to norm_stats32.pkl for later use
norm_stats = {'mean': mean, 'std': std}
```

#### Step 3: Apply to All Splits
```python
# Training set: use computed statistics
train_normalized = (train_features - mean) / std

# Validation/Test sets: use TRAINING statistics
val_normalized = (val_features - mean) / std
test_normalized = (test_features - mean) / std
```

### Critical: Why Use Training Statistics for All Splits?

**Data Leakage Prevention**:
- Validation/test set statistics contain information about future data
- Model must only see training data characteristics
- Using separate statistics would leak information

**Real-World Deployment**:
- New data will use training statistics
- Consistent preprocessing required
- Model expects training-normalized distribution

### Example Transformation

**Before normalization** (arbitrary scales):
```
Feature 0 (Red bin 0): [0.1, 0.3, 0.05, 0.2, ...]  (range: 0-1)
Feature 50 (Hue bin 18): [0.4, 0.6, 0.3, 0.5, ...]  (range: 0-1)
```

**After normalization** (mean=0, std=1):
```
Feature 0: [-1.2, 0.8, -1.8, -0.3, ...]  (centered, scaled)
Feature 50: [-0.5, 1.2, -1.1, 0.4, ...]  (centered, scaled)
```

### Two-Level Normalization

Note that we perform normalization at **two different stages**:

1. **Histogram Normalization** (during feature extraction):
   - Each histogram sums to 1
   - Within-sample normalization
   - Makes histograms comparable across different image sizes

2. **Z-Score Normalization** (after feature extraction):
   - Each feature has mean=0, std=1
   - Across-sample normalization
   - Makes features comparable to each other

Both are necessary and serve different purposes!

---

<a id='saving'></a>
## 9. Saving Processed Data

### Output Formats

Data is saved in **two formats** for flexibility:

#### 1. Pickle Format (.pkl)
**File**: `preprocessed_data_train32.pkl`

**Structure**: List of dictionaries
```python
[
    {
        'img_name': '001.jpg',
        'feature_vector': [0.02, 0.15, ..., -1.2],  # 192 values
        'class_id': 0,  # 0=Fresh, 1=Rotten
        'tomato_index': 0  # Which tomato in the image
    },
    ...
]
```

**Advantages**:
- Preserves exact data structure
- Fast loading with pickle
- Maintains data types
- Used by machine learning code

#### 2. CSV Format (.csv)
**File**: `preprocessed_data_train32.csv`

**Structure**: Tabular format
```
img_name,class_id,tomato_index,R0,R1,...,R31,G0,...,G31,B0,...,B31,H0,...,H31,S0,...,S31,V0,...,V31
001.jpg,0,0,0.02,0.15,...
```

**Advantages**:
- Human-readable
- Can open in Excel/Google Sheets
- Easy inspection and debugging
- Compatible with other tools

### File Naming Convention

Files include the number of bins in the filename:
- `preprocessed_data_train32.pkl` - 32 bins (192 features)
- `preprocessed_data_train64.pkl` - 64 bins (384 features)

This prevents confusion when experimenting with different bin sizes.

### Additional Files

**Normalization Statistics**: `norm_stats32.pkl`
```python
{
    'mean': array([0.15, 0.23, ..., 0.18]),  # 192 values
    'std': array([0.05, 0.08, ..., 0.06])    # 192 values
}
```

**Critical**: This file must exist before processing validation/test sets!

---

<a id='pipeline'></a>
## 10. Complete Pipeline

### Pipeline Flowchart

```
Raw Images + YOLO Labels
         |
         ↓
    Parse Labels
         |
         ↓
  Extract Tomato Regions
         |
         ↓
    Resize to 64×64
         |
         ↓
  Extract RGB Histograms (96 features)
         +
  Extract HSV Histograms (96 features)
         |
         ↓
  Feature Vector (192 dims)
         |
         ↓
[TRAINING SET ONLY]
         |
         ↓
  Balance Dataset (Augmentation)
         |
         ↓
  Compute Mean & Std
         |
         ↓
   Save norm_stats.pkl
         |
         ↓
[ALL SPLITS]
         |
         ↓
  Apply Z-Score Normalization
  (using training mean/std)
         |
         ↓
   Save .pkl and .csv
         |
         ↓
  Ready for ML Training!
```

### Processing Order

**IMPORTANT**: Must process in this order:

1. **Training set** (creates norm_stats.pkl)
2. **Validation set** (uses norm_stats.pkl)
3. **Test set** (uses norm_stats.pkl)

### Configuration Options

```python
# Adjustable parameters
dataset_path = "../../dataSet"      # Path to dataset
split_choice = 'train'              # 'train', 'val', or 'test'
bins = 32                           # Histogram bins (32 or 64)
target_size = (64, 64)              # Image resize dimensions
```

### Output Summary

After processing all splits:

```
preprocessing/
├── preprocessed_data_train32.pkl     # Training features (balanced)
├── preprocessed_data_train32.csv     # Training features (CSV)
├── preprocessed_data_val32.pkl       # Validation features
├── preprocessed_data_val32.csv       # Validation features (CSV)
├── preprocessed_data_test32.pkl      # Test features
├── preprocessed_data_test32.csv      # Test features (CSV)
└── norm_stats32.pkl                  # Normalization statistics
```

### Performance Metrics

**Typical processing times** (on standard laptop):
- Training set (1000 images): ~2-3 minutes
- Validation set (300 images): ~30-40 seconds
- Test set (300 images): ~30-40 seconds

**Memory usage**:
- Feature vectors: ~1-2 MB per 1000 samples
- Normalization stats: ~1 KB

---

## Summary

### Key Preprocessing Steps

1. ✅ **YOLO Label Parsing**: Convert annotations to bounding boxes
2. ✅ **Region Extraction**: Crop individual tomatoes from images
3. ✅ **Resizing**: Standardize to 64×64 pixels
4. ✅ **Feature Extraction**: 192D color histogram (RGB + HSV)
5. ✅ **Data Augmentation**: Real image transformations (5 types)
6. ✅ **Dataset Balancing**: Equal class representation
7. ✅ **Z-Score Normalization**: Mean=0, Std=1 scaling
8. ✅ **Data Saving**: Pickle + CSV formats

### Final Dataset Properties

- **Features**: 192 dimensions (32 bins × 6 channels)
- **Labels**: Binary (0=Fresh, 1=Rotten)
- **Balance**: 50/50 class distribution (training)
- **Normalization**: Z-score standardized
- **Format**: Ready for logistic regression/ML models

### Next Steps

After preprocessing:
1. Load processed data in training script
2. Train logistic regression model
3. Evaluate on validation/test sets
4. Analyze results and iterate

---

**End of Preprocessing Pipeline Documentation**