# Convolutional Neural Network (Predicting Vegetables)
### Kaung Myat San (P2408655)
### DAAA/FT/2B/22
### Date: 6.5.2025
---

## Background Research & Data Analysis
---

### Dataset Overview

* **Image Size**: 224 × 224 pixels (3 color channels)
* **Number of Classes**: **13**, but grouped into **11**
* **Classes**:

  * Bean
  * Bitter Gourd
  * **Bottle Gourd and Cucumber** *(grouped)*
  * Brinjal *(Eggplant)*
  * **Broccoli and Cauliflower** *(grouped)*
  * Cabbage
  * Capsicum *(Bell Pepper)*
  * **Carrot and Radish** *(grouped)*
  * Potato
  * Pumpkin
  * Tomato

> **Note**: Some vegetables are grouped due to their **visual similarity**, especially when images are **grayscale**.

---

### Irregularities

#### Test Folder

* **Pumpkin** and **Tomato** folders seem to contain images of **each other's vegetables**.

#### Train Folder

* **Bean** folder:

  * Files from `0001.jpg` to `0020.jpg`, `0033.jpg`, `0049.jpg` to `0050.jpg` appear to be **carrots**, not beans.


###  Tasks to Complete

1. **Correct Dataset Irregularities**

   * Identify and move misclassified images to their correct folders.

2. **Standardize Folder Names Across All Splits**

   * Ensure consistent naming for all class folders in the `train`, `test`, and `validation` directories.
   
---


## Setup
---
### Importing Modules

In [None]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
import re
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.optimizers import Adam

## Data Augmentation
---

In [None]:
# Define root and subfolders
root_dir = 'Dataset_A'
subfolders = ['train', 'test', 'validation']
target_size = (101, 101)  # Change to (23, 23) if needed

# Known issues
mislabel_test = {'test/Pumpkin (purportedly)': 'Tomato', 'test/Tomato (ostensibly)': 'Pumpkin'}
mislabel_train = {
    'train/Bean': list(range(1, 21)) + [33, 49, 50]
}  # 0001.jpg to 0020.jpg, 0033.jpg, 0049.jpg, 0050.jpg are carrots

# Data storage
data = []

def correct_label(path, fname, folder):
    folder_name = os.path.basename(os.path.dirname(path))

    # Fix test folder mislabels
    if folder == 'test' and folder_name in mislabel_test:
        return mislabel_test[folder_name]

    # Fix train/Bean mislabels
    if folder == 'train' and folder_name == 'Bean':
        match = re.match(r'(\d+)', fname)
        if match:
            idx = int(match.group(1))
            if idx in mislabel_train['train/Bean']:
                return 'Carrot'

    # Default: use folder name as label
    return folder_name

# Traverse the dataset
for split in subfolders:
    split_path = os.path.join(root_dir, split)
    for class_folder in os.listdir(split_path):
        class_path = os.path.join(split_path, class_folder)
        if not os.path.isdir(class_path):
            continue
        for fname in tqdm(os.listdir(class_path), desc=f"Processing {split}/{class_folder}"):
            img_path = os.path.join(class_path, fname)
            try:
                # Load in grayscale and resize
                img = load_img(img_path, color_mode='grayscale', target_size=target_size)
                img_array = img_to_array(img)  # shape: (101, 101, 1)
                img_array = img_array / 255.0  # normalize to 0–1
                img_array = np.expand_dims(img_array, axis=0)  # shape: (1, 101, 101, 1)

                label = correct_label(img_path, fname, split)
                data.append({'image': img_array, 'label': label, 'split': split})
            except Exception as e:
                print(f"Error processing {img_path}: {e}")

# Convert to DataFrame
df = pd.DataFrame(data)

In [None]:
df.head()