# Data Preprocessing for Garbage Classification

In this notebook, we will perform data cleaning and preprocessing steps on the garbage classification dataset. The dataset consists of images categorized into four classes: organic, plastic, metal/glass, and paper.

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from PIL import Image
import matplotlib.pyplot as plt
import cv2

# Define the path to the raw data
raw_data_path = '../data/raw/'
processed_data_path = '../data/processed/'

# Create processed data directory if it doesn't exist
if not os.path.exists(processed_data_path):
    os.makedirs(processed_data_path)


In [2]:
# Load the dataset
def load_data(data_path):
    images = []
    labels = []
    for label in os.listdir(data_path):
        label_path = os.path.join(data_path, label)
        for img_file in os.listdir(label_path):
            img_path = os.path.join(label_path, img_file)
            images.append(img_path)
            labels.append(label)
    return images, labels

images, labels = load_data(raw_data_path)


In [3]:
# Convert to DataFrame
data = pd.DataFrame({'image': images, 'label': labels})
data.head()


In [4]:
# Data cleaning: Check for missing values
print(data.isnull().sum())


In [5]:
# Data preprocessing: Resize images and normalize pixel values
def preprocess_images(data):
    processed_images = []
    for img_path in data['image']:
        img = Image.open(img_path)
        img = img.resize((128, 128))  # Resize to 128x128
        img_array = np.array(img) / 255.0  # Normalize pixel values
        processed_images.append(img_array)
    return np.array(processed_images)

X = preprocess_images(data)


In [6]:
# Encode labels
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['label'])


In [7]:
# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
# Save the processed data
np.save(os.path.join(processed_data_path, 'X_train.npy'), X_train)
np.save(os.path.join(processed_data_path, 'X_val.npy'), X_val)
np.save(os.path.join(processed_data_path, 'y_train.npy'), y_train)
np.save(os.path.join(processed_data_path, 'y_val.npy'), y_val)


## Conclusion

In this notebook, we have successfully loaded, cleaned, and preprocessed the garbage classification dataset. The processed data is now ready for model training.