# Garbage image classification

## Aggregating data, creating data set

We bring our folder images and we create a data frame.

In [9]:
import os
import pandas as pd

def list_folders_in_directory(directory):
    folders = [folder for folder in os.listdir(directory) if os.path.isdir(os.path.join(directory, folder))]
    return folders

directory_path = "./garbage-images/Garbage classification/Garbage classification"
folders_list = list_folders_in_directory(directory_path)
print(folders_list)


['paper', 'metal', 'cardboard', 'trash', 'glass', 'plastic']


In [10]:
data = []
for folder in folders_list: 
    files = os.listdir(os.path.join(directory_path, folder))
    #Add each file along with its folder name to the data list
    for file in files:
        data.append({'Folder': folder, 'File_name': file})

# Create a DataFrame from the data list
df = pd.DataFrame(data)

# Display the DataFrame
df.head()



Unnamed: 0,Folder,File_name
0,paper,paper283.jpg
1,paper,paper297.jpg
2,paper,paper526.jpg
3,paper,paper240.jpg
4,paper,paper254.jpg


## Analising the data

First we are checking our data set to see how manny photos we have.

In [11]:
df.shape

(2527, 2)

We want to check how manny images we have in each category.

In [12]:
df['Folder'].value_counts()

Folder
paper        594
glass        501
plastic      482
metal        410
cardboard    403
trash        137
Name: count, dtype: int64

We have some inbalance data in 'trash'

We want to see what file format we have in the dataset.

We can see that we have only 'jpg' files.

In [13]:
# Extract extension using str.split and str.get
df['Extension'] = df['File_name'].str.split('.').str[-1]

df['Extension'].value_counts()


Extension
jpg    2527
Name: count, dtype: int64

We want to check the dimension of images.

We are adding the dimensions of the image into the dataframe

In [14]:
from PIL import Image

def get_image_dimensions(df_row):
    file_path = os.path.join(directory_path, df_row['Folder'], df_row['File_name'])
    image = Image.open(file_path)
    width, height = image.size
    return width, height

df['Image_width'], df['Image_height'] = zip(*df.apply(get_image_dimensions, axis=1))

df.head()



Unnamed: 0,Folder,File_name,Extension,Image_width,Image_height
0,paper,paper283.jpg,jpg,512,384
1,paper,paper297.jpg,jpg,512,384
2,paper,paper526.jpg,jpg,512,384
3,paper,paper240.jpg,jpg,512,384
4,paper,paper254.jpg,jpg,512,384


We are checking how many different values we are having in those two columns.

It seems that all the files have the same size. This means that we do not need to do any transformation to the size of the files.

In [15]:
print(df.Image_width.value_counts())
print(30*'*')
print(df.Image_height.value_counts())

Image_width
512    2527
Name: count, dtype: int64
******************************
Image_height
384    2527
Name: count, dtype: int64


## Data processing

Now that our files in dataset are good to work with them we can make *Label Encoding*. Converting categorical labels into numerical values.

In [16]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Label'] = label_encoder.fit_transform(df['Folder'])

df.head()



Unnamed: 0,Folder,File_name,Extension,Image_width,Image_height,Label
0,paper,paper283.jpg,jpg,512,384,3
1,paper,paper297.jpg,jpg,512,384,3
2,paper,paper526.jpg,jpg,512,384,3
3,paper,paper240.jpg,jpg,512,384,3
4,paper,paper254.jpg,jpg,512,384,3


We need to load the data to normalise it

In [17]:
import tensorflow as tf

# Define the directory containing your dataset
data_directory = "./your_dataset_directory"

# Define image parameters
image_size = (150, 150)
batch_size = 32

# Create a TensorFlow Dataset object
train_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    data_directory,
    labels="inferred",  # Labels are inferred from subfolder names
    label_mode="int",  # Use integer labels
    color_mode="rgb",
    image_size=image_size,
    batch_size=batch_size,
    shuffle=True,
    seed=42,
    validation_split=0.2,
    subset="training"
)

val_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    data_directory,
    labels="inferred",  # Labels are inferred from subfolder names
    label_mode="int",  # Use integer labels
    color_mode="rgb",
    image_size=image_size,
    batch_size=batch_size,
    shuffle=True,
    seed=42,
    validation_split=0.2,
    subset="validation"
)


2024-03-14 11:25:07.961596: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


KeyboardInterrupt: 