<a href="https://colab.research.google.com/github/NethmiAmasha/Waste-Image-Classification-CNN/blob/main/Waste_image_calssification1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Download & Extract RealWaste Dataset

In [1]:
import os, zipfile, requests
from pathlib import Path
from io import BytesIO

DATA_URL = "https://archive.ics.uci.edu/static/public/908/realwaste.zip"
OUT_DIR = Path("/content/realwaste_data")   # where to save and extract
OUT_DIR.mkdir(parents=True, exist_ok=True)

print("ðŸ“¥ Downloading dataset...")
response = requests.get(DATA_URL, stream=True)
response.raise_for_status()

# Extract directly from memory
with zipfile.ZipFile(BytesIO(response.content)) as z:
    z.extractall(OUT_DIR)

print("âœ… Dataset downloaded and extracted to:", OUT_DIR)


ðŸ“¥ Downloading dataset...
âœ… Dataset downloaded and extracted to: /content/realwaste_data


Explore and List Image Files

In [2]:
import pandas as pd
from pathlib import Path

image_exts = (".jpg", ".jpeg", ".png", ".bmp", ".tif", ".tiff")
root = OUT_DIR

# Find subfolders that contain images
image_files = []
for path in root.rglob("*"):
    if path.suffix.lower() in image_exts:
        label = path.parent.name   # parent folder name = label
        image_files.append((str(path), label))

df = pd.DataFrame(image_files, columns=["filepath", "label"])

print("Total images found:", len(df))
print(df["label"].value_counts())
df.head()


Total images found: 4752
label
Plastic                921
Metal                  790
Paper                  500
Miscellaneous Trash    495
Cardboard              461
Vegetation             436
Glass                  420
Food Organics          411
Textile Trash          318
Name: count, dtype: int64


Unnamed: 0,filepath,label
0,/content/realwaste_data/realwaste-main/RealWas...,Paper
1,/content/realwaste_data/realwaste-main/RealWas...,Paper
2,/content/realwaste_data/realwaste-main/RealWas...,Paper
3,/content/realwaste_data/realwaste-main/RealWas...,Paper
4,/content/realwaste_data/realwaste-main/RealWas...,Paper


Split into Training / Validation / Test (70 / 15 / 15)

In [3]:
from sklearn.model_selection import train_test_split

RANDOM_STATE = 42

train_df, rest_df = train_test_split(
    df, test_size=0.30, stratify=df["label"], random_state=RANDOM_STATE
)

val_df, test_df = train_test_split(
    rest_df, test_size=0.50, stratify=rest_df["label"], random_state=RANDOM_STATE
)

print("Split sizes:")
print("Train:", len(train_df))
print("Validation:", len(val_df))
print("Test:", len(test_df))

# Verify class balance
print("\nClass distribution check:")
print(train_df['label'].value_counts(normalize=True).head())
print(val_df['label'].value_counts(normalize=True).head())
print(test_df['label'].value_counts(normalize=True).head())

Split sizes:
Train: 3326
Validation: 713
Test: 713

Class distribution check:
label
Plastic                0.193927
Metal                  0.166266
Paper                  0.105232
Miscellaneous Trash    0.104029
Cardboard              0.097114
Name: proportion, dtype: float64
label
Plastic                0.193548
Metal                  0.166900
Paper                  0.105189
Miscellaneous Trash    0.103787
Cardboard              0.096774
Name: proportion, dtype: float64
label
Plastic                0.193548
Metal                  0.165498
Miscellaneous Trash    0.105189
Paper                  0.105189
Cardboard              0.096774
Name: proportion, dtype: float64
