# 01 – Exploratory Data Analysis (EDA) & Majority-Class Baseline

This notebook answers three questions:

1. **What does the raw image dataset look like?**  
2. **How balanced are the two classes (recyclable / non-recyclable)?**  
3. **What is the accuracy of a naïve majority-class model?**  

The goal is to set a first benchmark before we train neural networks in later notebooks.


In [None]:
import os, cv2, numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
from collections import Counter
from pathlib import Path

from src.data_loader import load_images


## 1  Dataset locations

Replace the folder paths below with **your real directories**.  
For this demo we keep a tiny subset under `data/sample_images/` so the notebook always runs.


In [None]:
folder_paths = [
    "../data/sample_images/recyclable",       # <- put real path
    "../data/sample_images/non_recyclable"    # <- put real path
]
class_names  = ["recyclable", "non-recyclable"]

X, y, ignored = load_images(folder_paths, class_names, target_size=(64, 64))
print("Loaded:", X.shape, "| ignored files:", ignored)


## 2  Class distribution


In [None]:
cnt = Counter(y)
plt.figure(figsize=(4,3))
sns.barplot(x=list(cnt.keys()), y=list(cnt.values()), palette="Set2")
plt.title("Image count per class"); plt.ylabel("count"); plt.tight_layout()
plt.show()


*Observation:*  
If one class dominates, a majority predictor might already score high – a useful sanity check before deep-learning.


## 3  Random sample of images


In [None]:
idx = np.random.choice(len(X), size=min(6, len(X)), replace=False)
plt.figure(figsize=(10,4))
for i, j in enumerate(idx, 1):
    plt.subplot(1, len(idx), i)
    plt.imshow(cv2.cvtColor(X[j], cv2.COLOR_BGR2RGB))
    plt.title(y[j]); plt.axis("off")
plt.tight_layout(); plt.show()


## 4  Basic pixel-level stats


In [None]:
flat_pixels = X.reshape(-1, 3) / 255.0
plt.figure(figsize=(6,3))
for ch, color in enumerate(["r","g","b"]):
    sns.kdeplot(flat_pixels[:,ch], color=color, label=color.upper())
plt.title("Pixel intensity distribution"); plt.xlim(0,1); plt.tight_layout(); plt.legend()
plt.show()


## 5  Majority-class baseline


In [None]:
majority_label = cnt.most_common(1)[0][0]
baseline_acc   = np.mean(y == majority_label)
print(f"Majority-class baseline accuracy = {baseline_acc:.2%}")


### Take-aways
* Any model we train must beat **{baseline_acc:.0%}** accuracy.  
* Class imbalance (ratio recyclable : non-recyclable) ≈ {cnt['recyclable']} : {cnt['non-recyclable']}.  
* Pixel histograms show a broad spectrum → colour likely informative.

---

Proceed to **02 – V1 MLP** to train a first neural baseline.
