# ‚òï Coffee Bean Quality Classification
## Stage 1 ‚Äî Preprocessing

---

> **‚ÑπÔ∏è Catatan Arsitektur**
> Semua class (`DuplicateDetector`, `LabelProcessor`, `DataSplitter`,
> `PreprocessingPipeline`) sudah dipindahkan ke `src/` di repository GitHub.
> Notebook ini hanya berisi konfigurasi, eksekusi pipeline, dan inspeksi hasil.
>
> Output notebook ini (CSV splits) adalah input untuk `02_modeling_baseline.ipynb`.

---

### Pipeline Overview

```
Step 1 ‚Äî Load metadata CSV & resolve image paths
Step 2 ‚Äî Remove exact duplicates  (MD5)
Step 3 ‚Äî Remove near-duplicates   (64-bit pHash, Hamming ‚â§ 4)
Step 4 ‚Äî Encode labels            (sklearn LabelEncoder)
Step 5 ‚Äî Group-Aware Stratified Split (train / val / test)
```

---
## üì¶ 1. Clone Repository & Install Package

In [1]:
import os, sys

# Clone repo dari GitHub dan install sebagai package
REPO_URL = "https://github.com/Ardiyanto24/Coffee-Bean-Classifier.git"
REPO_DIR = "Coffee-Bean-Classifier"

if not os.path.exists(REPO_DIR):
    os.system(f"git clone {REPO_URL}")
else:
    # Jika sudah ada, pull update terbaru
    os.system(f"git -C {REPO_DIR} pull")

# Install sebagai editable package agar src/ bisa di-import
os.system(f"pip install -e {REPO_DIR} -q")
os.system(f"pip install imagehash -q")

# Tambahkan root repo ke sys.path sebagai fallback
if REPO_DIR not in sys.path:
    sys.path.insert(0, REPO_DIR)

print("‚úÖ Repository siap.")

Cloning into 'Coffee-Bean-Classifier'...
ERROR: Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
                      ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 76, in resolve
    collected = self.factory.collect_root_requirements(root_reqs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/resolution/resolvelib/factory.py", line 538, in collect_root_

‚úÖ Repository siap.


---
## üìö 2. Imports

In [2]:
import os
import hashlib
import logging
from pathlib import Path
from dataclasses import dataclass
from typing import Optional, Tuple, Dict, List

import numpy as np
import pandas as pd
import imagehash
from PIL import Image
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedGroupKFold

logging.basicConfig(level=logging.INFO, format='[%(levelname)s] %(message)s')

---
## üì¶ 3. Import dari `src/`

In [3]:
from src.config import PreprocessingConfig
from src.data.preprocessing import (
    DuplicateDetector,
    LabelProcessor,
    DataSplitter,
    PreprocessingPipeline,
)

print("‚úÖ Semua class berhasil di-import dari src/")

‚úÖ Semua class berhasil di-import dari src/


---
## ‚öôÔ∏è 4. Configuration Setup

Edit nilai di bawah sesuai path dataset Anda di Kaggle.

| Parameter | Keterangan |
|-----------|------------|
| `metadata_path` | Path ke CSV metadata yang berisi kolom `filepath` dan `label` |
| `image_base_dir` | Root folder gambar ‚Äî digunakan untuk resolve path relatif |
| `phash_hash_size` | Ukuran grid DCT (default 8 = 64-bit hash, jangan diubah) |
| `phash_threshold` | Hamming distance cutoff near-duplicate (default 4, standar) |
| `val_size` / `test_size` | Proporsi split validasi dan test |
| `output_dir` | Folder output untuk menyimpan CSV splits |

In [4]:
config = PreprocessingConfig(
    metadata_path   = '/kaggle/input/datasets/arproject01/metadata/coffee_metadata.csv',
    image_base_dir  = '/kaggle/input/datasets/ardiyanto24/coffee-bean-classification-dataset/Deteksi Jenis Kopi/train',
    phash_hash_size = 8,   # 8x8 DCT ‚Üí 64-bit hash (standard, do not change)
    phash_threshold = 4,   # Hamming distance <= 4 (standard for 64-bit pHash)
    val_size        = 0.15,
    test_size       = 0.15,
    random_state    = 42,
    path_col        = 'filepath',
    label_col       = 'label',
    save_splits     = True,
    output_dir      = '/kaggle/working/preprocessed'
)

---
## ‚ñ∂Ô∏è 5. Run Preprocessing

Dua opsi tersedia:

| Opsi | Kapan Digunakan |
|------|-----------------|
| **Option A** ‚Äî Full Pipeline | Jalankan semua step sekaligus (recommended) |
| **Option B** ‚Äî Step-by-Step | Jalankan per step jika butuh kontrol lebih detail |

### ‚ñ∂Ô∏è Option A ‚Äî Full Pipeline *(Recommended)*

In [5]:
pipeline = PreprocessingPipeline(config)
train_df, val_df, test_df, class_info = pipeline.run()

[INFO] Loading metadata from: /kaggle/input/datasets/arproject01/metadata/coffee_metadata.csv
[INFO] Metadata loaded: 1211 valid records.
[INFO] === Duplicate Detection Started ===
[INFO] [1/2] Computing MD5 hashes for exact duplicate detection...
[INFO]     Exact duplicates removed : 11
[INFO]     Remaining records        : 1200
[INFO] [2/2] Computing 64-bit pHash (DCT) for near-duplicate detection...
[INFO]     Hash size  : 8x8 = 64-bit
[INFO]     Threshold  : Hamming distance <= 4
[INFO]     Near-duplicates removed : 73
[INFO]     Remaining records       : 1127
[INFO] === Duplicate Detection Completed ===
[INFO] Classes found (4): ['defect', 'longberry', 'peaberry', 'premium']
[INFO] Mapping: {'defect': 0, 'longberry': 1, 'peaberry': 2, 'premium': 3}
[INFO] Split complete ‚Äî Train: 805 (71.4%) | Val: 161 (14.3%) | Test: 161 (14.3%)
[INFO] Class distribution per split:
[INFO] Split CSVs saved to: /kaggle/working/preprocessed
[INFO] ‚úÖ Preprocessing pipeline completed successfully.


           train  val  test
label                      
defect       211   43    43
longberry    195   39    39
peaberry     199   40    39
premium      200   39    40


### üîß Option B ‚Äî Step-by-Step *(Advanced)*

In [6]:
# --- Step 1: Load your own DataFrame ---
# df = pd.read_csv(config.metadata_path)

# --- Step 2: Remove duplicates ---
# detector = DuplicateDetector(config)
# df_clean = detector.run(df)
# # Or run them separately:
# df_no_exact = detector.remove_exact_duplicates(df)
# df_clean    = detector.remove_near_duplicates(df_no_exact)

# --- Step 3: Encode labels ---
# label_proc = LabelProcessor(config)
# df_encoded = label_proc.fit_transform(df_clean)
# class_info = label_proc.get_class_info()

# --- Step 4: Split ---
# splitter = DataSplitter(config)
# train_df, val_df, test_df = splitter.split(df_encoded)

---
## üìä 6. Results Inspection

Verifikasi output pipeline sebelum lanjut ke modeling.

### Split Summary

In [7]:
# --- Split summary ---
total = len(train_df) + len(val_df) + len(test_df)
print("=" * 42)
print("          SPLIT SUMMARY")
print("=" * 42)
print(f"  Train : {len(train_df):>5} samples  ({len(train_df)/total:.1%})")
print(f"  Val   : {len(val_df):>5} samples  ({len(val_df)/total:.1%})")
print(f"  Test  : {len(test_df):>5} samples  ({len(test_df)/total:.1%})")
print(f"  Total : {total:>5} samples")
print()
print("=" * 42)
print("          CLASS INFO")
print("=" * 42)
print(f"  Num classes : {class_info['num_classes']}")
print(f"  Classes     : {class_info['class_names']}")
print(f"  Encoding    : {class_info['class_to_idx']}")

          SPLIT SUMMARY
  Train :   805 samples  (71.4%)
  Val   :   161 samples  (14.3%)
  Test  :   161 samples  (14.3%)
  Total :  1127 samples

          CLASS INFO
  Num classes : 4
  Classes     : ['defect', 'longberry', 'peaberry', 'premium']
  Encoding    : {'defect': 0, 'longberry': 1, 'peaberry': 2, 'premium': 3}


### Class Distribution per Split

In [8]:
# --- Class distribution per split ---
print("Class Distribution per Split:")
dist = pd.DataFrame({
    'train': train_df['label'].value_counts(),
    'val'  : val_df['label'].value_counts(),
    'test' : test_df['label'].value_counts()
}).fillna(0).astype(int)
display(dist)

Class Distribution per Split:


Unnamed: 0_level_0,train,val,test
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
defect,211,43,43
longberry,195,39,39
peaberry,199,40,39
premium,200,39,40


### Preview Train DataFrame

In [9]:
# --- Preview train DataFrame ---
print("Train DataFrame Preview:")
train_df.head()

Train DataFrame Preview:


Unnamed: 0,filepath,label,md5,phash,phash_group,encoded_label
0,/kaggle/input/datasets/ardiyanto24/coffee-bean...,defect,8d10ced69c5a51cbaafba5ee4e4adfdf,bec1c12fce3cc831,2,0
1,/kaggle/input/datasets/ardiyanto24/coffee-bean...,defect,930569c68bdd3977e3532552524ef2a9,e7a3986469996665,3,0
2,/kaggle/input/datasets/ardiyanto24/coffee-bean...,defect,aea9bbc373c1c548fe70a84fe82de43a,e1929e6ed9386592,4,0
3,/kaggle/input/datasets/ardiyanto24/coffee-bean...,defect,3c5c0c2cd28968f1838f8dd762ac2b9d,b8e3cd8e66383338,5,0
4,/kaggle/input/datasets/ardiyanto24/coffee-bean...,defect,bf802fd12aeb66eff715609919194a57,b8c6c53bce38319c,6,0


---
## üíæ 7. Save Final CSV

Simpan CSV final yang hanya berisi `filepath` dan `encoded_label`.
File inilah yang akan di-attach ke Notebook 02 sebagai input training.

In [10]:
# --- Save final metadata (filepath + encoded_label only) ---
for split_name, split_df in [('train', train_df), ('val', val_df), ('test', test_df)]:
    final = split_df[[config.path_col, 'encoded_label']].copy()
    final.to_csv(f"{config.output_dir}/{split_name}_final.csv", index=False)
    print(f"{split_name}_final.csv saved ‚Äî {len(final)} records")

train_final.csv saved ‚Äî 805 records
val_final.csv saved ‚Äî 161 records
test_final.csv saved ‚Äî 161 records
