# Pipeline de determinacio de la qualitat de la patata

Aquest notebook conte el pipeline end-to-end que s'implementara a la industria per processar la qualitat de les patates.

Inputs (dades "raw" mesurades pel setup optic):
- Rutes a les imatges RGB de les patates estudiades
- Valor NIR de cada patata i el seu valor de referencia corresponent.

Outputs:
- Dataset amb els resultats de cada patata en columnes "Defecte" (on si la patata no es defectuosa apareixera l'string "Potato") i "MS_predita" calculat amb la xarxa neuronal previamente entrenada (NaN si la patata es defectuosa).
- Analisi de la qualitat general del lot on s'indicara el percentatge de patates defectuoses en general i els percentatges de cada tipus de defect particulars a banda d'un estudi sobre la qualitat del lot quant a materia seca.

## Inicialitzacio i Imports

In [1]:
from pathlib import Path
import os
import sys
import pandas as pd
import numpy as np
from PIL import Image

# ficar directori a l'arrel retrocedint fins arribar a "potato-dry-matter-optics-ml"
ROOT = Path().resolve()
while ROOT.name != "potato-dry-matter-optics-ml" and ROOT.parent != ROOT:
    ROOT = ROOT.parent
os.chdir(ROOT)
sys.path.append(str(ROOT))

DATA_RAW_CSV = ROOT / "data/input/raw/raw_dataset_definitive.csv"
RAW_IMG_DIR = ROOT / "data/input/raw/raw_images/definitive"
MODEL_PATH = ROOT / "data/output/train/test_run_definitive_1/model_prediccio_ms.h5"
SCALER_PATH = ROOT / "data/output/train/test_run_definitive_1/scaler_X.pkl"
OUTPUT_PATH = ROOT / "data/output/pipeline/pipeline_output_definitive_1.csv"

# la resta d'imports
from src.raw_image_treatment import apply_brightness_and_gamma, apply_sigmoid, potato_defect_classification, potato_pixels_rgb_img, potato_filter_extreme_colours, nir_scalation
from src.dry_matter import load_model_and_scaler, predict_dm, dry_matter_quality_classification



## Input

In [2]:
# CSV amb columnes: [id_mostra,ruta_imatges,canal_NIR_raw,ref_NIR]
raw_df = pd.read_csv(DATA_RAW_CSV, sep=",", decimal=".")

required_cols = ["id_mostra", "ruta_imatges", "canal_NIR_raw", "ref_NIR"]
missing = [c for c in required_cols if c not in raw_df.columns]
if missing:
    raise ValueError(f"Falten columnes al CSV: {missing}")

def _to_float(val):
    return pd.to_numeric(str(val).replace(",", "."), errors="coerce")

# --- neteja tipus ---
raw_df["id_mostra"] = pd.to_numeric(raw_df["id_mostra"], errors="coerce").astype("Int64")
raw_df["canal_NIR_raw"] = raw_df["canal_NIR_raw"].apply(_to_float)
raw_df["ref_NIR"] = raw_df["ref_NIR"].apply(_to_float)
raw_df["ruta_imatges"] = raw_df["ruta_imatges"].apply(lambda p: Path(str(p)).name)

# --- quedem-nos només amb aquestes id_mostra ---
ID_MOSTRA_KEEP = [36, 48, 49, 81, 94, 99, 141, 142, 149, 174, 175, 176]

raw_df = raw_df[required_cols].copy().set_index("id_mostra", drop=True)
raw_df = raw_df.loc[raw_df.index.isin(ID_MOSTRA_KEEP)].copy().sort_index()

# info de control (si en falta alguna)
found_ids = set(raw_df.index.dropna().astype(int).tolist())
missing_ids = sorted(set(ID_MOSTRA_KEEP) - found_ids)
print(f"Files trobades: {len(raw_df)} / {len(ID_MOSTRA_KEEP)}")
if missing_ids:
    print("id_mostra NO trobades al CSV:", missing_ids)

raw_df.head()

Files trobades: 12 / 12


Unnamed: 0_level_0,ruta_imatges,canal_NIR_raw,ref_NIR
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
36,p3_1.png,57198,14713.0
48,p3_13.png,57821,20512.0
49,p3_14.png,50115,23576.0
81,p4_16.png,52236,15765.0
94,p4_29.png,54451,18070.0


## Codi

### 1. Classificacio visual

In [3]:
# Utilitzar les funcions apply_brightness_and_gamma i apply_sigmoid (nomes per a la classificacio, no es fan servir les imatges amb aquest filtre) i potato_defect_classification
def classify_image(row):
    img_path = RAW_IMG_DIR / Path(row["ruta_imatges"])
    if not img_path.exists():
        return pd.Series({"Defecte": "Unable to classify", "Confianca": 0.0, "es_defecte": True})

    bg_img = apply_brightness_and_gamma(img_path, brightness=1, gamma=1.1)
    sigmoid_img = apply_sigmoid(bg_img, k=6.0, mid=0.5, normalize=True)
    try:
        defect, conf, _ = potato_defect_classification(image=sigmoid_img, confidence_threshold=0.2)
    except Exception:
        defect, conf = "Unable to classify", 0.0

    es_defecte = defect not in ("Potato")
    return pd.Series({"Defecte": defect, "Confianca": float(conf), "es_defecte": bool(es_defecte)})

classif_df = raw_df.apply(classify_image, axis=1)
raw_df = pd.concat([raw_df, classif_df], axis=1)
raw_df

Unnamed: 0_level_0,ruta_imatges,canal_NIR_raw,ref_NIR,Defecte,Confianca,es_defecte
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
36,p3_1.png,57198,14713.0,Potato,0.889392,False
48,p3_13.png,57821,20512.0,Potato,0.793315,False
49,p3_14.png,50115,23576.0,Potato,0.843536,False
81,p4_16.png,52236,15765.0,Potato,0.778394,False
94,p4_29.png,54451,18070.0,Potato,0.813266,False
99,p4_34.png,48077,19790.0,Potato,0.828074,False
141,p5_4.png,46624,5258.1,Sprouted potato,0.793834,True
142,p5_5.png,62986,21061.0,Sprouted potato,0.737746,True
149,p5_12.png,60615,10170.0,Diseased-fungal potato,0.492284,True
174,p6_3.png,59031,15454.0,Diseased-fungal potato,0.785016,True


### 2. Preprocessament de les dades "raw" de patates no defectuoses

Calcul de les mitjanes i les desviacions dels canals R, G i B

In [4]:
def _rgb_mean_std_ignore_black(pil_img) -> tuple:
    arr = np.asarray(pil_img, dtype=np.uint8)
    if arr.ndim != 3 or arr.shape[2] != 3:
        raise ValueError(f"Imatge RGB esperada, però arr shape={arr.shape}")
    mask = np.any(arr != 0, axis=2)  # ignora píxels negres (0,0,0)
    if mask.sum() == 0:
        return (np.nan, np.nan, np.nan, np.nan, np.nan, np.nan)
    vals = arr[mask]  # (N,3)
    mean = vals.mean(axis=0)
    std  = vals.std(axis=0)
    return (*mean.tolist(), *std.tolist())

def extract_rgb_stats(row):
    img_path = RAW_IMG_DIR / Path(row["ruta_imatges"])
    if not img_path.exists():
        return pd.Series({
            "color_promig_R": np.nan,
            "color_promig_G": np.nan,
            "color_promig_B": np.nan,
            "desviacio_R": np.nan,
            "desviacio_G": np.nan,
            "desviacio_B": np.nan,
        })

    # Area de la patata: potato_pixels_rgb_img
    try:
        crop_img, _ = potato_pixels_rgb_img(img_path, margin=35)
    except Exception:
        print(f"Warning: no s'ha pogut segmentar la patata a la imatge {img_path}, es fa servir la imatge completa.")
        crop_img = None
    base_img = crop_img if crop_img is not None else Image.open(img_path).convert("RGB")

    # Neteja valors extrems: potato_filter_extreme_colours
    try:
        filt_out = potato_filter_extreme_colours(base_img, margin=40, ignore_black=True)
        filtered_img = filt_out[0] if isinstance(filt_out, tuple) else filt_out
    except Exception:
        print(f"Warning: no s'ha pogut filtrar colors extrems a la imatge {img_path}, es fa servir la imatge sense filtrar.")
        filtered_img = base_img

    # Mitjana i desviació ignorant píxels negres
    mean_r, mean_g, mean_b, std_r, std_g, std_b = _rgb_mean_std_ignore_black(filtered_img)

    return pd.Series({
        "color_promig_R": float(mean_r) if pd.notna(mean_r) else np.nan,
        "color_promig_G": float(mean_g) if pd.notna(mean_g) else np.nan,
        "color_promig_B": float(mean_b) if pd.notna(mean_b) else np.nan,
        "desviacio_R": float(std_r) if pd.notna(std_r) else np.nan,
        "desviacio_G": float(std_g) if pd.notna(std_g) else np.nan,
        "desviacio_B": float(std_b) if pd.notna(std_b) else np.nan,
    })

no_defectes = raw_df[~raw_df["es_defecte"]].copy()
rgb_df = no_defectes.apply(extract_rgb_stats, axis=1)
no_defectes = pd.concat([no_defectes, rgb_df], axis=1)
no_defectes.head()

Unnamed: 0_level_0,ruta_imatges,canal_NIR_raw,ref_NIR,Defecte,Confianca,es_defecte,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
36,p3_1.png,57198,14713.0,Potato,0.889392,False,100.422562,71.011757,41.187936,22.923658,18.311081,12.131452
48,p3_13.png,57821,20512.0,Potato,0.793315,False,103.601126,71.920633,42.125666,22.393152,17.553958,12.130381
49,p3_14.png,50115,23576.0,Potato,0.843536,False,100.964289,69.938758,39.626931,22.597088,17.456802,11.306049
81,p4_16.png,52236,15765.0,Potato,0.778394,False,96.74225,68.960057,43.117154,20.971719,16.433083,12.036605
94,p4_29.png,54451,18070.0,Potato,0.813266,False,97.156951,69.224191,41.912914,23.525322,18.583422,12.409788


Escalat del valor NIR fent servir el valor de referencia

In [5]:
# Utilitzar la funcio nir_scalation
if not no_defectes.empty:
    no_defectes["canal_NIR"] = no_defectes.apply(lambda r: nir_scalation(r["canal_NIR_raw"], r["ref_NIR"]), axis=1)
no_defectes.head()

Unnamed: 0_level_0,ruta_imatges,canal_NIR_raw,ref_NIR,Defecte,Confianca,es_defecte,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B,canal_NIR
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
36,p3_1.png,57198,14713.0,Potato,0.889392,False,100.422562,71.011757,41.187936,22.923658,18.311081,12.131452,3.887582
48,p3_13.png,57821,20512.0,Potato,0.793315,False,103.601126,71.920633,42.125666,22.393152,17.553958,12.130381,2.818887
49,p3_14.png,50115,23576.0,Potato,0.843536,False,100.964289,69.938758,39.626931,22.597088,17.456802,11.306049,2.125679
81,p4_16.png,52236,15765.0,Potato,0.778394,False,96.74225,68.960057,43.117154,20.971719,16.433083,12.036605,3.313416
94,p4_29.png,54451,18070.0,Potato,0.813266,False,97.156951,69.224191,41.912914,23.525322,18.583422,12.409788,3.013337


Visualitzacio del dataset preprocessat (input de l'algoritme)

In [6]:
# CSV amb columnes: [id_mostra,ruta_imatges,canal_NIR,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B]
preprocessed_df = no_defectes[[
    "ruta_imatges",
    "canal_NIR",
    "color_promig_R",
    "color_promig_G",
    "color_promig_B",
    "desviacio_R",
    "desviacio_G",
    "desviacio_B",
]].copy() if not no_defectes.empty else pd.DataFrame()

preprocessed_df.head()

Unnamed: 0_level_0,ruta_imatges,canal_NIR,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
36,p3_1.png,3.887582,100.422562,71.011757,41.187936,22.923658,18.311081,12.131452
48,p3_13.png,2.818887,103.601126,71.920633,42.125666,22.393152,17.553958,12.130381
49,p3_14.png,2.125679,100.964289,69.938758,39.626931,22.597088,17.456802,11.306049
81,p4_16.png,3.313416,96.74225,68.960057,43.117154,20.971719,16.433083,12.036605
94,p4_29.png,3.013337,97.156951,69.224191,41.912914,23.525322,18.583422,12.409788


### 3. Crida a la xarxa neuronal per predir la materia seca

In [7]:
# Utilitzar les funcions load_model_and_scaler i predict_dm
feature_cols_data = [
    "color_promig_R",
    "color_promig_G",
    "color_promig_B",
    "desviacio_R",
    "desviacio_G",
    "desviacio_B",
    "canal_NIR",
]

preprocessed_df["MS_predita"] = np.nan
if not preprocessed_df.empty:
    try:
        model, scaler = load_model_and_scaler(str(MODEL_PATH), str(SCALER_PATH))
        features_array = preprocessed_df[feature_cols_data].to_numpy()
        preds = predict_dm(features_array, scaler, model)
        preprocessed_df["MS_predita"] = preds
    except Exception as exc:
        print(f"No s'ha pogut generar la prediccio de MS: {exc}")

preprocessed_df

Unnamed: 0_level_0,ruta_imatges,canal_NIR,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B,MS_predita
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
36,p3_1.png,3.887582,100.422562,71.011757,41.187936,22.923658,18.311081,12.131452,21.173422
48,p3_13.png,2.818887,103.601126,71.920633,42.125666,22.393152,17.553958,12.130381,20.892393
49,p3_14.png,2.125679,100.964289,69.938758,39.626931,22.597088,17.456802,11.306049,22.460295
81,p4_16.png,3.313416,96.74225,68.960057,43.117154,20.971719,16.433083,12.036605,17.99996
94,p4_29.png,3.013337,97.156951,69.224191,41.912914,23.525322,18.583422,12.409788,22.143976
99,p4_34.png,2.429358,92.38575,66.552628,39.300368,22.726281,18.14837,11.978339,21.820742


## Output

Resultats individuals

In [8]:
# Utilitzar els resultats de l'analisi de defectes inicial mes els de la funcio dry_matter_quality_classification
resultats = raw_df.merge(
    preprocessed_df[["MS_predita"]],
    left_index=True,
    right_index=True,
    how="left",
)
resultats.loc[resultats["es_defecte"], "MS_predita"] = np.nan
resultats["class_ms"] = resultats["MS_predita"].apply(
    lambda v: dry_matter_quality_classification(v) if pd.notna(v) else "descartada"
)

try:
    resultats[["ruta_imatges","Defecte", "MS_predita", "class_ms"]].to_csv(OUTPUT_PATH, index=True)
except Exception as exc:
    print(f"No s'ha pogut guardar el CSV de resultats del pipeline: {exc}")

resultats[["ruta_imatges","Defecte", "MS_predita", "class_ms"]]

Unnamed: 0_level_0,ruta_imatges,Defecte,MS_predita,class_ms
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
36,p3_1.png,Potato,21.173422,bona
48,p3_13.png,Potato,20.892393,bona
49,p3_14.png,Potato,22.460295,bona
81,p4_16.png,Potato,17.99996,descartada
94,p4_29.png,Potato,22.143976,bona
99,p4_34.png,Potato,21.820742,bona
141,p5_4.png,Sprouted potato,,descartada
142,p5_5.png,Sprouted potato,,descartada
149,p5_12.png,Diseased-fungal potato,,descartada
174,p6_3.png,Diseased-fungal potato,,descartada


Resultats del lot (en cas que s'hagi executat el pipeline per mes d'una patata)

In [15]:
total = len(resultats)
defectes = int(resultats["es_defecte"].sum()) if total else 0
pct_defectes = round(100 * defectes / total, 2) if total else 0.0

per_defecte = (
    resultats.loc[resultats["es_defecte"]]
    .groupby("Defecte")
    .size()
    .div(total)
    .mul(100)
    .round(2)
)

qual_ms = (
    resultats.loc[~resultats["es_defecte"], "class_ms"]
    .value_counts(normalize=True)
    .mul(100)
    .round(2)
)

resum_lot = pd.DataFrame([{
    "total": total,
    "defectuoses_%": pct_defectes,
    "ms_bona_%": float(qual_ms.get("bona", 0.0)),
    "ms_preu_rebaixat_%": float(qual_ms.get("preu rebaixat", 0.0)),
    "ms_descartada_%": float(qual_ms.get("descartada", 0.0)),
}])

display(resum_lot.style.hide(axis="index"))

if not per_defecte.empty:
    display(per_defecte.reset_index(name="percentatge_defecte").style.hide(axis="index"))


total,defectuoses_%,ms_bona_%,ms_preu_rebaixat_%,ms_descartada_%
12,50.0,83.33,0.0,16.67


Defecte,percentatge_defecte
Diseased-fungal potato,25.0
Sprouted potato,16.67
Unable to classify,8.33
