# Pipeline de determinacio de la qualitat de la patata

Aquest notebook conte el pipeline end-to-end que s'implementara a la industria per processar la qualitat de les patates.

Inputs (dades "raw" mesurades pel setup optic):
- Rutes a les imatges RGB de les patates estudiades
- Valor NIR de cada patata i el seu valor de referencia corresponent.

Outputs:
- Dataset amb els resultats de cada patata en columnes "Defecte" (on si la patata no es defectuosa apareixera l'string "Potato") i "MS_predita" calculat amb la xarxa neuronal previamente entrenada (NaN si la patata es defectuosa).
- Analisi de la qualitat general del lot on s'indicara el percentatge de patates defectuoses en general i els percentatges de cada tipus de defect particulars a banda d'un estudi sobre la qualitat del lot quant a materia seca.

## Inicialitzacio i Imports

In [18]:
from pathlib import Path
import os
import sys
import pandas as pd
import numpy as np
from PIL import Image

# ficar directori a l'arrel retrocedint fins arribar a "potato-dry-matter-optics-ml"
ROOT = Path().resolve()
while ROOT.name != "potato-dry-matter-optics-ml" and ROOT.parent != ROOT:
    ROOT = ROOT.parent
os.chdir(ROOT)
sys.path.append(str(ROOT))

DATA_RAW_CSV = ROOT / "data/input/raw/raw_dataset_v1.csv"
RAW_IMG_DIR = ROOT / "data/input/raw/raw_images/test_1"
MODEL_PATH = ROOT / "data/output/test_run_2/model_prediccio_ms_final_MAPE_1.65.h5"
SCALER_PATH = ROOT / "data/output/test_run_2/scaler_X.pkl"

# la resta d'imports
from src.raw_image_treatment import apply_brightness_and_gamma, potato_defect_classification, potato_pixels_rgb_img, potato_filter_extreme_colours, nir_scalation
from src.dry_matter import load_model_and_scaler, predict_dm, dry_matter_quality_classification

## Input

In [19]:
# CSV amb columnes: [id_mostra,ruta_imatges,canal_NIR_raw,ref_NIR]
raw_df = pd.read_csv(DATA_RAW_CSV, sep=";", decimal=",")
required_cols = ["id_mostra", "ruta_imatges", "canal_NIR_raw", "ref_NIR"]
missing = [c for c in required_cols if c not in raw_df.columns]
if missing:
    raise ValueError(f"Falten columnes al CSV: {missing}")

def _to_float(val):
    return pd.to_numeric(str(val).replace(",", "."), errors="coerce")

raw_df["id_mostra"] = pd.to_numeric(raw_df["id_mostra"], errors="coerce")
raw_df["canal_NIR_raw"] = raw_df["canal_NIR_raw"].apply(_to_float)
raw_df["ref_NIR"] = raw_df["ref_NIR"].apply(_to_float)
raw_df["ruta_imatges"] = raw_df["ruta_imatges"].apply(lambda p: Path(str(p)).name)

raw_df = raw_df[required_cols].copy().set_index("id_mostra", drop=False)
raw_df.head()

Unnamed: 0_level_0,id_mostra,ruta_imatges,canal_NIR_raw,ref_NIR
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,p1_1.png,48118,22136.0
2,2,p1_2.png,16376,7349.0
3,3,p1_3.png,42628,16141.0
4,4,p1_4.png,33992,17399.0
5,5,p1_5.png,64510,24869.0


## Codi

### 1. Classificacio visual

In [20]:
# Utilitzar les funcions apply_brightness_and_gamma (nomes per a la classificacio, no es fan servir les imatges amb aquest filtre) i potato_defect_classification
def classify_image(row):
    img_path = RAW_IMG_DIR / Path(row["ruta_imatges"])
    if not img_path.exists():
        return pd.Series({"Defecte": "Unable to classify", "Confianca": 0.0, "es_defecte": True})

    bright_img = apply_brightness_and_gamma(img_path, brightness=2.3, gamma=0.8)
    try:
        defect, conf, _ = potato_defect_classification(bright_img)
    except Exception:
        defect, conf = "Unable to classify", 0.0

    es_defecte = defect not in ("Potato", "Unable to classify")
    etiqueta = defect if es_defecte else "Potato"
    return pd.Series({"Defecte": etiqueta, "Confianca": float(conf), "es_defecte": bool(es_defecte)})

classif_df = raw_df.apply(classify_image, axis=1)
raw_df = pd.concat([raw_df, classif_df], axis=1)
raw_df

Unnamed: 0_level_0,id_mostra,ruta_imatges,canal_NIR_raw,ref_NIR,Defecte,Confianca,es_defecte
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,p1_1.png,48118,22136.0,Potato,0.925617,False
2,2,p1_2.png,16376,7349.0,Potato,0.89739,False
3,3,p1_3.png,42628,16141.0,Potato,0.917334,False
4,4,p1_4.png,33992,17399.0,Potato,0.852019,False
5,5,p1_5.png,64510,24869.0,Potato,0.899698,False
6,6,p1_6.png,46742,20116.0,Potato,0.931155,False
7,7,p1_7.png,38994,16236.0,Damaged potato,0.861069,True
8,8,p1_8.png,62964,19165.0,Potato,0.918368,False
9,9,p1_9.png,57633,22479.0,Potato,0.911738,False
10,10,p1_10.png,48462,17479.0,Potato,0.775514,False


### 2. Preprocessament de les dades "raw" de patates no defectuoses

Calcul de les mitjanes i les desviacions dels canals R, G i B

In [21]:
def extract_rgb_stats(row):
    img_path = RAW_IMG_DIR / Path(row["ruta_imatges"])
    if not img_path.exists():
        return pd.Series({
            "color_promig_R": np.nan,
            "color_promig_G": np.nan,
            "color_promig_B": np.nan,
            "desviacio_R": np.nan,
            "desviacio_G": np.nan,
            "desviacio_B": np.nan,
        })
    
    # Area de la patata amb la que treballarem: Utilitzar la funcio potato_pixels_rgb_img
    try:
        crop_img, _ = potato_pixels_rgb_img(img_path, margin=35, min_conf=0.05)
    except Exception:
        crop_img = None
    base_img = crop_img if crop_img is not None else Image.open(img_path).convert("RGB")

    # Neteja de valors extrems: Utilitzar la funcio potato_filter_extreme_colours
    try:
        filtered_img, median_color = potato_filter_extreme_colours(base_img, margin=50, ignore_black=True)
    except Exception:
        filtered_img, median_color = base_img, (0.0, 0.0, 0.0)

    arr = np.asarray(filtered_img, dtype=np.float32)
    mask = ~(arr == 0).all(axis=2)
    pixels = arr[mask] if mask.any() else arr.reshape(-1, 3)

    # Calcul de les mitjanes i desviacions R, G i B
    mean_r, mean_g, mean_b = pixels.mean(axis=0)
    std_r, std_g, std_b = pixels.std(axis=0)
    return pd.Series({
        "color_promig_R": float(mean_r),
        "color_promig_G": float(mean_g),
        "color_promig_B": float(mean_b),
        "desviacio_R": float(std_r),
        "desviacio_G": float(std_g),
        "desviacio_B": float(std_b),
    })

no_defectes = raw_df[~raw_df["es_defecte"]].copy()
rgb_df = no_defectes.apply(extract_rgb_stats, axis=1)
no_defectes = pd.concat([no_defectes.reset_index(drop=True), rgb_df.reset_index(drop=True)], axis=1)
no_defectes.head()

Unnamed: 0,id_mostra,ruta_imatges,canal_NIR_raw,ref_NIR,Defecte,Confianca,es_defecte,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B
0,1,p1_1.png,48118,22136.0,Potato,0.925617,False,119.468292,86.049881,48.707893,38.74522,30.321312,19.407145
1,2,p1_2.png,16376,7349.0,Potato,0.89739,False,112.707443,83.91468,44.745617,43.461731,34.154488,20.509338
2,3,p1_3.png,42628,16141.0,Potato,0.917334,False,89.37278,68.277069,38.547157,30.402367,24.968317,17.570663
3,4,p1_4.png,33992,17399.0,Potato,0.852019,False,120.851791,88.724251,48.768379,37.240997,29.458384,18.746269
4,5,p1_5.png,64510,24869.0,Potato,0.899698,False,103.087921,75.050011,41.371349,33.667488,26.597757,16.590998


Escalat del valor NIR fent servir el valor de referencia

In [22]:
# Utilitzar la funcio nir_scalation
if not no_defectes.empty:
    no_defectes["canal_NIR"] = no_defectes.apply(lambda r: nir_scalation(r["canal_NIR_raw"], r["ref_NIR"]), axis=1)
no_defectes.head()

Unnamed: 0,id_mostra,ruta_imatges,canal_NIR_raw,ref_NIR,Defecte,Confianca,es_defecte,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B,canal_NIR
0,1,p1_1.png,48118,22136.0,Potato,0.925617,False,119.468292,86.049881,48.707893,38.74522,30.321312,19.407145,2.173744
1,2,p1_2.png,16376,7349.0,Potato,0.89739,False,112.707443,83.91468,44.745617,43.461731,34.154488,20.509338,2.22833
2,3,p1_3.png,42628,16141.0,Potato,0.917334,False,89.37278,68.277069,38.547157,30.402367,24.968317,17.570663,2.640976
3,4,p1_4.png,33992,17399.0,Potato,0.852019,False,120.851791,88.724251,48.768379,37.240997,29.458384,18.746269,1.953676
4,5,p1_5.png,64510,24869.0,Potato,0.899698,False,103.087921,75.050011,41.371349,33.667488,26.597757,16.590998,2.593992


Visualitzacio i guardat del dataset preprocessat (input de l'algoritme)

In [None]:
# CSV amb columnes: [id_mostra,ruta_imatges,canal_NIR,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B]
preprocessed_df = no_defectes[[
    "id_mostra",
    "ruta_imatges",
    "canal_NIR",
    "color_promig_R",
    "color_promig_G",
    "color_promig_B",
    "desviacio_R",
    "desviacio_G",
    "desviacio_B",
]].copy() if not no_defectes.empty else pd.DataFrame()

output_path = ROOT / "data/input/processed/processed_dataset_pipeline.csv"
try:
    preprocessed_df.to_csv(output_path, index=False)
except Exception as exc:
    print(f"No s'ha pogut guardar el CSV preprocessat: {exc}")

preprocessed_df.head()

Unnamed: 0,id_mostra,ruta_imatges,canal_NIR,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B
0,1,p1_1.png,2.173744,119.468292,86.049881,48.707893,38.74522,30.321312,19.407145
1,2,p1_2.png,2.22833,112.707443,83.91468,44.745617,43.461731,34.154488,20.509338
2,3,p1_3.png,2.640976,89.37278,68.277069,38.547157,30.402367,24.968317,17.570663
3,4,p1_4.png,1.953676,120.851791,88.724251,48.768379,37.240997,29.458384,18.746269
4,5,p1_5.png,2.593992,103.087921,75.050011,41.371349,33.667488,26.597757,16.590998


### 3. Crida a la xarxa neuronal per predir la materia seca

In [None]:
# Utilitzar les funcions load_model_and_scaler i predict_dm
feature_cols_data = [
    "color_promig_R",
    "color_promig_G",
    "color_promig_B",
    "desviacio_R",
    "desviacio_G",
    "desviacio_B",
    "canal_NIR",
]

preprocessed_df["MS_predita"] = np.nan
if not preprocessed_df.empty:
    try:
        model, scaler = load_model_and_scaler(str(MODEL_PATH), str(SCALER_PATH))
        features_array = preprocessed_df[feature_cols_data].to_numpy()
        preds = predict_dm(features_array, scaler, model)
        preprocessed_df["MS_predita"] = preds
    except Exception as exc:
        print(f"No s'ha pogut generar la prediccio de MS: {exc}")

preprocessed_df

Unnamed: 0,id_mostra,ruta_imatges,canal_NIR,color_promig_R,color_promig_G,color_promig_B,desviacio_R,desviacio_G,desviacio_B,MS_predita
0,1,p1_1.png,2.173744,119.468292,86.049881,48.707893,38.74522,30.321312,19.407145,275.195862
1,2,p1_2.png,2.22833,112.707443,83.91468,44.745617,43.461731,34.154488,20.509338,287.197449
2,3,p1_3.png,2.640976,89.37278,68.277069,38.547157,30.402367,24.968317,17.570663,350.92807
3,4,p1_4.png,1.953676,120.851791,88.724251,48.768379,37.240997,29.458384,18.746269,237.408707
4,5,p1_5.png,2.593992,103.087921,75.050011,41.371349,33.667488,26.597757,16.590998,343.897766


## Output

Resultats individuals

In [29]:
# Utilitzar els resultats de l'analisi de defectes inicial mes els de la funcio dry_matter_quality_classification
base_df = raw_df.reset_index(drop=True)
resultats = base_df.merge(
    preprocessed_df[["id_mostra", "MS_predita"]],
    on="id_mostra",
    how="left",
).set_index("id_mostra", drop=False)
resultats.loc[resultats["es_defecte"], "MS_predita"] = np.nan
resultats["class_ms"] = resultats["MS_predita"].apply(
    lambda v: dry_matter_quality_classification(v) if pd.notna(v) else "descartada"
)

resultats[["id_mostra", "Defecte", "MS_predita", "class_ms"]]


Unnamed: 0_level_0,id_mostra,Defecte,MS_predita,class_ms
id_mostra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,Potato,275.195862,bona
2,2,Potato,287.197449,bona
3,3,Potato,350.92807,bona
4,4,Potato,237.408707,bona
5,5,Potato,343.897766,bona
6,6,Potato,299.229645,bona
7,7,Damaged potato,,descartada
8,8,Potato,457.164398,bona
9,9,Potato,338.286682,bona
10,10,Potato,375.476318,bona


Resultats del lot (en cas que s'hagi executat el pipeline per mes d'una patata)

In [30]:
# Percentatge de patates defectuoses (i num/tot) i quin percentatge de cada defecte hi ha (es compta com a defecte tambe la materia seca fora del rang i dins de cada categoria) (ficar tambe acompanyant tots els percentatges num/total)

# Percentatge de patates amb materia seca dins del rang optim (i num/tot)
total = len(resultats)
defectes = int(resultats.get("es_defecte", []).sum()) if total else 0
pct_defectes = round(100 * defectes / total, 2) if total else 0.0
per_defecte = (
    resultats[resultats["es_defecte"]]
    .groupby("Defecte")["id_mostra"]
    .count()
    .div(total)
    .mul(100)
    .round(2)
)
qual_ms = (
    resultats[~resultats["es_defecte"]]["class_ms"]
    .value_counts(normalize=True)
    .mul(100)
    .round(2)
)

resum_lot = pd.DataFrame([
    {
        "total": total,
        "defectuoses_%": pct_defectes,
        "ms_bona_%": float(qual_ms.get("bona", 0.0)),
        "ms_preu_rebaixat_%": float(qual_ms.get("preu rebaixat", 0.0)),
        "ms_descartada_%": float(qual_ms.get("descartada", 0.0)),
    }
])

display(resum_lot)
if not per_defecte.empty:
    display(per_defecte.to_frame("percentatge_defecte"))

Unnamed: 0,total,defectuoses_%,ms_bona_%,ms_preu_rebaixat_%,ms_descartada_%
0,35,25.71,100.0,0.0,0.0


Unnamed: 0_level_0,percentatge_defecte
Defecte,Unnamed: 1_level_1
Damaged potato,20.0
Diseased-fungal potato,5.71
