### Feature Engineering

Builds informative features for the Mat&Chem dataset from the common statistical aggregations (_minimo, _maximo, _soma, _media, _desvio) and saves:
- a full enriched CSV with all original + new columns, and
- a selected CSV containing all new features + untouched original features + phase.

**Engineered Features**

**Range:** _range = _maximo - _minimo per property (e.g., atomic_ea_range) to quantify spread within a material group → captures compositional diversity that may predict phase.

**Coefficient of Variation (CoV):** _desvio / _media (e.g., atomic_ea_cov) to compare relative variability across differently scaled properties (boiling point vs. enthalpy, etc.).

**Min/Max ratio:** _minimo / _maximo to indicate homogeneity (≈1) vs. broad distributions (≈0). Division is safe (returns NaN if _maximo==0).

**Binary flags:**
- has_zero_<prop>_desvio: marks zero variance (perfect homogeneity).
- has_zero_<prop>_minimo: marks true zeros or potential imputations—letting models learn their significance.

**Synthesis–property interactions:** Multiplies key properties by syn_group_* binaries to capture method-dependent effects, e.g.
- entalpia_oxidos_media_x_combustion, gibbs_oxidos_minimo_x_hydrothermal,
- melting_point_media_x_solid_state, boiling_point_media_x_solvo,
- atomic_en_paul_media_x_chemical_co_precipitation, atomic_radius_desvio_x_ball_mill, VEC_media_x_hydrothermal,
- density_of_solid_desvio_x_mechanochemical, thermal_conduct_media_x_combustion.



In [3]:
# JUPYTER CELL — feature engineering for Mat&Chem dataset
import re
import numpy as np
import pandas as pd
from pathlib import Path

# --------------------------------------------------------------------------------------
# 1) CONFIG
# --------------------------------------------------------------------------------------
DATASET_PATH = "data/dataset_deduplicated.csv"               # <-- set this if df not in memory
COLUMN_NAMES_PATH = "./data/Feature_Engineering/column_name.csv"      # optional helper
OUTPUT_PATH_FULL = "./data/Feature_Engineering/dataset_enriched.csv"  # full dataframe with everything
OUTPUT_PATH_SELECTED = "./data/Feature_Engineering/dataset_engineered.csv"  # only new + untouched + phase

# --------------------------------------------------------------------------------------
# 2) LOAD DATA
# --------------------------------------------------------------------------------------
if 'df' not in globals():
    df = pd.read_csv(DATASET_PATH)

# Keep a snapshot of original columns before we add new ones
original_cols = df.columns.tolist()

# Optionally load column name list (if provided) to help resolve exact names
col_catalog = None
if Path(COLUMN_NAMES_PATH).is_file():
    try:
        col_catalog = pd.read_csv(COLUMN_NAMES_PATH, header=None).iloc[:,0].astype(str).tolist()
    except Exception:
        col_catalog = None

# --------------------------------------------------------------------------------------
# 3) DISCOVER PROPERTIES WITH STANDARD SUFFIXES
# --------------------------------------------------------------------------------------
suffixes = ("minimo", "maximo", "soma", "media", "desvio")
pattern = re.compile(rf"^(?P<prop>.+)_(?P<suf>{'|'.join(suffixes)})$")

# Map: base property -> available suffix columns
prop_map = {}
for c in original_cols:
    m = pattern.match(c)
    if m:
        base = m.group("prop")
        suf  = m.group("suf")
        prop_map.setdefault(base, {})[suf] = c

# --------------------------------------------------------------------------------------
# 4) SAFE DIVIDE
# --------------------------------------------------------------------------------------
def safe_divide(numer, denom):
    numer = numer.astype(float)
    denom = denom.astype(float)
    return np.where(denom == 0, np.nan, numer / denom)

# --------------------------------------------------------------------------------------
# 5) BUILD NEW FEATURES
# --------------------------------------------------------------------------------------
new_series = {}          # name -> pd.Series
new_names = []           # final names actually added (respecting collision handling)
used_input_cols = set()  # track *all* original columns used to create *any* new features

# 5.1 Range, CoV, Min/Max ratio for each property
for base, parts in prop_map.items():
    # Range
    if "maximo" in parts and "minimo" in parts:
        nm = f"{base}_range"
        ser = df[parts["maximo"]] - df[parts["minimo"]]
        new_series[nm] = ser
        used_input_cols.update([parts["maximo"], parts["minimo"]])
    # Coefficient of Variation (desvio / media)
    if "desvio" in parts and "media" in parts:
        nm = f"{base}_cov"
        ser = safe_divide(df[parts["desvio"]], df[parts["media"]])
        new_series[nm] = ser
        used_input_cols.update([parts["desvio"], parts["media"]])
    # Min/Max ratio
    if "minimo" in parts and "maximo" in parts:
        nm = f"{base}_min_to_max_ratio"
        ser = safe_divide(df[parts["minimo"]], df[parts["maximo"]])
        new_series[nm] = ser
        used_input_cols.update([parts["minimo"], parts["maximo"]])

# 5.2 Binary indicators for zero std and zero minimo
for base, parts in prop_map.items():
    if "desvio" in parts:
        col = parts["desvio"]
        nm = f"has_zero_{base}_desvio"
        ser = (df[col] == 0).astype(int)
        new_series[nm] = ser
        used_input_cols.add(col)
    if "minimo" in parts:
        col = parts["minimo"]
        nm = f"has_zero_{base}_minimo"
        ser = (df[col] == 0).astype(int)
        new_series[nm] = ser
        used_input_cols.add(col)

# --------------------------------------------------------------------------------------
# 6) SYNTHESIS INTERACTION FEATURES (only create if both inputs exist)
# --------------------------------------------------------------------------------------
def resolve_existing(candidates):
    """Return the first candidate that exists in df (exact match)."""
    for name in candidates:
        if name in df.columns:
            return name
    # Try the column catalog as a soft hint (but still require df presence)
    if col_catalog is not None:
        for name in candidates:
            if name in col_catalog and name in df.columns:
                return name
    return None

def add_interaction(feature_candidates, syn_group_name, out_name):
    base_col = resolve_existing(feature_candidates)
    if (base_col is not None) and (syn_group_name in df.columns):
        return out_name, df[base_col] * df[syn_group_name], base_col, syn_group_name
    return None

interactions_specs = [
    # 3.1 Thermodynamic / Energetic
    (["entalpia-oxidos_media", "entalpia_oxidos_media"], "syn_group_combustion", "entalpia_oxidos_media_x_combustion"),
    (["gibbs-oxidos_minimo",   "gibbs_oxidos_minimo"],   "syn_group_hydrothermal", "gibbs_oxidos_minimo_x_hydrothermal"),
    # 3.1.2 Melting point × solid-state
    (["melting_point_media"], "syn_group_solid_state", "melting_point_media_x_solid_state"),
    # 3.1.3 Boiling point × solvo
    (["boiling_point_media"], "syn_group_solvo", "boiling_point_media_x_solvo"),

    # 3.2 Atomic & Electronic
    (["atomic_en_paul_media"], "syn_group_chemical_co_precipitation", "atomic_en_paul_media_x_chemical_co_precipitation"),
    (["atomic_radius_desvio"], "syn_group_ball_mill", "atomic_radius_desvio_x_ball_mill"),
    (["VEC_media"],            "syn_group_hydrothermal", "VEC_media_x_hydrothermal"),

    # 3.3 Physical & Mechanical
    (["density_of_solid_desvio"], "syn_group_mechanochemical", "density_of_solid_desvio_x_mechanochemical"),
    (["thermal_conduct_media"],   "syn_group_combustion", "thermal_conduct_media_x_combustion"),
]

for feature_candidates, syn_name, out_name in interactions_specs:
    res = add_interaction(feature_candidates, syn_name, out_name)
    if res is not None:
        k, v, base_col, syn_col = res
        new_series[k] = v
        used_input_cols.update([base_col, syn_col])

# --------------------------------------------------------------------------------------
# 7) MERGE NEW FEATURES INTO DATAFRAME (avoid overwriting)
# --------------------------------------------------------------------------------------
for k, v in new_series.items():
    final_name = k if k not in df.columns else f"{k}__new"
    df[final_name] = v
    new_names.append(final_name)

# Ensure target "phase" exists
if "phase" not in df.columns:
    raise ValueError('Target column "phase" not found in the dataset. Please ensure it exists.')

# --------------------------------------------------------------------------------------
# 8) SAVE FULL ENRICHED DATAFRAME
# --------------------------------------------------------------------------------------
df.to_csv(OUTPUT_PATH_FULL, index=False)

# --------------------------------------------------------------------------------------
# 9) BUILD & SAVE THE "SELECTED" DATAFRAME
#     (only new features + untouched original features + phase)
# --------------------------------------------------------------------------------------
original_set = set(original_cols)
# Keep 'phase' regardless
used_input_cols.discard("phase")

# Untouched originals = originals that were NOT used to create any new feature
untouched_originals = sorted(list(original_set - used_input_cols))

# Assemble selected columns: untouched originals + new features + phase
selected_cols = untouched_originals + new_names
# Put 'phase' at the end (avoid duplication)
if "phase" in selected_cols:
    selected_cols = [c for c in selected_cols if c != "phase"]
selected_cols.append("phase")

df_selected = df[selected_cols].copy()
df_selected.to_csv(OUTPUT_PATH_SELECTED, index=False)

# --------------------------------------------------------------------------------------
# 10) REPORT
# --------------------------------------------------------------------------------------
print(f"✅ Created {len(new_names)} new features.")
print(f"🔢 Total columns (full): {df.shape[1]}  -> saved to: {OUTPUT_PATH_FULL}")
print(f"🧹 Untouched originals kept in 'selected': {len(untouched_originals)}")
print(f"📦 Columns in 'selected' (new + untouched + phase): {len(selected_cols)}  -> saved to: {OUTPUT_PATH_SELECTED}")

# Peek at a few names
print("\nFirst 10 new feature columns:")
print(new_names[:10])
print("\nFirst 10 untouched original columns kept:")
print(untouched_originals[:10])


  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_name] = v
  df[final_na

✅ Created 229 new features.
🔢 Total columns (full): 460  -> saved to: ./data/Feature_Engineering/dataset_enriched.csv
🧹 Untouched originals kept in 'selected': 48
📦 Columns in 'selected' (new + untouched + phase): 277  -> saved to: ./data/Feature_Engineering/dataset_engineered.csv

First 10 new feature columns:
['atomic_ea_range', 'atomic_ea_cov', 'atomic_ea_min_to_max_ratio', 'atomic_en_allen _range', 'atomic_en_allen _cov', 'atomic_en_allen _min_to_max_ratio', 'atomic_en_allredroch_range', 'atomic_en_allredroch_cov', 'atomic_en_allredroch_min_to_max_ratio', 'atomic_en_paul_range']

First 10 untouched original columns kept:
['VEC_soma', 'atomic_ea_soma', 'atomic_ebe _soma', 'atomic_en_allen _soma', 'atomic_en_allredroch_soma', 'atomic_en_paul_soma', 'atomic_en_sanderson_soma', 'atomic_enc_soma', 'atomic_hatm_soma', 'atomic_hfu_soma']
