# **6. Final Production Pipeline (Optimized)**

## **6.1 Objective**
* **Goal:** Train the final production model using ONLY the "All-Star" feature set selected via SHAP.
* **Selection:** We reduced 30+ features down to the Top 12 that drive 90% of the predictions.
* **Output:** Save `final_model.pkl` and `feature_names.pkl` for the App.


In [3]:
import pandas as pd
import joblib
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# 1. Configuration
DATA_PATH = '.../data/UnifiedDataset_Engineered.csv'
MODEL_PATH = '../models/final_model.pkl'
FEATURE_PATH = '../models/feature_names.pkl'

# 2. Load Data
try:
    df = pd.read_csv(DATA_PATH)
    print("‚úÖ Data Loaded.")
except FileNotFoundError:
    print("‚ùå Error: Data file not found.")

# 3. Preprocessing (Drop Missing Target)
target = '% Death Cardiovascular'
df_train = df.dropna(subset=[target]).copy()

# Encode Gender
if df_train['Gender'].dtype == 'object':
    le = LabelEncoder()
    df_train['Gender'] = le.fit_transform(df_train['Gender'])

# ======================================================
# 4. SELECTING THE "ALL-STAR" FEATURES
# ======================================================
# Based on SHAP analysis + Business Logic (Turkey Interest)
selected_features = [
    # Demographics (Crucial Baseline)
    '% Population Aged 65+',
    'Gender',
    '% Population Aged 0-14',

    # The "Big Three" Diet Drivers (from SHAP)
    'Diet Composition Pulses',
    'Diet Composition Sugar',
    'Diet Calories Fat',

    # Turkey Specific Interest (Business Logic)
    'Cereal Consumption Wheat',
    'Vegetable Consumption',
    'Cereal Consumption Rice',
    'Fruit Consumption Oranges And Mandarines',

    # Smart Engineering (High Correlation)
    'Risk_Metabolic_Combo',
    'Ratio_Veg_to_Grain'
]

print(f"Original Feature Count: {len(df.columns)}")
print(f"Final Selected Features: {len(selected_features)}")

X = df_train[selected_features]
y = df_train[target]

# 5. Create Final Pipeline
# Random Forest (Best Model from Notebook 4)
final_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('rf', RandomForestRegressor(n_estimators=200, max_depth=None, min_samples_leaf=1, random_state=42))
])

# 6. Train on FULL Optimized Data
print("‚è≥ Training Final Model...")
final_pipeline.fit(X, y)
print("‚úÖ Training Complete.")

# 7. Save Model & Feature List
joblib.dump(final_pipeline, MODEL_PATH)
joblib.dump(selected_features, FEATURE_PATH) # Important: App needs to know the order!

print(f"üöÄ Model saved to {MODEL_PATH}")
print(f"üìã Feature list saved to {FEATURE_PATH}")

‚úÖ Data Loaded.
Original Feature Count: 33
Final Selected Features: 12
‚è≥ Training Final Model...
‚úÖ Training Complete.
üöÄ Model saved to /content/models/final_model.pkl
üìã Feature list saved to /content/models/feature_names.pkl
