# Advanced Preprocessing & MLOps for Tabular Data

## 🚀 Professional-Level Data Preprocessing Pipeline

This notebook builds upon the basic preprocessing techniques to introduce **production-ready** data preprocessing workflows using advanced Scikit-Learn features and MLOps best practices.

### 🎯 What You'll Learn
- **Automated Pipelines** with `Pipeline` and `ColumnTransformer`
- **Outlier Detection** and treatment strategies
- **Cross-Validation** for preprocessing and model evaluation
- **Feature Importance** analysis and selection
- **MLflow** experiment tracking and versioning
- **Pipeline Persistence** for production deployment

### 📊 Dataset
We'll use the same **Titanic dataset** to demonstrate how these advanced techniques improve upon basic preprocessing approaches.

## 1. Import Required Libraries and Setup

Let's import all the libraries we'll need for advanced preprocessing and MLOps.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn core
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Advanced preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import SelectKBest, f_classif, RFE

# Outlier detection
from sklearn.ensemble import IsolationForest
from scipy import stats

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Feature importance
from sklearn.inspection import permutation_importance

# MLflow for experiment tracking
try:
    import mlflow
    import mlflow.sklearn
    MLFLOW_AVAILABLE = True
except ImportError:
    print("⚠️ MLflow not installed. Run: pip install mlflow")
    MLFLOW_AVAILABLE = False

# Model persistence
import joblib
from datetime import datetime
import os

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')

print("✅ All libraries imported successfully!")

## 2. Load and Prepare Data

We'll load the Titanic dataset and perform initial exploration to understand our preprocessing needs.

In [None]:
# Load Titanic dataset
print("📥 Loading Titanic dataset...")
X, y = fetch_openml('titanic', version=1, as_frame=True, parser='auto', return_X_y=True)

print(f"Dataset shape: {X.shape}")
print(f"Target distribution:\n{y.value_counts()}")

# Initial data exploration
print("\n📊 Data Overview:")
X.info()

In [None]:
# Identify feature types for automated preprocessing
def identify_feature_types(df):
    """
    Automatically identify numerical and categorical features.
    """
    numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Remove features that shouldn't be used for prediction
    features_to_remove = ['boat', 'body', 'home.dest', 'name', 'ticket']
    numerical_features = [f for f in numerical_features if f not in features_to_remove]
    categorical_features = [f for f in categorical_features if f not in features_to_remove]
    
    return numerical_features, categorical_features

numerical_features, categorical_features = identify_feature_types(X)

print(f"📊 Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"📝 Categorical features ({len(categorical_features)}): {categorical_features}")

# Check missing values
missing_summary = X[numerical_features + categorical_features].isnull().sum()
missing_summary = missing_summary[missing_summary > 0].sort_values(ascending=False)
print(f"\n❌ Missing values:")
for feature, count in missing_summary.items():
    percentage = (count / len(X)) * 100
    print(f"  {feature}: {count} ({percentage:.1f}%)")

## 3. Create Preprocessing Pipelines with ColumnTransformer

We'll create automated, reproducible preprocessing pipelines that handle numerical and categorical features separately.

In [None]:
# Define preprocessing pipelines for different feature types

# Numerical preprocessing pipeline
numerical_pipeline = Pipeline([
    ('imputer', KNNImputer(n_neighbors=5)),  # More sophisticated imputation
    ('scaler', StandardScaler())  # Standardization
])

# Categorical preprocessing pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))  # Avoid dummy variable trap
])

# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_pipeline, numerical_features),
        ('categorical', categorical_pipeline, categorical_features)
    ],
    remainder='drop'  # Drop any remaining columns
)

print("✅ Preprocessing pipeline created!")
print(f"📊 Pipeline structure:")
print(f"  - Numerical features: {len(numerical_features)} → KNN Imputation + Standardization")
print(f"  - Categorical features: {len(categorical_features)} → Mode Imputation + One-Hot Encoding")

In [None]:
# Split data for pipeline evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"📊 Data split:")
print(f"  - Training set: {X_train.shape[0]} samples")
print(f"  - Test set: {X_test.shape[0]} samples")
print(f"  - Training target distribution: {y_train.value_counts().to_dict()}")
print(f"  - Test target distribution: {y_test.value_counts().to_dict()}")

# Apply preprocessing pipeline
print("\n🔄 Applying preprocessing pipeline...")
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)  # Only transform, don't refit!

print(f"✅ Preprocessing completed!")
print(f"  - Original features: {X_train.shape[1]}")
print(f"  - Processed features: {X_train_processed.shape[1]}")
print(f"  - Feature reduction: {X_train.shape[1] - X_train_processed.shape[1]} features removed/combined")

## 4. Outlier Detection and Treatment

Let's implement various outlier detection methods and treatment strategies.

In [None]:
# Function to detect outliers using different methods
def detect_outliers(X, method='iqr', contamination=0.1):
    """
    Detect outliers using various methods.
    
    Parameters:
    -----------
    X : array-like
        Input data
    method : str
        Method to use: 'iqr', 'zscore', 'isolation_forest'
    contamination : float
        Expected proportion of outliers (for isolation forest)
    
    Returns:
    --------
    outlier_mask : array
        Boolean mask indicating outliers (True = outlier)
    """
    if method == 'iqr':
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outlier_mask = (X < lower_bound).any(axis=1) | (X > upper_bound).any(axis=1) #prima realizza matrice boolean in cui, per ogni feature indica se quel valore e un outlier o no e poi (con .any e azione sulle colonne) 
        #crea un array di boolean in cui True=riga con almeno un outlier in qualche feature e False viceversa
        
    elif method == 'zscore':
        z_scores = np.abs(stats.zscore(X, axis=0, nan_policy='omit'))
        outlier_mask = (z_scores > 3).any(axis=1)
        
    elif method == 'isolation_forest':
        iso_forest = IsolationForest(contamination=contamination, random_state=42)
        outlier_predictions = iso_forest.fit_predict(X)
        outlier_mask = outlier_predictions == -1
    
    return outlier_mask

# Apply outlier detection on numerical features only
X_train_numeric = X_train[numerical_features].fillna(X_train[numerical_features].median())

methods = ['iqr', 'zscore', 'isolation_forest']
outlier_results = {}

print("🔍 Outlier Detection Results:")
for method in methods:
    outliers = detect_outliers(X_train_numeric.values, method=method)
    outlier_results[method] = outliers
    n_outliers = np.sum(outliers)
    percentage = (n_outliers / len(X_train)) * 100
    print(f"  {method.upper()}: {n_outliers} outliers ({percentage:.1f}%)")

# Visualize outliers for fare (most likely to have outliers)
if 'fare' in numerical_features:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for i, method in enumerate(methods):
        outliers = outlier_results[method]
        axes[i].scatter(X_train_numeric.index[~outliers], X_train_numeric.loc[~outliers, 'fare'], 
                       alpha=0.6, label='Normal', s=20)
        axes[i].scatter(X_train_numeric.index[outliers], X_train_numeric.loc[outliers, 'fare'], 
                       alpha=0.8, label='Outlier', s=20, color='red')
        axes[i].set_title(f'Outliers: {method.upper()}')
        axes[i].set_ylabel('Fare')
        axes[i].legend()
    
    plt.tight_layout()
    plt.show()

In [None]:
# Outlier treatment strategies
def treat_outliers(X, outlier_mask, method='remove'):
    """
    Treat outliers using various strategies.
    
    Parameters:
    -----------
    X : DataFrame
        Input data
    outlier_mask : array
        Boolean mask indicating outliers
    method : str
        Treatment method: 'remove', 'cap', 'log_transform'
    """
    X_treated = X.copy()
    
    if method == 'remove':
        return X_treated[~outlier_mask]
    
    elif method == 'cap':
        for col in X_treated.select_dtypes(include=[np.number]).columns:
            Q1 = X_treated[col].quantile(0.25)
            Q3 = X_treated[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            
            X_treated[col] = np.clip(X_treated[col], lower_bound, upper_bound)
    
    elif method == 'log_transform':
        for col in X_treated.select_dtypes(include=[np.number]).columns:
            if (X_treated[col] > 0).all():  # Only apply to positive values
                X_treated[col] = np.log1p(X_treated[col])
    
    return X_treated

# Example: Create preprocessor with outlier treatment
class OutlierTreatmentTransformer:
    def __init__(self, method='cap', detection_method='iqr'):
        self.method = method
        self.detection_method = detection_method
        self.bounds_ = {}
    
    def fit(self, X, y=None):
        # Store bounds for capping during training
        if self.method == 'cap':
            for col in X.select_dtypes(include=[np.number]).columns:
                Q1 = X[col].quantile(0.25)
                Q3 = X[col].quantile(0.75)
                IQR = Q3 - Q1
                self.bounds_[col] = {
                    'lower': Q1 - 1.5 * IQR,
                    'upper': Q3 + 1.5 * IQR
                }
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        
        if self.method == 'cap':
            for col, bounds in self.bounds_.items():
                if col in X_transformed.columns:
                    X_transformed[col] = np.clip(
                        X_transformed[col], 
                        bounds['lower'], 
                        bounds['upper']
                    )
        
        return X_transformed

print("✅ Outlier treatment methods implemented!")

## 5. Cross-Validation and Pipeline Evaluation

Let's implement comprehensive cross-validation strategies for our preprocessing pipelines.

In [None]:
# Create complete pipeline with model
def create_complete_pipeline(model, include_outlier_treatment=True):
    """
    Create a complete preprocessing + modeling pipeline.
    """
    steps = []
    
    # Add outlier treatment if requested
    if include_outlier_treatment:
        steps.append(('outlier_treatment', OutlierTreatmentTransformer(method='cap')))
    
    # Add preprocessing
    steps.append(('preprocessor', preprocessor))
    
    # Add model
    steps.append(('model', model))
    
    return Pipeline(steps)

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}

# Cross-validation evaluation
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = {}

print("🔄 Running cross-validation evaluation...")
print("=" * 50)

for model_name, model in models.items():
    print(f"\n📊 Evaluating {model_name}:")
    
    # Test with and without outlier treatment
    for outlier_treatment in [False, True]:
        pipeline = create_complete_pipeline(model, include_outlier_treatment=outlier_treatment)
        
        # Perform cross-validation
        cv_scores = cross_val_score(
            pipeline, X_train, y_train, 
            cv=cv_strategy, 
            scoring='roc_auc',
            n_jobs=-1
        )
        
        treatment_label = "with outlier treatment" if outlier_treatment else "without outlier treatment"
        result_key = f"{model_name}_{treatment_label}"
        cv_results[result_key] = cv_scores
        
        print(f"  {treatment_label}:")
        print(f"    Mean CV Score: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")
        print(f"    Score Range: [{cv_scores.min():.4f}, {cv_scores.max():.4f}]")

# Visualize CV results
fig, ax = plt.subplots(figsize=(12, 6))
positions = range(len(cv_results))
box_data = list(cv_results.values())
labels = list(cv_results.keys())

bp = ax.boxplot(box_data, positions=positions, patch_artist=True)
ax.set_xticklabels([label.replace('_', '\n') for label in labels], rotation=45, ha='right')
ax.set_ylabel('ROC-AUC Score')
ax.set_title('Cross-Validation Results: Pipeline Comparison')
ax.grid(True, alpha=0.3)

# Color boxes
colors = ['lightblue', 'lightcoral', 'lightgreen', 'lightyellow']
for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
    patch.set_facecolor(color)

plt.tight_layout()
plt.show()

## 6. Feature Importance Analysis

Let's analyze which features and preprocessing steps contribute most to model performance.

In [None]:
# Train best pipeline and analyze feature importance
best_pipeline = create_complete_pipeline(
    RandomForestClassifier(random_state=42, n_estimators=100), 
    include_outlier_treatment=True
)

# Fit the pipeline
print("🔄 Training best pipeline for feature importance analysis...")
best_pipeline.fit(X_train, y_train)

# Get feature names after preprocessing
def get_feature_names(pipeline):
    """
    Extract feature names from a preprocessing pipeline.
    """
    preprocessor = pipeline.named_steps['preprocessor']
    
    # Get numerical feature names
    num_features = numerical_features
    
    # Get categorical feature names
    cat_transformer = preprocessor.named_transformers_['categorical']
    cat_encoder = cat_transformer.named_steps['encoder']
    cat_features = cat_encoder.get_feature_names_out(categorical_features)
    
    return list(num_features) + list(cat_features)

feature_names = get_feature_names(best_pipeline)
print(f"📊 Total features after preprocessing: {len(feature_names)}")

# 1. Tree-based feature importance
model = best_pipeline.named_steps['model']
tree_importance = model.feature_importances_

# 2. Permutation importance
print("🔄 Calculating permutation importance...")
perm_importance = permutation_importance(
    best_pipeline, X_test, y_test, 
    n_repeats=10, random_state=42, n_jobs=-1
)

# Create importance DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'tree_importance': tree_importance,
    'perm_importance_mean': perm_importance.importances_mean,
    'perm_importance_std': perm_importance.importances_std
})

# Sort by permutation importance
importance_df = importance_df.sort_values('perm_importance_mean', ascending=False)

print("\n🎯 Top 10 Most Important Features:")
print(importance_df.head(10).to_string(index=False, float_format='%.4f'))

In [None]:
# Visualize feature importance
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Tree-based importance
top_features_tree = importance_df.nlargest(15, 'tree_importance')
axes[0].barh(range(len(top_features_tree)), top_features_tree['tree_importance'])
axes[0].set_yticks(range(len(top_features_tree)))
axes[0].set_yticklabels(top_features_tree['feature'])
axes[0].set_xlabel('Importance')
axes[0].set_title('Tree-based Feature Importance\n(Random Forest)')
axes[0].invert_yaxis()

# Permutation importance
top_features_perm = importance_df.nlargest(15, 'perm_importance_mean')
axes[1].barh(range(len(top_features_perm)), top_features_perm['perm_importance_mean'],
            xerr=top_features_perm['perm_importance_std'])
axes[1].set_yticks(range(len(top_features_perm)))
axes[1].set_yticklabels(top_features_perm['feature'])
axes[1].set_xlabel('Importance')
axes[1].set_title('Permutation Feature Importance\n(with error bars)')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

# Feature selection based on importance
print("\n🎯 Feature Selection Recommendations:")
low_importance_features = importance_df[importance_df['perm_importance_mean'] < 0.001]['feature'].tolist()
print(f"📉 Low importance features ({len(low_importance_features)}): {low_importance_features[:5]}...")
print(f"💡 Consider removing these features to reduce model complexity.")

## 7. MLflow Experiment Tracking

Let's set up MLflow to track our experiments, parameters, and results.

In [None]:
if MLFLOW_AVAILABLE:
    # Set up MLflow
    mlflow.set_experiment("Titanic_Advanced_Preprocessing")
    
    def log_experiment(pipeline, X_train, X_test, y_train, y_test, 
                      experiment_name, parameters=None):
        """
        Log an experiment to MLflow.
        """
        with mlflow.start_run(run_name=experiment_name):
            # Log parameters
            if parameters:
                for key, value in parameters.items():
                    mlflow.log_param(key, value)
            
            # Train pipeline
            pipeline.fit(X_train, y_train)
            
            # Make predictions
            train_predictions = pipeline.predict(X_train)
            test_predictions = pipeline.predict(X_test)
            train_proba = pipeline.predict_proba(X_train)[:, 1]
            test_proba = pipeline.predict_proba(X_test)[:, 1]
            
            # Calculate metrics
            train_auc = roc_auc_score(y_train, train_proba)
            test_auc = roc_auc_score(y_test, test_proba)
            
            # Log metrics
            mlflow.log_metric("train_auc", train_auc)
            mlflow.log_metric("test_auc", test_auc)
            mlflow.log_metric("overfitting", train_auc - test_auc)
            
            # Log model
            mlflow.sklearn.log_model(pipeline, "model")
            
            print(f"✅ Logged experiment: {experiment_name}")
            print(f"   Train AUC: {train_auc:.4f}")
            print(f"   Test AUC: {test_auc:.4f}")
            print(f"   Overfitting: {train_auc - test_auc:.4f}")
            
            return {
                'train_auc': train_auc,
                'test_auc': test_auc,
                'pipeline': pipeline
            }
    
    # Run experiments
    print("🔄 Running MLflow experiments...")
    print("=" * 50)
    
    experiments = [
        {
            'name': 'Baseline_LogisticRegression',
            'pipeline': create_complete_pipeline(
                LogisticRegression(random_state=42, max_iter=1000),
                include_outlier_treatment=False
            ),
            'params': {
                'model_type': 'LogisticRegression',
                'outlier_treatment': False,
                'numerical_imputation': 'KNN',
                'categorical_imputation': 'mode',
                'scaling': 'StandardScaler'
            }
        },
        {
            'name': 'Enhanced_LogisticRegression',
            'pipeline': create_complete_pipeline(
                LogisticRegression(random_state=42, max_iter=1000),
                include_outlier_treatment=True
            ),
            'params': {
                'model_type': 'LogisticRegression',
                'outlier_treatment': True,
                'numerical_imputation': 'KNN',
                'categorical_imputation': 'mode',
                'scaling': 'StandardScaler'
            }
        },
        {
            'name': 'Enhanced_RandomForest',
            'pipeline': create_complete_pipeline(
                RandomForestClassifier(random_state=42, n_estimators=100),
                include_outlier_treatment=True
            ),
            'params': {
                'model_type': 'RandomForest',
                'outlier_treatment': True,
                'numerical_imputation': 'KNN',
                'categorical_imputation': 'mode',
                'n_estimators': 100
            }
        }
    ]
    
    results = []
    for exp in experiments:
        result = log_experiment(
            exp['pipeline'], X_train, X_test, y_train, y_test,
            exp['name'], exp['params']
        )
        results.append({**result, 'name': exp['name']})
        print()
    
    # Compare results
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('test_auc', ascending=False)
    
    print("\n🏆 Experiment Results Summary:")
    print(results_df[['name', 'train_auc', 'test_auc']].to_string(index=False, float_format='%.4f'))
    
    print(f"\n💡 Best model: {results_df.iloc[0]['name']} (Test AUC: {results_df.iloc[0]['test_auc']:.4f})")
    print("\n📊 To view results in MLflow UI, run: mlflow ui")
    
else:
    print("⚠️ MLflow not available. Skipping experiment tracking.")
    print("   Install MLflow with: pip install mlflow")

## 8. Pipeline Persistence and Versioning

Finally, let's save our best pipeline and demonstrate versioning for production deployment.

In [None]:
# Create directory for model artifacts
model_dir = "models"
os.makedirs(model_dir, exist_ok=True)

# Train and save the best pipeline
print("💾 Saving best pipeline...")

# Train the best pipeline
best_pipeline = create_complete_pipeline(
    RandomForestClassifier(random_state=42, n_estimators=100),
    include_outlier_treatment=True
)
best_pipeline.fit(X_train, y_train)

# Create model metadata
model_metadata = {
    'model_type': 'RandomForest',
    'n_estimators': 100,
    'outlier_treatment': True,
    'numerical_features': numerical_features,
    'categorical_features': categorical_features,
    'preprocessing_steps': [
        'outlier_capping',
        'knn_imputation_numerical',
        'mode_imputation_categorical',
        'standard_scaling',
        'onehot_encoding'
    ],
    'train_auc': roc_auc_score(y_train, best_pipeline.predict_proba(X_train)[:, 1]),
    'test_auc': roc_auc_score(y_test, best_pipeline.predict_proba(X_test)[:, 1]),
    'created_at': datetime.now().isoformat(),
    'feature_names': feature_names
}

# Save pipeline and metadata
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_filename = f"titanic_pipeline_v{timestamp}.joblib"
metadata_filename = f"titanic_metadata_v{timestamp}.json"

# Save the pipeline
joblib.dump(best_pipeline, os.path.join(model_dir, model_filename))

# Save metadata
import json
with open(os.path.join(model_dir, metadata_filename), 'w') as f:
    json.dump(model_metadata, f, indent=2)

print(f"✅ Pipeline saved: {model_filename}")
print(f"✅ Metadata saved: {metadata_filename}")
print(f"📊 Model performance: Train AUC={model_metadata['train_auc']:.4f}, Test AUC={model_metadata['test_auc']:.4f}")

In [None]:
# Demonstrate loading and using the saved pipeline
print("🔄 Demonstrating model loading and inference...")

# Load the saved pipeline
loaded_pipeline = joblib.load(os.path.join(model_dir, model_filename))

# Load metadata
with open(os.path.join(model_dir, metadata_filename), 'r') as f:
    loaded_metadata = json.load(f)

print(f"📋 Loaded model metadata:")
for key, value in loaded_metadata.items():
    if key not in ['numerical_features', 'categorical_features', 'preprocessing_steps', 'feature_names']:
        print(f"  {key}: {value}")

# Test predictions on new data
sample_predictions = loaded_pipeline.predict_proba(X_test[:5])[:, 1]
print(f"\n🎯 Sample predictions on test data:")
for i, (pred, actual) in enumerate(zip(sample_predictions, y_test[:5])):
    print(f"  Sample {i+1}: Predicted probability = {pred:.3f}, Actual = {actual}")

# Create a production-ready prediction function
def predict_survival(passenger_data, pipeline_path, metadata_path):
    """
    Production-ready prediction function.
    
    Parameters:
    -----------
    passenger_data : dict or DataFrame
        Passenger information
    pipeline_path : str
        Path to saved pipeline
    metadata_path : str
        Path to model metadata
    
    Returns:
    --------
    dict : Prediction results
    """
    # Load pipeline and metadata
    pipeline = joblib.load(pipeline_path)
    with open(metadata_path, 'r') as f:
        metadata = json.load(f)
    
    # Convert to DataFrame if necessary
    if isinstance(passenger_data, dict):
        passenger_data = pd.DataFrame([passenger_data])
    
    # Make prediction
    survival_prob = pipeline.predict_proba(passenger_data)[:, 1][0]
    survival_prediction = pipeline.predict(passenger_data)[0]
    
    return {
        'survival_probability': float(survival_prob),
        'predicted_survival': bool(int(survival_prediction)),
        'model_version': metadata['created_at'],
        'model_performance': {
            'train_auc': metadata['train_auc'],
            'test_auc': metadata['test_auc']
        }
    }

# Example usage
example_passenger = {
    'pclass': 1,
    'sex': 'female',
    'age': 25,
    'sibsp': 0,
    'parch': 1,
    'fare': 100.0,
    'embarked': 'S',
    'cabin': None
}

prediction_result = predict_survival(
    example_passenger,
    os.path.join(model_dir, model_filename),
    os.path.join(model_dir, metadata_filename)
)

print(f"\n🎭 Example prediction for passenger:")
print(f"  Input: {example_passenger}")
print(f"  Prediction: {prediction_result}")

print(f"\n✅ Pipeline persistence and versioning complete!")
print(f"📁 Models saved in: {model_dir}/")
print(f"🔧 Ready for production deployment!")

## 🎯 Summary and Best Practices

### ✅ What We've Accomplished

1. **🔧 Automated Pipelines**: Created reproducible preprocessing pipelines using `ColumnTransformer` and `Pipeline`
2. **🔍 Outlier Management**: Implemented detection and treatment strategies for data quality
3. **📊 Robust Evaluation**: Used cross-validation to assess pipeline performance reliably
4. **🎯 Feature Analysis**: Analyzed feature importance to understand model decisions
5. **📈 Experiment Tracking**: Logged experiments with MLflow for reproducibility
6. **💾 Production Ready**: Saved versioned models with metadata for deployment

### 🚀 Production Best Practices

1. **Pipeline Everything**: Always use `Pipeline` objects to prevent data leakage
2. **Version Control**: Track model versions, parameters, and performance metrics
3. **Monitoring**: Implement data drift detection in production
4. **Testing**: Create unit tests for preprocessing functions
5. **Documentation**: Maintain clear documentation of preprocessing decisions
6. **Rollback Strategy**: Keep previous model versions for quick rollback

### 📚 Additional Resources

- **MLflow Documentation**: [https://mlflow.org/docs/latest/index.html](https://mlflow.org/docs/latest/index.html)
- **Scikit-learn Pipelines**: [https://scikit-learn.org/stable/modules/compose.html](https://scikit-learn.org/stable/modules/compose.html)
- **Data Validation with Great Expectations**: [https://greatexpectations.io/](https://greatexpectations.io/)
- **Model Deployment with FastAPI**: [https://fastapi.tiangolo.com/](https://fastapi.tiangolo.com/)

### 🎓 Next Steps

1. Implement automated hyperparameter tuning with `Optuna` or `Hyperopt`
2. Add data validation checks with `Great Expectations`
3. Create a REST API for model serving with `FastAPI`
4. Set up CI/CD pipelines for automated model training and deployment
5. Implement monitoring dashboards for model performance in production

**🎉 Congratulations! You now have the tools to build production-ready ML preprocessing pipelines!**