### Dual Model System - Complete Implementation Guide

**Overview:**

Flightmasters system now uses TWO separate models:

-Pre-Departure Model :- Predicts delays BEFORE flight takes off

-In-Flight Model :- Predicts delays AFTER takeoff (more accurate)

### Why create 2 models?

dep_delay is one of the STRONGEST predictors of arr_delay

If a flight departs 20 minutes late, it's likely to arrive late too

Pre-departure model must rely on other signals (time of day, route, etc.)

### Description and details:

Trains 4 models:

1. Random Forest (Pre-Departure)
2. GBT (Pre-Departure)
3. Random Forest (In-Flight)
4. GBT (In-Flight)

Features:

1. Bayesian optimization (smart parameter tuning)
2. 5-fold cross-validation (reliable metrics)
3. MLflow tracking (experiment management)

Runtime: ~15 hours total

Pre-Departure models: ~6 hours
In-Flight models: ~9 hours

### Steps to run:
Run all the cells in the notebook after you change your email at the bottom of the page

In [0]:
%pip install hyperopt

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
# #!/usr/bin/env python3
# """
# Optimized ML Experiments for Databricks Community Edition
# - FIXED: Logs models to workspace registry with signatures
# - Models can be registered to UC manually via UI/API after logging
# - Avoids MLeap dependency issue on Community Edition
# """

# import warnings
# import os
# import mlflow
# import mlflow.spark
# import numpy as np
# import pandas as pd
# from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
# from pyspark.ml.classification import GBTClassifier, RandomForestClassifier
# from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
# from pyspark.ml.tuning import CrossValidator
# from pyspark.sql import SparkSession
# from pyspark.sql.functions import col, udf
# from pyspark.sql.types import DoubleType
# from pyspark.ml.linalg import Vectors, VectorUDT

# # Signature imports
# from mlflow.models.signature import ModelSignature
# from mlflow.types.schema import Schema, TensorSpec

# warnings.filterwarnings("ignore")


# # =============================================================================
# # CONFIGURATION
# # =============================================================================

# class Config:
#     """Optimized for Databricks Community Edition"""
    
#     TOP_K_FEATURES = 40 
#     CV_FOLDS = 2
#     BAYES_MAX_EVALS = 4
#     TEST_RATIO = 0.2
#     RANDOM_SEED = 42
    
#     GOLD_TABLE = "default.gold_ml_features_experimental"
#     EXPERIMENT_NAME = "/Shared/Flightmasters_Optimized_Experiments"
    
#     # CRITICAL CHANGE: Use workspace registry for initial logging
#     MLFLOW_TRACKING_URI = "databricks"
#     MLFLOW_REGISTRY_URI = "databricks"  
    
#     # Model names (without UC prefix for workspace registry)
#     MODEL_RF_PRE = "model_rf_pre"
#     MODEL_GBT_PRE = "model_gbt_pre"
#     MODEL_RF_IN = "model_rf_in"
#     MODEL_GBT_IN = "model_gbt_in"
    
#     # UC settings (for manual registration later)
#     UC_CATALOG = "workspace"
#     UC_SCHEMA = "default"
#     UC_VOLUME_NAME = "mlflow_shared_tmp"
    
#     USE_CHECKPOINTING = True


# # =============================================================================
# # HELPERS
# # =============================================================================

# def setup_uc_volume(spark):
#     """Setup Unity Catalog Volume for artifacts"""
#     catalog = Config.UC_CATALOG
#     schema = Config.UC_SCHEMA
#     volume_name = Config.UC_VOLUME_NAME
    
#     volume_path = f"{catalog}.{schema}.{volume_name}"
#     env_path = f"/Volumes/{catalog}/{schema}/{volume_name}"
    
#     print("\n" + "=" * 80)
#     print(f"UNITY CATALOG VOLUME SETUP")
#     print("=" * 80)
#     print(f"Volume Target: {volume_path}")
    
#     try:
#         volume_exists = spark.sql(f"SHOW VOLUMES IN {catalog}.{schema}").filter(
#             col("volume_name") == volume_name
#         ).count() > 0
        
#         if not volume_exists:
#             spark.sql(f"CREATE VOLUME {volume_path}")
#             print("‚úÖ Volume created successfully.")
#         else:
#             print("‚úÖ Volume already exists.")
#     except Exception as e:
#         print(f"‚ö†Ô∏è  WARNING: Could not check/create volume: {e}")
#         pass
        
#     os.environ['MLFLOW_DFS_TMP'] = env_path
#     os.environ['SPARKML_TEMP_DFS_PATH'] = env_path
    
#     print(f"‚úÖ Environment paths set to: {env_path}")
#     return volume_path


# def create_safe_vector_slicer(indices_to_keep):
#     """Create UDF to slice vectors"""
#     indices_list = list(indices_to_keep)
    
#     @udf(returnType=VectorUDT())
#     def safe_slicer(features):
#         if features is None: 
#             return None
#         max_idx = features.size - 1
#         selected_values = [float(features[i]) for i in indices_list if i <= max_idx]
#         return Vectors.dense(selected_values)
    
#     return safe_slicer


# def checkpoint_if_enabled(df, eager=True):
#     """Checkpoint data to prevent OOM"""
#     if Config.USE_CHECKPOINTING:
#         return df.localCheckpoint(eager=eager)
#     return df


# # =============================================================================
# # 1. FEATURE IMPORTANCE
# # =============================================================================

# def analyze_feature_importance(train_data, top_k):
#     print("\n" + "=" * 80)
#     print("PHASE 1: FEATURE IMPORTANCE ANALYSIS")
#     print("=" * 80)
    
#     sample = train_data.select("features").first()
#     if not sample:
#         raise ValueError("Training data is empty or features column is missing.")
        
#     total_features = sample.features.size
#     print(f"\nüìä Original features: {total_features}")
    
#     print("\nüå≤ Training Random Forest for feature ranking...")
#     rf = RandomForestClassifier(
#         featuresCol="features", labelCol="label",
#         numTrees=30, maxDepth=8, seed=Config.RANDOM_SEED
#     )
    
#     rf_model = rf.fit(train_data)
#     importances = rf_model.featureImportances.toArray()
    
#     top_k_indices = np.argsort(importances)[-top_k:][::-1]
#     top_k_scores = importances[top_k_indices]
    
#     selected_importance = np.sum(top_k_scores)
#     total_importance = np.sum(importances)
#     retention = (selected_importance / total_importance) * 100
    
#     print(f"\nüìà Feature Selection Summary:")
#     print(f"   Original: {total_features} -> Selected: {top_k}")
#     print(f"   Information retained: {retention:.1f}%")
    
#     return total_features, top_k_indices.tolist(), retention


# # =============================================================================
# # 2. CREATE DUAL DATASETS
# # =============================================================================

# def create_dual_datasets(df_gold, selected_indices, dep_delay_orig_index=11):
#     print("\n" + "=" * 80)
#     print("PHASE 2: CREATING DUAL DATASETS")
#     print("=" * 80)
    
#     dep_delay_in_selected = dep_delay_orig_index in selected_indices
#     if dep_delay_in_selected: 
#         print(f"‚úÖ dep_delay found at original index {dep_delay_orig_index}")
#     else: 
#         print(f"‚ö†Ô∏è  dep_delay (index {dep_delay_orig_index}) not in top-K")
    
#     print(f"\nüîß Applying feature selection...")
#     slicer_udf = create_safe_vector_slicer(selected_indices)
#     df_selected = df_gold.withColumn("features", slicer_udf(col("features"))).select("features", "label")
#     df_selected = df_selected.withColumn("label", col("label").cast(DoubleType()))
#     df_selected = checkpoint_if_enabled(df_selected)
    
#     if dep_delay_in_selected:
#         print(f"\nüìã Creating Pre-Departure (removing dep_delay)...")
#         dep_delay_new_index = selected_indices.index(dep_delay_orig_index)
#         pre_dep_indices = [i for i in range(len(selected_indices)) if i != dep_delay_new_index]
#         pre_dep_slicer = create_safe_vector_slicer(pre_dep_indices)
#         df_pre_dep = df_selected.withColumn("features", pre_dep_slicer(col("features"))).select("features", "label")
#     else:
#         df_pre_dep = df_selected
    
#     df_in_flight = df_selected
    
#     df_pre_dep = checkpoint_if_enabled(df_pre_dep, eager=False)
#     df_in_flight = checkpoint_if_enabled(df_in_flight, eager=False)
    
#     return df_pre_dep, df_in_flight, dep_delay_in_selected


# # =============================================================================
# # 3. TRAIN/TEST SPLIT
# # =============================================================================

# def split_and_checkpoint(df, name):
#     print(f"\nüîÄ Splitting {name} dataset...")
#     train, test = df.randomSplit([1.0 - Config.TEST_RATIO, Config.TEST_RATIO], seed=Config.RANDOM_SEED)
#     print(f"   Train count: {train.count():,}")
#     print(f"   Test count: {test.count():,}")
    
#     train = checkpoint_if_enabled(train, eager=True)
#     test = checkpoint_if_enabled(test, eager=True)
#     return train, test


# # =============================================================================
# # 4. MODEL TRAINING
# # =============================================================================

# def train_model_optimized(train_data, test_data, model_name, model_type):
#     """Bayesian optimization with reduced CV folds"""
    
#     print(f"\n\tüéØ Training {model_name} ({model_type})")
#     print(f"\t   {Config.CV_FOLDS}-fold CV + {Config.BAYES_MAX_EVALS} Bayesian evals")
    
#     if model_name == "RandomForest":
#         space = {
#             'numTrees': hp.choice('numTrees', [50, 100]),
#             'maxDepth': hp.choice('maxDepth', [10, 15]),
#             'minInstancesPerNode': hp.choice('minInstancesPerNode', [25, 50])
#         }
#         ModelClass = RandomForestClassifier
#         param_map = {'numTrees': [50, 100], 'maxDepth': [10, 15], 'minInstancesPerNode': [25, 50]}
#     else:
#         space = {
#             'maxIter': hp.choice('maxIter', [50, 100]),
#             'maxDepth': hp.choice('maxDepth', [4, 6]),
#             'stepSize': hp.uniform('stepSize', 0.05, 0.15)
#         }
#         ModelClass = GBTClassifier
#         param_map = {'maxIter': [50, 100], 'maxDepth': [4, 6]}
    
#     def objective(params):
#         if model_name == "RandomForest":
#             model = ModelClass(
#                 featuresCol="features", labelCol="label",
#                 numTrees=int(params['numTrees']),
#                 maxDepth=int(params['maxDepth']),
#                 minInstancesPerNode=int(params['minInstancesPerNode']),
#                 seed=Config.RANDOM_SEED
#             )
#         else:
#             model = ModelClass(
#                 featuresCol="features", labelCol="label",
#                 maxIter=int(params['maxIter']),
#                 maxDepth=int(params['maxDepth']),
#                 stepSize=float(params['stepSize']),
#                 seed=Config.RANDOM_SEED
#             )
        
#         cv = CrossValidator(
#             estimator=model,
#             estimatorParamMaps=[{}],
#             evaluator=BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC"),
#             numFolds=Config.CV_FOLDS,
#             seed=Config.RANDOM_SEED,
#             parallelism=1
#         )
        
#         cv_model = cv.fit(train_data)
#         avg_auc = cv_model.avgMetrics[0]
#         return {'loss': -avg_auc, 'status': STATUS_OK}
    
#     print(f"\t   Optimizing over {Config.BAYES_MAX_EVALS} iterations...")
#     trials = Trials()
    
#     best = fmin(
#         fn=objective, space=space, algo=tpe.suggest,
#         max_evals=Config.BAYES_MAX_EVALS, trials=trials,
#         rstate=np.random.default_rng(Config.RANDOM_SEED),
#         verbose=False
#     )
    
#     best_params_actual = {}
#     for k, v in best.items():
#         if k in param_map:
#             best_params_actual[k] = param_map[k][v]
#         else:
#             best_params_actual[k] = v
    
#     print(f"\t   Best params: {best_params_actual}")
    
#     final_model = ModelClass(
#         featuresCol="features", labelCol="label",
#         **best_params_actual, seed=Config.RANDOM_SEED
#     ).fit(train_data)
    
#     predictions = final_model.transform(test_data)
    
#     metrics = {
#         "auc_roc": BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC").evaluate(predictions),
#         "accuracy": MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy").evaluate(predictions),
#         "f1_score": MulticlassClassificationEvaluator(labelCol="label", metricName="f1").evaluate(predictions)
#     }
    
#     cv_score = -min([t['result']['loss'] for t in trials.trials])
#     print(f"\t   Best CV Score: {cv_score:.4f}")
#     print(f"\t   Test AUC-ROC: {metrics['auc_roc']:.4f}")
    
#     return final_model, metrics, best_params_actual, cv_score


# # =============================================================================
# # 5. MAIN PIPELINE (WORKSPACE REGISTRY LOGGING)
# # =============================================================================

# def run_complete_experiments():
#     """Main execution pipeline - logs to workspace registry"""
    
#     print("\n" + "=" * 80)
#     print("FLIGHTMASTERS OPTIMIZED EXPERIMENTS")
#     print("Strategy: Log to Workspace Registry (bypasses MLeap issue)")
#     print("=" * 80)
    
#     # Setup MLflow for workspace registry
#     mlflow.set_tracking_uri(Config.MLFLOW_TRACKING_URI)
#     mlflow.set_registry_uri(Config.MLFLOW_REGISTRY_URI)
#     mlflow.set_experiment(Config.EXPERIMENT_NAME)
    
#     spark = SparkSession.builder.getOrCreate()
#     setup_uc_volume(spark)
    
#     mlflow.end_run()
    
#     # === MASTER PARENT RUN ===
#     with mlflow.start_run(run_name="Flight_Experiment_Master_Workspace"):
        
#         # Load data
#         print(f"\nüì• Loading Gold table...")
#         df_gold = spark.table(Config.GOLD_TABLE)
#         df_gold = df_gold.withColumn("label", col("label").cast(DoubleType())).filter(
#             (col("label") == 0.0) | (col("label") == 1.0)
#         )
#         df_gold = checkpoint_if_enabled(df_gold, eager=True)
        
#         train_full, test_full = df_gold.randomSplit([0.8, 0.2], seed=Config.RANDOM_SEED)
#         train_full = checkpoint_if_enabled(train_full, eager=False)
#         test_full = checkpoint_if_enabled(test_full, eager=False)

#         # Feature selection
#         with mlflow.start_run(run_name="Feature_Selection", nested=True):
#             orig_features, selected_indices, retention = analyze_feature_importance(
#                 train_full, Config.TOP_K_FEATURES
#             )
#             mlflow.log_params({
#                 "step": "Feature_Selection",
#                 "original_features": orig_features,
#                 "selected_features_count": Config.TOP_K_FEATURES
#             })
#             mlflow.log_metric("information_retained_pct", retention)

#         # Create datasets
#         df_pre, df_in, _ = create_dual_datasets(df_gold, selected_indices)
#         train_pre, test_pre = split_and_checkpoint(df_pre, "Pre-Departure")
#         train_in, test_in = split_and_checkpoint(df_in, "In-Flight")
#         results = {}
        
#         # Define signatures using TensorSpec (UC-compatible format)
#         pre_dep_feature_count = train_pre.select("features").first().features.size
#         in_flight_feature_count = train_in.select("features").first().features.size
        
#         pre_dep_input_schema = Schema([TensorSpec(type=np.dtype('float64'), shape=(-1,), name="features")])
#         in_flight_input_schema = Schema([TensorSpec(type=np.dtype('float64'), shape=(-1,), name="features")])
#         output_schema = Schema([TensorSpec(type=np.dtype('float64'), shape=(-1, 2), name="probability")])
        
#         pre_dep_signature = ModelSignature(inputs=pre_dep_input_schema, outputs=output_schema)
#         in_flight_signature = ModelSignature(inputs=in_flight_input_schema, outputs=output_schema)
        
#         print(f"\n‚úÖ Signatures defined. Features: Pre-Dep={pre_dep_feature_count}, In-Flight={in_flight_feature_count}")

#         # Get sample inputs
#         sample_pre = train_pre.limit(1).select("features").toPandas()
#         sample_in = train_in.limit(1).select("features").toPandas()
        
#         # === TRAIN & LOG MODELS ===
        
#         # 1. Pre-Departure Random Forest
#         with mlflow.start_run(run_name="RF_Pre_Departure", nested=True):
#             m, metrics, p, cv = train_model_optimized(train_pre, test_pre, "RandomForest", "Pre-Departure")
#             mlflow.log_params(p)
#             mlflow.log_metric("cv_score", cv)
#             mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
            
#             mlflow.spark.log_model(
#                 spark_model=m,
#                 artifact_path="model",
#                 signature=pre_dep_signature
#             )
#             results["RF_Pre"] = metrics
#             print(f"‚úÖ RF Pre-Departure logged to workspace registry.")
            
#         # 2. Pre-Departure GBT
#         with mlflow.start_run(run_name="GBT_Pre_Departure", nested=True):
#             m, metrics, p, cv = train_model_optimized(train_pre, test_pre, "GBT", "Pre-Departure")
#             mlflow.log_params(p)
#             mlflow.log_metric("cv_score", cv)
#             mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
#             mlflow.spark.log_model(
#                 spark_model=m,
#                 artifact_path="model",
#                 signature=pre_dep_signature
#             )
#             results["GBT_Pre"] = metrics
#             print(f"‚úÖ GBT Pre-Departure logged to workspace registry.")
            
#         # 3. In-Flight Random Forest
#         with mlflow.start_run(run_name="RF_In_Flight", nested=True):
#             m, metrics, p, cv = train_model_optimized(train_in, test_in, "RandomForest", "In-Flight")
#             mlflow.log_params(p)
#             mlflow.log_metric("cv_score", cv)
#             mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
#             mlflow.spark.log_model(
#                 spark_model=m,
#                 artifact_path="model",
#                 signature=in_flight_signature
#             )
#             results["RF_In"] = metrics
#             print(f"‚úÖ RF In-Flight logged to workspace registry.")
            
#         # 4. In-Flight GBT
#         with mlflow.start_run(run_name="GBT_In_Flight", nested=True):
#             m, metrics, p, cv = train_model_optimized(train_in, test_in, "GBT", "In-Flight")
#             mlflow.log_params(p)
#             mlflow.log_metric("cv_score", cv)
#             mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
#             mlflow.spark.log_model(
#                 spark_model=m,
#                 artifact_path="model",
#                 signature=in_flight_signature
#             )
#             results["GBT_In"] = metrics
#             print(f"‚úÖ GBT In-Flight logged to workspace registry.")
            
#         # === FINAL SUMMARY ===
#         print("\n" + "=" * 80)
#         print("FINAL RESULTS SUMMARY")
#         print("=" * 80)
#         for k, v in results.items():
#             print(f"Model: {k:<15} | Test AUC: {v['auc_roc']:.4f}")
        
#         print("\n" + "=" * 80)
#         print("NEXT STEPS: Manual Unity Catalog Registration")
#         print("=" * 80)
#         print("Your models are now logged in the workspace registry with signatures.")
#         print("\nTo register to Unity Catalog:")
#         print("1. Go to MLflow UI (Experiments page)")
#         print("2. Find your runs and click on each model")
#         print("3. Click 'Register Model' button")
#         print(f"4. Select Unity Catalog and use format: {Config.UC_CATALOG}.{Config.UC_SCHEMA}.model_name")
#         print("\nAlternatively, use the MLflow Client API to register programmatically")
#         print("after models are logged (example code in next message).")
            
#         return results, selected_indices


# # =============================================================================
# # MAIN EXECUTION
# # =============================================================================

# if __name__ == "__main__":
    
#     try:
#         run_complete_experiments()
#         print("\n‚úÖ All experiments complete and MLflow runs closed successfully!")
        
#     except Exception as e:
#         print(f"‚ùå An error occurred during the pipeline execution: {e}")
#         raise e
        
#     finally:
#         mlflow.end_run()


FLIGHTMASTERS OPTIMIZED EXPERIMENTS
Strategy: Log to Workspace Registry (bypasses MLeap issue)

UNITY CATALOG VOLUME SETUP
Volume Target: workspace.default.mlflow_shared_tmp
‚úÖ Volume already exists.
‚úÖ Environment paths set to: /Volumes/workspace/default/mlflow_shared_tmp

üì• Loading Gold table...

PHASE 1: FEATURE IMPORTANCE ANALYSIS

üìä Original features: 819

üå≤ Training Random Forest for feature ranking...

üìà Feature Selection Summary:
   Original: 819 -> Selected: 40
   Information retained: 99.3%

PHASE 2: CREATING DUAL DATASETS
‚úÖ dep_delay found at original index 11

üîß Applying feature selection...

üìã Creating Pre-Departure (removing dep_delay)...

üîÄ Splitting Pre-Departure dataset...
   Train count: 1,971,675
   Test count: 492,304

üîÄ Splitting In-Flight dataset...
   Train count: 1,971,675
   Test count: 492,304

‚úÖ Signatures defined. Features: Pre-Dep=39, In-Flight=40

	üéØ Training RandomForest (Pre-Departure)
	   2-fold CV + 4 Bayesian evals
	 



Uploading artifacts:   0%|          | 0/24 [00:00<?, ?it/s]

‚úÖ RF Pre-Departure logged to workspace registry.

	üéØ Training GBT (Pre-Departure)
	   2-fold CV + 4 Bayesian evals
	   Optimizing over 4 iterations...
	   Best params: {'maxDepth': 6, 'maxIter': 100, 'stepSize': np.float64(0.1313655407640288)}
	   Best CV Score: 0.9328
	   Test AUC-ROC: 0.9318




Uploading artifacts:   0%|          | 0/24 [00:00<?, ?it/s]

‚úÖ GBT Pre-Departure logged to workspace registry.

	üéØ Training RandomForest (In-Flight)
	   2-fold CV + 4 Bayesian evals
	   Optimizing over 4 iterations...


{"ts": "2025-11-30 05:27:44.395", "level": "ERROR", "logger": "pyspark.sql.connect.logging", "msg": "GRPC Error received", "context": {}, "exception": {"class": "_MultiThreadedRendezvous", "msg": "<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.FAILED_PRECONDITION\n\tdetails = \"BAD_REQUEST: Sorry, cannot run the resource because you have hit your free daily limit. Please come back again tomorrow.\"\n\tdebug_error_string = \"UNKNOWN:Error received from peer  {created_time:\"2025-11-30T05:27:44.394535673+00:00\", grpc_status:9, grpc_message:\"BAD_REQUEST: Sorry, cannot run the resource because you have hit your free daily limit. Please come back again tomorrow.\"}\"\n>", "stacktrace": [{"class": null, "method": "_execute_and_fetch_as_iterator", "file": "/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", "line": "2019"}, {"class": null, "method": "__next__", "file": "<frozen _collections_abc>", "line": "356"}, {"class": null

‚ùå An error occurred during the pipeline execution: (DENY_NEW_AND_EXISTING_RESOURCES) BAD_REQUEST: Sorry, cannot run the resource because you have hit your free daily limit. Please come back again tomorrow.


{"ts": "2025-11-30 05:27:46.135", "level": "ERROR", "logger": "pyspark.sql.connect.logging", "msg": "GRPC Error received", "context": {}, "exception": {"class": "_InactiveRpcError", "msg": "<_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.FAILED_PRECONDITION\n\tdetails = \"BAD_REQUEST: Sorry, cannot run the resource because you have hit your free daily limit. Please come back again tomorrow.\"\n\tdebug_error_string = \"UNKNOWN:Error received from peer  {created_time:\"2025-11-30T05:27:46.115777946+00:00\", grpc_status:9, grpc_message:\"BAD_REQUEST: Sorry, cannot run the resource because you have hit your free daily limit. Please come back again tomorrow.\"}\"\n>", "stacktrace": [{"class": null, "method": "interrupt_tag", "file": "/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/client/core.py", "line": "2219"}, {"class": null, "method": "__call__", "file": "/databricks/python/lib/python3.12/site-packages/grpc/_interceptor.py", "line": "277"}, {"

[0;31m---------------------------------------------------------------------------[0m
[0;31mUnknownException[0m                          Traceback (most recent call last)
File [0;32m<command-8141152329916617>, line 484[0m
[1;32m    482[0m [38;5;28;01mexcept[39;00m [38;5;167;01mException[39;00m [38;5;28;01mas[39;00m e:
[1;32m    483[0m     [38;5;28mprint[39m([38;5;124mf[39m[38;5;124m"[39m[38;5;124m‚ùå An error occurred during the pipeline execution: [39m[38;5;132;01m{[39;00me[38;5;132;01m}[39;00m[38;5;124m"[39m)
[0;32m--> 484[0m     [38;5;28;01mraise[39;00m e
[1;32m    486[0m [38;5;28;01mfinally[39;00m:
[1;32m    487[0m     mlflow[38;5;241m.[39mend_run()

File [0;32m<command-8141152329916617>, line 479[0m
[1;32m    476[0m [38;5;28;01mif[39;00m [38;5;18m__name__[39m [38;5;241m==[39m [38;5;124m"[39m[38;5;124m__main__[39m[38;5;124m"[39m:
[1;32m    478[0m     [38;5;28;01mtry[39;00m:
[0;32m--> 479[0m         run_complete_experime

The Notebook below is supposed to be a modified version of the one above that will register the models post creation which would allow the integration with the FlightMasters_Delay_prediction notebook that Uses the Aviation Stack API (I couldn't run it due to freetier limits being exceeded)

Hopefully it works, but if it doesn't you have the code above as a refrence to build off of it and fix it!

In [0]:

import warnings
import os
import mlflow
import mlflow.spark
import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from pyspark.ml.classification import GBTClassifier, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DoubleType
from pyspark.ml.linalg import Vectors, VectorUDT

# Signature imports
from mlflow.models.signature import ModelSignature
from mlflow.types.schema import Schema, TensorSpec

warnings.filterwarnings("ignore")


# =============================================================================
# CONFIGURATION
# =============================================================================

class Config:
    """Optimized for Databricks Community Edition"""
    
    TOP_K_FEATURES = 40 
    CV_FOLDS = 2
    BAYES_MAX_EVALS = 4
    TEST_RATIO = 0.2
    RANDOM_SEED = 42
    
    GOLD_TABLE = "default.gold_ml_features_experimental"
    EXPERIMENT_NAME = "/Shared/Flightmasters_Optimized_Experiments"
    
    # CRITICAL CHANGE: Use workspace registry for initial logging
    MLFLOW_TRACKING_URI = "databricks"
    MLFLOW_REGISTRY_URI = "databricks"  # Changed from databricks-uc
    
    # Model names (without UC prefix for workspace registry)
    MODEL_RF_PRE = "model_rf_pre"
    MODEL_GBT_PRE = "model_gbt_pre"
    MODEL_RF_IN = "model_rf_in"
    MODEL_GBT_IN = "model_gbt_in"
    
    # UC settings (for manual registration later)
    UC_CATALOG = "workspace"
    UC_SCHEMA = "default"
    UC_VOLUME_NAME = "mlflow_shared_tmp"
    
    USE_CHECKPOINTING = True


# =============================================================================
# HELPERS
# =============================================================================

def setup_uc_volume(spark):
    """Setup Unity Catalog Volume for artifacts"""
    catalog = Config.UC_CATALOG
    schema = Config.UC_SCHEMA
    volume_name = Config.UC_VOLUME_NAME
    
    volume_path = f"{catalog}.{schema}.{volume_name}"
    env_path = f"/Volumes/{catalog}/{schema}/{volume_name}"
    
    print("\n" + "=" * 80)
    print(f"UNITY CATALOG VOLUME SETUP")
    print("=" * 80)
    print(f"Volume Target: {volume_path}")
    
    try:
        volume_exists = spark.sql(f"SHOW VOLUMES IN {catalog}.{schema}").filter(
            col("volume_name") == volume_name
        ).count() > 0
        
        if not volume_exists:
            spark.sql(f"CREATE VOLUME {volume_path}")
            print("‚úÖ Volume created successfully.")
        else:
            print("‚úÖ Volume already exists.")
    except Exception as e:
        print(f"‚ö†Ô∏è  WARNING: Could not check/create volume: {e}")
        pass
        
    os.environ['MLFLOW_DFS_TMP'] = env_path
    os.environ['SPARKML_TEMP_DFS_PATH'] = env_path
    
    print(f"‚úÖ Environment paths set to: {env_path}")
    return volume_path


def create_safe_vector_slicer(indices_to_keep):
    """Create UDF to slice vectors"""
    indices_list = list(indices_to_keep)
    
    @udf(returnType=VectorUDT())
    def safe_slicer(features):
        if features is None: 
            return None
        max_idx = features.size - 1
        selected_values = [float(features[i]) for i in indices_list if i <= max_idx]
        return Vectors.dense(selected_values)
    
    return safe_slicer


def checkpoint_if_enabled(df, eager=True):
    """Checkpoint data to prevent OOM"""
    if Config.USE_CHECKPOINTING:
        return df.localCheckpoint(eager=eager)
    return df


# =============================================================================
# 1. FEATURE IMPORTANCE
# =============================================================================

def analyze_feature_importance(train_data, top_k):
    print("\n" + "=" * 80)
    print("PHASE 1: FEATURE IMPORTANCE ANALYSIS")
    print("=" * 80)
    
    sample = train_data.select("features").first()
    if not sample:
        raise ValueError("Training data is empty or features column is missing.")
        
    total_features = sample.features.size
    print(f"\nüìä Original features: {total_features}")
    
    print("\nüå≤ Training Random Forest for feature ranking...")
    rf = RandomForestClassifier(
        featuresCol="features", labelCol="label",
        numTrees=30, maxDepth=8, seed=Config.RANDOM_SEED
    )
    
    rf_model = rf.fit(train_data)
    importances = rf_model.featureImportances.toArray()
    
    top_k_indices = np.argsort(importances)[-top_k:][::-1]
    top_k_scores = importances[top_k_indices]
    
    selected_importance = np.sum(top_k_scores)
    total_importance = np.sum(importances)
    retention = (selected_importance / total_importance) * 100
    
    print(f"\nüìà Feature Selection Summary:")
    print(f"   Original: {total_features} -> Selected: {top_k}")
    print(f"   Information retained: {retention:.1f}%")
    
    return total_features, top_k_indices.tolist(), retention


# =============================================================================
# 2. CREATE DUAL DATASETS
# =============================================================================

def create_dual_datasets(df_gold, selected_indices, dep_delay_orig_index=11):
    print("\n" + "=" * 80)
    print("PHASE 2: CREATING DUAL DATASETS")
    print("=" * 80)
    
    dep_delay_in_selected = dep_delay_orig_index in selected_indices
    if dep_delay_in_selected: 
        print(f"‚úÖ dep_delay found at original index {dep_delay_orig_index}")
    else: 
        print(f"‚ö†Ô∏è  dep_delay (index {dep_delay_orig_index}) not in top-K")
    
    print(f"\nüîß Applying feature selection...")
    slicer_udf = create_safe_vector_slicer(selected_indices)
    df_selected = df_gold.withColumn("features", slicer_udf(col("features"))).select("features", "label")
    df_selected = df_selected.withColumn("label", col("label").cast(DoubleType()))
    df_selected = checkpoint_if_enabled(df_selected)
    
    if dep_delay_in_selected:
        print(f"\nüìã Creating Pre-Departure (removing dep_delay)...")
        dep_delay_new_index = selected_indices.index(dep_delay_orig_index)
        pre_dep_indices = [i for i in range(len(selected_indices)) if i != dep_delay_new_index]
        pre_dep_slicer = create_safe_vector_slicer(pre_dep_indices)
        df_pre_dep = df_selected.withColumn("features", pre_dep_slicer(col("features"))).select("features", "label")
    else:
        df_pre_dep = df_selected
    
    df_in_flight = df_selected
    
    df_pre_dep = checkpoint_if_enabled(df_pre_dep, eager=False)
    df_in_flight = checkpoint_if_enabled(df_in_flight, eager=False)
    
    return df_pre_dep, df_in_flight, dep_delay_in_selected


# =============================================================================
# 3. TRAIN/TEST SPLIT
# =============================================================================

def split_and_checkpoint(df, name):
    print(f"\nüîÄ Splitting {name} dataset...")
    train, test = df.randomSplit([1.0 - Config.TEST_RATIO, Config.TEST_RATIO], seed=Config.RANDOM_SEED)
    print(f"   Train count: {train.count():,}")
    print(f"   Test count: {test.count():,}")
    
    train = checkpoint_if_enabled(train, eager=True)
    test = checkpoint_if_enabled(test, eager=True)
    return train, test


# =============================================================================
# 4. MODEL TRAINING
# =============================================================================

def train_model_optimized(train_data, test_data, model_name, model_type):
    """Bayesian optimization with reduced CV folds"""
    
    print(f"\n\tüéØ Training {model_name} ({model_type})")
    print(f"\t   {Config.CV_FOLDS}-fold CV + {Config.BAYES_MAX_EVALS} Bayesian evals")
    
    if model_name == "RandomForest":
        space = {
            'numTrees': hp.choice('numTrees', [50, 100]),
            'maxDepth': hp.choice('maxDepth', [10, 15]),
            'minInstancesPerNode': hp.choice('minInstancesPerNode', [25, 50])
        }
        ModelClass = RandomForestClassifier
        param_map = {'numTrees': [50, 100], 'maxDepth': [10, 15], 'minInstancesPerNode': [25, 50]}
    else:
        space = {
            'maxIter': hp.choice('maxIter', [50, 100]),
            'maxDepth': hp.choice('maxDepth', [4, 6]),
            'stepSize': hp.uniform('stepSize', 0.05, 0.15)
        }
        ModelClass = GBTClassifier
        param_map = {'maxIter': [50, 100], 'maxDepth': [4, 6]}
    
    def objective(params):
        if model_name == "RandomForest":
            model = ModelClass(
                featuresCol="features", labelCol="label",
                numTrees=int(params['numTrees']),
                maxDepth=int(params['maxDepth']),
                minInstancesPerNode=int(params['minInstancesPerNode']),
                seed=Config.RANDOM_SEED
            )
        else:
            model = ModelClass(
                featuresCol="features", labelCol="label",
                maxIter=int(params['maxIter']),
                maxDepth=int(params['maxDepth']),
                stepSize=float(params['stepSize']),
                seed=Config.RANDOM_SEED
            )
        
        cv = CrossValidator(
            estimator=model,
            estimatorParamMaps=[{}],
            evaluator=BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC"),
            numFolds=Config.CV_FOLDS,
            seed=Config.RANDOM_SEED,
            parallelism=1
        )
        
        cv_model = cv.fit(train_data)
        avg_auc = cv_model.avgMetrics[0]
        return {'loss': -avg_auc, 'status': STATUS_OK}
    
    print(f"\t   Optimizing over {Config.BAYES_MAX_EVALS} iterations...")
    trials = Trials()
    
    best = fmin(
        fn=objective, space=space, algo=tpe.suggest,
        max_evals=Config.BAYES_MAX_EVALS, trials=trials,
        rstate=np.random.default_rng(Config.RANDOM_SEED),
        verbose=False
    )
    
    best_params_actual = {}
    for k, v in best.items():
        if k in param_map:
            best_params_actual[k] = param_map[k][v]
        else:
            best_params_actual[k] = v
    
    print(f"\t   Best params: {best_params_actual}")
    
    final_model = ModelClass(
        featuresCol="features", labelCol="label",
        **best_params_actual, seed=Config.RANDOM_SEED
    ).fit(train_data)
    
    predictions = final_model.transform(test_data)
    
    metrics = {
        "auc_roc": BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC").evaluate(predictions),
        "accuracy": MulticlassClassificationEvaluator(labelCol="label", metricName="accuracy").evaluate(predictions),
        "f1_score": MulticlassClassificationEvaluator(labelCol="label", metricName="f1").evaluate(predictions)
    }
    
    cv_score = -min([t['result']['loss'] for t in trials.trials])
    print(f"\t   Best CV Score: {cv_score:.4f}")
    print(f"\t   Test AUC-ROC: {metrics['auc_roc']:.4f}")
    
    return final_model, metrics, best_params_actual, cv_score


# =============================================================================
# 5. MAIN PIPELINE (WORKSPACE REGISTRY LOGGING)
# =============================================================================

def run_complete_experiments():
    """Main execution pipeline - logs to workspace registry"""
    
    print("\n" + "=" * 80)
    print("FLIGHTMASTERS OPTIMIZED EXPERIMENTS")
    print("Strategy: Log to Workspace Registry (bypasses MLeap issue)")
    print("=" * 80)
    
    # Setup MLflow for workspace registry
    mlflow.set_tracking_uri(Config.MLFLOW_TRACKING_URI)
    mlflow.set_registry_uri(Config.MLFLOW_REGISTRY_URI)
    mlflow.set_experiment(Config.EXPERIMENT_NAME)
    
    spark = SparkSession.builder.getOrCreate()
    setup_uc_volume(spark)
    
    mlflow.end_run()
    
    # === MASTER PARENT RUN ===
    with mlflow.start_run(run_name="Flight_Experiment_Master_Workspace"):
        
        # Load data
        print(f"\nüì• Loading Gold table...")
        df_gold = spark.table(Config.GOLD_TABLE)
        df_gold = df_gold.withColumn("label", col("label").cast(DoubleType())).filter(
            (col("label") == 0.0) | (col("label") == 1.0)
        )
        df_gold = checkpoint_if_enabled(df_gold, eager=True)
        
        train_full, test_full = df_gold.randomSplit([0.8, 0.2], seed=Config.RANDOM_SEED)
        train_full = checkpoint_if_enabled(train_full, eager=False)
        test_full = checkpoint_if_enabled(test_full, eager=False)

        # Feature selection
        with mlflow.start_run(run_name="Feature_Selection", nested=True):
            orig_features, selected_indices, retention = analyze_feature_importance(
                train_full, Config.TOP_K_FEATURES
            )
            mlflow.log_params({
                "step": "Feature_Selection",
                "original_features": orig_features,
                "selected_features_count": Config.TOP_K_FEATURES
            })
            mlflow.log_metric("information_retained_pct", retention)

        # Create datasets
        df_pre, df_in, _ = create_dual_datasets(df_gold, selected_indices)
        train_pre, test_pre = split_and_checkpoint(df_pre, "Pre-Departure")
        train_in, test_in = split_and_checkpoint(df_in, "In-Flight")
        results = {}
        
        # Define signatures using TensorSpec (UC-compatible format)
        pre_dep_feature_count = train_pre.select("features").first().features.size
        in_flight_feature_count = train_in.select("features").first().features.size
        
        pre_dep_input_schema = Schema([TensorSpec(type=np.dtype('float64'), shape=(-1,), name="features")])
        in_flight_input_schema = Schema([TensorSpec(type=np.dtype('float64'), shape=(-1,), name="features")])
        output_schema = Schema([TensorSpec(type=np.dtype('float64'), shape=(-1, 2), name="probability")])
        
        pre_dep_signature = ModelSignature(inputs=pre_dep_input_schema, outputs=output_schema)
        in_flight_signature = ModelSignature(inputs=in_flight_input_schema, outputs=output_schema)
        
        print(f"\n‚úÖ Signatures defined. Features: Pre-Dep={pre_dep_feature_count}, In-Flight={in_flight_feature_count}")

        # Get sample inputs - convert DenseVector to list/array for JSON serialization
        sample_pre_row = train_pre.limit(1).select("features").collect()[0]
        sample_pre = pd.DataFrame([sample_pre_row.features.toArray()])
        
        sample_in_row = train_in.limit(1).select("features").collect()[0]
        sample_in = pd.DataFrame([sample_in_row.features.toArray()])
        
        # === TRAIN & LOG MODELS ===
        
        # 1. Pre-Departure Random Forest
        with mlflow.start_run(run_name="RF_Pre_Departure", nested=True):
            m, metrics, p, cv = train_model_optimized(train_pre, test_pre, "RandomForest", "Pre-Departure")
            mlflow.log_params(p)
            mlflow.log_metric("cv_score", cv)
            mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
            # CRITICAL FIX: Log WITHOUT registered_model_name parameter
            mlflow.spark.log_model(
                spark_model=m,
                artifact_path="model",
                signature=pre_dep_signature,
                input_example=sample_pre
            )
            results["RF_Pre"] = metrics
            print(f"‚úÖ RF Pre-Departure logged to workspace registry.")
            
        # 2. Pre-Departure GBT
        with mlflow.start_run(run_name="GBT_Pre_Departure", nested=True):
            m, metrics, p, cv = train_model_optimized(train_pre, test_pre, "GBT", "Pre-Departure")
            mlflow.log_params(p)
            mlflow.log_metric("cv_score", cv)
            mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
            mlflow.spark.log_model(
                spark_model=m,
                artifact_path="model",
                signature=pre_dep_signature,
                input_example=sample_pre
            )
            results["GBT_Pre"] = metrics
            print(f"‚úÖ GBT Pre-Departure logged to workspace registry.")
            
        # 3. In-Flight Random Forest
        with mlflow.start_run(run_name="RF_In_Flight", nested=True):
            m, metrics, p, cv = train_model_optimized(train_in, test_in, "RandomForest", "In-Flight")
            mlflow.log_params(p)
            mlflow.log_metric("cv_score", cv)
            mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
            mlflow.spark.log_model(
                spark_model=m,
                artifact_path="model",
                signature=in_flight_signature,
                input_example=sample_in
            )
            results["RF_In"] = metrics
            print(f"‚úÖ RF In-Flight logged to workspace registry.")
            
        # 4. In-Flight GBT
        with mlflow.start_run(run_name="GBT_In_Flight", nested=True):
            m, metrics, p, cv = train_model_optimized(train_in, test_in, "GBT", "In-Flight")
            mlflow.log_params(p)
            mlflow.log_metric("cv_score", cv)
            mlflow.log_metric("auc_roc", metrics["auc_roc"])
            
            mlflow.spark.log_model(
                spark_model=m,
                artifact_path="model",
                signature=in_flight_signature,
                input_example=sample_in
            )
            results["GBT_In"] = metrics
            print(f"‚úÖ GBT In-Flight logged to workspace registry.")
            
        # === FINAL SUMMARY ===
        print("\n" + "=" * 80)
        print("FINAL RESULTS SUMMARY")
        print("=" * 80)
        for k, v in results.items():
            print(f"Model: {k:<15} | Test AUC: {v['auc_roc']:.4f}")
        
        print("\n" + "=" * 80)
        print("NEXT STEPS: Manual Unity Catalog Registration")
        print("=" * 80)
        print("Your models are now logged in the workspace registry with signatures.")
        print("\nTo register to Unity Catalog:")
        print("1. Go to MLflow UI (Experiments page)")
        print("2. Find your runs and click on each model")
        print("3. Click 'Register Model' button")
        print(f"4. Select Unity Catalog and use format: {Config.UC_CATALOG}.{Config.UC_SCHEMA}.model_name")
        print("\nAlternatively, use the MLflow Client API to register programmatically")
        print("after models are logged (example code in next message).")
            
        return results, selected_indices


# =============================================================================
# MAIN EXECUTION
# =============================================================================

if __name__ == "__main__":
    
    try:
        run_complete_experiments()
        print("\n‚úÖ All experiments complete and MLflow runs closed successfully!")
        
    except Exception as e:
        print(f"‚ùå An error occurred during the pipeline execution: {e}")
        raise e
        
    finally:
        mlflow.end_run()

### Useful Information:

You can find the ML models and information regarding feature selection in the Experiments tab on the left hand side.

You can see the indexs of what features were used for training in the Artifacts tab of Feature Engineering within experiments (indexes refrence the Gold table).

Once done head on to Flightmasters_Delay_Prediction notebook where you will need to fix and run the code.