The objective of this notebook is to choose a model for predicting if it is hazardous. To be complete it has to achieve these objectives.
- the model has to have a high Fbeta score because in the context of predicting a potentially hazardous asteroid a false negative would be very deadly, and a false positive would not be ideal but it would be the better alternative
- We will use Optuna to do hyperparameter searching because it can search more accurately than GridSearchCV and is less intensive and time consuming.

In [3]:
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent))


In [5]:
import optuna
import optuna.visualization as vis
import pandas as pd
from sklearn.linear_model import LogisticRegression as lg
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, make_scorer, fbeta_score, roc_auc_score
from sklearn.pipeline import Pipeline
from neows_project.pipeline import build_pipeline
from xgboost import XGBClassifier
import mlflow
import mlflow.sklearn
import numpy as np
import json

First we will set the mlflow experiment to start.

In [None]:

mlflow.set_experiment("model_selection")

2026/01/26 16:19:50 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/01/26 16:19:50 INFO mlflow.store.db.utils: Updating database tables
2026/01/26 16:19:50 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/26 16:19:50 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2026/01/26 16:19:50 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/26 16:19:50 INFO alembic.runtime.migration: Will assume non-transactional DDL.


<Experiment: artifact_location='/Users/ahmedahmed/Documents/GitHub/hazardous_asteroid/notebooks/mlruns/1', creation_time=1769450805859, experiment_id='1', last_update_time=1769450805859, lifecycle_stage='active', name='model_selection', tags={}>

Next we will load the data and build the pipeline.

In [6]:
data = pd.read_csv('../data/raw/asteroids_data.csv')

  data = pd.read_csv('../data/raw/asteroids_data.csv')


In [15]:
data['is_potentially_hazardous_asteroid'] = data['is_potentially_hazardous_asteroid'].astype(int)
y = data['is_potentially_hazardous_asteroid']
X = data.drop(columns=['is_potentially_hazardous_asteroid'])
print((y == 0).sum())
pip = build_pipeline()

37589


- We are going to decide which model to use between Logistic Regression, XGBoost, and Random Forest Classification. 

- Within the MLflow experiment we defined earlier we are going to make each model a set of runs using Optuna.

- Optuna is going to be used to do hyperparameter tuning on all three models to find the one that best fits on the data.

- Within Optuna we are going to use the Stratified K Folds function we imported earlier to perform Cross-Validation on our data.

Before we get any further we will make a dictionary holding the three models we will be using so that later on the functionality is clear.

In [6]:
models = {
    "LG" : lg,
    "XG" : XGBClassifier,
    "RF" : RandomForestClassifier
}

Now after making the model dictionary we will make the function to suggest parameters in Optuna.
    - For Optuna to work we want it to suggest hyperparameters for our model to try in the trials, this is what finds the best hyperparameters.
    - We will have it first determine which model it is tuning then suggest the corresponding parameters.   

In [7]:
# model_name will be defined later before this is called
def suggest_params(trial, model_name):
    
    if model_name == "LG":
        return {
            "C" : trial.suggest_float("C", 1e-3, 10.0, log=True),
            "max_iter" : 1000,
            "solver" : "lbfgs"
        }
    
    if model_name == "RF":
        return {
            "n_estimators" : trial.suggest_int("n_estimators", 100, 500),
            "max_depth" : trial.suggest_int("max_depth", 3, 20),
            "min_samples_split" : trial.suggest_int("min_samples_split", 2, 20),
            "min_samples_leaf" : trial.suggest_int("min_samples_leaf", 1, 10),
            "n_jobs" : -1,
            "random_state" : 42
        }
    
    if model_name == "XG":
        return {
            "n_estimators" : trial.suggest_int("n_estimators", 100, 500),
            "max_depth" : trial.suggest_int("max_depth", 3, 10),
            "learning_rate" : trial.suggest_float("learning_rate", 0.01, 0.3),
            "subsample" : trial.suggest_float("subsample", 0.6, 1.0),
            "colsample_bytree" : trial.suggest_float("colsample_bytree", 0.6, 1.0),
            "eval_metric" : "logloss",
            "tree_method" : "hist",
            "random_state" : 42
        }

Now the final step before making the objective function is to define the method of cross validation we will use.
    - The method used will be Stratified K Folds with shuffle set to true because the dataset is ordered by time we want to seperate the records of target variables randomly so that all folds gets target variables not focused on specific time blocks.

In [8]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
n_trial = 50

for model_name, model_clfr in models.items():
    def objective(trial, model_name = model_name, model_clfr = model_clfr):
        
        with mlflow.start_run(nested=True):

            params = suggest_params(trial, model_name)

            pipeline_mdl = Pipeline(steps=[
                ("preprocessor", pip),
                ("model", model_clfr(**params))
            ])

            scores = []

            for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X,y)):

                X_train, X_valid = X.iloc[train_idx], X.iloc[val_idx]
                y_train, y_valid = y.iloc[train_idx], y.iloc[val_idx]

                pipeline_mdl.fit(X_train, y_train)
                predicts = pipeline_mdl.predict_proba(X_valid)[:,1]

                fold_score = roc_auc_score(y_valid, predicts)

                scores.append(fold_score)

                mlflow.log_metric(f"fold_score_{fold_idx}", fold_score)

                #At this step we give Optuna the option to prune this trial based off of the function we will define later

                trial.report(float(np.mean(scores)), step = fold_idx)

                if trial.should_prune():
                    mlflow.set_tag("model_type", model_name)
                    mlflow.log_metric("pruned_at_fold", fold_idx)
                    raise optuna.TrialPruned()
                
            final_score = float(np.mean(scores))

            #Now we will log to MlFlow

            mlflow.set_tag("model_type", model_name)
            mlflow.log_params(params)
            mlflow.log_metric("cv_auc", final_score)

            return final_score
        
    #Now we will do the parent run for each model in the experiment

    with mlflow.start_run(run_name=f"optuna_{model_name}") as parent_run:
        
        mlflow.set_tag("model_type", model_name)
        mlflow.log_param("n_splits", cv.get_n_splits())
        mlflow.log_param("n_trials", n_trial)

        study = optuna.create_study(
            direction= "maximize",
            pruner = optuna.pruners.MedianPruner(
                n_startup_trials=5,
                n_warmup_steps=1
            )
        )

        study.optimize(objective, n_trials= n_trial)

        #This next bit is to log the params and metrics to the parent run of the best trial run
        mlflow.log_metric("best_cv_auc", float(study.best_value))
        mlflow.log_param("best_trial_numb", study.best_trial.number)
        mlflow.log_params({f"best_{k}": v for k, v in study.best_params.items()})

        #This is to save the artifact of the best params to later be used in train.py
        best_payload = {
            "model_name" : model_name,
            "best_value" : float(study.best_value),
            "best_params" : study.best_params,
            "best_trial_number" : study.best_trial.number
        }

        dire = Path("../models/params")
        dire.mkdir(parents=True, exist_ok=True)
        local_path = dire / f"best_{model_name}.json"
        with open(local_path, "w") as f:
            json.dump(best_payload, f, indent = 2)
        
        mlflow.log_artifact(str(local_path), artifact_path="best_params")

        


[I 2026-01-26 16:19:52,690] A new study created in memory with name: no-name-b47f8155-e760-4b5b-a379-3d0a250387e9
[I 2026-01-26 16:21:20,535] Trial 0 finished with value: 0.8836510378394365 and parameters: {'C': 0.021728415476241058}. Best is trial 0 with value: 0.8836510378394365.
[I 2026-01-26 16:22:46,773] Trial 1 finished with value: 0.870206635949455 and parameters: {'C': 0.0019977093175949177}. Best is trial 0 with value: 0.8836510378394365.
[I 2026-01-26 16:24:12,691] Trial 2 finished with value: 0.8849481800260474 and parameters: {'C': 0.3440229759158527}. Best is trial 2 with value: 0.8849481800260474.
[I 2026-01-26 16:25:38,795] Trial 3 finished with value: 0.8794383759603083 and parameters: {'C': 0.008697848643443456}. Best is trial 2 with value: 0.8849481800260474.
[I 2026-01-26 16:27:05,338] Trial 4 finished with value: 0.8851605749137483 and parameters: {'C': 0.1842137675141572}. Best is trial 4 with value: 0.8851605749137483.
[I 2026-01-26 16:28:31,084] Trial 5 finished 

In [None]:
vis.plot_param_importances(study)

In [11]:
vis.plot_optimization_history(study)


- Now that we have finished tunining the hyperparameters for our model we need to define our threshold for our evaluation metric. 
- Earlier in our notebook we chose ROC AUC as our tuning metric, this was because our final metric, F2, is a metric where it does not matter how much the probability is over or under the threshold. It just matters which side it is on.
- Because the tool we chose for our tuning, Optuna, was optimizing on the percent increase in the score a score metric like F2 is not compatible with the formula because there is not a difference between 66% and 94% when tuning for F2.
-  ROC AUC is ideal for Optuna and hyperparameter tuning because it is a percent based metric and it is good for classification because it grades how well it ranks the positives over the negative targets.

Now moving on from Hyperparameter tuning we need to define the threshold for our model. This will be continued in the threshold.ipynb notebook. And these hyperparameters will be frozen in the final training model in train.py