# Beyond Autopilot: Modeling API Helper Function

The main idea behind this function is to provide an easy way to iterate using models already on the leaderboard via:

- running more blueprints
- feature selection (based on the ideas of [FIRE](https://www.datarobot.com/blog/using-feature-importance-rank-ensembling-fire-for-advanced-feature-selection/))
- training duration
- advanced tuning
- blending

Additionally, the function allows for sorting models via any available metric as well as removing redundancy from the most accurate model. This notebook demonstrates the use of the beyond_autopilot function through the following examples:

1. AutoML regression project
2. AutoML binary classification project with SHAP
3. AutoML weighted multiclass
4. AutoTS regression
5. AutoTS regression (redundancy removal)
6. AutoML anomaly detection
7. AutoTS anomaly detection
8. Visual AI [requires exeternal dataset](https://github.com/datarobot/data-science-scripts/blob/master/taylor/beyond_autopilot/XSmall_Housing.ziphttps://github.com/datarobot/data-science-scripts/blob/master/taylor/beyond_autopilot/XSmall_Housing.zip)
9. Visual AI (sorting models only) [requires exeternal dataset](https://github.com/datarobot/data-science-scripts/blob/master/taylor/beyond_autopilot/XSmall_Housing.ziphttps://github.com/datarobot/data-science-scripts/blob/master/taylor/beyond_autopilot/XSmall_Housing.zip)

Please see docstrings in beyond_autopilot_helpers.py or this [presentation](https://docs.google.com/presentation/d/1t1V3FWiBWXvLMEn3mCk_twA_ppqqt1iQ298Gtf1_H14/edit?usp=sharinghttps://docs.google.com/presentation/d/1t1V3FWiBWXvLMEn3mCk_twA_ppqqt1iQ298Gtf1_H14/edit?usp=sharing) for more details. Don't hesitate to reach out to me (taylor.larkin@datarobot.com) with any feeback!

### Imports

In [1]:
# %load_ext autoreload
# %autoreload 2

# imports
import time
from platform import python_version

import datarobot as dr
import pandas as pd
from datarobot.helpers.partitioning_methods import construct_duration_string

from beyond_autopilot_helpers import beyond_autopilot

In [2]:
# DR version
dr.__version__

'3.6.1'

In [3]:
# Python version
python_version()

'3.11.11'

In [4]:
project = dr.Project.get('67a6411ad968c600b0a7c751')

In [5]:
%%time
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric="Weighted Area Under PR Curve",
    max_n_models_to_keep=6,
    run_similar_models_for_top_n_models=True,
    accumulation_ratio=0.95,
    try_fam_featurelist=True,
    training_duration_grid=None,
    # advanced_tuning_grid=[
    #     [
    #         {"parameter_name": "min_ngram", "value": 5},
    #         {"parameter_name": "max_ngram", "value": 10},
    #     ]
    # ],
    # blend_methods=["AVG", "ENET", "LGBM", "KERAS"],
    # max_size_of_blender=6,
    remove_redundancy_from_best_model=True,
    mark_project_name=False,
    wait_for_jobs_to_process_timeout=60,
)
best_model

*** Initializing beyond autopilot process ***

Collecting best models...
         Fewer attributes will be returned in the response, see the docstring for more details. 
  models = project.get_models()

*** Starring current best model ***


*** Running more blueprints ***

Collecting best models...
Total # of blueprints in project: 39
# of similar blueprints to the top n: 9
Running 18 models...
Models are finished!
         Fewer attributes will be returned in the response, see the docstring for more details. 
  models=[x for x in project.get_models() if x.id in model_ids_to_check],
Checking for duplicates in 18 models...
Deleting 12 of 18 models...

*** Feature selection ***

Collecting best models...
         Fewer attributes will be returned in the response, see the docstring for more details. 
  models = project.get_models()
Creating new feature lists...
  jobs.append(dr.ShapImpact.create(project_id=project.id, model_id=model.id))
  dr.ShapImpact.get(project_id=project.id, model_id

Model('Generalized Additive Model')

In [11]:
%%time
from datarobot.helpers.partitioning_methods import construct_duration_string
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric="Weighted LogLoss",
    max_n_models_to_keep=6,
    run_similar_models_for_top_n_models=True,
    accumulation_ratio=0.95,
    try_fam_featurelist=True,
    training_duration_grid = [
        construct_duration_string(months=1),
        construct_duration_string(months=2),
        construct_duration_string(months=4),
    ],
    remove_redundancy_from_best_model=True,
    mark_project_name=False,
    wait_for_jobs_to_process_timeout=60,
)
best_model

### Example 1 - AutoML regression project (more blueprints, feature selection, advanced tuning, blenders, redundancy removal)

In [4]:
# # Start and wait for modeling
# project = dr.Project.create(
#     sourcedata="https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_NBA_2017-2018.csv",
#     project_name="Beyond Autopilot - Regression",
# )
# project.set_target(target="game_score", worker_count=-1)
# project.wait_for_autopilot(check_interval=120)

In [6]:
# %%time
# # Run beyond autopilot function
# best_model = beyond_autopilot(
#     project_id=project.id,
#     worker_count=-1,
#     sorting_metric=None,
#     max_n_models_to_keep=5,
#     run_similar_models_for_top_n_models=True,
#     accumulation_ratio=0.95,
#     try_fam_featurelist=True,
#     training_duration_grid=None,
#     advanced_tuning_grid=[[{"parameter_name": "max_depth", "value": 10}]],
#     blend_methods=["AVG"],
#     max_size_of_blender=3,
#     remove_redundancy_from_best_model=True,
#     mark_project_name=True,
#     wait_for_jobs_to_process_timeout=60,
# )
# best_model

### Example 2 - AutoML binary classification project with SHAP (feature selection)

In [12]:
# # Start and wait for modeling
# project = dr.Project.create(
#     sourcedata="https://s3.amazonaws.com/datarobot_public_datasets/10K_Lending_Club_Loans.csv",
#     project_name="Beyond Autopilot - Binary SHAP",
# )
# project.set_target(
#     target="is_bad",
#     worker_count=-1,
#     partitioning_method=dr.RandomTVH(holdout_pct=0, validation_pct=20),
#     advanced_options=dr.AdvancedOptions(shap_only_mode=True),
# )
# project.wait_for_autopilot(check_interval=120)

In [13]:
# %%time
# # Run beyond autopilot function
# best_model = beyond_autopilot(
#     project_id=project.id,
#     worker_count=-1,
#     sorting_metric="AUC",
#     max_n_models_to_keep=6,
#     run_similar_models_for_top_n_models=False,
#     accumulation_ratio=0.95,
#     try_fam_featurelist=True,
#     training_duration_grid=None,
#     advanced_tuning_grid=None,
#     blend_methods=None,
#     max_size_of_blender=3,
#     remove_redundancy_from_best_model=True,
#     mark_project_name=False,
#     wait_for_jobs_to_process_timeout=60,
# )
# best_model

### Example 3 - AutoML weighted multiclass (more blueprints, advanced tuning, blenders, redundancy removal)

In [12]:
# # Start and wait for modeling
# project = dr.Project.create(
#     sourcedata="https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_Telco_Next_Best_Offer_Multiclass.csv",
#     project_name="Beyond Autopilot - Multiclass",
# )
# project.set_target(
#     target="Offers",
#     partitioning_method=dr.GroupCV(holdout_pct=0, reps=5, partition_key_cols=["State"]),
#     advanced_options=dr.AdvancedOptions(weights="Account_length"),
#     worker_count=-1,
# )
# project.wait_for_autopilot(check_interval=120)

### Example 4 - AutoTS regression (more blueprints, feature selection, training duration, advanced tuning, blenders)

In [14]:
# Start and wait for modeling
project = dr.Project.create(
    sourcedata="https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_Sales_Multiseries_training.xlsx",
    project_name="Beyond Autopilot - Time Series",
)
project.set_target(
    target="Sales",
    worker_count=-1,
    partitioning_method=dr.DatetimePartitioningSpecification(
        use_time_series=True,
        datetime_partition_column="Date",
        multiseries_id_columns=["Store"],
    ),
)
project.wait_for_autopilot(check_interval=120)

In [15]:
%%time
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric="MASE",
    max_n_models_to_keep=6,
    run_similar_models_for_top_n_models=True,
    accumulation_ratio=0.975,
    try_fam_featurelist=True,
    training_duration_grid=[
        construct_duration_string(months=3),
        construct_duration_string(months=6),
        construct_duration_string(months=9),
    ],
    advanced_tuning_grid=[
        [{"parameter_name": "Decay: Type", "value": "linear"}],
        [{"parameter_name": "Decay: Type", "value": "exponential"}],
    ],
    blend_methods=["AVG", "FORECAST_DISTANCE_ENET"],
    max_size_of_blender=2,
    remove_redundancy_from_best_model=False,
    mark_project_name=False,
    wait_for_jobs_to_process_timeout=30,
)
best_model

### Example 5 - AutoTS regression (redundancy removal)

In [16]:
%%time
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric=None,
    max_n_models_to_keep=3,
    run_similar_models_for_top_n_models=False,
    accumulation_ratio=None,
    try_fam_featurelist=False,
    training_duration_grid=None,
    advanced_tuning_grid=None,
    blend_methods=None,
    max_size_of_blender=3,
    remove_redundancy_from_best_model=True,
    mark_project_name=True,
    wait_for_jobs_to_process_timeout=30,
)
best_model

### Example 6 - AutoML anomaly detection (more blueprints, feature selection, blenders)

In [17]:
# Start and wait for modeling
project = dr.Project.create(
    sourcedata="https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_AML_Alert.csv",
    project_name="Beyond Autopilot - Anomaly Detection",
)
project.set_target(target=None, unsupervised_mode=True, worker_count=-1)
project.wait_for_autopilot(check_interval=120)

In [18]:
%%time
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric="Synthetic AUC",
    max_n_models_to_keep=7,
    run_similar_models_for_top_n_models=True,
    accumulation_ratio=0.975,
    try_fam_featurelist=True,
    training_duration_grid=None,
    advanced_tuning_grid=None,
    blend_methods=["AVG", "MIN", "MAX"],
    max_size_of_blender=2,
    remove_redundancy_from_best_model=True,
    mark_project_name=False,
    wait_for_jobs_to_process_timeout=60,
)
best_model

### Example 7 - AutoTS anomaly detection (feature selection, training duration, blenders)

In [19]:
# Start and wait for modeling
project = dr.Project.create(
    sourcedata="https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_Sales_Multiseries_training.xlsx",
    project_name="Beyond Autopilot - TS AD",
)
project.set_target(
    target=None,
    unsupervised_mode=True,
    worker_count=-1,
    partitioning_method=dr.DatetimePartitioningSpecification(
        use_time_series=True,
        datetime_partition_column="Date",
        multiseries_id_columns=["Store"],
        number_of_backtests=1,
    ),
)
project.wait_for_autopilot(check_interval=120)

In [20]:
%%time
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric=None,
    max_n_models_to_keep=5,
    run_similar_models_for_top_n_models=False,
    accumulation_ratio=0.90,
    try_fam_featurelist=False,
    training_duration_grid=[construct_duration_string(years=1)],
    advanced_tuning_grid=None,
    blend_methods=["AVG", "MIN", "MAX"],
    max_size_of_blender=4,
    remove_redundancy_from_best_model=True,
    mark_project_name=False,
    wait_for_jobs_to_process_timeout=60,
)
best_model

### Example 8 - Visual AI (more blueprints, feature selection, advanced tuning, blenders, redundancy removal)

In [21]:
# Start and wait for modeling
project = dr.Project.create(
    sourcedata="./XSmall_Housing.zip", project_name="Beyond Autopilot - Visual AI"
)
project.set_target(target="price", worker_count=-1)
project.wait_for_autopilot(check_interval=120)

In [22]:
%%time
# Run beyond autopilot function
best_model = beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric=None,
    max_n_models_to_keep=5,
    run_similar_models_for_top_n_models=True,
    accumulation_ratio=None,
    try_fam_featurelist=False,
    training_duration_grid=None,
    advanced_tuning_grid=[
        [{"parameter_name": "use_low_level_features", "value": True}],
        [
            {"parameter_name": "use_low_level_features", "value": True},
            {"parameter_name": "use_medium_level_features", "value": True},
            {"parameter_name": "use_high_level_features", "value": True},
        ],
        [
            {"parameter_name": "use_low_level_features", "value": True},
            {"parameter_name": "use_medium_level_features", "value": False},
            {"parameter_name": "use_high_level_features", "value": False},
            {"parameter_name": "use_highest_level_features", "value": False},
        ],
    ],
    blend_methods=["GLM"],
    max_size_of_blender=2,
    remove_redundancy_from_best_model=True,
    mark_project_name=True,
    wait_for_jobs_to_process_timeout=60,
)
best_model

### Example 9 - Visual AI (sorting models only)

In [23]:
%%time
# Run beyond autopilot function
beyond_autopilot(
    project_id=project.id,
    worker_count=-1,
    sorting_metric="FVE Poisson",
    max_n_models_to_keep=3,
    run_similar_models_for_top_n_models=False,
    accumulation_ratio=None,
    try_fam_featurelist=False,
    training_duration_grid=None,
    advanced_tuning_grid=None,
    blend_methods=None,
    max_size_of_blender=3,
    remove_redundancy_from_best_model=True,
    mark_project_name=False,
    wait_for_jobs_to_process_timeout=30,
)

In [1]:
from fire import Fire

In [2]:
f = Fire(project_id = '67a6411ad968c600b0a7c751')

In [5]:
f2 = Fire(project_id = '6792be89d7b608a4f52457c4')
best_models2 = f2.get_best_models()

In [6]:
best_models = f.get_best_models()

In [7]:
best_models, len(best_models2)

(0     DatetimeModel('Dropout Additive Regression Tre...
 1     DatetimeModel('eXtreme Gradient Boosted Trees ...
 2     DatetimeModel('eXtreme Gradient Boosted Trees ...
 3     DatetimeModel('eXtreme Gradient Boosted Trees ...
 4     DatetimeModel('Light Gradient Boosted Trees Cl...
 5     DatetimeModel('Gradient Boosted Trees Classifi...
 6     DatetimeModel('Light Gradient Boosting on Elas...
 7     DatetimeModel('Elastic-Net Classifier (L2 / Bi...
 8     DatetimeModel('eXtreme Gradient Boosted Trees ...
 9     DatetimeModel('Elastic-Net Classifier (mixing ...
 10    DatetimeModel('Elastic-Net Classifier (L2 / Bi...
 11    DatetimeModel('Elastic-Net Classifier (L2 / Bi...
 12    DatetimeModel('Elastic-Net Classifier (L2 / Bi...
 13    DatetimeModel('Elastic-Net Classifier (L2 / Bi...
 14    DatetimeModel('Stochastic Gradient Descent Cla...
 15                 DatetimeModel('Logistic Regression')
 16    DatetimeModel('Elastic-Net Classifier (mixing ...
 17    DatetimeModel('Elastic-N

In [11]:
for model in best_models[:4].append(best_models2[:3]):
    try:
        model.request_feature_impact(
            # row_count = 1e5,
            # metadata = True,
        )
    except:
        pass




ratio=0.99

for model in best_models[:4].append(best_models2[:3]):
    # This can take some time to compute feature impact
    feature_impact = pd.DataFrame(
        model.get_or_request_feature_impact(
            max_wait=900,
        )
    )  # 15min

    # Track model name and ID for bookkeeping purposes
    feature_impact["model_type"] = model.model_type
    feature_impact["model_id"] = model.id
    # By sorting and re-indexing, the new index becomes our 'ranking'
    feature_impact = feature_impact.sort_values(
        by="impactUnnormalized", ascending=False
    ).reset_index(drop=True)
    feature_impact["rank"] = feature_impact.index.values

    # Add to our master list of all models' feature ranks
    f.all_impact = pd.concat(
        [f.all_impact, feature_impact], ignore_index=True
    )

# We need to get a threshold number of features to select based on cumulative sum of impact
all_impact_agg = (
    f.all_impact.groupby("featureName")[
        ["impactNormalized", "impactUnnormalized"]
    ]
    .sum()
    .sort_values("impactUnnormalized", ascending=False)
    .reset_index()
)

# calculate cumulative feature impact and take first features that possess <ratio> of total impact
all_impact_agg["impactCumulative"] = all_impact_agg[
    "impactUnnormalized"
].cumsum()
total_impact = all_impact_agg["impactCumulative"].max() * ratio
tmp_fl = list(
    set(
        all_impact_agg[all_impact_agg.impactCumulative <= total_impact][
            "featureName"
        ].values.tolist()
    )
)

# that will be a number of feature to use
n_feats = len(tmp_fl)

# get top features based on median rank
top_ranked_feats = list(
    f.all_impact.groupby("featureName")['rank']
    .median()
    .sort_values(ascending=True)
    .head(n_feats)
    .index.values
)

  for model in best_models[:4].append(best_models2[:3]):
  for model in best_models[:4].append(best_models2[:3]):
  for model in best_models[:4].append(best_models2[:3]):
  for model in best_models[:4].append(best_models2[:3]):


In [12]:
fi = f.all_impact.groupby("featureName")['rank'].median().sort_values(ascending=True)
fi.to_csv('fi_20250205.csv')

In [14]:
df = dr.Project.get('67a6411ad968c600b0a7c751').get_dataset()

In [15]:
output = pd.Series(df.get_featurelists()[2].features).to_frame('Feature').reset_index()
output = output.merge(fi.reset_index(), left_on='Feature',right_on='featureName', how='left')
output['included_in_model'] = ~output.featureName.isna()*1

In [16]:
output.to_csv('fi2.csv',index=False)

In [43]:
fi

In [17]:
output

Unnamed: 0,index,Feature,featureName,rank,included_in_model
0,0,idsubscrip,,,0
1,1,idsystem,,,0
2,2,pubcode,,,0
3,3,observation_date,,,0
4,4,churned,,,0
...,...,...,...,...,...
244,244,high_affinity_ratio,high_affinity_ratio,38.0,1
245,245,affinity_segment,affinity_segment,58.0,1
246,246,enews_active_days,enews_active_days,129.0,1
247,247,enews_sessions,enews_sessions,122.0,1
