The selected_features function is used to calculate the feature importance from a trained GradientBoostingRegressor model and filter out the features that have zero importance.

It takes two arguments: best_model_grid, the trained model, and feature_names, a list of feature names corresponding to the model's input features. The function retrieves the feature importance values using the feature_importances_ attribute of the model and pairs them with the respective feature names. Then, it filters out features with zero importance, returning a list of selected feature names (selected_features) and their corresponding importance values (selected_importances). This helps in identifying the most important features for the model's predictions, enabling a more efficient and focused analysis.

In [1]:
%run ./run_script.ipynb

conf = get_conf()

trans = get_datasources(conf)["trans_info"]
item = get_datasources(conf)["item_info"]
stores = get_datasources(conf)["outlets_info"]
no_categories = conf['params']["no_categories"]
cutoff = conf['train_test_split']["cutoff"]

data = create_pipeline(conf, trans, item, stores, no_categories)
X_train, X_test, y_train, y_test = create_train_test_split(data, cutoff)

# Load the saved model
best_model_grid = joblib.load('best_model_grid.pkl')
# Load the saved predictions
y_pred = joblib.load('y_pred.pkl')

In [2]:
def selected_features(best_model_grid, feature_names):
    
    """
    Calculating the feature importance and filtering non-importance features
      
      Args:
          best_model_grid: GradientBoostingRegressor
              The trained GradientBoostingRegressor model.
          feature_names: list
              A list of feature names corresponding to the model's input features.
              
      Returns:
          selected_features: list
              A list of selected feature names with non-zero importance.
          selected_importances: list
              A list of corresponding importances for the selected features.  
    """
    
    # Calculating the feature importance
    feature_importance = best_model_grid.feature_importances_
    feature_importance_list = list(zip(feature_names, feature_importance))

    # Filter the list to keep only features with non-zero importance
    non_zero_importance_list = [(feature, importance) for feature, importance in feature_importance_list if importance > 0]
    selected_features = [feature for feature, _ in non_zero_importance_list]
    selected_importances = [importance for _, importance in non_zero_importance_list]

    return selected_features, selected_importances

In [3]:
selected_features, selected_importances = selected_features(best_model_grid, X_train.columns)

print("Selected Features:")
for feature, importance in zip(selected_features, selected_importances):
    print(f"{feature}: {importance}")

Selected Features:
week: 3.039727722293066e-06
total_sales_qty: 0.4148315190989942
fe_avg_4_week_sales: 0.00014374864754362605
fe_4_weeks_std_dev_weekly: 0.00011033290487125094
fe_4_weeks_weekly_min_sales: 0.00024416319589802164
fe_4_weeks_weekly_max_sales: 3.6092634952222513e-05
previous_week_sales: 0.001643628785686086
prev_2_weeks_sales: 0.10046285617135953
prev_3_weeks_sales: 0.02955680181413554
prev_month_sales: 0.011445100558770268
outlet_min_sales: 2.1533987211492467e-05
outlet_max_sales: 6.064567375959897e-05
fe_sales_change_vs_next_week: 0.07050701863007254
fe_sales_change_vs_previous_week: 0.0015875116656360899
fe_sales_to_max_sales_ratio: 0.3655548923438419
fe_cumulative_sales: 0.0002818589741753419
month: 0.00014687788045145142
week_month: 4.3370444181462885e-05
week_year: 3.7922222230864143e-05
quarter_year: 2.3651260618248682e-06
row: 1.3967071384277558e-18
outlet_code_D: 3.207080872700897e-05
item_department_Beverages: 1.900800190008742e-06
item_department_Chilled: 2.369