# Explainer notebook

In this notebook I'll explain the steps and thought process behind all my data processing, feature engineering, model creation, validation and optimizations.

The API will consume the outputs of the process described here and will be documented separately.


All the functions i wrote here were later organized in separate modules.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import shutil
import os
import seaborn as sns
import kagglehub
import xgboost as xgb

import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


# Objectives

My chosen task for this challange is to make an Agricultural commodity price forecasting model and implement it.

For this i chose an avocado prices and sales dataset for the US that can be found here: https://www.kaggle.com/datasets/neuromusic/avocado-prices/data

The dataset has weekly entries for Organic and Conventional avocados in lots of regions across the US.

My specific task will be to do a 4 week ahead forecast for the average price of an avocado, for this I'll train one separate regressor model per region.

There are out of the box time series forecasting models, i choose to implement a regression model such as XGB for a few reasons:
1) In my experience if well tuned it tends to perform better.
2) Since this is an exercise that will not be worked upon in the real world and which the objective is purely a  technical evaluation, i believe a custom solution allows me more room to express skills in machine learning engineering than models like prophet.
3) I got some free time this week. In a more time constrained environment where a couple percentage points of precision are not essential I would probably go for the easier route...

## Data understanding

## Loading and preprocessing

In [2]:
def load_raw_data():
    # Download latest version
    path = kagglehub.dataset_download("neuromusic/avocado-prices")

    destination_folder = "./data"
    # Create the destination folder if it doesn't exist
    os.makedirs(destination_folder, exist_ok=True)

    # Move all files from source to destination
    for filename in os.listdir(path):
        source_file = os.path.join(path, filename)
        destination_file = os.path.join(destination_folder, filename)
        shutil.move(source_file, destination_file)

    df = pd.read_csv('data/avocado.csv')
    return df

df = load_raw_data()

In [3]:
df

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.70,109149.67,130.50,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.00,71976.41,72.58,5811.16,5677.40,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.60,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18244,7,2018-02-04,1.63,17074.83,2046.96,1529.20,0.00,13498.67,13066.82,431.85,0.0,organic,2018,WestTexNewMexico
18245,8,2018-01-28,1.71,13888.04,1191.70,3431.50,0.00,9264.84,8940.04,324.80,0.0,organic,2018,WestTexNewMexico
18246,9,2018-01-21,1.87,13766.76,1191.92,2452.79,727.94,9394.11,9351.80,42.31,0.0,organic,2018,WestTexNewMexico
18247,10,2018-01-14,1.93,16205.22,1527.63,2981.04,727.01,10969.54,10919.54,50.00,0.0,organic,2018,WestTexNewMexico


The DF has a "Unnamed: 0" column with repeating indices regarding the weeks of the year for each region and each type of avocado. Since i'll be dealing with the dates later this column will be dropped. 

The date column needs to be a datetime, and the names are not standardized, so i'll write a preprocess function to fix all this details}

In [4]:
def preprocess_raw_data(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.sort_values('Date')
    # df = df.set_index('Date')
    df = df.drop(['Unnamed: 0', 'year'], axis=1)
    df.columns = df.columns.str.strip()
    df.columns = df.columns.str.replace(' ', '')
    df.columns = df.columns.str[0].str.upper() + df.columns.str[1:]
    return df

df = preprocess_raw_data(df)

In [5]:
df

Unnamed: 0,Date,AveragePrice,TotalVolume,4046,4225,4770,TotalBags,SmallBags,LargeBags,XLargeBags,Type,Region
11569,2015-01-04,1.75,27365.89,9307.34,3844.81,615.28,13598.46,13061.10,537.36,0.00,organic,Southeast
9593,2015-01-04,1.49,17723.17,1189.35,15628.27,0.00,905.55,905.55,0.00,0.00,organic,Chicago
10009,2015-01-04,1.68,2896.72,161.68,206.96,0.00,2528.08,2528.08,0.00,0.00,organic,HarrisburgScranton
1819,2015-01-04,1.52,54956.80,3013.04,35456.88,1561.70,14925.18,11264.80,3660.38,0.00,conventional,Pittsburgh
9333,2015-01-04,1.64,1505.12,1.27,1129.50,0.00,374.35,186.67,187.68,0.00,organic,Boise
...,...,...,...,...,...,...,...,...,...,...,...,...
8574,2018-03-25,1.36,908202.13,142681.06,463136.28,174975.75,127409.04,103579.41,22467.04,1362.59,conventional,Chicago
9018,2018-03-25,0.70,9010588.32,3999735.71,966589.50,30130.82,4014132.29,3398569.92,546409.74,69152.63,conventional,SouthCentral
18141,2018-03-25,1.42,163496.70,29253.30,5080.04,0.00,129163.36,109052.26,20111.10,0.00,organic,SouthCentral
17673,2018-03-25,1.70,190257.38,29644.09,70982.10,0.00,89631.19,89424.11,207.08,0.00,organic,California


## Aggregations

Since i decided that i want to forecast the average avocado prices i need to aggregate the data that currently is separated by types.

In [6]:
# Receives a group and combines all the variables
def aggregate_types(group):
    total_volume = group['TotalVolume'].sum()
    weighted_avg = (group['AveragePrice'] * group['TotalVolume']).sum() / total_volume
    return pd.Series({
        'Date': group['Date'].iloc[0],
        'Region': group['Region'].iloc[0],
        'AveragePrice_combined': weighted_avg, # The average price is wighted against the total volumes
        'TotalVolume_combined': total_volume,
        '4046_combined': group['4046'].sum(),
        '4225_combined': group['4225'].sum(),
        '4770_combined': group['4770'].sum(),
        'TotalBags_combined': group['TotalBags'].sum(),
        'SmallBags_combined': group['SmallBags'].sum(),
        'LargeBags_combined': group['LargeBags'].sum(),
        'XLargeBags_combined': group['XLargeBags'].sum(),
    })

def group_by_region(df):
    combined_df = df.groupby(['Date', 'Region'])[df.columns].apply(aggregate_types).reset_index(drop=True)
    return combined_df

grouped_df = group_by_region(df)

Those are the differences for an example date and region

In [7]:
df[(df['Date'] == '2015-01-04') & (df['Region'] == 'Albany')]

Unnamed: 0,Date,AveragePrice,TotalVolume,4046,4225,4770,TotalBags,SmallBags,LargeBags,XLargeBags,Type,Region
9177,2015-01-04,1.79,1373.95,57.42,153.88,0.0,1162.65,1162.65,0.0,0.0,organic,Albany
51,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,Albany


In [8]:
grouped_df[(grouped_df['Date'] == '2015-01-04') & (grouped_df['Region'] == 'Albany')]

Unnamed: 0,Date,Region,AveragePrice_combined,TotalVolume_combined,4046_combined,4225_combined,4770_combined,TotalBags_combined,SmallBags_combined,LargeBags_combined,XLargeBags_combined
0,2015-01-04,Albany,1.238537,42247.23,2876.92,28441.3,49.9,10879.11,10349.58,529.53,0.0


## Pivoting

If i run the models in this aggregated data i loose the information about the dinamic between organic and conventional sales, so lets get this info back. For this I'll pivot the data and add the values as columns to my aggregated dataset

In [9]:
def pivot_and_merge_numerical_columns(df, grouped_df, target_name):
    numerical_columns = df.select_dtypes(include=['number']).columns
    # numerical_columns = numerical_columns.drop('AveragePrice')
    # Pivots arround date and region to separate each type value a column
    pivot_df = df.pivot(index=['Date','Region'], columns='Type', values=numerical_columns).reset_index()
    # Uses _ to join names
    pivot_df.columns = pivot_df.columns.map(lambda col: '_'.join(map(str, col)).strip('_'))
    # Merges data back toghether with the grouped DF
    merge_df = pd.merge(pivot_df, df[['Date', 'Type', 'Region']], on=['Date', 'Region'], how='left')
    merge_df = pd.merge(merge_df, grouped_df, on=['Date', 'Region'], how='left')
    # Since i pivoted the columns, now each row has the information of the AveragePrice 
    # for both organic and conventional, AND a type flag separating the entries. But since
    # I need a target for the forecast, another merge is done to get the price for the row's type
    merge_df = pd.merge(merge_df, df[['Region', 'Date', 'Type', target_name]], on=['Date', 'Region', 'Type'], how='left')
    return merge_df

target_name = 'AveragePrice'
merge_df = pivot_and_merge_numerical_columns(df, grouped_df, target_name)

In [10]:
merge_df[['Date', 'Region', 'Type' ,'AveragePrice_conventional', 'AveragePrice_organic', 'AveragePrice']].head()

Unnamed: 0,Date,Region,Type,AveragePrice_conventional,AveragePrice_organic,AveragePrice
0,2015-01-04,Albany,organic,1.22,1.79,1.79
1,2015-01-04,Albany,conventional,1.22,1.79,1.22
2,2015-01-04,Atlanta,conventional,1.0,1.76,1.0
3,2015-01-04,Atlanta,organic,1.0,1.76,1.76
4,2015-01-04,BaltimoreWashington,organic,1.08,1.29,1.29


I have now 2 entries for each date/region pair, one for organic an one for conventional, and those rows have information regarding the other. So i can use this to calculate lagged features in the future and since the grouping is done by date this does not leak future information for the observations

Combining everything in one function we get

In [None]:
def make_stage_1_data(configs):
    df = load_raw_data()
    df = preprocess_raw_data(df)
    grouped_df = group_by_region(df)
    merge_df = pivot_and_merge_numerical_columns(df, grouped_df, configs['target_name'])
    return merge_df

# Feature engineering

Now i need to make features to feed my regression model.

Since this is an exercise I'll be adding only lags and time related features. In a real world scenario more transformations can be applyied and tested to improve performance, such as seasonal decompositions, rolling statistics, cyclical features and etc...

I need a fuction to select one single region, since I'll be training a separate model for each

In [11]:
def select_region(merge_df, region):
    sel_df = merge_df[merge_df['Region'] == region].reset_index(drop=True)
    return sel_df.drop(columns='Region')

## Lags

This function will create lagged columns for a single region, with an option for adding the region name to the created column (will be usefull later)

In [12]:
def make_lags_single_column(sel_df, lags, lag_column, region_name=False):
    '''Create lag features for a single region and lag column.

    Parameters:
    sel_df (pd.DataFrame): The input DataFrame containing the data.
    lags (list): A list of integers representing the lag periods to create.
    lag_column (str): The name of the column for which to create lag features.
    region_name (str or bool, optional): The name of the region to include in the lag feature names. 
                                         If False, the region name is not included. Default is False.

    Returns:
    pd.DataFrame: A DataFrame containing the lag features.'''
    # TODO Raise error if sel_df has multiple regions
    lag_feats = pd.DataFrame(index=sel_df.index)
    for lag in lags:
        if region_name:
            lag_feats[f'{region_name}_{lag_column}_lag_{lag}'] = sel_df.groupby('Type')[lag_column].shift(lag)
        else:
            lag_feats[f'{lag_column}_lag_{lag}'] = sel_df.groupby('Type')[lag_column].shift(lag)
    return lag_feats

Here is a function that will serve to call the make_lags_single_column() function on desired columns, handling the logic for adding the col name as well.

In [13]:
def make_region_lags(region_filt_df, region_name, columns_to_lag, lags ,region_name_in_col = False):
    lag_features = []
    for col_name in columns_to_lag:
        if region_name_in_col:
            lag_features.append(make_lags_single_column(region_filt_df, lags, col_name, region_name))
        else:
            lag_features.append(make_lags_single_column(region_filt_df, lags, col_name))
    return pd.concat(lag_features, axis=1)

Now i need a way to make the lagged features for a specific region from the merged dataframe.

And while we are at it, could it be that the market behaviour in region A influences the prices in region B?

If we only calculate the lags from the target region, the models will loose this information. 

To account for this, I'll also calculate lags from a set of aggregated regions from the merged dataset and add them to my separate region models. Doing this i can feed the general market information to the models. 

Those "Exogenous" regions will be called aux_regions

To coordinate all those lagging and aux features operations I'll make a configs JSON that will specify what needs to be done

In [14]:
configs = {
        "target_name": "AveragePrice", # Name of the target variable to predict
        "lags":[4], # The lags to be calculated for all the columns of the selected target region
        "aux_regions": ['TotalUS'], # "Exogenous" (auxiliary) regions to be used as inputs when predicting the target region
        "aux_features": ['AveragePrice_combined', 'TotalVolume_combined', # Features of the aux regions to include
                         '4046_combined', '4225_combined', '4770_combined', 
                         'TotalBags_combined', 'SmallBags_combined', 
                         'LargeBags_combined', 'XLargeBags_combined'],
        "aux_lags": [4], # The lags of the aux features to be calculated
    }

# Here lets use only the fourth lag and only one aux_region simplyfy the example

Since the objective is to make a 4 weeks ahead forecast, the minimum lag I can use is 4. Less than that, the model would not work in the real world. 

Eg: if im in week 5 and using the second lag as a feature, my fourth week forecast would need information from week 7, which is in the future.

Now finally i can make the function that will receive the complete merged dataframe, the specified configs and return the lagged columns

In [15]:
def make_target_region_lags_df(merge_df, target_region, train_configs):
    # Lags for the target region
    region_filt_df = select_region(merge_df, target_region)
    columns_to_lag = merge_df.loc[:, merge_df.columns != train_configs['target_name']].select_dtypes(
        include=['number']).columns
    target_region_lags_df = make_region_lags(region_filt_df, target_region, columns_to_lag, train_configs['lags'])

    # Lags for Auxiliary regions
    aux_regions_lags = []
    for aux_region_name in train_configs['aux_regions']:
        if aux_region_name == target_region:
            continue
        
        aux_region_filt_df = select_region(merge_df, aux_region_name).reset_index()
        aux_regions_lags.append(
            make_region_lags(
                aux_region_filt_df, 
                aux_region_name, 
                train_configs['aux_features'], 
                train_configs['aux_lags'],
                region_name_in_col=True))
    aux_regions_lags_df = pd.concat(aux_regions_lags, axis=1)

    return pd.concat([target_region_lags_df, aux_regions_lags_df], axis=1)

Now i can get my lags with

In [16]:
region = 'Albany'
lags_df = make_target_region_lags_df(merge_df, region, configs)

In [17]:
lags_df

Unnamed: 0,AveragePrice_conventional_lag_4,AveragePrice_organic_lag_4,TotalVolume_conventional_lag_4,TotalVolume_organic_lag_4,4046_conventional_lag_4,4046_organic_lag_4,4225_conventional_lag_4,4225_organic_lag_4,4770_conventional_lag_4,4770_organic_lag_4,...,XLargeBags_combined_lag_4,TotalUS_AveragePrice_combined_lag_4,TotalUS_TotalVolume_combined_lag_4,TotalUS_4046_combined_lag_4,TotalUS_4225_combined_lag_4,TotalUS_4770_combined_lag_4,TotalUS_TotalBags_combined_lag_4,TotalUS_SmallBags_combined_lag_4,TotalUS_LargeBags_combined_lag_4,TotalUS_XLargeBags_combined_lag_4
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333,1.45,1.43,121804.36,3817.93,8183.48,59.18,95548.47,289.85,61.0,0.0,...,451.11,0.987467,44484806.56,15969142.96,11812643.14,654696.38,16048064.96,11613094.64,4200629.04,234341.28
334,1.43,1.43,85630.24,7566.17,5499.73,4314.30,61661.76,251.85,75.0,0.0,...,380.00,1.100729,38524817.46,13509266.77,11171955.89,554875.19,13288489.86,9806369.66,3254341.43,227778.77
335,1.43,1.43,85630.24,7566.17,5499.73,4314.30,61661.76,251.85,75.0,0.0,...,380.00,1.100729,38524817.46,13509266.77,11171955.89,554875.19,13288489.86,9806369.66,3254341.43,227778.77
336,1.28,1.56,104278.89,5356.63,10368.77,816.56,59723.32,532.59,48.0,0.0,...,310.00,1.077948,41481381.31,13952770.84,10755838.42,725393.48,16046348.64,11431999.60,4310556.28,303792.76


In [18]:
lags_df.columns

Index(['AveragePrice_conventional_lag_4', 'AveragePrice_organic_lag_4',
       'TotalVolume_conventional_lag_4', 'TotalVolume_organic_lag_4',
       '4046_conventional_lag_4', '4046_organic_lag_4',
       '4225_conventional_lag_4', '4225_organic_lag_4',
       '4770_conventional_lag_4', '4770_organic_lag_4',
       'TotalBags_conventional_lag_4', 'TotalBags_organic_lag_4',
       'SmallBags_conventional_lag_4', 'SmallBags_organic_lag_4',
       'LargeBags_conventional_lag_4', 'LargeBags_organic_lag_4',
       'XLargeBags_conventional_lag_4', 'XLargeBags_organic_lag_4',
       'AveragePrice_combined_lag_4', 'TotalVolume_combined_lag_4',
       '4046_combined_lag_4', '4225_combined_lag_4', '4770_combined_lag_4',
       'TotalBags_combined_lag_4', 'SmallBags_combined_lag_4',
       'LargeBags_combined_lag_4', 'XLargeBags_combined_lag_4',
       'TotalUS_AveragePrice_combined_lag_4',
       'TotalUS_TotalVolume_combined_lag_4', 'TotalUS_4046_combined_lag_4',
       'TotalUS_4225_combined_l

## One hot encodes and time features

Add back date, type and target info

In [19]:
region_filt_df = select_region(merge_df, region)
feat_df = pd.concat([
    region_filt_df[['Date', 'Type', configs['target_name']]], 
    lags_df], axis=1)

Regression models need those, so lets add them

In [20]:
def make_time_features(sel_df):
    time_feats = pd.DataFrame(index=sel_df.index)
    dates = sel_df['Date']
    time_feats['Year'] = dates.dt.year
    time_feats['Month'] = dates.dt.month
    time_feats['Day'] = dates.dt.day
    time_feats['DayofWeek'] = dates.dt.dayofweek
    time_feats["WeekofYear"] = dates.dt.isocalendar().week
    time_feats["Quarter"] = dates.dt.quarter
    return time_feats

feat_df = pd.concat([feat_df, make_time_features(feat_df)], axis=1)

One hot encode type

In [21]:
feat_df = pd.get_dummies(feat_df, columns=['Type'], prefix='Type')

Since we are concatenating, with different regions, there is a chance some dates are present in one but not the other, so just dropping the nulls solves it. (In the real world more appropriate data quality checks can be implemented in a prior data engineering stage)

In [22]:
feat_df = feat_df.loc[~feat_df['Date'].isnull(), :]
dates = feat_df['Date'] # also store the dates not to loose the info (usefull for plotting)
feat_df = feat_df.drop(columns='Date')

Finally, now I'll make the model ready data

In [23]:
X = feat_df[feat_df.columns.difference([configs['target_name']])]
y = feat_df[configs['target_name']]

In [24]:
X

Unnamed: 0,4046_combined_lag_4,4046_conventional_lag_4,4046_organic_lag_4,4225_combined_lag_4,4225_conventional_lag_4,4225_organic_lag_4,4770_combined_lag_4,4770_conventional_lag_4,4770_organic_lag_4,AveragePrice_combined_lag_4,...,TotalVolume_combined_lag_4,TotalVolume_conventional_lag_4,TotalVolume_organic_lag_4,Type_conventional,Type_organic,WeekofYear,XLargeBags_combined_lag_4,XLargeBags_conventional_lag_4,XLargeBags_organic_lag_4,Year
0,,,,,,,,,,,...,,,,False,True,1,,,,2015
1,,,,,,,,,,,...,,,,True,False,1,,,,2015
2,,,,,,,,,,,...,,,,False,True,2,,,,2015
3,,,,,,,,,,,...,,,,True,False,2,,,,2015
4,,,,,,,,,,,...,,,,True,False,3,,,,2015
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
333,8242.66,8183.48,59.18,95838.32,95548.47,289.85,61.0,61.0,0.0,1.449392,...,125622.29,121804.36,3817.93,False,True,10,451.11,451.11,0.0,2018
334,9814.03,5499.73,4314.30,61913.61,61661.76,251.85,75.0,75.0,0.0,1.430000,...,93196.41,85630.24,7566.17,False,True,11,380.00,380.00,0.0,2018
335,9814.03,5499.73,4314.30,61913.61,61661.76,251.85,75.0,75.0,0.0,1.430000,...,93196.41,85630.24,7566.17,True,False,11,380.00,380.00,0.0,2018
336,11185.33,10368.77,816.56,60255.91,59723.32,532.59,48.0,48.0,0.0,1.293680,...,109635.52,104278.89,5356.63,True,False,12,310.00,310.00,0.0,2018


In [25]:
X.columns

Index(['4046_combined_lag_4', '4046_conventional_lag_4', '4046_organic_lag_4',
       '4225_combined_lag_4', '4225_conventional_lag_4', '4225_organic_lag_4',
       '4770_combined_lag_4', '4770_conventional_lag_4', '4770_organic_lag_4',
       'AveragePrice_combined_lag_4', 'AveragePrice_conventional_lag_4',
       'AveragePrice_organic_lag_4', 'Day', 'DayofWeek',
       'LargeBags_combined_lag_4', 'LargeBags_conventional_lag_4',
       'LargeBags_organic_lag_4', 'Month', 'Quarter',
       'SmallBags_combined_lag_4', 'SmallBags_conventional_lag_4',
       'SmallBags_organic_lag_4', 'TotalBags_combined_lag_4',
       'TotalBags_conventional_lag_4', 'TotalBags_organic_lag_4',
       'TotalUS_4046_combined_lag_4', 'TotalUS_4225_combined_lag_4',
       'TotalUS_4770_combined_lag_4', 'TotalUS_AveragePrice_combined_lag_4',
       'TotalUS_LargeBags_combined_lag_4', 'TotalUS_SmallBags_combined_lag_4',
       'TotalUS_TotalBags_combined_lag_4',
       'TotalUS_TotalVolume_combined_lag_4',
    

In [26]:
y

0      1.79
1      1.22
2      1.77
3      1.24
4      1.17
       ... 
333    1.68
334    1.66
335    1.35
336    1.57
337    1.71
Name: AveragePrice, Length: 338, dtype: float64

Combining all the feat eng in one function we get

In [None]:
def make_stage_2_data(merge_df, region, train_configs):
    lags_df = make_target_region_lags_df(merge_df, region, train_configs)
    region_filt_df = select_region(merge_df, region)
    feat_df = pd.concat([
        region_filt_df[['Date', 'Type', train_configs['target_name']]], 
        lags_df], axis=1)

    feat_df = pd.concat([feat_df, make_time_features(feat_df)], axis=1)
    feat_df = pd.get_dummies(feat_df, columns=['Type'], prefix='Type')

    feat_df = feat_df.loc[~feat_df['Date'].isnull(), :]
    dates = feat_df['Date']
    feat_df = feat_df.drop(columns='Date')
    
    X = feat_df[feat_df.columns.difference([train_configs['target_name']])]
    y = feat_df[train_configs['target_name']]
    
    return X, y, dates

# Modelling and validation

## Hyperparameter optimization

Before the main train loop i made a grid search with optuna to optimize the models. The scores and parameters were logged to MLflow

To specify which regions to iterate I'll add a key "target_regions" to the configs like this

In [None]:
# configs = {
#         "target_name": "AveragePrice", # Name of the target variable to predict
#         "lags":[4, 8, 13, 26, 52],
#         'target_regions': merge_df['Region'].unique(),
#         "aux_regions": ['TotalUS', 'West', 'Midsouth', 'Northeast', 'Southeast', 'SouthCentral'], # "Exogenous" (auxiliary) regions to be used as inputs when predicting the target region
#         "aux_features": ['AveragePrice_combined', 'TotalVolume_combined', # Features of the aux regions to include
#                          '4046_combined', '4225_combined', '4770_combined', 
#                          'TotalBags_combined', 'SmallBags_combined', 
#                          'LargeBags_combined', 'XLargeBags_combined'],
#         "aux_lags": [4, 8, 13, 26, 52],
#     }

This was the code executed to get the best parameters for each region, but since it would take too long to run it here, ill be skipping it. A JSON file with my results is included in the repo.

In [None]:
# import mlflow
# import mlflow.xgboost
# import xgboost as xgb
# from sklearn.model_selection import TimeSeriesSplit, train_test_split
# from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
# import numpy as np
# import pandas as pd
# import os
# import optuna
# import feature_eng

# # Set your MLflow experiment name
# mlflow.set_experiment("Optuna Hyperparameter Optimization")

# # Define the objective function for Optuna
# def objective(trial):
#     params = {
#         "n_estimators": trial.suggest_int("n_estimators", 50, 200),
#         "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2),
#         "max_depth": trial.suggest_int("max_depth", 3, 7),
#         "subsample": trial.suggest_float("subsample", 0.5, 1.0),
#         "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
#         "gamma": trial.suggest_float("gamma", 0, 5),
#         "reg_alpha": trial.suggest_float("reg_alpha", 0, 5),
#         "reg_lambda": trial.suggest_float("reg_lambda", 0, 5)
#     }
    
#     fold_mse = []
#     fold_mape = []
    
#     for fold, (train_cv_index, val_cv_index) in enumerate(tscv.split(X_train)):
#         X_train_cv, X_val_cv = X_train.iloc[train_cv_index], X_train.iloc[val_cv_index]
#         y_train_cv, y_val_cv = y_train.iloc[train_cv_index], y_train.iloc[val_cv_index]
        
#         model = xgb.XGBRegressor(**params)
#         model.fit(X_train_cv, y_train_cv)
        
#         # Make predictions and compute performance
#         y_pred_cv = model.predict(X_val_cv)
#         mse = mean_squared_error(y_val_cv, y_pred_cv)
#         mape = mean_absolute_percentage_error(y_val_cv, y_pred_cv)
#         fold_mse.append(mse)
#         fold_mape.append(mape)
    
#     avg_mse = np.mean(fold_mse)   
#     avg_mape = np.mean(fold_mape)    
    
#     return avg_mape

# for region in configs['target_regions']:
#     print(f'Training: {region}')
#     X, y, dates = feature_eng.make_stage_2_data(merge_df, region, configs)

#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
    
#     # Start one run per region
#     with mlflow.start_run(run_name=f"Region: {region}") as region_run:
#         mlflow.log_param("region", region)
        
#         # Expanding window CV.
#         tscv = TimeSeriesSplit(n_splits=5)
        
#         # Perform hyperparameter optimization with Optuna
#         study = optuna.create_study(direction="minimize")
#         study.optimize(objective, n_trials=40)
        
#         # Log the best parameters
#         best_params = study.best_params
#         mlflow.log_params(best_params)
        
#         # Log the best cross-validation scores
#         # best_mse = study.best_value
#         best_mape = study.best_value
#         mlflow.log_metric("best_cv_mape", best_mape)

## Main tran loop

The model is trained on the entire train set using the best cross validated parameters for each region. After that there is another retraining, this time with the entire dataset, the retrained model is then logged and registered to an MLflow server to be later fetched for the deploy.

Lets define a smaller set of target regions so it would not take long

In [None]:
configs = {
        "target_name": "AveragePrice", # Name of the target variable to predict
        "lags":[4], # The lags to be calculated for all the columns of the selected target region
        'target_regions':['Chicago', 'Albany'], # The regions i want to forecast
        "aux_regions": ['TotalUS', 'West', 'Midsouth', 'Northeast', 'Southeast', 'SouthCentral'], # "Exogenous" (auxiliary) regions to be used as inputs when predicting the target region
        "aux_features": ['AveragePrice_combined', 'TotalVolume_combined', # Features of the aux regions to include
                         '4046_combined', '4225_combined', '4770_combined', 
                         'TotalBags_combined', 'SmallBags_combined', 
                         'LargeBags_combined', 'XLargeBags_combined'],
        "aux_lags": [4], # The lags to be calculated of the aux features
    }

In [None]:
import json

def load_best_params(region, from_json=True):
    if from_json:
        with open('best_params.json') as file:
            best_params = json.load(file)
        return best_params[region]
    
    # If not loading with a JSON, load with MLflow TODO
    return

For visualizing results easily I'll make a plotting function

In [None]:
def plot_results(y_train, y_true, y_pred, target_name, dates, fold=None):
    # Plot the fold results
    plt.figure(figsize=(12, 6))
    plt.plot(dates.loc[y_train.index], y_train, label='Train')
    plt.plot(dates.loc[y_true.index], y_true, label='True')
    plt.plot(dates.loc[y_true.index], y_pred, label='Predicted')
    plt.legend()
    if fold:
        plt.title(f"Train, Validation, and Predicted Values - Fold {fold}")
    else:
        plt.title(f"Train, Test, and Predicted Values")
    plt.xlabel("Date")
    plt.ylabel(target_name)
    return plt


Now, the main train loop

In [None]:
mlflow.set_experiment("Example Experiment")

for region in configs['target_regions']:
    print(f'Trainning: {region}')
    X, y, dates = make_stage_2_data(merge_df, region, configs)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
    
    # Start one run per region
    with mlflow.start_run(run_name=f"Region: {region}") as region_run:
        mlflow.log_param("region", region)

        # get best params from grid search
        params = load_best_params(region, from_json=True)

        # Evaluate final model on the hold-out test set using training data only
        final_model_cv = xgb.XGBRegressor(**params)
        final_model_cv.fit(X_train, y_train)
        y_test_pred = final_model_cv.predict(X_test)
        test_mse = mean_squared_error(y_test, y_test_pred)
        mlflow.log_metric("test_mse", test_mse)
        
        plot = plot_results(y_train, y_test, y_test_pred, configs['target_name'], dates)
        plot_path = f"plot_test.png"
        plot.savefig(plot_path)
        mlflow.log_artifact(plot_path)
        plot.close()
        # Delete the plot file after logging it
        if os.path.exists(plot_path):
            os.remove(plot_path)
        
        final_model = xgb.XGBRegressor(**params)
        final_model.fit(X, y)

        mlflow.xgboost.log_model(final_model, artifact_path="final_model", input_example=X.iloc[:1])
        
        # Register the model
        model_uri = mlflow.get_artifact_uri("final_model")
        mlflow.register_model(model_uri, name=f'{region}_AVOCADO_FORECAST_EXAMPLE')


2025/03/10 19:40:14 INFO mlflow.tracking.fluent: Experiment with name 'Example Experiment' does not exist. Creating a new experiment.


Trainning: Chicago


Registered model 'Chicago_AVOCADO_FORECAST' already exists. Creating a new version of this model...
Created version '6' of model 'Chicago_AVOCADO_FORECAST'.


Trainning: SanFrancisco


Registered model 'SanFrancisco_AVOCADO_FORECAST' already exists. Creating a new version of this model...
Created version '2' of model 'SanFrancisco_AVOCADO_FORECAST'.


The models can now be analyzed inside MLFlow and deployed for serving predictions with a restAPI that loads them from the MLFlow model registry

# Runned Grid Search (save results to json then delete later)

In [58]:
configs = {
        "target_name": "AveragePrice", # Name of the target variable to predict
        # "lags":[4], # The lags to be calculated for all the columns of the selected target region
        "lags":[4, 8, 13, 26, 52],
        'target_regions':['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
            'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
            'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
            'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
            'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
            'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
            'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
            'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
            'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
            'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
            'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
            'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
            'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'],
        "aux_regions": ['TotalUS', 'West', 'Midsouth', 'Northeast', 'Southeast', 'SouthCentral'], # "Exogenous" (auxiliary) regions to be used as inputs when predicting the target region
        "aux_features": ['AveragePrice_combined', 'TotalVolume_combined', # Features of the aux regions to include
                         '4046_combined', '4225_combined', '4770_combined', 
                         'TotalBags_combined', 'SmallBags_combined', 
                         'LargeBags_combined', 'XLargeBags_combined'],
        # "aux_lags": [4], # The lags to be calculated of the aux features
        "aux_lags": [4, 8, 13, 26, 52],
    }

In [59]:
import data_processing
merge_df = data_processing.make_stage_1_data(configs)

In [None]:
import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
import numpy as np
import pandas as pd
import os
import optuna
import feature_eng

# Set your MLflow experiment name
mlflow.set_experiment("Optuna Hyperparameter Optimization")

# Define the objective function for Optuna
def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 200),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.2),
        "max_depth": trial.suggest_int("max_depth", 3, 7),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "gamma": trial.suggest_float("gamma", 0, 5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 5),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 5)
    }
    
    fold_mse = []
    fold_mape = []
    
    for fold, (train_cv_index, val_cv_index) in enumerate(tscv.split(X_train)):
        X_train_cv, X_val_cv = X_train.iloc[train_cv_index], X_train.iloc[val_cv_index]
        y_train_cv, y_val_cv = y_train.iloc[train_cv_index], y_train.iloc[val_cv_index]
        
        model = xgb.XGBRegressor(**params)
        model.fit(X_train_cv, y_train_cv)
        
        # Make predictions and compute performance
        y_pred_cv = model.predict(X_val_cv)
        mse = mean_squared_error(y_val_cv, y_pred_cv)
        mape = mean_absolute_percentage_error(y_val_cv, y_pred_cv)
        fold_mse.append(mse)
        fold_mape.append(mape)
    
    avg_mse = np.mean(fold_mse)   
    avg_mape = np.mean(fold_mape)    
    
    return avg_mape

for region in configs['target_regions']:
    print(f'Training: {region}')
    X, y, dates = feature_eng.make_stage_2_data(merge_df, region, configs)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
    
    # Start one run per region
    with mlflow.start_run(run_name=f"Region: {region}") as region_run:
        mlflow.log_param("region", region)
        
        # Expanding window CV.
        tscv = TimeSeriesSplit(n_splits=5)
        
        # Perform hyperparameter optimization with Optuna
        study = optuna.create_study(direction="minimize")
        study.optimize(objective, n_trials=40)
        
        # Log the best parameters
        best_params = study.best_params
        mlflow.log_params(best_params)
        
        # Log the best cross-validation scores
        # best_mse = study.best_value
        best_mape = study.best_value
        mlflow.log_metric("best_cv_mape", best_mape)

2025/03/10 21:27:46 INFO mlflow.tracking.fluent: Experiment with name 'Optuna Hyperparameter Optimization' does not exist. Creating a new experiment.
[I 2025-03-10 21:27:46,544] A new study created in memory with name: no-name-101772f5-695b-49da-a824-1f21f7b7a405


Training: Albany


[I 2025-03-10 21:27:50,444] Trial 0 finished with value: 0.14486099069523398 and parameters: {'n_estimators': 50, 'learning_rate': 0.14525944963878198, 'max_depth': 7, 'subsample': 0.7904456328795267, 'colsample_bytree': 0.9021469732313794, 'gamma': 4.6533830625244645, 'reg_alpha': 3.4076139356519146, 'reg_lambda': 0.40106622438011386}. Best is trial 0 with value: 0.14486099069523398.
[I 2025-03-10 21:27:58,308] Trial 1 finished with value: 0.11986225591163772 and parameters: {'n_estimators': 198, 'learning_rate': 0.10942029793843307, 'max_depth': 4, 'subsample': 0.5897701155520684, 'colsample_bytree': 0.5482939691496018, 'gamma': 2.4526579382419027, 'reg_alpha': 0.038022844495874675, 'reg_lambda': 2.7943765782226166}. Best is trial 1 with value: 0.11986225591163772.
[I 2025-03-10 21:28:02,733] Trial 2 finished with value: 0.1401484685146662 and parameters: {'n_estimators': 70, 'learning_rate': 0.03088143368126505, 'max_depth': 4, 'subsample': 0.7102987585250911, 'colsample_bytree': 0.

Training: Atlanta


[I 2025-03-10 21:32:23,002] Trial 0 finished with value: 0.24413914909925225 and parameters: {'n_estimators': 177, 'learning_rate': 0.18027783017464025, 'max_depth': 4, 'subsample': 0.5565433059109752, 'colsample_bytree': 0.670795840654648, 'gamma': 2.7819728410026947, 'reg_alpha': 3.6902327271857676, 'reg_lambda': 0.9475265657561455}. Best is trial 0 with value: 0.24413914909925225.
[I 2025-03-10 21:32:29,864] Trial 1 finished with value: 0.23115215751193857 and parameters: {'n_estimators': 158, 'learning_rate': 0.015380016299151492, 'max_depth': 7, 'subsample': 0.7128411755994215, 'colsample_bytree': 0.598402013032534, 'gamma': 2.8959657783596384, 'reg_alpha': 0.9485373859091634, 'reg_lambda': 2.7232787498995137}. Best is trial 1 with value: 0.23115215751193857.
[I 2025-03-10 21:32:35,817] Trial 2 finished with value: 0.2351774430490508 and parameters: {'n_estimators': 105, 'learning_rate': 0.17436448820492065, 'max_depth': 5, 'subsample': 0.9466837841736462, 'colsample_bytree': 0.74

Training: BaltimoreWashington


[I 2025-03-10 21:36:30,171] Trial 0 finished with value: 0.17360648615540814 and parameters: {'n_estimators': 122, 'learning_rate': 0.04093723720700222, 'max_depth': 3, 'subsample': 0.8824294901994971, 'colsample_bytree': 0.8368586319690232, 'gamma': 3.0168061552008196, 'reg_alpha': 2.512251169429347, 'reg_lambda': 3.2471505490457133}. Best is trial 0 with value: 0.17360648615540814.
[I 2025-03-10 21:36:35,793] Trial 1 finished with value: 0.17801256069480106 and parameters: {'n_estimators': 83, 'learning_rate': 0.047917557958155756, 'max_depth': 5, 'subsample': 0.8095732374175746, 'colsample_bytree': 0.989336543480703, 'gamma': 4.943399564417307, 'reg_alpha': 3.7923974700725465, 'reg_lambda': 1.7270831030677898}. Best is trial 0 with value: 0.17360648615540814.
[I 2025-03-10 21:36:40,574] Trial 2 finished with value: 0.16841097369871472 and parameters: {'n_estimators': 73, 'learning_rate': 0.026397056070827, 'max_depth': 7, 'subsample': 0.7712713760416295, 'colsample_bytree': 0.850838

Training: Boise


[I 2025-03-10 21:40:53,805] Trial 0 finished with value: 0.2939095514834002 and parameters: {'n_estimators': 181, 'learning_rate': 0.16925750686960572, 'max_depth': 3, 'subsample': 0.7579455445262077, 'colsample_bytree': 0.8534305661655015, 'gamma': 3.286124294761325, 'reg_alpha': 4.863648270881105, 'reg_lambda': 2.0210040593328618}. Best is trial 0 with value: 0.2939095514834002.
[I 2025-03-10 21:41:00,902] Trial 1 finished with value: 0.2732528248565465 and parameters: {'n_estimators': 187, 'learning_rate': 0.07149863631364488, 'max_depth': 6, 'subsample': 0.5415102416660078, 'colsample_bytree': 0.9479393038698104, 'gamma': 0.7123627908466973, 'reg_alpha': 3.0022412158028473, 'reg_lambda': 3.7659749137063874}. Best is trial 1 with value: 0.2732528248565465.
[I 2025-03-10 21:41:09,932] Trial 2 finished with value: 0.23103204678712408 and parameters: {'n_estimators': 197, 'learning_rate': 0.09714477984542455, 'max_depth': 3, 'subsample': 0.6019577197099379, 'colsample_bytree': 0.893064

Training: Boston


[I 2025-03-10 21:44:52,699] Trial 0 finished with value: 0.1847982855665274 and parameters: {'n_estimators': 200, 'learning_rate': 0.08594967610004513, 'max_depth': 4, 'subsample': 0.7394757976690826, 'colsample_bytree': 0.9000773839098684, 'gamma': 1.1694499926753217, 'reg_alpha': 2.730333339848621, 'reg_lambda': 4.355072707720375}. Best is trial 0 with value: 0.1847982855665274.
[I 2025-03-10 21:44:58,448] Trial 1 finished with value: 0.18592218579706168 and parameters: {'n_estimators': 130, 'learning_rate': 0.12845012271557332, 'max_depth': 4, 'subsample': 0.7938458565118536, 'colsample_bytree': 0.5767817345620242, 'gamma': 3.1097894118547713, 'reg_alpha': 0.9440175961570751, 'reg_lambda': 2.0262204271821913}. Best is trial 0 with value: 0.1847982855665274.
[I 2025-03-10 21:45:04,150] Trial 2 finished with value: 0.1824674251351759 and parameters: {'n_estimators': 153, 'learning_rate': 0.08156183706875321, 'max_depth': 7, 'subsample': 0.6292767541726545, 'colsample_bytree': 0.672460

Training: BuffaloRochester


[I 2025-03-10 21:49:06,585] Trial 0 finished with value: 0.12840641375872708 and parameters: {'n_estimators': 199, 'learning_rate': 0.07592895397405701, 'max_depth': 4, 'subsample': 0.7243804988003156, 'colsample_bytree': 0.5614929421303723, 'gamma': 2.1066750545838437, 'reg_alpha': 1.7155458229229903, 'reg_lambda': 4.975698130860404}. Best is trial 0 with value: 0.12840641375872708.
[I 2025-03-10 21:49:12,311] Trial 1 finished with value: 0.12013536438271466 and parameters: {'n_estimators': 85, 'learning_rate': 0.09383418667739975, 'max_depth': 7, 'subsample': 0.7047717386393251, 'colsample_bytree': 0.8696897499014993, 'gamma': 0.6589232328347283, 'reg_alpha': 1.3203765902495372, 'reg_lambda': 2.8932694755968686}. Best is trial 1 with value: 0.12013536438271466.
[I 2025-03-10 21:49:17,022] Trial 2 finished with value: 0.13052375580878256 and parameters: {'n_estimators': 119, 'learning_rate': 0.01950911030965479, 'max_depth': 3, 'subsample': 0.642029255845604, 'colsample_bytree': 0.821

Training: California


[I 2025-03-10 21:53:21,227] Trial 0 finished with value: 0.19426146142488462 and parameters: {'n_estimators': 126, 'learning_rate': 0.0813208245590085, 'max_depth': 7, 'subsample': 0.9163553824598527, 'colsample_bytree': 0.666028838230603, 'gamma': 3.7088738119796045, 'reg_alpha': 1.6047464716471689, 'reg_lambda': 4.328042358674875}. Best is trial 0 with value: 0.19426146142488462.
[I 2025-03-10 21:53:27,276] Trial 1 finished with value: 0.20505669305679097 and parameters: {'n_estimators': 142, 'learning_rate': 0.036658661782917046, 'max_depth': 4, 'subsample': 0.7226921611837682, 'colsample_bytree': 0.6430227624712967, 'gamma': 2.695881765854926, 'reg_alpha': 3.316441560505363, 'reg_lambda': 4.2163450578303365}. Best is trial 0 with value: 0.19426146142488462.
[I 2025-03-10 21:53:33,565] Trial 2 finished with value: 0.1584742829625685 and parameters: {'n_estimators': 156, 'learning_rate': 0.19308375448538365, 'max_depth': 4, 'subsample': 0.9284031082663076, 'colsample_bytree': 0.93351

Training: Charlotte


[I 2025-03-10 21:57:55,728] Trial 0 finished with value: 0.16557057382475784 and parameters: {'n_estimators': 62, 'learning_rate': 0.04745872899750934, 'max_depth': 4, 'subsample': 0.9099290962112361, 'colsample_bytree': 0.6408980921068574, 'gamma': 3.840088995218574, 'reg_alpha': 2.726448179673826, 'reg_lambda': 3.6378619143522832}. Best is trial 0 with value: 0.16557057382475784.
[I 2025-03-10 21:57:59,680] Trial 1 finished with value: 0.14967981228777255 and parameters: {'n_estimators': 72, 'learning_rate': 0.03419334180036266, 'max_depth': 7, 'subsample': 0.8936732695096867, 'colsample_bytree': 0.6726867715802964, 'gamma': 2.483817506384987, 'reg_alpha': 1.9200366881342879, 'reg_lambda': 1.4799338432835951}. Best is trial 1 with value: 0.14967981228777255.
[I 2025-03-10 21:58:07,104] Trial 2 finished with value: 0.1781797712419558 and parameters: {'n_estimators': 189, 'learning_rate': 0.05067241745717816, 'max_depth': 4, 'subsample': 0.732663329244606, 'colsample_bytree': 0.5848368

Training: Chicago


[I 2025-03-10 22:01:54,879] Trial 0 finished with value: 0.15595713615196324 and parameters: {'n_estimators': 157, 'learning_rate': 0.17129333611412653, 'max_depth': 7, 'subsample': 0.9877187815336758, 'colsample_bytree': 0.9796299509407612, 'gamma': 1.0207950909562906, 'reg_alpha': 2.53349035855672, 'reg_lambda': 0.9230132158638527}. Best is trial 0 with value: 0.15595713615196324.
[I 2025-03-10 22:02:02,036] Trial 1 finished with value: 0.18411764139778053 and parameters: {'n_estimators': 200, 'learning_rate': 0.040843341284004374, 'max_depth': 4, 'subsample': 0.6424728774990982, 'colsample_bytree': 0.9888606579696293, 'gamma': 4.13972939563659, 'reg_alpha': 0.3251146220960316, 'reg_lambda': 0.6223674982565458}. Best is trial 0 with value: 0.15595713615196324.
[I 2025-03-10 22:02:06,358] Trial 2 finished with value: 0.18140567318959994 and parameters: {'n_estimators': 66, 'learning_rate': 0.07105465544453321, 'max_depth': 6, 'subsample': 0.969395493756099, 'colsample_bytree': 0.77840

Training: CincinnatiDayton


[I 2025-03-10 22:05:44,438] Trial 0 finished with value: 0.2604672692099801 and parameters: {'n_estimators': 97, 'learning_rate': 0.13608120024421555, 'max_depth': 6, 'subsample': 0.7653972955605232, 'colsample_bytree': 0.6828918163780525, 'gamma': 4.8688661261864405, 'reg_alpha': 2.871410216877713, 'reg_lambda': 0.5280997966000023}. Best is trial 0 with value: 0.2604672692099801.
[I 2025-03-10 22:05:49,001] Trial 1 finished with value: 0.25517356300551974 and parameters: {'n_estimators': 80, 'learning_rate': 0.1790585497891654, 'max_depth': 4, 'subsample': 0.9000114968018393, 'colsample_bytree': 0.9820823948751067, 'gamma': 2.7746407375245448, 'reg_alpha': 4.505069017817826, 'reg_lambda': 0.4887669102593961}. Best is trial 1 with value: 0.25517356300551974.
[I 2025-03-10 22:05:53,677] Trial 2 finished with value: 0.2537100834490491 and parameters: {'n_estimators': 71, 'learning_rate': 0.06229887848481045, 'max_depth': 3, 'subsample': 0.9776606349601736, 'colsample_bytree': 0.661933609

Training: Columbus


[I 2025-03-10 22:09:58,236] Trial 0 finished with value: 0.19508331581986021 and parameters: {'n_estimators': 196, 'learning_rate': 0.1302584691159183, 'max_depth': 6, 'subsample': 0.7864548181490228, 'colsample_bytree': 0.6446940702789923, 'gamma': 2.786458439498028, 'reg_alpha': 0.6910449528848117, 'reg_lambda': 2.9781473766059827}. Best is trial 0 with value: 0.19508331581986021.
[I 2025-03-10 22:10:03,963] Trial 1 finished with value: 0.1654732325930651 and parameters: {'n_estimators': 105, 'learning_rate': 0.1898706565718633, 'max_depth': 4, 'subsample': 0.616530336593931, 'colsample_bytree': 0.5410704891296814, 'gamma': 0.641385242503707, 'reg_alpha': 0.6227410177619741, 'reg_lambda': 3.7659585998517646}. Best is trial 1 with value: 0.1654732325930651.
[I 2025-03-10 22:10:09,664] Trial 2 finished with value: 0.17633754603945767 and parameters: {'n_estimators': 115, 'learning_rate': 0.1518123449043489, 'max_depth': 4, 'subsample': 0.5851768324579554, 'colsample_bytree': 0.89289887

Training: DallasFtWorth


[I 2025-03-10 22:13:55,100] Trial 0 finished with value: 0.18014142759197363 and parameters: {'n_estimators': 160, 'learning_rate': 0.11452045564482896, 'max_depth': 5, 'subsample': 0.9185530115344025, 'colsample_bytree': 0.9989996209290835, 'gamma': 2.2129517810764128, 'reg_alpha': 3.3617357696733263, 'reg_lambda': 0.5191944578036395}. Best is trial 0 with value: 0.18014142759197363.
[I 2025-03-10 22:14:01,099] Trial 1 finished with value: 0.18728934147591753 and parameters: {'n_estimators': 103, 'learning_rate': 0.06988510797579248, 'max_depth': 4, 'subsample': 0.7221979129845749, 'colsample_bytree': 0.7050311511541933, 'gamma': 3.821161604203147, 'reg_alpha': 0.6670689443859035, 'reg_lambda': 1.8678324287216612}. Best is trial 0 with value: 0.18014142759197363.
[I 2025-03-10 22:14:07,851] Trial 2 finished with value: 0.18056746159333498 and parameters: {'n_estimators': 172, 'learning_rate': 0.16515728766842874, 'max_depth': 6, 'subsample': 0.9813316431802885, 'colsample_bytree': 0.8

Training: Denver


[I 2025-03-10 22:17:43,946] Trial 0 finished with value: 0.2205793762278815 and parameters: {'n_estimators': 104, 'learning_rate': 0.04126255788035211, 'max_depth': 7, 'subsample': 0.8491424656481623, 'colsample_bytree': 0.5731119600826468, 'gamma': 4.395306554187334, 'reg_alpha': 1.0604261511553976, 'reg_lambda': 3.9303863305614977}. Best is trial 0 with value: 0.2205793762278815.
[I 2025-03-10 22:17:48,356] Trial 1 finished with value: 0.2214202298653892 and parameters: {'n_estimators': 62, 'learning_rate': 0.08083324409569476, 'max_depth': 5, 'subsample': 0.792029853329566, 'colsample_bytree': 0.6964036598932788, 'gamma': 1.9652508032440341, 'reg_alpha': 2.1713452753966833, 'reg_lambda': 4.7759117541701235}. Best is trial 0 with value: 0.2205793762278815.
[I 2025-03-10 22:17:53,234] Trial 2 finished with value: 0.22054940990389463 and parameters: {'n_estimators': 95, 'learning_rate': 0.11370022907136551, 'max_depth': 5, 'subsample': 0.6020149297687531, 'colsample_bytree': 0.73275848

Training: Detroit


[I 2025-03-10 22:21:39,420] Trial 0 finished with value: 0.24823835276728584 and parameters: {'n_estimators': 133, 'learning_rate': 0.19306653526289722, 'max_depth': 3, 'subsample': 0.5227097531580008, 'colsample_bytree': 0.7077662134061349, 'gamma': 3.7702327627748913, 'reg_alpha': 3.5882929971756465, 'reg_lambda': 4.510648869438319}. Best is trial 0 with value: 0.24823835276728584.
[I 2025-03-10 22:21:41,759] Trial 1 finished with value: 0.23895403732361764 and parameters: {'n_estimators': 58, 'learning_rate': 0.18028671632890012, 'max_depth': 3, 'subsample': 0.6676229733772447, 'colsample_bytree': 0.8750577894339556, 'gamma': 3.2317833558545024, 'reg_alpha': 1.2702246428145314, 'reg_lambda': 3.1419744735593707}. Best is trial 1 with value: 0.23895403732361764.
[I 2025-03-10 22:21:44,991] Trial 2 finished with value: 0.24696329674975753 and parameters: {'n_estimators': 129, 'learning_rate': 0.029673119715207495, 'max_depth': 3, 'subsample': 0.5184400117687901, 'colsample_bytree': 0.9

Training: GrandRapids


[I 2025-03-10 22:25:15,126] Trial 0 finished with value: 0.22201344669413098 and parameters: {'n_estimators': 75, 'learning_rate': 0.19230750784364484, 'max_depth': 6, 'subsample': 0.5522027944822894, 'colsample_bytree': 0.5854013242119619, 'gamma': 4.153077949721976, 'reg_alpha': 2.9281891177960913, 'reg_lambda': 0.8817942439665133}. Best is trial 0 with value: 0.22201344669413098.
[I 2025-03-10 22:25:19,564] Trial 1 finished with value: 0.2226198512280544 and parameters: {'n_estimators': 67, 'learning_rate': 0.06885896431299392, 'max_depth': 7, 'subsample': 0.5723525066080029, 'colsample_bytree': 0.9740613555968497, 'gamma': 4.3240571203368185, 'reg_alpha': 2.8483683936369433, 'reg_lambda': 3.597816681205504}. Best is trial 0 with value: 0.22201344669413098.
[I 2025-03-10 22:25:24,523] Trial 2 finished with value: 0.22598899245300927 and parameters: {'n_estimators': 141, 'learning_rate': 0.11681245197436153, 'max_depth': 6, 'subsample': 0.5235282222016976, 'colsample_bytree': 0.56850

Training: GreatLakes


[I 2025-03-10 22:30:10,858] Trial 0 finished with value: 0.16944368946245347 and parameters: {'n_estimators': 98, 'learning_rate': 0.03797346086472641, 'max_depth': 4, 'subsample': 0.6197633915406706, 'colsample_bytree': 0.6911566142236787, 'gamma': 2.2949191428277875, 'reg_alpha': 3.6577287674963226, 'reg_lambda': 2.4460617967388556}. Best is trial 0 with value: 0.16944368946245347.
[I 2025-03-10 22:30:15,625] Trial 1 finished with value: 0.15842637345481023 and parameters: {'n_estimators': 110, 'learning_rate': 0.19017565423804278, 'max_depth': 6, 'subsample': 0.908899529372327, 'colsample_bytree': 0.8422300503051607, 'gamma': 1.0854012909885964, 'reg_alpha': 4.786309164826486, 'reg_lambda': 1.2777367565198294}. Best is trial 1 with value: 0.15842637345481023.
[I 2025-03-10 22:30:19,128] Trial 2 finished with value: 0.17117861775116064 and parameters: {'n_estimators': 66, 'learning_rate': 0.048922057441001, 'max_depth': 3, 'subsample': 0.6608316888445662, 'colsample_bytree': 0.693389

Training: HarrisburgScranton


[I 2025-03-10 22:35:59,948] Trial 0 finished with value: 0.17408082547998216 and parameters: {'n_estimators': 152, 'learning_rate': 0.19991665406652515, 'max_depth': 5, 'subsample': 0.7985366172760575, 'colsample_bytree': 0.6532283588426049, 'gamma': 4.71217257016432, 'reg_alpha': 3.7011297130423966, 'reg_lambda': 4.410493453812469}. Best is trial 0 with value: 0.17408082547998216.
[I 2025-03-10 22:36:06,680] Trial 1 finished with value: 0.1550768823296677 and parameters: {'n_estimators': 159, 'learning_rate': 0.10468986547837811, 'max_depth': 5, 'subsample': 0.8688839808958848, 'colsample_bytree': 0.6543271022642906, 'gamma': 3.6120533214154116, 'reg_alpha': 1.1206051521936917, 'reg_lambda': 0.12971442609939043}. Best is trial 1 with value: 0.1550768823296677.
[I 2025-03-10 22:36:10,195] Trial 2 finished with value: 0.19055804118618122 and parameters: {'n_estimators': 61, 'learning_rate': 0.0858843459898339, 'max_depth': 6, 'subsample': 0.5795295283629234, 'colsample_bytree': 0.852354

Training: HartfordSpringfield


[I 2025-03-10 22:40:09,242] Trial 0 finished with value: 0.1320407744834603 and parameters: {'n_estimators': 117, 'learning_rate': 0.1775423269980342, 'max_depth': 3, 'subsample': 0.7375988119925945, 'colsample_bytree': 0.7273686037379162, 'gamma': 0.5310669294539216, 'reg_alpha': 2.6122718540018335, 'reg_lambda': 2.778135632522325}. Best is trial 0 with value: 0.1320407744834603.
[I 2025-03-10 22:40:13,875] Trial 1 finished with value: 0.18905242716294432 and parameters: {'n_estimators': 66, 'learning_rate': 0.11148646919936393, 'max_depth': 7, 'subsample': 0.6976476939960946, 'colsample_bytree': 0.6999775886323207, 'gamma': 4.840332584884491, 'reg_alpha': 2.3957603280916917, 'reg_lambda': 3.9287174459457774}. Best is trial 0 with value: 0.1320407744834603.
[I 2025-03-10 22:40:19,842] Trial 2 finished with value: 0.17387557252800973 and parameters: {'n_estimators': 151, 'learning_rate': 0.09420809808722666, 'max_depth': 6, 'subsample': 0.6443733617592554, 'colsample_bytree': 0.8938086

Training: Houston


[I 2025-03-10 22:45:47,712] Trial 0 finished with value: 0.2332436152562532 and parameters: {'n_estimators': 184, 'learning_rate': 0.034607374042699465, 'max_depth': 5, 'subsample': 0.8486171453128217, 'colsample_bytree': 0.680640711290599, 'gamma': 4.0837222916588605, 'reg_alpha': 3.929572290128751, 'reg_lambda': 2.7185267351100553}. Best is trial 0 with value: 0.2332436152562532.
[I 2025-03-10 22:45:52,563] Trial 1 finished with value: 0.15868457448406517 and parameters: {'n_estimators': 57, 'learning_rate': 0.040365225754597604, 'max_depth': 7, 'subsample': 0.8873759998403111, 'colsample_bytree': 0.7863292008031636, 'gamma': 0.5400816183795332, 'reg_alpha': 0.22880794764958412, 'reg_lambda': 2.508615639307507}. Best is trial 1 with value: 0.15868457448406517.
[I 2025-03-10 22:46:02,177] Trial 2 finished with value: 0.2146551547097591 and parameters: {'n_estimators': 183, 'learning_rate': 0.19667037244924762, 'max_depth': 7, 'subsample': 0.7003020229385762, 'colsample_bytree': 0.5627

Training: Indianapolis


[I 2025-03-10 22:50:22,764] Trial 0 finished with value: 0.19204084338752572 and parameters: {'n_estimators': 199, 'learning_rate': 0.19905228832646094, 'max_depth': 4, 'subsample': 0.7965570457581697, 'colsample_bytree': 0.8711237149843584, 'gamma': 2.197577808901711, 'reg_alpha': 4.505577678807116, 'reg_lambda': 1.623757417212814}. Best is trial 0 with value: 0.19204084338752572.
[I 2025-03-10 22:50:29,408] Trial 1 finished with value: 0.1965490916024168 and parameters: {'n_estimators': 152, 'learning_rate': 0.14492050461257486, 'max_depth': 3, 'subsample': 0.935692777120958, 'colsample_bytree': 0.5248300227330003, 'gamma': 4.932075707216149, 'reg_alpha': 3.4855872212096766, 'reg_lambda': 1.9341301358624214}. Best is trial 0 with value: 0.19204084338752572.
[I 2025-03-10 22:50:34,621] Trial 2 finished with value: 0.19268435481383192 and parameters: {'n_estimators': 128, 'learning_rate': 0.09205455102024612, 'max_depth': 5, 'subsample': 0.5937032920243522, 'colsample_bytree': 0.707089

Training: Jacksonville


[I 2025-03-10 22:55:31,328] Trial 0 finished with value: 0.22759388157542543 and parameters: {'n_estimators': 146, 'learning_rate': 0.11742788690197066, 'max_depth': 5, 'subsample': 0.5765591820957999, 'colsample_bytree': 0.9573417183557306, 'gamma': 4.199526757182895, 'reg_alpha': 1.0678977930464217, 'reg_lambda': 4.271813642016288}. Best is trial 0 with value: 0.22759388157542543.
[I 2025-03-10 22:55:34,890] Trial 1 finished with value: 0.17977633529227427 and parameters: {'n_estimators': 60, 'learning_rate': 0.08647014561274446, 'max_depth': 5, 'subsample': 0.8136109342039799, 'colsample_bytree': 0.5680583398257348, 'gamma': 1.2584875952160486, 'reg_alpha': 0.2155625690287888, 'reg_lambda': 3.8840824336079622}. Best is trial 1 with value: 0.17977633529227427.
[I 2025-03-10 22:55:40,817] Trial 2 finished with value: 0.19611056973016247 and parameters: {'n_estimators': 165, 'learning_rate': 0.19581187349492188, 'max_depth': 4, 'subsample': 0.973081608337869, 'colsample_bytree': 0.6053

Training: LasVegas


[I 2025-03-10 22:59:35,639] Trial 0 finished with value: 0.16890318147351474 and parameters: {'n_estimators': 97, 'learning_rate': 0.1969374275560062, 'max_depth': 5, 'subsample': 0.972639791641176, 'colsample_bytree': 0.7627949586196063, 'gamma': 0.623148208908304, 'reg_alpha': 1.1926246546653163, 'reg_lambda': 2.4733601244624337}. Best is trial 0 with value: 0.16890318147351474.
[I 2025-03-10 22:59:39,212] Trial 1 finished with value: 0.24085368146786196 and parameters: {'n_estimators': 57, 'learning_rate': 0.11120734255164241, 'max_depth': 3, 'subsample': 0.9525189466077488, 'colsample_bytree': 0.5530801755346646, 'gamma': 4.269338954992841, 'reg_alpha': 3.5604497205185885, 'reg_lambda': 0.704503725170153}. Best is trial 0 with value: 0.16890318147351474.
[I 2025-03-10 22:59:45,231] Trial 2 finished with value: 0.19646760974630703 and parameters: {'n_estimators': 63, 'learning_rate': 0.019672572361229358, 'max_depth': 3, 'subsample': 0.9243019146976572, 'colsample_bytree': 0.8335637

Training: LosAngeles


[I 2025-03-10 23:04:54,717] Trial 0 finished with value: 0.2100560909850917 and parameters: {'n_estimators': 72, 'learning_rate': 0.0612460507994512, 'max_depth': 7, 'subsample': 0.6750048945530048, 'colsample_bytree': 0.618711819478365, 'gamma': 2.3665053761675043, 'reg_alpha': 1.657071265059677, 'reg_lambda': 2.3785140447059456}. Best is trial 0 with value: 0.2100560909850917.
[I 2025-03-10 23:05:01,617] Trial 1 finished with value: 0.2219045051358624 and parameters: {'n_estimators': 134, 'learning_rate': 0.13252501996668895, 'max_depth': 3, 'subsample': 0.7208543806795604, 'colsample_bytree': 0.6873540886026461, 'gamma': 2.1369232280434036, 'reg_alpha': 4.827582387067157, 'reg_lambda': 3.8010173530548137}. Best is trial 0 with value: 0.2100560909850917.
[I 2025-03-10 23:05:07,397] Trial 2 finished with value: 0.21061485589976076 and parameters: {'n_estimators': 123, 'learning_rate': 0.12587326161446585, 'max_depth': 3, 'subsample': 0.7278644719825378, 'colsample_bytree': 0.578726338

Training: Louisville


[I 2025-03-10 23:08:17,218] Trial 0 finished with value: 0.21184411008131554 and parameters: {'n_estimators': 109, 'learning_rate': 0.06940068068172565, 'max_depth': 6, 'subsample': 0.5145650141797848, 'colsample_bytree': 0.5893340870941665, 'gamma': 3.145190473817422, 'reg_alpha': 0.7960144682818948, 'reg_lambda': 2.4550048638491817}. Best is trial 0 with value: 0.21184411008131554.
[I 2025-03-10 23:08:21,784] Trial 1 finished with value: 0.2002914693581991 and parameters: {'n_estimators': 173, 'learning_rate': 0.020637475845635134, 'max_depth': 4, 'subsample': 0.6915640749678118, 'colsample_bytree': 0.915220600970376, 'gamma': 0.21233321850218922, 'reg_alpha': 4.508017965778041, 'reg_lambda': 1.4010196888223592}. Best is trial 1 with value: 0.2002914693581991.
[I 2025-03-10 23:08:27,030] Trial 2 finished with value: 0.21756710770094706 and parameters: {'n_estimators': 186, 'learning_rate': 0.09114458708640637, 'max_depth': 6, 'subsample': 0.7430374369182091, 'colsample_bytree': 0.908

Training: MiamiFtLauderdale


[I 2025-03-10 23:11:07,030] Trial 0 finished with value: 0.17463404285548365 and parameters: {'n_estimators': 77, 'learning_rate': 0.030822824355666814, 'max_depth': 4, 'subsample': 0.797188555291136, 'colsample_bytree': 0.9800195112415357, 'gamma': 1.9017638614617587, 'reg_alpha': 1.875211514276887, 'reg_lambda': 1.9387821604360234}. Best is trial 0 with value: 0.17463404285548365.
[I 2025-03-10 23:11:10,157] Trial 1 finished with value: 0.18443115706247365 and parameters: {'n_estimators': 91, 'learning_rate': 0.18099352392753934, 'max_depth': 4, 'subsample': 0.5667614399766487, 'colsample_bytree': 0.5985648156082051, 'gamma': 3.3255377821318355, 'reg_alpha': 3.377161949411111, 'reg_lambda': 4.704097695241703}. Best is trial 0 with value: 0.17463404285548365.
[I 2025-03-10 23:11:12,686] Trial 2 finished with value: 0.1843400043624948 and parameters: {'n_estimators': 98, 'learning_rate': 0.10040892901701887, 'max_depth': 4, 'subsample': 0.9132766463444164, 'colsample_bytree': 0.8313127

Training: Midsouth


[I 2025-03-10 23:13:57,366] Trial 0 finished with value: 0.15691366452486671 and parameters: {'n_estimators': 126, 'learning_rate': 0.15855856996929038, 'max_depth': 6, 'subsample': 0.7237754885978522, 'colsample_bytree': 0.6403310252439072, 'gamma': 2.6575942443188, 'reg_alpha': 4.149621069674236, 'reg_lambda': 3.1036674176965446}. Best is trial 0 with value: 0.15691366452486671.
[I 2025-03-10 23:14:01,799] Trial 1 finished with value: 0.14913806746991595 and parameters: {'n_estimators': 157, 'learning_rate': 0.029726338682819785, 'max_depth': 3, 'subsample': 0.5037909048227179, 'colsample_bytree': 0.8855589149463283, 'gamma': 2.862544540962115, 'reg_alpha': 0.11967835624760725, 'reg_lambda': 3.673802343205981}. Best is trial 1 with value: 0.14913806746991595.
[I 2025-03-10 23:14:05,135] Trial 2 finished with value: 0.11904480723221525 and parameters: {'n_estimators': 148, 'learning_rate': 0.05935399828438398, 'max_depth': 3, 'subsample': 0.7301883343665354, 'colsample_bytree': 0.7510

Training: Nashville


[I 2025-03-10 23:16:34,070] Trial 0 finished with value: 0.2385907943116988 and parameters: {'n_estimators': 68, 'learning_rate': 0.04568718761276516, 'max_depth': 3, 'subsample': 0.5593544630139063, 'colsample_bytree': 0.8067688692821245, 'gamma': 2.6467322190066422, 'reg_alpha': 0.11033895053060405, 'reg_lambda': 2.937158806337263}. Best is trial 0 with value: 0.2385907943116988.
[I 2025-03-10 23:16:38,226] Trial 1 finished with value: 0.24674451511531625 and parameters: {'n_estimators': 132, 'learning_rate': 0.12986480904386277, 'max_depth': 7, 'subsample': 0.5628276292090246, 'colsample_bytree': 0.9113967781389687, 'gamma': 3.198926667823818, 'reg_alpha': 0.360031467858668, 'reg_lambda': 3.5423792193934824}. Best is trial 0 with value: 0.2385907943116988.
[I 2025-03-10 23:16:41,597] Trial 2 finished with value: 0.24379906593779346 and parameters: {'n_estimators': 89, 'learning_rate': 0.0367310077445014, 'max_depth': 4, 'subsample': 0.7110420544205229, 'colsample_bytree': 0.85782664

# Other section