This is a script for creating an ensemble of *AutoGluon* ensembles 

Strategy for Ensemble Configuration:

- Run 1: Standard configuration with all defaults.
- Run 2: Exclude deep learning models and focus on tree-based models with increased bagging.
- Run 3: Include only linear models and neural networks with hyperparameter optimization tuned towards smaller datasets.
- Run 4: Use a different feature set or preprocessing pipeline, perhaps focusing on interactions and polynomial features.

In [2]:
from autogluon.tabular import TabularDataset, TabularPredictor

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
### Uploading all the data and concatinating the X and y sets

import pandas as pd

# Loading the data
df_brazil_1_X = pd.read_parquet('../../data/361098/1/X_train.parquet')
df_brazil_1_y = pd.read_parquet('../../data/361098/1/y_train.parquet')
df_brazil_1_X_test = pd.read_parquet('../../data/361098/1/X_test.parquet')
df_brazil_1_y_test = pd.read_parquet('../../data/361098/1/y_test.parquet')

df_bike_1_X = pd.read_parquet('../../data/361099/1/X_train.parquet')
df_bike_1_y = pd.read_parquet('../../data/361099/1/y_train.parquet')
df_bike_1_X_test = pd.read_parquet('../../data/361099/1/X_test.parquet')
df_bike_1_y_test = pd.read_parquet('../../data/361099/1/y_test.parquet')

df_property_1_X = pd.read_parquet('../../data/361092/1/X_train.parquet')
df_property_1_y = pd.read_parquet('../../data/361092/1/y_train.parquet')
df_property_1_X_test = pd.read_parquet('../../data/361092/1/X_test.parquet')
df_property_1_y_test = pd.read_parquet('../../data/361092/1/y_test.parquet')

# Concatenating dataframes
df_brazil_1_train = pd.concat([df_brazil_1_X, df_brazil_1_y], axis=1)
df_brazil_1_test = pd.concat([df_brazil_1_X_test, df_brazil_1_y_test], axis=1)

df_bike_1_train = pd.concat([df_bike_1_X, df_bike_1_y], axis=1)
df_bike_1_test = pd.concat([df_bike_1_X_test, df_bike_1_y_test], axis=1)

df_property_1_train = pd.concat([df_property_1_X, df_property_1_y], axis=1)
df_property_1_test = pd.concat([df_property_1_X_test, df_property_1_y_test], axis=1)

# Display the data 
#df_brazil_1_train.head()
#df_bike_1_train.head()
#df_property_1_train.head()


In [4]:
train_brazil = TabularDataset(df_brazil_1_train).sample(n=500, random_state=0)
test_brazil = TabularDataset(df_brazil_1_test).sample(n=500, random_state=0)
train_bike = TabularDataset(df_bike_1_train).sample(n=500, random_state=0)
test_bike = TabularDataset(df_bike_1_test).sample(n=500, random_state=0)
train_property = TabularDataset(df_property_1_train).sample(n=500, random_state=0)
test_property = TabularDataset(df_property_1_test).sample(n=500, random_state=0)

#print(train_brazil.head(), test_brazil.head())
#print(train_bike.head(), test_bike.head())
#print(train_property.head(), test_property.head())



In [5]:
# instantiate label as the last column. 
label_brazil = 'total_(BRL)'
label_bike = 'count'
label_property = 'oz252'

In [12]:
### First fit of the model using base parameters

time_limit = 100  # for quick demonstration only, should set this to longest time you are willing to wait (in seconds)
metric = 'r2'  # specify your evaluation metric here
# train model
predictor_brazil = TabularPredictor(label=label_property, eval_metric=metric).fit(train_property, time_limit=time_limit, presets='medium_quality')

No path specified. Models will be saved in: "AutogluonModels\ag-20240628_130159"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.11.9
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22631
CPU Count:          8
Memory Avail:       4.30 GB / 15.80 GB (27.2%)
Disk Space Avail:   184.41 GB / 952.46 GB (19.4%)
Presets specified: ['medium_quality']
Beginning AutoGluon training ... Time limit = 100s
AutoGluon will save models to "AutogluonModels\ag-20240628_130159"
Train Data Rows:    500
Train Data Columns: 11
Label Column:       total_(BRL)
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (10.142150059508008, 6.230481447578482, 8.26398, 0.82655)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one o

[1000]	valid_set's l2: 0.00857937	valid_set's r2: 0.98835


	0.9887	 = Validation score   (r2)
	2.71s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 97.01s of the 97.01s of remaining time.
	0.9899	 = Validation score   (r2)
	0.84s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 96.14s of the 96.13s of remaining time.
	0.9917	 = Validation score   (r2)
	0.72s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 95.33s of the 95.32s of remaining time.
	Ran out of time, early stopping on iteration 2701.
	0.9863	 = Validation score   (r2)
	95.34s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 99.88s of the -0.06s of remaining time.
	Ensemble Weights: {'KNeighborsDist': 0.667, 'LightGBMXT': 0.143, 'LightGBM': 0.143, 'RandomForestMSE': 0.048}
	0.995	 = Validation score   (r2)
	0.07s	 = Training   runtime
	0.0s	 = Valid

In [13]:
test_brazil = TabularDataset(df_1_property_test)

print(predictor_brazil.evaluate(test_brazil))
predictor_brazil.leaderboard(test_brazil)

{'r2': 0.9797601834838966, 'root_mean_squared_error': -0.11488719832418692, 'mean_squared_error': -0.01319906833878106, 'mean_absolute_error': -0.03040613105454128, 'pearsonr': 0.9900738908567597, 'median_absolute_error': -0.018254331812315172}


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist,0.983597,0.994206,r2,0.020763,0.018284,0.0075,0.020763,0.018284,0.0075,1,True,2
1,KNeighborsUnif,0.982807,0.993797,r2,0.02767,0.020539,0.009005,0.02767,0.020539,0.009005,1,True,1
2,WeightedEnsemble_L2,0.97976,0.994994,r2,0.206441,0.086416,4.345753,0.004004,0.001008,0.070554,2,True,7
3,CatBoost,0.969121,0.986275,r2,0.030999,0.004013,95.340654,0.030999,0.004013,95.340654,1,True,6
4,LightGBMXT,0.965367,0.988694,r2,0.054982,0.005,2.706465,0.054982,0.005,2.706465,1,True,3
5,LightGBM,0.962041,0.989949,r2,0.013004,0.003003,0.844317,0.013004,0.003003,0.844317,1,True,4
6,RandomForestMSE,0.959275,0.99165,r2,0.113688,0.059122,0.716917,0.113688,0.059122,0.716917,1,True,5


In [14]:
predictor_brazil.model_best

'WeightedEnsemble_L2'

In [15]:
predictor_brazil.feature_importance(test_brazil)

Computing feature importance via permutation shuffling for 11 features using 1070 rows with 5 shuffle sets...
	18.03s	= Expected runtime (3.61s per shuffle set)
	3.67s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
rent_amount_(BRL),1.196586,0.034331,8.122662e-08,5,1.267274,1.125897
hoa_(BRL),0.1644956,0.003652,2.912253e-08,5,0.172014,0.156977
property_tax_(BRL),0.0314874,0.006661,0.0002266734,5,0.045203,0.017772
fire_insurance_(BRL),0.002048875,0.000292,4.804304e-05,5,0.00265,0.001448
area,0.0003668337,0.000119,0.001145092,5,0.000611,0.000123
city,4.141786e-05,3e-05,0.01789859,5,0.000103,-2e-05
furniture,3.693102e-05,5e-05,0.08697396,5,0.00014,-6.6e-05
parking_spaces,2.629676e-06,1.3e-05,0.3356975,5,2.9e-05,-2.4e-05
bathroom,-9.172283e-07,2.2e-05,0.5346817,5,4.5e-05,-4.6e-05
animal,-1.168672e-05,2.3e-05,0.8412869,5,3.5e-05,-5.9e-05
