# Overview of AutoML Tools
## 1. AutoGluon
Description: AutoGluon is an open-source AutoML framework from Amazon that focuses on ease of use and high performance. It automates machine learning workflows, including feature engineering, model selection, and hyperparameter tuning.
### Strengths:
*   Versatility: Supports multiple data types and tasks (e.g., regression, classification).
*   Ensemble Learning: Automatically builds ensembles of models.
*   Efficiency: Optimized for performance with multi-threading and GPU support.

## 2. MLJAR
Description: MLJAR is a Python library that automates the machine learning pipeline with a focus on simplicity and interpretability. It also supports multiple types of data and tasks.
### Strengths:
*   Easy-to-Use Interface: Simplified API for quick model training.
*   Ensemble Learning: Combines multiple models to improve performance.
*   Feature Importance: Provides insights into feature importance.

## 3. TPOT
Description: TPOT (Tree-based Pipeline Optimization Tool) is an AutoML tool that uses genetic algorithms to optimize machine learning pipelines. It's part of the scikit-learn ecosystem.
### Strengths:


*   Pipeline Optimization: Automatically designs and optimizes machine learning pipelines.
*   Genetic Algorithms: Uses evolutionary algorithms to find the best models.
* Customization: Allows for detailed control over the optimization process.






In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install autogluon
!pip install mljar-supervised

from autogluon.tabular import TabularDataset, TabularPredictor


#from flaml import AutoML

import pandas as pd
import numpy as np
import os
import torch
import matplotlib.pyplot as plt
import argparse
import logging
import pickle

from sklearn.metrics import accuracy_score, r2_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle


from IPython.display import Image, display

from datetime import datetime, timedelta

from scipy.stats import entropy

from __future__ import annotations

from pathlib import Path

random_seed = 42

Collecting autogluon
  Downloading autogluon-1.1.1-py3-none-any.whl (9.7 kB)
Collecting autogluon.core[all]==1.1.1 (from autogluon)
  Downloading autogluon.core-1.1.1-py3-none-any.whl (234 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m234.8/234.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.features==1.1.1 (from autogluon)
  Downloading autogluon.features-1.1.1-py3-none-any.whl (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.4/63.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.tabular[all]==1.1.1 (from autogluon)
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl (312 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.1/312.1 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autogluon.multimodal==1.1.1 (from autogluon)
  Downloading autogluon.multimodal-1.1.1-py3-none-any.whl (427 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

Collecting mljar-supervised
  Downloading mljar-supervised-1.1.9.tar.gz (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.1/127.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dtreeviz>=2.2.2 (from mljar-supervised)
  Downloading dtreeviz-2.2.2-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.8/91.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting shap>=0.42.1 (from mljar-supervised)
  Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.1/540.1 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Collecting category_encoders>=2.2.2 (from mljar-supervised)
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81

Load data from multiple folds and concatenate them into single training and testing DataFrames. This approach handles cross-validation folds and prepares the dataset for modeling.

In [3]:
# Define the base folder path
base_folder_path = '/content/drive/My Drive/data/361098'

# Initialize lists to store DataFrames
df_train_list = []
df_test_list = []

# Loop through each fold inside the 361098 folder
for fold in range(1, 11):
    # Construct the file paths for each fold
    fold_path = os.path.join(base_folder_path, str(fold))

    # Load data
    X_train = pd.read_parquet(os.path.join(fold_path, 'X_train.parquet'))
    y_train = pd.read_parquet(os.path.join(fold_path, 'y_train.parquet'))
    X_test = pd.read_parquet(os.path.join(fold_path, 'X_test.parquet'))
    y_test = pd.read_parquet(os.path.join(fold_path, 'y_test.parquet'))

    # Ensure the target column is named 'target'
    y_train.columns = ['target']
    y_test.columns = ['target']
    # Concatenating dataframes
    df_train = pd.concat([X_train, y_train], axis=1)
    df_test = pd.concat([X_test, y_test], axis=1)

    # Append DataFrames to the lists
    df_train_list.append(df_train)
    df_test_list.append(df_test)

# Concatenate all DataFrames in the list
df_train = pd.concat(df_train_list)
df_test = pd.concat(df_test_list)

In [4]:
full_train = None
full_test = None
for fold_number in range(1, 11):
    train_dataset, test_dataset = df_train, df_test
    if full_train is None:
        full_train = train_dataset
        full_test = test_dataset
    else:
        # Use pd.concat to combine TabularDatasets
        full_train = pd.concat([full_train, train_dataset])
        full_test = pd.concat([full_test, test_dataset])


# AutoGluon Training
Train an AutoGluon model on a subset of the data and evaluate its performance. AutoGluon automatically selects and tunes various models to optimize the performance based on the evaluation metric (r2 in this case).

In [5]:
# Make set out of half the folds

train_brazil = TabularDataset(full_train)
test_brazil = TabularDataset(full_test)
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_brazil.sample(n=subsample_size, random_state=random_seed)
test_data = test_brazil.sample(n=subsample_size, random_state=random_seed)



In [6]:
label_brazil = 'target'
print(f"Unique classes: {list(train_data[label_brazil].unique())}")

Unique classes: [8.157370441186774, 7.837948916025283, 8.661639795781266, 7.106606137727303, 7.368970402194793, 8.919319398258887, 8.68710472813351, 7.060476365999801, 9.067277989134345, 7.906178840394815, 9.291090521661292, 8.683046555502886, 7.148345743900068, 8.239857411018601, 7.720017940432244, 8.020270472819238, 7.802209316247118, 8.705662478796427, 7.669028288589683, 7.529943370601589, 7.409136443920128, 7.494430215031565, 7.332369205929062, 8.488999457045455, 8.166784289056151, 7.3453648404168685, 8.223358899479258, 7.52294091807237, 8.546946149565585, 8.830250570199247, 8.780941113572387, 8.536407410340042, 8.423541635334782, 8.46168048148598, 8.386856689688234, 8.863474306170954, 7.586803535162581, 8.816853240627426, 9.33353098253138, 7.177782416195197, 8.914626127827137, 8.34924780056679, 8.344742754417545, 8.537975730598767, 6.763884908562435, 7.552762084214147, 9.189627330378642, 8.114025442356757, 7.080867896690782, 8.447199819595703, 9.210040326967182, 8.224967478914584,

In [7]:
## Function to fit the model using AutoGluon

def fit_gluon(train_dataset, problem_type='regression', hyperparameters=None, eval_metric='r2', presets='medium_quality', time_limit=100, fit_weighted_ensemble=None, num_cpus = None, num_gpus=None, auto_stack=None, num_bag_folds=None, num_bag_sets=None, num_stack_levels=None, num_trials=None, verbosity=None, ag_args_fit=None, feature_prune=None, excluded_model_types=None, keep_only_best=None):
    predictor = TabularPredictor(label=label_brazil, problem_type=problem_type, eval_metric=eval_metric)

    fit_args = {
        'train_data': train_dataset,
        'presets': presets,
        'time_limit': time_limit,
    }

    if hyperparameters is not None:
        fit_args['hyperparameters'] = hyperparameters
    if auto_stack is not None:
        fit_args['auto_stack'] = auto_stack
    if num_bag_folds is not None:
        fit_args['num_bag_folds'] = num_bag_folds
    if num_bag_sets is not None:
        fit_args['num_bag_sets'] = num_bag_sets
    if num_stack_levels is not None:
        fit_args['num_stack_levels'] = num_stack_levels
    if num_trials is not None:
        fit_args['num_trials'] = num_trials
    if verbosity is not None:
        fit_args['verbosity'] = verbosity
    if ag_args_fit is not None:
        fit_args['ag_args_fit'] = ag_args_fit
    if feature_prune is not None:
        fit_args['feature_prune'] = feature_prune
    if excluded_model_types is not None:
        fit_args['excluded_model_types'] = excluded_model_types
    if fit_weighted_ensemble is not None:
        fit_args['fit_weighted_ensemble'] = fit_weighted_ensemble
    if num_cpus is not None:
        fit_args['num_cpus'] = num_cpus
    if num_gpus is not None:
        fit_args['num_gpus'] = num_gpus
    if keep_only_best is not None:
        fit_args['keep_only_best'] = keep_only_best

    predictor.fit(**fit_args)
    return predictor


In [8]:
predictions_brazil = fit_gluon(train_dataset, time_limit=30)

No path specified. Models will be saved in: "AutogluonModels/ag-20240721_161941"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       11.25 GB / 12.67 GB (88.8%)
Disk Space Avail:   76.30 GB / 107.72 GB (70.8%)
Presets specified: ['medium_quality']
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to "AutogluonModels/ag-20240721_161941"
Train Data Rows:    96228
Train Data Columns: 11
Label Column:       target
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11517.68 MB
	Train Data (Original)  Memory Usage: 4.22 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to

In [9]:
time_limit = 100  # for quick demonstration only, should set this to longest time you are willing to wait (in seconds)
metric = 'r2'  # specify your evaluation metric here
# train model
predictor_brazil = predictions_brazil.predict(test_data.drop(columns=[label_brazil]))
eval = predictions_brazil.evaluate(test_data)
print(eval)
print(predictor_brazil)


{'r2': 0.9999999999999152, 'root_mean_squared_error': -2.2900595538872259e-07, 'mean_squared_error': -5.24437276035016e-14, 'mean_absolute_error': -1.9127932116980163e-07, 'pearsonr': 0.9999999999999577, 'median_absolute_error': -1.7356791648381886e-07}
4301     8.166500
7690     8.134467
10616    7.747165
8435     9.215427
4578     8.487558
           ...   
5227     8.458504
1938     7.791110
9288     7.048387
4517     7.838737
8368     7.791110
Name: target, Length: 500, dtype: float32


#  MLJAR Training
Train an MLJAR model using the full dataset. MLJAR performs automated machine learning and provides a model that is evaluated on a test set. It focuses on easy-to-use interfaces and interpretability.

In [10]:
from re import X
# Make set out of all the folds

full_train_X = None
full_train_y = None
full_test_X = None
full_test_y = None

for fold_number in range(1, 11):
    train_dataset_X, train_dataset_y, test_dataset_X, test_dataset_y = X_train, y_train, X_test, y_test
    if full_train_X is None:
        full_train_X = train_dataset_X
        full_train_y = train_dataset_y
        full_test_X = test_dataset_X
        full_test_y = test_dataset_y
    else:
        # Use pd.concat to combine TabularDatasets
        full_train_X = pd.concat([full_train_X, train_dataset_X])
        full_train_y = pd.concat([full_train_y, train_dataset_y])
        full_test_X = pd.concat([full_test_X, test_dataset_X])
        full_test_y = pd.concat([full_test_y, test_dataset_y])

In [11]:
from supervised.automl import AutoML

# Initialize AutoML for regression
mljar_automl_regressor = AutoML(
    mode="Compete",  # Set to "Compete" for more thorough training
    total_time_limit=1200,  # Total time for the task in seconds
    n_jobs=-1  # Use all available cores
)


In [12]:
# Fit the model on the full training data
mljar_automl_regressor.fit(full_train_X, full_train_y['target'])


Linear algorithm was disabled.
AutoML directory: AutoML_1
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree rmse 0.245676 trained in 2.67 seconds
Adjust validation. Remove: 1_DecisionTree
Validation strategy: 5-fold CV Shuffle
* Step simple_algorithms will try to check up to 3 models
1_DecisionTree rmse 0.242206 trained in 4.13 seconds
2_DecisionTree rmse 0.177643 trained in 4.37 seconds
3_DecisionTree rmse 0.1776

In [18]:
with open('mljar_model.pkl', 'wb') as f:
    pickle.dump(mljar_automl_regressor, f)

In [14]:
# Load the model
with open('mljar_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [15]:
# Make predictions with the loaded model
mljar_predictions = loaded_model.predict(full_test_X)
mljar_score = r2_score(full_test_y['target'], mljar_predictions)
print("MLJAR AutoML R2 score:", mljar_score)

MLJAR AutoML R2 score: 0.9931095041804755


# TPOT Training
Train a TPOT model which uses genetic algorithms to optimize machine learning pipelines. The best pipeline is saved and then used for making predictions. TPOT focuses on pipeline optimization and offers a high degree of automation.

In [16]:
!pip install tpot
from tpot import TPOTRegressor

Collecting tpot
  Downloading TPOT-0.12.2-py3-none-any.whl (87 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/87.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn>=1.4.1 (from tpot)
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting deap>=1.2 (from tpot)
  Downloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1 (f

In [27]:
# Assuming the data is already in numpy arrays as shown:
full_train_X = np.array(full_train_X)
full_train_y = np.array(full_train_y)
full_test_X = np.array(full_test_X)
full_test_y = np.array(full_test_y)
display(full_train_X), display(full_train_y)

# Convert to pandas DataFrame
full_train_X = pd.DataFrame(full_train_X)
full_train_y = pd.Series(full_train_y)
full_test_X = pd.DataFrame(full_test_X)
full_test_y = pd.Series(full_test_y)

display(full_train_X), display(full_train_y)

array([[   4,   76,    2, ..., 5600,    0,   71],
       [   4,   70,    3, ..., 1800,   42,   23],
       [   4,   92,    3, ..., 4250,   21,   54],
       ...,
       [   4,   76,    2, ..., 4250,  217,   54],
       [   0,  365,    4, ..., 8800,  459,  118],
       [   3,   30,    1, ..., 1480,   25,   20]])

array([8.75966867, 7.7380523 , 8.53227883, ..., 8.59932602, 9.46350864,
       7.33040521])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,4,76,2,1,1,0,0,700,5600,0,71
1,4,70,3,1,1,1,1,428,1800,42,23
2,4,92,3,3,1,0,0,750,4250,21,54
3,1,48,2,1,1,0,1,183,1190,0,16
4,1,62,2,1,1,0,1,485,980,48,13
...,...,...,...,...,...,...,...,...,...,...,...
96225,4,16,1,1,0,0,1,0,2700,5,35
96226,4,20,1,1,1,1,0,1700,1360,0,18
96227,4,76,2,1,1,1,0,906,4250,217,54
96228,0,365,4,4,3,0,0,3500,8800,459,118


0        8.759669
1        7.738052
2        8.532279
3        7.237059
4        7.331060
           ...   
96225    7.916078
96226    8.032360
96227    8.599326
96228    9.463509
96229    7.330405
Length: 96230, dtype: float64

(None, None)

In [28]:
tpot = TPOTRegressor(
    verbosity=2,
    generations=3,          # Reduce number of generations
    population_size=20,     # Reduce population size
    random_state=42,
    n_jobs=-1,              # Utilize all CPU cores
    max_time_mins=30,       # Max total time in minutes
    max_eval_time_mins=2    # Max time per pipeline in minutes
)
tpot.fit(full_train_X, full_train_y)
display(full_train_X), display(full_train_y)

Optimization Progress:   0%|          | 0/20 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -0.0009738433062893204

30.79 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=32, p=2, weights=uniform)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,4,76,2,1,1,0,0,700,5600,0,71
1,4,70,3,1,1,1,1,428,1800,42,23
2,4,92,3,3,1,0,0,750,4250,21,54
3,1,48,2,1,1,0,1,183,1190,0,16
4,1,62,2,1,1,0,1,485,980,48,13
...,...,...,...,...,...,...,...,...,...,...,...
96225,4,16,1,1,0,0,1,0,2700,5,35
96226,4,20,1,1,1,1,0,1700,1360,0,18
96227,4,76,2,1,1,1,0,906,4250,217,54
96228,0,365,4,4,3,0,0,3500,8800,459,118


0        8.759669
1        7.738052
2        8.532279
3        7.237059
4        7.331060
           ...   
96225    7.916078
96226    8.032360
96227    8.599326
96228    9.463509
96229    7.330405
Length: 96230, dtype: float64

(None, None)

In [31]:
# Save the best pipeline
import joblib

# Save the model
joblib.dump(tpot.fitted_pipeline_, "tpot_pipeline.joblib")

# At prediction time
loaded_pipeline = joblib.load("tpot_pipeline.joblib")

# Make predictions
tpot_predictions = loaded_pipeline.predict(full_test_X)
tpot_score = r2_score(full_test_y, tpot_predictions)
print("TPOT R2 score:", tpot_score)

TPOT R2 score: 0.9993435662899023


# Comparison

Compare the performance of the models trained by AutoGluon, MLJAR, and TPOT based on the R2 score, which measures the goodness of fit.

In [33]:
print("AutoGluon R2 score:", eval['r2'])  # Assuming 'eval' contains AutoGluon's evaluation results
print("MLJAR R2 score:", mljar_score)
print("TPOT R2 score:", tpot_score)


AutoGluon R2 score: 0.9999999999999152
MLJAR R2 score: 0.9931095041804755
TPOT R2 score: 0.9993435662899023


# Run all tools on the exam dataset

In [49]:
!ls /content/drive/My\ Drive/data/

361092	361098	361099	exam_dataset


In [54]:
# Final dataset
random_seed = 42
base_path = '/content/drive/My Drive/data/exam_dataset'

X_train = pd.read_parquet(f'{base_path}/X_train.parquet')
y_train = pd.read_parquet(f'{base_path}/y_train.parquet')
train_dataset = pd.concat([X_train, y_train], axis=1)
test = train_dataset.sample(frac=0.2, replace=False, random_state=random_seed)

# Also instantiate the target column
label = 'price'
print(X_train.columns)
print(y_train.columns)


Index(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'waterfront',
       'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'date_year', 'date_month',
       'date_day'],
      dtype='object')
Index(['price'], dtype='object')


In [55]:

# Initialize AutoML for regression
mljar_automl = AutoML(
    mode="Compete",  # Use "Compete" mode for more thorough training
    total_time_limit=30,  # Total time for the task in seconds
    n_jobs=-1  # Use all available cores
)

# Fit the model
mljar_automl.fit(X_train, y_train)

# Make predictions
mljar_pred = mljar_automl.predict(test.drop(columns=[label]))

# Evaluate the model
mljar_score = r2_score(test[label], mljar_pred)
print("MLJAR R2 score:", mljar_score)

Linear algorithm was disabled.
AutoML directory: AutoML_3
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['adjust_validation', 'simple_algorithms', 'default_algorithms', 'not_so_random', 'mix_encoding', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step adjust_validation will try to check up to 1 model
1_DecisionTree rmse 0.308772 trained in 1.33 seconds
Disable stacking for split validation
* Step simple_algorithms will try to check up to 2 models
2_DecisionTree rmse 0.278504 trained in 1.15 seconds
3_DecisionTree rmse 0.278504 trained in 0.99 seconds
* Step default_algorithms will try to check up to 7 models
4_Default_

In [58]:
# Initialize the AutoGluon TabularPredictor
def fit_gluon(train_dataset, time_limit=30, verbosity=1, keep_only_best=True):
    predictor = TabularPredictor(label=label, verbosity=verbosity)
    predictor.fit(train_data=train_dataset, time_limit=time_limit, keep_only_best=keep_only_best)
    return predictor
gluon = fit_gluon(train_dataset, time_limit=30, verbosity=1, keep_only_best=True)
gluon_pred = gluon.predict(test.drop(columns=[label]))
eval_gluon = gluon.evaluate(test)
print(eval_gluon['r2'])

No path specified. Models will be saved in: "AutogluonModels/ag-20240721_193127"
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Deleting model KNeighborsUnif. All files under AutogluonModels/ag-20240721_193127/models/KNeighborsUnif will be removed.
Deleting model KNeighborsDist. All files under AutogluonModels/ag-20240721_193127/models/KNeighborsDist will be removed.


0.9611797217358052


In [60]:
ensemble_1 = (gluon_pred + mljar_pred) / 2
print('R2 score ensemble 50/50 gluon and mljar', r2_score(ensemble_1, test[label]))

R2 score ensemble 50/50 gluon and mljar 0.9490374947835988


In [61]:
!pip install tpot
from tpot import TPOTRegressor




In [69]:
import numpy as np

full_train_X = np.array(full_train_X)
full_train_y = np.array(full_train_y)
full_test_X = np.array(full_test_X)
full_test_y = np.array(full_test_y)
display(full_train_X), display(full_train_y)

# Convert to pandas DataFrame
X_train_np = pd.DataFrame(full_train_X)
y_train_np = pd.Series(full_train_y)
X_test_np = pd.DataFrame(full_test_X)
y_test_np = pd.Series(full_test_y)

tpot = TPOTRegressor(
    verbosity=2,
    generations=3,          # Reduce number of generations
    population_size=20,     # Reduce population size
    random_state=42,
    n_jobs=-1,              # Utilize all CPU cores
    max_time_mins=30,       # Max total time in minutes
    max_eval_time_mins=2    # Max time per pipeline in minutes
)
# Fit the TPOT model
tpot.fit(X_train_np, y_train_np)

# Make predictions
tpot_pred = tpot.predict(X_test_np)

# Evaluate the model
tpot_score = r2_score(y_test_np, tpot_pred)
print("TPOT R2 score:", tpot_score)

array([[   4,   76,    2, ..., 5600,    0,   71],
       [   4,   70,    3, ..., 1800,   42,   23],
       [   4,   92,    3, ..., 4250,   21,   54],
       ...,
       [   4,   76,    2, ..., 4250,  217,   54],
       [   0,  365,    4, ..., 8800,  459,  118],
       [   3,   30,    1, ..., 1480,   25,   20]])

array([8.75966867, 7.7380523 , 8.53227883, ..., 8.59932602, 9.46350864,
       7.33040521])

Optimization Progress:   0%|          | 0/20 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -0.0009738433062893204

30.52 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=32, p=2, weights=uniform)
TPOT R2 score: 0.9993435662899023
