<a href="https://www.kaggle.com/code/taimour/s4e9-tutorial-autogluon-explained?scriptVersionId=196086119" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Tutorial - AutoGluon Explained
![](https://auto.gluon.ai/stable/_static/autogluon.png)

# Installation

In [1]:
!pip install ray==2.10.0
!pip install autogluon.tabular
!pip install -U ipywidgets

Collecting ray==2.10.0
  Downloading ray-2.10.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (13 kB)
Downloading ray-2.10.0-cp310-cp310-manylinux2014_x86_64.whl (65.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.1/65.1 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ray
  Attempting uninstall: ray
    Found existing installation: ray 2.24.0
    Uninstalling ray-2.24.0:
      Successfully uninstalled ray-2.24.0
Successfully installed ray-2.10.0
Collecting autogluon.tabular
  Downloading autogluon.tabular-1.1.1-py3-none-any.whl.metadata (13 kB)
Collecting scipy<1.13,>=1.5.4 (from autogluon.tabular)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn<1.4.1,>=1.3.0 (from autogluon.tabular)
  Downloading scikit_learn-1.4.0-

# Import

In [2]:
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor

# Read Data

In [3]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e9/train.csv').drop('id', axis=1)
test_data = pd.read_csv('/kaggle/input/playground-series-s4e9/test.csv').drop('id', axis=1)
submission = pd.read_csv('/kaggle/input/playground-series-s4e9/sample_submission.csv')

# View Training and Test data

In [4]:
train_data.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200
1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999
2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500


In [5]:
test_data.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,Land,Rover LR2 Base,2015,98000,Gasoline,240.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,White,Beige,None reported,Yes
1,Land,Rover Defender SE,2020,9142,Hybrid,395.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,Silver,Black,None reported,Yes
2,Ford,Expedition Limited,2022,28121,Gasoline,3.5L V6 24V PDI DOHC Twin Turbo,10-Speed Automatic,White,Ebony,None reported,
3,Audi,A6 2.0T Sport,2016,61258,Gasoline,2.0 Liter TFSI,Automatic,Silician Yellow,Black,None reported,
4,Audi,A6 2.0T Premium Plus,2018,59000,Gasoline,252.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,A/T,Gray,Black,None reported,Yes


# AutoGluon Introduction

AutoGluon is an open-source AutoML framework developed by Amazon that simplifies the process of building machine learning models for various tasks, including tabular data prediction, image classification, text analysis, and more. It automates the entire machine learning pipeline, from data preprocessing to model selection and hyperparameter tuning, making it accessible for users with minimal coding and machine learning expertise. AutoGluon supports both classification and regression problems and leverages powerful ensemble techniques to deliver high-quality models. It also allows users to specify resource constraints, like time limits and hardware availability (GPUs/CPUs), to optimize model training efficiency.

# AutoGluon Code with Explanation

In [6]:
predictor = TabularPredictor(
    label='price',             # Target column that needs to be predicted (dependent variable)
    eval_metric='rmse',        # Evaluation metric (Root Mean Squared Error) used to judge the model’s performance
    problem_type='regression'  # Specifying this is a regression problem
).fit(
    train_data,                  # The training dataset containing features and the target (price)
    presets='best_quality',    # The preset configuration for optimal quality (though it may take more time)
    time_limit=3600*3,      # Time limit for training (3 hours = 3600 seconds/hour * 3 hours)
    verbosity=2,               # Level of logging information (2 is medium verbosity)
    excluded_model_types=['KNN'], # Exclude K-Nearest Neighbors models from training
    ag_args_fit={
        'num_gpus': 2,          # Use 2 GPUs if available for model training
        'num_cpus': 4           # Use 4 CPUs for model training
    }
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240910_145635"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.1.1
Python Version:     3.10.14
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Jun 27 20:43:36 UTC 2024
CPU Count:          4
Memory Avail:       30.16 GB / 31.36 GB (96.2%)
Disk Space Avail:   19.50 GB / 19.52 GB (99.9%)
Presets specified: ['best_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
DyStack is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
	This is used to identify the optimal `num_stack_levels` value. Copies of AutoGluon will be fit on subsets of the data. Then holdout validation data is used to 

* **label='price':** The column name 'price' is the target (dependent variable) to be predicted.
* **eval_metric='rmse':** The Root Mean Squared Error (RMSE) is chosen as the evaluation metric, which is common for regression tasks.
* **problem_type='regression':** Specifies that the task is a regression task (i.e., predicting continuous values).
* **train_data:** This is the DataFrame containing the training data with both features and the target (price).
* **presets='best_quality':** This preset prioritizes accuracy over training speed. It will try many models and techniques to ensure the highest possible quality.
* **time_limit=3600*10:** Limits the model training process to a maximum of 10 hours.
* **verbosity=2:** Specifies the verbosity level for logging. Higher values will show more details about the training process.
* **excluded_model_types=['KNN']:** K-Nearest Neighbors (KNN) models are excluded from being considered during training.
* **ag_args_fit:** This argument allows you to pass configuration options to the fitting process:
* **num_gpus=2:** The model will utilize 2 GPUs for training if available, speeding up the process for certain algorithms.
* **num_cpus=4:** The model will use 4 CPU cores during training.

In [7]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                          model     score_val              eval_metric  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0           WeightedEnsemble_L3 -72449.253754  root_mean_squared_error      72.923016  7015.015082                0.002833           0.555356            3       True         26
1               CatBoost_BAG_L2 -72542.797184  root_mean_squared_error      41.507642  5417.826240                0.653220          77.326773            2       True         20
2           WeightedEnsemble_L2 -72600.339591  root_mean_squared_error      35.898734  3786.125246                0.003237           0.470863            2       True         16
3             LightGBMXT_BAG_L2 -72603.165649  root_mean_squared_error      42.533028  5430.640810                1.678606          90.141343            2       True         17
4               CatBoost_BAG_L1 -72812.796073  root_m

* **fit_summary():** After training is complete, this method outputs a summary of the models trained, their performance, and additional statistics. The results object will contain information such as the leaderboard of model performance, training times, and which model was selected as the best for predictions.

In [8]:
predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,-72449.253754,root_mean_squared_error,72.923016,7015.015082,0.002833,0.555356,3,True,26
1,CatBoost_BAG_L2,-72542.797184,root_mean_squared_error,41.507642,5417.82624,0.65322,77.326773,2,True,20
2,WeightedEnsemble_L2,-72600.339591,root_mean_squared_error,35.898734,3786.125246,0.003237,0.470863,2,True,16
3,LightGBMXT_BAG_L2,-72603.165649,root_mean_squared_error,42.533028,5430.64081,1.678606,90.141343,2,True,17
4,CatBoost_BAG_L1,-72812.796073,root_mean_squared_error,1.244417,120.778952,1.244417,120.778952,1,True,4
5,CatBoost_r177_BAG_L1,-72844.316077,root_mean_squared_error,0.979802,89.376491,0.979802,89.376491,1,True,10
6,NeuralNetFastAI_BAG_L2,-72879.922868,root_mean_squared_error,44.351679,5998.206054,3.497256,657.706587,2,True,22
7,LightGBMXT_BAG_L1,-72917.481425,root_mean_squared_error,0.631912,56.368863,0.631912,56.368863,1,True,1
8,XGBoost_BAG_L2,-72995.075779,root_mean_squared_error,42.137048,5388.159207,1.282625,47.65974,2,True,23
9,LightGBM_BAG_L2,-73042.741028,root_mean_squared_error,41.245911,5401.166497,0.391488,60.66703,2,True,18


**What the Leaderboard Shows:**
* **Model:** The name of the model that was trained. This can include various types of models such as Random Forest, Gradient Boosting, Neural Networks, etc.
* **Time Training:** The time taken to train the model.
* **Time Prediction:** The time taken to make predictions with the model.
* **Score Validation:** The score (e.g., RMSE) on the validation set, indicating how well the model performs on data it hasn’t seen during training.
* **Fit Order:** The order in which the models were trained.

**Interpreting the Table:**
* **WeightedEnsemble_L2:** This is an ensemble model that combines predictions from multiple other models (e.g., LightGBM, CatBoost). It is ranked first due to its lowest RMSE on the validation set (Score_Validation).
* **LightGBM_BAG_L1:** A LightGBM model that was also considered. It shows slightly worse performance than the ensemble but may have taken less time to train (Training_Time).

# Make Predictions

In [9]:
test_pred = predictor.predict(test_data)

  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))


# Submission

In [10]:
submission['price'] = test_pred
submission.to_csv('submission.csv', index=False)