<a href="https://www.kaggle.com/code/taimour/s4e9-autogluon-explained-regression?scriptVersionId=196528449" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Tutorial - AutoGluon Explained
![](https://auto.gluon.ai/stable/_static/autogluon.png)

# Installation

To keep notebook clean and avoid alot of installation text in notebook, lets use subprocess

In [1]:
import subprocess

subprocess.run(["pip", "install", "ray==2.10.0"], capture_output=True)
subprocess.run(["pip", "install", "autogluon.tabular"], capture_output=True)
subprocess.run(["pip", "install", "-U", "ipywidgets"], capture_output=True)



# Import

In [2]:
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor

# Read Data

In [3]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e9/train.csv').drop('id', axis=1)
test_data = pd.read_csv('/kaggle/input/playground-series-s4e9/test.csv').drop('id', axis=1)
submission = pd.read_csv('/kaggle/input/playground-series-s4e9/sample_submission.csv')

# View Training and Test data

In [4]:
train_data.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200
1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999
2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900
3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000
4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500


In [5]:
test_data.head()

Unnamed: 0,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title
0,Land,Rover LR2 Base,2015,98000,Gasoline,240.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,6-Speed A/T,White,Beige,None reported,Yes
1,Land,Rover Defender SE,2020,9142,Hybrid,395.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,Silver,Black,None reported,Yes
2,Ford,Expedition Limited,2022,28121,Gasoline,3.5L V6 24V PDI DOHC Twin Turbo,10-Speed Automatic,White,Ebony,None reported,
3,Audi,A6 2.0T Sport,2016,61258,Gasoline,2.0 Liter TFSI,Automatic,Silician Yellow,Black,None reported,
4,Audi,A6 2.0T Premium Plus,2018,59000,Gasoline,252.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,A/T,Gray,Black,None reported,Yes


# AutoGluon Introduction

AutoGluon is an open-source AutoML framework developed by Amazon that simplifies the process of building machine learning models for various tasks, including tabular data prediction, image classification, text analysis, and more. It automates the entire machine learning pipeline, from data preprocessing to model selection and hyperparameter tuning, making it accessible for users with minimal coding and machine learning expertise. AutoGluon supports both classification and regression problems and leverages powerful ensemble techniques to deliver high-quality models. It also allows users to specify resource constraints, like time limits and hardware availability (GPUs/CPUs), to optimize model training efficiency.

# AutoGluon Code with Explanation

In [6]:
predictor = TabularPredictor(
    label='price',             # Target column that needs to be predicted (dependent variable)
    eval_metric='rmse',        # Evaluation metric (Root Mean Squared Error) used to judge the model’s performance
    problem_type='regression'  # Specifying this is a regression problem
).fit(
    train_data,                  # The training dataset containing features and the target (price)
    presets='best_quality',    # The preset configuration for optimal quality (though it may take more time)
    time_limit=3600*3,      # Time limit for training (3 hours = 3600 seconds/hour * 3 hours)
    verbosity=0,               # Level of logging information (0 is used to avoid alot of text in notebook)
    excluded_model_types=['KNN'], # Exclude K-Nearest Neighbors models from training
    ag_args_fit={
        'num_gpus': 1,          # Use 1 GPU if available for model training
        'num_cpus': 4           # Use 4 CPUs for model training
    }
)

No path specified. Models will be saved in: "AutogluonModels/ag-20240913_162627"
2024-09-13 16:26:30,796	INFO worker.py:1752 -- Started a local Ray instance.
[36m(_ray_fit pid=452)[0m 	Training S1F1 with GPU, note that this may negatively impact model quality compared to CPU training.
[36m(_ray_fit pid=452)[0m [LightGBM] [Fatal] bin size 1669 cannot run on GPU
[36m(_ray_fit pid=494)[0m 	Training S1F2 with GPU, note that this may negatively impact model quality compared to CPU training.
[36m(_ray_fit pid=494)[0m [LightGBM] [Fatal] bin size 1665 cannot run on GPU
[36m(_ray_fit pid=536)[0m 	Training S1F3 with GPU, note that this may negatively impact model quality compared to CPU training.
[36m(_ray_fit pid=536)[0m [LightGBM] [Fatal] bin size 1670 cannot run on GPU
[36m(_ray_fit pid=578)[0m 	Training S1F4 with GPU, note that this may negatively impact model quality compared to CPU training.
[36m(_ray_fit pid=578)[0m [LightGBM] [Fatal] bin size 1666 cannot run on GPU
[36m(

* **label='price':** The column name 'price' is the target (dependent variable) to be predicted.
* **eval_metric='rmse':** The Root Mean Squared Error (RMSE) is chosen as the evaluation metric, which is common for regression tasks.
* **problem_type='regression':** Specifies that the task is a regression task (i.e., predicting continuous values).
* **train_data:** This is the DataFrame containing the training data with both features and the target (price).
* **presets='best_quality':** This preset prioritizes accuracy over training speed. It will try many models and techniques to ensure the highest possible quality.
* **time_limit=3600*3:** Limits the model training process to a maximum of 3 hour.
* **verbosity=0:** Specifies the verbosity level for logging. Higher values will show more details about the training process. To avoid alot of text in notebook we will use 0.
* **excluded_model_types=['KNN']:** K-Nearest Neighbors (KNN) models are excluded from being considered during training.
* **ag_args_fit:** This argument allows you to pass configuration options to the fitting process:
* **num_gpus=1:** The model will utilize 1 GPU for training if available, speeding up the process for certain algorithms.
* **num_cpus=4:** The model will use 4 CPU cores during training.

# Why KNN models are excluded from training?

**The exclusion of K-Nearest Neighbors (KNN) models from the training process in the given code is likely due to several reasons:**

**Computational Cost:** KNN models can be computationally expensive for large datasets, especially when dealing with high-dimensional data. This is because they require calculating distances between each new data point and all training points, which can be time-consuming.

**Sensitivity to Noise:** KNN models are sensitive to noise in the data. Outliers or noisy data points can significantly impact the predictions, leading to less accurate results.

**Scalability Issues:** As the dataset size grows, KNN models can become increasingly difficult to scale. The computational complexity increases linearly with the number of training points, making it challenging to handle large datasets efficiently.

**Interpretability:** KNN models are generally less interpretable compared to other machine learning algorithms. It can be difficult to understand how the model arrived at a particular prediction, making it harder to explain the model's behavior.


In summary, while KNN can be a simple and effective algorithm for certain problems, its drawbacks in terms of computational cost, sensitivity to noise, scalability, and interpretability make it less suitable for larger or more complex datasets. The decision to exclude KNN models from the training process in this specific case is likely based on these considerations and the desire to use more efficient and interpretable algorithms.

In [7]:
results = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                          model     score_val              eval_metric  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0           WeightedEnsemble_L3 -72490.504440  root_mean_squared_error      74.438018  7142.474694                0.002777           0.559012            3       True         25
1               CatBoost_BAG_L2 -72519.993593  root_mean_squared_error      41.965800  5417.481269                0.529201          78.171062            2       True         20
2           WeightedEnsemble_L2 -72592.412648  root_mean_squared_error      36.106889  3857.897592                0.002920           0.418711            2       True         16
3             LightGBMXT_BAG_L2 -72674.286253  root_mean_squared_error      41.921645  5398.120039                0.485046          58.809832            2       True         17
4               CatBoost_BAG_L1 -72826.382752  root_m

* **fit_summary():** After training is complete, this method outputs a summary of the models trained, their performance, and additional statistics. The results object will contain information such as the leaderboard of model performance, training times, and which model was selected as the best for predictions.

# Models Leaderboard

In [8]:
predictor.leaderboard()

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L3,-72490.50444,root_mean_squared_error,74.438018,7142.474694,0.002777,0.559012,3,True,25
1,CatBoost_BAG_L2,-72519.993593,root_mean_squared_error,41.9658,5417.481269,0.529201,78.171062,2,True,20
2,WeightedEnsemble_L2,-72592.412648,root_mean_squared_error,36.106889,3857.897592,0.00292,0.418711,2,True,16
3,LightGBMXT_BAG_L2,-72674.286253,root_mean_squared_error,41.921645,5398.120039,0.485046,58.809832,2,True,17
4,CatBoost_BAG_L1,-72826.382752,root_mean_squared_error,0.963896,117.284948,0.963896,117.284948,1,True,4
5,CatBoost_r177_BAG_L1,-72832.213633,root_mean_squared_error,0.769926,90.295459,0.769926,90.295459,1,True,10
6,NeuralNetFastAI_BAG_L2,-72852.617889,root_mean_squared_error,44.777326,6040.602611,3.340727,701.292404,2,True,22
7,LightGBM_r96_BAG_L1,-72875.108129,root_mean_squared_error,2.106201,81.626262,2.106201,81.626262,1,True,15
8,LightGBMXT_BAG_L1,-72917.481425,root_mean_squared_error,0.650084,54.389211,0.650084,54.389211,1,True,1
9,XGBoost_BAG_L2,-72979.600664,root_mean_squared_error,42.679213,5385.19455,1.242614,45.884343,2,True,23


**What the Leaderboard Shows:**
* **Model:** The name of the model that was trained. This can include various types of models such as Random Forest, Gradient Boosting, Neural Networks, etc.
* **Time Training:** The time taken to train the model.
* **Time Prediction:** The time taken to make predictions with the model.
* **Score Validation:** The score (e.g., RMSE) on the validation set, indicating how well the model performs on data it hasn’t seen during training.
* **Fit Order:** The order in which the models were trained.

**Interpreting the Table:**
* **WeightedEnsemble_L2:** This is an ensemble model that combines predictions from multiple other models (e.g., LightGBM, CatBoost). It is ranked first due to its lowest RMSE on the validation set (Score_Validation).
* **LightGBM_BAG_L1:** A LightGBM model that was also considered. It shows slightly worse performance than the ensemble but may have taken less time to train (Training_Time).

# Make Predictions

In [9]:
test_pred = predictor.predict(test_data)

  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))
  return torch.load(io.BytesIO(b))


# Submission

In [10]:
submission['price'] = test_pred
submission.to_csv('submission.csv', index=False)