# Kolosal AutoML Tutorial: Basic AutoML🚀

Welcome! In this notebook, we'll demonstrate how to use **Kolosal AutoML** to build a regression model with minimal effort. We'll use the California Housing dataset as an example and walk through the entire pipeline: loading data, training a model, evaluating it, and generating a performance report.


## 📥 Step 1: Load the Dataset

We use the `fetch_california_housing` dataset from `sklearn.datasets`. This built-in dataset contains information about California districts, including features like median income, house age, and average rooms. Our target is the median house value.


In [1]:
from sklearn.datasets import fetch_california_housing


housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.rename(columns={'MedHouseVal': 'target'}, inplace=True)

## 🧹 Step 2: Prepare Features and Target Variable

We rename the target column and separate the features (`X`) and the target (`y`). This is a standard practice for supervised learning tasks.


In [2]:
X = df.drop(columns=['target']).values
y = df['target'].values

## ⚙️ Step 3: Configure Kolosal AutoML

We import the Kolosal AutoML training engine and its configuration. We then create a configuration for a regression task.


In [3]:
from kolosal_automl.modules.engine.train_engine import MLTrainingEngine
from kolosal_automl.modules.configs import MLTrainingEngineConfig, TaskType

config = MLTrainingEngineConfig(task_type="regression")
trainer = MLTrainingEngine(config)

  from .autonotebook import tqdm as notebook_tqdm
2025-05-20 15:46:43,432 - INFO - Experiment_1747730803 - MLflow tracking configured with experiment: experiment_1747730803
2025-05-20 15:46:43,432 - INFO - MLTrainingEngine - Experiment tracking enabled
2025-05-20 15:46:43,432 - kolosal_automl.modules.engine.data_preprocessor.DataPreprocessor - INFO - Parallel processing initialized with 20 workers
2025-05-20 15:46:43,432 - INFO - kolosal_automl.modules.engine.data_preprocessor.DataPreprocessor - Parallel processing initialized with 20 workers
2025-05-20 15:46:43,432 - kolosal_automl.modules.engine.data_preprocessor.DataPreprocessor - INFO - DataPreprocessor initialized with version 1.0.0
2025-05-20 15:46:43,432 - INFO - kolosal_automl.modules.engine.data_preprocessor.DataPreprocessor - DataPreprocessor initialized with version 1.0.0
2025-05-20 15:46:43,432 - INFO - MLTrainingEngine - Data preprocessor initialized
2025-05-20 15:46:43,448 - INFO - MLTrainingEngine - Batch processor initi

2025-05-20 15:48:55,095 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - CPU usage (0.0%) below threshold, disabling throttling
2025-05-20 15:48:55,095 - INFO - kolosal_automl.modules.engine.inference_engine.InferenceEngine - CPU usage (0.0%) below threshold, disabling throttling


## 🏋️ Step 4: Train the Model

Now, we train the model using the `train_model()` function. Kolosal AutoML automatically selects and tunes the best model for the task.


In [4]:
trainer.train_model(X=X, y=y)

2025-05-20 15:46:44,092 - INFO - MLTrainingEngine - Initialized random_forest model for regression
2025-05-20 15:46:44,092 - INFO - Experiment_1747730803 - Started experiment 1747730803
2025-05-20 15:46:44,092 - INFO - Experiment_1747730803 - Configuration:
{
  "task_type": "regression",
  "random_state": 42,
  "n_jobs": -1,
  "verbose": 1,
  "cv_folds": 5,
  "test_size": 0.2,
  "stratify": true,
  "optimization_strategy": "hyper_optimization_x",
  "optimization_iterations": 50,
  "optimization_timeout": null,
  "early_stopping": true,
  "early_stopping_rounds": 10,
  "early_stopping_metric": null,
  "feature_selection": true,
  "feature_selection_method": "mutual_info",
  "feature_selection_k": null,
  "feature_importance_threshold": 0.01,
  "model_path": "./models",
  "model_registry_url": null,
  "auto_version_models": true,
  "experiment_tracking": true,
  "experiment_tracking_platform": "mlflow",
  "experiment_tracking_config": {},
  "use_intel_optimization": true,
  "use_gpu": tr

Problem features: {'n_samples': 16512, 'n_features': 8, 'density': 1.0, 'y_mean': np.float64(2.071946937378876), 'y_std': np.float64(1.1561912522543263), 'y_skew': np.float64(0.9764422910936056), 'n_params': 3, 'n_categorical': 3, 'n_numerical': 0}
Selected optimization strategy: bayesian
Initializing HyperOptX with 50 maximum iterations
Phase 1: Initial exploration with 5 configurations
Phase 2: Iterative optimization with 45 iterations
  Iteration 6/50: budget=0.33, evaluating 1 candidates
Early stopping at iteration 10


Hyperparameter optimization: 100%|██████████| 100/100 [01:48<00:00,  1.08s/%]
2025-05-20 15:48:33,310 - INFO - Experiment_1747730803 - Metrics for optimization: {'mean_cv_score': np.float64(-0.2819377876408739), 'std_cv_score': np.float64(0.006574162737796381), 'optimization_time': 108.18300342559814}
2025-05-20 15:48:33,325 - INFO - MLTrainingEngine - Best CV score: -0.2819 ± 0.0066
2025-05-20 15:48:33,325 - INFO - MLTrainingEngine - Best parameters: {'n_estimators': 300, 'min_samples_split': 2, 'max_depth': 30}
2025-05-20 15:48:33,325 - INFO - MLTrainingEngine - Evaluating model on validation data...



Optimization completed:
Total iterations: 10
Best score: -0.279177
Best parameters: {'n_estimators': 300, 'min_samples_split': 2, 'max_depth': 30}
Time elapsed: 108.18 seconds


2025-05-20 15:48:33,671 - INFO - Experiment_1747730803 - Metrics for validation: {'prediction_time': 0.34542036056518555, 'mse': 0.2538600144483563, 'rmse': np.float64(0.5038452286648711), 'mae': 0.32676373138054576, 'r2': 0.8062742100644491, 'explained_variance': 0.8063865989643, 'median_absolute_error': np.float64(0.19849222222222085), 'mape': np.float64(18.845053137458656)}
2025-05-20 15:48:33,717 - INFO - MLTrainingEngine - Validation prediction_time: 0.3454
2025-05-20 15:48:33,717 - INFO - MLTrainingEngine - Validation mse: 0.2539
2025-05-20 15:48:33,719 - INFO - MLTrainingEngine - Validation rmse: 0.5038
2025-05-20 15:48:33,719 - INFO - MLTrainingEngine - Validation mae: 0.3268
2025-05-20 15:48:33,719 - INFO - MLTrainingEngine - Validation r2: 0.8063
2025-05-20 15:48:33,719 - INFO - MLTrainingEngine - Validation explained_variance: 0.8064
2025-05-20 15:48:33,719 - INFO - MLTrainingEngine - Validation median_absolute_error: 0.1985
2025-05-20 15:48:33,719 - INFO - MLTrainingEngine 

{'model_name': 'random_forest_1747730804',
 'model': RandomForestRegressor(max_depth=30, n_estimators=300),
 'params': {'n_estimators': 300, 'min_samples_split': 2, 'max_depth': 30},
 'metrics': {'prediction_time': 0.34542036056518555,
  'mse': 0.2538600144483563,
  'rmse': np.float64(0.5038452286648711),
  'mae': 0.32676373138054576,
  'r2': 0.8062742100644491,
  'explained_variance': 0.8063865989643,
  'median_absolute_error': np.float64(0.19849222222222085),
  'mape': np.float64(18.845053137458656)},
 'feature_importance': array([0.52573105, 0.05421078, 0.04456671, 0.02965091, 0.03058678,
        0.13816233, 0.08855102, 0.08854042]),
 'training_time': 114.39941763877869}

## 📊 Step 5: Evaluate the Model

Once training is complete, we evaluate the model's performance using built-in evaluation methods.


In [5]:
trainer.evaluate_model()

2025-05-20 15:48:38,514 - INFO - MLTrainingEngine - Using cached test data for evaluation
2025-05-20 15:48:38,514 - INFO - MLTrainingEngine - Evaluating model: random_forest_1747730804
2025-05-20 15:48:38,822 - INFO - MLTrainingEngine - Evaluation results for random_forest_1747730804:
2025-05-20 15:48:38,822 - INFO - MLTrainingEngine -   prediction_time: 0.3081
2025-05-20 15:48:38,822 - INFO - MLTrainingEngine -   mse: 0.2539
2025-05-20 15:48:38,822 - INFO - MLTrainingEngine -   rmse: 0.5038
2025-05-20 15:48:38,822 - INFO - MLTrainingEngine -   mae: 0.3268
2025-05-20 15:48:38,822 - INFO - MLTrainingEngine -   r2: 0.8063


{'prediction_time': 0.3080747127532959,
 'mse': 0.2538600144483563,
 'rmse': np.float64(0.5038452286648711),
 'mae': 0.32676373138054576,
 'r2': 0.8062742100644491}

## 📝 Step 6: Generate and Display Performance Report

Kolosal AutoML can generate a markdown report summarizing model performance, metrics, and key insights. We display it directly in the notebook using `IPython.display.Markdown`.


In [9]:
from IPython.display import Markdown, display


report = trainer.generate_report()
display(Markdown(report))

2025-05-20 15:53:34,458 - INFO - MLTrainingEngine - Report generated: ./models\model_report.md


# ML Training Engine Report

Generated on: 2025-05-20 15:53:34

## Configuration

| Parameter | Value |
| --- | --- |
| random_state | 42 |
| n_jobs | -1 |
| verbose | 1 |
| cv_folds | 5 |
| test_size | 0.2 |
| stratify | True |
| optimization_iterations | 50 |
| optimization_timeout | None |
| early_stopping | True |
| early_stopping_rounds | 10 |
| early_stopping_metric | None |
| feature_selection | True |
| feature_selection_method | mutual_info |
| feature_selection_k | None |
| feature_importance_threshold | 0.01 |
| model_path | ./models |
| model_registry_url | None |
| auto_version_models | True |
| experiment_tracking | True |
| experiment_tracking_platform | mlflow |
| use_intel_optimization | True |
| use_gpu | True |
| gpu_memory_fraction | 0.8 |
| memory_optimization | True |
| enable_distributed | False |
| distributed_strategy | mirrored |
| checkpointing | True |
| checkpoint_interval | 10 |
| checkpoint_path | ./models\checkpoints |
| enable_pruning | False |
| auto_ml_time_budget | None |
| ensemble_models | False |
| ensemble_method | stacking |
| ensemble_size | 3 |
| hyperparameter_tuning_cv | True |
| auto_save | True |
| auto_save_on_shutdown | True |
| save_state_on_shutdown | False |
| load_best_model_after_train | True |
| enable_quantization | False |
| enable_model_compression | False |
| compression_method | pruning |
| compute_permutation_importance | True |
| generate_feature_importance_report | True |
| generate_model_summary | True |
| generate_prediction_explanations | False |
| enable_model_export | True |
| auto_deploy | False |
| deployment_platform | None |
| log_level | INFO |
| debug_mode | False |
| enable_telemetry | False |
| backend | sklearn |
| enable_data_validation | True |
| enable_security | False |
| optimization_metric | None |

## Model Performance Summary

| Model | mae | mse | prediction_time | r2 | rmse |
| --- | --- | --- | --- | --- | --- |
| random_forest_1747730804 **[BEST]** | 0.3268 | 0.2539 | 0.3081 | 0.8063 | 0.5038 |

## Model Details

### random_forest_1747730804

- **Type**: RandomForestRegressor
- **Best Model**: Yes
- **Training Time**: 109.23s

#### Hyperparameters

| Parameter | Value |
| --- | --- |
| n_estimators | 300 |
| min_samples_split | 2 |
| max_depth | 30 |


#### Top 10 Features by Importance

| Feature | Importance |
| --- | --- |
| feature_0 | 0.5257 |
| feature_5 | 0.1382 |
| feature_6 | 0.0886 |
| feature_7 | 0.0885 |
| feature_1 | 0.0542 |
| feature_2 | 0.0446 |
| feature_4 | 0.0306 |
| feature_3 | 0.0297 |


## Conclusion

The best performing model is **random_forest_1747730804** with metrics:

- **rmse**: 0.5038
- **mse**: 0.2539
- **r2**: 0.8063
- **mae**: 0.3268


## 🧠 Step 7: Model Explainability

Understanding how the model makes predictions is crucial. Kolosal AutoML provides built-in explainability features that help interpret which features influenced the model most.

The following command generates a visualization or structured summary of feature importances or SHAP-based explanations (depending on backend implementation).


In [7]:
trainer.generate_explainability()



{'method': 'drop_column',
 'importance': {'feature_0': 0.038615729232270524,
  'feature_5': 0.02378078257916383,
  'feature_7': 0.021539624453945527,
  'feature_6': 0.02075503177011595,
  'feature_3': 0.018068993612751494,
  'feature_1': 0.017734990080649937,
  'feature_4': 0.01650550718542887,
  'feature_2': 0.015770797993822305},
 'plot_path': './models\\explanations\\drop_column_importance_random_forest_1747730804.png',
 'execution_time': 6.9282660484313965}

## ✅ Summary

In this notebook, we have:
- Loaded and explored a dataset
- Configured and used Kolosal AutoML for regression
- Trained, evaluated, and reported on model performance

You can now try this with your own dataset by simply replacing the input features and target variable. Happy automating! 🔁
