# Kolosal AutoML Tutorial: Basic AutoML🚀

Welcome! In this notebook, we'll demonstrate how to use **Kolosal AutoML** to build a regression model with minimal effort. We'll use the California Housing dataset as an example and walk through the entire pipeline: loading data, training a model, evaluating it, and generating a performance report.


## 📥 Step 1: Load the Dataset

We use the `fetch_california_housing` dataset from `sklearn.datasets`. This built-in dataset contains information about California districts, including features like median income, house age, and average rooms. Our target is the median house value.


In [7]:
from sklearn.datasets import fetch_california_housing


housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.rename(columns={'MedHouseVal': 'target'}, inplace=True)

## 🧹 Step 2: Prepare Features and Target Variable

We rename the target column and separate the features (`X`) and the target (`y`). This is a standard practice for supervised learning tasks.


In [8]:
X = df.drop(columns=['target']).values
y = df['target'].values

## ⚙️ Step 3: Configure Kolosal AutoML

We import the Kolosal AutoML training engine and its configuration. We then create a configuration for a regression task.


In [None]:
from kolosal_automl.modules.engine.train_engine import MLTrainingEngine
from kolosal_automl.modules.configs import MLTrainingEngineConfig, TaskType

config = MLTrainingEngineConfig(
    task_type=TaskType.REGRESSION,
    enable_jit_compilation=True,      # Enable JIT compilation
    enable_mixed_precision=True,      # Enable mixed precision training
    enable_adaptive_hyperopt=True,    # Enable advanced hyperparameter optimization
    enable_streaming=True,            # Enable streaming for large datasets
)
trainer = MLTrainingEngine(config)

2025-08-06 20:06:48,926 - INFO - MLTrainingEngine - Experiment tracking enabled
2025-08-06 20:06:48,928 - INFO - MLTrainingEngine - Data preprocessor skipped for fast initialization
2025-08-06 20:06:48,928 - INFO - kolosal_automl.modules.engine.batch_processor - Memory-aware batch processing enabled
2025-08-06 20:06:48,930 - INFO - MLTrainingEngine - Batch processor initialized
2025-08-06 20:06:48,932 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - Advanced resource management configured with 4 threads
2025-08-06 20:06:50,248 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - SIMD optimizer initialized
2025-08-06 20:06:50,250 - INFO - kolosal_automl.modules.engine.quantizer - Quantizer initialized with QuantizationType.INT8 type and QuantizationMode.DYNAMIC mode
2025-08-06 20:06:50,251 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - Quantizer initialized with QuantizationType.INT8 type
2025-08-06 20:06:50,252 - 

2025-08-06 20:07:00,566 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - CPU usage (0.0%) below threshold, disabling throttling
2025-08-06 20:07:41,057 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - CPU usage (0.0%) below threshold, disabling throttling


## 🏋️ Step 4: Train the Model

Now, we train the model using the `train_model()` function. Kolosal AutoML automatically selects and tunes the best model for the task.


In [10]:
trainer.train_model(X=X, y=y)

2025-08-06 20:06:50,450 - INFO - MLTrainingEngine - Applied mixed precision optimization to training data
2025-08-06 20:06:50,451 - INFO - MLTrainingEngine - Initialized random_forest model for regression
2025-08-06 20:06:50,453 - INFO - MLTrainingEngine - Using minimal parameter grid for randomforest due to quick training mode or large dataset
2025-08-06 20:06:50,456 - INFO - Experiment_1754485608 - Started experiment 1754485608
2025-08-06 20:06:50,458 - INFO - Experiment_1754485608 - Configuration:
{
  "task_type": "regression",
  "random_state": 42,
  "n_jobs": -1,
  "verbose": 1,
  "cv_folds": 5,
  "test_size": 0.2,
  "stratify": true,
  "optimization_strategy": "hyper_optimization_x",
  "optimization_iterations": 50,
  "optimization_timeout": null,
  "early_stopping": true,
  "early_stopping_rounds": 10,
  "early_stopping_metric": null,
  "feature_selection": true,
  "feature_selection_method": "mutual_info",
  "feature_selection_k": null,
  "feature_importance_threshold": 0.01,
 

{'model_name': 'random_forest_1754485610',
 'model': RandomForestRegressor(),
 'params': {'max_depth': None, 'n_estimators': 100},
 'metrics': {'mse': 0.2775370023300599,
  'rmse': 0.5268178075293772,
  'mae': 0.3463221171964883,
  'r2': 0.7988844626014588},
 'feature_importance': array([0.52755913, 0.05681866, 0.04423055, 0.0295223 , 0.03098977,
        0.13485737, 0.08799703, 0.08802518]),
 'training_time': 45.43019700050354}

## 📊 Step 5: Evaluate the Model

Once training is complete, we evaluate the model's performance using built-in evaluation methods.


In [11]:
trainer.evaluate_model()

2025-08-06 20:07:35,878 - INFO - MLTrainingEngine - Using stored test data for evaluation


{'model_name': 'random_forest_1754485610',
 'metrics': {'mse': 0.2655312019575462,
  'rmse': 0.5152971977000711,
  'mae': 0.336204313396318,
  'r2': 0.7973676872131961,
  'prediction_time': 0.08065199851989746},
 'detailed': False}

## 📝 Step 6: Generate and Display Performance Report

Kolosal AutoML can generate a markdown report summarizing model performance, metrics, and key insights. We display it directly in the notebook using `IPython.display.Markdown`.


In [12]:
from IPython.display import Markdown, display


report = trainer.generate_report()
display(Markdown(report))

# ML Training Engine Report

**Generated**: 2025-08-06 20:07:35

**Total Models Trained**: 1
**Best Model**: random_forest_1754485610

## Configuration

- **Task Type**: regression
- **Random State**: 42
- **CV Folds**: 5
- **Optimization Strategy**: OptimizationStrategy.HYPERX

## Model Performance Comparison

| Model Name | Type | Training Time (s) | Best | Metrics |
|------------|------|------------------|------|----------|
| random_forest_1754485610 | RandomForestRegressor | 44.92 | ✅ | mse: 0.2775, rmse: 0.5268, mae: 0.3463 |

## Detailed Model Information

### random_forest_1754485610

**Model Type**: RandomForestRegressor

**Parameters**:
- max_depth: None
- n_estimators: 100

**Performance Metrics**:
- mse: 0.277537
- rmse: 0.526818
- mae: 0.346322
- r2: 0.798884

**Top 10 Feature Importance**:
1. feature_0: 0.5276
2. feature_5: 0.1349
3. feature_7: 0.0880
4. feature_6: 0.0880
5. feature_1: 0.0568
6. feature_2: 0.0442
7. feature_4: 0.0310
8. feature_3: 0.0295

---



## ✅ Summary

In this notebook, we have:
- Loaded and explored a dataset
- Configured and used Kolosal AutoML for regression
- Trained, evaluated, and reported on model performance

You can now try this with your own dataset by simply replacing the input features and target variable. Happy automating! 🔁
