# 🚀 Kolosal AutoML Tutorial: Optimize Your Configuration

This section sets up the key modules from the `kolosal_automl` library that enable automated machine learning.

### Components
- **`MLTrainingEngine`**: Manages model selection, training, evaluation, and reporting.
- **`DeviceOptimizer`**: Detects available hardware (CPU/GPU) and suggests the best configuration for performance.

### Outcome
All necessary classes are imported and ready for instantiation.


# 📦 Import Essential Libraries

This cell imports the core components from the Kolosal AutoML library needed for our machine learning pipeline.

### Imported Components:
- **`MLTrainingEngine`**: The main engine that handles model training, evaluation, and management
- **`DeviceOptimizer`**: Automatically detects and optimizes hardware configuration (CPU/GPU)
- **`MLTrainingEngineConfig`**: Configuration class for training parameters
- **`TaskType`**: Enum for specifying machine learning task types (regression, classification)

### Why These Imports?
These modules provide the foundation for automated machine learning with optimal hardware utilization.

In [1]:
from kolosal_automl.modules.engine.train_engine import MLTrainingEngine
from kolosal_automl.modules.device_optimizer import DeviceOptimizer
from kolosal_automl.modules.configs import MLTrainingEngineConfig, TaskType



# ⚙️ Training Engine Configuration

We now configure the AutoML pipeline to optimize for regression tasks using your machine’s hardware resources.

### Steps
- **Device Optimization**: Automatically detect and apply the best hardware configuration using `DeviceOptimizer`.
- **Task Type**: Set to `regression` to predict continuous numerical values.
- **Engine Initialization**: Instantiate `MLTrainingEngine` with the optimal config.

### Outcome
A `MLTrainingEngine` object is ready to run regression training using an optimized setup.


In [2]:
# Create device optimizer with performance optimization mode
optimizer = DeviceOptimizer(optimization_mode="performance")

# Get optimal training engine configuration
training_config = optimizer.get_optimal_training_engine_config()

# Set the task type to regression (configuration supports both string and enum)
training_config.task_type = TaskType.REGRESSION

# Create the training engine with the optimized configuration
training_engine = MLTrainingEngine(training_config)

2025-08-06 20:43:18,789 - INFO - cpu_device_optimizer - System Overview: Laptop-Evint (Windows 10)
2025-08-06 20:43:18,790 - INFO - cpu_device_optimizer - Environment: cloud
2025-08-06 20:43:18,790 - INFO - cpu_device_optimizer - Optimization Mode: performance
2025-08-06 20:43:18,791 - INFO - cpu_device_optimizer - --------------------------------------------------
2025-08-06 20:43:18,791 - INFO - cpu_device_optimizer - CPU: Intel64 Family 6 Model 154 Stepping 3, GenuineIntel
2025-08-06 20:43:18,791 - INFO - cpu_device_optimizer - CPU Cores: 14 physical, 20 logical
2025-08-06 20:43:18,792 - INFO - cpu_device_optimizer - CPU Freq (MHz): Current=2300, Min=0, Max=2300
2025-08-06 20:43:18,793 - INFO - cpu_device_optimizer - CPU Features: AVX=False, AVX2=False, AVX512=False, SSE4=False, FMA=False, NEON=False
2025-08-06 20:43:18,794 - INFO - cpu_device_optimizer - Memory: 63.67 GB total, 57.30 GB usable, 33.63 GB available
2025-08-06 20:43:18,794 - INFO - cpu_device_optimizer - Swap Memory: 

2025-08-06 20:43:31,744 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - CPU usage (0.0%) below threshold, disabling throttling


# 🧪 Synthetic Dataset Generation

To test our AutoML pipeline, we create a clean, synthetic dataset for regression modeling.

### Details
- **Method**: `make_regression` from scikit-learn.
- **Size**: 1,000 samples × 20 features.
- **Noise**: Low (0.1), ensuring the model learns clear patterns.

### Outcome
- `X`: Feature matrix (1000 × 20)
- `y`: Target vector (1000,)


In [3]:
from sklearn.datasets import make_regression

data = make_regression(n_samples=1000, n_features=20, noise=0.1)
X, y= data

# 🚀 Model Training Execution

With the dataset and engine ready, we now execute the automated training workflow.

### What Happens Under the Hood
- Preprocessing: Scaling, transformation, or encoding (if needed)
- Model Selection: Tries multiple regression models and tunes them
- Training: Uses the best model and trains with optimal config
- Evaluation: Computes metrics (e.g., MAE, RMSE) internally

### Outcome
An optimized regression model is trained and stored in `training_engine`, ready for evaluation and prediction.


In [4]:
training_engine.train_model(X, y, model_name="my_model")

2025-08-06 20:43:21,744 - INFO - MLTrainingEngine - Initialized random_forest model for regression
2025-08-06 20:43:21,745 - INFO - Experiment_1754487798 - Started experiment 1754487798
2025-08-06 20:43:21,746 - INFO - Experiment_1754487798 - Configuration:
{
  "task_type": "regression",
  "random_state": 42,
  "n_jobs": 18,
  "verbose": 0,
  "cv_folds": 5,
  "test_size": 0.2,
  "stratify": true,
  "optimization_strategy": "hyper_optimization_x",
  "optimization_iterations": 75,
  "optimization_timeout": null,
  "early_stopping": true,
  "early_stopping_rounds": 10,
  "early_stopping_metric": null,
  "feature_selection": false,
  "feature_selection_method": "mutual_info",
  "feature_selection_k": null,
  "feature_importance_threshold": 0.01,
  "model_path": "model_registry\\training_models",
  "model_registry_url": null,
  "auto_version_models": true,
  "experiment_tracking": true,
  "experiment_tracking_platform": "mlflow",
  "experiment_tracking_config": {},
  "use_intel_optimization

{'model_name': 'my_model',
 'model': RandomForestRegressor(),
 'params': {'estimator': RandomForestRegressor(),
  'param_space': {'n_estimators': [100, 200],
   'max_depth': [None, 10],
   'min_samples_split': [2]},
  'max_iter': 15,
  'cv': 5,
  'scoring': 'accuracy',
  'random_state': 42,
  'n_jobs': -1,
  'verbose': 0,
  'maximize': True,
  'time_budget': None,
  'ensemble_surrogate': True,
  'transfer_learning': True,
  'optimization_strategy': 'auto',
  'early_stopping': True,
  'meta_learning': True,
  'constraint_handling': 'auto',
  'estimator__bootstrap': True,
  'estimator__ccp_alpha': 0.0,
  'estimator__criterion': 'squared_error',
  'estimator__max_depth': None,
  'estimator__max_features': 1.0,
  'estimator__max_leaf_nodes': None,
  'estimator__max_samples': None,
  'estimator__min_impurity_decrease': 0.0,
  'estimator__min_samples_leaf': 1,
  'estimator__min_samples_split': 2,
  'estimator__min_weight_fraction_leaf': 0.0,
  'estimator__monotonic_cst': None,
  'estimator__

# 📊 Model Evaluation

After training our model, we need to evaluate its performance to understand how well it learned from the data.

### What This Does:
- **Performance Metrics**: Calculates key regression metrics like MAE, MSE, RMSE, and R²
- **Validation**: Uses the internal validation set to assess model quality
- **Scoring**: Provides quantitative measures of prediction accuracy

### Expected Output:
You'll see metrics showing how well the model performs on unseen data, helping you understand the quality of your trained model.

In [5]:
training_engine.evaluate_model()

2025-08-06 20:43:44,848 - INFO - MLTrainingEngine - Using stored test data for evaluation


{'model_name': 'my_model',
 'metrics': {'mse': 2927.1280766016057,
  'rmse': 54.10293962994623,
  'mae': 42.29247041919631,
  'r2': 0.7894957394280003,
  'prediction_time': 0.0065135955810546875},
 'detailed': False}

# 📋 Generate Comprehensive Training Report

This section creates a detailed, formatted report of the entire training process and model performance.

### Report Contents:
- **Model Information**: Details about the selected algorithm and hyperparameters
- **Performance Metrics**: Complete evaluation results with statistical measures
- **Training Process**: Summary of the automated training workflow
- **Recommendations**: Insights and suggestions for model improvement

### Display Format:
The report is rendered as Markdown for easy reading, with formatted tables, metrics, and explanations.

In [6]:
from IPython.display import Markdown, display


report = training_engine.generate_report()
display(Markdown(report))

# ML Training Engine Report

**Generated**: 2025-08-06 20:43:44

**Total Models Trained**: 1
**Best Model**: my_model

## Configuration

- **Task Type**: regression
- **Random State**: 42
- **CV Folds**: 5
- **Optimization Strategy**: OptimizationStrategy.HYPERX

## Model Performance Comparison

| Model Name | Type | Training Time (s) | Best | Metrics |
|------------|------|------------------|------|----------|
| my_model | RandomForestRegressor | 22.76 | ✅ | mse: 3169.5485, rmse: 56.2987, mae: 44.1075 |

## Detailed Model Information

### my_model

**Model Type**: RandomForestRegressor

**Parameters**:
- estimator: RandomForestRegressor()
- param_space: {'n_estimators': [100, 200], 'max_depth': [None, 10], 'min_samples_split': [2]}
- max_iter: 15
- cv: 5
- scoring: accuracy
- random_state: 42
- n_jobs: -1
- verbose: 0
- maximize: True
- time_budget: None
- ensemble_surrogate: True
- transfer_learning: True
- optimization_strategy: auto
- early_stopping: True
- meta_learning: True
- constraint_handling: auto
- estimator__bootstrap: True
- estimator__ccp_alpha: 0.0
- estimator__criterion: squared_error
- estimator__max_depth: None
- estimator__max_features: 1.0
- estimator__max_leaf_nodes: None
- estimator__max_samples: None
- estimator__min_impurity_decrease: 0.0
- estimator__min_samples_leaf: 1
- estimator__min_samples_split: 2
- estimator__min_weight_fraction_leaf: 0.0
- estimator__monotonic_cst: None
- estimator__n_estimators: 100
- estimator__n_jobs: None
- estimator__oob_score: False
- estimator__random_state: None
- estimator__verbose: 0
- estimator__warm_start: False

**Performance Metrics**:
- mse: 3169.548506
- rmse: 56.298743
- mae: 44.107523
- r2: 0.832419

**Top 10 Feature Importance**:
1. feature_6: 0.3109
2. feature_9: 0.2961
3. feature_19: 0.1638
4. feature_5: 0.0477
5. feature_18: 0.0348
6. feature_16: 0.0258
7. feature_13: 0.0142
8. feature_2: 0.0123
9. feature_1: 0.0106
10. feature_12: 0.0084

---



# 💾 Save the Trained Model

Now that we have a trained and evaluated model, we'll save it to disk for future use.

### What This Does:
- **Model Serialization**: Converts the trained model into a file format (.pkl)
- **Persistent Storage**: Saves to `model_registry/training_models/` directory
- **Reusability**: Enables loading the model later without retraining
- **Version Control**: Maintains model versions for comparison and rollback

### File Location:
The model will be saved as `model_registry/training_models/my_model.pkl`

In [7]:
training_engine.save_model("my_model")

2025-08-06 20:43:44,885 - INFO - MLTrainingEngine - Model 'my_model' saved to model_registry\training_models\my_model.pkl


'model_registry\\training_models\\my_model.pkl'

In [14]:
from kolosal_automl.modules.engine.inference_engine import InferenceEngine
inference_engine = InferenceEngine()

inference_engine.load_model("model_registry/training_models/my_model.pkl")

2025-08-06 20:48:32,632 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - Advanced resource management configured with 4 threads
2025-08-06 20:48:33,954 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - SIMD optimizer initialized
2025-08-06 20:48:33,955 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - Result cache initialized with 1000 entries
2025-08-06 20:48:33,954 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - SIMD optimizer initialized
2025-08-06 20:48:33,955 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - Result cache initialized with 1000 entries
2025-08-06 20:48:33,983 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - JIT compiler enabled for inference optimization
2025-08-06 20:48:33,984 - INFO - kolosal_automl.modules.engine.mixed_precision - Mixed precision enabled
2025-08-06 20:48:33,985 - kolosal_automl.modules.engine.inferenc

True

2025-08-06 20:48:44,250 - kolosal_automl.modules.engine.inference_engine.InferenceEngine - INFO - CPU usage (0.0%) below threshold, disabling throttling


# 🔮 Model Prediction

Now that we have loaded our trained model, we can use it to make predictions on new data.

### What We'll Do
- **Generate New Test Data**: Create a small sample array to test predictions
- **Make Predictions**: Use the trained model to predict target values
- **Display Results**: Show both input features and predicted outputs

### Outcome
Demonstration of how the trained model performs on unseen data, showcasing the practical application of our AutoML pipeline.

In [39]:
# Generate some test data for prediction (5 samples with 20 features each)
import numpy as np

# Create a small test array with the same number of features as our training data
test_data = np.random.randn(5, 20)

print("Test data shape:", test_data.shape)
print("\nSample test data (first 3 samples, first 5 features):")
print(test_data[:3, :5])

# Make predictions using the loaded model
_, predictions, _ = inference_engine.predict(test_data)

print(f"\nPredictions for {len(test_data)} samples:")
print(predictions)

# Display results in a more readable format
print("\n" + "="*50)
print("PREDICTION RESULTS")
print("="*50)
for i, pred in enumerate(predictions):
    print(f"Sample {i+1}: {pred:.4f}")

Test data shape: (5, 20)

Sample test data (first 3 samples, first 5 features):
[[-0.57454698 -0.41986346 -1.50487599  0.12598395  0.02996544]
 [ 0.22940325 -1.08113711  0.75053612  0.69822084 -0.86998019]
 [ 1.03186798  1.34580327  0.73725253 -1.09161119 -0.63682645]]

Predictions for 5 samples:
[  26.3905134   -38.14230437  166.25675717 -114.99146015  -86.85265942]

PREDICTION RESULTS
Sample 1: 26.3905
Sample 2: -38.1423
Sample 3: 166.2568
Sample 4: -114.9915
Sample 5: -86.8527


# 🎉 Tutorial Complete!

Congratulations! You've successfully completed the Kolosal AutoML optimal training tutorial.

## What You've Accomplished:
✅ **Imported** essential AutoML libraries  
✅ **Configured** optimal training settings with device optimization  
✅ **Generated** synthetic regression data for testing  
✅ **Trained** an automated machine learning model  
✅ **Evaluated** model performance with comprehensive metrics  
✅ **Generated** a detailed training report  
✅ **Saved** the model for future use  
✅ **Loaded** and tested the saved model  
✅ **Made predictions** on new data  

## Next Steps:
- Try with your own datasets
- Experiment with different task types (classification)
- Explore advanced configuration options
- Compare multiple models using the benchmarking features

Happy machine learning! 🚀