Skip to content

Mihawk1891/TuneLab

Repository files navigation

TuneLab

Fully Autonomous ML Engineer Agent

A production-ready, fully autonomous machine learning pipeline for tabular datasets. Build high-quality models end-to-end without human intervention.

Features

  • 🔄 Fully Autonomous: Zero human intervention required
  • 💾 Strategy Memory: Learns from past runs via dataset fingerprinting
  • 📊 Complete Pipeline: From raw data to production model
  • 📝 Professional Docs: Auto-generated Markdown reports
  • 🖥️ CPU-First: Optimized for CPU, GTX 1650 friendly (no GPU required)
  • 🆓 100% Open Source: No proprietary dependencies

🚀 Quick Start

Installation

# Clone or download this repository
git clone <your-repo-url>
cd ml-agent

# Install dependencies
pip install -r requirements.txt

Basic Usage

# Run on your dataset
python ml_agent.py path/to/your/data.csv

That's it! The agent will:

  1. Analyze your data
  2. Engineer features
  3. Train multiple models
  4. Optimize hyperparameters
  5. Generate reports and plots
  6. Save the best model

Example with Sample Data

# Generate sample data first
python example_usage.py --generate-data

# Run the agent
python ml_agent.py sample_data/iris.csv

📁 Output Structure

After running, you'll find everything in the outputs/ directory:

outputs/
├── models/
│   └── final_model.joblib          # Trained model ready for production
├── plots/
│   ├── feature_importance.png      # Top features visualization
│   └── metric_comparison.png       # Model performance comparison
├── reports/
│   ├── overview.md                 # Project summary
│   ├── data_analysis.md            # Data insights
│   ├── modeling.md                 # Model selection details
│   └── results.md                  # Final results & recommendations
└── strategy/
    └── <fingerprint>.json          # Reusable strategy for this dataset

🎯 How It Works

1. Dataset Fingerprinting

Generates a unique fingerprint based on:

  • Dataset shape
  • Column names and types
  • Missing value patterns

If you've run the agent on similar data before, it loads the successful strategy from memory.

2. Data Understanding

Automatically detects:

  • Target column (last column by default)
  • Problem type (classification vs regression)
  • Missing values
  • Feature types (numerical vs categorical)
  • Class imbalance

3. Feature Engineering

Applies robust preprocessing:

  • Numerical: Median imputation
  • Categorical: Most frequent imputation + label encoding
  • No leakage: All transformations fit only on training data

4. Model Selection

Trains multiple baseline models:

Classification:

  • Logistic Regression
  • Random Forest
  • Extra Trees
  • Gradient Boosting

Regression:

  • Linear Regression
  • Ridge Regression
  • Random Forest
  • Extra Trees
  • Gradient Boosting

5. Hyperparameter Optimization

Uses Optuna for Bayesian optimization:

  • 30 trials
  • 3-fold cross-validation
  • Automatic parameter search

6. Artifacts & Documentation

Generates:

  • Saved models (.joblib)
  • Visualizations (.png)
  • Professional Markdown reports
  • Reusable strategy files

💻 Advanced Usage

Specify Target Column

from ml_agent import MLAgent

agent = MLAgent(
    data_path="data.csv",
    target_col="target",  # Specify target column
    problem_type="classification"  # Or "regression"
)

agent.run()

Customize Parameters

agent = MLAgent(
    data_path="data.csv",
    output_dir="my_outputs",
    max_iterations=5,
    target_metric_threshold=0.95,
    improvement_threshold=0.01
)

agent.run()

Load and Use Trained Model

import joblib
import pandas as pd

# Load model package
model_pkg = joblib.load('outputs/models/final_model.joblib')

model = model_pkg['model']
preprocessor = model_pkg['preprocessor']
feature_names = model_pkg['feature_names']

# Load new data
new_data = pd.read_csv('new_data.csv')

# Make predictions
predictions = model.predict(new_data)

🛠️ Tech Stack

Core (Required)

  • Python 3.10+
  • pandas
  • numpy
  • scikit-learn
  • matplotlib
  • joblib

Optional (Recommended)

  • seaborn (better plots)
  • optuna (hyperparameter tuning)

Constraints

  • ✅ 100% open source
  • ✅ CPU-optimized
  • ✅ No GPU required
  • ✅ GTX 1650 compatible if GPU available
  • ❌ No proprietary software
  • ❌ No paid APIs

📊 Supported Models

Classification

Model Speed Accuracy Interpretability
Logistic Regression ⚡⚡⚡ ⭐⭐ ⭐⭐⭐
Random Forest ⚡⚡ ⭐⭐⭐ ⭐⭐
Extra Trees ⚡⚡ ⭐⭐⭐ ⭐⭐
Gradient Boosting ⭐⭐⭐

Regression

Model Speed Accuracy Interpretability
Linear Regression ⚡⚡⚡ ⭐⭐ ⭐⭐⭐
Ridge Regression ⚡⚡⚡ ⭐⭐ ⭐⭐⭐
Random Forest ⚡⚡ ⭐⭐⭐ ⭐⭐
Extra Trees ⚡⚡ ⭐⭐⭐ ⭐⭐
Gradient Boosting ⭐⭐⭐

📖 Example Workflow

1. Prepare Your Data

Your CSV should have:

  • Features in columns
  • Target variable (typically last column)
  • Header row with column names
feature1,feature2,feature3,target
1.2,3.4,cat,0
2.3,4.5,dog,1
...

2. Run the Agent

python ml_agent.py my_data.csv

Output:

🤖 ML Agent Initialized
📁 Output Directory: outputs

📊 Loading dataset: my_data.csv
   Shape: (1000, 5)
   Columns: ['feature1', 'feature2', 'feature3', 'target']
🔑 Dataset Fingerprint: a3f5d8c9b2e1f4a7

🔍 Analyzing dataset...
   Auto-detected target: target
   Auto-detected problem type: classification
   ✅ No missing values
   Numerical features: 2
   Categorical features: 1

🔧 Engineering features...
   Processing 2 numerical + 1 categorical features
   ✅ Train: 800 samples
   ✅ Test: 200 samples

🎯 Training baseline models...
   Training Logistic Regression... accuracy=0.8500
   Training Random Forest... accuracy=0.9200
   Training Extra Trees... accuracy=0.9100
   Training Gradient Boosting... accuracy=0.9350

   🏆 Best model: Gradient Boosting (accuracy=0.9350)

⚙️  Optimizing hyperparameters...
   Best trial score: 0.9425
   ✅ Optimized model score: 0.9450

💾 Model saved: outputs/models/final_model.joblib

📊 Generating plots...
   ✅ Feature importance plot saved
   ✅ Model comparison plot saved

📝 Generating reports...
   ✅ All reports generated

💾 Strategy saved to memory: a3f5d8c9

============================================================
✅ PIPELINE COMPLETE
============================================================

📁 All outputs saved to: outputs/
🏆 Final model score: 0.9450
💾 Model: outputs/models/final_model.joblib
📊 Reports: outputs/reports/
📈 Plots: outputs/plots/

3. Review the Outputs

Check the generated reports:

  • outputs/reports/overview.md - Quick summary
  • outputs/reports/data_analysis.md - Data insights
  • outputs/reports/modeling.md - Model details
  • outputs/reports/results.md - Final results

View visualizations:

  • outputs/plots/feature_importance.png
  • outputs/plots/metric_comparison.png

4. Use the Model

import joblib

# Load and use
model_pkg = joblib.load('outputs/models/final_model.joblib')
predictions = model_pkg['model'].predict(new_data)

🎓 Design Philosophy

1. Simplicity Over Complexity

  • Use simple, explainable models first
  • Avoid overfitting with regularization
  • Prefer interpretability when possible

2. CPU-First Architecture

  • No GPU required (though compatible)
  • Optimized for standard hardware
  • Works on laptops and servers alike

3. Full Transparency

  • Every decision is documented
  • Complete audit trail in reports
  • Reproducible with fixed random seeds

4. Production Ready

  • Save everything needed for deployment
  • Include preprocessing in model package
  • Professional documentation for handover

5. Autonomous Operation

  • Zero human intervention
  • Automatic problem detection
  • Self-documenting workflows

🔧 Customization

Add Custom Models

Edit ml_agent.py in the train_models() method:

if self.problem_type == 'classification':
    models = {
        'Logistic Regression': LogisticRegression(...),
        'Random Forest': RandomForestClassifier(...),
        # Add your model here:
        'SVM': SVC(...),
    }

Change Metrics

Modify the metric calculation in train_models():

# For classification
metrics = {
    'accuracy': accuracy_score(self.y_test, y_pred),
    'f1': f1_score(self.y_test, y_pred, average='weighted'),
    # Add custom metrics
}

Adjust Hyperparameter Search

Modify optimize_hyperparameters():

# Change number of trials
study.optimize(objective, n_trials=50)  # Default: 30

# Adjust cross-validation folds
scores = cross_val_score(..., cv=5)  # Default: 3

📝 Report Examples

Overview Report

  • Quick project summary
  • Tech stack used
  • Best model and score
  • File structure

Data Analysis Report

  • Dataset statistics
  • Missing value analysis
  • Feature type breakdown
  • Target distribution

Modeling Report

  • All models tried
  • Performance comparison
  • Selection rationale
  • Hyperparameter tuning results

Results Report

  • Final metrics
  • Usage instructions
  • Next steps recommendations
  • Reproducibility guide

🐛 Troubleshooting

"Optuna not available"

pip install optuna

Or continue without it (hyperparameter tuning will be skipped).

"Memory Error"

For large datasets, reduce n_estimators in models:

'Random Forest': RandomForestClassifier(n_estimators=50)  # Default: 100

"FileNotFoundError"

Make sure your CSV path is correct:

python ml_agent.py /full/path/to/data.csv

🤝 Contributing

This is a fully autonomous agent - improvements welcome!

Areas for enhancement:

  • Additional model types
  • Advanced feature engineering
  • Custom metric support
  • Multi-class calibration
  • Time series support

📄 License

Open source - use freely for any purpose.


🙏 Acknowledgments

Built with:

  • scikit-learn - Amazing ML library
  • Optuna - Hyperparameter optimization
  • pandas - Data manipulation
  • matplotlib - Visualization

📞 Support

For issues or questions you can contact me at pranavbansode2604@gmail.com

  1. Check the generated reports in outputs/reports/
  2. Review the troubleshooting section
  3. Examine the code comments in ml_agent.py

Built for production. Designed for autonomy. Optimized for simplicity.

🤖 Let the agent do the work.

About

TuneLab is a fully autonomous ML engineer agent that builds production-ready models from raw CSV data. Upload → Train → Deploy. Features FastAPI backend, beautiful web UI, strategy memory, auto hyperparameter tuning, and one-click Render deployment. 100% open-source & CPU-first.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages