A production-ready, fully autonomous machine learning pipeline for tabular datasets. Build high-quality models end-to-end without human intervention.
- 🔄 Fully Autonomous: Zero human intervention required
- 💾 Strategy Memory: Learns from past runs via dataset fingerprinting
- 📊 Complete Pipeline: From raw data to production model
- 📝 Professional Docs: Auto-generated Markdown reports
- 🖥️ CPU-First: Optimized for CPU, GTX 1650 friendly (no GPU required)
- 🆓 100% Open Source: No proprietary dependencies
# Clone or download this repository
git clone <your-repo-url>
cd ml-agent
# Install dependencies
pip install -r requirements.txt# Run on your dataset
python ml_agent.py path/to/your/data.csvThat's it! The agent will:
- Analyze your data
- Engineer features
- Train multiple models
- Optimize hyperparameters
- Generate reports and plots
- Save the best model
# Generate sample data first
python example_usage.py --generate-data
# Run the agent
python ml_agent.py sample_data/iris.csvAfter running, you'll find everything in the outputs/ directory:
outputs/
├── models/
│ └── final_model.joblib # Trained model ready for production
├── plots/
│ ├── feature_importance.png # Top features visualization
│ └── metric_comparison.png # Model performance comparison
├── reports/
│ ├── overview.md # Project summary
│ ├── data_analysis.md # Data insights
│ ├── modeling.md # Model selection details
│ └── results.md # Final results & recommendations
└── strategy/
└── <fingerprint>.json # Reusable strategy for this dataset
Generates a unique fingerprint based on:
- Dataset shape
- Column names and types
- Missing value patterns
If you've run the agent on similar data before, it loads the successful strategy from memory.
Automatically detects:
- Target column (last column by default)
- Problem type (classification vs regression)
- Missing values
- Feature types (numerical vs categorical)
- Class imbalance
Applies robust preprocessing:
- Numerical: Median imputation
- Categorical: Most frequent imputation + label encoding
- No leakage: All transformations fit only on training data
Trains multiple baseline models:
Classification:
- Logistic Regression
- Random Forest
- Extra Trees
- Gradient Boosting
Regression:
- Linear Regression
- Ridge Regression
- Random Forest
- Extra Trees
- Gradient Boosting
Uses Optuna for Bayesian optimization:
- 30 trials
- 3-fold cross-validation
- Automatic parameter search
Generates:
- Saved models (
.joblib) - Visualizations (
.png) - Professional Markdown reports
- Reusable strategy files
from ml_agent import MLAgent
agent = MLAgent(
data_path="data.csv",
target_col="target", # Specify target column
problem_type="classification" # Or "regression"
)
agent.run()agent = MLAgent(
data_path="data.csv",
output_dir="my_outputs",
max_iterations=5,
target_metric_threshold=0.95,
improvement_threshold=0.01
)
agent.run()import joblib
import pandas as pd
# Load model package
model_pkg = joblib.load('outputs/models/final_model.joblib')
model = model_pkg['model']
preprocessor = model_pkg['preprocessor']
feature_names = model_pkg['feature_names']
# Load new data
new_data = pd.read_csv('new_data.csv')
# Make predictions
predictions = model.predict(new_data)- Python 3.10+
- pandas
- numpy
- scikit-learn
- matplotlib
- joblib
- seaborn (better plots)
- optuna (hyperparameter tuning)
- ✅ 100% open source
- ✅ CPU-optimized
- ✅ No GPU required
- ✅ GTX 1650 compatible if GPU available
- ❌ No proprietary software
- ❌ No paid APIs
| Model | Speed | Accuracy | Interpretability |
|---|---|---|---|
| Logistic Regression | ⚡⚡⚡ | ⭐⭐ | ⭐⭐⭐ |
| Random Forest | ⚡⚡ | ⭐⭐⭐ | ⭐⭐ |
| Extra Trees | ⚡⚡ | ⭐⭐⭐ | ⭐⭐ |
| Gradient Boosting | ⚡ | ⭐⭐⭐ | ⭐ |
| Model | Speed | Accuracy | Interpretability |
|---|---|---|---|
| Linear Regression | ⚡⚡⚡ | ⭐⭐ | ⭐⭐⭐ |
| Ridge Regression | ⚡⚡⚡ | ⭐⭐ | ⭐⭐⭐ |
| Random Forest | ⚡⚡ | ⭐⭐⭐ | ⭐⭐ |
| Extra Trees | ⚡⚡ | ⭐⭐⭐ | ⭐⭐ |
| Gradient Boosting | ⚡ | ⭐⭐⭐ | ⭐ |
Your CSV should have:
- Features in columns
- Target variable (typically last column)
- Header row with column names
feature1,feature2,feature3,target
1.2,3.4,cat,0
2.3,4.5,dog,1
...python ml_agent.py my_data.csvOutput:
🤖 ML Agent Initialized
📁 Output Directory: outputs
📊 Loading dataset: my_data.csv
Shape: (1000, 5)
Columns: ['feature1', 'feature2', 'feature3', 'target']
🔑 Dataset Fingerprint: a3f5d8c9b2e1f4a7
🔍 Analyzing dataset...
Auto-detected target: target
Auto-detected problem type: classification
✅ No missing values
Numerical features: 2
Categorical features: 1
🔧 Engineering features...
Processing 2 numerical + 1 categorical features
✅ Train: 800 samples
✅ Test: 200 samples
🎯 Training baseline models...
Training Logistic Regression... accuracy=0.8500
Training Random Forest... accuracy=0.9200
Training Extra Trees... accuracy=0.9100
Training Gradient Boosting... accuracy=0.9350
🏆 Best model: Gradient Boosting (accuracy=0.9350)
⚙️ Optimizing hyperparameters...
Best trial score: 0.9425
✅ Optimized model score: 0.9450
💾 Model saved: outputs/models/final_model.joblib
📊 Generating plots...
✅ Feature importance plot saved
✅ Model comparison plot saved
📝 Generating reports...
✅ All reports generated
💾 Strategy saved to memory: a3f5d8c9
============================================================
✅ PIPELINE COMPLETE
============================================================
📁 All outputs saved to: outputs/
🏆 Final model score: 0.9450
💾 Model: outputs/models/final_model.joblib
📊 Reports: outputs/reports/
📈 Plots: outputs/plots/
Check the generated reports:
outputs/reports/overview.md- Quick summaryoutputs/reports/data_analysis.md- Data insightsoutputs/reports/modeling.md- Model detailsoutputs/reports/results.md- Final results
View visualizations:
outputs/plots/feature_importance.pngoutputs/plots/metric_comparison.png
import joblib
# Load and use
model_pkg = joblib.load('outputs/models/final_model.joblib')
predictions = model_pkg['model'].predict(new_data)- Use simple, explainable models first
- Avoid overfitting with regularization
- Prefer interpretability when possible
- No GPU required (though compatible)
- Optimized for standard hardware
- Works on laptops and servers alike
- Every decision is documented
- Complete audit trail in reports
- Reproducible with fixed random seeds
- Save everything needed for deployment
- Include preprocessing in model package
- Professional documentation for handover
- Zero human intervention
- Automatic problem detection
- Self-documenting workflows
Edit ml_agent.py in the train_models() method:
if self.problem_type == 'classification':
models = {
'Logistic Regression': LogisticRegression(...),
'Random Forest': RandomForestClassifier(...),
# Add your model here:
'SVM': SVC(...),
}Modify the metric calculation in train_models():
# For classification
metrics = {
'accuracy': accuracy_score(self.y_test, y_pred),
'f1': f1_score(self.y_test, y_pred, average='weighted'),
# Add custom metrics
}Modify optimize_hyperparameters():
# Change number of trials
study.optimize(objective, n_trials=50) # Default: 30
# Adjust cross-validation folds
scores = cross_val_score(..., cv=5) # Default: 3- Quick project summary
- Tech stack used
- Best model and score
- File structure
- Dataset statistics
- Missing value analysis
- Feature type breakdown
- Target distribution
- All models tried
- Performance comparison
- Selection rationale
- Hyperparameter tuning results
- Final metrics
- Usage instructions
- Next steps recommendations
- Reproducibility guide
pip install optunaOr continue without it (hyperparameter tuning will be skipped).
For large datasets, reduce n_estimators in models:
'Random Forest': RandomForestClassifier(n_estimators=50) # Default: 100Make sure your CSV path is correct:
python ml_agent.py /full/path/to/data.csvThis is a fully autonomous agent - improvements welcome!
Areas for enhancement:
- Additional model types
- Advanced feature engineering
- Custom metric support
- Multi-class calibration
- Time series support
Open source - use freely for any purpose.
Built with:
- scikit-learn - Amazing ML library
- Optuna - Hyperparameter optimization
- pandas - Data manipulation
- matplotlib - Visualization
For issues or questions you can contact me at pranavbansode2604@gmail.com
- Check the generated reports in
outputs/reports/ - Review the troubleshooting section
- Examine the code comments in
ml_agent.py
Built for production. Designed for autonomy. Optimized for simplicity.
🤖 Let the agent do the work.