In this repository, we want to propose an organized structure for your ML projects to simplify traning and tuning lots of ML pipelines. This repository is based on https://medium.com/@a.larian/how-to-structure-your-machine-learning-project-to-run-hundreds-of-pipelines-960fcc4c877a
- Project structure to Run Hundreds of ML Pipelines
- Preview
- Prerequisites
- Installation
- Usage
- Project structure
Knowing following concepts give you a great idea what we try to talk about. Basic machine learning concepts.
- Basic python (3 or higher versions)
- Having a little experience with ML libraries such as scikit-learn, XGBoost, Dask, pandas, or etc.
- And some other libraries like imbalanced-learn, category_encoders would help you.
To use this project, first clone the repo on your device using the command below:
git init
git clone https://github.com/IDAS-Labratory/Project-Structure-to-Run-Hundreds-of-ML-Pipelines.git
pip install requirements.txt
To run this sample project just go to main_run.ipynb notebook and run all cells.
.
├── project <- project's name
│ ├── configs <- Yaml config files correspond to ML models, dimensionality reduction
| | | algorithms, oversamplers and any other components in your pipeline
│ │ ├── encoder_hp.yaml that you want to tune it.
│ │ ├── ml_model_hp.yaml
│ │ ├── imputer_hp.yaml
│ │ └── oversampler_hp.yaml
| |
│ ├── models <- Seperating each main component of your pipeline into
| | | different .py files.
│ │ ├── __init__.py
│ │ ├── _encoders.py
│ │ ├── _fetch_hyperparameter.py
│ │ ├── _imputers.py
│ │ ├── _ml_algorithms.py
│ │ └── _oversamplers.py
| |
│ ├── pipelines <- Putting your pipelines(if you have several) here.
| | | Each pipeline is written as a function.
│ │ ├── __init__.py
│ │ └── _make_pipeline.py
| |
│ ├── results
| | |
│ │ ├── final-models <- Saving results of training and testing.
│ │ └── outputs <- Serialized pipelines are stored here.
| |
│ └── utils
│ ├── __init__.py
│ ├── _handle_results.py
│ ├── _evaluation_metrics.py
│ └── _load_data.py
|
├── analytics <- Putting your analytical notebooks here.
| |
│ ├── hypotesting
│ └── visualization
|
├── requirements.txt
|
├── main_run.ipynb
|
├── api
|
├── data
| |
│ ├── intermediate <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│ └── H1N1_Flu_Vaccines.csv
|
├── docker <- Docker file of your project.
├── experiments <- To test new model, lib etc.
├── lib <- Libraries that need to be added manually.
│ └── impute
│ ├── __init__.py
│ └── _knn.py
├── papers <- Papers which are related to your project.
└── scripts <- Scripts like scraping etc (if it is needed).