This repository contains a modularized data science model pipeline for cloud classification. It is a flexible and extensible framework designed to streamline the process of the machine learning workflow, including data acquisition, feature engineering, model selection, training, evaluation, and deployment. The pipeline is organized into distinct, reusable components that can be easily modified or replaced to meet the specific needs of a wide range of machine learning tasks.
- Modularity: The pipeline is designed with independent modules for data processing, feature engineering, model training, evaluation, deployment, and artifact saving.
- Flexibility: Each module can be easily customized to accommodate specific requirements or preferences using config.yaml.
- Compatibility: Supports a wide variety of machine learning models and libraries, including TensorFlow, PyTorch, and scikit-learn.
- Reproducibility: The entire pipeline and its unit tests can be run inside a docker container.
- Python 3.7 or higher
- See requirements.txt.
- The pipeline assumes that an S3 bucket has already been setup. It will not set one up for you.
git clone https://github.com/MSIA/2023-423-hwl6390-hw2.git
cd 2023-423-hwl6390-hw2
This guide assumes you have installed the AWS
CLI. If you have not configured an AWS profile, run the following.
aws configure sso --profile my-sso
For the purposes of this guide, the name of the AWS profile will be my-sso
. The user can name it however they like.
After configuring the sso, run the following to login.
aws sso login --profile my-sso
After logging in, export the profile as an environment variable.
export AWS_PROFILE=my-sso
If you run aws configure list
and are able to see my-sso
in the list of profiles, the environment variable has been set correctly.
pip install -r requirements.txt
Verify you are in the same directory as pipeline.py
. Then, run
python pipeline.py
in the terminal.
Run
pytest
in the terminal.
docker build -t pipeline -f dockerfiles/dockerfile-main .
docker run -v ~/.aws:/root/.aws -e AWS_PROFILE=my-sso pipeline
docker build -t unittest-pipeline -f dockerfiles/dockerfile-test .
docker run unittest-pipeline
To customize settings within the pipeline, edit config.yaml.
Modify run_config
section in config.yaml
to achieve desired dataset and output locations.
Modify create_dataset
section in config.yaml
to achieve desired dataset characteristics and output locations.
Modify generate_features
section in config.yaml
to achieve desired features and operations to achieve those features.
Modify mpl_config
and eda
sections in config.yaml
to adjust matplotlib settings, create desired visualizations. Set desired save locations using figure_dir
in run_config
section of config.yaml
.
Modify train_model
section of config.yaml
to adjust train test split, features, model configuration, hyperparameters, and directory to save model artifacts.
Modify score_model
section of config.yaml
to adjust settings for model output.
Modify evaluate_performance
section of config.yaml
to adjust metrics used for evaluating model performance.
Modify aws
section of config.yaml
to achieve desired bucket name and prefixes.