Clouds

Overview

This repository contains a modularized data science model pipeline for cloud classification. It is a flexible and extensible framework designed to streamline the process of the machine learning workflow, including data acquisition, feature engineering, model selection, training, evaluation, and deployment. The pipeline is organized into distinct, reusable components that can be easily modified or replaced to meet the specific needs of a wide range of machine learning tasks.

Features

Modularity: The pipeline is designed with independent modules for data processing, feature engineering, model training, evaluation, deployment, and artifact saving.
Flexibility: Each module can be easily customized to accommodate specific requirements or preferences using config.yaml.
Compatibility: Supports a wide variety of machine learning models and libraries, including TensorFlow, PyTorch, and scikit-learn.
Reproducibility: The entire pipeline and its unit tests can be run inside a docker container.

Requirements

Python 3.7 or higher
See requirements.txt.
The pipeline assumes that an S3 bucket has already been setup. It will not set one up for you.

Installation

1. Clone the repository

git clone https://github.com/MSIA/2023-423-hwl6390-hw2.git

2. Change directory into repository folder

cd 2023-423-hwl6390-hw2

3. Setup AWS credentials for artifact upload to S3

This guide assumes you have installed the AWS CLI. If you have not configured an AWS profile, run the following.

aws configure sso --profile my-sso

For the purposes of this guide, the name of the AWS profile will be my-sso. The user can name it however they like.

After configuring the sso, run the following to login.

aws sso login --profile my-sso

After logging in, export the profile as an environment variable.

export AWS_PROFILE=my-sso

If you run aws configure list and are able to see my-sso in the list of profiles, the environment variable has been set correctly.

4. Install required packages (required for local implementation)

pip install -r requirements.txt

Usage

1. Local

Pipeline only

Verify you are in the same directory as pipeline.py. Then, run

python pipeline.py

in the terminal.

Unit Test

Run

pytest

in the terminal.

2. Docker

Pipeline only

Build the Docker image

docker build -t pipeline -f dockerfiles/dockerfile-main .

Run the entire model pipeline

docker run -v ~/.aws:/root/.aws -e AWS_PROFILE=my-sso pipeline

Unit Test

Build the Docker image for unit test

docker build -t unittest-pipeline -f dockerfiles/dockerfile-test .

Run the tests

docker run unittest-pipeline

Customization

To customize settings within the pipeline, edit config.yaml.

Acquire data

Modify run_config section in config.yaml to achieve desired dataset and output locations.

Create dataset

Modify create_dataset section in config.yaml to achieve desired dataset characteristics and output locations.

Generate features

Modify generate_features section in config.yaml to achieve desired features and operations to achieve those features.

Analysis

Modify mpl_config and eda sections in config.yaml to adjust matplotlib settings, create desired visualizations. Set desired save locations using figure_dir in run_config section of config.yaml.

Train model

Modify train_model section of config.yaml to adjust train test split, features, model configuration, hyperparameters, and directory to save model artifacts.

Score model

Modify score_model section of config.yaml to adjust settings for model output.

Evaluate performance

Modify evaluate_performance section of config.yaml to adjust metrics used for evaluating model performance.

AWS

Modify aws section of config.yaml to achieve desired bucket name and prefixes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
dockerfiles		dockerfiles
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
Assignment.md		Assignment.md
README.md		README.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clouds

Table of Contents

Overview

Features

Requirements

Installation

1. Clone the repository

2. Change directory into repository folder

3. Setup AWS credentials for artifact upload to S3

4. Install required packages (required for local implementation)

Usage

1. Local

Pipeline only

Unit Test

2. Docker

Pipeline only

Build the Docker image

Run the entire model pipeline

Unit Test

Build the Docker image for unit test

Run the tests

Customization

Acquire data

Create dataset

Generate features

Analysis

Train model

Score model

Evaluate performance

AWS

About

Releases

Packages

Languages

HenryLiang-123/ml-pipeline

Folders and files

Latest commit

History

Repository files navigation

Clouds

Table of Contents

Overview

Features

Requirements

Installation

1. Clone the repository

2. Change directory into repository folder

3. Setup AWS credentials for artifact upload to S3

4. Install required packages (required for local implementation)

Usage

1. Local

Pipeline only

Unit Test

2. Docker

Pipeline only

Build the Docker image

Run the entire model pipeline

Unit Test

Build the Docker image for unit test

Run the tests

Customization

Acquire data

Create dataset

Generate features

Analysis

Train model

Score model

Evaluate performance

AWS

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages