Scaling ML models with Taipy and Dask

This project uses Taipy to create a Dask pipeline to create a dataset and run K-Means in parallel. The results are then displayed in a web app.

Taipy is a great way to manage and display the results of Dask applications, as its backend is built for large-scale applications and can handle caching, parallelization, scenario management, pipeline versioning, data scoping, etc.

Why Taipy?

Taipy is an open-source Python library that manages both front and back-end:

Taipy GUI helps create web apps quickly using only Python code
Taipy Core manages data pipelines through a visual editor where parallelization, caching, and scoping are easily defined

Why Dask?

Most estimators in scikit-learn are designed to work with NumPy arrays or scipy sparse matrices. These data structures must fit in the RAM on a single machine.

Estimators implemented in Dask-ML work well with Dask Arrays and DataFrames. This can be much larger than a single machine’s RAM. They can be distributed in memory on a cluster of machines.

Data Pipeline

The Data Pipeline is built using Taipy Studio in VSCode and looks like this:

Blue nodes are Data Nodes that store Python variables or datasets:

centers is an Int with the number of clusters in the dataset we'll create
n_clusters is an Int with the number of clusters we want K-Means to find
dataset is the Dask array of synthetic data created
km is the dask_ml K-Means model

Between the data nodes (in blue) are Task Nodes (in orange). Task Nodes take Data Nodes as inputs and return Data Nodes as outputs using Python functions.

These Task Nodes are combined into a pipeline using a green node called the Pipeline Node, which is the entry point of the pipeline. (Note that Taipy allows for several Pipelines Nodes to co-exist)

When running the pipeline, Taipy will read the clusters and n_clusters argument, create a synthetic dataset using Dask and run a K-Means model.

Web App

The web app is built using Taipy GUI and looks like this:

The app allows you to select a number of clusters using a slider, and the resulting dataset and K-Means clustering will be displayed on a scatter plot.

How to Run

Clone the repository

git clone https://github.com/AlexandreSajus/Taipy-Dask-ML-Demo.git

Install the requirements

pip install -r requirements.txt

Run the web app

python app.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
media		media
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.toml		config.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaling ML models with Taipy and Dask

Table of Contents

Why Taipy?

Why Dask?

Data Pipeline

Web App

How to Run

About

Releases

Packages

Languages

License

AlexandreSajus/Taipy-Dask-ML-Demo

Folders and files

Latest commit

History

Repository files navigation

Scaling ML models with Taipy and Dask

Table of Contents

Why Taipy?

Why Dask?

Data Pipeline

Web App

How to Run

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages