This project uses Taipy to create a Dask pipeline to create a dataset and run K-Means in parallel. The results are then displayed in a web app.
Taipy is a great way to manage and display the results of Dask applications, as its backend is built for large-scale applications and can handle caching, parallelization, scenario management, pipeline versioning, data scoping, etc.
Taipy is an open-source Python library that manages both front and back-end:
- Taipy GUI helps create web apps quickly using only Python code
- Taipy Core manages data pipelines through a visual editor where parallelization, caching, and scoping are easily defined
Most estimators in scikit-learn are designed to work with NumPy arrays or scipy sparse matrices. These data structures must fit in the RAM on a single machine.
Estimators implemented in Dask-ML work well with Dask Arrays and DataFrames. This can be much larger than a single machine’s RAM. They can be distributed in memory on a cluster of machines.
The Data Pipeline is built using Taipy Studio in VSCode and looks like this:
Blue nodes are Data Nodes that store Python variables or datasets:
centers
is an Int with the number of clusters in the dataset we'll createn_clusters
is an Int with the number of clusters we want K-Means to finddataset
is the Dask array of synthetic data createdkm
is the dask_ml K-Means model
Between the data nodes (in blue) are Task Nodes (in orange). Task Nodes take Data Nodes as inputs and return Data Nodes as outputs using Python functions.
These Task Nodes are combined into a pipeline using a green node called the Pipeline Node, which is the entry point of the pipeline. (Note that Taipy allows for several Pipelines Nodes to co-exist)
When running the pipeline, Taipy will read the clusters and n_clusters argument, create a synthetic dataset using Dask and run a K-Means model.
The web app is built using Taipy GUI and looks like this:
The app allows you to select a number of clusters using a slider, and the resulting dataset and K-Means clustering will be displayed on a scatter plot.
- Clone the repository
git clone https://github.com/AlexandreSajus/Taipy-Dask-ML-Demo.git
- Install the requirements
pip install -r requirements.txt
- Run the web app
python app.py