Skip to content

Examples of use

Lidia edited this page Sep 17, 2025 · 12 revisions

We provide some examples to desmostrate the applicability of the tool to different domains, reading data from different data sources, performing several cleanings, using diverse training algorithms, and some evaluation metrics.

Examples details

The first two pipelines has been developed to evaluate the applicability: the diabetes example is using a known toy dataset provided by sklearn module (data read from a csv file); the Big Mart Sales example has been developed to participate in a popular hackathon. The Iris Classification, the Purchase Experience score and the Image Classification pipelines has been generated to replicate three existing ML pipelines hosted in GitHub.

Example Domain Cleanings Training algorithm Evaluation metrics URL
Diabetes Health No cleaning SVM Accuracy www4.stat.ncsu.edu/~boos/var.select/diabetes.html
Big Mart Sales Sales prediction Replace Nulls per average and Mode Random Forest Regression Accuracy www.analyticsvidhya.com/datahack/contest/practice-problem-big-mart-sales-iii/
Iris classification Botany No cleaning Random Forest Regression, SVM Accuracy www.github.com/Ernesto905/Zenml-Sentiment-Analysis-Pipeline - training_pipeline.py
Purchase Experience Score Sales Reviews Replace Null by Median or Text Linear Regression Accuracy, MSE, MRSE, R2 www.github.com/Akurati-Kaustiki/MLOPS-Assignment-Group17
Image Classification Images No cleaning SVM Confusion matrix, Precision, Recall, F1-score [svm-image-classification](https://github.com/ahmdmohamedd/svm-image-classification/tree/main)

Evaluation

To evalute the quality of the generated code, we used well-known tools such as Pylint and Radon. Pylint provides a quality score from 0 to 10, being 0 the worst value and 10 the best code quality evaluation. Radon computes Cyclomatic complexity (CC) and Maintainability index (MI) metrics. CC measures how many independent paths exist in the code, the more branching logic (e.g., if, for, while, try, etc.), the higher the number, and the harder the code is to understand, test, and maintain. MI is a composite score (from 0 to 100) that estimates how maintainable your code is; the higher the number, the better maintainability. Both metrics complement the quantitative value with a qualitative label from A (the best qualification) to F (the worst).

The following table presents the results for the three code quality metrics for the provided examples generated using the MLS Toolbox Code Generator. The generated code obtains good values for the three metrics, values next to 10 in PyLint scores, values next to 1 for Cyclomatic complexity, and values next to 100 for the Maintainability index.

Example PyLint score Cyclomatic complexity Maintainability index
Diabetes 9,25 1,40 (A) 97,95 (A)
Big Mart Sales 9,03 1,41 (A) 97,96 (A)
Iris classification 8,82 1,39 (A) 97,62 (A)
Purchase Experience Score 8,94 1,39 (A) 97,29 (A)
Image Classification 8.87 1.48 (A) 93.36 (A)

The following table provides the quality metrics results for the example implementations that has not been generated by the MLSToolbox Code Generator.

Example Tool PyLint score Cyclomatic complexity Maintainability index
Diabetes Kubeflow 4,29 1 (A) 74,55 (A)
Diabetes ZenML 7,53 1 (A) 71,35 (A)
Diabetes scikit-learn 3,57 1 (A) 88,85 (A)
Diabetes -- 0,55 2 (A) 81,32 (A)
Big Mart Sales Kubeflow 0 1 (A) 61,77 (A)
Big Mart Sales ZenML 5,49 2,5 (A) 63,09 (A)
Big Mart Sales scikit-learn 6,94 1,25 (A) 100 (A)
Big Mart Sales -- 0,55 1 (A) 72,51 (A)
Iris classification ZenML 0 1,97 (A) 97,77 (A)
Purchase Experience Score -- 5,66 3 (A) 71,90 (A)
Image Classification -- 7.1 2 (A) 84.732 (A)

Conclusion

The results consistently showed that MLSToolbox-generated code achieved the highest Pylint scores and a better MI than most other tools, highlighting its design focus on maintainability and evolution. The slightly higher CC can be attributed to a modular design that prioritizes reusability and extensibility.

Clone this wiki locally