# 8 Python Packages to Watch Out For in 2022
![](images/image.jpg)

### Introduction

Gone are the days when your machine learning stack consisted only a few libraries like Scikit-learn, pandas, NumPy and Matplotlib. If you want to make yourself an asset as an engineer, you have to arm yourself with more recent additions to the ecosystem. 

For this reason, this article will cover nine rising stars in the Python machine learning community that emerged to solve fresh challenges of todays ML problems.

### 1. SHAP

Nowadays, you just have to accept that a machine learning solution must come with explainability. You can't simply go around handing people your black-box models. Companies and businesses demand explainability and trancparency in their models from the input stage to output.

One look at search trends of the past ten years for the term "explainalbe AI" will tell you everything:

![image.png](attachment:850dc1c1-0ece-482a-9d9d-4a1791be146e.png)

This sudden interest in Explainable AI comes from the need to interpret the outputs of today's increasingly complex models. As an example, check out Google Translate in action, one of the most commonly used language models in the world:

![](https://miro.medium.com/max/788/1*QrjBYM_t47CK4-TYQbaauQ.png)

Apparently, even such a powerful model can contain biases so common among people. While translating languages without gendered pronouns, these biases become crystal clear. The example above was in my native language Uzbek but similar results can be observed for Turkish and Persian:

![](https://miro.medium.com/max/1026/1*C7yN543qPek9yyJQSxAQ1A.png)
![](https://miro.medium.com/max/1051/1*A3WWWnhbKaUgHdLiRiHMUA.png)

I think you can now understand why there is still skepticism about machine learning solutions, no matter how good the results are. 

One of the best solutions to this problem is the SHAP library created by Scott M. Lundberg and Su-In Lee. SHapley Additive exPlanations (SHAP) is based on solid math from game theory and can explain virtually any machine learning model in any aspect imaginable. 

![](https://miro.medium.com/proxy/1*RE0R9AkgLiS8nw1BgqvyHg.png)

Major part of its mass-appeal is its elegant visualization of Shapley values, which can explain model outputs both generally and idividually:

![](https://miro.medium.com/max/541/1*7VhOQ5b76wySh0r62xm_yw.png)
![](https://miro.medium.com/max/641/1*hIlrMPnWEDp3eZeG2NRzTg.png)
![](https://miro.medium.com/max/1132/1*FrG11Vid8M0e7LiGQKs5Sw.png)
![](https://miro.medium.com/max/1236/1*NzcJmJ8M_7OpMy6uRGkHoQ.png)

🌟GitHub Stars: 16.2K

📦Issues: 1.3K

🍴Forks: 2.5K

🔗Useful links: [docs](https://shap.readthedocs.io/en/latest/index.html), [comprehensive tutorial](https://towardsdatascience.com/how-to-explain-black-box-models-with-shap-the-ultimate-guide-539c152d3275?gi=b243d368fb2)

### 2. UMAP

The need for better dimensionality reduction algorithms was painfully evident as dataset sizes kept blowing up into the sky. 

PCA was fast but dumb - it simply reduced the number of dimensions of the dataset and didn't care much about underlying data structure. t-SNE tried to remedy that but it was sluggish for larger datasets.

Fortunately, in 2018, Leland McInnes and his collegues introduced the UMAP (Uniform Manifold Approximation and Projection) algorithm to be the common ground between the two older methods. The UMAP Python package can be used to reduce the dimensions of tabular datasets in a smarter way. It tries to preserve the topological structure of the dataset as much as possible after the reduction. 

Its results are practical and beautiful when visualized:

![](https://miro.medium.com/max/938/1*G2XHMEx4yvXaflB4T1nLKA.png)
![](https://miro.medium.com/max/938/1*rlfn-CugxKhmZ8G7sLJQJQ.png)
![](https://miro.medium.com/max/2813/1*PxQEBPiq7Wbuj-4HO5PmTA.png)

The package is very popular on Kaggle and its docs outline other interesting applications beyond dimensionality reduction, like blazing-fast outlier detection for larger datasets ([link](https://towardsdatascience.com/tricky-way-of-using-dimensionality-reduction-for-outlier-detection-in-python-4ee7665cdf99)).

🌟GitHub Stars: 5.6K

📦Issues: 313

🍴Forks: 633

🔗Useful links: [docs](https://umap-learn.readthedocs.io/en/latest/), [comprehensive tutorial](https://towardsdatascience.com/beginners-guide-to-umap-for-reducing-dimensionality-and-visualizing-100-dimensional-datasets-ff5590fb17be)

### 3 and 4. LightGBM and CatBoost

When the XGBoost library became stable in 2015, it quickly dominated tabular competitions at Kaggle. It was fast and could knock older methods cleanly out of the ballpark. However, it was not perfect.

Two billion-dollar companies, Microsoft and Yandex, got inspired by the work Tianqi Chen did on Gradient Boosted Machines and open-sourced LightGBM and CatBoost libraries. 

Their aim was clear - improve on the weaknesses of XGBoost. While LightGBM vastly reduced the memory footprint of boosted trees of XGBoost, CatBoost became even faster than XGBoost and reached impressive results with default parameters ([link to a benchmark](https://catboost.ai/#:~:text=Full%20news%20list-,Benchmarks,-Quality)). 

In Kaggle's State of Data Science and Machine Learning Survey of 2021, the two libraries were in top seven of the most popular algorithms:

![image.png](attachment:178f7dd0-012f-44a2-bcc6-e4072962059b.png)

🌟GitHub Stars (LGBM, CB): 13.7K, 6.5K

📦Issues: 174, 363

🍴Forks: 3.5K, 1K 

🔗Useful links: [LGBM docs](https://lightgbm.readthedocs.io/en/latest/), [CB docs](https://catboost.ai/en/docs/), tutorials - [LGBM](https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5), [CB](https://catboost.ai/en/docs/concepts/tutorials)

### 5. BentoML

Now, let's talk about less painful ways of deploying machine learning models. 

Specifically, this section is about deploying models as API endpoints in the easiest way possible. In the past, people used web-frameworks like Flask and Django (which, let's admit it, are pure torture) or something more tolerable like FastAPI. 

However, they won't come anywhere close to this new player in the ecosystem. BentoML greatly simplifies the process of creating an API service, requiring only a few lines of code. It works with virtually any ML framework and can deploy them as API endpoints in a few minutes. Here is [an example API](https://pet-pawpularity.herokuapp.com/) I built for an image regression problem, which took me less than five minutes to deploy online with Heroku:

![image.png](attachment:b0e16199-39a5-42b9-afcb-88ac0ac634eb.png)

Getting predictions from this API is now only a single POST request away. 

Even though BentoML was released last year and still in beta, it amassed a significant community:

🌟GitHub Stars: 3.5K

📦Issues: 395

🍴Forks: 53

🔗Useful links: [docs](https://docs.bentoml.org/en/latest/), [comprehensive tutorial](https://towardsdatascience.com/the-easiest-way-to-deploy-your-ml-dl-models-in-2022-streamlit-bentoml-dagshub-ccf29c901dac)

### 6 and 7.  Streamlit and Gradio

A machine learning solution should be accessible to everyone. API deployment is only for the benefit of your coworkers, teammates and your programmer friends. A model should have a user-friendly interface for non-technical community as well. 

Two of the fastest-developing packages for building such interfaces are Streamlit and Gradio. They both offer low-code Pythonic APIs to build web apps to showcase your models. They provide simple Python functions to create HTML components for taking different types of inputs to your model, such as image, text, video, speech, sketches, etc. 

I especially like Streamlit, as it can be used to tell a beautiful data stories with its rich media tools. You can check out an [example app](https://share.streamlit.io/bextuychiev/pet_pawpularity/ui/src/ui.py) I've deployed with Streamlit using the API service I created with BentoML.

![](https://miro.medium.com/proxy/1*61yvVw-rD6RIu96yIqu7gw.gif)

I believe that combining an API service like BentoML with UI tools like Streamlit or Gradio is the best way to deploy machine learning models in 2022. You can check out [this article](https://towardsdatascience.com/the-easiest-way-to-deploy-your-ml-dl-models-in-2022-streamlit-bentoml-dagshub-ccf29c901dac) where I show how I build the above app from scratch.

🌟GitHub Stars (Streamlit, Gradio): 18.9K, 6.6K

📦Issues: 264, 119

🍴Forks: 1.7K, 422

🔗Useful links: [Streamlit docs](https://docs.streamlit.io/), [Gradio docs](https://gradio.app/docs/), tutorials - [Streamlit](https://docs.streamlit.io/knowledge-base/tutorials), [Gradio](https://gradio.app/guides/)

### 8. PyCaret

PyCaret is a low-code machine learning library that has been turning a lot of heads recently. Its main attraction stems from its ability to go from preparing your data to deploying a model within a few minutes in a notebook environment.

Using PyCaret, you can automate almost any stage of a machine learning pipeline with only a few lines of code. It combines some of the best features and algorithms from other popular packages like Scikit-learn, XGBoost, transformers, etc. 

It has got separate sub-modules for classification, regression, NLP, clustering, anomaly detection and since its latest release, a dedicated module for time series. 

PyCaret is the go-to library if you want to concentrate on solving the problem, rather than the "how".

🌟GitHub Stars: 6.5K

📦Issues: 248

🍴Forks: 1.3K

🔗Useful links: [docs](https://pycaret.readthedocs.io/en/latest/), [tutorials](https://pycaret.org/tutorial/)

### 9. Optuna

![](https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png)

For the cherry on top, we have Optuna. It is the library that's making other hyperparameter tuning methods obsolete on Kaggle. 

Optuna is a bayesian hyperparameter tuning library that works on virtually any ML framework. It has numerous advantages over its rivals:

- platform-agnostic design, works with almost any model
- Pythonic search space - hyperparameters can be defined with conditionals and loops
- A large set of state-of-the-art tuning algorithms, available to change with a single keyword
- Easy and efficient parallelization - scale across available resources through an argument
- Visualization - plot tuning experiments, see hyperparameter importances

Optuna's API is based on objects called studies and trials. Combined, they give the ability to control how long a tuning session runs, to pause and resume them, etc. It is the best library to squeeze last drops of performance from your models. 

🌟GitHub Stars: 6.3K

📦Issues: 108

🍴Forks: 701

🔗Useful links: [docs](https://optuna.readthedocs.io/en/stable/), [comprehensive tutorial](https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c)

### 10. Data Version Control - DVC

![image.png](attachment:2923cfdc-ab67-43d5-8a86-ab461f9b74d9.png)

Finally, have you heard about "Git for data"? That's what DVC is - versioning and managing your massive data files and models as easily as Git manages your codebase. 

Git fails spectacularly at versioning large files, which greatly hindered the progress of open-source data science. Data scientists needed a system to keep track of changes made to both code and data and work on experiments in isolated branches without duplicating data sources. 

Data Version Control (DVC) by Iterative.ai made this all possible. With a simple remote or local repo to store the data, DVC can capture changes to data and models just like code and track metrics and model artifacts to monitor experiments. 

It becomes a game-changing tool when combined with DagsHub (i.e. GitHub for data scientists) since the platform offers free storage for DVC and can be configured with a single CLI command. 

🌟GitHub Stars: 9.7K

📦Issues: 619

🍴Forks: 924

🔗Useful links: [docs](https://dvc.org/doc), [comprehensive tutorial](https://realpython.com/python-data-version-control/), [sample project made with DVC and DagsHub](https://dagshub.com/BexTuychiev/pet_pawpularity)

### Conclusion

The data science and machine learning community is vibrant and ever-growing. There are new tools being developed faster than you can say "Artificial Intelligence." In this article, I simply tried to narrow your focus on the most promising and useful ones in development in 2022. Thank you for reading!