# 8 Python Packages to Watch Out For in 2022
![](images/image.jpg)

### Introduction

Gone are the days when your machine learning stack consisted only a few libraries like Scikit-learn, pandas, NumPy and Matplotlib. If you want to make yourself an asset as an engineer, you have to arm yourself with more recent additions to the ecosystem. 

For this reason, this article will cover nine rising stars in the Python machine learning community that emerged to solve fresh challenges of todays ML problems.

### 1. SHAP

Nowadays, you just have to accept that a machine learning solution must come with explainability. You can't simply go around handing people your black-box models. Companies and businesses demand explainability and trancparency in their models from the input stage to output.

One look at search trends of the past ten years for the term "explainalbe AI" will tell you everything:

![image.png](attachment:2dc8882e-0c63-408d-96dc-79b7baedc36c.png)

This sudden interest in Explainable AI comes from the need to interpret the outputs of today's increasingly complex models. As an example, check out Google Translate in action, one of the most commonly used language models in the world:

![](https://miro.medium.com/max/788/1*QrjBYM_t47CK4-TYQbaauQ.png)

Apparently, even such a powerful model can contain biases so common among people. While translating languages without gendered pronouns, these biases become crystal clear. The example above was in my native language Uzbek but similar results can be observed for Turkish and Persian:

![](https://miro.medium.com/max/1026/1*C7yN543qPek9yyJQSxAQ1A.png)
![](https://miro.medium.com/max/1051/1*A3WWWnhbKaUgHdLiRiHMUA.png)

I think you can now understand why there is still skepticism about machine learning solutions, no matter how good the results are. 

One of the best solutions to this problem is the SHAP library created by Scott M. Lundberg and Su-In Lee. SHapley Additive exPlanations (SHAP) is based on solid math from game theory and can explain virtually any machine learning model in any aspect imaginable. 

![](https://miro.medium.com/proxy/1*RE0R9AkgLiS8nw1BgqvyHg.png)

Major part of its mass-appeal is its elegant use of visualization of Shapley values, which can explain model outputs both generally and idividually:

![](https://miro.medium.com/max/541/1*7VhOQ5b76wySh0r62xm_yw.png)
![](https://miro.medium.com/max/641/1*hIlrMPnWEDp3eZeG2NRzTg.png)
![](https://miro.medium.com/max/1132/1*FrG11Vid8M0e7LiGQKs5Sw.png)
![](https://miro.medium.com/max/1236/1*NzcJmJ8M_7OpMy6uRGkHoQ.png)

🌟GitHub Stars: 16.2K

📦Issues: 1.3K

🍴Forks: 2.5K

🔗Useful links: [docs](https://shap.readthedocs.io/en/latest/index.html), [comprehensive tutorial](https://towardsdatascience.com/how-to-explain-black-box-models-with-shap-the-ultimate-guide-539c152d3275?gi=b243d368fb2)

### 2. UMAP

The need for better dimensionality reduction algorithms was painfully evident as dataset sizes kept blowing up into the sky. 

PCA was fast but dumb - it simply reduced the number of dimensions of the dataset and didn't care much about underlying data structure. t-SNE tried to remedy that but it was sluggish for larger datasets.

Fortunately, in 2018, Leland McInnes and his collegues introduced the UMAP (Uniform Manifold Approximation and Projection) algorithm to be the common ground between the two older methods. The UMAP Python package can be used to reduce the dimensions of tabular datasets in a smarter way. It tries to preserve the topological structure of the dataset as much as possible after the reduction. 

Its results are practical and often, beautiful when visualized:

![](https://miro.medium.com/max/938/1*G2XHMEx4yvXaflB4T1nLKA.png)
![](https://miro.medium.com/max/938/1*rlfn-CugxKhmZ8G7sLJQJQ.png)
![](https://miro.medium.com/max/2813/1*PxQEBPiq7Wbuj-4HO5PmTA.png)

The package is popular at Kaggle and its docs outline other interesting applications beyond dimensionality reduction, like blazing-fast outlier detection for larger datasets ([link](https://towardsdatascience.com/tricky-way-of-using-dimensionality-reduction-for-outlier-detection-in-python-4ee7665cdf99)).

🌟GitHub Stars: 5.6K

📦Issues: 313

🍴Forks: 633

🔗Useful links: [docs](https://umap-learn.readthedocs.io/en/latest/), [comprehensive tutorial](https://towardsdatascience.com/beginners-guide-to-umap-for-reducing-dimensionality-and-visualizing-100-dimensional-datasets-ff5590fb17be)

### 3 and 4. LightGBM and CatBoost

When the XGBoost library became stable in 2015, it quickly dominated tabular competitions at Kaggle. It was fast and could knock older methods cleanly out of the ballpark. However, it was not perfect.

Two billion-dollar companies, Microsoft and Yandex, got inspired by the work Tianqi Chen did on Gradient Boosted Machines and open-sourced LightGBM and CatBoost libraries. 

Their aim was clear - improve on the weaknesses of XGBoost. While LightGBM vastly reduced the memory footprint of boosted trees of XGBoost, CatBoost became even faster than XGBoost and reached impressive results with default parameters ([link to a benchmark](https://catboost.ai/#:~:text=Full%20news%20list-,Benchmarks,-Quality)). 

In Kaggle's State of Data Science and Machine Learning Survey of 2021, the two libraries were in top seven of the most popular algorithms:

![image.png](attachment:178f7dd0-012f-44a2-bcc6-e4072962059b.png)

🌟GitHub Stars (LGBM, CB): 13.7K, 6.5K

📦Issues: 174, 363

🍴Forks: 3.5K, 1K 

🔗Useful links: [LGBM docs](https://lightgbm.readthedocs.io/en/latest/), [CB docs](https://catboost.ai/en/docs/), tutorials - [LGBM](https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5), [CB](https://catboost.ai/en/docs/concepts/tutorials)

### 5. BentoML

Now, let's talk about less painful ways of deploying machine learning models. 

### 6 and 7.  Streamlit and Gradio

### 8. PyCaret

### 9. Optuna

For the cherry on top, we have Optuna. It is the library that's making other hyperparameter tuning methods obsolete on Kaggle. 

Optuna is a bayesian hyperparameter tuning library that works on virtually any ML framework. It has numerous advantages over its rivals:

- platform-agnostic design, works with almost any model
- Pythonic search space - hyperparameters can be defined with conditionals and loops
- A large set of state-of-the-art tuning algorithms, available to change with a single keyword
- Easy and efficient parallelization - scale across available resources through an argument
- Visualization - plot tuning experiments, see hyperparameter importances

Optuna's API is based on objects called studies and trials. Combined, they give the ability to control how long a tuning session runs, to pause and resume them, etc.

🌟GitHub Stars: 6.3K

📦Issues: 108

🍴Forks: 701

🔗Useful links: [docs](https://optuna.readthedocs.io/en/stable/), [comprehensive tutorial](https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c)

### Conclusion