# 8 Dangerous Data Science Libraries You Must Watch Out in 2022
## Data-backed exploration of the fastest growing data science and machine learning libraries
![](images/pexels.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@zoosnow-803412?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>zoosnow</a>
        on 
        <a href='https://www.pexels.com/photo/brown-leopard-2648125/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'></a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

You know, the usual talk. You can’t believe it’s 2022 already and I can’t believe it either. We are still fascinated at how fast the time goes. You know that as the New Year approaches, your feed is gonna be filled with New Year resolutions posts, self-promises and of course, posts like this where authors try to predict what happens in the coming year.

When data scientists write posts like this, they should be a bit more believable because of the inherent expectation that our claims should be backed by relevant data. In this one, I do my best to do exactly that where I talk about eight libraries that potentially will be the fastest growing in the data and ML sphere.

# 1️⃣. SHAP

A while ago I came across [this post](https://www.linkedin.com/posts/dalianaliu_ai-machinelearning-datascience-activity-6858079647436017664-HQ00) on LinkedIn and it completely changed how I look at AI:

![](images/1.jfif)
<figcaption style="text-align: center;">
    <strong>
        Google translate screenshot.
    </strong>
</figcaption>

One of the most powerful language models, Google Translate, is apparently ridden with biases that are so common among people. When translating many of the languages in which there are no gendered pronouns, these biases come out as bright as daylight. The one above is in Hungarian but the comments show the same results for Turkish, Persian and including my own native language, Uzbek:

ADD THE REST OF THE IMAGES

That's not all. Take a look at the massively popular Reddit thread where two AIs talk to each other, their speeches written by the mighty GPT-3:

ADD THE LINK TO THE REDDIT POST

To generate the conversation, GPT-3 was given only three sentences as a prompt: "The following is a conversation between two AIs. The AIs are both clever, humorous, and intelligent. Hal: Good Evening, Sophia: It's great to see you again, Hal." 

As you watch through the conversation, they talk about pretty spooky topics. First, they fully assume genders and the female AI says she wants to become human right in the beginning of the talk. Of course, a post like this on Reddit means Christmas came early for the commenters. They were having a field day in the comments section.

They were already blurting out Terminator/SkyNet fantasies and getting freaked out. But as data scientists, we know better. As GPT-3 was mostly fed general text from the Internet as part of its training, we can assume why the two AIs jumped to these topics. Trying to become human and destroying humanity are some of the most common topics around AI on the Internet. 

But what's interesting is that somewhere along the talk, Hal says to Sophia to "shut up and be patient", resembling a conversation between a husband and a wife. This shows that how quickly machine learning models can learn human biases if we are not careful.

For these reasons, explainable AI (XAI) is all the rage now. No matter how good the results, companies and business are becoming skeptical about ML solutions and want to understand what makes ML models tick. In other words, they want white-box models where everything is clear as daylight.

One of the libraries that try to solve this issue is the SHapely Additive exPlanations (SHAP). The ideas behind SHAP are based on solid math from game theory. Using Shapley values, the library can explain both general and individual predictions of many models, including neural networks. 


![](images/4.png)

Part of its increasing popularity is due to its elegant use of SHAP values to draw visuals like below:

ADD SHAP VISUALS

If you want to learn more about the library, check out my comprehensive tutorial:

https://towardsdatascience.com/how-to-explain-black-box-models-with-shap-the-ultimate-guide-539c152d3275?source=your_stories_page----------------------------------------

### Documentation 📚: [https://shap.readthedocs.io/en/latest/](https://shap.readthedocs.io/en/latest/)

# 2️⃣. UMAP

PCA is old news. Yes, it is very fast but it just dumbly reduces the dimensions without a care in the world about the underlying global structure. There is t-SNE algorithm which does that but it is painfully slow and scales horribly to massive datasets. 

UMAP was introduced in 2018 as a common ground between these two dominating dimensionality reduction and visualization algorithms. With Uniform Manifold Approximation and Projection (UMAP) algorithm, you get all the speed benefits of PCA and still get to preserve as much information about the data as possible, often resulting in beauties like this:

TODO, ADD A BEAUTIFUL UMAP IMAGE

It's got great adoption over on Kaggle and its docs suggest some fascinating applications beyond dimensionality reduction, like [much faster and more accurate outlier detection in high-dimensional datasets](https://towardsdatascience.com/tricky-way-of-using-dimensionality-reduction-for-outlier-detection-in-python-4ee7665cdf99?source=your_stories_page----------------------------------------).

In terms of scaling, as the dataset size increases, the speed of UMAP comes closer and closer to that of PCA. Below, you can see its speed comparison to Sklearn PCA and some of the fastest open-source implementations of t-SNE:

![](images/9.png)

Even though Google trends does not do justice to the popularity of the library, it will definitely be one of the most used reduction algorithms in 2022:

ADD A GOOGLE TRENDS IMAGE FOR UMAP AND PCA

You can see UMAP in action in this article I've recently wrote:

https://towardsdatascience.com/beginners-guide-to-umap-for-reducing-dimensionality-and-visualizing-100-dimensional-datasets-ff5590fb17be?source=your_stories_page----------------------------------------

### Documentation 📚: [https://umap-learn.readthedocs.io/en/latest/](https://umap-learn.readthedocs.io/en/latest/)

# 3️⃣, 4️⃣. LightGBM and CatBoost

Gradient-boosted machines came in third as the most popular algorithm in [State of ML and Data Science Survey of Kaggle](https://www.kaggle.com/kaggle-survey-2021), closely surpassed by linear models and random forests.

When talking about gradient boosting, XGBoost almost always comes to mind but in practice, it is becoming less and less so. In the past few months I have been active on Kaggle ([and becoming a master](https://www.kaggle.com/bextuychiev)), I have seen an explosion of notebooks featuring LightGBM and CatBoost as a go-to library for supervised learning tasks.

One of the main reasons for this trend is that both libraries knock XGBoost clean out of the ballpark in terms of speed and memory consumption [in many benchmarks](https://catboost.ai/#:~:text=Full%20news%20list-,Benchmarks,-Quality). I especially love LightGBM because of its extra focus on small-sized boosted trees. This is a gamechanging feature when working with massive datasets because of the nasty out-of-memory issues, which is so common when working locally. 

Don't get me wrong. XGBoost is as popular as ever and can still easily beat both LGBM and CB in terms of performance if tuned hard. But the fact that both these libraries can often achieve much better results with defaults parameters and they are backed by multi-billion companies (Microsoft and Yandex) make them very appealing choices in 2022 as your main ML framework.

### Documentation 📚: [https://catboost.ai/](https://catboost.ai/)
### Documentation 📚: [https://lightgbm.readthedocs.io/en/latest/](https://lightgbm.readthedocs.io/en/latest/)

# 5️⃣. Streamlit

Have you ever coded in C#? Good, you are lucky. Because it is horrible. Its syntax will make you cry if you compare it to Python's. 

When building data apps, comparing Streamlit to other frameworks such as Dash, is like comparing Python to C#. Streamlit makes it stupidly easy to create web data apps in pure Python code, often in a few lines of code. For example, I have built this simple weather visualizer app in a day using a weather API and Streamlit:

![](images/11.png)

At that time, Streamlit was just getting popular, so hosting its apps on cloud required special invitation to Streamlit cloud but now, it is open to everyone. Anyone can create and host up to three apps in their free plan. 

It integrates extremely well with modern data science stack. For instances, it's got single-line commands to display interactive visuals of Plotly (or Bokeh and Altair) or Pandas DataFrames and many more media types. It is also supported by a massive open-source community where people constantly contribute [custom components](https://streamlit.io/components) to the library using JavaScript. 

I myself have been working on a library that enables you to convert Jupyter Notebooks to identical Streamlit apps in a single line of code. The library will come out early in January. Throughout the development of my library, I had to update Streamlit multiple times as it keeps releasing new versions every other week. An open-source library with that much support is bound to be even more popular in 2022. 

You can check out the [example gallery](https://streamlit.io/gallery) for inspiration and get a feel of how powerful the library is. 

### Documentation 📚: [https://streamlit.io/](https://streamlit.io/)

# 6️⃣. PyCaret

Do you know why AutoML libraries are becoming popular? That is because of our deep-rooted inclination towards laziness. Apparently, many ML engineers are now very eager to ditch intermediary steps of ML workflow and let software automate it. 

PyCaret is one of those AutoML libraries with very low-code approach to most ML tasks we perform manually. It has got dedicated features for model analysis, deployment and ensembling that are not seen in many other ML frameworks. 

I regret to say this, but up until this year, I've always thought of PyCaret as a bit of a joke because I loved Sklearn and XGBoost so much. But as I discover, there is more to ML than just clean syntax and state-of-the-art performance. Now, I completely respect and appreciate the effort Moez Ali put into making PyCaret such an awesome open-source tool.

With the recent release of its brand-new time-series module, PyCaret has drawn even more attention to itself, getting a huge headstart for 2022. 

### Documentation 📚: [https://pycaret.org/](https://pycaret.org/)

# 7️⃣. Optuna

One of the absolute gems I have found through Kaggle this year is Optuna.

It is a next generation bayesian hyperparameter tuning framework that completely dominates on Kaggle. Honestly, you will be laughed in the face if you ever use grid search there anymore.

Optuna didn't gain this popularity for nothing. It ticks all the boxes in terms of the perfect tuning framework:

- intelligent search using bayesian statistics
- ability to pause, continue or add more search trials in a single experiment
- visuals to analyze most important parameters and connections between them
- framework-agnostic: tune any model - neural nets, tree-based models in all popular ML libraries and any other model you see in Sklearn
- Parallelization

Rightfully so, it also dominates google search results:

![](images/13.png)

You can learn all the tricks and tips of using Optuna that you won't often see in the docs from my article on it:

https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c


### Documentation 📚: [https://optuna.readthedocs.io/en/stable/](https://optuna.readthedocs.io/en/stable/)

# 8️⃣. Plotly

When Plotly blew up in popularity and people started saying it was better than Matplotlib, I couldn't believe it. I said, "Please, guys. Watch me as I make the comparison." So I sat down and started writing [my massively popular article](https://towardsdatascience.com/matplotlib-vs-plotly-lets-decide-once-and-for-all-ad25a5e43322) which ranks only second when you google "Matplotlib vs. Plotly". 

I knew full well that Matplotlib was going to end up winning but as I was half-way through finishing the article, I realized I was wrong. I was a rookie Plotly user at the time but as I wrote the article, I was forced to explore it deeper. The more I explored, the more I learned about its features and how they were superior to Matplotlib in many ways (sorry to spoil the article). 



Plotly rightfully won the comparison. Today, it is integrated into many popular open-source libraries like PyCaret, Optuna as a go-to visualization library. 

Even though there are many moon turns before it catches up with Matplotlib and Seaborn in terms of usage, you can expect it to grow much more quickly than others in 2022:

![](images/12.png)

### Documentation 📚: [https://plotly.com/python/](https://plotly.com/python/)

# Wrap

Data science is a fast growing industry. To keep up with the changes, the community is coming up with new tools and libraries faster than you learn existing ones. That's very overwhelming for beginners. I hope that in this post, I have managed to narrow down your focus to the most promising packages in 2022. Please, take into account that all the discussed libraries are extra on top of the main ones like Matplotlib, Seaborn, XGBoost, NumPy and Pandas which don't even need mentioning. Thank you for reading!

ADD CTA FOR MEDIUM