# 8 Coolest Python Packages Kagglers Are Using Without Telling You
## Seriously, they are great
![](images/pexels.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@miphotography?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Miesha Maiden</a>
        on 
        <a href='https://www.pexels.com/photo/pineapple-with-brown-sunglasses-459601/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels.</a> All images are by author unless specified otherwise.
    </strong>
</figcaption>

## Setup

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")
optuna.logging.set_verbosity(optuna.logging.WARNING)

## Introduction

## 1️⃣. UMAP

![](images/1.png)
<figcaption style="text-align: center;">
    <strong>
        <a href='https://www.kaggle.com/subinium/tps-jun-this-is-original-eda-viz/notebook?scriptVersionId=64865915&cellId=37'>Link to the code of the plot.</a>
    </strong>
</figcaption>

Above is a 100k row dataset with 75 features projected to 2D using a package called UMAP. Each dot represents a single sample in a classification problem and is color-encoded based on their class. 

Massive datasets like these can make you miserable during EDA, mainly because of the computation and time expenses they come with. So, it is important that each plot you create is spot-on and reveals something significant about the data. 

I think that's one of the reasons why UMAP (Uniform Manifold Approximation and Projection) is so well-received on Kaggle. It is efficient, low-code and let's you take a real "look" at the data from a high-dimensional perspective:

<p float="left">
  <img src="https://umap-learn.readthedocs.io/en/latest/_images/plotting_21_2.png" width="250" height="250">
  <img src="https://umap-learn.readthedocs.io/en/latest/_images/plotting_32_2.png" width="250" height="250"> 
  <img src="https://umap-learn.readthedocs.io/en/latest/_images/plotting_34_2.png" width="250" height="250">
</p

When I look at plots like these, they remind me of why I got into data science in the first place - data is beautiful!

🛠 Github and documentation
- https://umap-learn.readthedocs.io/en/latest/
- https://github.com/lmcinnes/umap

🔬 Papers
- [UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction](https://arxiv.org/abs/1802.03426)

💻 Demo

UMAP offers an easy, Sklearn-compatible API. After importing `UMAP` module, call its `fit` on the feature and target arrays (`X`, `y`) to project them to 2D by default:

```python
import umap  # pip install umap-learn

# Create the mapper
mapper = umap.UMAP()
# Fit to the data
mapper.fit(X, y)

# Plot as a scatterplot
umap.plot.points(mapper)
```

The most important parameters of `UMAP` estimator are `n_neighbors` and `min_dist` (minimum distance). Think of `n_neighbors` as a handle that controls the zoom level of the projects. `min_dist` is the minimum distance between each projected point.

If you wish to project to a higher dimension, you can tweak `n_components` just like in  Sklearn's `PCA`.

## 2️⃣. Datatable

![](images/5.png)

As dataset sizes are getting bigger, people are paying more attention to out-of-memory, multi-threaded data preprocessing tools to escape the performance limitations of Pandas. 

One of the most promising tools in this regard is `datatable`, inspired by R's data.table package. It is developed by H2O.ai to support parallel-computing and out-of-memory operations on big data (up to 100 GB), as required by today's machine learning applications.

While `datatable` does not have as large a suite of tabular manipulation functions as pandas, it is found to heavily outperform pandas on most common operations. In an [experiment](https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets) done on a 100M row dataset, datatable manages to read the data into memory in just over a minute, 9 times faster than pandas.

### 🛠 Github and documentation
- https://github.com/h2oai/datatable
- https://datatable.readthedocs.io/en/latest/?badge=latest

### 💻 Demo

The main data structure in `datatable` is `Frame` (as in DataFrame). 

In [4]:
import datatable as dt  # pip install datatable

frame = dt.fread("data/station_day.csv")
frame.head(5)

Unnamed: 0_level_0,StationId,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,…,Benzene,Toluene,Xylene,AQI,AQI_Bucket
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,Unnamed: 11_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,AP001,2017-11-24,71.36,115.75,1.75,20.65,12.4,12.19,0.1,10.76,…,0.17,5.92,0.1,,
1,AP001,2017-11-25,81.4,124.5,1.44,20.5,12.08,10.72,0.12,15.24,…,0.2,6.5,0.06,184.0,Moderate
2,AP001,2017-11-26,78.32,129.06,1.26,26.0,14.85,10.28,0.14,26.96,…,0.22,7.95,0.08,197.0,Moderate
3,AP001,2017-11-27,88.76,135.32,6.6,30.85,21.77,12.91,0.11,33.59,…,0.29,7.63,0.12,198.0,Moderate
4,AP001,2017-11-28,64.18,104.09,2.56,28.07,17.01,11.42,0.09,19.0,…,0.17,5.02,0.07,188.0,Moderate


In [5]:
type(frame)

datatable.Frame

## 3️⃣. Lazypredict

Lazypredict is one of the best one-liner packages I have ever seen.

Using the library, you can train almost all Sklearn models plus XGBoost and LightGBM in a single line of code. It only has two estimators - one for regression and one for classification. Fitting either one on a dataset with a given target will evaluate more than 30 base models and generate a report with their rankings on a number of popular metrics.

### 💻 Demo

In [11]:
from lazypredict.Supervised import (  # pip install lazypredict
    LazyClassifier,
    LazyRegressor,
)
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load data and split
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit LazyRegressor
reg = LazyRegressor(ignore_warnings=True, random_state=1121218, verbose=False)
models, predictions = reg.fit(X_train, X_test, y_train, y_test)  # pass all sets

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 42/42 [00:03<00:00, 13.08it/s]


In [13]:
models.head(10)

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
XGBRegressor,0.9,0.91,2.95,0.09
GradientBoostingRegressor,0.89,0.9,3.08,0.13
RandomForestRegressor,0.88,0.9,3.15,0.35
ExtraTreesRegressor,0.88,0.89,3.23,0.21
AdaBoostRegressor,0.85,0.87,3.49,0.17
HistGradientBoostingRegressor,0.85,0.87,3.54,0.61
BaggingRegressor,0.85,0.87,3.55,0.07
LGBMRegressor,0.85,0.87,3.59,0.09
ExtraTreeRegressor,0.81,0.83,4.02,0.02
DecisionTreeRegressor,0.79,0.81,4.22,0.02


A table like this will free you from the manual task of selecting a base model, a time much better spent on tasks like feature engineering.

### 🛠 Github and documentation
- https://lazypredict.readthedocs.io/en/latest/index.html
- https://github.com/shankarpandala/lazypredict

## 4️⃣. Optuna

![](https://miro.medium.com/max/1400/0*IBkpkOCS0anhUHWp.png)

One of the more recent libraries I have added to my skill-stack is Kagglers' favorite - Optuna. 

Optuna is a next-generation automatic hyperparameter tuning framework, designed to work on virtually any model and neural network available in today's ML and Deep learning packages. 

It offers several advantages over similar tools like GridSearch, TPOT, HyperOPT, etc:
- Platform-agnostic: has APIs to work with any framework, including XGBoost, LightGBM, CatBoost, Sklearn, Keras, TensorFlow, PyTorch, etc.
- A large suite of optimization algorithms with early stopping and pruning features baked in
- Easy parallelization with little or no changes to the code
- Built in support to visually explore tuning history and the importance of each hyperparameter.

My most favorite feature is its ability to pause/resume/save search histories. Optuna keeps track of all previous rounds of tuning and you can resume the search for however long you want until you get the performance you want. 

Besides, you can make Optuna RAM-independent for massive datasets and searching by storing results in a local or a remote database by just adding an extra parameter.

### 🛠 Github and documentation
- https://github.com/optuna/optuna
- https://optuna.readthedocs.io/en/stable/

### 🔬 Papers
- [Optuna: A Next-generation Hyperparameter Optimization Framework](https://arxiv.org/abs/1907.10902)

### 💻 Demo

In [16]:
import optuna  # pip install optuna


def objective(trial):
    x = trial.suggest_float("x", -7, 7)
    y = trial.suggest_float("y", -7, 7)
    return (x - 1) ** 2 + (y + 3) ** 2


study = optuna.create_study()
study.optimize(objective, n_trials=100)  # number of iterations

study.best_params

[32m[I 2021-08-05 17:00:58,948][0m A new study created in memory with name: no-name-e42a8b2d-395c-4859-aee4-009fc462453d[0m
[32m[I 2021-08-05 17:00:58,951][0m Trial 0 finished with value: 96.98038155604101 and parameters: {'x': 1.3813072644079387, 'y': 6.840476935908683}. Best is trial 0 with value: 96.98038155604101.[0m
[32m[I 2021-08-05 17:00:58,953][0m Trial 1 finished with value: 3.950690932180003 and parameters: {'x': 2.853574558327045, 'y': -3.7176017620537887}. Best is trial 1 with value: 3.950690932180003.[0m
[32m[I 2021-08-05 17:00:58,955][0m Trial 2 finished with value: 18.174361233669117 and parameters: {'x': 3.866378810808227, 'y': 0.15566692580486574}. Best is trial 1 with value: 3.950690932180003.[0m
[32m[I 2021-08-05 17:00:58,955][0m Trial 3 finished with value: 104.68880750024934 and parameters: {'x': -4.014968135417397, 'y': 5.918458504752797}. Best is trial 1 with value: 3.950690932180003.[0m
[32m[I 2021-08-05 17:00:58,963][0m Trial 4 finished with va

{'x': 0.7648391277142319, 'y': -3.2718583720422925}