# 25 NumPy Functions You Never Knew Existed | P (Guarantee = 0.85)
## Become a NumPy Ninja
![](https://cdn-images-1.medium.com/max/1350/1*_Fr-DksmSUoyogXaNm-pCA.jpeg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@ilargian-faus-763704?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Ilargian Faus</a>
        on 
        <a href='https://www.pexels.com/photo/white-dog-wearing-sunglasses-1629780/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [None]:
import logging
import time
import warnings

import catboost as cb
import datatable as dt
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import shap
import umap
import umap.plot
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import *
from sklearn.impute import *
from sklearn.metrics import *
from sklearn.model_selection import *
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import *

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore")
pd.set_option("float_format", "{:.5f}".format)

# Motivation

Every data scientist admires someone. For some, it might be people who create killer data visualizations, for others, it is simply anyone who answers their StackOverflow questions. For me, it was people who used NumPy like a ninja.

I don't know. I have always thought that the ability to use a long forsaken function buried deep inside the documentation on rare edge-cases spoke a lot about a programmer's skill. Reinventing the wheel for a particular task is, of course hard, but that's not always what you want. 

This month, it was time to turn the tables around and become a NumPy ninja myself. Along the way, I said "why don't I make others too?". So, here I am, with a list of coolest and yet rare NumPy functions that when used, will surely surprise anyone reading your code. 

![](https://cdn-images-1.medium.com/max/900/1*m0-c5e45bQH7bigxoSOnyQ.gif)

# 1️⃣. np.full_like

I bet you have used common NumPy functions like `ones_like` or `zeros_like`. Well, `full_like` is exactly like those two, except, you can create a matrix with the same shape as another one, filled with some custom value.

### 💻Demo

In [None]:
array = np.array([[1, 4, 6, 8], [9, 4, 4, 4], [2, 7, 2, 3]])

array_w_inf = np.full_like(array, fill_value=np.pi, dtype=np.float32)
array_w_inf

Here, we are creating a matrix of `pi`s, with the shape of `array`.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.full_like.html)

# 2️⃣. np.logspace

I am sure you use `linspace` regularly. It can create a custom number of linearly spaced data points within an interval. Its cousin `logspace` takes this a bit further. It can generate custom number of points, evenly spaced on a logarithmic scale. You can choose any number as a base, as long as it is non-zero: 

### 💻Demo

In [None]:
log_array = np.logspace(start=1, stop=100, num=15, base=np.e)

log_array

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.logspace.html)

# 3️⃣. np.meshgrid

This is one of those functions you see only on documentations. For a while, I thought it wasn't intended for public use because I had such a hard time to understand it. Well, as always, StackOverflow to the rescue. According to this [thread](https://stackoverflow.com/questions/36013063/what-is-the-purpose-of-meshgrid-in-python-numpy), you can create every possible coordinate pair from a given X and Y arrays using `meshgrid`. Here is a simple one:

In [None]:
x = [1, 2, 3, 4]
y = [3, 5, 6, 8]

xx, yy = np.meshgrid(x, y)

In [None]:
xx

In [None]:
yy

There will be 16 unique coordinate pairs, one for each index-to-index element pair in the resulting arrays. 

In [None]:
plt.plot(xx, yy, linestyle="none", marker="o", color="red");

Of course, `meshgrid` is usually used for more complex tasks that would take forever if done with loops. Plotting a contour graph of the 3D sine function is an example:

### 💻Demo

In [None]:
def sinus2d(x, y):
    return np.sin(x) + np.sin(y)


xx, yy = np.meshgrid(np.linspace(0, 2 * np.pi, 100), np.linspace(0, 2 * np.pi, 100))
z = sinus2d(xx, yy)  # Create the image on this grid

import matplotlib.pyplot as plt

plt.imshow(z, origin="lower", interpolation="none")
plt.show()

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html)

# 4️⃣. np.triu / np.tril

Similar to `ones_like` or `zeros_like`, these two functions return zeros above or below a certain diagonal of a matrix. For example, we can use the `triu` function to create a boolean mask with True values above the main diagonal and use this mask when plotting a correlation heatmap.

### 💻Demo

In [None]:
import seaborn as sns

diamonds = sns.load_dataset("diamonds")

matrix = diamonds.corr()
mask = np.triu(np.ones_like(matrix, dtype=bool))

sns.heatmap(matrix, square=True, mask=mask, annot=True, fmt=".2f", center=0);

As you can see, a mask created with `triu` can be used to mask a correlation matrix, dropping the unnecessary upper triangle and the diagonal itself. This leaves a much more compact, and readable heatmap, devoid of clutter.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.triu.html) - `np.triu`

# 5️⃣. np.ravel / np.flatten

NumPy is all about high-dimensional matrices and ndarrays. Sometimes, you just want to take those arrays and crush them into 1D. This is where you would use `ravel` or `flatten`:

### 💻Demo

In [None]:
array = np.random.randint(0, 10, size=(4, 5))
array

In [None]:
array.ravel()

In [None]:
array.flatten()

Do they look the same? Not exactly. `flatten` always returns a 1D copy while `ravel` tries to return a 1D view of the original array. So, be careful because modifying the returned array from `ravel` might change the original array. For more info about their differences, check out [this](https://stackoverflow.com/questions/28930465/what-is-the-difference-between-flatten-and-ravel-functions-in-numpy) StackOverflow thread.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html)

# 6️⃣. np.vstack / np.hstack

On Kaggle, these two functions are used regularly. Often, people have multiple predictions for the test set from different models and they want to ensemble these predictions in some way. To make them easy to work with, they have to be combined into a single matrix. This allows easy ensembling using techniques like arithmetic or geometric averages.

### 💻Demo

In [None]:
array1 = np.arange(1, 11).reshape(-1, 1)
array2 = np.random.randint(1, 10, size=10).reshape(-1, 1)

hstacked = np.hstack((array1, array2))
hstacked

In [None]:
array1 = np.arange(20, 31).reshape(1, -1)
array2 = np.random.randint(20, 31, size=11).reshape(1, -1)

vstacked = np.vstack((array1, array2))
vstacked

Keep in mind that you have to reshape each array before stacking them with these functions as they require 2D arrays by default. That's why we used the `reshape` function. Here, `reshape(-1, 1)` represents converting the array to a single column with as many rows as possible. 

Similarly, `reshape(1, -1)` converts the array to a single row vector with as many columns as possible.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.hstack.html)

# 7️⃣. np.r_ / np.c_

If you are lazy like me and don't want to call `reshape` on all your arrays, there is a much more elegant solution. `np.r_` and `np.c_` operators (not functions!) allow stacking arrays as rows and columns, respectively. 

Below, we are simulating a prediction array with 100 probabilities. To stack them on top of each other, we call `np.r_` with the brackets notation (like `pandas.DataFrame.loc`). 

### 💻Demo

In [None]:
preds1 = np.random.rand(100)
preds2 = np.random.rand(100)

as_rows = np.r_[preds1, preds2]
as_cols = np.c_[preds1, preds2]

In [None]:
as_rows.shape

In [None]:
as_cols.shape

Similarly, `np.c_` stacks the arrays next to each other, creating a matrix. However, their functionality isn't limited to simple horizontal and vertical stacks. They are much more powerful than that. For additional information, I suggest you read the docs!

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.r_.html) - `np.r_`
### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.c_.html) - `np.c_`

# 8️⃣. np.info

NumPy is so vast and deep. You probably won't have the time and patience to learn every single function and class of its API. What if you face an unknown function? Well, don't you go running to the documentation because you have a much better alternative. 

The `info` function can print the docstring of any name in the NumPy API. Here is `info` used on `info`:

### 💻Demo

In [None]:
np.info(np.info)

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.info.html)

# 9️⃣. np.where

As the name suggests, this function returns all the indices of an array `where` some condition is True:

### 💻Demo

In [None]:
probs = np.random.rand(100)

idx = np.where(probs > 0.8)
probs[idx]

It is particularly useful when searching for non-zero elements in sparse arrays or even can be used on Pandas DataFrames for a much faster index retrieval based on a condition.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.where.html)

# 1️⃣0️⃣. np.all / np.any

These two functions will be handy during data cleaning when used with `assert` statements.

`np.all` only returns True if all elements inside an array match a certain condition:

### 💻Demo

In [None]:
array1 = np.random.rand(100)
array2 = np.random.rand(100)

In [None]:
np.all(array1 == array2)

Since we created two arrays filled with random numbers, there is no chance every single element is equal to each other. There is, however, a much better chance of at least two of them being equal to each other if the numbers are integers:

In [None]:
a1 = np.random.randint(1, 100, size=100)
a2 = np.random.randint(1, 100, size=100)

np.any(a1 == a2)

So, `any` returns True if at least one element of an array satisfies a particular condition.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.all.html) - `np.all`
### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.any.html) - `np.any`

# 1️⃣1️⃣. np.allclose

If you ever want to check if two arrays with equal lengths are duplicates of each other, simple `==` operator won't cut it. Sometimes, you might want to compare arrays of floats but their long decimal places make it hard. In that case, you can use `allclose` which returns True if all elements of an array are close to each other, given some tolerance.

### 💻Demo

In [None]:
a1 = np.arange(1, 10, step=0.5)
a2 = np.arange(0.8, 9.8, step=0.5)

np.all(a1 == a2)

In [None]:
a1

In [None]:
a2

In [None]:
np.allclose(a1, a2, rtol=0.2)

In [None]:
np.allclose(a1, a2, rtol=0.3)

Note that the function returns True only if the differences are smaller (`<`) than `rtol`, not `<=`!

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html)

# 1️⃣2️⃣. np.argsort

While `np.sort` returns a sorted copy of an array, that's not always what you want. Sometimes, you need the indices that would sort an array so that you can use the same indices multiple times over for different purposes. That's where `argsort` comes in handy:

### 💻Demo

In [None]:
random_ints = np.random.randint(1, 100, size=20)
idx = np.argsort(random_ints)

random_ints[idx]

It is from a family of functions that start with `arg`, which always return an index or indices from the result of some function. For example, `argmax` finds the maximum value in an array and returns its index.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)

# 1️⃣3️⃣. np.isneginf / np.isposinf

These two boolean functions check if an element in an array is negative infinity or positive infinity. Unfortunately, computers or NumPy doesn't understand the concept of infinity (well, who does?). They can only represent infinity as some extremely large or small number they can fit into a single bit (I hope I said it correctly).

That's why when you print the type of `np.inf`, it returns `float`:

In [None]:
type(np.inf)  # type of the infinity

In [None]:
type(-np.inf)

This means infinity values can easily sneak into an array and break operations that you would use on floats. You need a special function to find these sneaky little ...:

### 💻Demo

In [None]:
a = np.array([-9999, 99999, 97897, -79897, -np.inf])

In [None]:
np.all(a.dtype == "float64")

In [None]:
np.any(np.isneginf(a))

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.isneginf.html)

# 1️⃣4️⃣. np.polyfit

If you want to perform a traditional linear regression, you don't necessarily need Sklearn. NumPy has got you covered:

### 💻Demo

In [None]:
X = diamonds["carat"].values.flatten()
y = diamonds["price"].values.flatten()

slope, intercept = np.polyfit(X, y, deg=1)
slope, intercept

`polyfit` can take two vectors, apply linear regression on them and return a slope and an intercept. You just have to specify the degree with `deg`, because this function can be used to approximate the roots of any degree polynomial. 

Double-checking with Sklearn reveals that the slope and intercept found with `polyfit` are the same as Sklearn's `LinearRegression` model:

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(X.reshape(-1, 1), y)

lr.coef_, lr.intercept_

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html)

# 1️⃣5️⃣. Probability distributions

NumPy's `random` module has a wide selection of pseudo-random number generators. Along my favorites such as [`sample`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.sample.html), [`choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html), there are functions to simulate pseudo-perfect probability distributions.

For example, `binomial`, `gamma`, `normal` and `tweedie` functions draw custom number of datapoints from their respective distributions.

You may find them quite useful when you have to approximate the distributions of the features in your data. For example, below, we check whether diamond prices follow a normal distribution.

### 💻Demo

In [None]:
fig, ax = plt.subplots(figsize=(6, 8))

price_mean = diamonds["price"].mean()
price_std = diamonds["price"].std()

# Draw from a perfect normal distribution
perfect_norm = np.random.normal(price_mean, price_std, size=1000000)

sns.kdeplot(diamonds["price"], ax=ax)
sns.kdeplot(perfect_norm, ax=ax)

plt.legend(["Price", "Perfect Normal Distribution"]);

This can be done by plotting a KDE of diamond prices on top of a perfect normal distribution so that the differences are visible.

### 📚 Documentation: [link](https://numpy.org/doc/1.16/reference/routines.random.html)

# 1️⃣6️⃣. np.rint

`rint` is a nifty little function if you ever want to round each element of an array to the nearest integer. You can start using it when you want to convert class probabilities to class labels in binary classification. You don't have to call the `predict` method of your model, wasting your time:

### 💻Demo

In [None]:
preds = np.random.rand(100)

np.rint(preds[:50])

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.rint.html)

# 1️⃣7️⃣. np.nanmean / np.nan*

Did you know that arithmetic operations on pure NumPy arrays fail if at least a single element is `NaN`?

### 💻Demo

In [None]:
a = np.array([12, 45, np.nan, 9, np.nan, 22])

np.mean(a)

To go around this without modifying the original array, you can use a family of `nan` functions:

In [None]:
np.nanmean(a)

Above is an example of the arithmetic mean function that ignores missing values. There are many others that work in the same manner:

In [None]:
[func for func in dir(np) if func.startswith("nan")]

But you might as well forget about these if you only work with Pandas DataFrames or Series, because they ignore NaNs by default.

In [None]:
pd.Series(a).mean()

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.nanmean.html) - `np.nanmean`

# 1️⃣8️⃣. np.clip

`clip` is useful when you want to impose a strict limit on the values of your array. Below, we are clipping any values that are outside the hard limits of 10 and 70:

### 💻Demo

In [None]:
ages = np.random.randint(1, 110, size=100)

limited_ages = np.clip(ages, 10, 70)
limited_ages

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.clip.html)

# 1️⃣9️⃣. np.count_nonzero

It is common to work with sparse arrays. Often, they are the result of one-hot encoding a categorical feature with high caridinality or just many binary columns. 

You can check the number of non-zero elements in any array with `count_nonzero`:

### 💻Demo

In [None]:
a = np.random.randint(-50, 50, size=100000)

np.count_nonzero(a)

Among 100k random integers, ~1000 of them are zeros.

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html)

# 2️⃣0️⃣. np.array_split

The final function in our list is `array_split`. I think you can probably guess what it does from the name - it can be used to chunk ndarrays or DataFrames into N buckets. Besides, it doesn't raise an error when you want to split the array into non-equal sized chunks like `vsplit`:

### 💻Demo

In [None]:
import datatable as dt

df = dt.fread("../input/tabular-playground-series-oct-2021/train.csv").to_pandas()

splitted_dfs = np.array_split(df, 100)

In [None]:
len(splitted_dfs)

### 📚 Documentation: [link](https://numpy.org/doc/stable/reference/generated/numpy.array_split.html)

# Summary

OK, I lied in the intro a bit. I don't *really* admire people who use NumPy well. Actually, I admire anyone who uses some library or tool better than me. So, each of the articles I write are just me trying to push myself to the limits and see how it feels to use the things that more experienced folks so elaborately utilize. 

![](https://cdn-images-1.medium.com/max/900/1*KeMS7gxVGsgx8KC36rSTcg.gif)