# 27 Pandas Functions You Didn't Know Existed | P(Guarantee) = 0.9
## Making Pandas Even More Awesome
![](https://images.pexels.com/photos/5199661/pexels-photo-5199661.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@introspectivedsgn?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Erik Mclean</a>
        on 
        <a href='https://www.pexels.com/photo/unrecognizable-man-in-panda-head-sitting-near-car-5199661/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

## Setup

In [74]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.options.display.max_rows = None
warnings.filterwarnings("ignore")

## Introduction

## 1. `ExcelWriter`

`ExcelWriter` is a generic class for creating excel files (with sheets!) and writing DataFrames to them. Let's say we have these 2:

In [2]:
diamonds = sns.load_dataset("diamonds")
tips = sns.load_dataset("tips")

with pd.ExcelWriter("data/data.xlsx") as writer:
    diamonds.to_excel(writer, sheet_name="diamonds")
    tips.to_excel(writer, sheet_name="tips")

It has additional attributes to specify the datetime format to be used, whether you want to create a new excel file or modify an existing one, what happens when a sheet exists, etc. Check out the details from the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html).

## 2. `factorize`

This function is a pandas alternative to Sklearn's `LabelEncoder`:

In [6]:
codes, unique = pd.factorize(diamonds["cut"], sort=True)

codes[:10]

array([0, 1, 3, 1, 3, 2, 2, 2, 4, 2], dtype=int64)

In [7]:
unique

CategoricalIndex(['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], categories=['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], ordered=False, dtype='category')

Pay attention to how the function returns the categories as well. 

In [10]:
diamonds["cut_enc"] = pd.factorize(diamonds["cut"])[0]
diamonds["cut_enc"].sample(5)

46228    0
10292    3
50925    1
38309    1
1775     0
Name: cut_enc, dtype: int64

## 3. `bdate_range`

A short-hand function to create TimeSeries indices with business-day frequency:

In [14]:
series = pd.bdate_range("2021-01-01", "2021-01-31")  # A period of one month
len(series)

21

Business-day frequencies are common in the financial world. So, this function may come in handy when reindexing existing time-series with `reindex` function.

## 4. `hasnans`

Pandas offers a quick method to check if a given series contains any nulls with `hasnans` attribute:

In [15]:
series = pd.Series([2, 4, 6, "sadf", np.nan])
series.hasnans

True

According to its [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.hasnans.html), it enables various performance increases. Note that the attribute is pd.Series-only.

## 5. `convert_dtypes`

We all know that pandas has an annoying tendency to mark some columns as `object` data type. Instead of manually specifying their types, you can use `convert_dtypes` method which infers the best data type:

In [25]:
sample = pd.read_csv(
    "data/station_day.csv",
    usecols=["StationId", "CO", "O3", "AQI_Bucket"],
)
sample.dtypes

StationId      object
CO            float64
O3            float64
AQI_Bucket     object
dtype: object

In [28]:
sample.convert_dtypes().dtypes

StationId      string
CO            float64
O3            float64
AQI_Bucket     string
dtype: object

Unfortunately, it can't pares dates due to the caveats of different date time formats.

## 6. `at` and `iat`

These two accessors are much faster alternatives to `loc` and `iloc` with a disadvantage. They only allow selecting or replacing a single value at a time:

In [29]:
diamonds.at[234, "cut"]

'Ideal'

In [30]:
diamonds.iat[1564, 4]

61.2

In [31]:
# Replace 16541th row of the price column
diamonds.at[16541, "price"] = 10000

## 7. `min` and `max` along the columns axis

Even though `min` and `max` functions are well-known, they have another useful property for some edge-cases. Consider this dataset:

In [34]:
index = ["Diamonds", "Titanic", "Iris", "Heart Disease", "Loan Default"]
libraries = ["XGBoost", "CatBoost", "LightGBM", "Sklearn GB"]
df = pd.DataFrame(
    {lib: np.random.uniform(90, 100, 5) for lib in libraries}, index=index
)

df

Unnamed: 0,XGBoost,CatBoost,LightGBM,Sklearn GB
Diamonds,99.733903,96.540712,91.725201,93.640438
Titanic,93.130239,99.838026,93.586711,95.963391
Iris,97.153966,98.354851,90.58525,93.693119
Heart Disease,92.590182,94.402339,93.373349,93.399484
Loan Default,98.359192,91.073934,90.595855,92.90032


The above fake DataFrame is a point-performance of 4 different gradient boosting libraries on 5 datasets. We want to find the library that performed best at each dataset non-manually. Here is how you do it with `max`:

In [37]:
df.max(axis=1)

Diamonds         99.733903
Titanic          99.838026
Iris             98.354851
Heart Disease    94.402339
Loan Default     98.359192
dtype: float64

Just change the axis to 1 and you get a row-wise max/min. 

## 8. `pipe`

This is one of the best functions for doing data cleaning in a concise, compact manner. `pipe` function allows you to chain multiple custom functions into a single operation.

For example, let's say you have custom functions to `drop_duplicates`, `remove_outliers`, `encode_categoricals` that accept their own custom arguments. Here is how you apply all three in a single operation:

```python
df_preprocessed = (diamonds.pipe(drop_duplicates).
                            pipe(remove_outliers, ['price', 'carat', 'depth']).
                            pipe(encode_categoricals, ['cut', 'color', 'clarity'])
                  )
```

I like how this function resembles [Sklearn pipelines](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d). There is more you can do with it, so check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html) or this [helpful article](https://towardsdatascience.com/a-better-way-for-data-preprocessing-pandas-pipe-a08336a012bc).

## 9. `autocorr`

One of the critical components in time-series analysis is examining the autocorrelation of a variable. 

Autocorrelation is the plain-old correlation coefficient but it is calculated with the lagging version of a time series. 

In more detail, autocorrelation of a time series at `lag=k` is calculated as follows:

1. The time-series is shifted till `k` periods:

In [44]:
## HIDE
# Prep the data for an example
dt = pd.date_range("2021-01-01", periods=len(tips))
tips.index = dt

time_series = tips[["tip"]]

In [43]:
time_series["lag_1"] = time_series["tip"].shift(1)
time_series["lag_2"] = time_series["tip"].shift(2)
time_series["lag_3"] = time_series["tip"].shift(3)
time_series["lag_4"] = time_series["tip"].shift(4)
# time_series['lag_k'] = time_series['tip'].shift(k)

time_series.head()

Unnamed: 0,tip,lag_1,lag_2,lag_3,lag_4
2021-01-01,1.01,,,,
2021-01-02,1.66,1.01,,,
2021-01-03,3.5,1.66,1.01,,
2021-01-04,3.31,3.5,1.66,1.01,
2021-01-05,3.61,3.31,3.5,1.66,1.01


2. Correlation is calculated between the original `tip` and each `lag_*`. 

Instead of doing all this manually, you can use the `autocorr` function of Pandas:

In [47]:
# Autocorrelation of tip at lag_10
time_series["tip"].autocorr(lag=8)

0.07475238789967077

You can read more about the importance of autocorrelation in time-series analysis from this [post](https://towardsdatascience.com/advanced-time-series-analysis-in-python-decomposition-autocorrelation-115aa64f475e).

## 10. between

A rather nifty function for boolean indexing numeric features within a range:

In [53]:
# Get diamonds that are priced between 3500 and 3700 dollars
diamonds[diamonds["price"].between(3500, 3700, inclusive="neither")].sample(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_enc
4412,1.01,Very Good,I,SI1,58.8,58.0,3610,6.52,6.57,3.85,3
4806,0.9,Good,F,SI1,58.0,58.0,3699,6.24,6.28,3.63,2
4587,0.8,Ideal,F,VS2,62.0,55.0,3653,5.93,6.0,3.7,0
3972,0.73,Ideal,D,VS1,61.1,57.0,3509,5.85,5.81,3.56,0
4379,0.9,Premium,E,SI1,62.4,57.0,3609,6.19,6.14,3.85,1


## 11. `clip`

Outlier detection and removal is common in data analysis. 

`clip` function makes it really easy to find outliers outside a range and replacing them with the hard limits. For example, let's say we have a survey data collected from people aged 50-60. We want to check the age column and replace any values that are outside this range treating them as data entry mistakes:

In [55]:
ages = pd.Series([55, 52, 50, 66, 57, 59, 49, 60]).to_frame("ages")
ages

Unnamed: 0,ages
0,55
1,52
2,50
3,66
4,57
5,59
6,49
7,60


Two values are outside this range (49, 66). Let's fix it with `clip`:

In [57]:
ages.clip(50, 60)

Unnamed: 0,ages
0,55
1,52
2,50
3,60
4,57
5,59
6,50
7,60


Fast and efficient!

## 12. `nlargest` and `nsmallest`

Sometimes you don't just want the min/max of a column. You want to see the top N or ~(top N) values of a variable. This is where `nlargest` and `nsmallest` comes in handy.

Let's see top 5 most expensive and cheapest diamonds:

In [59]:
diamonds.nlargest(5, "price")

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_enc
27749,2.29,Premium,I,VS2,60.8,60.0,18823,8.5,8.47,5.16,1
27748,2.0,Very Good,G,SI1,63.5,56.0,18818,7.9,7.97,5.04,3
27747,1.51,Ideal,G,IF,61.7,55.0,18806,7.37,7.41,4.56,0
27746,2.07,Ideal,G,SI2,62.5,55.0,18804,8.2,8.13,5.11,0
27745,2.0,Very Good,H,SI1,62.8,57.0,18803,7.95,8.0,5.01,3


In [60]:
diamonds.nsmallest(5, "price")

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_enc
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,0
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,1
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,2
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,1
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,2


## 13. `value_counts` with `dropna=False`

One thing that I see most people do to find the percentage of missing values in a column is to chain `isnull` and `sum` and divide by the length of the array. 

You can do the same thing with `value_counts` with relevant arguments:

In [76]:
ames_housing = pd.read_csv("data/train.csv")

ames_housing["FireplaceQu"].value_counts(dropna=False, normalize=True)

NaN    0.472603
Gd     0.260274
TA     0.214384
Fa     0.022603
Ex     0.016438
Po     0.013699
Name: FireplaceQu, dtype: float64

Fireplace quality of Ames housing dataset consists of 47% nulls.

## 14. `idxmax` and `idxmin`

When you call `max` or `min` on a column, pandas returns the value that is largest/smallest. However, sometimes you want the *position* of the min/max, which is not possible with these functions.

Instead, you should use `idxmax`/`idxmin`:

In [77]:
diamonds.price.idxmax()

27749

In [78]:
diamonds.carat.idxmin()

14

You can also specify the `columns` axis, in which case the functions return the index number of the column.

## 15. `mask`

A quick function to replace values of a DataFrame based on a condition. 

Let's go back to the ages example:

In [79]:
ages

Unnamed: 0,ages
0,55
1,52
2,50
3,66
4,57
5,59
6,49
7,60


Let's replace ages that are outside 50-60 with nulls:

In [81]:
ages.mask(cond=~ages["ages"].between(50, 60), other=np.nan)

Unnamed: 0,ages
0,55.0
1,52.0
2,50.0
3,
4,57.0
5,59.0
6,
7,60.0


So, `mask` replaces values that don't meet `cond` with `other`.

## 16. `argsort`

You should use this function when you want to extract the indices that would sort an array:

In [91]:
tips.reset_index(inplace=True, drop=True)

sort_idx = tips["total_bill"].argsort(kind="mergesort")

# Now, sort tips based on total_bill
tips.iloc[sort_idx].head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
67,3.07,1.0,Female,Yes,Sat,Dinner,1
92,5.75,1.0,Female,Yes,Fri,Dinner,2
111,7.25,1.0,Female,No,Sat,Dinner,1
172,7.25,5.15,Male,Yes,Sun,Dinner,2
149,7.51,2.0,Male,No,Thur,Lunch,2


## 17. `explode` - 🤯🤯

![](https://images.unsplash.com/photo-1567446042109-8a62d37fea07?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=750&q=80)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@joshuas?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Joshua Sukoff</a>
        on 
        <a href='https://unsplash.com/s/photos/explode?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

A function with an interesting name. Let's see an example first and then, explain:

In [94]:
data = pd.Series([1, 6, 7, [46, 56, 49], 45, [15, 10, 12]]).to_frame("dirty")
data

Unnamed: 0,dirty
0,1
1,6
2,7
3,"[46, 56, 49]"
4,45
5,"[15, 10, 12]"


The `dirty` column has two rows where values are recorded as actual lists. You may often see this type of data from real-world survey data as some questions accept multiple answers. 

In [99]:
data.explode("dirty", ignore_index=True)

Unnamed: 0,dirty
0,1
1,6
2,7
3,46
4,56
5,49
6,45
7,15
8,10
9,12


`explode` takes a row with an array-like value and *explodes* it into multiple rows. Set `ignore_index` to True to keep the ordering of a numeric index.

## 18. `squeeze`

![](https://images.pexels.com/photos/5875701/pexels-photo-5875701.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@cottonbro?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>cottonbro</a>
        on 
        <a href='https://www.pexels.com/photo/close-up-photo-of-sausage-5875701/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels.</a>
    </strong>
</figcaption>

Another function with a funky name is `squeeze` and is used in very rare but annoying edge cases.

One of these rare cases is when there is a *single* value returned from a condition used to subset a DataFrame. Consider this example:

In [112]:
subset = diamonds.loc[diamonds.index < 1, ["price"]]
subset

Unnamed: 0,price
0,326


Even though there is just one value, it is returned as a DataFrame. This can be annoying since you have to now use `.loc` again with both the column name and index.

But, if you know `squeeze`, you don't have to. The function enables you to remove an axis from a single-cell DataFrame or Series. For example:

In [113]:
subset.squeeze()

326

Only the scalar value is returned. You can also specify the axis to remove:

In [115]:
subset.squeeze("columns")  # or "rows"

0    326
Name: price, dtype: int64

Note that `squeeze` only works for DataFrames or Series with single values.

## 19. `at_time` and `between_time`

Another couple of handy time-series functions.

`at_time` allows you to subset values at a specific date or time. Consider this time series:

In [117]:
index = pd.date_range("2021-08-01", periods=100, freq="H")
data = pd.DataFrame({"col": list(range(100))}, index=index)
data.head()

Unnamed: 0,col
2021-08-01 00:00:00,0
2021-08-01 01:00:00,1
2021-08-01 02:00:00,2
2021-08-01 03:00:00,3
2021-08-01 04:00:00,4


Let's select all rows at time equals "15:00":

In [118]:
data.at_time("15:00")

Unnamed: 0,col
2021-08-01 15:00:00,15
2021-08-02 15:00:00,39
2021-08-03 15:00:00,63
2021-08-04 15:00:00,87


Cool, huh? Now, let's use `between_time` to select rows within a custom interval:

In [123]:
from datetime import datetime

data.between_time("09:45", "12:00")

Unnamed: 0,col
2021-08-01 10:00:00,10
2021-08-01 11:00:00,11
2021-08-01 12:00:00,12
2021-08-02 10:00:00,34
2021-08-02 11:00:00,35
2021-08-02 12:00:00,36
2021-08-03 10:00:00,58
2021-08-03 11:00:00,59
2021-08-03 12:00:00,60
2021-08-04 10:00:00,82


Note that both functions require a DateTimeIndex and they only work with times (as in o'clock). If you want to subset within a DateTime interval, use `between`.

## 20. `cat` accessor

It is common knowledge that Pandas enables to use built-in Python functions on dates and strings using accessors like `dt` or `str`. 

Pandas also has a special `category` data type for categorical variables as can be seen below:

In [124]:
diamonds.dtypes

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
cut_enc       int64
dtype: object

When a column is `category`, you can use several special functions using the `cat` accessor. For example, let's see the unique categories of diamond cuts:

In [125]:
diamonds["cut"].cat.categories

Index(['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], dtype='object')

There are also functions like `remove_categories` or `rename_categories`, etc.:

In [132]:
diamonds["new_cuts"] = diamonds["cut"].cat.rename_categories(list("ABCDE"))
diamonds["new_cuts"].cat.categories

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

You can see the full list of functions under the `cat` accessor [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#categorical-accessor).

## 21. `select_dtypes`

A function I use all the time is `select_dtypes`. I think it is obvious what the function does from its name. It has `include` and `exclude` parameters that can be used to select columns including or excluding certain data types. 

For example:

In [135]:
# Choose only numerical columns
diamonds.select_dtypes(include=np.number).head()

Unnamed: 0,carat,depth,table,price,x,y,z,cut_enc
0,0.23,61.5,55.0,326,3.95,3.98,2.43,0
1,0.21,59.8,61.0,326,3.89,3.84,2.31,1
2,0.23,56.9,65.0,327,4.05,4.07,2.31,2
3,0.29,62.4,58.0,334,4.2,4.23,2.63,1
4,0.31,63.3,58.0,335,4.34,4.35,2.75,2


In [137]:
# Exclude numerical columns
diamonds.select_dtypes(exclude=np.number).head()

Unnamed: 0,cut,color,clarity,new_cuts
0,Ideal,E,SI2,A
1,Premium,E,SI1,B
2,Good,E,VS1,D
3,Premium,I,VS2,B
4,Good,J,SI2,D


## 22. `T`

All DataFrames have a simple `T` attribute, which stands for *transpose*. You may not use it often, but I find it quite useful when displaying DataFrames of the `describe` method:

In [141]:
boston.describe().T.head(10)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,569.0,14.127292,3.524049,6.981,11.7,13.37,15.78,28.11
mean texture,569.0,19.289649,4.301036,9.71,16.17,18.84,21.8,39.28
mean perimeter,569.0,91.969033,24.298981,43.79,75.17,86.24,104.1,188.5
mean area,569.0,654.889104,351.914129,143.5,420.3,551.1,782.7,2501.0
mean smoothness,569.0,0.09636,0.014064,0.05263,0.08637,0.09587,0.1053,0.1634
mean compactness,569.0,0.104341,0.052813,0.01938,0.06492,0.09263,0.1304,0.3454
mean concavity,569.0,0.088799,0.07972,0.0,0.02956,0.06154,0.1307,0.4268
mean concave points,569.0,0.048919,0.038803,0.0,0.02031,0.0335,0.074,0.2012
mean symmetry,569.0,0.181162,0.027414,0.106,0.1619,0.1792,0.1957,0.304
mean fractal dimension,569.0,0.062798,0.00706,0.04996,0.0577,0.06154,0.06612,0.09744


Boston housing dataset has 30 numeric columns. If you call `describe` as-is, the DataFrame will stretch horizontally, making it hard to compare the statistics. Taking the transpose will switch the axes so that summary statistics are given in columns.

## 23. Pandas Styler

Did you know that Pandas allows you to style DataFrames?

DataFrames have a `style` attribute which opens doors to customizations and styles only limited by your HTML and CSS knowledge. I won't discuss the full details of what you can with `style` but only show you my favorite functions:

In [146]:
diabetes.describe().T.drop("count", axis=1).style.highlight_max(color="darkred")

Unnamed: 0,mean,std,min,25%,50%,75%,max
Pregnancies,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


Above, we are highlighting cells that hold the maximum value of a column. Another cool styler is `background_gradient` which can give columns gradient background color based on their values:

In [148]:
diabetes.describe().T.drop("count", axis=1).style.background_gradient(
    subset=["mean", "50%"], cmap="Reds"
)

Unnamed: 0,mean,std,min,25%,50%,75%,max
Pregnancies,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


This feature comes especially handy when you are using `describe` on a table with many columns and want to compare summary statistics. Check out the documentation of the styler [here](https://pandas.pydata.org/docs/reference/style.html).