# 6 Pandas Mistakes That Silently Tell You Are a Rookie
## No error messages - that's what makes them subtle
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@michalmatlon?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Michal Matlon</a>
        on 
        <a href='https://unsplash.com/s/photos/problem?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [5]:
import logging
import time
import warnings

import catboost as cb
import datatable as dt
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import shap
import umap
import umap.plot
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import *
from sklearn.impute import *
from sklearn.metrics import *
from sklearn.model_selection import *
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import *

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore")
pd.set_option("float_format", "{:.5f}".format)

# Introduction

We are all used to the big, fat, red error messages that frequently pop up while we code. Fortunately, people won't usually see them because we always fix them. But how about the mistakes that give no errors? These are the most dangerous ones and when spotted by more experienced folks, would embarrass us the most. 

These mistakes are not related to the API or syntax of the tool you are using but directly associated to theory and your experience level. Today, we are here to talk about 6 of such mistakes that come up often among beginner Pandas users and we will learn how to solve them.

# 1. Using Pandas itself

It is kind of ironic that the first mistake is related to actually using Pandas for certain tasks. Specifically, today's real-world tabular datasets are just massive. To read them into your environment with Pandas would be a huge mistake. 

Why? Because it is so damn slow! Below, we load the TPS October dataset that has 1M rows and ~300 features, taking up a whopping 2.2GB of disk space. 

In [29]:
import pandas as pd

In [3]:
%%time

tps_october = pd.read_csv("data/train.csv")

Wall time: 21.8 s


It took ~22 seconds. Now, you might be saying that 22 seconds isn't that much but imagine this. In a single project, you will perform many experiments during different stages. You will probably create separate scripts or notebooks for cleaning, feature engineering, choosing a model and many more for other tasks.

Waiting for the data to load for 20 seconds really gets on your nerves. Besides, your dataset will probably be much larger. So, what is a faster solution?

The solution is to ditch Pandas at this stage and use other alternatives that are specifically designed for fast IO. My favorite one is `datatable` but you can also go for `Dask`, `Vaex`, `cuDF`, etc. Here is how long it takes to load the same dataset with `datatable`:

In [30]:
import datatable as dt  # pip install datatble

In [4]:
%%time

tps_dt_october = dt.fread("data/train.csv").to_pandas()

Wall time: 2 s


Just 2 seconds!

# 2. No vectors?

One of the craziest rules in [functional programming](https://en.wikipedia.org/wiki/Functional_programming) is to never use loops (along with the "no variables" rule). It seems that sticking to this "no-loops" rule while using Pandas is the best you can do to speed up computations. 

Functional programming replaces loops with recursion. Fortunately, we don't have to so hard on ourselves because we can just use vectorization! 

Vectorization, which is at the heart of Pandas and NumPy, is the process of performing mathematical operations on whole arrays rather than individual scalars. The best part is that Pandas already has a large suite of vectorized functions, eliminating the need to reinvent the wheel. 

All arithmetic operators in Python (+, -, \*, /, \**) work in vectorized manner when used on Pandas series or dataframes. Also, any other mathematical function you see in Pandas or NumPy is already vectorized. 

To see the speed increase, we will use the below `big_function` that takes 3 columns as an input and performs some meaningless arithmetic:

In [16]:
def big_function(col1, col2, col3):
    return np.log(col1 ** 10 / col2 ** 9 + np.sqrt(col3 ** 3))

First, we will use this function with Pandas's faster iterator - `apply`:

In [19]:
%time tps_october['f1000'] = tps_october.apply(lambda row: big_function(row['f0'], row['f1'], row['f2']), axis=1)

Wall time: 20.1 s


The operation took 20 seconds. Let's do the same by using the core NumPy arrays in a vectorized manner:

In [18]:
%time tps_october['f1001'] = big_function(tps_october['f0'].values, tps_october['f1'].values, tps_october['f2'].values)

Wall time: 82 ms


82 milliseconds, which is about 250 times faster. 

It is true that you can't completely ditch loops. After all, not all data manipulation operations are mathematical. But whenever you find yourself itching to use some type of looping functions like `apply`, `applymap` or `itertuples`, take a moment to see if what you want to do can be vectorized. 

# 3. Data types, dtypes, types!

No, this is not that "Change the default data types of Pandas columns" lesson you received in middle school. Here, we will go much deeper. Specifically, we discuss data types in terms of their memory usage.

The worst and most memory consuming data type is `object`, which also happens to limit some of the features of Pandas. Next, we have floats and integers. Actually, I don't want to list all Pandas data types, so why don't you take a look at this table:

![](https://miro.medium.com/max/1050/0*jUWj8UtW_gOYuZh0.png)

<figcaption style="text-align: center;">
    <strong>
        Source: http://pbpython.com/pandas_dtypes.html
    </strong>
</figcaption>

The most interesting part of this table is that we have multiple type variations for floats and integers. This is also the most critical part for reducing memory consumption of the datasets. 

The numbers after the data type name represent how many bits of memory each number in this data type will take. So, the idea is to cast every column in our dataset to the smallest subtype as possible. How do you know which one to choose? Well, here is another table for you:

![](https://miro.medium.com/max/1050/0*T0KacMFCMtlSrd1l.png)
<figcaption style="text-align: center;">
    Source: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html
    </strong>
</figcaption>

Generally, you want to cast floats to `float16/32` and the columns with both positive and negative integers to `int8/16/32` based on the above table. You can also use `uint8` for booleans and positive-only integers for more decrease in memory consumption. 

Here is a handy but long function that casts floats and integers to their smallest subtype based on the above table:

In [31]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

Let's use it on the TPS October data and see how much reduction we can get:

In [32]:
reduce_memory_usage(tps_october)

Mem. usage decreased to 509.26 Mb (76.9% reduction)


Unnamed: 0,id,f0,f1,f2,f3,f4,f5,f6,f7,f8,...,f278,f279,f280,f281,f282,f283,f284,target,f1001,f1000
0,0,0.20593,0.41089,0.17676,0.22363,0.42358,0.47607,0.41357,0.61182,0.53467,...,0,0,0,0,0,0,0,1,-2.59375,-2.59375
1,1,0.18103,0.47314,0.01173,0.21362,0.61963,0.44165,0.23035,0.68604,0.28198,...,0,0,0,0,0,0,0,1,-6.64453,-6.64453
2,2,0.18262,0.30737,0.32593,0.20715,0.60547,0.30981,0.49341,0.75098,0.53613,...,0,1,1,0,0,0,0,1,-1.67285,-1.67285
3,3,0.18030,0.49463,0.00837,0.22363,0.76074,0.43921,0.43213,0.77637,0.48389,...,0,0,1,0,0,0,0,1,-7.14844,-7.14844
4,4,0.17712,0.49561,0.01426,0.54883,0.62549,0.56250,0.11719,0.56104,0.07709,...,1,0,1,0,0,1,0,1,-6.36328,-6.36328
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,999995,0.20435,0.34473,0.26221,0.22839,0.61084,0.35742,0.49048,0.61377,0.50928,...,0,1,0,0,1,0,0,1,-1.99414,-1.99414
999996,999996,0.18201,0.56396,0.24255,0.24121,0.45361,0.46948,0.47754,0.65918,0.51904,...,0,0,0,0,0,0,1,0,-2.12500,-2.12500
999997,999997,0.25024,0.49146,0.09857,0.23560,0.77148,0.36792,0.53174,0.59814,0.61865,...,0,0,0,0,0,0,0,0,-3.45703,-3.45703
999998,999998,0.20361,0.53516,0.18018,0.21313,0.65479,0.53516,0.31616,0.65234,0.39795,...,0,0,0,0,0,0,0,1,-2.57031,-2.57031


We compressed the dataset to 510 MBs from the original 2.2GB. Unfortunately, this decrease in memory consumption gets lost when we save the dataframe to file. 

Why was this a mistake again? Well, RAM consumption plays a big part when play with such big toys using big machine learning models. Once you get a few OutOfMemory errors, you start to catch up and learn ticks like this to keep your computer happy.

# 4. No styling?

One of the most wonderful features of Pandas is its ability to display stylized dataframes. Raw dataframes are rendered as HTML tables with a bit of CSS inside Jupyter.

For people with style and who want to go the extra mile to make their notebooks more colorful and appealing, Pandas allows styling its DataFrames through the `style` attribute.

In [24]:
tps_october.sample(20, axis=1).describe().T.style.bar(
    subset=["mean"], color="#205ff2"
).background_gradient(subset=["std"], cmap="Reds").background_gradient(
    subset=["50%"], cmap="coolwarm"
)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
f227,1000000.0,0.042318,0.097016,4.1e-05,0.006148,0.008488,0.011224,0.995261
f143,1000000.0,0.49912,0.218694,0.0,0.506243,0.569084,0.62275,0.99971
f234,1000000.0,0.111832,0.106602,0.024656,0.035676,0.081907,0.140333,0.996664
f166,1000000.0,0.030398,0.094173,0.0,0.006335,0.008709,0.011248,1.0
f90,1000000.0,0.642243,0.141101,0.0,0.53981,0.619562,0.778886,0.981087
f84,1000000.0,0.066512,0.104063,0.0,0.012872,0.018155,0.097201,1.0
f121,1000000.0,0.186078,0.040156,0.010368,0.169747,0.170958,0.191037,0.973322
f243,1000000.0,0.213172,0.409548,0.0,0.0,0.0,0.0,1.0
f30,1000000.0,0.118508,0.099597,0.0,0.056052,0.059557,0.194628,0.967574
f67,1000000.0,0.223665,0.077721,0.053498,0.174431,0.208975,0.247528,0.939481


Above, we are randomly choosing 20 columns, creating a 5-number summary for them, transposing the result and coloring the mean, std and median columns based on their magnitude. I especially like how you can create background bar charts based on the magnitude of the means. 

Changes like these make it easier to spot patterns in raw numbers without turning to visualization libraries. You can learn the full details of how you can style your dataframes from this [link](https://pandas.pydata.org/docs/user_guide/style.html).

Actually, there is nothing wrong with not styling your dataframes. However, this seemed such a good feature that not using it would be a missed opportunity.

# 5. Saving to CSVs

Just like reading CSV files is extremely slow, so is saving the data back to them. Here is how long it takes to save the TPS October data to CSV:

In [26]:
%%time

tps_october.to_csv("data/copy.csv")

Wall time: 2min 43s


It took almost 3 minutes. To be fair to everyone else and yourself, save your dataframes to some other lighter and cheaper format like feather or parquet.

In [27]:
%%time

tps_october.to_feather("data/copy.feather")

Wall time: 1.05 s


In [28]:
%%time

tps_october.to_parquet("data/copy.parquet")

Wall time: 7.84 s


As you can see, saving the dataframe to feather format took 160 times less runtime. Besides, feather and parquet also take much less storage. My favorite write, Dario Radecic, has an entire series dedicated to CSV alternatives. You can check it out [here](https://medium.com/towards-data-science/stop-using-csvs-for-storage-here-are-the-top-5-alternatives-e3a7c9018de0).

# 6. You should've read the user guide!

Actually, the most grievous mistake in this list is not reading the [User Guide](https://pandas.pydata.org/docs/user_guide/index.html) or the documentation of Pandas.

I understand. We all have this weird thing when it comes to documentations. We'd rather scour the Internet for hours than read the docs. 

However, this isn't at all true when it comes to Pandas. It has an exceptional user guide covering topics right from the basics to actually contributing and making Pandas more awesome.

In fact, you could've learned about all the mistakes I mentioned today from the user guide. It is even true that the [section on reading large datasets](https://pandas.pydata.org/docs/user_guide/io.html) specifically tells you to use other packages like `Dask` to read massive files and stay away from Pandas. If I had the time to read the user guide from start to finish, I would've probably come up with 50 more beginner mistakes but now that you know what to do, I leave the rest to you.

# Summary

Today, we have learned about the 6 most common mistakes beginners make while using Pandas. 

I do want you to note that most of these mistakes are actually counted wrong when you work with gigabyte-sized datasets. You might as well forget about them if you are still playing around with toy datasets because the solutions won't make much of a difference.

However, as you improve your skills and start tackling real-world datasets, the concepts will eventually be very helpful. 

ADD CTA and MEMBERSHIP

# Before you leave, my readers are loving these - why don't you give them a check?
https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5

https://towardsdatascience.com/25-numpy-functions-you-never-knew-existed-p-guarantee-0-85-64616ba92fa8

https://towardsdatascience.com/7-cool-python-packages-kagglers-are-using-without-telling-you-e83298781cf4

https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c

https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd

https://towardsdatascience.com/love-3blue1brown-animations-learn-how-to-create-your-own-in-python-in-10-minutes-8e0430cf3a6d
