# 7 Pandas Mistakes That Silently Tell You Are a Rookie
## No error messages - that's what makes them subtle
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@michalmatlon?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Michal Matlon</a>
        on 
        <a href='https://unsplash.com/s/photos/problem?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [5]:
import logging
import time
import warnings

import catboost as cb
import datatable as dt
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import shap
import umap
import umap.plot
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import *
from sklearn.impute import *
from sklearn.metrics import *
from sklearn.model_selection import *
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import *

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore")
pd.set_option("float_format", "{:.5f}".format)

# Introduction

# 1. Using Pandas itself

It is kind of ironic that the first mistake is related to actually using Pandas for certain tasks. Specifically, today's real-world tabular datasets are just massive. To read them into your environment with Pandas would be a grievous mistake. 

Why? Because it is so damn slow! Below, we load the TPS October dataset that has 1M rows and ~300 features, taking up a whopping 2.2GB of disk space. 

In [29]:
import pandas as pd

In [3]:
%%time

tps_october = pd.read_csv("data/train.csv")

Wall time: 21.8 s


It took ~22 seconds. Now, you might be saying that 22 seconds isn't that much but imagine this. In a single project, you will perform many experiments during different stages. You will probably create separate scripts or notebooks for cleaning, feature engineering, choosing a model and many more for other tasks.

Waiting for the data to load for 20 seconds really gets on your nerves. Besides, your dataset will probably be much larger. So, what is a faster solution?

The solution is to ditch Pandas at this stage and use other alternatives that are specifically designed for fast IO. My favorite one is `datatable` but you can also go for `Dask`, `Vaex`, `cuDF`, etc. Here is how long it takes to load the same dataset with `datatable`:

In [30]:
import datatable as dt  # pip install datatble

In [4]:
%%time

tps_dt_october = dt.fread("data/train.csv").to_pandas()

Wall time: 2 s


Just 2 seconds!

# 2. No vectors?

One of the craziest rules in [functional programming](https://en.wikipedia.org/wiki/Functional_programming) is to never use loops (along with the "no variables" rule). It seems that sticking to this "no-loops" rule while using Pandas is the best you can do to speed up computations. 

Functional programming replaces loops with recursion. Fortunately, we don't have to so hard on ourselves because we can just use vectorization! 

Vectorization, which is at the heart of Pandas and NumPy, is the process of performing mathematical operations on whole arrays rather than individual scalars. The best part is that Pandas already has a large suite of vectorized functions, eliminating the need to reinvent the wheel. 

All arithmetic operators in Python (+, -, \*, /, \**) work in vectorized manner when used on Pandas series or dataframes. Also, any other mathematical function you see in Pandas or NumPy is already vectorized. 

To see the speed increase, we will use the below `big_function` that takes 3 columns as an input and performs some meaningless arithmetic:

In [16]:
def big_function(col1, col2, col3):
    return np.log(col1 ** 10 / col2 ** 9 + np.sqrt(col3 ** 3))

First, we will use this function with Pandas's faster iterator - `apply`:

In [19]:
%time tps_october['f1000'] = tps_october.apply(lambda row: big_function(row['f0'], row['f1'], row['f2']), axis=1)

Wall time: 20.1 s


The operation took 20 seconds. Let's do the same by using the core NumPy arrays in a vectorized manner:

In [18]:
%time tps_october['f1001'] = big_function(tps_october['f0'].values, tps_october['f1'].values, tps_october['f2'].values)

Wall time: 82 ms


82 milliseconds, which is about 250 times faster. 

It is true that you can't completely ditch loops. After all, not all data manipulation operations are mathematical. But whenever you find yourself itching to use some type of looping functions like `apply`, `applymap` or `itertuples`, take a moment to see if what you want to do can be vectorized. 

# 3. Data types, dtypes, types!

![](https://miro.medium.com/max/1050/0*jUWj8UtW_gOYuZh0.png)

![](https://miro.medium.com/max/1050/0*T0KacMFCMtlSrd1l.png)

In [21]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    return df

# 4. No styling?

In [24]:
tps_october.sample(20, axis=1).describe().T.style.bar(
    subset=["mean"], color="#205ff2"
).background_gradient(subset=["std"], cmap="Reds").background_gradient(
    subset=["50%"], cmap="coolwarm"
)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
f227,1000000.0,0.042318,0.097016,4.1e-05,0.006148,0.008488,0.011224,0.995261
f143,1000000.0,0.49912,0.218694,0.0,0.506243,0.569084,0.62275,0.99971
f234,1000000.0,0.111832,0.106602,0.024656,0.035676,0.081907,0.140333,0.996664
f166,1000000.0,0.030398,0.094173,0.0,0.006335,0.008709,0.011248,1.0
f90,1000000.0,0.642243,0.141101,0.0,0.53981,0.619562,0.778886,0.981087
f84,1000000.0,0.066512,0.104063,0.0,0.012872,0.018155,0.097201,1.0
f121,1000000.0,0.186078,0.040156,0.010368,0.169747,0.170958,0.191037,0.973322
f243,1000000.0,0.213172,0.409548,0.0,0.0,0.0,0.0,1.0
f30,1000000.0,0.118508,0.099597,0.0,0.056052,0.059557,0.194628,0.967574
f67,1000000.0,0.223665,0.077721,0.053498,0.174431,0.208975,0.247528,0.939481


# 5. Saving to CSVs

In [26]:
%%time

tps_october.to_csv("data/copy.csv")

Wall time: 2min 43s


In [27]:
%%time

tps_october.to_feather("data/copy.feather")

Wall time: 1.05 s


In [28]:
%%time

tps_october.to_parquet("data/copy.parquet")

Wall time: 7.84 s


# 6. You should've read the user guide!

# Summary