# DataFrame-Reducer Explanation

This library provides a convenient function to reduce the memory usage of a DataFrame to the extent possible.

In short, by calling `reduce_size` with default parameters it:
* Turns `int64` (default in pandas) to ints of lower number of bits (e.g. `int32`) or to `uint` whenever possible.
* Turns `float64` to floats of lower number of bits, but cautiously.
* Turns string columns into categories when it makes sense (i.e. when there aren't a lot of different categories and memory usage is therefore reduced).

It reduces the size of the integer columns based on the minimum and maximum of its distribution, but it also adds a `margin` parameter (default is 20%) so that if new data comes and in some column it's lower than the minimum or higher than the maximum it still fits in memory and you don't get overflow.

Another useful parameter is `round_cols`, which allows you to specify the names of float columns of which you don't care about it's decimals. They are then turned into the most efficient integer representation.

Also, if one column doesn't have any number lower than zero it doesn't use `int`, it uses `uint` even if the margin is large. You can change this behavior with `allow_negatives=True`.

Let's see a quick example.

## A quick example

In [1]:
import numpy as np
import pandas as pd
from df_reducer import reduce_size

In [2]:
65535/1.2

54612.5

In [3]:
df = pd.DataFrame({
        'a': [50000]+list(range(1, 10000)), 
        'b': np.linspace(1, 5001, 10000), 
        'c': [np.nan]+list(range(1, 10000)), # column that is almost int, but has nan
        'd': ['x', 'y', 'z', 'x', 'x']*2000, 
        'e': ['id'+str(i) for i in range(10000)] # too many categories
})

In [4]:
df.head(15)

Unnamed: 0,a,b,c,d,e
0,50000,1.0,,x,id0
1,1,1.50005,1.0,y,id1
2,2,2.0001,2.0,z,id2
3,3,2.50015,3.0,x,id3
4,4,3.0002,4.0,x,id4
5,5,3.50025,5.0,x,id5
6,6,4.0003,6.0,y,id6
7,7,4.50035,7.0,z,id7
8,8,5.0004,8.0,x,id8
9,9,5.50045,9.0,x,id9


Size of the DataFrame ('memory usage'):

In [5]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
a    10000 non-null int64
b    10000 non-null float64
c    9999 non-null float64
d    10000 non-null object
e    10000 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 1.5 MB


Reduce the size:

In [6]:
reduced = reduce_size(df.copy())
reduced.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
a    10000 non-null uint16
b    10000 non-null float32
c    9999 non-null float32
d    10000 non-null category
e    10000 non-null object
dtypes: category(1), float32(2), object(1), uint16(1)
memory usage: 721.9 KB


Size without the last column:

In [15]:
df.drop('e', axis=1).info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
a    10000 non-null int64
b    10000 non-null float64
c    9999 non-null float64
d    10000 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 879.0 KB


In [16]:
reduced.drop('e', axis=1).info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
a    10000 non-null uint16
b    10000 non-null float32
c    9999 non-null float32
d    10000 non-null category
dtypes: category(1), float32(2), uint16(1)
memory usage: 107.8 KB


Reduced memory usage to less than 1/8!

Comparison of the size of the columns before and after (in KB):

In [7]:
pd.DataFrame({'Before':df.memory_usage(deep=True), 'After':reduced.memory_usage(deep=True)})/1e3

Unnamed: 0,Before,After
Index,0.08,0.08
a,80.0,20.0
b,80.0,40.0
c,80.0,40.0
d,660.0,10.278
e,628.89,628.89


Comparison of the dtypes of the columns before and after:

In [8]:
pd.DataFrame({'Before':df.dtypes, 'After':reduced.dtypes})

Unnamed: 0,Before,After
a,int64,uint16
b,float64,float32
c,float64,float32
d,object,category
e,object,object


Note that the last column didn't change to category. If it did it would use more memory, not less, because there are too many unique categories. So it doesn't do it.

## Specifying bigger margins

In [9]:
np.iinfo('uint16')

iinfo(min=0, max=65535, dtype=uint16)

The maximum number we can represent with `uint16` is 65535. Max of the first column is 50000. Say that you believe that values well above that may appear later on. No worries: specify a bigger margin, say 500% more (`margin=5`):

In [19]:
reduce_size(df.copy(), margin=5).dtypes

a      uint32
b     float32
c     float32
d    category
e      object
dtype: object

Now the first column is `uint32`!

Note that since the series is positive it doesn't turn it into `int`. It assumes that positive series stay positive on new data. More on this later.

## Rounding columns for even more efficient memory usage

You can specify float columns that you want to round and turn to `int`/`uint`, for even more efficient memory usage, but won't work in float columns with missing values (like `c`):

In [11]:
reduce_size(df.copy(), round_cols=['b','c']).head()

Unnamed: 0,a,b,c,d,e
0,50000,1,,x,id0
1,1,2,1.0,y,id1
2,2,2,2.0,z,id2
3,3,3,3.0,x,id3
4,4,3,4.0,x,id4


Note that `b` is now integer, but `c` is not. Still, it's optimized to `float32`:

In [12]:
reduce_size(df.copy(), round_cols=['b','c']).dtypes

a      uint16
b      uint16
c     float32
d    category
e      object
dtype: object

## Allow negative values on positive series (in case you believe they are possible)

In [13]:
reduce_size(df.copy(), margin=5, allow_negatives=True).dtypes

a       int32
b     float32
c     float32
d    category
e      object
dtype: object

Note that it changed to `int32` instead of `uint32`!