### When we are dealing with data of huge size it is important to optimize the memory

#### In this notebook we discuss "categorical mapping" and "type conversions" which are the most used memory optimization techniques

In [21]:
import pandas as pd ## To Create dataframe.
import numpy as np ## To get data types range.

In [22]:
## Create a dataframe with randomly generated column.
df1 = pd.DataFrame({"categorical_level": ['Abc', 'Def', 'Ghi']*5000})

In [23]:
## Get first 5 records of dataframe.
df1.head()

Unnamed: 0,categorical_level
0,Abc
1,Def
2,Ghi
3,Abc
4,Def


In [24]:
## Get memory usgae of dataframe.
df1['categorical_level'].memory_usage(deep=True)

900128

In [25]:
## Get category column levels count.
df1.categorical_level.value_counts()

Ghi    5000
Def    5000
Abc    5000
Name: categorical_level, dtype: int64

In [26]:
## Save the unique values in a variable.
categorical_level_unique = df1['categorical_level'].unique()

In [27]:
## Display unique levels.
categorical_level_unique

array(['Abc', 'Def', 'Ghi'], dtype=object)

In [28]:
## Use a dictionary comprehension to create a map between the original value and an integer.

## The integer is recommended to be between 1 and number of unique values in the column (optional).

categorical_level_mapping = {categorical_level: idx for 
                             idx, categorical_level in 
                             enumerate(categorical_level_unique, 1)}

In [29]:
## Display mapping levels.
categorical_level_mapping

{'Abc': 1, 'Def': 2, 'Ghi': 3}

In [30]:
## Use the .map() method on the column to replace all the original values in the column to the encoded values.

df1['categorical_level'] = df1['categorical_level'].map(categorical_level_mapping)

In [31]:
## Display the memory usage after mapping.
df1['categorical_level'].memory_usage(deep=True)

120128

In [39]:
## Display first 5 records of dataframe after mapping.
df1['categorical_level'].head()

0    1
1    2
2    3
3    1
4    2
Name: categorical_level, dtype: int8

In [33]:
## Get minimum value of category column.
df1.categorical_level.min()

1

In [34]:
## Get maximum value of category column.
df1.categorical_level.max()

3

In [35]:
## Display data types range.
print("int8 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int8).min,np.iinfo(np.int8).max))
print("int16 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int16).min,np.iinfo(np.int16).max))
print("int32 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int32).min,np.iinfo(np.int32).max))
print("int64 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int64).min,np.iinfo(np.int64).max))

int8 datatype can handle integer values ranging from -128 to 127
int16 datatype can handle integer values ranging from -32768 to 32767
int32 datatype can handle integer values ranging from -2147483648 to 2147483647
int64 datatype can handle integer values ranging from -9223372036854775808 to 9223372036854775807


#### For more details refer

https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes


In [36]:
## Display memory usgae of dataframe before data type conversion.
df1['categorical_level'].memory_usage(deep=True)

120128

In [37]:
## As we can see that the min and max values of this particular column falls inside int8.
## As int8 would take much lesser memory than in64 so,we can convert it to int8.

df1['categorical_level'] = df1['categorical_level'].astype("int8")

In [38]:
## Display memory usgae of dataframe after data type conversion.
df1['categorical_level'].memory_usage(deep=True)

15128

#### In this case we had a categorical columns for which we did a "categorical mapping" first and then a "type conversion". 
#### This decreased our memory from 900128 to 15128


#### For numeric variables use appropriate "type conversion" to decrease the memory footprint.
