### When we are dealing with data of huge size it is important to optimize the memory

#### In this notebook we discuss "categorical mapping" and "type conversions" which are the most used memory optimization techniques

In [64]:
import pandas as pd
import numpy as np

In [69]:
## Creating a randomly generated column
df1 = pd.DataFrame({"categorical_level": ['Abc', 'Def', 'Ghi']*5000})

In [70]:
df1.head()

Unnamed: 0,categorical_level
0,Abc
1,Def
2,Ghi
3,Abc
4,Def


In [71]:
df1['categorical_level'].memory_usage(deep=True)

900080

In [72]:
df1.categorical_level.value_counts()

Def    5000
Abc    5000
Ghi    5000
Name: categorical_level, dtype: int64

In [73]:
## Save the unique values in a variable
categorical_level_unique = df1['categorical_level'].unique()

In [74]:
## Use a dictionary comprehension to create a map between the original value and an integer

## the integer is recommended to be between 1 and number of unique values in the column (optional)

categorical_level_mapping = {categorical_level: idx for 
                             idx, categorical_level in 
                             enumerate(categorical_level_unique, 1)}

In [75]:
## Use the .map() method on the column to replace all the original values in the column 

## to the encoded values

df1['categorical_level'] = df1['categorical_level'].map(categorical_level_mapping)

In [76]:
df1['categorical_level'].memory_usage(deep=True)

120080

In [77]:
df1.categorical_level.min()

1

In [78]:
df1.categorical_level.max()

3

In [79]:
print("int8 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int8).min,np.iinfo(np.int8).max))
print("int16 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int16).min,np.iinfo(np.int16).max))
print("int32 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int32).min,np.iinfo(np.int32).max))
print("int64 datatype can handle integer values ranging from {} to {}".format(np.iinfo(np.int64).min,np.iinfo(np.int64).max))

int8 datatype can handle integer values ranging from -128 to 127
int16 datatype can handle integer values ranging from -32768 to 32767
int32 datatype can handle integer values ranging from -2147483648 to 2147483647
int64 datatype can handle integer values ranging from -9223372036854775808 to 9223372036854775807


#### For more details refer

https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes


In [80]:
## As we can see that the min and max values of this particular column falls inside int8.
## As int8 would take much lesser memory than in64 we convert it to int8

df1['categorical_level'] = df1['categorical_level'].astype("int8")

In [81]:
# In this case 

df1['categorical_level'].memory_usage(deep=True)

15080

#### In this case we had a categorical columns for which we did a "categorical mapping" first and then a "type conversion". 
#### This decreased our memory from 900080 to 15080


#### For numeric variables use appropriate "type conversion" to decrease the memory footprint.
