Working with a large dataset in pandas, and thinking of to reduce its memory trace or improve its efficiency. So lets know more how we can do that:

In [1]:
import pandas as pd

In [2]:
drinks=pd.read_csv('http://bit.ly/drinksbycountry')

In [3]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


Above We can see alcohol consumption one country per row.

In [4]:
drinks.info() #info(): Dataframe method

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB


Here info(), tells us about index, 6 columns, count of non-missing value in each column with its datatypes(int, float, object(string)). 

object usually means a string is being stored and we may not know that we can actually create a pandas Series of python lists or pandas Series of python dictionaries, in other words we can store arbitrary python objects in pandas Series and pandas basically store reference to that object and thats why its just call its type object. 

###### Memory usage:

Here we will focus on 'memory usage' at the bottom of the resulset above which says " 9.2+ KB ".

why there is a '+' (plus) in between? :: This is telling us that Dataframe takes atleast 9.2kilobytes of memory which is quite small Dataframe but why the '+'(plus) - because object columns are references to other objects and pandas wants this info() method to run faster, so it doesn't actually go out and search for objects to figure out how much space the references to those objects takes. So it's saying atleast 9.2kb but it might be lot more depending upon whats in those object columns.
So again in this case they are just strings for 'country' and 'continent'.

object data type can actually contain multiple different types. For instance, a column could include integers, floats and strings which collectively are labeled as an object . Therefore, we may need some additional techniques to handle mixed data types in object columns. 

We can actually force pandas to count the true memory usage as below: 

In [5]:
drinks.info(memory_usage='deep') # A value of ‘deep’ is equivalent to “True with deep introspection”.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 30.5 KB


Here also it will say same stuff as before except it will tell us pandas actually looked at the object columns and figured out the space they were using and says 30.5KB, its more than triple the actual estimate, this command with 'deep' will take little bit longer to run but it is accurate.

Above accurate value made us thinking, how much space each column(Series) is using?  Let see here below:

In [6]:
drinks.memory_usage()

Index                            128
country                         1544
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                       1544
dtype: int64

Here we can see that memory usage by each column(Series) in bytes not in KiloBytes.

Again as drinks.info(), it does not examine the object columns by default, So if we are looking for true space usage then we have to say deep='True' in memory_usage() method also as below:

In [7]:
drinks.memory_usage(deep='True')

Index                             128
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64

Now we got actual size in Bytes used by each column(Series).

One last thing about memory_usage is, since above drinks.memory_usage(deep='True') outputs a pandas Series, these column labels are the 'index' and actual size in Bytes are the 'values' of the Series, we can actually sum up these values. Lets see how to do it :

In [8]:
drinks.memory_usage(deep='True').sum()

31224

Here, we got about 30 KB after summing up the values of each Series(Column).

So the conclusion is, Object columns can occupy plenty of space in memory. 

And the question is if this was large dataset and storing too much, that the Dataframe was growing too large.
How can we reduce that and more space efficient mainly with object columns?

    Well, imagine that if we store our strings as integers because integers are more space efficient than strings.Lets see 
    how below:

In [9]:
sorted(drinks.continent.unique()) # sorted() is a python function.

['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']

Here there are only 6 unique values in the continent Series and sorted using python function.

So lets assume instead of storing strings, lets give sorted numbers from "0 to 5" to the above 6 unique sorted continents like 
Africa - 0, Asia - 1, Europe - 2 and so on.
By doing this, we are only storing integers for those unique 6 continents.

Basically, we still have to store strings but we would only have to store them once like a little lookup table. So this is a good idea which we dont need to implement it step by step here because pandas has already implemented this idea as " Category Datatype " (from version 0.15).

###### Convert Object Column to Category Datatype: 

In [10]:
drinks['continent']= drinks.continent.astype('category')

In [11]:
drinks.dtypes # Lets check the dtypes what we got now

country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

Here we can 'country' is still type 'object' but 'continent' is type 'category'.

And it still looks the same below:

In [12]:
drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

But here when we notice, we got 'Categories' down in above resultset which is listing them as 6 categories and actually under the hood it is storing these strings as integers. Above category list is like representation of lookup table.

Now lets check this continent Series 'Categories' storage as below: 

In [13]:
drinks.continent.cat.codes.unique() #'cat' is short form of categorical and 'codes' is the number(code) assigned to the strings.

array([1, 2, 0, 3, 5, 4], dtype=int8)

In [14]:
drinks.continent.cat.codes.head()

0    1
1    2
2    0
3    2
4    0
dtype: int8

Here this is exactly how pandas is representing the 'continent' Series as integers.

Before we spoke about converting strings to integers will reduce dataframe memory usage. Lets take a look here: 

In [15]:
drinks.memory_usage(deep='True')

Index                             128
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

Look at here, before 'continent' was over 12000 Bytes but now its less than 800 Bytes. 

So now altogether, instead of storing 193(total row count)strings we are now storing 193 integers which referring to a lookup table of 6 unique strings which says Africa - 0, Asia - 1, Europe - 2 and so on. Strings only have to be stored once and the rest is Integer storage hence this is much more space efficient. 

Lets try to repeat for 'country' column to convert object type to Category type as below: 

In [16]:
drinks['country']= drinks.country.astype('category')

In [17]:
drinks.memory_usage(deep=True) #Lets check memory usage

Index                             128
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

Here when we look at 'country' memory usage, it is utilizing most of the space of the dataframe. why is that ?

if you remember the 'country' Series, every country was a different string. 

In [18]:
drinks.country.cat.categories

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)

Here its an attribute and we see 193 categories. So previously before we convert categories with strings, we were storing 193 strings and after converting to category now we are storing 193 integers which are very small in size but it referring to lookup table of 193 strings. Thus we are actually spending more memory than before to store the same thing. 

To summarize here we see the category data type when we have an object column of strings which has only some different values(not a ton of them). 

Other than reducing memory usage in continent it actually speeds up computation. So if we working with object column of strings and we are doing computation with it, say groupby with that column, it actually speed up that operations if we use category data type. 

So simply converting a column to a category will not only save space as long as there aren't too many unique values and it will speed up computation. 

Its such a simpler way to make our Dataframe compact and faster. 

###### Useful hint:

lets create a very small Dataframe: 

In [19]:
df = pd.DataFrame({'ID': [100, 101, 102, 103],
                   'Quality': ['good','very good','good','excellent']})

In [20]:
df

Unnamed: 0,ID,Quality
0,100,good
1,101,very good
2,102,good
3,103,excellent


Here created a simple Dataframe with 2 columns and 4 rows.

In [21]:
df.sort_values('Quality') # Want to sort the Dataframe by 'Quality'

Unnamed: 0,ID,Quality
3,103,excellent
0,100,good
2,102,good
1,101,very good


sort_values() sorted Quality Series in alphabetical order. There is a logical ordering to these categories. 

How can we tell pandas that there is actually a logical ordering?:
    We are going to use Category datatype and going to define ordered categories as below :

In [22]:
df.Quality.astype('category') # Converting to Category data type

0         good
1    very good
2         good
3    excellent
Name: Quality, dtype: category
Categories (3, object): [excellent, good, very good]

In [23]:
cat_dtype = pd.api.types.CategoricalDtype(
    categories=['good', 'very good', 'excellent'], ordered=True)
# Converting to ordered categorical type with custom ordering

In [24]:
df.Quality.astype(cat_dtype)

0         good
1    very good
2         good
3    excellent
Name: Quality, dtype: category
Categories (3, object): [good < very good < excellent]

In [27]:
df['Quality']= df.Quality.astype(cat_dtype)

In [28]:
df.sort_values('Quality')

Unnamed: 0,ID,Quality
0,100,good
2,102,good
1,101,very good
3,103,excellent


When you define ordered categories, if you sort 'Quality' values. it will sort now in logical order as above.

So its not sorting by letter(alphabetical order) but sorting by the ordering we defined for the categories.

And also the coolest thing is now we can use boolean condition such as below:

Lets say we want to see all the rows where the quality is better than good with all the columns and below is how we can code:

In [29]:
df.loc[df.Quality >'good',:]

Unnamed: 0,ID,Quality
1,101,very good
3,103,excellent


Here we only see 'very good' and 'excellent' which is better than 'good'.

We can now use this comparision operator with string because the categories are ordered and it understands that 'very good' and 'excellent' are greater than 'good'.