# How do I make my pandas DataFrame smaller and faster?

🐼 Tuto on pandas by Data School - Exercice performed by Dorian.H Mekni 🥇 | Fri 11 Dec 2020

In [2]:
import pandas as pd

In [3]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

In [4]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa



⭐️ Let's use here a new method not used so far in order to obtain more intel about this dataframe : 


In [5]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB



☝🏻 The last line indicates that this dataframe takes at least 9.2 KB of memory. 


⭐️ For a deeper memory scan taking into account what's in the objects columns when proceeding, we can force this action by using the following parameter : 

In [6]:
drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.5 KB



✅ pandas actually looked at the columns and figured out how much space they are taking : 25.8 KB.



⭐️ How much space each column is taking ? 


In [9]:
drinks.memory_usage()

Index                            128
country                         1544
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                       1544
dtype: int64


✅ We here obtain the memory usage in bytes per column. However, it does not inspect those object columns by default. Therefore, we will proceed as follows in order to read it in depth : 

In [70]:
drinks.memory_usage(deep=True)

Index                             128
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64


⭐️ Let's sum the Series of values : 


In [71]:
drinks.memory_usage(deep=True).sum()

25142


✅ This is roughly what we read above in our previous scan : 30.5 KB
    


🧐 The point here is that columns can occupy a large amount of space. That might be problematic if we deal with huge dataset.


Integers are indeed more space efficient strings. 

What if we were able to store our strings as integers to save space ❓

In [72]:
sorted(drinks.continent.unique())

['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']


☝🏻Here we have five unique strings within our continent columns. We could perfectly reference each of these values into a numeric order ranging from 0 to 4. 

We will therefore store an integer rather than a string object. 


In [13]:
drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object


🧐 To invent this system, we will still have to store a lookup table that says when I store zero, I mean Africa,    and so on. In that way, we will still have to store the strings but such operation will take place only once. 

😄 Thanksfully, pandas has created this system for its users. There is category data type that has been introduced in pandas 0.15.

⭐️ Let's proceed into converitng an object column into a category type : 


In [14]:
drinks['continent'] = drinks.continent.astype('category')


☝🏻 Let's check if it worked : 


In [15]:
drinks.dtypes

country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object


✅ The datatype of continent has been effectively converted.

➕ It's still reads the same : 


In [17]:
drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]


☝🏻The final line enumerates the listing of our 6 objects. It basically says that under the hood, string objects have been converted into integers. 

⭐️ Let's prove that : 

In [18]:
drinks.continent.cat.codes.head()

0    1
1    2
2    0
3    2
4    0
dtype: int8


✅ pandas is now representing the continent Series into integers. 



⭐️ let's now see if our memory usage has been reduced : 


In [19]:
drinks.memory_usage(deep=True)

Index                             128
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64


✅ Continent was previosuly 12332 bytes and has been effectively reduced to 744 bytes. 



⭐️ Let's repeat the same set of operations but this time for the column country. 


In [20]:
drinks['country'] = drinks.country.astype('category')

In [21]:
drinks.memory_usage(deep=True)

Index                             128
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64


❗️The memory usage has increased dramatically, why that ? 

The country column had 193 different and for each a different string. Thus, we created 193 categories :


In [47]:
drinks.country.cat.categories

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)


🧐 By doing so, we are storing 193 integers, which is considerably small, but that points to a lookup table of 193 strings ! We are therefore spending more memory than before to store the same thing. 



⚠️ Bottom line is : Only use the category datatype when you have an object column of strings that only have a few different values.   


🧐 Remember : 

 --> Reducing memory usage will :
        
        1️⃣ save memory space (As long as they're aren't too many unique values). 
        
        2️⃣ speeds up your computational operations. 

That way, your dataframe become smaller and faster to deal with. 

    


🙏🏻 Thank you !

👋🏻 See you in the next one !
