https://www.youtube.com/watch?v=wDYDYGyN_cw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=21

# How do I make my pandas DataFrame smaller and faster?

In [1]:
import pandas as pd

In [2]:
# read a dataset of alcohol consumption into a DataFrame
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

In [3]:
# exact memory usage is unknown because object columns are references elsewhere
drinks.info() #different dtypes (object / int64 / float64)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB


In [4]:
# force pandas to calculate the true memory usage
drinks.info(memory_usage = 'deep') #true use of memory

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.4 KB


In [5]:
drinks.memory_usage() #bits 1544

Index                             80
country                         1544
beer_servings                   1544
spirit_servings                 1544
wine_servings                   1544
total_litres_of_pure_alcohol    1544
continent                       1544
dtype: int64

In [6]:
# calculate the memory usage for each Series (in bytes)
drinks.memory_usage(deep = True) #real usage 12588

Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64

Documentation for info and memory_usage

In [7]:
drinks.memory_usage(deep = True).sum()

31176

In [8]:
##How could be more space efficient?
##'Object' use a lot of space, could change the store for 'integers'

In [9]:
sorted(drinks.continent.unique())

['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']

In [10]:
drinks.continent.head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

In [11]:
#Convert object types into category types = astype()

In [31]:
# use the 'category' data type (new in pandas 0.15) to store the 'continent' strings as integers
drinks['continent'] = drinks.continent.astype('category')
drinks.dtypes

country                         category
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

In [32]:
# 'continent' Series appears to be unchanged
drinks.continent.head() #show categories

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

In [14]:
# with 'cat.codes' show the respectives codes for position in the list

In [35]:
# strings are now encoded (0 means 'Africa', 1 means 'Asia', 2 means 'Europe', etc.)
drinks.continent.cat.codes.head() 

0    1
1    2
2    0
3    2
4    0
dtype: int8

In [36]:
drinks.memory_usage(deep = True) #country =  18094

Index                              80
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [27]:
# repeat this process for the 'country' Series
drinks['country'] = drinks.country.astype('category')

In [37]:
# memory usage has been drastically reduced
drinks.memory_usage(deep = True)

Index                              80
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [30]:
# repeat this process for the 'country' Series
drinks['country'] = drinks.country.astype('category')
drinks.memory_usage(deep=True)

Index                              80
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [39]:
drinks.country.cat.categories

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=193)

The **category** data type should only be used with a string Series that has a **small number of possible values.**


In [40]:
# create a small DataFrame from a dictionary
df = pd.DataFrame({'ID':[100,101,102,103], 'quality':['good', 'very good', 'excellent', 'good']})

In [41]:
df

Unnamed: 0,ID,quality
0,100,good
1,101,very good
2,102,excellent
3,103,good


In [42]:
# sort the DataFrame by the 'quality' Series (alphabetical order)
df.sort_values('quality')

Unnamed: 0,ID,quality
2,102,excellent
0,100,good
3,103,good
1,101,very good


# How to inform a logical order to sort! astype + definition the order

In [43]:
# define a logical ordering for the categories
df['quality'] = df.quality.astype('category', categories = ['good', 'very good', 'excellent'], ordered = True)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [44]:
df.quality

0         good
1    very good
2    excellent
3         good
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]

In [45]:
# sort the DataFrame by the 'quality' Series (logical order)
df.sort_values('quality')

Unnamed: 0,ID,quality
0,100,good
3,103,good
1,101,very good
2,102,excellent


# Quality better than good. loc

In [46]:
# comparison operators work with ordered categories
df.loc[df.quality > 'good',  :] # : = all, sort function depends on previous definition, with this 

Unnamed: 0,ID,quality
1,101,very good
2,102,excellent


https://pandas.pydata.org/pandas-docs/stable/categorical.html

https://pandas.pydata.org/pandas-docs/stable/api.html#categorical