In [2]:
import numpy as np
import pandas as pd

**Background and Motivation**

In [3]:
values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)

In [4]:
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [5]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [7]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

In [8]:
values = pd.Series([0, 1, 0, 0] * 2)

In [10]:
dim = pd.Series(['apple', 'orange'])

In [11]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [12]:
dim

0     apple
1    orange
dtype: object

In [13]:
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

This representation as integers is called the categorical or dictionary-encoded repre‐
sentation. The array of distinct values can be called the categories, dictionary, or levels
of the data. In this book we will use the terms categorical and categories. The integer
values that reference the categories are called the category codes or simply codes.
The categorical representation can yield significant performance improvements when
you are doing analytics. You can also perform transformations on the categories while
leaving the codes unmodified. Some example transformations that can be made at
relatively low cost are:

    • Renaming categories

    • Appending a new category without changing the order or position of the existing
categories


**Categorical Type in pandas**

Pandas has a special Categorical type for holding data that uses the integer-based
categorical representation or encoding. 

In [14]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [15]:
N = len(fruits)

In [19]:
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
          columns=['basket_id', 'fruit', 'count', 'weight'])

In [20]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,5,0.996204
1,1,orange,9,1.516011
2,2,apple,11,2.443883
3,3,apple,13,3.993331
4,4,apple,11,3.340299
5,5,orange,9,2.964207
6,6,apple,7,0.756544
7,7,apple,6,2.241434


In [21]:
fruit_cat = df['fruit'].astype('category')

In [22]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [23]:
c = fruit_cat.values

In [24]:
type(c)

pandas.core.arrays.categorical.Categorical

In [25]:
c.categories

Index(['apple', 'orange'], dtype='object')

In [26]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [27]:
df['fruit'] = df['fruit'].astype('category')

In [28]:
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [29]:
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

In [30]:
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

In [31]:
categories = ['foo', 'bar', 'baz']

In [32]:
codes = [0, 1, 2, 0, 0, 1]

In [33]:
my_cats_2 = pd.Categorical.from_codes(codes, categories)

In [34]:
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

In [36]:
ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [37]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

**Computations with Categoricals**

In [38]:
np.random.seed(12345)

In [39]:
draws = np.random.randn(1000)

In [40]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [41]:
bins = pd.qcut(draws, 4)

In [42]:
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

While useful, the exact sample quartiles may be less useful for producing a report
than quartile names. We can achieve this with the labels argument to qcut:

In [43]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

In [44]:
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [45]:
bins = pd.Series(bins, name='quartile')

In [46]:
bins

0      Q2
1      Q3
2      Q2
3      Q2
4      Q4
       ..
995    Q3
996    Q2
997    Q1
998    Q3
999    Q4
Name: quartile, Length: 1000, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [47]:
results = (pd.Series(draws)
            .groupby(bins)
            .agg(['count', 'min', 'max'])
            .reset_index())

In [48]:
results

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [49]:
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

**Better performance with categoricals**

If you do a lot of analytics on a particular dataset, converting to categorical can yield
substantial overall performance gains. A categorical version of a DataFrame column
will often use significantly less memory, too. Let’s consider some Series with 10 mil‐
lion elements and a small number of distinct categories:

In [51]:
N = 10000000

In [52]:
draws = pd.Series(np.random.randn(N))

In [53]:
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

In [54]:
categories = labels.astype('category')

In [56]:
labels.memory_usage()

80000128

In [57]:
categories.memory_usage()

10000320

In [58]:
%time _ = labels.astype('category')

Wall time: 452 ms


GroupBy operations can be significantly faster with categoricals because the underly‐
ing algorithms use the integer-based codes array instead of an array of strings.

**Categorical Methods**

In [59]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)

In [60]:
cat_s = s.astype('category')

In [61]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [62]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [63]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

In [64]:
cat_s2 = cat_s.cat.set_categories(actual_categories)

In [65]:
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [66]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [67]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In large datasets, categoricals are often used as a convenient tool for memory savings
and better performance. After you filter a large DataFrame or Series, many of the
categories may not appear in the data. To help with this, we can use the
remove_unused_categories method to trim unobserved categories:

In [68]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

In [69]:
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [70]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

**Creating dummy variables for modeling**

When you’re using statistics or machine learning tools, you’ll often transform catego‐
rical data into dummy variables, also known as one-hot encoding. This involves creat‐
ing a DataFrame with a column for each distinct category; these columns contain 1s
for occurrences of a given category and 0 otherwise.


In [71]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

In [72]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [74]:
pd.get_dummies(cat_s) # One Hot Encode

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


**Advanced GroupBy Use**

**Group Transforms and “Unwrapped” GroupBys**

There is another built-in method called transform, which is similar
to apply but imposes more constraints on the kind of function you can use:

• It can produce a scalar value to be broadcast to the shape of the group

• It can produce an object of the same shape as the input group

• It must not mutate its input

In [75]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4, 'value': np.arange(12.)})

In [76]:
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


In [77]:
g = df.groupby('key').value

In [78]:
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

In [79]:
g.transform(lambda x: x.mean())

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [80]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [81]:
g.transform(lambda x: x * 2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

In [82]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

In [83]:
def normalize(x):
    return (x - x.mean()) / x.std()

In [84]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [85]:
g.apply(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

In [86]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

In [87]:
normalized = (df['value'] - g.transform('mean')) / g.transform('std')

In [88]:
normalized

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

While an unwrapped group operation may involve multiple group aggregations, the
overall benefit of vectorized operations often outweighs this.

**Grouped Time Resampling**

In [89]:
N = 15

In [90]:
times = pd.date_range('2017-05-20 00:00', freq='1min', periods=N)

In [91]:
df = pd.DataFrame({'time': times, 'value': np.arange(N)})

In [92]:
df

Unnamed: 0,time,value
0,2017-05-20 00:00:00,0
1,2017-05-20 00:01:00,1
2,2017-05-20 00:02:00,2
3,2017-05-20 00:03:00,3
4,2017-05-20 00:04:00,4
5,2017-05-20 00:05:00,5
6,2017-05-20 00:06:00,6
7,2017-05-20 00:07:00,7
8,2017-05-20 00:08:00,8
9,2017-05-20 00:09:00,9


In [93]:
df.set_index('time').resample('5min').count()

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
2017-05-20 00:00:00,5
2017-05-20 00:05:00,5
2017-05-20 00:10:00,5


In [95]:
df2 = pd.DataFrame({'time': times.repeat(3),
                    'key': np.tile(['a', 'b', 'c'], N),
                    'value': np.arange(N * 3.)})

In [96]:
df2[:7]

Unnamed: 0,time,key,value
0,2017-05-20 00:00:00,a,0.0
1,2017-05-20 00:00:00,b,1.0
2,2017-05-20 00:00:00,c,2.0
3,2017-05-20 00:01:00,a,3.0
4,2017-05-20 00:01:00,b,4.0
5,2017-05-20 00:01:00,c,5.0
6,2017-05-20 00:02:00,a,6.0


In [121]:
time_key = pd.Grouper(freq='5min')

In [122]:
time_key

TimeGrouper(freq=<5 * Minutes>, axis=0, sort=True, closed='left', label='left', how='mean', convention='e', base=0)

In [123]:
resampled = df2.set_index('time').groupby(['key', time_key]).sum()

In [124]:
resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,time,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30.0
a,2017-05-20 00:05:00,105.0
a,2017-05-20 00:10:00,180.0
b,2017-05-20 00:00:00,35.0
b,2017-05-20 00:05:00,110.0
b,2017-05-20 00:10:00,185.0
c,2017-05-20 00:00:00,40.0
c,2017-05-20 00:05:00,115.0
c,2017-05-20 00:10:00,190.0


In [125]:
resampled.reset_index()

Unnamed: 0,key,time,value
0,a,2017-05-20 00:00:00,30.0
1,a,2017-05-20 00:05:00,105.0
2,a,2017-05-20 00:10:00,180.0
3,b,2017-05-20 00:00:00,35.0
4,b,2017-05-20 00:05:00,110.0
5,b,2017-05-20 00:10:00,185.0
6,c,2017-05-20 00:00:00,40.0
7,c,2017-05-20 00:05:00,115.0
8,c,2017-05-20 00:10:00,190.0


**Techniques for Method Chaining**

When applying a sequence of transformations to a dataset, you may find yourself cre‐
ating numerous temporary variables that are never used in your analysis. Consider
this example, for instance:


df = load_data()

df2 = df[df['col2'] < 0]

df2['col1_demeaned'] = df2['col1'] - df2['col1'].mean()

result = df2.groupby('key').col1_demeaned.std()


While we’re not using any real data here, this example highlights some new methods.
First, the DataFrame.assign method is a functional alternative to column assign‐
ments of the form df[k] = v. Rather than modifying the object in-place, it returns a
new DataFrame with the indicated modifications.


##### Usual non-functional way
df2 = df.copy()
df2['k'] = v

##### Functional assign way
df2 = df.assign(k=v)

In [None]:
Assigning in-place may execute faster than using assign, but assign enables easier
method chaining:
    
    
result = (df2.assign(col1_demeaned=df2.col1 - df2.col2.mean())
 .groupby('key')
 .col1_demeaned.std())

result = (load_data()
    [lambda x: x.col2 < 0]
    .assign(col1_demeaned=lambda x: x.col1 - x.col1.mean())
    .groupby('key')
    .col1_demeaned.std())

**The pipe Method**

In [None]:
a = f(df, arg1=v1)

b = g(a, v2, arg3=v3)

c = h(b, arg4=v4)

In [None]:
result = (df.pipe(f, arg1=v1)
 .pipe(g, v2, arg3=v3)
 .pipe(h, arg4=v4))

In [None]:
g = df.groupby(['key1', 'key2'])

df['col1'] = df['col1'] - g.transform('mean')

In [None]:
Suppose that you wanted to be able to demean more than one column and easily
change the group keys. Additionally, you might want to perform this transformation
in a method chain. Here is an example implementation:
    
def group_demean(df, by, cols):
    result = df.copy()
    g = df.groupby(by)
    for c in cols:
    result[c] = df[c] - g[c].transform('mean')
    return result

Then it is possible to write:
    
result = (df[df.col1 < 0]
    .pipe(group_demean, ['key1', 'key2'], ['col1']))