## 1 Categorical Data

Categorical representation can yield significant performance improvements. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
- Renaming categories
- Appending a new category without changing the order or position of the existing categories

In [1]:
import numpy as np; import pandas as pd

### 1.1 Categorical Type in pandas

pandas has a special `Categorical` type for holding data that uses the integer-based categorical representation or encoding.

A categorical array can consist of any immutable value types

In [2]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [3]:
N = len(fruits)

In [5]:
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])

In [6]:
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,6,2.363501
1,1,orange,7,2.545627
2,2,apple,14,0.44992
3,3,apple,9,0.377273
4,4,apple,7,2.644092
5,5,orange,10,0.106627
6,6,apple,6,1.809975
7,7,apple,3,3.312087


Convert the fruit column to categorical

In [7]:
fruit_cat = df['fruit'].astype('category')

In [8]:
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

The values for fruit_cat are not a NumPy array, but an instance of `pandas.Categorical`

In [9]:
c = fruit_cat.values

In [10]:
type(c)

pandas.core.categorical.Categorical

It has `categories` and `codes` attributes

In [11]:
c.categories

Index([u'apple', u'orange'], dtype='object')

In [12]:
c.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

Convert a DataFrame column to categorical by assigning the converted result

In [14]:
df['fruit'] = df['fruit'].astype('category')

In [15]:
df.fruit

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

You can create pandas.Categorical directly from other types of Python sequences

In [16]:
my_categories = pd.Categorical(['foo', 'bar'])

In [17]:
my_categories

[foo, bar]
Categories (2, object): [bar, foo]

If you have obtained categorical encoded data from another source, you can use the alternative `from_codes` constructor

In [18]:
categories = ['foo', 'bar', 'baz']

In [19]:
codes = [0, 1, 2, 0, 0, 1]

In [20]:
my_cats_2 = pd.Categorical.from_codes(codes, categories)

In [21]:
my_cats_2

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

Unless explicitly specified, categorical conversions assume no specific ordering of the categories. When using `from_codes` or any the other constructors, you can indicate that the categories have a meaningful ordering

In [22]:
ordered_cat = pd.Categorical.from_codes(codes, categories,
                                        ordered=True)

In [23]:
ordered_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

An unordered categorical instance can be made ordered with `as_ordered`

In [24]:
my_cats_2.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

### 1.2 Computations with Categoricals

In [25]:
np.random.seed(12345)

In [26]:
draws = np.random.randn(1000)

In [28]:
draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [31]:
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

In [32]:
bins

[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [34]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

The labeled bins categorical does not contain information about the bin edges. We can use groupby to extract some summary statistics

In [35]:
bins = pd.Series(bins, name='quartile')

In [38]:
results= pd.Series(draws).groupby(bins).agg(['count', 'min', 'max'])

In [39]:
results

Unnamed: 0_level_0,count,min,max
quartile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1,250,-2.949343,-0.685484
Q2,250,-0.683066,-0.010115
Q3,250,-0.010032,0.628894
Q4,250,0.634238,3.927528


**Better performance with categoricals**

In [40]:
N = 1000000

In [41]:
draws = pd.Series(np.random.randn(N))

In [45]:
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N / 4))

In [46]:
categories = labels.astype('category')

categoriacals uses significantly less memory than labels

In [47]:
labels.memory_usage()

8000072

In [48]:
categories.memory_usage()

1000104

GroupBy operations can be siginificantly faster because the underlying algorithms use the integer-based codes array instead of an array of strings

In [49]:
%time draws.groupby(labels).mean()

CPU times: user 60.8 ms, sys: 10.2 ms, total: 71 ms
Wall time: 69.8 ms


bar    0.002704
baz    0.000537
foo   -0.000582
qux    0.003127
dtype: float64

In [50]:
%time draws.groupby(categories).mean()

CPU times: user 20.3 ms, sys: 8.91 ms, total: 29.2 ms
Wall time: 26 ms


bar    0.002704
baz    0.000537
foo   -0.000582
qux    0.003127
dtype: float64

### 1.3 Categorical Methods

In [51]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)

In [53]:
cat_s = s.astype('category')

In [55]:
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

The special attribute `cat` provides access to categorical methods

In [57]:
cat_s.cat

<pandas.core.categorical.CategoricalAccessor object at 0x1139c01d0>

In [56]:
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [58]:
cat_s.cat.categories

Index([u'a', u'b', u'c', u'd'], dtype='object')

Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the `set_categories` method to change them

In [59]:
actual_categories = ['a', 'b', 'c', 'd', 'e']

In [60]:
cat_s2 = cat_s.cat.set_categories(actual_categories)

In [61]:
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

While it appears that the data is unchanged, the new categories will be reflected in operations that use them.

In [62]:
cat_s.value_counts()

d    2
c    2
b    2
a    2
dtype: int64

In [63]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

After you filter a large DataFrame or Series, many of the categories may not appear in the data. We can use the `remove_unused_categories` method to trim unobeserved categories

In [64]:
cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

In [65]:
cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [67]:
cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

<img src='img/12_1_1.png'>

**Creating dummy variables for modeling**

Dummy variables (AKA one-hot encoding).

In [68]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

In [69]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


## 2 Advanced GroupBy Use

### 2.1 Group Transforms and "Unwrapped" GroupBys

The built-in method `transform` is similar to `apply` but imposes more constraints on the kind of function you can use:
- It can produce a scalar value to be broadcast to the shape of the group
- It can produce an object of the same shape as the input group
- It must not mutate its input

In [70]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12.)})

In [71]:
df

Unnamed: 0,key,value
0,a,0.0
1,b,1.0
2,c,2.0
3,a,3.0
4,b,4.0
5,c,5.0
6,a,6.0
7,b,7.0
8,c,8.0
9,a,9.0


Group means by key

In [78]:
g = df.groupby('key')['value']

In [79]:
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

Suppose instead we wanted to produce a Series of the same shape as df['value'] but with values replaced by the average by 'key'.

In [80]:
g.transform(lambda x: x.mean())

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

For built-in aggregation functions, we can pass a string alias as with the GroupBy `agg` method

In [81]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

Like `apply`, `transform` works with functions that return Series, but the result must be the same size as the input

In [82]:
g.transform(lambda x: x * 2)

0      0.0
1      2.0
2      4.0
3      6.0
4      8.0
5     10.0
6     12.0
7     14.0
8     16.0
9     18.0
10    20.0
11    22.0
Name: value, dtype: float64

Compute the ranks in descending order for each group

In [83]:
g.transform(lambda x: x.rank(ascending=False))

0     4.0
1     4.0
2     4.0
3     3.0
4     3.0
5     3.0
6     2.0
7     2.0
8     2.0
9     1.0
10    1.0
11    1.0
Name: value, dtype: float64

Normalize the value

In [85]:
def normalize(x):
    return (x - x.mean()) / x.std()

In [86]:
g.transform(normalize)

0    -1.161895
1    -1.161895
2    -1.161895
3    -0.387298
4    -0.387298
5    -0.387298
6     0.387298
7     0.387298
8     0.387298
9     1.161895
10    1.161895
11    1.161895
Name: value, dtype: float64

This allows us to perform a so-called **unwrapped** group operation

In [94]:
g.transform('mean')

0     4.5
1     5.5
2     6.5
3     4.5
4     5.5
5     6.5
6     4.5
7     5.5
8     6.5
9     4.5
10    5.5
11    6.5
Name: value, dtype: float64

### 2.2 Grouped Time Resampling

Suppose that a DataFrame contains multiple time series, marked by an additional group key columm

In [97]:
N = 15

In [98]:
df = pd.DataFrame({'key': np.tile(['a', 'b', 'c'], N),
                   'value': np.arange(N * 3)},
                  index=pd.date_range('2017-05-20 00:00',
                                      freq='1min', periods=N).repeat(3))

In [99]:
df.head()

Unnamed: 0,key,value
2017-05-20 00:00:00,a,0
2017-05-20 00:00:00,b,1
2017-05-20 00:00:00,c,2
2017-05-20 00:01:00,a,3
2017-05-20 00:01:00,b,4


Use `pandas.TimeGrouper` object to do resampling for each value of key,

In [100]:
time_key = pd.TimeGrouper('5min')

In [101]:
resampled = df.groupby(['key', time_key]).sum()

In [102]:
resampled

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30
a,2017-05-20 00:05:00,105
a,2017-05-20 00:10:00,180
b,2017-05-20 00:00:00,35
b,2017-05-20 00:05:00,110
b,2017-05-20 00:10:00,185
c,2017-05-20 00:00:00,40
c,2017-05-20 00:05:00,115
c,2017-05-20 00:10:00,190


In [105]:
df.groupby('key').resample('5min').sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,value
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2017-05-20 00:00:00,30
a,2017-05-20 00:05:00,105
a,2017-05-20 00:10:00,180
b,2017-05-20 00:00:00,35
b,2017-05-20 00:05:00,110
b,2017-05-20 00:10:00,185
c,2017-05-20 00:00:00,40
c,2017-05-20 00:05:00,115
c,2017-05-20 00:10:00,190


## 3 Techniques for Method Chaining

### 3.2 The pipe method

```
a = f(df, arg1=v1)
b = g(a, v2, arg3=v3)
c = h(b, arg4=v4)
```

When using functions that accept and return Series of DataFrame objects like above, you can rewrite them using calls to pipe

```
result = (df.pipe(f, arg1=v1)
          .pipe(g, v2, arg3=v3)
          .pipe(h, arg4=v4))
```

A potentially useful pattern for `pipe` is to generalize sequences of operations into reusable functions