# Categories

See "Python for Data Analysis,  2nd Edition',  Wes McKinney,  Chapter 12

In [1]:
import numpy as np
import pandas as pd

In [None]:
Let's look at a categorical or factor type variable in Python

In [2]:
values=pd.Series(['apple','orange','apple','apple']*2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [3]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [4]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

# This storage of a categorical variable as text uses up a lot of memory

In data warehousing or other computational approaches to working with or storing categorical data,  it is common to 
code these are integers and then use a "dimension table" that store the names of the categories

The R factor type stores factors as integers and creates a dimension table of names.  Since R is so closely tuned for statistical analysis,   the factor variable type is a built-in variable type in R

Here is what the "dimension-table" implementation looks like

In [5]:
values=pd.Series([0,1,0,0]*2)

dim_table=pd.Series(['apple','orange'])

In [6]:
values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [7]:
dim_table

0     apple
1    orange
dtype: object

In [8]:
# the take member function allows extractions of each name in the dim table

dim_table.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

You may see this style of dimension table used in data bases,   with a numerical code in the table and the dimension table often explained in a data dictionary

## Magic Number coding

Prior to 1980 or so,   the use of "magic numbers" in databases for encoding multiple categorical values into a single digital number was common.

You see these in serial numbers of manufactured objects,   a code like    0210912321  might mean the object was made in 2021 (first 3 digits) in week 09 of 
year (first week of March) and was the 12321st object of that type made that year,  so the serial number might have to be split up into 3 integer pieces to be 
interpretted.

This is not uncommon in older databases,   kind of watch for it.

## The Category data type in Python

The Python category type is not as sophisticated as the Factor is in R, but it does save somewhat on storrage space, and there is an easy conversion to 
one-hot encoding.

In [10]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2

N = len(fruits)


df = pd.DataFrame({'fruit': fruits,'basket_id': np.arange(N),'count': np.random.randint(3, 15, size=N),'weight': np.random.uniform(0, 4, size=N)}, columns=['basket_id', 'fruit', 'count', 'weight'])

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,9,3.508213
1,1,orange,8,0.550126
2,2,apple,6,1.061943
3,3,apple,3,0.409063
4,4,apple,5,1.039201
5,5,orange,5,1.939664
6,6,apple,9,3.738356
7,7,apple,6,3.219056


In [11]:
# Convert fruit to a category- creating a categorical variable

fruit_cat=df['fruit'].astype('category')

fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [12]:
# this type data has a values entry

fruit_cat.values

['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple', 'apple']
Categories (2, object): ['apple', 'orange']

In [17]:
# and a category entry
fruit_cat.cat.categories

Index(['apple', 'orange'], dtype='object')

In [18]:
# creating a category from a list

my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [20]:
# ordered categories,  used for categories where there is a step or ordering,   ie small-medium-large

categories = ['foo', 'bar', 'baz']

codes = [0, 1, 2, 0, 0, 1]

ordered_cat = pd.Categorical.from_codes(codes, categories,ordered=True)

ordered_cat

['foo', 'bar', 'baz', 'foo', 'foo', 'bar']
Categories (3, object): ['foo' < 'bar' < 'baz']

## Computation with Categoricals



In [30]:
np.random.seed(12345)

draws = np.random.randn(1000)

draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

### let's bin this into 4 quartiles,  the qcut function bins data into quintile forms.   The cut() produces evenly spaced bins


In [31]:

bins = pd.qcut(draws, 4)

bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64, right]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]

In [32]:
# we can add labels to the quartiles, which is nice for plots or tables

bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [33]:
# we can use groupby to extract some results about the nature of this binning

bins = pd.Series(bins, name='quartile')

# interesting construction here

results = (pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']).reset_index())
results

# note that the quartiles are ordered,  so the groupby uses that ordering of the category

Unnamed: 0,quartile,count,min,max
0,Q1,250,-2.949343,-0.685484
1,Q2,250,-0.683066,-0.010115
2,Q3,250,-0.010032,0.628894
3,Q4,250,0.634238,3.927528


In [34]:
# it isn't clear to me what reset_index in the cell above does
results = (pd.Series(draws).groupby(bins).agg(['count', 'min', 'max']))
results

Unnamed: 0_level_0,count,min,max
quartile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Q1,250,-2.949343,-0.685484
Q2,250,-0.683066,-0.010115
Q3,250,-0.010032,0.628894
Q4,250,0.634238,3.927528


## Memory usage

we can see the savings in memory usage

In [26]:
N = 10000000

draws = pd.Series(np.random.randn(N))

labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

# category version of the results
categories = labels.astype('category')

In [28]:
print(labels.memory_usage())

print(categories.memory_usage())

80000128
10000332


## Categorical methods

There are specialized categorical methods that can be accessed using the special attribute cat

In [35]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [36]:
# see the underlying codes
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

If we knew that this categories had multiple categories that did not appear in our dataset,  we can add more category entries

In [37]:
actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [38]:
print(cat_s.value_counts())

print(cat_s2.value_counts())

a    2
b    2
c    2
d    2
dtype: int64
a    2
b    2
c    2
d    2
e    0
dtype: int64


In [39]:
# selecting only specific entries, using isin

cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

cat_s3

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [42]:
# note that the variable cat_s3 has only the set of two categories,  a,b but lists all four possible category levels a,b,c,d

# we can clean this up

# removing unused categories

cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

## Creating Dummy variable or One-hot encoding from a category


In [43]:
cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [44]:
pd.get_dummies(cat_s)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1
