# Pandas Categorical data type

- unexpected behaviour in comparison to `object` or  `string` type, see [blog](https://towardsdatascience.com/staying-sane-while-adopting-pandas-categorical-datatypes-78dbd19dcd8a)


In [1]:
import pandas as pd
from vaep import utils

df_long = utils.create_long_df(100, 10)
df_long = df_long.set_index(df_long.columns[:-1].to_list())
df_long

Unnamed: 0_level_0,Unnamed: 1_level_0,intensity
Sample ID,peptide,Unnamed: 2_level_1
0,0,23.362
0,1,25.403
0,2,25.592
0,3,21.950
0,4,26.262
...,...,...
99,4,28.055
99,5,27.920
99,6,23.427
99,7,22.775


In [2]:
idx_names = df_long.index.names
df_long = df_long.reset_index()
df_long[idx_names] = df_long[idx_names].astype(str)
df_long.dtypes

Sample ID     object
peptide       object
intensity    float64
dtype: object

In [3]:
samples = ["0", "98"]
df_long.set_index(idx_names).loc[samples]

Unnamed: 0_level_0,Unnamed: 1_level_0,intensity
Sample ID,peptide,Unnamed: 2_level_1
0,0,23.362
0,1,25.403
0,2,25.592
0,3,21.95
0,4,26.262
0,5,27.28
0,6,24.959
0,7,23.384
0,8,22.749
0,9,24.496


In [4]:
df_long[idx_names] = df_long[idx_names].astype('category')
df_long.set_index(idx_names).loc[samples]

Unnamed: 0_level_0,Unnamed: 1_level_0,intensity
Sample ID,peptide,Unnamed: 2_level_1
0,0,23.362
0,1,25.403
0,2,25.592
0,3,21.95
0,4,26.262
0,5,27.28
0,6,24.959
0,7,23.384
0,8,22.749
0,9,24.496


In [5]:
idx = df_long.set_index(idx_names).index

In [6]:
SAMPLE_ID = 'Sample ID'
df_long[SAMPLE_ID].dtype

CategoricalDtype(categories=['0', '1', '10', '11', '12', '13', '14', '15', '16', '17',
                  '18', '19', '2', '20', '21', '22', '23', '24', '25', '26',
                  '27', '28', '29', '3', '30', '31', '32', '33', '34', '35',
                  '36', '37', '38', '39', '4', '40', '41', '42', '43', '44',
                  '45', '46', '47', '48', '49', '5', '50', '51', '52', '53',
                  '54', '55', '56', '57', '58', '59', '6', '60', '61', '62',
                  '63', '64', '65', '66', '67', '68', '69', '7', '70', '71',
                  '72', '73', '74', '75', '76', '77', '78', '79', '8', '80',
                  '81', '82', '83', '84', '85', '86', '87', '88', '89', '9',
                  '90', '91', '92', '93', '94', '95', '96', '97', '98', '99'],
, ordered=False)

## Reuse a categorical dtype

In [7]:
pd.Series(['1', '98', '200'], dtype=df_long[SAMPLE_ID].dtype)

0      1
1     98
2    NaN
dtype: category
Categories (100, object): ['0', '1', '10', '11', ..., '96', '97', '98', '99']

In [8]:
df_long[SAMPLE_ID].cat.codes # integer codes

0      0
1      0
2      0
3      0
4      0
      ..
902   99
903   99
904   99
905   99
906   99
Length: 907, dtype: int8

## Ordered integers

In [9]:
s = pd.Series([10, 50, 100] *10, dtype='category')
s.describe()

count    30
unique    3
top      10
freq     10
dtype: int64

In [10]:
s.cat.categories, s.unique()

(Int64Index([10, 50, 100], dtype='int64'),
 [10, 50, 100]
 Categories (3, int64): [10, 50, 100])

Insertion order is important for `object` and `string` categories

> Codes are an array of integers which are the positions of the actual values in the categories array. ([src](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.codes.html))

In [17]:
s = pd.Series([100, 50, 10] *10, dtype='category') # for integers order of data does not seem to matter
s.cat.categories, s.unique()

(Int64Index([10, 50, 100], dtype='int64'),
 [100, 50, 10]
 Categories (3, int64): [100, 50, 10])

Solution: Be specific and define categories upfront

In [12]:
dtype = pd.CategoricalDtype([10, 50, 100], ordered=False)
s = pd.Series([100, 50, 10] *10, dtype=dtype)
s.cat.categories, s.unique()

(Int64Index([10, 50, 100], dtype='int64'),
 [100, 50, 10]
 Categories (3, int64): [100, 50, 10])

In [14]:
s.unique()

[100, 50, 10]
Categories (3, int64): [100, 50, 10]

The codes did change!

In [16]:
s.cat.codes.unique() # the codes are then not the original integers!

array([2, 1, 0], dtype=int8)