Skip to content

Commit

Permalink
ENH: Parametrized CategoricalDtype
Browse files Browse the repository at this point in the history
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
  • Loading branch information
TomAugspurger committed Sep 13, 2017
1 parent f11bbf2 commit ac24101
Show file tree
Hide file tree
Showing 29 changed files with 701 additions and 235 deletions.
2 changes: 1 addition & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -640,7 +640,7 @@ and allows efficient indexing and storage of an index with a large number of dup
df = pd.DataFrame({'A': np.arange(6),
'B': list('aabbca')})
df['B'] = df['B'].astype('category', categories=list('cab'))
df['B'] = df['B'].astype(pd.api.types.CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
Expand Down
90 changes: 82 additions & 8 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,20 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of

1. categories are inferred from the data
2. categories are unordered.

To control those behaviors, instead of passing ``'category'``, use an instance
of :class:`~pd.api.types.CategoricalDtype`.

.. ipython:: python
s = pd.Series(["a","b","c","a"])
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
s = pd.Series(["a", "b", "c", "a"])
cat_type = pd.api.types.CategoricalDtype(categories=["b", "c", "d"],
ordered=False)
s_cat = s.astype(cat_type)
s_cat
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Expand Down Expand Up @@ -133,6 +141,62 @@ constructor to save the factorize step during normal constructor mode:
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
CategoricalDtype
----------------

.. versionchanged:: 0.21.0

A categorical's type is fully described by 1.) its categories (an iterable with
unique values and no missing values), and 2.) its orderedness (a boolean).
This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
The ``categories`` argument is optional, which implies that the actual categories
should be inferred from whatever is present in the data when the
:class:`pandas.Categorical` is created.

.. ipython:: python
pd.api.types.CategoricalDtype(['a', 'b', 'c'])
pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
pd.api.types.CategoricalDtype()
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or the Series constructor.

As a convenience, you can use the string `'category'` in place of a
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
the categories being unordered, and equal to the set values present in the
array. On other words, ``dtype='category'`` is equivalent to
``dtype=pd.api.types.CategoricalDtype()``.

Equality Semantics
~~~~~~~~~~~~~~~~~~

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal whenever the have
the same categories and orderedness. When comparing two unordered categoricals, the
order of the ``categories`` is not considered

.. ipython:: python
c1 = pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=False)
# Equal, since order is not considered when ordered=False
c1 == pd.api.types.CategoricalDtype(['b', 'c', 'a'], ordered=False)
# Unequal, since the second CategoricalDtype is ordered
c1 == pd.api.types.CategoricalDtype(['a', 'b', 'c'], ordered=True)
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``

.. ipython:: python
c1 == 'category'
.. warning::

Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``

Description
-----------

Expand Down Expand Up @@ -182,7 +246,9 @@ It's also possible to pass in the categories in a specific order:

.. ipython:: python
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
s = pd.Series(list('babc')).astype(
pd.api.types.CategoricalDtype(list('abcd'))
)
s
# categories
Expand Down Expand Up @@ -295,7 +361,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
s = pd.Series(["a","b","c","a"]).astype(
pd.api.types.CategoricalDtype(ordered=True)
)
s.sort_values(inplace=True)
s
s.min(), s.max()
Expand Down Expand Up @@ -395,9 +463,15 @@ categories or a categorical with any list-like object, will raise a TypeError.

.. ipython:: python
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
cat = pd.Series([1,2,3]).astype(
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base = pd.Series([2,2,2]).astype(
pd.api.types.CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base2 = pd.Series([2,2,2]).astype(
pd.api.types.CategoricalDtype(ordered=True)
)
cat
cat_base
Expand Down
11 changes: 8 additions & 3 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,7 @@ The left frame.
.. ipython:: python
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
X = X.astype('category', categories=['foo', 'bar'])
X = X.astype(pd.api.types.CategoricalDtype(categories=['foo', 'bar']))
left = pd.DataFrame({'X': X,
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
Expand All @@ -842,8 +842,13 @@ The right frame.

.. ipython:: python
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
'Z': [1, 2]})
from pandas.api.types import CategoricalDtype
right = pd.DataFrame({
'X': pd.Series(['foo', 'bar'],
dtype=CategoricalDtype(['foo', 'bar'])),
'Z': [1, 2]
})
right
right.dtypes
Expand Down
26 changes: 26 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
New features
~~~~~~~~~~~~

- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data (:issue:`14711`, :issue:`15078`)
- Support for `PEP 519 -- Adding a file system path protocol
<https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
- Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
Expand Down Expand Up @@ -88,6 +90,30 @@ This does not raise any obvious exceptions, but also does not create a new colum

Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.

.. _whatsnew_0210.enhancements.categorical_dtype:

``CategoricalDtype`` for specifying categoricals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical``:

.. ipython:: python

from pandas.api.types import CategoricalDtype

s = pd.Series(['a', 'b', 'c', 'a']) # strings
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.

See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.

.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand Down
Loading

0 comments on commit ac24101

Please sign in to comment.