Skip to content

Commit

Permalink
ENH: Parametrized CategoricalDtype
Browse files Browse the repository at this point in the history
We extended the CategoricalDtype to accept optional categories and ordered
argument.

```python
pd.CategoricalDtype(categories=['a', 'b'], ordered=True
```

CategoricalDtype is now part of the public API. This allows users to
specify the desired categories and orderedness of an operation ahead of time.
The current behavior, which is still possible with categories=None, the
default, is to infer the categories from whatever is present.

This change will make it easy to implement support for specifying categories
that are know ahead of time in other places e.g. .astype, .read_csv, and the
Series constructor.

Closes pandas-dev#14711
Closes pandas-dev#15078
Closes pandas-dev#14676
  • Loading branch information
TomAugspurger committed Sep 17, 2017
1 parent 94266d4 commit cd5f4e2
Show file tree
Hide file tree
Showing 31 changed files with 898 additions and 250 deletions.
4 changes: 3 additions & 1 deletion doc/source/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -638,9 +638,11 @@ and allows efficient indexing and storage of an index with a large number of dup

.. ipython:: python
from pandas.api.types import CategoricalDtype
df = pd.DataFrame({'A': np.arange(6),
'B': list('aabbca')})
df['B'] = df['B'].astype('category', categories=list('cab'))
df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
df
df.dtypes
df.B.cat.categories
Expand Down
5 changes: 4 additions & 1 deletion doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -646,7 +646,10 @@ strings and apply several methods to it. These can be accessed like
Categorical
~~~~~~~~~~~

If the Series is of dtype ``category``, ``Series.cat`` can be used to change the the categorical
.. autoclass:: api.types.CategoricalDtype
:members: categories, ordered

If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
data. This accessor is similar to the ``Series.dt`` or ``Series.str`` and has the
following usable methods and properties:

Expand Down
98 changes: 90 additions & 8 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,12 +89,22 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df
You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of

1. categories are inferred from the data
2. categories are unordered.

To control those behaviors, instead of passing ``'category'``, use an instance
of :class:`~pandas.api.types.CategoricalDtype`.

.. ipython:: python
s = pd.Series(["a","b","c","a"])
s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
from pandas.api.types import CategoricalDtype
s = pd.Series(["a", "b", "c", "a"])
cat_type = CategoricalDtype(categories=["b", "c", "d"],
ordered=True)
s_cat = s.astype(cat_type)
s_cat
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Expand Down Expand Up @@ -133,6 +143,70 @@ constructor to save the factorize step during normal constructor mode:
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
.. _categorical.categoricaldtype:

CategoricalDtype
----------------

.. versionchanged:: 0.21.0

A categorical's type is fully described by

1. its categories: a sequence of unique values and no missing values
2. its orderedness: a boolean

This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
The ``categories`` argument is optional, which implies that the actual categories
should be inferred from whatever is present in the data when the
:class:`pandas.Categorical` is created.

.. ipython:: python
from pandas.api.types import CategoricalDtype
CategoricalDtype(['a', 'b', 'c'])
CategoricalDtype(['a', 'b', 'c'], ordered=True)
CategoricalDtype()
A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
expects a `dtype`. For example :func:`pandas.read_csv`,
:func:`pandas.DataFrame.astype`, or in the Series constructor.

As a convenience, you can use the string ``'category'`` in place of a
:class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
the categories being unordered, and equal to the set values present in the
array. In other words, ``dtype='category'`` is equivalent to
``dtype=CategoricalDtype()``.

Equality Semantics
~~~~~~~~~~~~~~~~~~

Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
whenever they have the same categories and orderedness. When comparing two
unordered categoricals, the order of the ``categories`` is not considered

.. ipython:: python
c1 = CategoricalDtype(['a', 'b', 'c'], ordered=False)
# Equal, since order is not considered when ordered=False
c1 == CategoricalDtype(['b', 'c', 'a'], ordered=False)
# Unequal, since the second CategoricalDtype is ordered
c1 == CategoricalDtype(['a', 'b', 'c'], ordered=True)
All instances of ``CategoricalDtype`` compare equal to the string ``'category'``

.. ipython:: python
c1 == 'category'
.. warning::

Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``

Description
-----------

Expand Down Expand Up @@ -182,7 +256,7 @@ It's also possible to pass in the categories in a specific order:

.. ipython:: python
s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
s
# categories
Expand Down Expand Up @@ -295,7 +369,9 @@ meaning and certain operations are possible. If the categorical is unordered, ``
s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
s.sort_values(inplace=True)
s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
s = pd.Series(["a","b","c","a"]).astype(
CategoricalDtype(ordered=True)
)
s.sort_values(inplace=True)
s
s.min(), s.max()
Expand Down Expand Up @@ -395,9 +471,15 @@ categories or a categorical with any list-like object, will raise a TypeError.

.. ipython:: python
cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
cat = pd.Series([1,2,3]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base = pd.Series([2,2,2]).astype(
CategoricalDtype([3, 2, 1], ordered=True)
)
cat_base2 = pd.Series([2,2,2]).astype(
CategoricalDtype(ordered=True)
)
cat
cat_base
Expand Down
11 changes: 8 additions & 3 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -830,8 +830,10 @@ The left frame.

.. ipython:: python
from pandas.api.types import CategoricalDtype
X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
X = X.astype('category', categories=['foo', 'bar'])
X = X.astype(CategoricalDtype(categories=['foo', 'bar']))
left = pd.DataFrame({'X': X,
'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
Expand All @@ -842,8 +844,11 @@ The right frame.

.. ipython:: python
right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
'Z': [1, 2]})
right = pd.DataFrame({
'X': pd.Series(['foo', 'bar'],
dtype=CategoricalDtype(['foo', 'bar'])),
'Z': [1, 2]
})
right
right.dtypes
Expand Down
26 changes: 26 additions & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ users upgrade to this version.
Highlights include:

- Integration with `Apache Parquet <https://parquet.apache.org/>`__, including a new top-level :func:`read_parquet` and :func:`DataFrame.to_parquet` method, see :ref:`here <io.parquet>`.
- New user-facing :class:`pandas.api.types.CategoricalDtype` for specifying
categoricals independent of the data (:issue:`14711`, :issue:`15078`)

Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.

Expand Down Expand Up @@ -88,6 +90,30 @@ This does not raise any obvious exceptions, but also does not create a new colum

Setting a list-like data structure into a new attribute now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.

.. _whatsnew_0210.enhancements.categorical_dtype:

``CategoricalDtype`` for specifying categoricals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`pandas.api.types.CategoricalDtype` has been added to the public API and
expanded to include the ``categories`` and ``ordered`` attributes. A
``CategoricalDtype`` can be used to specify the set of categories and
orderedness of an array, independent of the data themselves. This can be useful,
e.g., when converting string data to a ``Categorical``:

.. ipython:: python

from pandas.api.types import CategoricalDtype

s = pd.Series(['a', 'b', 'c', 'a']) # strings
dtype = CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
s.astype(dtype)

The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
``Series`` with categorical type will now return an instance of ``CategoricalDtype``.

See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.

.. _whatsnew_0210.enhancements.other:

Other Enhancements
Expand Down
Loading

0 comments on commit cd5f4e2

Please sign in to comment.