ENH: Parametrized CategoricalDtype

We extended the CategoricalDtype to accept optional categories and ordered argument. ```python pd.CategoricalDtype(categories=['a', 'b'], ordered=True ``` CategoricalDtype is now part of the public API. This allows users to specify the desired categories and orderedness of an operation ahead of time. The current behavior, which is still possible with categories=None, the default, is to infer the categories from whatever is present. This change will make it easy to implement support for specifying categories that are know ahead of time in other places e.g. .astype, .read_csv, and the Series constructor. Closes pandas-dev#14711 Closes pandas-dev#15078 Closes pandas-dev#14676
TomAugspurger · Aug 25, 2017 · a7eb835 · a7eb835
1 parent 1abaecb
commit a7eb835
Show file tree

Hide file tree

Showing 21 changed files with 510 additions and 170 deletions.
diff --git a/doc/source/advanced.rst b/doc/source/advanced.rst
@@ -654,7 +654,7 @@ setting the index of a ``DataFrame/Series`` with a ``category`` dtype would conv
 
    df = pd.DataFrame({'A': np.arange(6),
                       'B': list('aabbca')})
-   df['B'] = df['B'].astype('category', categories=list('cab'))
+   df['B'] = df['B'].astype(pd.CategoricalDtype(list('cab')))
    df
    df.dtypes
    df.B.cat.categories

diff --git a/doc/source/categorical.rst b/doc/source/categorical.rst
@@ -96,12 +96,19 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
     df["B"] = raw_cat
     df
 
-You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to ``astype()``:
+Anywhere above we passed a keyword ``dtype='category'``, we used the default behavior of
+
+1. categories are inferred from the data
+2. categories are unordered.
+
+To control those behaviors, instead of passing ``'category'``, use an instance
+of :class:`CategoricalDtype`.
 
 .. ipython:: python
 
-    s = pd.Series(["a","b","c","a"])
-    s_cat = s.astype("category", categories=["b","c","d"], ordered=False)
+    s = pd.Series(["a", "b", "c", "a"])
+    cat_type = pd.CategoricalDtype(categories=["b", "c", "d"], ordered=False)
+    s_cat = s.astype(cat_type)
     s_cat
 
 Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
@@ -140,6 +147,61 @@ constructor to save the factorize step during normal constructor mode:
     splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
     s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
 
+CategoricalDtype
+----------------
+
+.. versionchanged:: 0.21.0
+
+A categorical's type is fully described by 1.) its categories (an iterable with
+unique values and no missing values), and 2.) its orderedness (a boolean).
+This information can be stored in a :class:`~pandas.CategoricalDtype`.
+The ``categories`` argument is optional, which implies that the actual categories
+should be inferred from whatever is present in the data when the
+:class:`pandas.Categorical` is created.
+
+.. ipython:: python
+
+   pd.CategoricalDtype(['a', 'b', 'c'])
+   pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)
+   pd.CategoricalDtype()
+
+A :class:`~pandas.CategoricalDtype` can be used in any place pandas expects a
+`dtype`. For example :func:`pandas.read_csv`, :func:`pandas.DataFrame.astype`,
+or the Series constructor.
+
+As a convenience, you can use the string `'category'` in place of a
+:class:`pandas.CategoricalDtype` when you want the default behavior of
+the categories being unordered, and equal to the set values present in the array.
+On other words, ``dtype='category'`` is equivalent to ``dtype=pd.CategoricalDtype()``.
+
+Equality Semantics
+~~~~~~~~~~~~~~~~~~
+
+Two instances of :class:`pandas.CategoricalDtype` compare equal whenever the have
+the same categories and orderedness. When comparing two unordered categoricals, the
+order of the ``categories`` is not considered
+
+.. ipython:: python
+
+   c1 = pd.CategoricalDtype(['a', 'b', 'c'], ordered=False)
+   # Equal, since order is not considered when ordered=False
+   c1 == pd.CategoricalDtype(['b', 'c', 'a'], ordered=False)
+   # Unequal, since the second CategoricalDtype is ordered
+   c1 == pd.CategoricalDtype(['a',  'b', 'c'], ordered=True)
+
+All instances of ``CategoricalDtype`` compare equal to the string ``'category'``
+
+.. ipython:: python
+
+   c1 == 'category'
+
+
+.. warning::
+
+   Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
+   and since all instances ``CategoricalDtype`` compare equal to ``'`category'``,
+   all instances of ``CategoricalDtype`` compare equal to a ``CategoricalDtype(None)``
+
 Description
 -----------
 
@@ -189,7 +251,7 @@ It's also possible to pass in the categories in a specific order:
 
     .. ipython:: python
 
-         s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
+         s = pd.Series(list('babc')).astype(pd.CategoricalDtype(list('abcd')))
          s
 
          # categories
@@ -306,7 +368,7 @@ meaning and certain operations are possible. If the categorical is unordered, ``
 
     s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
     s.sort_values(inplace=True)
-    s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
+    s = pd.Series(["a","b","c","a"]).astype(pd.CategoricalDtype(ordered=True))
     s.sort_values(inplace=True)
     s
     s.min(), s.max()
@@ -406,9 +468,9 @@ categories or a categorical with any list-like object, will raise a TypeError.
 
 .. ipython:: python
 
-    cat = pd.Series([1,2,3]).astype("category", categories=[3,2,1], ordered=True)
-    cat_base = pd.Series([2,2,2]).astype("category", categories=[3,2,1], ordered=True)
-    cat_base2 = pd.Series([2,2,2]).astype("category", ordered=True)
+    cat = pd.Series([1,2,3]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
+    cat_base = pd.Series([2,2,2]).astype(pd.CategoricalDtype([3, 2, 1], ordered=True))
+    cat_base2 = pd.Series([2,2,2]).astype(pd.CategoricalDtype(ordered=True))
 
     cat
     cat_base

diff --git a/doc/source/merging.rst b/doc/source/merging.rst
@@ -831,7 +831,7 @@ The left frame.
 .. ipython:: python
 
    X = pd.Series(np.random.choice(['foo', 'bar'], size=(10,)))
-   X = X.astype('category', categories=['foo', 'bar'])
+   X = X.astype(pd.CategoricalDtype(categories=['foo', 'bar']))
 
    left = pd.DataFrame({'X': X,
                         'Y': np.random.choice(['one', 'two', 'three'], size=(10,))})
@@ -842,8 +842,10 @@ The right frame.
 
 .. ipython:: python
 
-   right = pd.DataFrame({'X': pd.Series(['foo', 'bar']).astype('category', categories=['foo', 'bar']),
-                         'Z': [1, 2]})
+   right = pd.DataFrame({
+        'X': pd.Series(['foo', 'bar'], dtype=pd.CategoricalDtype(['foo', 'bar'])),
+        'Z': [1, 2]
+   })
    right
    right.dtypes
 

diff --git a/doc/source/whatsnew/v0.21.0.txt b/doc/source/whatsnew/v0.21.0.txt
@@ -22,6 +22,8 @@ Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations
 New features
 ~~~~~~~~~~~~
 
+- New user-facing :class:`CategoricalDtype` for specifying categorical independent
+  of the data (:issue:`14711`, :issue:`15078`)
 - Support for `PEP 519 -- Adding a file system path protocol
   <https://www.python.org/dev/peps/pep-0519/>`_ on most readers and writers (:issue:`13823`)
 - Added ``__fspath__`` method to :class:`~pandas.HDFStore`, :class:`~pandas.ExcelFile`,
@@ -106,6 +108,28 @@ This does not permit that column to be accessed as an attribute:
 
 Both of these now raise a ``UserWarning`` about the potential for unexpected behavior. See :ref:`Attribute Access <indexing.attribute_access>`.
 
+.. _whatsnew_0210.enhancements.categorical_dtype:
+
+``CategoricalDtype`` for specifying categoricals
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`CategoricalDtype` has been added to the public API and expanded to
+include the ``categories`` and ``ordered`` attributes. A ``CategoricalDtype``
+can be used to specify the set of categories and orderedness of an array,
+independent of the data themselves. This can be useful, e.g., when converting
+string data to a ``Categorical``:
+
+.. ipython:: python
+
+   s = pd.Series(['a', 'b', 'c', 'a'])  # strings
+   dtype = pd.CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=True)
+   s.astype(dtype)
+
+The ``.dtype`` property of a ``Categorical``, ``CategoricalIndex`` or a
+``Series`` with categorical type will now return an instance of ``CategoricalDtype``.
+
+See :ref:`CategoricalDtype <categorical.categoricaldtype>` for more.
+
 .. _whatsnew_0210.enhancements.other:
 
 Other Enhancements

diff --git a/pandas/core/api.py b/pandas/core/api.py
@@ -6,6 +6,7 @@
 
 from pandas.core.algorithms import factorize, unique, value_counts
 from pandas.core.dtypes.missing import isna, isnull, notna, notnull
+from pandas.core.dtypes.dtypes import CategoricalDtype
 from pandas.core.categorical import Categorical
 from pandas.core.groupby import Grouper
 from pandas.io.formats.format import set_eng_float_format