## New Extension Arrays

New (or revamped) extension arrays for nullable integer, boolean, and text data.

In [36]:
import pandas as pd
import numpy as np

### Text Data

NumPy doesn't have a good container for variable-width text data. Previously, pandas stored text data in `object`-dtype ndarrays.

In [37]:
s = pd.Series(['This', 'is', 'text', 'data'])
s

0    This
1      is
2    text
3    data
dtype: object

There are several issues, here. You could accidentally store text and non-text data.

In [38]:
pd.Series(['This', 'is', 1, 'text', 'data?'])

0     This
1       is
2        1
3     text
4    data?
dtype: object

It makes `dtype`-specific operations like `select_dtypes` awkward.

In [39]:
df = pd.DataFrame({"A": ['a', 'b', 'c'], "B": [1, '2', True], "C": [1, 2, 3]})
display(df)

print("Select text?")
display(df.select_dtypes(include=["object"]))

Unnamed: 0,A,B,C
0,a,1,1
1,b,2,2
2,c,True,3


Select text?


Unnamed: 0,A,B
0,a,1
1,b,2
2,c,True


So we have a dedicated `string` dtype.

In [40]:
text = pd.Series(['This', 'is', 'some', 'text'], dtype="string")
text

0    This
1      is
2    some
3    text
dtype: string

In [41]:
%xmode plain
pd.Series(['This', 'is', 1, 'text?'], dtype="string")

Exception reporting mode: Plain


ValueError: StringArray requires a sequence of strings or pandas.NA

In [42]:
df = pd.DataFrame({"A": [0, 1, 2],
                   "B": pd.array(['a', 'b', 'c'], dtype="string"),
                   "C": [0, 'a', 'b']})
df.select_dtypes(include="string")

Unnamed: 0,B
0,a
1,b
2,c


### Boolean and Integer

Pandas now supports nullable boolean and integer data. Previously these were either cast to float (integer) or object (boolean).

In [43]:
df = pd.DataFrame({"A": [0, 1], "B": [True, False]})
df

Unnamed: 0,A,B
0,0,True
1,1,False


In [44]:
df2 = df.reindex([0, 2, 1])
df2

Unnamed: 0,A,B
0,0.0,True
2,,
1,1.0,False


In [45]:
df2.dtypes

A    float64
B     object
dtype: object

In [46]:
old = pd.DataFrame({
    "A": [1, None, 3],         # float
    "B": [True, None, False],  # object
    "C": ["a", None, "c"],     # object
})
old

Unnamed: 0,A,B,C
0,1.0,True,a
1,,,
2,3.0,False,c


In [48]:
old.dtypes

A    float64
B     object
C     object
dtype: object

In [49]:
old.select_dtypes(include=["bool"])

0
1
2


The new, recommended approach: Use the "nullable" extension dtypes.

In [50]:
# IntegerArray
a = pd.array([1, None, 3], dtype="Int64")
a

<IntegerArray>
[1, <NA>, 3]
Length: 3, dtype: Int64

In [51]:
# BooleanArray
b = pd.array([True, None, False], dtype="boolean")
b

<BooleanArray>
[True, <NA>, False]
Length: 3, dtype: boolean

In [52]:
# StringArray
c = pd.array(["a", None, "c"], dtype="string")
c

<StringArray>
['a', <NA>, 'c']
Length: 3, dtype: string

In [53]:
new = pd.DataFrame({
    "a": a, "b": b, "c": c
})
new

Unnamed: 0,a,b,c
0,1.0,True,a
1,,,
2,3.0,False,c


In [54]:
new.dtypes

a      Int64
b    boolean
c     string
dtype: object

In [55]:
new.select_dtypes(["boolean", "integer"])

Unnamed: 0,a,b
0,1.0,True
1,,
2,3.0,False


These work well together.

In [56]:
new.c.str.startswith('a')

0     True
1     <NA>
2    False
Name: c, dtype: boolean

In [57]:
new.a == 1

0     True
1     <NA>
2    False
Name: a, dtype: boolean

To opt into these, use `DataFrame.convert_dtypes`.

In [58]:
print("# old")
display(old.dtypes)
print("\n# converted")
display(old.convert_dtypes().dtypes)

# old


A    float64
B     object
C     object
dtype: object


# converted


A      Int64
B    boolean
C     string
dtype: object

In [59]:
old.convert_dtypes().dtypes

A      Int64
B    boolean
C     string
dtype: object