In [10]:
import pandas as pd 
import numpy as np

There are two ways to store text data in pandas
1. object ftype
2. StringDtype extension type

Prior to pandas 1.0, object dtype was the only option. This was unfortunate for many reasons:

1. You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.

2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.

3. When reading code, the contents of an object dtype array is less clear than 'string'.

In [3]:
pd.Series(["a", "b", "c"])

0    a
1    b
2    c
dtype: object

In [4]:
# To explicitly request string dtype, specify the dtype
pd.Series(["a", "b", "c"], dtype="string")

0    a
1    b
2    c
dtype: string

In [5]:
pd.Series(["a", "b", "c"], dtype=pd.StringDtype())

0    a
1    b
2    c
dtype: string

In [7]:
# Or astype after the Series or DataFrame is created
s = pd.Series(["a", "b", "c"])
print(s)
s.astype("string")

0    a
1    b
2    c
dtype: object


0    a
1    b
2    c
dtype: string

In [11]:
# You can also use StringDtype/"string" as the dtype on non-string data and it will be converted to string dtype:
s = pd.Series(["a", 2, np.nan], dtype="string")
s

0       a
1       2
2    <NA>
dtype: string

## Behavior differences between object nd StringDtype

For StringDtype, string accessor methods that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean output will return a nullable boolean dtype.

In [12]:
s = pd.Series(["a", None, "b"], dtype="string")

In [13]:
s.str.count("a")

0       1
1    <NA>
2       0
dtype: Int64

In [14]:
s.dropna().str.count("a")

0    1
2    0
dtype: Int64

In [20]:
# Both previous outputs are Int64 dtype. Compare that with object-dtype
s2 = pd.Series(["a", None, "b"], dtype="object")

In [21]:
s2.str.count("a")

0    1.0
1    NaN
2    0.0
dtype: float64

In [22]:
s2.dropna().str.count("a")

0    1
2    0
dtype: int64

In [25]:
# When NA values are present, the output dtype is float64. Similarly for methods returning boolean values.
print(s.str.isdigit(), '\n')
print(s.str.match("a"))

0    False
1     <NA>
2    False
dtype: boolean 

0     True
1     <NA>
2    False
dtype: boolean


Some string methods, like Series.str.decode() are not available on StringArray because StringArray only holds strings, not bytes.

In comparison operations, arrays.StringArray and Series backed by a StringArray will return an object with BooleanDtype, rather than a bool dtype object. Missing values in a StringArray will propagate in comparison operations, rather than always comparing unequal like numpy.nan.

# String Methods

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

In [26]:
s = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)

In [27]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [28]:
s.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [29]:
s.str.len()

0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

In [33]:
idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
print(
    idx.str.strip(), '\n',
    idx.str.lstrip(), '\n',
    idx.str.rstrip(), '\n'
)

Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') 
 Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object') 
 Index([' jack', 'jill', ' jesse', 'frank'], dtype='object') 



The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:

In [39]:
df = pd.DataFrame(
    np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
)

In [35]:
df

Unnamed: 0,Column A,Column B
0,-1.597467,1.434516
1,0.239225,-0.084869
2,-0.406492,0.68718


In [37]:
# Since df.columns is an Index object, we can use the .str accessor
print(
    df.columns.str.strip(), '\n',
    df.columns.str.lower()
)

Index(['Column A', 'Column B'], dtype='object') 
 Index([' column a ', ' column b '], dtype='object')


In [40]:
# These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing 
# whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df

Unnamed: 0,column_a,column_b
0,-0.689758,-0.509611
1,-0.439441,0.351222
2,-1.227683,0.386365


# Splitting and replacing Strings

In [41]:
s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")

In [42]:
s2.str.split("_")

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

In [44]:
# Elements in the split lists can be accessed using get or [] notation:
print(
    s2.str.split("_").str.get(1), '\n\n',
    s2.str.split("_").str[1]
)

0       b
1       d
2    <NA>
3       g
dtype: object 

 0       b
1       d
2    <NA>
3       g
dtype: object


In [46]:
# It is easy to expand this to return a DataFrame using expand.
s2.str.split("_", expand=True)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


In [47]:
# When original Series has StringDtype, the output columns will all be StringDtype as well.
# It is also possible to limit the number of splits:
s2.str.split("_", expand=True, n=1)

Unnamed: 0,0,1
0,a,b_c
1,c,d_e
2,,
3,f,g_h


In [48]:
# rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning 
# of the string:
s2.str.rsplit("_", expand=True, n=1)

Unnamed: 0,0,1
0,a_b,c
1,c_d,e
2,,
3,f_g,h


In [57]:
# replace optionally uses regular expressions:
s3 = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
    dtype="string"
)
print(s3, '\n')
print(s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True))

0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: string 

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string


In [59]:
# If you want literal replacement of a string (equivalent to str.replace()), you can set the optional regex parameter 
# to False, rather than escaping each character. In this case both pat and repl must be strings:
dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")
print(
    dollars.str.replace(r"-\$", "-", regex=True), '\n\n',
    dollars.str.replace("-$", "-", regex=False)
)

0         12
1        -10
2    $10,000
dtype: string 

 0         12
1        -10
2    $10,000
dtype: string


In [61]:
# The replace method can also take a callable as replacement. It is called on every pat using re.sub(). 
# The callable should expect one positional argument (a regex object) and return a string.
pat = r"[a-z]+"
def repl(m):
    return m.group(0)[::-1]
print(
        pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
            pat, repl, regex=True
    ),
    '\n\n'
)


# Using regex groups
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
def repl(m):
    return m.group("two").swapcase()
print(
    pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
        pat, repl, regex=True
    )
)

0    oof 123
1    rab zab
2       <NA>
dtype: string 


0     bAR
1    <NA>
dtype: string


In [62]:
# The replace method also accepts a compiled regular expression object from re.compile() as a pattern. 
# All flags should be included in the compiled regular expression object.
import re
regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)
s3.str.replace(regex_pat, "XX-XX ", regex=True)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

# Concatenation

There are several ways to concatenate a Series or Index, either with itself or others, all based on cat(), resp. Index.str.cat.

In [64]:
s = pd.Series(["a", "b", "c", "d"], dtype="string")
s.str.cat(sep=",")

'a,b,c,d'

In [65]:
s.str.cat()

'abcd'

In [67]:
t = pd.Series(["a", "b", np.nan, "d"], dtype="string")
t.str.cat(sep=",")

'a,b,d'

In [68]:
t.str.cat(sep=",", na_rep="-")

'a,b,-,d'

## Concatenating a Series and something list-like into a Series

The first argument to cat() can be a list-like object, provided that it matches the length of the calling Series (or Index).

In [69]:
s.str.cat(["A", "B", "C", "D"])

0    aA
1    bB
2    cC
3    dD
dtype: string

In [70]:
s.str.cat(t)

0      aa
1      bb
2    <NA>
3      dd
dtype: string

In [71]:
s.str.cat(t)

0      aa
1      bb
2    <NA>
3      dd
dtype: string

## Concatenating a Series and something array-like into a Series

The parameter others can also be two-dimensional. In this case, the number or rows must match the lengths of the calling 
Series (or Index).

In [74]:
d = pd.concat([t, s], axis=1)
print(
    s, '\n\n',
    d
)

0    a
1    b
2    c
3    d
dtype: string 

       0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d


In [75]:
s.str.cat(d, na_rep="-")

0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

## Concatenating a Series and an indexed object into a Series, with alignment

For concatenation with a Series or DataFrame, it is possible to align the indexes before concatenation by setting the join-keyword.

In [77]:
u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")
print(
    s, '\n\n',
    u
)

0    a
1    b
2    c
3    d
dtype: string 

 1    b
3    d
0    a
2    c
dtype: string


In [78]:
print(
    s.str.cat(u), '\n\n',
    s.str.cat(u, join="left")
)

0    aa
1    bb
2    cc
3    dd
dtype: string 

 0    aa
1    bb
2    cc
3    dd
dtype: string


The usual options are available for join (one of 'left', 'outer', 'inner', 'right'). In particular, alignment also means that the different lengths do not need to coincide anymore.

In [79]:
v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")
print(
    s, '\n\n',
    v
)

0    a
1    b
2    c
3    d
dtype: string 

 -1    z
 0    a
 1    b
 3    d
 4    e
dtype: string


In [80]:
print(
    s.str.cat(v, join="left", na_rep="-"), '\n\n',
    s.str.cat(v, join="outer", na_rep="-")
)

0    aa
1    bb
2    c-
3    dd
dtype: string 

 -1    -z
 0    aa
 1    bb
 2    c-
 3    dd
 4    -e
dtype: string


In [81]:
# The same alignment can be used when others is a DataFrame:
f = d.loc[[3, 2, 1, 0], :]
print(
    s, '\n\n',
    f
)

0    a
1    b
2    c
3    d
dtype: string 

       0  1
3     d  d
2  <NA>  c
1     b  b
0     a  a


In [82]:
s.str.cat(f, join="left", na_rep="-")

0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

## Concatenating a Series and many objects into a Series

In [83]:
print(
    s, '\n\n',
    u, '\n\n'
)
s.str.cat([u, u.to_numpy()], join="left")

0    a
1    b
2    c
3    d
dtype: string 

 1    b
3    d
0    a
2    c
dtype: string 




0    aab
1    bbd
2    cca
3    ddc
dtype: string

# Indexing with `.str`

In [88]:
s = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)

In [90]:
s.str[0]

0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [89]:
s.str[1]

0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

# Extracting substrings

The extract method accepts a regular expression with at least one capture group.
Extracting a regular expression with more than one group returns a DataFrame with one column per group.

In [91]:
pd.Series(
    ["a1", "b2", "c3"],
    dtype="string"
).str.extract(r"([ab])(\d)", expand=False)

Unnamed: 0,0,1
0,a,1.0
1,b,2.0
2,,


Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains NaN.

In [92]:
pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
    r"(?P<letter>[ab])(?P<digit>\d)", expand=False
)

Unnamed: 0,letter,digit
0,a,1.0
1,b,2.0
2,,


In [93]:
pd.Series(
    ["a1", "b2", "3"],
    dtype="string",
).str.extract(r"([ab])?(\d)", expand=False)

Unnamed: 0,0,1
0,a,1
1,b,2
2,,3


## Extract all matches in each subject (extractall)

In [94]:
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")
print(s, '\n\n')

two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"
s.str.extract(two_groups, expand=True)

A    a1a2
B      b1
C      c1
dtype: string 




Unnamed: 0,letter,digit
A,a,1
B,b,1
C,c,1


the extractall method returns every match. The result of extractall is always a DataFrame with a MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the subject.

In [95]:
s.str.extractall(two_groups)

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,a,2
B,0,b,1
C,0,c,1


In [96]:
# When each subject string in the Series has exactly one match then extractall(pat).xs(0, level='match') gives 
# the same result as extract(pat).
s = pd.Series(["a3", "b3", "c2"], dtype="string")
print(s, '\n\n')

extract_result = s.str.extract(two_groups, expand=True)
print(extract_result, '\n\n')

extractall_result = s.str.extractall(two_groups)
print(extract_result, '\n\n')

extractall_result.xs(0, level="match")

0    a3
1    b3
2    c2
dtype: string 


  letter digit
0      a     3
1      b     3
2      c     2 


  letter digit
0      a     3
1      b     3
2      c     2 




Unnamed: 0,letter,digit
0,a,3
1,b,3
2,c,2


Index also supports .str.extractall. It returns a DataFrame which has the same result as a Series.str.extractall with a default index (starts from 0).

In [97]:
pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,a,1
0,1,a,2
1,0,b,1
2,0,c,1


In [98]:
pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,a,1
0,1,a,2
1,0,b,1
2,0,c,1


# Testing for strings that match or contain a pattern

In [99]:
pattern = r"[0-9][a-z]"

In [100]:
pd.Series(
    ["1", "2", "3a", "3b", "03c", "4dx"],
    dtype="string",
).str.contains(pattern)

0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

In [101]:
# Or whether elements match a pattern:
pd.Series(
    ["1", "2", "3a", "3b", "03c", "4dx"],
    dtype="string",
).str.match(pattern)

0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [103]:
pd.Series(
    ["1", "2", "3a", "3b", "03c", "4dx"],
    dtype="string",
).str.fullmatch(pattern)

0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

Methods like match, fullmatch, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:

In [104]:
s4 = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)

s4.str.contains("A", na=False)

0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

# Creating indicator variables

In [105]:
s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")
s.str.get_dummies(sep="|")

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


String Index also supports get_dummies which returns a MultiIndex.

In [106]:
idx = pd.Index(["a", "a|b", np.nan, "a|c"])
idx.str.get_dummies(sep="|")

MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])