# [Go to "Working with text data" in pandas docs](https://pandas.pydata.org/docs/user_guide/text.html)

In [1]:
import pandas as pd
import numpy as np

# 1. Text Data Types

There are two ways to store text data in pandas:

- `object`: a NumPy array.
- `StringDtype`: an extension type. Recommended, but currently considered experimental.


In [2]:
pd.Series(list('abcde'))

0    a
1    b
2    c
3    d
4    e
dtype: object

In [3]:
pd.Series(list('abcde'), dtype="string")

0    a
1    b
2    c
3    d
4    e
dtype: string

# 2. String Methods

>`Series` and `Index` have string processing methods that make it easy to operate on each of their elements (missing/NA values are automatically excluded).

>These are accessed via the `str` attribute, and generally have names matching the equivalent (scalar) built-in string methods.
 


In [4]:
s = pd.Series(['AAaa', 'BbbB', 'ccCC', np.nan, None], dtype="string")

In [5]:
s.str.lower()

0    aaaa
1    bbbb
2    cccc
3    <NA>
4    <NA>
dtype: string

In [6]:
s.str.title()

0    Aaaa
1    Bbbb
2    Cccc
3    <NA>
4    <NA>
dtype: string

In [7]:
s.str.len()

0       4
1       4
2       4
3    <NA>
4    <NA>
dtype: Int64

In [8]:
idx = pd.Index([' One ', 'Two ', '  Three   '])
idx

Index([' One ', 'Two ', '  Three   '], dtype='object')

In [9]:
idx.str.strip()

Index(['One', 'Two', 'Three'], dtype='object')

> Since `df.columns` is an `Index` object, we can use the `.str` accessor:

In [10]:
df = pd.DataFrame(np.random.randn(3, 3),
                  columns=[' Column A ', ' COLUMN B ', 'column c'])
df

Unnamed: 0,Column A,COLUMN B,column c
0,-0.361046,-0.333455,-2.437201
1,0.655642,-0.798077,-0.520721
2,0.137223,-0.32589,-0.448394


In [11]:
df.columns.str.title()

Index([' Column A ', ' Column B ', 'Column C'], dtype='object')

> This is very handy for cleaning up column labels as needed. Here's an example where several string methods are `chained`:

In [12]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

Unnamed: 0,column_a,column_b,column_c
0,-0.361046,-0.333455,-2.437201
1,0.655642,-0.798077,-0.520721
2,0.137223,-0.32589,-0.448394


> **Note:** *When dealing with a `Series` having lots of repeated elements such that the number of unique elements is a lot smaller than the length of the `Series`, it can be faster to convert it to type `category` first, then use `.str.<method>` or `.dt.<property>`.*

>*The performance difference comes from the fact that for `Series` of type `category`, the string operations are done on the `.categories` and not on each element of the Series.*

# 3. Splitting & Replacing Strings

### 3.1 split
> Methods like `split` return a `Series` of lists:

In [13]:
s1 = pd.Series(['a b c', 'c d e', np.nan, 'f g h'], dtype="string")
s1.str.split()

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

In [14]:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string")
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

> Elements in the split lists can be accessed using `get` or `[]` notation:

In [15]:
s1.str.split().str.get(0)

0       a
1       c
2    <NA>
3       f
dtype: object

In [16]:
s2.str.split('_').str[-1]

0       c
1       e
2    <NA>
3       h
dtype: object

>  The list output can be returned as a `DataFrame` using `expand`:

In [17]:
s2.str.split('_', expand=True)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


In [18]:
s2.str.split('_', expand=True, n=1)  # limit the number of splits to 1

Unnamed: 0,0,1
0,a,b_c
1,c,d_e
2,,
3,f,g_h


### 3.2 rsplit

> `rsplit` is similar to `split`, but it works in the *reverse direction* (from the end of the string to its beginning):

In [19]:
s2.str.rsplit('_', expand=True, n=1)

Unnamed: 0,0,1
0,a_b,c
1,c_d,e
2,,
3,f_g,h


### 3.3 Replace

> `replace` by default replaces regular expressions.

In [20]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], dtype="string")
s3

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [21]:
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5        <NA>
6    XX-XX BA
7      XX-XX 
8     XX-XX t
dtype: string

> If you want literal replacement of a string, you can set the optional `regex` parameter to `False`, or else **escape** each character.

In [22]:
dollars = pd.Series(['12', '-$10', '$10,000'], dtype="string")

dollars.str.replace('$', '-')  # won't work (regex meaning of $ is used)

0         12
1       --10
2    -10,000
dtype: string

In [23]:
dollars.str.replace('-$', '-', regex=False)  # setting the regex parameter

0         12
1        -10
2    $10,000
dtype: string

In [24]:
dollars.str.replace(r'-\$', '-').str.replace('$', '')  # with escaping

0        12
1       -10
2    10,000
dtype: string

# 4. Concatenation

### 4.1 Concatenating a `Series` into a string

In [25]:
s = pd.Series(list('abcdef'), dtype="string")
s

0    a
1    b
2    c
3    d
4    e
5    f
dtype: string

> If not specified, the keyword `sep` for the separator defaults to the empty string, `sep=''`:

In [26]:
s.str.cat()

'abcdef'

In [27]:
s.str.cat(sep=',')

'a,b,c,d,e,f'

> By default, missing values are ignored. Using `na_rep`, they can be given a representation:

In [28]:
s = pd.Series(['a', 'b', None, np.nan, 'e'], dtype='string')
s

0       a
1       b
2    <NA>
3    <NA>
4       e
dtype: string

In [29]:
s.str.cat()

'abe'

In [30]:
s.str.cat(na_rep='*')

'ab**e'

### 4.2 Concatenating a `Series` and something list-like 

> The list-like object has to match the length of the calling `Series` (or `Index`):

> Missing values on either side will result in missing values in the result as well, unless `na_rep` is specified:

In [31]:
s.str.cat(['#']*5)

0      a#
1      b#
2    <NA>
3    <NA>
4      e#
dtype: string

In [32]:
s.str.cat(['#']*5, na_rep='!')

0    a#
1    b#
2    !#
3    !#
4    e#
dtype: string

### 4.3 Concatenating a `Series` and something array-like

> The number or rows of the array-like object must match the length of the calling `Series` (or `Index`).

In [33]:
_array = np.array([['a', 'b'], ['b', 'c'], ['c', 'd'], ['d', 'e'], ['e', 'f']])
_array

array([['a', 'b'],
       ['b', 'c'],
       ['c', 'd'],
       ['d', 'e'],
       ['e', 'f']], dtype='<U1')

In [34]:
s.str.cat(_array)

0     aab
1     bbc
2    <NA>
3    <NA>
4     eef
dtype: string

In [35]:
s.str.cat(_array, na_rep='-')

0    aab
1    bbc
2    -cd
3    -de
4    eef
dtype: string

### 4.4 Concatenating a `Series` and an indexed object into a Series, with alignment

> For concatenation with indexed objects (`Series` or `DataFrame`), it is possible to align the indexes before concatenation by setting the `join`-keyword.

> `join` can be `inner`, `outer`, `left`, or `right`. Setting `join=None` disables alignment.

In [36]:
t = pd.Series([2, 3, 1, 4, 0], index=[2, 3, 1, 4, 0]).astype('str')
t

2    2
3    3
1    1
4    4
0    0
dtype: object

In [37]:
s.str.cat(t, na_rep='_')  # join='left' is default

0    a0
1    b1
2    _2
3    _3
4    e4
dtype: string

In [38]:
s.str.cat(t, join='right') # right index is used

2    <NA>
3    <NA>
1      b1
4      e4
0      a0
dtype: string

In [39]:
df = pd.concat([t, t, t], axis=1)
df

Unnamed: 0,0,1,2
2,2,2,2
3,3,3,3
1,1,1,1
4,4,4,4
0,0,0,0


In [40]:
s.str.cat(df, join='left', sep='_', na_rep='+')

0    a_0_0_0
1    b_1_1_1
2    +_2_2_2
3    +_3_3_3
4    e_4_4_4
dtype: string

### 4.5 Concatenating a `Series` and many objects

> Several array-like items (specifically: `Series`, `Index`, and 1-dimensional variants of `np.ndarray`) can be combined in a list-like container (including `iterator`s, `dict`-views, etc.).

In [41]:
s.str.cat([t, s], na_rep='_')

0    a0a
1    b1b
2    _2_
3    _3_
4    e4e
dtype: string

> All elements without an index (e.g. `np.ndarray`) within the passed list-like must match in length to the calling `Series`:

In [42]:
s.str.cat([t, t.to_numpy()], na_rep='_')

0    a02
1    b13
2    _21
3    _34
4    e40
dtype: string

> If using `join='right'` on a list-like of others that contains different indexes, the union of these indexes will be used as the basis for the final concatenation:

In [43]:
v = pd.Series(['x', 'y', 'z'], index=[-1, 3, 7])
v

-1    x
 3    y
 7    z
dtype: object

In [44]:
s.str.cat([t, t.to_numpy(), v], join='right', na_rep='_')

-1    ___x
 0    a02_
 1    b13_
 2    _21_
 3    _34y
 4    e40_
 7    ___z
dtype: string

# 5. Indexing with .`str`

> You can use `[]` notation to directly index by position:

In [45]:
s = pd.Series(['A', 'Bb', 'CcC', np.nan])
s

0      A
1     Bb
2    CcC
3    NaN
dtype: object

In [46]:
s.str[0]

0      A
1      B
2      C
3    NaN
dtype: object

> If you index past the end of the string, the result will be a `NaN`:

In [47]:
s.str[1]

0    NaN
1      b
2      c
3    NaN
dtype: object

# 6. Extracting substrings

### 6.1 Extracting the first match in each subject (`extract`)

> The `extract` method accepts a regular expression with at least one capture group. It returns the first match:

In [48]:
s = pd.Series(['a1', 'b2', 'c3'], dtype="string")
s

0    a1
1    b2
2    c3
dtype: string

In [49]:
s.str.extract(r'([ab])(\d)', expand=False) # expand=True always returns a dataframe

Unnamed: 0,0,1
0,a,1.0
1,b,2.0
2,,


> Elements that do not match return a row filled with `NaN`. The dtype of the result is always `object`, even if no match is found and the result only contains `NaN`.

In [50]:
# Named groups
# Capture-group names in the regular expression will be set as column names
s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)', expand=False)

Unnamed: 0,letter,digit
0,a,1.0
1,b,2.0
2,,


In [51]:
# Optional groups
s.str.extract(r'([ab])?(\d)', expand=False)

Unnamed: 0,0,1
0,a,1
1,b,2
2,,3


> Extracting a regular expression with one group returns a `DataFrame` with one column if `expand=True`:

In [52]:
s.str.extract(r'[ab](\d)', expand=True)

Unnamed: 0,0
0,1.0
1,2.0
2,


> Extracting a regular expression with one group returns a `Series` if `expand=False`:

In [53]:
s.str.extract(r'[ab](\d)', expand=False)

0       1
1       2
2    <NA>
dtype: string

> Calling on an `Index` with a regex with **exactly one** capture group returns a `DataFrame` with one column if `expand=True`:

In [54]:
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")
s

A11    a1
B22    b2
C33    c3
dtype: string

In [55]:
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

Unnamed: 0,letter
0,A
1,B
2,C


> Calling on an `Index` with a regex with **exactly one** capture group returns an `Index` if `expand=False`:

In [56]:
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)

Index(['A', 'B', 'C'], dtype='object', name='letter')

> Calling on an `Index` with a regex with **more than one** capture group returns a `DataFrame` if `expand=True`:

In [57]:
s.index.str.extract("(?P<letter>[a-zA-Z])(?P<digits>[0-9]+)", expand=True)

Unnamed: 0,letter,digits
0,A,11
1,B,22
2,C,33


> Calling on an `Index` with a regex with **more than one** capture group raises a `ValueError` if `expand=False`:

In [58]:
try:
    s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
except ValueError as error_raised:
    print('ValueError:', error_raised)

ValueError: only one regex group is supported with Index


### 6.2 Extracting all the matches in each subject (`extractall`)

> The `extractall` method returns every match. The result is always a `DataFrame` with a `MultiIndex` on its rows.

> The last level of the `MultiIndex` is named `match` and indicates the order in the subject.

In [59]:
s = pd.Series(["a1a2", "b1", "2c1"], index=["A", "B", "C"], dtype="string")
s

A    a1a2
B      b1
C     2c1
dtype: string

In [60]:
s.str.extractall('(?P<letter>[a-z])(?P<digit>[0-9])')

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,a,2
B,0,b,1
C,0,c,1


> When each subject string in the `Series` has exactly one match, then
    
```python
extractall(pat).xs(0, level='match')
```
    
>gives the same result as `extract(pat)`:

In [61]:
s = pd.Series(['a3', 'b3', 'c2'], dtype="string")
s

0    a3
1    b3
2    c2
dtype: string

In [62]:
s.str.extractall('(?P<letter>[a-z])(?P<digit>[0-9])')

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,a,3
1,0,b,3
2,0,c,2


In [63]:
s.str.extractall('(?P<letter>[a-z])(?P<digit>[0-9])').xs(0, level="match")

Unnamed: 0,letter,digit
0,a,3
1,b,3
2,c,2


In [64]:
s.str.extract('(?P<letter>[a-z])(?P<digit>[0-9])', expand=True)

Unnamed: 0,letter,digit
0,a,3
1,b,3
2,c,2


> `Index` also supports .`str.extractall`. It returns a `DataFrame` which has the same result as `Series.str.extractall`:

In [65]:
pd.Index(["a1a2", "b1", "c1"]).str.extractall('(?P<letter>[a-z])(?P<digit>[0-9])')

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,a,1
0,1,a,2
1,0,b,1
2,0,c,1


# 7. Testing for Strings that match or contain a pattern

### 7.1 Checking whether elements contain a pattern:

> `contains` tests whether there is a match of the regular expression at any position within the string.

In [66]:
s = pd.Series(['1', '2', '3a', '3b', '03c'], dtype="string")
s

0      1
1      2
2     3a
3     3b
4    03c
dtype: string

In [67]:
s.str.contains(r'[0-9][a-z]')

0    False
1    False
2     True
3     True
4     True
dtype: boolean

### 7.2 Checking whether elements match a pattern:

> `match` tests whether there is a match of the regular expression that begins at the first character of the string

In [68]:
s.str.match(r'[0-9][a-z]')

0    False
1    False
2     True
3     True
4    False
dtype: boolean

>`fullmatch` tests whether the entire string matches the regular expression.

In [69]:
(pd.Series(['1', '2', '3a', '3b', '03c', '4dx'], dtype="string")
   .str.fullmatch(r'[0-9][a-z]'))

0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

> Methods like `match`, `fullmatch`, `contains`, `startswith`, and `endswith` take an extra `na` argument so missing values can be considered `True` or `False`:

In [70]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], dtype="string")
s

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

In [71]:
s.str.contains('A', na=False) 

0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

# 8. Creating indicator variables

In [72]:
s = pd.Series(['a', 'a|b', np.nan, 'a|c|a'], dtype="string")
s

0        a
1      a|b
2     <NA>
3    a|c|a
dtype: string

In [73]:
s.str.get_dummies(sep='|')

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


In [74]:
idx = pd.Index(['a', 'ab', np.nan, 'ac'])
idx

Index(['a', 'ab', nan, 'ac'], dtype='object')

In [75]:
idx.str.get_dummies(sep='')

MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

# 9. Method Summary

In [76]:
pd.read_html('https://pandas.pydata.org/docs/user_guide/text.html')[1]

Unnamed: 0,Method,Description
0,cat(),Concatenate strings
1,split(),Split strings on delimiter
2,rsplit(),Split strings on delimiter working from the en...
3,get(),Index into each element (retrieve i-th element)
4,join(),Join strings in each element of the Series wit...
5,get_dummies(),Split strings on the delimiter returning DataF...
6,contains(),Return boolean array if each string contains p...
7,replace(),Replace occurrences of pattern/regex/string wi...
8,repeat(),Duplicate values (s.str.repeat(3) equivalent t...
9,pad(),"Add whitespace to left, right, or both sides o..."
