# Based on: [Working with text data](https://pandas.pydata.org/docs/user_guide/text.html)

In [1]:
import pandas as pd
import numpy as np

# 1. Text Data Types

There are currently two ways to store text data in *pandas*:

- `object`: a NumPy array.
- `StringDtype`: an extension type. Recommended, but still experimental.


In [2]:
pd.Series(list("abcde"))

0    a
1    b
2    c
3    d
4    e
dtype: object

In [3]:
pd.Series(list("abcde"), dtype="string")

0    a
1    b
2    c
3    d
4    e
dtype: string

# 2. String Methods

`Series` and `Index` have string processing methods accessible via the `str` attribute that make it easy to operate on each of their elements. 

These methods generally have names matching the equivalent (scalar) built-in string methods.

Missing/NA values are automatically excluded.

In [4]:
s = pd.Series(["AAaa", "BbbB", "ccCC", np.nan, None], dtype="string")
s.str.lower()

0    aaaa
1    bbbb
2    cccc
3    <NA>
4    <NA>
dtype: string

In [5]:
s.str.swapcase()

0    aaAA
1    bBBb
2    CCcc
3    <NA>
4    <NA>
dtype: string

In [6]:
s.str.len()

0       4
1       4
2       4
3    <NA>
4    <NA>
dtype: Int64

In [7]:
idx = pd.Index([" One ", "Two ", "  Three   "])
idx

Index([' One ', 'Two ', '  Three   '], dtype='object')

In [8]:
idx.str.strip()

Index(['One', 'Two', 'Three'], dtype='object')

> Since `df.columns` is an `Index` object, we can use the `.str` accessor.
>
> This is very handy for cleaning up column labels as needed.

In [9]:
df = pd.DataFrame(
    np.random.randn(3, 3), columns=[" Column A ", " COLUMN B ", "column c"]
)
df.columns = df.columns.str.lower().str.strip().str.replace(" ", "_")
df

Unnamed: 0,column_a,column_b,column_c
0,0.423263,-2.583964,-0.790346
1,-1.360317,-0.427899,-1.195781
2,0.840359,0.116849,-1.767832


> **NOTE:** For Series having lots of repeated elements, it may be faster to convert to `category` dtype first, since the string operations are done on the `.categories` and not on each element of the Series.
>
> However, only a limited number of the `.str` methods will work e.g. you can’t add strings to each other.


# 3. Splitting Strings

## 3.1 split

Returns a `Series` of lists. You can limit the number of splits using the `n` argument.

Elements in the split lists can be accessed using `get` or `[]` notation.

The list output can be returned as a `DataFrame` using the `expand` argument.

In [10]:
s1 = pd.Series(["a-b-c", "c-d-e", np.nan, "f-g-h"], dtype="string")
s1.str.split("-", n=1)  # returns a Series of lists

0    [a, b-c]
1    [c, d-e]
2        <NA>
3    [f, g-h]
dtype: object

In [11]:
s1.str.split("-", n=1, expand=True)  # returns a DataFrame

Unnamed: 0,0,1
0,a,b-c
1,c,d-e
2,,
3,f,g-h


In [12]:
s2 = pd.Series(["a b c", "c d e", np.nan, "f g h"], dtype="string")
s2.str.split().str.get(0)

0       a
1       c
2    <NA>
3       f
dtype: object

In [13]:
s2.str.split().str[-1]

0       c
1       e
2    <NA>
3       h
dtype: object

In [14]:
s1.str.split(expand=True)

Unnamed: 0,0
0,a-b-c
1,c-d-e
2,
3,f-g-h


## 3.2 rsplit

`rsplit` is similar to `split`, but it works in the *reverse direction* (from the end of the string to its beginning).

In [15]:
s1.str.rsplit("-", n=1)

0    [a-b, c]
1    [c-d, e]
2        <NA>
3    [f-g, h]
dtype: object

# 4. Replacing Strings

`replace` can use regular expressions or literal strings, depending on the `regex` argument.

In [16]:
s3 = pd.Series(
    ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
)
s3.str.replace("^A|cat", "*", regex=True)

0       *
1       B
2       C
3    *aba
4    Baca
5    <NA>
6    CABA
7     dog
8       *
dtype: string

In [17]:
dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")
dollars.str.replace("$", "", regex=False)  # $ in regex matches the end

0        12
1       -10
2    10,000
dtype: string

In [18]:
dollars.str.replace(r"\$", "", regex=True)  # $ is escaped

0        12
1       -10
2    10,000
dtype: string

# 5. Concatenation

## 5.1 Concatenating a `Series` into a string

You can specify the separator `sep`.

Missing values are ignored, but you can assign them representation with `na_rep`.

In [19]:
s = pd.Series(list("abcd") + [None], dtype="string")
s.str.cat()

'abcd'

In [20]:
s.str.cat(sep=", ", na_rep="?")

'a, b, c, d, ?'

## 5.2 Concatenating a `Series` and something list-like 

The list-like object must match the length of the calling `Series` (or `Index`).

Missing values on either side will result in missing values in the result as well, unless `na_rep` is specified.

In [21]:
s.str.cat(list("zyxwv"))

0      az
1      by
2      cx
3      dw
4    <NA>
dtype: string

In [22]:
s.str.cat([None, "B", "C", None, "E"], na_rep="?")

0    a?
1    bB
2    cC
3    d?
4    ?E
dtype: string

## 5.3 Concatenating a `Series` and something array-like

The number or rows of the array-like object must match the length of the calling `Series` (or `Index`).

In [23]:
arr = np.array([["b", "c"], ["c", "d"], ["d", "e"], ["e", "f"], ["f", "g"]])
arr

array([['b', 'c'],
       ['c', 'd'],
       ['d', 'e'],
       ['e', 'f'],
       ['f', 'g']], dtype='<U1')

In [24]:
s.str.cat(arr, na_rep="_")

0    abc
1    bcd
2    cde
3    def
4    _fg
dtype: string

## 5.4 Concatenating a `Series` and an indexed object into a Series, with alignment

You can align indices before concatenation by setting the `join`-keyword.

`join` can be `inner`, `outer`, `left`, or `right`. Setting `join=None` disables alignment.

In [25]:
t = pd.Series([2, 3, 1, 4, 0], index=[2, 3, 1, 4, 0]).astype("str")
t

2    2
3    3
1    1
4    4
0    0
dtype: object

In [26]:
s.str.cat(t, na_rep="_")  # join='left' is default

0    a0
1    b1
2    c2
3    d3
4    _4
dtype: string

In [27]:
s.str.cat(t, join="right")  # right index is used

2      c2
3      d3
1      b1
4    <NA>
0      a0
dtype: string

## 5.5 Concatenating a `Series` and many objects

Several array-like items (specifically: `Series`, `Index`, and 1-dimensional variants of `np.ndarray`) can be combined in a list-like container (including `iterator`s, `dict`-views, etc.).

All elements without an index (e.g. `np.ndarray`) within the passed list-like must match in length to the calling `Series`.

In [28]:
s.str.cat([t, t.to_numpy()], na_rep="_")

0    a02
1    b13
2    c21
3    d34
4    _40
dtype: string

# 6. Indexing

You can use `[]` notation to directly index by position.

Indices past the end of a string return `<NA>`.

In [29]:
s = pd.Series(["A", "Bb", "CcC", np.nan], dtype="string")
s.str[0]

0       A
1       B
2       C
3    <NA>
dtype: string

In [30]:
s.str[1]

0    <NA>
1       b
2       c
3    <NA>
dtype: string

# 7. Extracting substrings

## 7.1 Extracting the first match in each subject (`extract`)

`extract` accepts a regular expression with at least one capture group, and returns the first match.

Extracting a regular expression with more than one group returns a DataFrame, with one column per group.

Any capture group names in the regular expression will be used for column names; otherwise capture group numbers will be used.

Elements that do not match return a row filled with `<NA>`.

In [31]:
s = pd.Series(["a1", "b2", "c3"], dtype="string")
s.str.extract(r"(\d)")  # returns a DataFrame

Unnamed: 0,0
0,1
1,2
2,3


In [32]:
s.str.extract(r"(\d)", expand=False)  # returns a Series

0    1
1    2
2    3
dtype: string

In [33]:
s.str.extract(r"([ab])(\d)")  # Uses capture group number for columns

Unnamed: 0,0,1
0,a,1.0
1,b,2.0
2,,


In [34]:
s.str.extract(r"(?P<letter>[a-z])(?P<digit>\d)")  # Uses capture group names for columns

Unnamed: 0,letter,digit
0,a,1
1,b,2
2,c,3


In [35]:
# With optional groups
s.str.extract(r"([ab])?(\d)")

Unnamed: 0,0,1
0,a,1
1,b,2
2,,3


## 7.2 Extracting all the matches in each subject (`extractall`)

`extractall` returns every match. The result is always a `DataFrame` with a `MultiIndex` on its rows. The last level of the `MultiIndex` is named `match` and indicates the order in the subject.

In [36]:
s = pd.Series(["a1a2", "bb2", "1c2"], index=["A", "B", "C"], dtype="string")
s.str.extractall(r"(?P<letter>[a-z])(?P<digit>[0-9])")

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,a,2
B,0,b,2
C,0,c,2


In [37]:
pd.Index(s).str.extractall("(?P<letter>[a-z])(?P<digit>[0-9])")

Unnamed: 0_level_0,Unnamed: 1_level_0,letter,digit
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,a,1
0,1,a,2
1,0,b,2
2,0,c,2


# 8. Testing for Strings that match or contain a pattern

## 8.1 Checking whether elements contain a pattern:

`contains` tests whether there is a match for a regular expression *at any position* within the string.

In [38]:
s = pd.Series(["1", "2", "3a", "3bb", "03c"], dtype="string")
s.str.contains(r"[0-9][a-z]")

0    False
1    False
2     True
3     True
4     True
dtype: boolean

## 8.2 Checking whether elements match a pattern:

`match` tests whether there is a match for a regular expression, but begins *at the first character* of the string.

`fullmatch` tests whether the *entire string* matches a regular expression.

In [39]:
s.str.match(r"[0-9][a-z]")

0    False
1    False
2     True
3     True
4    False
dtype: boolean

In [40]:
s.str.fullmatch(r"[0-9][a-z]")

0    False
1    False
2     True
3    False
4    False
dtype: boolean

> **NOTE:** Methods like `match`, `fullmatch`, `contains`, `startswith`, and `endswith` take an extra `na` argument so missing values can be considered `True` or `False`:

In [41]:
s = pd.Series(["A", "B", "Aaba", "BaA", np.nan, "dog"], dtype="string")
s.str.contains("A", na=False)

0     True
1    False
2     True
3     True
4    False
5    False
dtype: boolean

# 8. Creating indicator variables

In [42]:
s = pd.Series(["a", "a|b", np.nan, "a|c|a"], dtype="string")
s

0        a
1      a|b
2     <NA>
3    a|c|a
dtype: string

In [43]:
s.str.get_dummies(sep="|")

Unnamed: 0,a,b,c
0,1,0,0
1,1,1,0
2,0,0,0
3,1,0,1


In [44]:
idx = pd.Index(["a", "ab", np.nan, "ac"])
idx

Index(['a', 'ab', nan, 'ac'], dtype='object')

In [45]:
idx.str.get_dummies(sep="")

MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

# 9. Method Summary


|    | Method          | Description                                                                                                                   |
|---:|:----------------|:------------------------------------------------------------------------------------------------------------------------------|
|  0 | cat()           | Concatenate strings                                                                                                           |
|  1 | split()         | Split strings on delimiter                                                                                                    |
|  2 | rsplit()        | Split strings on delimiter working from the end of the string                                                                 |
|  3 | get()           | Index into each element (retrieve i-th element)                                                                               |
|  4 | join()          | Join strings in each element of the Series with passed separator                                                              |
|  5 | get_dummies()   | Split strings on the delimiter returning DataFrame of dummy variables                                                         |
|  6 | contains()      | Return boolean array if each string contains pattern/regex                                                                    |
|  7 | replace()       | Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence     |
|  8 | removeprefix()  | Remove prefix from string, i.e. only remove if string starts with prefix.                                                     |
|  9 | removesuffix()  | Remove suffix from string, i.e. only remove if string ends with suffix.                                                       |
| 10 | repeat()        | Duplicate values (s.str.repeat(3) equivalent to x * 3)                                                                        |
| 11 | pad()           | Add whitespace to left, right, or both sides of strings                                                                       |
| 12 | center()        | Equivalent to str.center                                                                                                      |
| 13 | ljust()         | Equivalent to str.ljust                                                                                                       |
| 14 | rjust()         | Equivalent to str.rjust                                                                                                       |
| 15 | zfill()         | Equivalent to str.zfill                                                                                                       |
| 16 | wrap()          | Split long strings into lines with length less than a given width                                                             |
| 17 | slice()         | Slice each string in the Series                                                                                               |
| 18 | slice_replace() | Replace slice in each string with passed value                                                                                |
| 19 | count()         | Count occurrences of pattern                                                                                                  |
| 20 | startswith()    | Equivalent to str.startswith(pat) for each element                                                                            |
| 21 | endswith()      | Equivalent to str.endswith(pat) for each element                                                                              |
| 22 | findall()       | Compute list of all occurrences of pattern/regex for each string                                                              |
| 23 | match()         | Call re.match on each element, returning matched groups as list                                                               |
| 24 | extract()       | Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group |
| 25 | extractall()    | Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group  |
| 26 | len()           | Compute string lengths                                                                                                        |
| 27 | strip()         | Equivalent to str.strip                                                                                                       |
| 28 | rstrip()        | Equivalent to str.rstrip                                                                                                      |
| 29 | lstrip()        | Equivalent to str.lstrip                                                                                                      |
| 30 | partition()     | Equivalent to str.partition                                                                                                   |
| 31 | rpartition()    | Equivalent to str.rpartition                                                                                                  |
| 32 | lower()         | Equivalent to str.lower                                                                                                       |
| 33 | casefold()      | Equivalent to str.casefold                                                                                                    |
| 34 | upper()         | Equivalent to str.upper                                                                                                       |
| 35 | find()          | Equivalent to str.find                                                                                                        |
| 36 | rfind()         | Equivalent to str.rfind                                                                                                       |
| 37 | index()         | Equivalent to str.index                                                                                                       |
| 38 | rindex()        | Equivalent to str.rindex                                                                                                      |
| 39 | capitalize()    | Equivalent to str.capitalize                                                                                                  |
| 40 | swapcase()      | Equivalent to str.swapcase                                                                                                    |
| 41 | normalize()     | Return Unicode normal form. Equivalent to unicodedata.normalize                                                               |
| 42 | translate()     | Equivalent to str.translate                                                                                                   |
| 43 | isalnum()       | Equivalent to str.isalnum                                                                                                     |
| 44 | isalpha()       | Equivalent to str.isalpha                                                                                                     |
| 45 | isdigit()       | Equivalent to str.isdigit                                                                                                     |
| 46 | isspace()       | Equivalent to str.isspace                                                                                                     |
| 47 | islower()       | Equivalent to str.islower                                                                                                     |
| 48 | isupper()       | Equivalent to str.isupper                                                                                                     |
| 49 | istitle()       | Equivalent to str.istitle                                                                                                     |
| 50 | isnumeric()     | Equivalent to str.isnumeric                                                                                                   |
| 51 | isdecimal()     | Equivalent to str.isdecimal                                                                                                   |
