# Pandas - String and Time Series

I will walk through some of the Pandas string operations and time-indexed data. Pandas builds on this and provides a comprehensive set of vectorized string operations and time series tools that become an essential piece of the type of munging required when working with real-world data.

## 1. Pandas String Operations

In [6]:
import numpy as np
import pandas as pd

data = ['Peter Li', 'Paul Zhang', None, 'MARY Yi', 'gUIDO QI']

try:
    print([s.capitalize() for s in data])
except:
    print('NoneType object has no attribute capitalize')

NoneType object has no attribute capitalize


In [20]:
pd.Series(data).str.capitalize()

0      Peter li
1    Paul zhang
2          None
3       Mary yi
4      Guido qi
dtype: object

## 1.1. Python string methods

Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    |
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    |
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  |
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  |
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      |
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     |
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  |
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

In [25]:
data = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])
data

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [24]:
data.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [26]:
data.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

### 1.2. Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in re module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexp

In [29]:
data.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [30]:
data.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

### 1.3. Miscellaneous methods

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

The `get()` and `slice()` operations, in particular, enable vectorized element access from each array. 

Note: `df.str.slice(0, i)` is equivalent to `df.str[0:i]`. Indexing via `df.str.get(i)` and `df.str[i]` is likewise similar.

In [31]:
data.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

These `get()` and `slice()` methods also let you access elements of arrays returned by `split()`. For example, to extract the last name of each entry, we can combine `split()` and `get()`

In [37]:
data.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object