# Pandas - String and Time Series

I will walk through some of the Pandas string operations and time-indexed data. Pandas builds on this and provides a comprehensive set of vectorized string operations and time series tools that become an essential piece of the type of munging required when working with real-world data.

## 1. Pandas String Operations

In [78]:
import numpy as np
import pandas as pd

data = ['Peter Li', 'Paul Zhang', None, 'MARY Yi', 'gUIDO QI']

try:
    print([s.capitalize() for s in data])
except:
    print('NoneType object has no attribute capitalize')

NoneType object has no attribute capitalize


List doesn't support `none` string operateion, whereas Series works well

In [79]:
pd.Series(data).str.capitalize()

0      Peter li
1    Paul zhang
2          None
3       Mary yi
4      Guido qi
dtype: object

## 1.1. Python string methods

Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    |
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    |
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  |
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  |
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      |
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     |
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  |
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

In [80]:
data = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])
data

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [81]:
data.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [82]:
data.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

### 1.2. Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in re module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexp

In [83]:
data.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [88]:
data.str.contains('John')

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

In [30]:
data.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

### 1.3. Miscellaneous methods

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

The `get()` and `slice()` operations, in particular, enable vectorized element access from each array. 

Note: `df.str.slice(0, i)` is equivalent to `df.str[0:i]`. Indexing via `df.str.get(i)` and `df.str[i]` is likewise similar.

In [31]:
data.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

These `get()` and `slice()` methods also let you access elements of arrays returned by `split()`. For example, to extract the last name of each entry, we can combine `split()` and `get()`

In [37]:
data.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

## 2. Dates and Times in Python

The Python world has a number of available representations of dates, times, deltas, and timespans. While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.

### 2.1. Native Python dates and times

Python's basic objects for working with dates and times reside in the built-in datetime module. You can use it to quickly perform a host of useful functionalities on dates and times. 

In [1]:
from datetime import datetime

time = datetime(year=2020, month=1, day=1)
time

datetime.datetime(2020, 1, 1, 0, 0)

In [3]:
# the day of the week
time.strftime('%A')

'Wednesday'

In [6]:
# current datetime
str(datetime.now())

'2020-06-21 11:35:15.387070'

### 2.2. Numpy Array of times

`NumPy` to add a set of native time series data type to NumPy. The `datetime64` dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly. Because of the uniform type in `NumPy datetime64` arrays, this type of operation can be accomplished much more quickly than if we were working directly with Python's datetime objects, especially as arrays get large. 

|Code    | Meaning     |
|--------|-------------|
| ``Y``  | Year       |
| ``M``  | Month       |
| ``W``  | Week       |
| ``D``  | Day         |
| ``h``  | Hour        |
| ``m``  | Minute      |
| ``s``  | Second      |
| ``ms`` | Millisecond |

In [24]:
time = np.array('2020-01-01', dtype=np.datetime64)
assert(time == np.datetime64('2020-01-01', 'D'))
time

array('2020-01-01', dtype='datetime64[D]')

In [23]:
time + np.arange(12)

array(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
       '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
       '2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12'],
      dtype='datetime64[D]')

## 3. Pandas Time Series Data Structures

`Pandas` builds upon all the tools just discussed to provide a Timestamp object, which combines the ease-of-use of datetime and dateutil with the efficient storage and vectorized interface of `numpy.datetime64`. From a group of these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in a `Series` or `DataFrame`.

* For time stamps, Pandas provides the Timestamp type. As mentioned before, it is essentially a replacement for Python's native datetime, but is based on the **more efficient** `numpy.datetime64` data type. The associated Index structure is DatetimeIndex.
* For time Periods, Pandas provides the Period type. This encodes a fixed-frequency interval based on `numpy.datetime64`. The associated index structure is PeriodIndex.
* For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a **more efficient** replacement for Python's native `datetime.timedelta` type, and is based on `numpy.timedelta64`. The associated index structure is TimedeltaIndex.

### 3.1. Pandas Time Series: Indexing by Time

The most fundamental of these date/time objects are the Timestamp and DatetimeIndex objects. While these class objects can be invoked directly, it is more common to use the `pd.to_datetime()` function, which can parse a wide variety of formats. Passing a single date to `pd.to_datetime()` yields a Timestamp; passing a series of dates by default yields a DatetimeIndex.

In [51]:
# create timestamp
pd.to_datetime('1th of Jan, 2020')

Timestamp('2020-01-01 00:00:00')

In [61]:
# create timestamp index
index = pd.to_datetime(['1th of Jan, 2020', '2020-Feb-2', '03-03-2020', '20200404'])
index

DatetimeIndex(['2020-01-01', '2020-02-02', '2020-03-03', '2020-04-04'], dtype='datetime64[ns]', freq=None)

In [64]:
# create pandas series with timestamp index
data = pd.Series(range(len(index)), index=index)
data

2020-01-01    0
2020-02-02    1
2020-03-03    2
2020-04-04    3
dtype: int64

In [57]:
# find matches
data['2020-02']

2020-02-02    1
dtype: int64

In [74]:
# convert Datatime type to string type
index.strftime('%Y-%m-%d')

Index(['2020-01-01', '2020-02-02', '2020-03-03', '2020-04-04'], dtype='object')

Any DatetimeIndex can be converted to a PeriodIndex with the `to_period()` function with the addition of a frequency code; here we'll use `'D'` to indicate daily frequency

In [66]:
period = index.to_period('D')
period

PeriodIndex(['2020-01-01', '2020-02-02', '2020-03-03', '2020-04-04'], dtype='period[D]', freq='D')

In [67]:
period - period[0]

Index([<0 * Days>, <32 * Days>, <62 * Days>, <94 * Days>], dtype='object')

### 3.2. Pandas Regular sequences

To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: `pd.date_range()` for timestamps, `pd.period_range()` for periods, and `pd.timedelta_range()` for time deltas. `pd.date_range()` accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates. By default, the frequency is one day

In [68]:
pd.date_range('2020-01-01', '2020-01-10')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods

In [70]:
pd.date_range('2020-01-01', periods = 10)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

The spacing can be modified by altering the freq argument, which defaults to `D`. For example, here we will construct a range of hourly timestamps

In [71]:
pd.date_range('2020-01-01', periods=8, freq='H')

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 01:00:00',
               '2020-01-01 02:00:00', '2020-01-01 03:00:00',
               '2020-01-01 04:00:00', '2020-01-01 05:00:00',
               '2020-01-01 06:00:00', '2020-01-01 07:00:00'],
              dtype='datetime64[ns]', freq='H')

To create regular sequences of Period or Timedelta values, the very similar `pd.period_range()` and `pd.timedelta_range()` functions are useful.

In [75]:
pd.period_range('2020-01', periods=8, freq='M')

PeriodIndex(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06',
             '2020-07', '2020-08'],
            dtype='period[M]', freq='M')

To create a sequence of durations increasing by 2 hours and 30 mins

In [77]:
pd.timedelta_range(0, periods=5, freq='2H30T')

TimedeltaIndex(['00:00:00', '02:30:00', '05:00:00', '07:30:00', '10:00:00'], dtype='timedelta64[ns]', freq='150T')