# Basic Time Series Manipulation with Pandas

## Objectives

Original Tutorial: https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea

This basic introduction to time series data manipulation with pandas should allow you to get started in your time series analysis. Specific objectives are to show you how to:

-  create a date range
-  work with timestamp data
-  setting a datetime index
-  convert string data to a timestamp
-  index and slice your time series data in a data frame
-  resample your time series for different time period aggregates/summary statistics
-  compute a rolling statistic such as a rolling average
-  work with missing data
-  understand the basics of unix/epoch time
-  understand common pitfalls of time series data analysis

In [80]:
#---Step 0: Import necessary libraries and render plots inline---

import numpy as np
import pandas as pd
from datetime import datetime

## Create a Date Range

First import the libraries we'll be working with and then use them to create a date range.

In [81]:
# Create date range for every hour using date_range() method
date_rng = pd.date_range(start="1/1/2018", end="1/08/2018", freq="H")

# Check type of first element
type(date_rng[0])

pandas._libs.tslibs.timestamps.Timestamp

In [82]:
# Convert to dataframe
date_rng_df = date_rng.to_frame()
date_rng_df.head()

Unnamed: 0,0
2018-01-01 00:00:00,2018-01-01 00:00:00
2018-01-01 01:00:00,2018-01-01 01:00:00
2018-01-01 02:00:00,2018-01-01 02:00:00
2018-01-01 03:00:00,2018-01-01 03:00:00
2018-01-01 04:00:00,2018-01-01 04:00:00


## Create a Date Range

Let’s create an example data frame with the timestamp data and look at the first 15 elements.

In [83]:
df = pd.DataFrame(date_rng, columns=["date"])

df["data"] = np.random.randint(0, 100, size=df.shape[0])

df.head()

Unnamed: 0,date,data
0,2018-01-01 00:00:00,34
1,2018-01-01 01:00:00,29
2,2018-01-01 02:00:00,64
3,2018-01-01 03:00:00,38
4,2018-01-01 04:00:00,60


## Setting a Datetime Index

If we want to do time series manipulation, we’ll need to have a date time index so that our data frame is indexed on the timestamp.

In [84]:
df["date"] = pd.to_datetime(df["date"])
df.columns = ["datetime", "data"]
df.set_index("datetime", inplace=True)

# Alternate method
# df['datetime'] = pd.to_datetime(df['date'])
# df = df.set_index('datetime')
# df.drop(['date'], axis=1, inplace=True)

df.dtypes

data    int32
dtype: object

## Convert String Data to a Timestamp

What if our ‘time’ stamps in our data are actually string type vs. numerical? Let’s convert our date_rng to a list of strings and then convert the strings to timestamps.

In [85]:
# Create list of date strings using list comprehension
string_date_rng = [str(x) for x in date_rng]

string_date_rng

['2018-01-01 00:00:00',
 '2018-01-01 01:00:00',
 '2018-01-01 02:00:00',
 '2018-01-01 03:00:00',
 '2018-01-01 04:00:00',
 '2018-01-01 05:00:00',
 '2018-01-01 06:00:00',
 '2018-01-01 07:00:00',
 '2018-01-01 08:00:00',
 '2018-01-01 09:00:00',
 '2018-01-01 10:00:00',
 '2018-01-01 11:00:00',
 '2018-01-01 12:00:00',
 '2018-01-01 13:00:00',
 '2018-01-01 14:00:00',
 '2018-01-01 15:00:00',
 '2018-01-01 16:00:00',
 '2018-01-01 17:00:00',
 '2018-01-01 18:00:00',
 '2018-01-01 19:00:00',
 '2018-01-01 20:00:00',
 '2018-01-01 21:00:00',
 '2018-01-01 22:00:00',
 '2018-01-01 23:00:00',
 '2018-01-02 00:00:00',
 '2018-01-02 01:00:00',
 '2018-01-02 02:00:00',
 '2018-01-02 03:00:00',
 '2018-01-02 04:00:00',
 '2018-01-02 05:00:00',
 '2018-01-02 06:00:00',
 '2018-01-02 07:00:00',
 '2018-01-02 08:00:00',
 '2018-01-02 09:00:00',
 '2018-01-02 10:00:00',
 '2018-01-02 11:00:00',
 '2018-01-02 12:00:00',
 '2018-01-02 13:00:00',
 '2018-01-02 14:00:00',
 '2018-01-02 15:00:00',
 '2018-01-02 16:00:00',
 '2018-01-02 17:

In [86]:
# Convert strings to timestamps by using infer_datetime_format to infer format
timestamp_date_rng = pd.to_datetime(string_date_rng, infer_datetime_format=True)

timestamp_date_rng

# Use same process as above to port timestamps into dataframe
# df2 = pd.DataFrame(timestamp_date_rng, columns=["date"])
# df2["data"] = np.random.randint(0, 100, size=df.shape[0])
# df2.dtypes

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00', '2018-01-01 03:00:00',
               '2018-01-01 04:00:00', '2018-01-01 05:00:00',
               '2018-01-01 06:00:00', '2018-01-01 07:00:00',
               '2018-01-01 08:00:00', '2018-01-01 09:00:00',
               ...
               '2018-01-07 15:00:00', '2018-01-07 16:00:00',
               '2018-01-07 17:00:00', '2018-01-07 18:00:00',
               '2018-01-07 19:00:00', '2018-01-07 20:00:00',
               '2018-01-07 21:00:00', '2018-01-07 22:00:00',
               '2018-01-07 23:00:00', '2018-01-08 00:00:00'],
              dtype='datetime64[ns]', length=169, freq=None)

But what about if we need to convert a unique string format?

Let’s create an arbitrary list of dates that are strings and convert them to timestamps:

In [87]:
# Create list of arbitrary unique date strings
string_date_rng_2 = ['June-01-2018', 'June-02-2018', 'June-03-2018']

timestamp_date_rng_2 = [datetime.strptime(x, "%B-%d-%Y") for x in string_date_rng_2]

timestamp_date_rng_2

[datetime.datetime(2018, 6, 1, 0, 0),
 datetime.datetime(2018, 6, 2, 0, 0),
 datetime.datetime(2018, 6, 3, 0, 0)]

In [88]:
# Pass in timestamp_date_rng_2 into dataframe
df2 = pd.DataFrame(timestamp_date_rng_2, columns=["date"])

df2

Unnamed: 0,date
0,2018-06-01
1,2018-06-02
2,2018-06-03


## Index and Slice Your Time Series Data in a DataFrame

Say we just want to see data where the date is the 2nd of the month, we could use the index as per below.

In [89]:
# Return rows where day is equal to second day of the month
df[df.index.day == 2]

# Alternate
# df.loc[df.index.day == 2, :]

Unnamed: 0_level_0,data
datetime,Unnamed: 1_level_1
2018-01-02 00:00:00,19
2018-01-02 01:00:00,28
2018-01-02 02:00:00,96
2018-01-02 03:00:00,97
2018-01-02 04:00:00,49
2018-01-02 05:00:00,9
2018-01-02 06:00:00,18
2018-01-02 07:00:00,73
2018-01-02 08:00:00,39
2018-01-02 09:00:00,77


We could also directly call a date that we want to look at via the index of the data frame:

In [90]:
df["2018-01-03"]

Unnamed: 0_level_0,data
datetime,Unnamed: 1_level_1
2018-01-03 00:00:00,3
2018-01-03 01:00:00,11
2018-01-03 02:00:00,80
2018-01-03 03:00:00,50
2018-01-03 04:00:00,69
2018-01-03 05:00:00,65
2018-01-03 06:00:00,33
2018-01-03 07:00:00,5
2018-01-03 08:00:00,92
2018-01-03 09:00:00,11


What about selecting data between certain dates?

In [91]:
df["2018-01-04":"2018-01-06"]

# Alternate
# df.loc["2018-01-04":"2018-01-06",:]

Unnamed: 0_level_0,data
datetime,Unnamed: 1_level_1
2018-01-04 00:00:00,30
2018-01-04 01:00:00,43
2018-01-04 02:00:00,72
2018-01-04 03:00:00,1
2018-01-04 04:00:00,94
2018-01-04 05:00:00,33
2018-01-04 06:00:00,38
2018-01-04 07:00:00,11
2018-01-04 08:00:00,66
2018-01-04 09:00:00,3


## Resampling Time Series for Different Time Periods

The basic data frame that we’ve populated gives us data on an hourly frequency, but we can resample the data at a different frequency and specify how we would like to compute the summary statistic for the new sample frequency.

We could take the min, max, average, sum, etc., of the data at a daily frequency instead of an hourly frequency as per the example below where we compute the daily average of the data:

In [92]:
df.resample("D").mean()

Unnamed: 0_level_0,data
datetime,Unnamed: 1_level_1
2018-01-01,50.875
2018-01-02,52.75
2018-01-03,43.208333
2018-01-04,48.291667
2018-01-05,58.625
2018-01-06,49.208333
2018-01-07,51.416667
2018-01-08,18.0


In [93]:
df.resample("3H").mean()

Unnamed: 0_level_0,data
datetime,Unnamed: 1_level_1
2018-01-01 00:00:00,42.333333
2018-01-01 03:00:00,53.666667
2018-01-01 06:00:00,44.0
2018-01-01 09:00:00,44.333333
2018-01-01 12:00:00,62.333333
2018-01-01 15:00:00,57.0
2018-01-01 18:00:00,62.333333
2018-01-01 21:00:00,41.0
2018-01-02 00:00:00,47.666667
2018-01-02 03:00:00,51.666667


## Compute a Rolling Statistic Such as a Rolling Average

What about window statistics such as a rolling mean or a rolling sum?

Let’s create a new column in our original df that computes the rolling sum over a 3 window period and then look at the top of the data frame:

In [94]:
df["rolling_sum"] = df.rolling(3).sum()

df.head()

Unnamed: 0_level_0,data,rolling_sum
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01 00:00:00,34,
2018-01-01 01:00:00,29,
2018-01-01 02:00:00,64,127.0
2018-01-01 03:00:00,38,131.0
2018-01-01 04:00:00,60,162.0


We can see that this is computing correctly and that it only starts having valid values when there are three periods over which to look back.

## Working with Missing Data

This is a good chance to see how we can do forward or backfilling of data when working with missing data values.

Here’s our df but with a new column that takes the rolling sum and backfills the data:

In [95]:
df["rolling_sum_backfilled"] = df["rolling_sum"].fillna(method="backfill")

df.head()

Unnamed: 0_level_0,data,rolling_sum,rolling_sum_backfilled
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-01 00:00:00,34,,127.0
2018-01-01 01:00:00,29,,127.0
2018-01-01 02:00:00,64,127.0,127.0
2018-01-01 03:00:00,38,131.0,131.0
2018-01-01 04:00:00,60,162.0,162.0


It’s often useful to be able to fill your missing data with realistic values such as the average of a time period, but always remember that if you are working with a time series problem and want your data to be realistic, you should not do a backfill of your data as that’s like looking into the future and getting information you would never have at that time period. Likely you will want to forward fill your data more frequently than you backfill.

## Understand the Basics of Unix / Epoch Time

When working with time series data, you may come across time values that are in Unix time. Unix time, also called Epoch time is the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970. Using Unix time helps to disambiguate time stamps so that we don’t get confused by time zones, daylight savings time, etc.

Here’s an example of a time t that is in Epoch time and converting unix/epoch time to a regular time stamp in UTC:

In [96]:
epoch_t = 1529272655
real_t = pd.to_datetime(epoch_t, unit='s')

real_t

Timestamp('2018-06-17 21:57:35')

If I wanted to convert that time that is in UTC to my own time zone, I could simply do the following:

In [97]:
real_t.tz_localize("UTC").tz_convert("US/Pacific")

Timestamp('2018-06-17 14:57:35-0700', tz='US/Pacific')

## Common Pitfalls of Time Series Data Analysis

With these basics, you should be all set to work with your time series data.

Here are a few tips to keep in mind and common pitfalls to avoid when working with time series data:

-  Check for discrepancies in your data that may be caused by region specific time changes like daylight savings time.
-  Keep track of time zones meticulously — let others going through your code know what time zone your data is in, and think about converting to UTC or a standardized value in order to keep your data standardized.
-  Missing data can occur frequently — make sure you document your cleaning rules and think about not backfilling information you wouldn’t have been able to have at the time of a sample.
-  Remember that as you resample your data or fill in missing values, you’re losing a certain amount of information about your original data set. I’d suggest keeping track of all of your data transformations and tracking the root cause of your data issues.
-  When you resample your data, the best method (mean, min, max, sum, etc.) will be dependent on the kind of data you have and how it was sampled. Be thoughtful about how you resample your data for your analysis.