# 시계열(time series)
- 일정 시간 간격으로 배치된 데이터들의 수열
- 시계열을 해석하고, 이해하는데 쓰이는 여러가지 방법을 연구하는 분야
- 시계열 데이터를 분석하는 수학적 모델은 여러가지 지만, 세가지 범용 모델은 autoregressive(AR)모델, integrated 모델, moving average (MA) 모델 등이 있음

출처: https://ko.wikipedia.org/wiki/%EC%8B%9C%EA%B3%84%EC%97%B4

#### 박정환 (nbicjh@gmail.com)

A time-series is a sequence of data points, typically consisting of successive
measurements made at a regular frequency and over a specific time interval. Timeseries
analysis is composed of various methods for making decisions based upon the
data in a time-series by extracting meaningful statistics. Time-series forecasting is
the process of developing a model based upon data in a time-series, and using it to
predict future values based upon previously observed values. Regression analysis is
the process of testing whether one or more independent time-series affect the current
value of another time-series.

In this chapter, we will cover the following:
- DatetimeIndex and its use in time-series data
- Creating time-series with specific frequencies
- Calculation of new dates using date offsets
- Representation of intervals of time user periods
- Shifting and lagging time-series data
- Frequency conversion of time-series data
- Upsampling and downsampling of time-series data

In [99]:
#import module
import numpy as np
import pandas as pd

In [3]:
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',8)
pd.set_option('precision',7)

In [4]:
import datetime

In [5]:
from datetime import datetime

In [6]:
import matplotlib.pyplot as plt
%matplotlib inline
pd.options.display.mpl_style = 'default'

## 1.Time-series data and the DatetimeIndex

The representations of dates, times, and time intervals and periods provided by
pandas, which are pandas's own, are above and beyond those provided in other
Python frameworks such as SciPy and NumPy. The pandas implementations provide
additional capabilities that are required to model time-series data, and to transform
data across different frequencies, periods, and calendars for different organizations
and financial markets.

Specific dates and times in pandas are represented using the pandas Timestamp
class. Timestamp is based on NumPy's dtype datetime64 and has higher precision
than Python's built-in datetime object. This increased precision is frequently
required for accurate financial calculations.

In [7]:
dates = [datetime(2014, 8, 1), datetime(2014, 8, 2)]
dti = pd.DatetimeIndex(dates)
dti

DatetimeIndex(['2014-08-01', '2014-08-02'], dtype='datetime64[ns]', freq=None)

In [8]:
np.random.seed(123456)
ts = pd.Series(np.random.randn(2), dates)
# np.random.randn은 정규분포 랜덤값을 만듦

type(ts.index)
#type은 매개변수의 타입을 출력

pandas.tseries.index.DatetimeIndex

In [9]:
ts

2014-08-01    0.4691123
2014-08-02   -0.2828633
dtype: float64

In [10]:
ts[datetime(2014, 8, 2)]
#ts안의 datetime에 일치하는 값을 출력함

-0.28286334432866328

In [11]:
ts['2014-8-2']

-0.28286334432866328

#### Series함수는 DatetimeIndex를 list형으로 변환해줌 

In [13]:
np.random.seed(123456)
dates = ['2014-08-01','2014-08-02']
ts = pd.Series(np.random.randn(2), dates)
ts

2014-08-01    0.4691123
2014-08-02   -0.2828633
dtype: float64

In [15]:
dti = pd.to_datetime(['Aug 1, 2014', '2014-08-02', '2014.8.3', None])
dti

DatetimeIndex(['2014-08-01', '2014-08-02', '2014-08-03', 'NaT'], dtype='datetime64[ns]', freq=None)

In [28]:
dti1=pd.to_datetime(['8/1/2014'])
dti2=pd.to_datetime(['1/8/2014'], dayfirst=True)
dti1[0],dti2[0]

(Timestamp('2014-08-01 00:00:00'), Timestamp('2014-08-01 00:00:00'))

In [29]:
np.random.seed(123456)
dates = pd.date_range('8/1/2014', periods=10)
s1 = pd.Series(np.random.randn(10),dates)
s1[:5]

2014-08-01    0.4691123
2014-08-02   -0.2828633
2014-08-03   -1.5090585
2014-08-04   -1.1356324
2014-08-05    1.2121120
Freq: D, dtype: float64

### Get Web Data using Pandas

Like any pandas index, a DatetimeIndex can be used for various index operations,
such as data alignment, selection, and slicing. To demonstrate slicing using a
DatetimeIndex, we will refer to the Yahoo! Finance stock quotes for MSFT from
2012 through 2014 using the pandas DataReader class (more info on DataReader is
available at http://pandas.pydata.org/pandas-docs/version/0.15.2/remote_
data.html)

In [30]:
import pandas.io.data as web

The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.


In [33]:
msft = web.DataReader("MSFT",'yahoo','2012-1-1','2013-12-30')
msft.head()

                 Open       High        Low      Close    Volume  Adj Close
Date                                                                       
2012-01-03  26.549999  26.959999  26.389999  26.770000  64731500  23.943792
2012-01-04  26.820000  27.469999  26.780001  27.400000  80516100  24.507280
2012-01-05  27.379999  27.730000  27.290001  27.680000  56081400  24.757720
2012-01-06  27.530001  28.190001  27.530001  28.110001  99455500  25.142323
2012-01-09  28.049999  28.100000  27.719999  27.740000  59706800  24.811385

In [34]:
msftAC = msft['Adj Close']
msftAC.head(3)

Date
2012-01-03    23.943792
2012-01-04    24.507280
2012-01-05    24.757720
Name: Adj Close, dtype: float64

In [35]:
msft['2012-01-01':'2012-01-05']

                 Open       High        Low  Close    Volume  Adj Close
Date                                                                   
2012-01-03  26.549999  26.959999  26.389999  26.77  64731500  23.943792
2012-01-04  26.820000  27.469999  26.780001  27.40  80516100  24.507280
2012-01-05  27.379999  27.730000  27.290001  27.68  56081400  24.757720

A specific item can be retrieved from a time-series represented by a DataFrame by
specifying the date/time index value and using the .loc method. The result is a
Series where the index labels are the column names, with the values for each being
in a specific row for each of the columns

In [36]:
msft.loc['2012-01-03']

Open               26.549999
High               26.959999
Low                26.389999
Close              26.770000
Volume       64731500.000000
Adj Close          23.943792
Name: 2012-01-03 00:00:00, dtype: float64

In [41]:
msftAC['2012-01-03']

23.943792000000002

In [39]:
msft['2012-02'].head(5)

                 Open       High        Low      Close    Volume  Adj Close
Date                                                                       
2012-02-01  29.790001  30.049999  29.760000  29.889999  67409900  26.734401
2012-02-02  29.900000  30.170000  29.709999  29.950001  52223300  26.788068
2012-02-03  30.139999  30.400000  30.090000  30.240000  41838500  27.047451
2012-02-06  30.040001  30.219999  29.969999  30.200001  28039700  27.011675
2012-02-07  30.150000  30.490000  30.049999  30.350000  39242400  27.145838

In [42]:
msft['2012-02':'2012-02-09']

                 Open       High        Low      Close    Volume  Adj Close
Date                                                                       
2012-02-01  29.790001  30.049999  29.760000  29.889999  67409900  26.734401
2012-02-02  29.900000  30.170000  29.709999  29.950001  52223300  26.788068
2012-02-03  30.139999  30.400000  30.090000  30.240000  41838500  27.047451
2012-02-06  30.040001  30.219999  29.969999  30.200001  28039700  27.011675
2012-02-07  30.150000  30.490000  30.049999  30.350000  39242400  27.145838
2012-02-08  30.260000  30.670000  30.219999  30.660000  49659100  27.423110
2012-02-09  30.680000  30.799999  30.480000  30.770000  50481600  27.521497

In [43]:
msft['2012-02':'2012-02-09'][:5]

                 Open       High        Low      Close    Volume  Adj Close
Date                                                                       
2012-02-01  29.790001  30.049999  29.760000  29.889999  67409900  26.734401
2012-02-02  29.900000  30.170000  29.709999  29.950001  52223300  26.788068
2012-02-03  30.139999  30.400000  30.090000  30.240000  41838500  27.047451
2012-02-06  30.040001  30.219999  29.969999  30.200001  28039700  27.011675
2012-02-07  30.150000  30.490000  30.049999  30.350000  39242400  27.145838

## Creating time-series with specific frequencies

Time-series data in pandas can also be created to represent intervals of time other
than daily frequency. Different frequencies can be generated with pd.date_range()
by utilizing the freq parameter. This parameter defaults to a value of D, which
represents daily frequency.

To introduce the creation of nondaily frequencies, the following command creates
a DatetimeIndex with one-minute intervals using freq='T':

In [44]:
# Time series를 생성 arange는 범위(틀 생성)를 나타내는 것.
# freq는 주기를 뜻하는 것
bymin = pd.Series(np.arange(0,90*60*24), 
                 pd.date_range('2014-08-01',
                              '2014-10-29 23:59:00', freq='T'))

In [45]:
bymin

2014-08-01 00:00:00         0
2014-08-01 00:01:00         1
2014-08-01 00:02:00         2
2014-08-01 00:03:00         3
                        ...  
2014-10-29 23:56:00    129596
2014-10-29 23:57:00    129597
2014-10-29 23:58:00    129598
2014-10-29 23:59:00    129599
Freq: T, dtype: int32

In [46]:
bymin['2014-08-01 12:30':'2014-08-01 12:59']

2014-08-01 12:30:00    750
2014-08-01 12:31:00    751
2014-08-01 12:32:00    752
2014-08-01 12:33:00    753
                      ... 
2014-08-01 12:56:00    776
2014-08-01 12:57:00    777
2014-08-01 12:58:00    778
2014-08-01 12:59:00    779
Freq: T, dtype: int32

## Representing intervals of time using periods

### Periods object, The representation bounded interval of time

It is often required to represent not just a specific time or sequence of timestamps,
but to represent an interval of time using a start date and an end date (an example of
this would be a financial quarter). This representation of a bounded interval of time
can be represented in pandas using Period objects.

Period objects consist of a start time and an end time and are created from a
start date with a given frequency. The start time is referred to as the anchor of the
Period object, and the end time is then calculated from the start date and the period
specification.

In [48]:
aug2014 = pd.Period('2014-08', freq='M')
aug2014

Period('2014-08', 'M')

In [49]:
aug2014.start_time, aug2014.end_time

(Timestamp('2014-08-01 00:00:00'), Timestamp('2014-08-31 23:59:59.999999999'))

In [51]:
sep2014 = aug2014+1
sep2014

Period('2014-09', 'M')

Since we specified a period that starts using a partial date specification of August
2014, pandas determines the anchor (start_time) as 2014-08-01 00:00:00 and
then calculates the end_time property based upon the specified frequency; in this
case, calculating 1 month from the start_time anchor and returning the last unit
of time prior to this.

Mathematical operations are overloaded on Period objects, so as to calculate another
period based upon the value represented in Period. As an example, the following
command creates a new Period based upon the aug2014 period object by adding 1
to the period. Since aug2014 has a period of 1 month, the resulting value is that start
date (2014-08-01) + 1 * 1 month (the period represented by the object), and, hence,
the result is the last moment of time prior to 2014-09-01

In [52]:
sep2014.start_time, sep2014.end_time

(Timestamp('2014-09-01 00:00:00'), Timestamp('2014-09-30 23:59:59.999999999'))

In [53]:
mp2013 = pd.period_range('1/1/2013', '12/31/2013', freq='M')
mp2013

PeriodIndex(['2013-01', '2013-02', '2013-03', '2013-04', '2013-05', '2013-06',
             '2013-07', '2013-08', '2013-09', '2013-10', '2013-11', '2013-12'],
            dtype='int64', freq='M')

In [58]:
for p in mp2013:
    print "{0} {1} {2} {3}".format(p, p.freq, p.start_time, p.end_time)

2013-01 <MonthEnd> 2013-01-01 00:00:00 2013-01-31 23:59:59.999999999
2013-02 <MonthEnd> 2013-02-01 00:00:00 2013-02-28 23:59:59.999999999
2013-03 <MonthEnd> 2013-03-01 00:00:00 2013-03-31 23:59:59.999999999
2013-04 <MonthEnd> 2013-04-01 00:00:00 2013-04-30 23:59:59.999999999
2013-05 <MonthEnd> 2013-05-01 00:00:00 2013-05-31 23:59:59.999999999
2013-06 <MonthEnd> 2013-06-01 00:00:00 2013-06-30 23:59:59.999999999
2013-07 <MonthEnd> 2013-07-01 00:00:00 2013-07-31 23:59:59.999999999
2013-08 <MonthEnd> 2013-08-01 00:00:00 2013-08-31 23:59:59.999999999
2013-09 <MonthEnd> 2013-09-01 00:00:00 2013-09-30 23:59:59.999999999
2013-10 <MonthEnd> 2013-10-01 00:00:00 2013-10-31 23:59:59.999999999
2013-11 <MonthEnd> 2013-11-01 00:00:00 2013-11-30 23:59:59.999999999
2013-12 <MonthEnd> 2013-12-01 00:00:00 2013-12-31 23:59:59.999999999


In [60]:
np.random.seed(123456)
ps = pd.Series(np.random.randn(12), mp2013)
ps

2013-01    0.4691123
2013-02   -0.2828633
2013-03   -1.5090585
2013-04   -1.1356324
             ...    
2013-09   -0.8618490
2013-10   -2.1045692
2013-11   -0.4949293
2013-12    1.0718038
Freq: M, dtype: float64

## Shifting and lagging time-series data

A common operation on time-series data is to shift or "lag" the values back and
forward in time, such as to calculate percentage change from sample to sample. The
pandas method for this is .shift(), which will shift the values in the index by a
specified number of units of the index's period.

In [63]:
#yahoo data, msft
msftAC[:5]

Date
2012-01-03    23.943792
2012-01-04    24.507280
2012-01-05    24.757720
2012-01-06    25.142323
2012-01-09    24.811385
Name: Adj Close, dtype: float64

In [64]:
#shift function
shifted_forward = msftAC.shift(1)
shifted_forward[:5]

Date
2012-01-03          NaN
2012-01-04    23.943792
2012-01-05    24.507280
2012-01-06    24.757720
2012-01-09    25.142323
Name: Adj Close, dtype: float64

In [65]:
msftAC.tail(5), shifted_forward.tail(5)

(Date
 2013-12-23    34.699330
 2013-12-24    35.135207
 2013-12-26    35.476322
 2013-12-27    35.334191
 2013-12-30    35.334191
 Name: Adj Close, dtype: float64, Date
 2013-12-23    34.869890
 2013-12-24    34.699330
 2013-12-26    35.135207
 2013-12-27    35.476322
 2013-12-30    35.334191
 Name: Adj Close, dtype: float64)

In [66]:
# shift(-2)의 경우는 2일 후에 Data를 앞으로 땡겨온다
shifted_backwards = msftAC.shift(-2)[:10]
shifted_backwards[:5]

Date
2012-01-03    24.757720
2012-01-04    25.142323
2012-01-05    24.811385
2012-01-06    24.900828
2012-01-09    24.793496
Name: Adj Close, dtype: float64

In [69]:
ps.tail(5)

2013-08   -1.0442360
2013-09   -0.8618490
2013-10   -2.1045692
2013-11   -0.4949293
2013-12    1.0718038
Freq: M, dtype: float64

In [71]:
ps.shift(-2)

2013-01   -1.5090585
2013-02   -1.1356324
2013-03    1.2121120
2013-04   -0.1732146
             ...    
2013-09   -0.4949293
2013-10    1.0718038
2013-11          NaN
2013-12          NaN
Freq: M, dtype: float64

In [73]:
#초단위로 shift
msftAC.shift(1, freq="S")

Date
2012-01-03 00:00:01    23.943792
2012-01-04 00:00:01    24.507280
2012-01-05 00:00:01    24.757720
2012-01-06 00:00:01    25.142323
                         ...    
2013-12-24 00:00:01    35.135207
2013-12-26 00:00:01    35.476322
2013-12-27 00:00:01    35.334191
2013-12-30 00:00:01    35.334191
Name: Adj Close, dtype: float64

The resulting DataFrame or Series is essentially the same as the original, with the
specified number of units of frequency added to each index label. No data will be
shifted out or replaced with NaN as this is not performing realignment.

In [74]:
msftAC.shift(1,freq="D")

Date
2012-01-04    23.943792
2012-01-05    24.507280
2012-01-06    24.757720
2012-01-07    25.142323
                ...    
2013-12-25    35.135207
2013-12-27    35.476322
2013-12-28    35.334191
2013-12-31    35.334191
Name: Adj Close, dtype: float64

In [76]:
#day-to-day percentage change
msftAC / msftAC.shift(1) -1

Date
2012-01-03          NaN
2012-01-04    0.0235338
2012-01-05    0.0102190
2012-01-06    0.0155347
                ...    
2013-12-24    0.0125615
2013-12-26    0.0097086
2013-12-27   -0.0040064
2013-12-30    0.0000000
Name: Adj Close, dtype: float64

## Frequency conversion of time-series data

The frequency of the data in a time-series can be converted in pandas using the
.asfreq() method of a Series or DataFrame. To demonstrate, we will use the
following small subset of the MSFT stock closing values:

In [77]:
sample = msftAC[:2]
sample

Date
2012-01-03    23.943792
2012-01-04    24.507280
Name: Adj Close, dtype: float64

In [78]:
sample.asfreq("H")

Date
2012-01-03 00:00:00    23.943792
2012-01-03 01:00:00          NaN
2012-01-03 02:00:00          NaN
2012-01-03 03:00:00          NaN
                         ...    
2012-01-03 21:00:00          NaN
2012-01-03 22:00:00          NaN
2012-01-03 23:00:00          NaN
2012-01-04 00:00:00    24.507280
Freq: H, Name: Adj Close, dtype: float64

In [79]:
#ffill 함수는 기존의 값을 맞춰줌
sample.asfreq("H", method="ffill")

Date
2012-01-03 00:00:00    23.943792
2012-01-03 01:00:00    23.943792
2012-01-03 02:00:00    23.943792
2012-01-03 03:00:00    23.943792
                         ...    
2012-01-03 21:00:00    23.943792
2012-01-03 22:00:00    23.943792
2012-01-03 23:00:00    23.943792
2012-01-04 00:00:00    24.507280
Freq: H, Name: Adj Close, dtype: float64

In [80]:
sample.asfreq("H", method="bfill")

Date
2012-01-03 00:00:00    23.943792
2012-01-03 01:00:00    24.507280
2012-01-03 02:00:00    24.507280
2012-01-03 03:00:00    24.507280
                         ...    
2012-01-03 21:00:00    24.507280
2012-01-03 22:00:00    24.507280
2012-01-03 23:00:00    24.507280
2012-01-04 00:00:00    24.507280
Freq: H, Name: Adj Close, dtype: float64

## Resampling of time-series

Frequency conversion provides basic conversion of data using the new frequency
intervals and allows the filling of missing data using either NaN, forward filling,
or backward filling. More elaborate control is provided through the process of
resampling.

Resampling can be either downsampling, where data is converted to wider
frequency ranges (such as downsampling from day-to-day to month-to-month)
or upsampling, where data is converted to narrower time ranges. Data for the
associated labels are then calculated by a function provided to pandas instead
of simple filling.

In [82]:
msft_cum_ret = (1+ (msftAC / msftAC.shift() -1 )).cumprod()
msft_cum_ret

Date
2012-01-03          NaN
2012-01-04    1.0235338
2012-01-05    1.0339933
2012-01-06    1.0500560
                ...    
2013-12-24    1.4674036
2013-12-26    1.4816501
2013-12-27    1.4757141
2013-12-30    1.4757141
Name: Adj Close, dtype: float64

A time-series can be resampled using the .resample() method. This method
provides a very flexible means to specify the frequency conversion involved in the
resampling, as well as the means by which the resampled values are selected or
calculated.

In [84]:
#월단위로 resample
msft_monthly_cum_ret = msft_cum_ret.resample("M")
msft_monthly_cum_ret

Date
2012-01-31    1.0686747
2012-02-29    1.1556975
2012-03-31    1.2105696
2012-04-30    1.1846436
                ...    
2013-09-30    1.2773969
2013-10-31    1.3503984
2013-11-30    1.4719148
2013-12-31    1.4823625
Freq: M, Name: Adj Close, dtype: float64

In [85]:
msft_cum_ret['2012-01'].mean()

1.0686746606165674

In [94]:
msft_cum_ret.resample("M", how = "mean")

Date
2012-01-31    1.0686747
2012-02-29    1.1556975
2012-03-31    1.2105696
2012-04-30    1.1846436
                ...    
2013-09-30    1.2773969
2013-10-31    1.3503984
2013-11-30    1.4719148
2013-12-31    1.4823625
Freq: M, Name: Adj Close, dtype: float64

The type of index resulting from a resampling is controlled by the kind parameter,
which can be set to timestamp (the default) or period. In the resampling examples
up to this point, the resample has returned Timestamp and, in particular, returned
the last day of the month. The following command demonstrates returning an index
based on periods instead of time stamps, which can be quite useful if we need to
have the start and end timestamps for each sample:

In [92]:
by_periods = msft_cum_ret.resample("M", how="mean",kind="period")

In [93]:
for i in by_periods.index[:5]:
    print("{0}:{1} {2}".format(i.start_time, i.end_time, by_periods[i]))

2012-01-01 00:00:00:2012-01-31 23:59:59.999999999 1.06867466062
2012-02-01 00:00:00:2012-02-29 23:59:59.999999999 1.15569749979
2012-03-01 00:00:00:2012-03-31 23:59:59.999999999 1.21056962376
2012-04-01 00:00:00:2012-04-30 23:59:59.999999999 1.18464362913
2012-05-01 00:00:00:2012-05-31 23:59:59.999999999 1.14051595959


In [95]:
sample = msft_cum_ret[1:3]
sample

Date
2012-01-04    1.0235338
2012-01-05    1.0339933
Name: Adj Close, dtype: float64

In [96]:
by_hour = sample.resample("H")
by_hour

Date
2012-01-04 00:00:00    1.0235338
2012-01-04 01:00:00          NaN
2012-01-04 02:00:00          NaN
2012-01-04 03:00:00          NaN
                         ...    
2012-01-04 21:00:00          NaN
2012-01-04 22:00:00          NaN
2012-01-04 23:00:00          NaN
2012-01-05 00:00:00    1.0339933
Freq: H, Name: Adj Close, dtype: float64

Hourly index labels have been created by pandas, but the alignment only propagates
two values into the new time-series and fills the others with NaN. This is an inherent
issue with upsampling as in the result there is missing information. By default,
pandas uses NaN but provide other methods to fill in values.

As with frequency conversion, the new index labels can be forward filled or back
filled using the fill_method parameter and specifying bfill or ffill. Another
option is to interpolate the missing data, which can be done using the time-series
object's .interpolate() method, which will perform a linear interpolation:

In [98]:
by_hour.interpolate()

Date
2012-01-04 00:00:00    1.0235338
2012-01-04 01:00:00    1.0239696
2012-01-04 02:00:00    1.0244054
2012-01-04 03:00:00    1.0248412
                         ...    
2012-01-04 21:00:00    1.0326858
2012-01-04 22:00:00    1.0331217
2012-01-04 23:00:00    1.0335575
2012-01-05 00:00:00    1.0339933
Freq: H, Name: Adj Close, dtype: float64

## Summary

In this chapter, we examined the many ways in pandas to represent various units of
time and time-series data. Understanding date and time-series as well as frequency
conversion is critical to analyzing financial information. We examined several ways
of manipulating time-series data represented by stock price information, working
with dates, times, time zones, and calendars. In closing, the chapter examined the
means of converting the data in time-series into different frequencies.