## Load and Explore time Series Data

In [1]:
from pandas import read_csv
import os

In [2]:
data = os.environ.get('data') # avoid personal information in the notebook
data_path = data + '\TimeSeries\\1-daily-total-female-births.csv'

In [3]:
series = read_csv(data_path, header=0, index_col=0, parse_dates=True,
squeeze=True)
print(type(series))
print(series.head())

<class 'pandas.core.series.Series'>
Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Births, dtype: int64


In [4]:
# number of Observations
print(series.size)

365


In [5]:
# quering by time
print(series['1959-02'])

Date
1959-02-01    23
1959-02-02    31
1959-02-03    44
1959-02-04    38
1959-02-05    50
1959-02-06    38
1959-02-07    51
1959-02-08    31
1959-02-09    31
1959-02-10    51
1959-02-11    36
1959-02-12    45
1959-02-13    51
1959-02-14    34
1959-02-15    52
1959-02-16    47
1959-02-17    45
1959-02-18    46
1959-02-19    39
1959-02-20    48
1959-02-21    37
1959-02-22    35
1959-02-23    52
1959-02-24    42
1959-02-25    45
1959-02-26    39
1959-02-27    37
1959-02-28    30
Name: Births, dtype: int64


In [6]:
# calculate descriptive statistics

print(series.describe())

count    365.000000
mean      41.980822
std        7.348257
min       23.000000
25%       37.000000
50%       42.000000
75%       46.000000
max       73.000000
Name: Births, dtype: float64


## Basic Feature Engineering

Time series datasert must be transformed to be modeled as a supervised learning problem. We try to predict the daily minimum tempreature given to month and day.

In [7]:
#### create date time features of a dataset
from pandas import read_csv
from pandas import DataFrame

data = os.environ.get('data') # avoid personal information in the notebook
data_path = data + '\TimeSeries\\3-daily-minimum-temperatures.csv'
series = read_csv(data_path, header=0, index_col=0, parse_dates=True, squeeze=True)
print(series)

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
              ... 
1990-12-27    14.0
1990-12-28    13.6
1990-12-29    13.5
1990-12-30    15.7
1990-12-31    13.0
Name: Temp, Length: 3650, dtype: float64


In [8]:
df_temp = DataFrame()
df_temp['month'] = [series.index[i].month for i in range(len(series))]
df_temp['day'] = [series.index[i].day for i in range(len(series))]
df_temp['temperature'] = [series[i] for i in range(len(series))]
print(df_temp.head(5))

   month  day  temperature
0      1    1         20.7
1      1    2         17.9
2      1    3         18.8
3      1    4         14.6
4      1    5         15.8


### Lag Features

Lag features are the classical way that a time series forecasting problems are transformed into supervised learning problems. The simplest approach is to precit the value at the next time (t+1) given the value at the curren time(t). The supervised learning problem with shiftet values look as follows:  

    Value(t), Value(t+1)  
    Value(t), Value(t+1)  
    Value(t), Value(t+1)  

The Pandas library provides the shift() function to help create these shifted or lag features. Shiftin the dataset by 1 creates the t column, adding a NaN value for the first row.  

    Shifted, Original
    NaN,      20.7
    20.7,     17.9
    17.9,     18.8


In [9]:
from pandas import concat

temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t', 't+1']
print(dataframe.head(5))

      t   t+1
0   NaN  20.7
1  20.7  17.9
2  17.9  18.8
3  18.8  14.6
4  14.6  15.8


In [10]:
dataframe_3 = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe_3.columns = ['t-2', 't-1', 't', 't+1']
print(dataframe_3.head(5))

    t-2   t-1     t   t+1
0   NaN   NaN   NaN  20.7
1   NaN   NaN  20.7  17.9
2   NaN  20.7  17.9  18.8
3  20.7  17.9  18.8  14.6
4  17.9  18.8  14.6  15.8


 We can calculate summary statistics across the values in the sliding window,
 A difficulty with the sliding window approach is how large to make the window for your problem.  
 **Rolling Dinwos Statistics**  
 A step beyond adding raw lagged values is to add a summary of the values at previous time steps

In [11]:
temps = DataFrame(series.values)
shifted = temps.shift(1)
window = shifted.rolling(window=2)
means = window.mean()
dataframe = concat([means, temps], axis=1)
dataframe.columns = ['mean(t-1,t)', 't+1']
print(dataframe.head(5))

   mean(t-1,t)   t+1
0          NaN  20.7
1          NaN  17.9
2        19.30  18.8
3        18.35  14.6
4        16.70  15.8


**Expanding Window Statistics**


#, Window Values  
1, 20.7  
2, 20.7, 17.9,  
3, 20.7, 17.9, 18.8  
4, 20.7, 17.9, 18.8, 14.6  
5, 20.7, 17.9, 18.8, 14.6, 15.8  

In [12]:
temps = DataFrame(series.values)
window = temps.expanding()
dataframe = concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))

    min       mean   max   t+1
0  20.7  20.700000  20.7  17.9
1  17.9  19.300000  20.7  18.8
2  17.9  19.133333  20.7  14.6
3  14.6  18.000000  20.7  15.8
4  14.6  17.560000  20.7  15.8
