# Exercises

Remember to document your thoughts and any takeaways as you work through visualizations! 

Using your store items data you prepped in lesson 2 exercises:

1. Split your data into train and test using the sklearn.model_selection.TimeSeriesSplit method. 
1. Validate your splits by plotting X_train and y_train.  
1. Plot the weekly average & the 7-day moving average. Compare the 2 plots. 
1. Plot the daily difference. Observe whether usage seems to vary drastically from day to day or has more of a smooth transition.   
1. Plot a time series decomposition. 
1. Create a lag plot (day over day).  
1. Run a lag correlation.   

Using your OPS data you prepped in lesson 2 exercises: 

1. Split your data into train and test using the percent cutoff method.  
1. Validate your splits by plotting X_train and y_train.  
1. Plot the weekly average & the 7-day moving average. Compare the 2 plots. 
1. Group the electricity consumption time series by month of year, to explore annual seasonality.
1. Plot the daily difference. Observe whether usage seems to vary drastically from day to day or has more of a smooth transition.   
1. Plot a time series decomposition. Takeaways?   
1. Create a lag plot (day over day).  
1. Run a lag correlation.   

If time: 

For each store I want to see how many items were sold over a period of time, for each item. Find a way to chart this. Hints: Subplots for the piece with the fewest distinct values (like store), x = time, y = count, color = item. If you have too many distinct items, you may need to plot the top n, while aggregating the others into an 'other' bucket.  

In [None]:
# data manipulation 
import numpy as np
import pandas as pd

from datetime import datetime
import itertools

# data visualization 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline

from sklearn.model_selection import TimeSeriesSplit

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_store_data_sql
from prepare import prep_store_data

df = get_store_data_sql()

In [None]:
df = prep_store_data(df)
target_vars = ['sale_amount','sales_total']
df = df.resample('D')[target_vars].sum()
df.head()

## Store Item Sales

### Q1
#### Split your data into train and test using the sklearn.model_selection.TimeSeriesSplit method

In [None]:
# reset index to be row number
df2 = df.reset_index()

# create X and y
X = df2.sale_date
y = df2.sale_amount

# create object, with 5 splits
tss = TimeSeriesSplit(n_splits=5, max_train_size=None)

# fit (get index values)
# transform into X_train, X_test, y_train, y_test
for train_index, test_index in tss.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [None]:
# from Sara:

train_indices = []
test_indices = []
for train_index, test_index in tss.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    train_indices.append(train_index)
    test_indices.append(test_index)
train_indices[0]
test_indices[0]


for i in range(0,5):
    plt.figure(figsize = (12,4))
    plt.plot(X_train[train_indices[i]], y_train[train_indices[i]])
    plt.plot(X[test_indices[i]], y[test_indices[i]])

### Q2
#### Validate your splits by plotting X_train and y_train.  

In [None]:
plt.figure(figsize = (12,4))
plt.plot(X_train, y_train)
plt.plot(X_test, y_test)
plt.show()

### Q3
#### Plot the weekly average & the 7-day moving average. Compare the 2 plots. 

In [None]:
data = {'sale_date': X_train, 'sale_amount': y_train}
train = pd.DataFrame(data)
train = train.groupby(['sale_date']).sum()

In [None]:
train_W = train.resample('W').mean()
ax = train_W.plot(figsize = (12,4))
ax.set_title('Item Sales: Weekly Average')
plt.show()

In [None]:
plt.figure(figsize = (12,4))
ax = train.rolling(5).mean().plot(figsize=(12, 4))
ax.set_title('Item Sales: 7-day Moving Average')
plt.show()

### Q4
#### Plot the daily difference. Observe whether usage seems to vary drastically from day to day or has more of a smooth transition.   

In [None]:
train.diff(periods=1).plot(figsize=(12, 4), linewidth=0.5)

### Q5

#### Plot a time series decomposition. 

In [None]:
decomposition = sm.tsa.seasonal_decompose(train.resample('W').mean(), model='additive')

fig = decomposition.plot()
plt.show()

### Q6
#### Create a lag plot (day over day).  

In [None]:
pd.plotting.lag_plot(train.resample('D').mean(), lag=1)

### Q7
#### Run a lag correlation.   

In [None]:
df_corr = pd.concat([train.shift(1), train], axis=1)
df_corr.columns = ['t-1','t+1']
result = df_corr.corr()
print(result)

## OPS Data

### Q1

#### Split your data into train and test using the percent cutoff method.  

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv')

In [None]:
# to datetime
df['date'] = pd.to_datetime(df['Date'])
df = df.set_index('date').resample('D').sum()

# Set the train size to be 66% of total size of dataframe (# of rows).   
# Compute how many rows that is.  
train_size = int(len(df) * 0.66)

# Select our data up to the index representing the 66th percentile as our 'train' sample.  
# Select our data from the 66th percentile through the end of the dataframe as our 'test' sample.  

train, test = df[0:train_size], df[train_size:len(df)]
print('Observations: %d' % (len(df)))
print('Training Observations: %d' % (len(train)))
print('Testing Observations: %d' % (len(test)))

### Q2
#### Validate your splits by plotting X_train and y_train.  

In [None]:
train.Consumption.plot(figsize=(12,4))
test.Consumption.plot(figsize=(12,4))
plt.show()

### Q3
#### Plot the weekly average & the 7-day moving average. Compare the 2 plots. 

In [None]:
train.resample('W').mean().plot(figsize=(12, 4))
plt.show()

In [None]:
train.rolling(7).mean().plot(figsize=(12, 4))
plt.show()

### Q4

#### Group the electricity consumption time series by month of year, to explore annual seasonality.

We will plot all 3 variables for fun :)

In [None]:
# create the month column
df['month'] = df.index.month

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(11, 10), sharex=True)
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
    sns.boxplot(data=df, x='month', y=name, ax=ax)
    ax.set_ylabel('GWh')
    ax.set_title(name)
    if ax != axes[-1]:
        ax.set_xlabel('')

# Remove the automatic x-axis label from all but the bottom subplot

### Q5

#### Plot the daily difference. Observe whether usage seems to vary drastically from day to day or has more of a smooth transition.   

In [None]:
train.diff(periods=10).plot(figsize=(12, 4), alpha=0.7, linewidth=0.5)

### Q6
#### Plot a time series decomposition. Takeaways?   

In [None]:
decomposition = sm.tsa.seasonal_decompose(train.resample('W').mean(), model='additive')

fig = decomposition.plot()
plt.show()

### Q7
#### Create a lag plot (day over day).  

In [None]:
pd.plotting.lag_plot(train.Consumption, lag=1, c='red', alpha=0.5)
pd.plotting.lag_plot(train.Wind, lag=1, c='orange', alpha=0.25)
pd.plotting.lag_plot(train.Solar, lag=1, c='green', alpha=0.25)
pd.plotting.lag_plot(train['Wind+Solar'], lag=1, c='blue', alpha=0.25)

### Q8
#### Run a lag correlation.   

In [None]:
for col in list(train.columns):
    df_corr = pd.concat([train[col].shift(1), train[col]], axis=1)
    df_corr.columns = ['t-1','t+1']
    result = df_corr.corr()
    print(col,":\n",result,"\n")