# Exercises

Remember to document your thoughts and any takeaways as you work through visualizations!

Using your store items data you prepped in lesson 2 exercises:

1. Split your data into train and test using the sklearn.model_selection.TimeSeriesSplit method.
2. Validate your splits by plotting X_train and y_train.
3. Plot the weekly average & the 7-day moving average. Compare the 2 plots.
4. Plot the daily difference. Observe whether usage seems to vary drastically from day to day or has more of a smooth transition.
5. Plot a time series decomposition.
6. Create a lag plot (day over day).
7. Run a lag correlation.

Using your OPS data you prepped in lesson 2 exercises:

1. Split your data into train and test using the percent cutoff method.
2. Validate your splits by plotting X_train and y_train.
3. Plot the weekly average & the 7-day moving average. Compare the 2 plots.
4. Group the electricity consumption time series by month of year, to explore annual seasonality.
5. Plot the daily difference. Observe whether usage seems to vary drastically from day to day or has more of a smooth transition.
6. Plot a time series decomposition. Takeaways?

If time:

For each store I want to see how many items were sold over a period of time, for each item. Find a way to chart this. Hints: Subplots for the piece with the fewest distinct values (like store), x = time, y = count, color = item. If you have too many distinct items, you may need to plot the top n, while aggregating the others into an 'other' bucket.

In [49]:
# data manipulation 
import numpy as np
import pandas as pd

from datetime import datetime
import itertools

# data visualization 
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline

from sklearn.model_selection import TimeSeriesSplit

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import acquire
import prepare

df = acquire.get_all_data()
df.head()

Unnamed: 0,item_id,sale_amount,sale_date,sale_id,store_id,item_brand,item_name,item_price,item_upc12,item_upc14,store_address,store_city,store_state,store_zipcode
0,1,13.0,"Tue, 01 Jan 2013 00:00:00 GMT",1,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
1,1,11.0,"Wed, 02 Jan 2013 00:00:00 GMT",2,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
2,1,14.0,"Thu, 03 Jan 2013 00:00:00 GMT",3,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
3,1,13.0,"Fri, 04 Jan 2013 00:00:00 GMT",4,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253
4,1,10.0,"Sat, 05 Jan 2013 00:00:00 GMT",5,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253


Using prep_store_data() function from prepapre.py

In [50]:
df = prepare.prep_store_data(df)

In [51]:
df.head()

Unnamed: 0_level_0,item_id,sale_amount,sale_id,store_id,item_brand,item_name,item_price,item_upc12,item_upc14,store_address,store_city,store_state,store_zipcode,month,weekday,sales_total
sale_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2013-01-01 00:00:00+00:00,1,13.0,1,1,Riceland,Riceland American Jazmine Rice,0.84,35200264013,35200264013,12125 Alamo Ranch Pkwy,San Antonio,TX,78253,01-Jan,2-Tue,10.92
2013-01-01 00:00:00+00:00,17,26.0,295813,3,Ducal,Ducal Refried Red Beans,1.16,88313590791,88313590791,2118 Fredericksburg Rdj,San Antonio,TX,78201,01-Jan,2-Tue,30.16
2013-01-01 00:00:00+00:00,7,32.0,125995,10,Twinings Of London,Twinings Of London Classics Lady Grey Tea - 20 Ct,9.64,70177154004,70177154004,8503 NW Military Hwy,San Antonio,TX,78231,01-Jan,2-Tue,308.48
2013-01-01 00:00:00+00:00,18,45.0,314073,3,Scotch,Scotch Removable Clear Mounting Squares - 35 Ct,4.39,21200725340,21200725340,2118 Fredericksburg Rdj,San Antonio,TX,78201,01-Jan,2-Tue,197.55
2013-01-01 00:00:00+00:00,19,34.0,332333,3,Careone,Careone Family Comb Set - 8 Ct,0.74,41520035646,41520035646,2118 Fredericksburg Rdj,San Antonio,TX,78201,01-Jan,2-Tue,25.16


In [52]:
df.shape

(913000, 16)

In [53]:
target_vars = ['sales_total']

df1 = df[target_vars].resample('D').sum()

df2 = df[target_vars].resample('w').sum()

In [69]:
print(df1.shape)
print(df2.shape)

(1826, 1)
(261, 1)


In [70]:
X = df2['sales_total']
y = df2.index

In [71]:
tss = TimeSeriesSplit(n_splits=5, max_train_size=None)

In [72]:
train_indices=[]
test_indices=[]
for train_index, test_index in tss.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    train_indices.append(train_index)
    test_indices.append(test_index)
    
# for i in range(0, 5):
#     plt.figure(figsize=(16,4))
#     plt.plot(X_train[train_indices[i]], y_train[train_indices[i]])
#     plt.plot(X[test_indices[i]], y[test_indices[i]])
    

In [73]:
train_indices[0]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45])

In [76]:
X_train.head()

sale_date
2013-01-06 00:00:00+00:00    490767.50
2013-01-13 00:00:00+00:00    559934.21
2013-01-20 00:00:00+00:00    552813.52
2013-01-27 00:00:00+00:00    554908.84
2013-02-03 00:00:00+00:00    586547.55
Name: sales_total, dtype: float64

In [77]:
X_train.shape

(218,)

## OPS Dataset

In [28]:
df = pd.read_csv('https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv')

In [29]:
df.head()

Unnamed: 0,Date,Consumption,Wind,Solar,Wind+Solar
0,2006-01-01,1069.184,,,
1,2006-01-02,1380.521,,,
2,2006-01-03,1442.533,,,
3,2006-01-04,1457.217,,,
4,2006-01-05,1477.131,,,


#### When resetting the index, one benefit is all NaNs are converted to zero

In [30]:
df['date'] = pd.to_datetime(df['Date'])
df = df.set_index('date').resample('D').sum()

In [31]:
df.head()

Unnamed: 0_level_0,Consumption,Wind,Solar,Wind+Solar
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2006-01-01,1069.184,0.0,0.0,0.0
2006-01-02,1380.521,0.0,0.0,0.0
2006-01-03,1442.533,0.0,0.0,0.0
2006-01-04,1457.217,0.0,0.0,0.0
2006-01-05,1477.131,0.0,0.0,0.0


In [39]:
train_size = int(len(df)*.70)
train, test = df[:train_size], df[train_size:]

In [41]:
len(train), len(test)

(3068, 1315)

In [44]:
len(train)/(len(train)+len(test))

0.6999771845767739