In [None]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Overview

With this notebook, users learn how to load, explore, visualize, and pre-process a time-series dataset. The output of this notebook is a processed dataset that will be used in following notebooks to build a machine learning model.

### Dataset

Public domain datasets used in this notebook:

* U.S. Bureau of Economic Analysis, Total Vehicle Sales [TOTALSA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/TOTALSA, September 13, 2020.
* U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATE], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/UNRATE, September 13, 2020.

### Objective

The goal is to forecast total vehicle sales in the USA, based on previous sales and the unemployment rate.

## Install packages and dependencies

Restarting the kernel may be required to use new packages.

In [None]:
%pip install -U statsmodels --user

**Note:** To restart the Kernel, navigate to Kernel > Restart Kernel... on the Jupyter menu.

### Import libraries and define constants

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from pandas.plotting import register_matplotlib_converters
from statsmodels.tsa.seasonal import seasonal_decompose

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
# Enter your project and region. Then run the  cell to make sure the
# Cloud SDK uses the right project for all the commands in this notebook.

PROJECT = "your-project-name" # REPLACE WITH YOUR PROJECT NAME 
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

#Don't change the following command - this is to check if you have changed the project name above.
assert PROJECT != 'your-project-name', 'Don''t forget to change the project variables!'

In [None]:
target = 'TOTALSA' # The variable we are predicting
target_description = 'Total Vehicle Sales' # A description of the target variable
features = {'UNRATE': 'Unemployment Rate'} # Other features to include in the model
ts_col = 'DATE' # The name of the column with the date field

monthly_file = 'vehicle_sales.csv' # Which file to save the results to

## Load data

In [None]:
# Import CSV files
urls = [f'https://fred.stlouisfed.org/graph/fredgraph.csv?id={id}' for id in list(features.keys()) + [target]]
dfs = [pd.read_csv(url, index_col=[0], parse_dates=[0]) for url in urls]

# Concatenate dataframes together: only include months available in all files and end with 1/1/2020
df = pd.concat(dfs, axis=1, join='inner').sort_index()
df = df[df.index < '2020-01-01']

## Explore data

In [None]:
# Print the top 5 rows

df.head()

### TODO 1: Analyze the patterns

* Is there seasonality?
* What is the relationship between variables?
* Does one variable lead the other? 

In [None]:
register_matplotlib_converters() # Addresses a warning
sns.set(rc={'figure.figsize':(16,4)})

# Show how each feature relates to the target variable
for code, description in features.items():
    sns.lineplot(data=df[target], color='g')
    ax2 = plt.twinx()
    sns.lineplot(data=df[code], color='b').set_title(f'{description} x {target_description}')
    plt.show()

### TODO 2: Review summary statistics

* How many records are in the dataset?
* What is the average # of vehicles sold per month (in millions)?

In [None]:
df[target].describe()

### TODO 3: Explore seasonality

* Is there much difference between months?
* Can we extract the trend and seasonal pattern from the data?

In [None]:
# Show the distribution of values for each month in a boxplot:
# Min, 25th percentile, median, 75th percentile, max 

months = df.index.to_series().dt.month

_ = sns.boxplot(x=months, y=df[target])

In [None]:
# Decompose the data into trend and seasonal components

result = seasonal_decompose(df[target], period=12)
fig = result.plot()

## Export data

This will generate a csv file, which you will use in the next labs of this quest.
Inspect the csv file to see what the data looks like.

In [None]:
df.to_csv(monthly_file, index=True, index_label=ts_col)

## Conclusion

You've successfully completed the exploration and visualization lab.
We've learned how to:
* Create a query that groups data into a time series
* Visualize data
* Decompose time series into trend and seasonal components