# Lab 4 - Student Notebook

## Overview

In this lab, you will prepare a dataset for creating a forecast by using Amazon Forecast.

This lab includes two Jupyter notebooks:

1. This notebook contains the steps that you will follow to prepare the dataset and evaluate the forecast.
2. The `forecast-autorun.ipynb` notebook contains the steps to create the forecast by using Amazon Forecast. This notebook is run in the background when the lab starts, and it can take between 1–2 hours to complete. You will refer to this notebook during the lab steps, but you won't need to run any cells.


## About the dataset

This [Online Retail II](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II) dataset contains all transactions that occurred between January 12, 2009 and September 12, 2011 for a non-store, online retail organization that's registered and based in the United Kingdom. The company mainly sells unique all-occasion giftware. Many customers of the company are wholesalers.


## Attribute information

- **InvoiceNo** – Invoice number. Nominal. A 6-digit integral number that's uniquely assigned to each transaction. If this code starts with the letter *c*, it indicates a cancelation.
- **StockCode** – Product (item) code. Nominal. A 5-digit integral number that's uniquely assigned to each distinct product.
- **Description** – Product (item) name. Nominal.
- **Quantity** – The quantities of each product (item) per transaction. Numeric.
- **InvoiceDate** – Invoice date and time. Numeric. The day and time when a transaction was generated.
- **UnitPrice** – Unit price. Numeric. Product price per unit in pounds sterling (£).
- **CustomerID** – Customer number. Nominal. A 5-digit integral number that's uniquely assigned to each customer.
- **Country** – Country name. Nominal. The name of the country where a customer resides.


## Dataset attributions

This dataset was obtained from:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

## Lab instructions

To complete this lab, read and run the cells below.

## Task 1: Importing Python packages

Start by importing the Python packages that you need.

In the following code:

- *boto3* represents the AWS SDK for Python (Boto3), which is the Python library for AWS
- *pandas* provides DataFrames for manipulating time series data
- *matplotlib* provides plotting functions
- *sagemaker* represents the API that's needed to work with Amazon SageMaker
- *time*, *sys*, *os*, *io*, and *json* provide helper functions 


In [None]:
import warnings
warnings.filterwarnings('ignore')
bucket_name='c33334a421003l774089t1w69430420174-forecastbucket-163su8h2wvi7'

import boto3
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sagemaker
import time, sys, os, io, json


## Task 2: Exploring the data


The data is in the *Microsoft Excel* format. pandas can read Excel files.

**Note:** This data might take 1–2 minutes to load

In [None]:
retail = pd.read_excel('online_retail_II.xlsx')

NameError: name 'pd' is not defined

According to the description for the dataset, some values are missing. To keep things simple, you will remove anything wtih a missing value.

In [None]:
retail = retail.dropna()

NameError: name 'retail' is not defined

Start by examining the data.

How many rows and columns are in the dataset?

In [None]:
retail.shape

NameError: name 'retail' is not defined

What are the data types?

In [None]:
retail.dtypes

What does the data look like?

In [None]:
retail.head(20)

Amazon Forecast has schemas for domains such as retail. Review the schema information at [RETAIL Domain](https://docs.aws.amazon.com/forecast/latest/dg/retail-domain.html) in the AWS Documentation.

The target time series is the historical time series data for each item or product that's sold by the retail organization. The following fields are required:

- **item_id** (string) – A unique identifier for the item or product that you want to predict the demand for.
- **timestamp** (timestamp)
- **demand** (float) – The number of sales for that item at the timestamp. It's also the target field that Amazon Forecast generates a forecast for.



If you examine the previous data, there are certain columns that you don't need for your investigation. You can drop these columns. The columns you can drop are **Invoice**, **Description**, and **Customer ID**. 

**Note:** It's possible that items in the same order (as shown by the **Invoice** column) could have a correlation that impacts the model. For this lab, you will ignore this possibility.

Drop the columns that you don't need.

In [None]:
retail = retail[['StockCode','Quantity','Price','Country','InvoiceDate']]

The **InvoiceDate** column is your datetime data. You can inform pandas of this by using the `to_datetime` function. You can explore the data by time by setting the index of the DataFrame to the **InvoiceDate** column.

In [None]:
retail['InvoiceDate'] = pd.to_datetime(retail.InvoiceDate)
retail = retail.set_index('InvoiceDate')

You will now examine the updated DataFrame.

The number of rows and columns are:

In [None]:
retail.shape

The new data looks like this example:

In [None]:
retail.head()

Note that **InvoiceDate** is the index, and it's shown in the first column.

Because you set the index to your datetime data, you can use it to select data.

To select all the rows from a specific date, use the date in the index.

In [None]:
retail['2010-01-04']

You can use parts of a date, and date ranges. To view the **Jan** and **Feb** rows:

In [None]:
retail['2010-01':'2010-02']

The date range starts at:

In [None]:
retail.index.min()

The date range ends at:

In [None]:
retail.index.max()

With pandas, you can extract date information easily. You might extract date information to explore the data further and look for time-related trends.

Extract the year, month, and day of the week.

In [None]:
retail['Year'] = retail.index.year
retail['Month'] = retail.index.month
retail['weekday_name'] = retail.index.day_name()

In [None]:
retail.head()

The dataset that you now have includes purchases made between December 2009 and December 2010. It's reasonable to assume there would be some seasonality in this data. You will now investigate whether there is seasonality.

In [None]:
retail.Month.value_counts(sort=False).plot(kind='bar')

From the chart, you could deduce some seasonality:

1. November and December seem to be higher than the rest of the year.

2. Q4 seems to be higher than other quarters.

3. For Q1, Q2, and Q3: The last month of the quarter (months 3, 6, and 9) seem to have spikes.

Do you notice any other seasonal patterns?

Now, investigate whether there is any seasonality during the week.

In [None]:
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
retail.weekday_name.value_counts(sort=False).loc[day_order].plot(kind='bar')

Saturday shows very few orders. Why might this be the case?

## Task 3: Cleaning and reducing the size of the data

In this task, you will reduce the size of the data. You will also remove any anomalies, such as negative prices, outliers, and country data.

### Reducing the countries
Examine the **Country** data.

In [None]:
retail.Country.unique()

In [None]:
retail.Country.value_counts()

Most of the data seems to be for the United Kingdom. To make your job easier, filter the data by *United Kingdom*.

In [None]:
country_filter = ['United Kingdom']
retail = retail[retail.Country.isin(country_filter)]

Because the **Country** column only contains the same value, you can drop it.

In [None]:
retail = retail[['StockCode','Quantity','Price']]

In [None]:
retail.head()

### Examining StockCode and removing anomalies

Examine the distribution of the **StockCode** column:

In [None]:
retail.StockCode.describe()

There are 4,015 unique values for **StockCode**. A quick plot of the counts might give you some insight into how the values are distributed.

In [None]:
retail.StockCode.value_counts().plot()

It seems that there are a few high-selling products, with a long tail behind them. You could investigate this situation further. However, for now, examine **Quantity**.

In [None]:
retail.Quantity.describe()

In [None]:
retail.Quantity.plot()

From the initial plot, notice a couple of interesting aspects.

1. There appear to be negative quantities.

2. There are very large spikes throughout the year.


Negative and zero quantities could impact the forecast if you don't know why these values exist. To make things easier for now, you will remove negative and zero quantities

In [None]:
retail = retail[retail.Quantity>0]

Now, examine **Price**.

In [None]:
retail.Price.describe()

In [None]:
retail.Price.plot()

The plot shows some clear price spikes. You will now try to find out why these spikes exist.

In [None]:
retail[retail.Price>500].head()

The **StockCode** value of *M* looks unusual. If you had access to a domain expert, you could learn about the importance of *M*. Because you can't ask a domain expert for this lab, you will drop everything that has a **StockCode** value of *M*.

In [None]:
retail = retail[retail.StockCode!='M']

In [None]:
retail.Price.describe()

This result is better, but the **max** value is still high. You will now investigate this situation further.

In [None]:
retail[retail.Price>300].head(20)

It seems that some adjustments occurred. You will also drop any data that shows these adjustments.

In [None]:
stockcodes = ['ADJUST', 'ADJUST2', 'POST']
retail = retail[~retail.StockCode.isin(stockcodes)]

In [None]:
retail.Price.describe()

You will now examine zero-priced items.

In [None]:
retail[retail.Price==0].count

There aren't many values in these results, so you can drop zero-priced items.

In [None]:
retail = retail[retail.Price>0]

### Splitting the data

The timeseries data that you need to create a forecast requires a *timestamp*, an *itemId*, and a *demand*. These features will map to the **InvoiceDate**, **StockCode**, and **Quantity** columns.

The related timeseries data needs a *timestamp*, an *itemId*, and a *price*. These features will map to the **InvoiceDate**, **StockCode**, and **Price** columns.

Create the two DataFrames:

In [None]:
df_time_series = retail[['StockCode','Quantity']]
df_related_time_series = retail[['StockCode','Price']]

### Downsampling

You will now examine a single item.

In [None]:
df_time_series[df_time_series.StockCode==21232]['2009-12-01']

You can see multiple orders for each day. You want to create a forecast that predicts demand at a daily level.

You must *downsample* the data from the individual orders into a daily total.

The orders for each day can be summed, because the total demand for the day is the value that you will forecast.

pandas provides the `resample` function for this purpose. `sum` will sum the **Quantity** column. You will also reset the index based on the **InvoiceDate** value. However, this time, it will be a date without the time portion.

**Note:** It might take up to 1 minute for this process to complete.

In [None]:
df_time_series = df_time_series.groupby('StockCode').resample('D').sum().reset_index()

In [None]:
df_time_series['InvoiceDate'] = pd.to_datetime(df_time_series.InvoiceDate)
df_time_series = df_time_series.set_index('InvoiceDate')
df_time_series.head()

In [None]:
df_time_series = df_time_series.groupby('StockCode').resample('D').sum().reset_index().set_index(['InvoiceDate'])

Examine the new DataFrame.

In [None]:
df_time_series[df_time_series.StockCode==21232]


The order now has a single entry for each day.

Repeat this process with the related time series data.

In [None]:
df_related_time_series.head()

In [None]:
df_related_time_series2 = df_related_time_series.groupby('StockCode').resample('D').mean().reset_index().set_index(['InvoiceDate','StockCode'])

In [None]:
df_related_time_series2.head(20)

**Question:** Why are some of the previous values showing as *NaN*?

**Answer:** That product had no orders for those days, and thus it has no price. Should you fill these NaN values with a numerical value?

In [None]:
retail[retail.StockCode == 10002]['2009-12']

You can use `pad` to forward-fill the price. The previous value will be used to fill the gap for each missing value. 

In [None]:
df_related_time_series3 = df_related_time_series2.groupby('StockCode').pad()

In [None]:
df_related_time_series3.head(20)

## Task 4: Reviewing the creation of the forecast

The following cells are Markdown. They demonstrate the API calls that are needed to create a forecast based on the data that you have been working with. Creating a forecast with Amazon Forecast involves three stages:

1. Creating the datasets and importing the data. This process typically takes 5–10 minutes.
2. Creating the predictor. This process trains a model by using the data that you provided. It takes 30–60 minutes to complete.
3. Creating the forecast. This process generates a forecast for a particular item by using the predictor. It also takes 30–60 minutes to complete.

To save time, when this lab was started, the `forecast-autorun.ipynb` was also ran in the background. The notebook will be updated with the results after running completes. It takes about 65 minutes to run, but it might take a little longer. By the time you review this cell, the forecast creation should in process. While it's finishing, you will review the code.

**Note:** Feel free to review the actual `forecast-autorun.ipynb` notebook if you want some more detail. However, make sure that you don't run any cells!

### Creating the datasets and importing the data

The first step is to create a Forecast Dataset Group:

```python
session = boto3.Session()
forecast = session.client(service_name='forecast') 
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=dataset_group_name, Domain="RETAIL")
dataset_group_arn = create_dataset_group_response['DatasetGroupArn']
```
    
The `create_dataset` function requires a few parameters:

- **DOMAIN** – This parameter specifies the domain, such as *retail*, that the forecast should use.
- **DatasetType** – For the time series data, this parameter will be set to *TARGET_TIME_SERIES*.
- **DatasetName** – This parameter specifies the name of the dataset.
- **DataFrequency** – This parameter specifices the frequency. For the daily dataset, it will be *D*.
- **Schema** – This parameter specifies the schema of the dataset.

The dataset schema for the time series data is:

```python
schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"demand",
         "AttributeType":"float"
      }
   ]
}
```


The code to create the dataset is:

```python
time_series_response=forecast.create_dataset(
                    Domain="RETAIL",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName='retail_time_series_data',
                    DataFrequency='D', 
                    Schema = schema
)
dataset_arn = time_series_response['DatasetArn']
```
    
Now that the dataset is defined, a job is needed to import the data:

```python
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName='retail_import_job',
                                                      DatasetArn=dataset_arn,
                                                      DataSource= data_source,
                                                      TimestampFormat=timestamp_format
                                                     )
```

Note that the *data_source* is a path to the data that's stored in Amazon Simple Storage Service (Amazon S3).

The final step is to add the dataset to the dataset group:

```python
forecast.update_dataset_group(DatasetGroupArn=dataset_group_arn, DatasetArns=[dataset_arn])
```
    

The process of adding the related data or metadata is done in the same way: by  changing the names, schema, and dataset type. Although you have prepared this data, you won't use it in the predictor because the model wasn't impacted by the additional data.

### Creating the predictor

The next step is to create the predictor. The `create_predictor` command needs a few parameters:

- **PredictorName** – This parameter specifies the name that you want to give the predictor.

    ```python
    predictor_name= prefix+'_deeparp_algo'
    ```


- **AlgorithmArn** – This parameter is the path to the algorithm that you want to use. In this example, you will use DeepAR+.

    ```python
    algorithm_arn = 'arn:aws:forecast:::algorithm/Deep_AR_Plus
    ```


- **EvaluationParameters** – This parameter enables you to specify the number and size of the back test windows. Recall from the module that this parameter controls the size and number of testing windows that are created from the data.

    ```python
    evaluation_parameters= {"NumberOfBacktestWindows": 1, "BackTestWindowOffset": 30}
    ```


- **ForecastHorizon** – How many units to forecast (in this case, the units are days).

    ```python
    forecast_horizon = 30
    ```


- **InputDataConfig** – This parameter specifies the data, along with optional vacation days.

    ```python
    input_data_config = {"DatasetGroupArn": dataset_group_arn, "SupplementaryFeatures": [ {"Name": "holiday","Value": "UK"} ]}
    ```


- **FeaturizationConfig** – This parameter sets the frequency, but it can also be used to specify filling methods for data.

    ```python
    featurization_config= {"ForecastFrequency": dataset_frequency }
    ```

The code to create the predictor is:

```python
create_predictor_response=forecast.create_predictor(PredictorName = predictor_name,
      AlgorithmArn = algorithm_arn,
      ForecastHorizon = forecast_horizon,
      PerformAutoML = False,
      PerformHPO = False,
      EvaluationParameters= evaluation_parameters, 
      InputDataConfig = input_data_config,
      FeaturizationConfig = featurization_config
     )
```
                                                 
After the predictor is created, you can create a forecast.

### Creating the forecast

To create the forecast, use the `create_forecast` method:

```python
predictor_arn = create_predictor_response['PredictorArn']

create_forecast_response=forecast.create_forecast(ForecastName=forecast_Name,
                                                  PredictorArn=predictor_arn)

```

After the forecast is generated, the results can be queried by using the `query_forecast` method:

```python
forecast_response = forecast_query.query_forecast(
    ForecastArn=forecast_arn,
    Filters={"item_id":"22423"}
)
```


## Task 5: Waiting for the forecast creation to complete

The forecast should now be created. You can investigate to see whether the forecast creation is complete.

First, create a helper method to show the status.

In [None]:
import sys

class StatusIndicator:
    
    def __init__(self):
        self.previous_status = None
        self.need_newline = False
        
    def update( self, status ):
        if self.previous_status != status:
            if self.need_newline:
                sys.stdout.write("\n")
            sys.stdout.write( status + " ")
            self.need_newline = True
            self.previous_status = status
        else:
            # sys.stdout.write(".")
            print('.',end='')
            self.need_newline = True
        sys.stdout.flush()

    def end(self):
        if self.need_newline:
            sys.stdout.write("\n")

Next, create instances of the forecast and the forecast query objects.

In [None]:
bucket='mlf-lab4-forecastbucket-12sb9sjex9iv'

session = boto3.Session() 
forecast = session.client(service_name='forecast') 
forecast_query = session.client(service_name='forecastquery')

You will read the variables from the store, and check whether the forecast was defined. After the forecast is defined, you will wait until its status becomes active.

In [None]:
print('Waiting for the predictor arn to be available')
while True:
    %store -r
    is_local = "forecast_arn" in locals()
    if is_local: break
    print('.',end='')
    time.sleep(10)

print('Waiting for the predictor to be available')
status_indicator_predictor = StatusIndicator()
while True:
    status = forecast.describe_predictor(PredictorArn=predictor_arn)['Status']
    status_indicator_predictor.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator_predictor.end()
    
print('Waiting for forecast to be available')
status_indicator = StatusIndicator()
while True:
    status = forecast.describe_forecast(ForecastArn=forecast_arn)['Status']
    status_indicator.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator.end()

## Task 6: Using the forecast

At this point, there should be a forecast that's ready to be queried.

Check that you get data for the following test stock code: *21232*

In [None]:
print()
forecast_response = forecast_query.query_forecast(
    ForecastArn=forecast_arn,
    Filters={"item_id":"21232"}
)
print(forecast_response)

### Plotting the actual results

Earlier, you split the data and held back the *November* and *December* values. You will plot these values against the predicted values for the same time period.

You will start by reading the test values back into a DataFrame.


In [None]:
actual_df = pd.read_csv(test, names=['InvoiceDate','StockCode','Quantity'])
actual_df['InvoiceDate'] = pd.to_datetime(actual_df.InvoiceDate)
actual_df = actual_df.set_index('InvoiceDate')
actual_df.head()

Check that you only have data for the *21232* stock code.

In [None]:
stockcode_filter = ['21232']
actual_df = actual_df[actual_df['StockCode'].isin(stockcode_filter)]

In [None]:
actual_df.head()

You can do a quick plot of the data. Remember that this data is test data, so the actual values are plotted. In the next step, you will plot the predicted values.

In [None]:
actual_df.Quantity.plot()

### Plotting the prediction

Next, you must convert the JSON response from the predictor to a DataFrame that you can plot.

Start by getting the P10 predictions.


In [None]:
# Generate DF 
prediction_df_p10 = pd.DataFrame.from_dict(forecast_response['Forecast']['Predictions']['p10'])
prediction_df_p10.head()

Next, plot the P10 predictions.

In [None]:
# Plot
prediction_df_p10.plot()


The previous code only retrieved the P10 values and put them in a DataFrame. Now, complete the same process for the P50 and P90 values.


In [None]:
prediction_df_p50 = pd.DataFrame.from_dict(forecast_response['Forecast']['Predictions']['p50'])
prediction_df_p90 = pd.DataFrame.from_dict(forecast_response['Forecast']['Predictions']['p90'])


### Comparing the prediction to actual results

After you obtain the DataFrames, the next task is to plot them together to determine the best fit.


In [None]:
# Start by creating a DataFrame to house the content. Here, Source will be which DataFrame it came from.
results_df = pd.DataFrame(columns=['timestamp','value','Source'])

results_df.head()



Import the observed values into the DataFrame:


In [None]:
import dateutil.parser
for index, row in actual_df.iterrows():
    #clean_timestamp = dateutil.parser.parse(index)
    results_df = results_df.append({'timestamp' : index , 'value' : row['Quantity'], 'Source': 'Actual'} , ignore_index=True)

In [None]:
# To show the new DataFrame
results_df.head()

In [None]:
# Now add the P10, P50, and P90 Values
for index, row in prediction_df_p10.iterrows():
    clean_timestamp = dateutil.parser.parse(row['Timestamp'])
    results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['Value'], 'Source': 'p10'} , ignore_index=True)
for index, row in prediction_df_p50.iterrows():
    clean_timestamp = dateutil.parser.parse(row['Timestamp'])
    results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['Value'], 'Source': 'p50'} , ignore_index=True)
for index, row in prediction_df_p90.iterrows():
    clean_timestamp = dateutil.parser.parse(row['Timestamp'])
    results_df = results_df.append({'timestamp' : clean_timestamp , 'value' : row['Value'], 'Source': 'p90'} , ignore_index=True)

By creating a pivot on the data, you can compare the actual P10, P50, and P90 values.

In [None]:
pivot_df = results_df.pivot(columns='Source', values='value', index="timestamp")
pivot_df

Charts can be easier to analyze than the raw values.

In [None]:
pivot_df.plot(figsize=(20,10))

### Examining the results

Hopefully, in the previous chart, you will see at least some correlation between the predicted values and the actual values. The correlation might not be good, and there could be several reasons for this outcome:

- The sales are mostly wholesale, but they do include some smaller orders.
- You held back data, which meant that an entire season wasn't included in the training data.
- You might have been missing useful category or sales promotion data.

Like all machine learning models, the results are as good as the data you use to train the model. As noted previously, the model could be improved with more data.

## Task 7: Cleaning up

The following cells will clean up the resources that were created during the lab.

In [None]:
%store -r

In [None]:
print(forecast_arn)

In [None]:
forecast.delete_forecast(ForecastArn=forecast_arn)
time.sleep(60)

In [None]:
forecast.delete_predictor(PredictorArn=predictor_arn)
time.sleep(60)

In [None]:
forecast.delete_dataset_import_job(DatasetImportJobArn=ds_related_import_job_arn)

In [None]:
forecast.delete_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)

In [None]:
time.sleep(60)

In [None]:
forecast.delete_dataset(DatasetArn=related_dataset_arn)

In [None]:
forecast.delete_dataset(DatasetArn=dataset_arn)

In [None]:
time.sleep(60)

In [None]:
forecast.delete_dataset_group(DatasetGroupArn=dataset_group_arn)