Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---

This repository uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/) to walk you through the process of training many models and forecasting on Azure Machine Learning. 

This notebook walks you through all the necessary steps to configure the data for this solution accelerator, including:

1. Download the sample data
2. Split in training/forecasting sets
3. Connect to your workspace and upload the data to its Datastore

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


## 1.0 Download sample data

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset, which featured two years of sales of 3 different orange juice brands for individual stores. You can learn more about the dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/sample-oj-sales-simulated/). 

The full dataset includes simulated sales for 3,991 stores with 3 orange juice brands each, thus allowing 11,973 models to be trained to showcase the power of the many models pattern. Each series contains data from '1990-06-14' to '1992-10-01'.

You'll need the `azureml-opendatasets` package to download the data. You can install it with the following:

In [1]:
!pip install azureml-opendatasets

Collecting scipy<=1.4.1,>=1.0.0
  Downloading scipy-1.4.1-cp36-cp36m-manylinux1_x86_64.whl (26.1 MB)
[K     |████████████████████████████████| 26.1 MB 30.1 MB/s eta 0:00:01████████████████████████▏  | 23.8 MB 30.1 MB/s eta 0:00:01
[31mERROR: raiwidgets 0.7.0 has requirement jinja2==2.11.3, but you'll have jinja2 2.11.2 which is incompatible.[0m
[31mERROR: azureml-responsibleai 1.33.0 has requirement responsibleai~=0.9.1, but you'll have responsibleai 0.7.0 which is incompatible.[0m
[31mERROR: autokeras 1.0.15 has requirement tensorflow>=2.3.0, but you'll have tensorflow 2.1.0 which is incompatible.[0m
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.5.2
    Uninstalling scipy-1.5.2:
      Successfully uninstalled scipy-1.5.2
Successfully installed scipy-1.4.1


In [2]:
!pip install update azureml-opendatasets

Collecting update
  Downloading update-0.0.1-py2.py3-none-any.whl (2.9 kB)
Collecting style==1.1.0
  Downloading style-1.1.0-py2.py3-none-any.whl (6.4 kB)
Installing collected packages: style, update
Successfully installed style-1.1.0 update-0.0.1


We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

In [3]:
!pip list

Package                                 Version
--------------------------------------- -------------------
absl-py                                 0.13.0
adal                                    1.2.7
aiohttp                                 3.7.4.post0
aiohttp-cors                            0.7.0
aioredis                                1.3.1
ansiwrap                                0.8.4
antlr4-python3-runtime                  4.7.2
anyio                                   3.3.0
applicationinsights                     0.11.10
arch                                    4.14
argcomplete                             1.12.3
argon2-cffi                             20.1.0
astor                                   0.8.1
astroid                                 2.6.6
astunparse                              1.6.3
async-timeout                           3.0.1
attrs                                   21.2.0
autokeras                               1.0.15
autopep8                                1.5.7
azure-

In [4]:
dataset_maxfiles = 10 # Set to 11973 or 0 to get all the files

In [5]:
import os
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull only the first `dataset_maxfiles` files
if dataset_maxfiles:
    oj_sales_files = oj_sales_files.take(dataset_maxfiles)

# Create a folder to download
target_path = 'oj_sales_data' 
os.makedirs(target_path, exist_ok=True)

# Download the data
oj_sales_files.download(target_path, overwrite=True)

['/mnt/batch/tasks/shared/LS_root/mounts/clusters/coding-forge-medium/code/Users/brcampb/projects/manymodels_mlops/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1000_dominicks.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/coding-forge-medium/code/Users/brcampb/projects/manymodels_mlops/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1000_minute.maid.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/coding-forge-medium/code/Users/brcampb/projects/manymodels_mlops/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1000_tropicana.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/coding-forge-medium/code/Users/brcampb/projects/manymodels_mlops/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1001_dominicks.csv',
 '/mnt/batch/tasks/shared/LS_root/

## 2.0 Split data in two sets

We will now split each dataset in two parts: one will be used for training, and the other will be used for simulating batch forecasting. The training files will contain the data records before '1992-5-28' and the last part of each series will be stored in the inferencing files.

Finally, we will upload both sets of data files to the Workspace's default [Datastore](https://docs.microsoft.compython/api/azureml-core/azureml.core.datastore(class)).

In [6]:
from scripts.helper import split_data

# Provide name of timestamp column in the data and date from which to split into the inference dataset
timestamp_column = 'WeekStarting'
split_date = '1992-05-28'

# Split each file and store in corresponding directory
train_path, inference_path = split_data(target_path, timestamp_column, split_date)

## 3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace). We are going to register the data in that enviroment.

In [7]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()

# Take a look at Workspace
ws.get_details()

{'id': '/subscriptions/af3877c2-18a2-4ce2-b67c-a8e21e968128/resourceGroups/coding-forge-rg/providers/Microsoft.MachineLearningServices/workspaces/coding-forge-ml-ws',
 'name': 'coding-forge-ml-ws',
 'identity': {'principal_id': '62dda26f-3571-42bf-8c44-ccb3db7f889b',
  'tenant_id': '72f988bf-86f1-41af-91ab-2d7cd011db47',
  'type': 'SystemAssigned'},
 'location': 'eastus',
 'type': 'Microsoft.MachineLearningServices/workspaces',
 'tags': {'Created By': 'brandon campbell',
  'contact': 'brandon.campbell@microsoft.com',
  'phone': '770.853.0352'},
 'sku': 'Basic',
 'workspaceid': 'e057c1f8-0319-435b-92cd-84642f9e37d5',
 'sdkTelemetryAppInsightsKey': 'f5784ccd-178d-4ecc-9998-b05841b44ae9',
 'description': '',
 'friendlyName': 'coding-forge-ml-ws',
 'creationTime': '2021-09-01T12:30:38.7634249+00:00',
 'containerRegistry': '/subscriptions/af3877c2-18a2-4ce2-b67c-a8e21e968128/resourcegroups/coding-forge-rg/providers/microsoft.containerregistry/registries/codingforgeacr',
 'keyVault': '/subsc

We will upload both sets of data files to your Workspace's default [Datastore](https://docs.microsoft.com/azure/machine-learning/how-to-access-data). 
A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)) on how to access data from Datastore.

In [8]:
# Connect to default datastore
datastore = ws.get_default_datastore()

# Upload train data
ds_train_path = target_path + '_train'
datastore.upload(src_dir=train_path, target_path=ds_train_path, overwrite=True)

# Upload inference data
ds_inference_path = target_path + '_inference'
datastore.upload(src_dir=inference_path, target_path=ds_inference_path, overwrite=True)

Uploading an estimated of 10 files
Uploading oj_sales_data/upload_train_data/Store1000_dominicks.csv
Uploaded oj_sales_data/upload_train_data/Store1000_dominicks.csv, 1 files out of an estimated total of 10
Uploading oj_sales_data/upload_train_data/Store1000_minute.maid.csv
Uploaded oj_sales_data/upload_train_data/Store1000_minute.maid.csv, 2 files out of an estimated total of 10
Uploading oj_sales_data/upload_train_data/Store1000_tropicana.csv
Uploaded oj_sales_data/upload_train_data/Store1000_tropicana.csv, 3 files out of an estimated total of 10
Uploading oj_sales_data/upload_train_data/Store1001_dominicks.csv
Uploaded oj_sales_data/upload_train_data/Store1001_dominicks.csv, 4 files out of an estimated total of 10
Uploading oj_sales_data/upload_train_data/Store1001_minute.maid.csv
Uploaded oj_sales_data/upload_train_data/Store1001_minute.maid.csv, 5 files out of an estimated total of 10
Uploading oj_sales_data/upload_train_data/Store1001_tropicana.csv
Uploaded oj_sales_data/upload_t

$AZUREML_DATAREFERENCE_1cc17acca9524a7e9cb1855ac0339c92

### *[Optional]* If data is already in Azure: create Datastore from it

If your data is already in Azure you don't need to upload it from your local machine to the default datastore. Instead, you can create a new Datastore that references that set of data. 
The following is an example of how to set up a Datastore from a container in Blob storage where the sample data is located. 

In this case, the orange juice data is available in a public blob container, defined by the information below. In your case, you'll need to specify the account credentials as well. For more information check [the documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore#register-azure-blob-container-workspace--datastore-name--container-name--account-name--sas-token-none--account-key-none--protocol-none--endpoint-none--overwrite-false--create-if-not-exists-false--skip-validation-false--blob-cache-timeout-none--grant-workspace-access-false--subscription-id-none--resource-group-none-).

In [9]:
blob_datastore_name = "automl_many_models"
container_name = "automl-sample-notebook-data"
account_name = "automlsamplenotebookdata"

In [10]:
from azureml.core import Datastore

datastore = Datastore.register_azure_blob_container(
    workspace=ws, 
    datastore_name=blob_datastore_name, 
    container_name=container_name,
    account_name=account_name,
    create_if_not_exists=True
)

if 0 < dataset_maxfiles < 11973:
    ds_train_path = 'oj_data_small/'
    ds_inference_path = 'oj_inference_small/'
else:
    ds_train_path = 'oj_data/'
    ds_inference_path = 'oj_inference/'

## 4.0 Register dataset in AML Workspace

The last step is creating and registering [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and inference sets.

Using a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/inference sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [11]:
from azureml.core.dataset import Dataset

# Create file datasets
ds_train = Dataset.File.from_files(path=datastore.path(ds_train_path), validate=False)
ds_inference = Dataset.File.from_files(path=datastore.path(ds_inference_path), validate=False)

# Register the file datasets
dataset_name = 'oj_data_small' if 0 < dataset_maxfiles < 11973 else 'oj_data'
train_dataset_name = dataset_name + '_train'
inference_dataset_name = dataset_name + '_inference'
ds_train.register(ws, train_dataset_name, create_new_version=True)
ds_inference.register(ws, inference_dataset_name, create_new_version=True)

{
  "source": [
    "('automl_many_models', 'oj_inference_small/')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "59f4fec2-cee8-4e64-99dd-b686ad96c58e",
    "name": "oj_data_small_inference",
    "version": 1,
    "workspace": "Workspace.create(name='coding-forge-ml-ws', subscription_id='af3877c2-18a2-4ce2-b67c-a8e21e968128', resource_group='coding-forge-rg')"
  }
}

## 5.0 *[Optional]* Interact with the registered dataset

After registering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks.

In [12]:
oj_ds = Dataset.get_by_name(ws, name=train_dataset_name)
oj_ds

{
  "source": [
    "('automl_many_models', 'oj_data_small/')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "f31d6346-74e4-4a52-afd0-f7ec048ff351",
    "name": "oj_data_small_train",
    "version": 1,
    "workspace": "Workspace.create(name='coding-forge-ml-ws', subscription_id='af3877c2-18a2-4ce2-b67c-a8e21e968128', resource_group='coding-forge-rg')"
  }
}

It is also possible to download the data from the registered dataset:

In [13]:
download_paths = oj_ds.download()
download_paths

['/tmp/tmpbxm0yeo6/Store1000_dominicks.csv',
 '/tmp/tmpbxm0yeo6/Store1000_minute.maid.csv',
 '/tmp/tmpbxm0yeo6/Store1000_tropicana.csv',
 '/tmp/tmpbxm0yeo6/Store1001_dominicks.csv',
 '/tmp/tmpbxm0yeo6/Store1001_minute.maid.csv',
 '/tmp/tmpbxm0yeo6/Store1001_tropicana.csv',
 '/tmp/tmpbxm0yeo6/Store1002_dominicks.csv',
 '/tmp/tmpbxm0yeo6/Store1002_minute.maid.csv',
 '/tmp/tmpbxm0yeo6/Store1002_tropicana.csv',
 '/tmp/tmpbxm0yeo6/Store1003_dominicks.csv']

Let's load one of the data files to see the format:

In [14]:
import pandas as pd

sample_data = pd.read_csv(download_paths[0])
sample_data.head(10)

Unnamed: 0,WeekStarting,Store,Brand,Quantity,Advert,Price,Revenue
0,1990-06-14,1000,dominicks,12003,1,2.59,31087.77
1,1990-06-21,1000,dominicks,10239,1,2.39,24471.21
2,1990-06-28,1000,dominicks,17917,1,2.48,44434.16
3,1990-07-05,1000,dominicks,14218,1,2.33,33127.94
4,1990-07-12,1000,dominicks,15925,1,2.01,32009.25
5,1990-07-19,1000,dominicks,17850,1,2.17,38734.5
6,1990-07-26,1000,dominicks,10576,1,1.97,20834.72
7,1990-08-02,1000,dominicks,9912,1,2.26,22401.12
8,1990-08-09,1000,dominicks,9571,1,2.11,20194.81
9,1990-08-16,1000,dominicks,15748,1,2.42,38110.16


## Next Steps

Now that you have created your datasets, you are ready to move to one of the training notebooks to train and score the models:

- Automated ML: please open [02_AutoML_Training_Pipeline.ipynb](Automated_ML/02_AutoML_Training_Pipeline/02_AutoML_Training_Pipeline.ipynb).
- Custom Script: please open [02_CustomScript_Training_Pipeline.ipynb](Custom_Script/02_CustomScript_Training_Pipeline.ipynb).