Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---

This repository uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/) to walk you through the process of training many models and forecasting on Azure Machine Learning. 

This notebook walks you through all the necessary steps to configure the data for this solution accelerator, including:

1. Download the sample data
2. Split in training/forecasting sets
3. Connect to your workspace and upload the data to its Datastore

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


## 1.0 Download sample data

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset, which featured two years of sales of 3 different orange juice brands for individual stores. You can learn more about the dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/sample-oj-sales-simulated/). 

The full dataset includes simulated sales for 3,991 stores with 3 orange juice brands each, thus allowing 11,973 models to be trained to showcase the power of the many models pattern. Each series contains data from '1990-06-14' to '1992-10-01'.

You'll need the `azureml-opendatasets` package to download the data. You can install it with the following:

In [None]:
#!pip install azureml-opendatasets

In [None]:
#!pip install update azureml-opendatasets

We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

In [None]:
#!pip list

In [None]:
import os
import shutil
import pandas as pd

fldr="staged_data"
datastore_exists = True

if os.path.isdir(os.path.join(os.getcwd(), fldr)):
    shutil.rmtree(os.path.join(os.getcwd(), fldr))
    os.mkdir(os.path.join(os.getcwd(), fldr))


### Adjust Path 
This allows access to the utils folder that is not directly in the path of this folder. 

In [None]:
import sys
sys.path.append("../../")

## Load the Python variables and environment variables 

The `python-dotenv` package allows an application to use and/or pass through environment variables that are set by the compute that is executing the script

In [None]:
%load_ext dotenv

from utils.env_variables import Env

e=Env()

## 2.0 Connect to AML Workspace
3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace). We are going to register the data in that enviroment.

In [None]:
from azureml.core.workspace import Workspace
from utils.aml_workspace import Connect

connect = Connect()

ws = connect.authenticate()

# Take a look at Workspace
# ws.get_details()

## Loading Data 

### Option A: From an Azure Blob Container

Connect to a datastore and a path and recursively extract files from that tree and any folders in that path

In [None]:
from scripts.helper import recurse_path
from pathlib import Path
from azureml.core import Dataset, Datastore

datastore = Datastore(ws,"somethingnew")

datastore_paths = [(datastore, '')]
DS = Dataset.File.from_files(path=datastore_paths)

# Access Data using Mount Point instead of Download
with DS.mount() as mount_context:
    # files = os.listdir(mount_context.mount_point)
    mydir = mount_context.mount_point
    recurse_path(Path(mydir))
    # walk_path(Path(mydir))

### Option B: Using data from the azure ml open datasets

This data source is not required to run if you are pulling data from an Azure Blob Storage or other data source

In [None]:
from azureml.opendatasets import OjSalesSimulated

dataset_maxfiles = 10 # Set to 11973 or 0 to get all the files

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull only the first `dataset_maxfiles` files
if dataset_maxfiles:
    oj_sales_files = oj_sales_files.take(dataset_maxfiles)

# Create a folder to download
target_path = 'oj_sales_data' 
os.makedirs(target_path, exist_ok=True)

# Download the data
oj_sales_files.download(target_path, overwrite=True)



In [None]:
from azureml.core import Dataset

df = pd.DataFrame()

for file in os.listdir(os.path.join(os.getcwd(), "../../../data")):
    if file.endswith(".parquet"):
        df_p = pd.read_parquet(os.path.join(os.path.join(os.getcwd(), "../../../data", file)))
        df = pd.concat([df,df_p])




## Make sure the date time column is sorted descending

In [None]:
df["DateTime"]=pd.to_datetime(df["DateTime"])
df=df.sort_values("DateTime")

### Write files to staging_data to read files

In [None]:
for p in df[e.primary_partition].unique():
    for s in df[e.secondary_partition].unique():
        df[(df["LocationNumber"]==p) & (df["item_id"]==s)].to_csv(os.path.join(os.getcwd(), fldr, f"D{p}_{s}.csv"), index=False)
        

## 3.0 Split data in two sets

We will now split each dataset in two parts: one will be used for training, and the other will be used for simulating batch forecasting. The training files will contain the data records before '1992-5-28' and the last part of each series will be stored in the inferencing files.

Finally, we will upload both sets of data files to the Workspace's default [Datastore](https://docs.microsoft.compython/api/azureml-core/azureml.core.datastore(class)).

In [None]:
from scripts.helper import split_data

target_path = fldr

# Provide name of timestamp column in the data and date from which to split into the inference dataset
timestamp_column = "DateTime"  # e.timestamp_column
split_date = "2021-09-15"   #e.split_date

# Split each file and store in corresponding directory
train_path, inference_path = split_data(target_path, timestamp_column, split_date)

We will upload both sets of data files to your Workspace's default [Datastore](https://docs.microsoft.com/azure/machine-learning/how-to-access-data). 
A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)) on how to access data from Datastore.

In [None]:
from datetime import datetime

run_time = str(datetime.now().strftime("%Y%m%d_%H%M"))

# Connect to default datastore
from azureml.core import Datastore

datastore = Datastore.get(ws, e.blob_datastore_name)

# need to update this with information about using Dataset.Tabular when using files that have not already been partitioned.
print('add this to message me about the changes to come.')
# Upload train data
ds_train_path = "manymodels_train_" + run_time
datastore.upload(src_dir=train_path, target_path=ds_train_path, overwrite=True)

# Upload inference data
ds_inference_path = "manymodels_inference_" + run_time
datastore.upload(src_dir=inference_path, target_path=ds_inference_path, overwrite=True)

print(ds_train_path)
print(ds_inference_path)

### *[Optional]* If data is already in Azure: create Datastore from it

If your data is already in Azure you don't need to upload it from your local machine to the default datastore. Instead, you can create a new Datastore that references that set of data. 
The following is an example of how to set up a Datastore from a container in Blob storage where the sample data is located. 

In this case, the orange juice data is available in a public blob container, defined by the information below. In your case, you'll need to specify the account credentials as well. For more information check [the documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore#register-azure-blob-container-workspace--datastore-name--container-name--account-name--sas-token-none--account-key-none--protocol-none--endpoint-none--overwrite-false--create-if-not-exists-false--skip-validation-false--blob-cache-timeout-none--grant-workspace-access-false--subscription-id-none--resource-group-none-).

In [None]:
blob_datastore_name = e.blob_datastore_name
container_name = e.container_name
account_name = e.account_name

In [None]:
from azureml.core import Datastore

if datastore_exists==False:
    datastore = Datastore.register_azure_blob_container(
        workspace=ws, 
        datastore_name=blob_datastore_name, 
        container_name=container_name,
        account_name=account_name,
        create_if_not_exists=True,    
        account_key=os.getenv("ACCOUNT_KEY")
    )
else:
    print('datastore already exists')

## 4.0 Register dataset in AML Workspace

The last step is creating and registering [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and inference sets.

Using a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/inference sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [None]:
from azureml.core.dataset import Dataset

# Create file datasets
ds_train = Dataset.File.from_files(path=datastore.path(ds_train_path), validate=False)
ds_inference = Dataset.File.from_files(path=datastore.path(ds_inference_path), validate=False)

# Register the file datasets
dataset_name = "manymodels"

train_dataset_name = dataset_name + '_train'
inference_dataset_name = dataset_name + '_inference'
ds_train.register(ws, train_dataset_name, create_new_version=True)
ds_inference.register(ws, inference_dataset_name, create_new_version=True)

## 5.0 *[Optional]* Interact with the registered dataset

After registering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks.

In [None]:
oj_ds = Dataset.get_by_name(ws, name=train_dataset_name)
oj_ds.version

It is also possible to download the data from the registered dataset:

In [None]:
download_paths = oj_ds.download()
download_paths

Let's load one of the data files to see the format:

In [None]:
import pandas as pd

download_paths = oj_ds.download()
download_paths

data = pd.read_csv(download_paths[0])
data.head(10)
#sample_data.to_csv("/home/brandon/somefile.csv", index=False, header=True)

## Next Steps

Now that you have created your datasets, you are ready to move to one of the training notebooks to train and score the models:

- Automated ML: please open [02_AutoML_Training_Pipeline.ipynb](Automated_ML/02_AutoML_Training_Pipeline/02_AutoML_Training_Pipeline.ipynb).
- Custom Script: please open [02_CustomScript_Training_Pipeline.ipynb](Custom_Script/02_CustomScript_Training_Pipeline.ipynb).