# Prepare a custom dataset to train and evaluate a Time-Series Forecasting AutomatedML Model

This notebook will allow you to upload a dataset to AzureML to be used for training a time-series forecasting model with AutomatedML. This trained model can then be evaluated with the ResponsibleAI Dashboard using the forecasting-automl.ipynb notebook.

To use this workbook, you must have:

    1. a valid AzureML subscription ID, workspace, and resource group
    2. a dataset containing datetime data in .csv or .txt format
    
Your dataset must have the following requirements:

    1. a datetime column
    2. a group_id column
    
This notebook will help you get your data in the correct format and upload it to AzureML. 

In [None]:
%pip install datasets

### 1. Workspace Details

Enter the details of your AzureML workspace below.

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

### 2. Specify data path and version

First, enter the version of your dataset (default 1). Everytime you make changes to your dataset and want to upload those changes to AzureML, increase this data_version by 1. Next, Enter the path to the folder containing your data. Your data should be in the following file architecture:

    data_path/
        train/
            train.csv
        test/
            test.csv

In [None]:
data_version = "1"
data_path = "data-forecasting/"

In [None]:
train_data_path = data_path + "train/"
test_data_path = data_path + "test/"

### 3. Prepare data

Here we will ensure your data has all the properties it needs and is converted to the correct parquet format. Your test dataset must be under 5000 lines. Remeber to save your changes (step 3d)! 


In [None]:
import mltable
import matplotlib.pyplot as plt
import pandas as pd
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Data
from azure.identity import DefaultAzureCredential

In [None]:
train_df = pd.read_csv(train_data_path + "train.csv")
test_df = pd.read_csv(test_data_path + "test.csv")
train_df = mltable.load(train_data_path).to_pandas_dataframe()
test_df = mltable.load(test_data_path).to_pandas_dataframe()

assert len(test_df.index) <= 5000
test_df

#### 3a. Create MLTable files

In [None]:
mltable_train_contents = f"""
$schema: http://azureml/sdk-2-0/MLTable.json
type: mltable
paths:
  - file: ./train.parquet
transformations:
  - read_parquet
"""

mltable_test_contents = f"""
$schema: http://azureml/sdk-2-0/MLTable.json
type: mltable
paths:
  - file: ./test.parquet
transformations:
  - read_parquet
"""

mltable_train_filename = train_data_path + "MLTable"
mltable_test_filename = test_data_path + "MLTable"

with open(mltable_train_filename, "w") as f:
    f.write(mltable_train_contents)

with open(mltable_test_filename, "w") as g:
    g.write(mltable_test_contents)

#### 3b. Convert datetime column to datetime64[ns] format

If you already have a datetime column in datetime64[ns] format, you can skip this step. Otherwise, specify which column contains your datetime data.

In [None]:
train_df["datetime"] = train_df["datetime"].astype("datetime64[ns]")
test_df["datetime"] = test_df["datetime"].astype("datetime64[ns]")

#### 3c. Create time_series_id column

If your dataset contains multiple time series and already has a time_series_id column identifying which time-series each row belongs to, you can skip this step.

In [None]:
train_df["group_id"] = 1.0
test_df["group_id"] = 1.0

### 3d. Save your data

In [None]:
train_df.to_parquet(train_data_path + "train.parquet")
test_df.to_parquet(test_data_path + "test.parquet")

#### 3e. Visualize your data

Specify which column you want to visualize. Usually this would be the target column you want your model to predict over time.

In [None]:
target_column = "<TARGET>"

In [None]:
print(
    f"Train dates : {train_df.index.min()} --- {train_df.index.max()}  (n={len(train_df)})"
)
print(
    f"Test dates  : {test_df.index.min()} --- {test_df.index.max()}  (n={len(test_df)})"
)


fig, ax = plt.subplots(figsize=(6, 2.5))
train_df[target_column].plot(ax=ax, label="train")
test_df[target_column].plot(ax=ax, label="test")
ax.legend();

### 4. Get a handle to the workspace

We will use the information provided in the Workspace Details section to get a handle to the required Azure Machine Learning workspace. No additional input is required for this section.

In [None]:
# Handle to the workspace
credential = DefaultAzureCredential()
ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace,
)

# Get handle to azureml registry for the RAI built in components
registry_name = "azureml"
ml_client_registry = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    registry_name=registry_name,
)

### 5. Upload data

Finally, upload the data to AzureML. Change the names and descriptions of your datasets as you see fit.

In [None]:
train_name = "forecasting_train_mltable"
test_name = "forecasting_test_mltable"

train_description="Forecasting example training data"
test_description = "Forecasting example testing data

In [None]:
train_data = Data(
    path=train_data_path,
    type=AssetTypes.MLTABLE,
    description=train_description,
    name=train_name,
    version=data_version,
)
ml_client.data.create_or_update(train_data)

test_data = Data(
    path=test_data_path,
    type=AssetTypes.MLTABLE,
    description=test_description,
    name=test_name,
    version=data_version,
)
ml_client.data.create_or_update(test_data)