# Near real-time sales forecasting leveraging Synapse Link for Azure Cosmos DB

Microsoft Retail Store has built its new-age supply chain management system on Azure Cosmos DB.

The supply chain management system tracks retail operations across 1000s of locations across the world and tracks inventory across the 100s of Microsoft product SKUs sold.

This notebook shows the power of Synapse Link for Cosmos DB to be able to run near real-time analytics over operational data, without ETL and without impact to transactional workloads.

In particular, the **goal here is to build a sales forecasting model to help store locations customize their inventory planning in real-time based on flucutations in demand.**

<img src="https://cosmosnotebooksdata.blob.core.windows.net/notebookdata/store.PNG" alt="Surface Device" width="75%"/>

&nbsp;

## Environment Creation

This is a Synapse Notebook and it runs without any customization in Synaspe Analytics workspaces. Using Azure Machine Learning SDK, it will run an AutomatedML process within Azure Synapse Spark Pool, reading data from Azure Cosmos Db Analytical Store.

The required steps are listed below. Please check the availabilty and try to locate all these services in the same Azure region.

+ Create an Azure Synpase workspace using [this](https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-workspace) tutorial.
+ Create an Azure Machine Learning workspace using [this](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace) tutorial.
+ Create an Azure Automated Machine Learning experimento using [this](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-automated-ml-for-ml-models) tutorial. You don't need to run it, neither to create a compute cluster.
+ Create an Azure Cosmos Db account using [this](https://docs.microsoft.com/en-us/azure/cosmos-db/create-cosmosdb-resources-portal) tutorial.
+ Load the data using the Batch Ingestion notebook.

## Leverage power of Spark SQL to join & aggregate operational data across Cosmos DB containers

In [None]:
%%sql
create database if not exists RetailSalesDemoDB

In [None]:
%%sql
create table if not exists SurfaceSalesDB.RetailSales using cosmos.olap options (
    spark.synapse.linkedService 'RetailSalesDemoDB',
    spark.cosmos.preferredRegions 'West US 2',
    spark.cosmos.container 'RetailSales'
)

In [None]:
%%sql
create table if not exists SurfaceSalesDB.StoreDemographics using cosmos.olap options (
    spark.synapse.linkedService 'RetailSalesDemoDB',
    spark.cosmos.preferredRegions 'West US 2',
    spark.cosmos.container 'StoreDemoGraphics'
)

In [None]:
%%sql
create table if not exists SurfaceSalesDB.Product using cosmos.olap options (
    spark.synapse.linkedService 'RetailSalesDemoDB',
    spark.cosmos.preferredRegions 'West US 2',
    spark.cosmos.container 'Products'
)

In [None]:
data = spark.sql("select a.storeId \
                       , b.productCode \
                       , b.wholeSaleCost \
                       , b.basePrice \
                       , c.ratioAge60 \
                       , c.collegeRatio \
                       , c.income \
                       , c.highIncome150Ratio \
                       , c.largeHH \
                       , c.minoritiesRatio \
                       , c.more1FullTimeEmployeeRatio \
                       , c.distanceNearestWarehouse \
                       , c.salesNearestWarehousesRatio \
                       , c.avgDistanceNearest5Supermarkets \
                       , c.salesNearest5StoresRatio \
                       , a.quantity \
                       , a.logQuantity \
                       , a.advertising \
                       , a.price \
                       , a.weekStarting \
                 from surfacesalesDB.retailsales a \
                 left join surfacesalesDB.product b \
                 on a.productcode = b.productcode \
                 left join surfacesalesDB.storedemographics c \
                 on a.storeId = c.storeId \
                 order by a.weekStarting, a.storeId, b.productCode")

display(data)

## Leverage power of Azure Machine Learning's AutoML to build a Forecasting Model

### Setup 
Let's start with the AML Environment setup. Please replace all "your-AML-%" placeholders with your own information, that you can get from your AML workspace in the the Azure Portal.


In [None]:
import azureml.core
import pandas as pd
import numpy as np
import logging
from azureml.core.workspace import Workspace
from azureml.core import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import os
subscription_id = os.getenv("SUBSCRIPTION_ID", default="your-AML-subscription-id")
resource_group = os.getenv("RESOURCE_GROUP", default="your-AML-resource-group")
workspace_name = os.getenv("WORKSPACE_NAME", default="your-AML-workspace-name")
workspace_region = os.getenv("WORKSPACE_REGION", default="your-AML-region")

ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
ws.write_config()
    
experiment_name = 'your-AML-experiment-name'
experiment = Experiment(ws, experiment_name)
output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['SKU'] = ws.sku
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Run History Name'] = experiment_name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

### Data Preparation - Feature engineering, Splitting train & test datasets


In [None]:
# Initial variables
time_column_name = 'weekStarting'
grain_column_names = ['storeId', 'productCode']
target_column_name = 'quantity'
use_stores = [2, 5, 8,71,102]
n_test_periods = 20


#DataFrame
df = data.toPandas()
df[time_column_name] = pd.to_datetime(df[time_column_name])
df['storeId'] = pd.to_numeric(df['storeId'])


# Time Series
data_subset = df[df.storeId.isin(use_stores)]
nseries = data_subset.groupby(grain_column_names).ngroups
print('Data subset contains {0} individual time-series.'.format(nseries))

# Group by date
def split_last_n_by_grain(df, n):
    """Group df by grain and split on last n rows for each group."""
    df_grouped = (df.sort_values(time_column_name) # Sort by ascending time
                  .groupby(grain_column_names, group_keys=False))
    df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])
    df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])
    return df_head, df_tail

# splitting
train, test = split_last_n_by_grain(data_subset, n_test_periods)
print(len(train),len(test))
train.to_csv (r'./SurfaceSales_train.csv', index = None, header=True)
test.to_csv (r'./SurfaceSales_test.csv', index = None, header=True)
datastore = ws.get_default_datastore()
datastore.upload_files(files = ['./SurfaceSales_train.csv', './SurfaceSales_test.csv'], target_path = 'dataset/', overwrite = True,show_progress = True)

# loading the train dataset
from azureml.core.dataset import Dataset
train_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path('dataset/SurfaceSales_train.csv'))

### Training the Models using AutoML Forecasting

Please notice that **compute_target** is commented, meaning that the model training will run locally in Synapse Spark.

In [None]:
# Parameters
time_series_settings = {
    'time_column_name': time_column_name,
    'grain_column_names': grain_column_names,
    'max_horizon': n_test_periods
}

# Config
automl_config = AutoMLConfig(task='forecasting',
                             debug_log='automl_ss_sales_errors.log',
                             primary_metric='normalized_mean_absolute_error',
                             experiment_timeout_hours=0.5,
                             training_data=train_dataset,
                             label_column_name=target_column_name,
                             #compute_target=compute_target,
                             enable_early_stopping=True,
                             n_cross_validations=3,
                             verbosity=logging.INFO,
                             **time_series_settings)

# Running the training
remote_run = experiment.submit(automl_config, show_output=True)

### Retrieving the Best Model and Forecasting

In [None]:
# Retrieving the best model
best_run, fitted_model = remote_run.get_output()
print(fitted_model.steps)
model_name = best_run.properties['model_name']
print(model_name)

# Forecasting based on test dataset
X_test = test
y_test = X_test.pop(target_column_name).values
X_test[time_column_name] = pd.to_datetime(X_test[time_column_name])
y_predictions, X_trans = fitted_model.forecast(X_test)

### Plotting the Results

In [None]:
import pandas as pd
import numpy as np
from pandas.tseries.frequencies import to_offset


def align_outputs(y_predicted, X_trans, X_test, y_test, target_column_name,
                  predicted_column_name='predicted',
                  horizon_colname='horizon_origin'):
    """
    Demonstrates how to get the output aligned to the inputs
    using pandas indexes. Helps understand what happened if
    the output's shape differs from the input shape, or if
    the data got re-sorted by time and grain during forecasting.

    Typical causes of misalignment are:
    * we predicted some periods that were missing in actuals -> drop from eval
    * model was asked to predict past max_horizon -> increase max horizon
    * data at start of X_test was needed for lags -> provide previous periods
    """

    if (horizon_colname in X_trans):
        df_fcst = pd.DataFrame({predicted_column_name: y_predicted,
                                horizon_colname: X_trans[horizon_colname]})
    else:
        df_fcst = pd.DataFrame({predicted_column_name: y_predicted})

    # y and X outputs are aligned by forecast() function contract
    df_fcst.index = X_trans.index

    # align original X_test to y_test
    X_test_full = X_test.copy()
    X_test_full[target_column_name] = y_test

    # X_test_full's index does not include origin, so reset for merge
    df_fcst.reset_index(inplace=True)
    X_test_full = X_test_full.reset_index().drop(columns='index')
    together = df_fcst.merge(X_test_full, how='right')

    # drop rows where prediction or actuals are nan
    # happens because of missing actuals
    # or at edges of time due to lags/rolling windows
    clean = together[together[[target_column_name,
                               predicted_column_name]].notnull().all(axis=1)]
    return(clean)


df_all = align_outputs(y_predictions, X_trans, X_test, y_test, target_column_name)

from azureml.automl.core._vendor.automl.client.core.common import metrics
from matplotlib import pyplot as plt
from automl.client.core.common import constants

# use automl metrics module
scores = metrics.compute_metrics_regression(
    df_all['predicted'],
    df_all[target_column_name],
    list(constants.Metric.SCALAR_REGRESSION_SET),
    None, None, None)

print("[Test data scores]\n")
for key, value in scores.items():    
    print('{}:   {:.3f}'.format(key, value))
    
# Plot outputs
#%matplotlib inline
test_pred = plt.scatter(df_all[target_column_name], df_all['predicted'], color='b')
test_test = plt.scatter(df_all[target_column_name], df_all[target_column_name], color='g')
plt.legend((test_pred, test_test), ('prediction', 'truth'), loc='upper left', fontsize=8)
plt.show()

## Closing

At this point you should have a chart with the real values and the predictions we did with Azure AutoML. Suggested next steps are:

1. [Deploy](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where) your Model.
1. [Collect and Evaluate](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-enable-data-collection) Model Data.
1. [Create](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline) a ML Pipeline.
1. [Create](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-event-grid) Events Driven CI/CD Workflows.

