# Milestone 1: Advertising

**Business Context:** 

Retail company, "Fashion Haven," operates multiple stores in different cities. The company invests in advertising campaigns to promote its latest collections through various media sources like TV, Newspaper, and Radio. They want to understand the impact of each media source on their sales revenue to optimize their advertising strategy and improve overall business performance.

Currently, Fashion Haven lacks an effective method to predict the sales revenue generated from their advertising efforts accurately. As a result, they struggle to allocate their advertising budget optimally across different media channels, leading to sub optimal returns on investment and inefficient resource allocation.

To address this business problem, Fashion Haven has collected historical data containing information on various advertising campaigns (TV, Newspaper, Radio) and their corresponding sales revenue across their different store locations. The goal is to build a robust predictive model that accurately estimates the sales revenue based on the media sources' advertising budgets, helping the company make data-driven decisions and drive business growth.


Dataset Description:

The data contains the different attributes of the advertising business. The detailed data dictionary is given below.

* TV: Expenditure on media resource- TV 
* Radio: Expenditure on media resource- Radio 
* NewsPaper: Expenditure on media resource- Newspaper 
* Sales: Target Column - Amount of Sales

Connect to the Workspce  

In [1]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

In [2]:
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="6793e723-756c-4c5d-84c0-812f1bb4c679", #Provide your subscription ID as shown in the above screenshot
    resource_group_name="JuvlinResourceGroup", #Provide your Resource Group as shown in the above screenshot
    workspace_name="JuvlinWorkspace",
)

Create a Compute Resource to run the job

In [3]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster-E2E"

# Let's define the Azure ML compute object with the intended parameters
cpu_cluster = AmlCompute(
    name=cpu_compute_target,
    # Azure ML Compute is the on-demand VM service
    type="amlcompute",
    # VM Family
    size="STANDARD_D2_V3",
    # Minimum running nodes when there is no job running
    min_instances=0,
    # Nodes in cluster
    max_instances=1,
    # How many seconds will the node running after the job termination
    idle_time_before_scale_down=180,
    # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
    tier="Dedicated",
)

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )
except Exception:
    print("Creating a new cpu compute target...")
    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()
    print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

You already have a cluster named cpu-cluster-E2E, we'll reuse it as is.


Create a Custom Job Environment

In [4]:
import os

## Set the name of the directory we want to create
dependencies_dir = "./env"

# # The os.makedirs() function creates a directory
# exist_ok=True means that the function will not raise an exception if the directory already exists
os.makedirs(dependencies_dir, exist_ok=True)

Create YAML File to create and Register the Custom Job Environment i the Workspace.
The Environment will be packaged into a Docker Container at runtime.

In [5]:
%%writefile {dependencies_dir}/conda1.yaml
name: sklearn-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=21.2.4
  - scikit-learn=1.0.2
  - scipy=1.7.1
  - pip:  
      - mlflow==2.8.1
      - azureml-mlflow==1.51.0
      - azureml-inference-server-http
      - azureml-core==1.49.0
      - cloudpickle==1.6.0

Overwriting ./env/conda1.yaml


### Upload the dataset on Blob Storage as Data Asset

In [6]:
# Import the Environment class from the azure.ai.ml.entities module
from azure.ai.ml.entities import Environment


# Set the name of the custom environment we want to create
custom_env_name = "Milestone1_JFP_E2E"

# Create an Environment object with the specified properties
job_env = Environment(
    name=custom_env_name,
    description="Custom environment for machine learning task",
    conda_file=os.path.join(dependencies_dir, "conda1.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)

# Print out some information about the registered environment
print(
    f"Environment with name {job_env.name} is registered to workspace, the environment version is {job_env.version}"
)

Environment with name Milestone1_JFP_E2E is registered to workspace, the environment version is 14


### Create a processing script to perform the data preprocessing job

In [7]:
# To use the training script, first create a directory where you will store the file.
import os

src_dir = "./src"
os.makedirs(src_dir, exist_ok=True)

In [8]:
%%writefile {src_dir}/milestone1_jfp.py

# importing necessary libraries
import argparse
import os
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
##### from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

import mlflow
import mlflow.sklearn

# create an argument parser to take input arguments from command line
def main():
    parser = argparse.ArgumentParser()

    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument('--criterion', type=str, default='mse',
                        help='The function to measure the quality of a split')
    parser.add_argument('--max-depth', type=int, default=None,
                        help='The maximum depth of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples.')
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--registered_model_name", type=str, help="model name")

    args = parser.parse_args()

    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    # print input arguments
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    # load input data
    print("input data:", args.data)
    df = pd.read_csv(args.data)

    # log input hyperparameters

    mlflow.log_param('Criterion', str(args.criterion))
    mlflow.log_param('Max depth', str(args.max_depth))

    # # split the data into training and testing sets
    # train_df, test_df = train_test_split(
    #     df,
    #     test_size=args.test_train_ratio,
    # )


    ##### training a decision tree classifier
    # training a decision tree regressor


    # # Extracting the label column
    # y_train = train_df.pop("Sales")

    # # convert the dataframe values to array
    # X_train = train_df.values

    # # Extracting the label column
    # y_test = test_df.pop("Sales")

    # # convert the dataframe values to array
    # X_test = test_df.values


    # Split data into features (X) and target (y)
    X = df[['TV', 'Radio', 'Newspaper']]
    y = df['Sales']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=args.test_train_ratio, random_state=42)
    


    ##### initialize and train a decision tree classifier
    # initialize and train a decision tree regressor

    tree_model = DecisionTreeRegressor(criterion=args.criterion, max_depth=args.max_depth, random_state=42)
    tree_model = tree_model.fit(X_train, y_train)
    tree_predictions = tree_model.predict(X_test)

# Comment out Accuracy ????????????????
    # # compute and log model accuracy
    # accuracy = tree_model.score(X_test, y_test)
    # print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(accuracy))
    # mlflow.log_metric('Accuracy', float(accuracy))

    # compute and log model mse
    mse = mean_squared_error(y_test, tree_predictions)
    print(f"Mean Squared Error on test set: {mse}")
    mlflow.log_metric('Mean Squared Error', float(mse))

    # creating a confusion matrix
    # cm = confusion_matrix(y_test, tree_predictions)
    # print(cm)

    # set the name for the registered model
    registered_model_name="advertising_decisiontree_model"

    ##########################
    #<save and register model>
    ##########################

    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=tree_model,
        registered_model_name=registered_model_name,
        artifact_path=registered_model_name
    )

    # # Saving the model to a file
    print("Saving the model via MLFlow")
    mlflow.sklearn.save_model(
        sk_model=tree_model,
        path=os.path.join(registered_model_name, 'trained_model'),
    )
    ###########################
    #</save and register model>
    ###########################
   
    # end MLflow tracking
    mlflow.end_run()

if __name__ == '__main__':
    main()


Overwriting ./src/milestone1_jfp.py


### Configure the processing job

In [9]:
# Import the necessary modules
from azure.ai.ml import command
from azure.ai.ml import Input

# Define a new AML job using the `command` function
job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            path="./data/advertising_raw.csv", # The path to the input data file
            # path="/data/advertising_raw.csv", # The path to the input data file
        ),
        test_train_ratio=0.3,       # The ratio of the data to be used for testing
        criterion="mse",  # The criterion used to measure the quality of a split
        max_depth=2,                # The maximum depth of the decision tree
    ),
    # Specify the directory containing the code to be run in the job
    code="./src/",
    # Specify the command to be run in the job, including the input data and parameters as command line arguments
    command="python milestone1_jfp.py --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} --criterion ${{inputs.criterion}} --max-depth ${{inputs.max_depth}}",
    # Specify the environment to be used for the job
    environment="Milestone1_JFP_E2E@latest",
    # Specify the compute target to be used for the job
    compute="cpu-cluster-E2E",
    # Specify the name of the experiment for the job
    experiment_name="train_decision_tree_advertising_prediction",
     # Specify the display name for the job
    display_name="decision_tree_advertising_prediction",
)

### Run the processing job

In [10]:
# ml_client.create_or_update will create a new job if it does not exist or update the existing job if it does
ml_client.create_or_update(job)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading src (0.01 MBs): 100%|██

Experiment,Name,Type,Status,Details Page
train_decision_tree_advertising_prediction,amiable_spinach_61qj5bkgsb,command,Starting,Link to Azure Machine Learning studio


### Create a training script to perform the training job

### Configure the training job

### Run the training job

### Define the parameter space for hyperparameter tuning

### Configure the sweep job for tuning

### Run the sweep job

### Extract the run that gave best modeling results

### Register the best model

### Configure an Endpoint

### Create an Endpoint

### Create a deployment script to perform model deployment

### Configure the deployment

### Delete the Endpoint

**Important!** An Endpoint is a LIVE node which is always running, ready to process & predict to give you output. So unless you are making real-time predictions on streaming data, delete your endpoints after use

### 