# Advertising

**Business Context:** 

Retail company, "Fashion Haven," operates multiple stores in different cities. The company invests in advertising campaigns to promote its latest collections through various media sources like TV, Newspaper, and Radio. They want to understand the impact of each media source on their sales revenue to optimize their advertising strategy and improve overall business performance.

Currently, Fashion Haven lacks an effective method to predict the sales revenue generated from their advertising efforts accurately. As a result, they struggle to allocate their advertising budget optimally across different media channels, leading to sub optimal returns on investment and inefficient resource allocation.

To address this business problem, Fashion Haven has collected historical data containing information on various advertising campaigns (TV, Newspaper, Radio) and their corresponding sales revenue across their different store locations. The goal is to build a robust predictive model that accurately estimates the sales revenue based on the media sources' advertising budgets, helping the company make data-driven decisions and drive business growth.


Dataset Description:

The data contains the different attributes of the advertising business. The detailed data dictionary is given below.

* TV: Expenditure on media resource- TV 
* Radio: Expenditure on media resource- Radio 
* NewsPaper: Expenditure on media resource- Newspaper 
* Sales: Target Column - Amount of Sales

In [2]:
!pip install mlflow



### Connect to the workspace

In [3]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

from azure.ai.ml import Input, Output
from azure.ai.ml import command
from azure.ai.ml.sweep import Choice

From VMs within the Azure ML workspace, the default Azure credentials are inherited. However, interactive browser credentials could be used to authenticate an Azure account to the Azure ML workspace.

In [4]:
credential = DefaultAzureCredential(exclude_interactive_browser_credential=False)

In [5]:
# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="4322d79c-475f-471d-bd6d-52528e4d32ee", # Assigned subscription ID 
    resource_group_name="tasnim_rg", # Assigned Resource Group 
    workspace_name="maycohortmlops", #Assigned workspace name
)

### Create a compute resource to run the job
Azure Machine Learning needs a compute resource to run a job. This resource can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like Spark.

A basic compute cluster is need for this task; thus, picking a Standard_D2s_v3 model with 2 CPU cores and 8 GB RAM to create an Azure Machine Learning compute.

In [6]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster-MP1"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_D2s_V3",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=1,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

Creating a new cpu compute target...
AMLCompute with name cpu-cluster-MP1 is created, the compute size is STANDARD_D2s_V3


### Upload the dataset on Blob Storage as Data Asset

Let's explore the process of registering an external dataset as a data asset in the Azure ML environment. In this scenario, we will copy a local dataset to Azure blob storage and then proceed to register the file as a data asset.

In [7]:
local_data_path = 'data/advertising_raw.csv'

Azure provides a convenient way to manage datasets using a feature called data assets. Data assets are built on top of cloud fsspec (file system specification) and offer additional functionalities such as tracking and versioning of different datasets.

Unlike the actual data files stored in blob storage, data assets act as a lightweight wrapper that includes valuable metadata without incurring any extra cost. This allows us to register a dataset and easily update its version whenever new data is added.

In [8]:
data_asset = Data(
    path=local_data_path,
    type=AssetTypes.URI_FILE,
    description="A data set containing the information on various advertising campaigns (TV, Newspaper, Radio) and their corresponding sales revenue",
    name="advertising-data",
    version="1"
)

In [9]:
ml_client.data.create_or_update(data_asset)

[32mUploading advertising_raw.csv[32m (< 1 MB): 100%|██████████| 4.06k/4.06k [00:00<00:00, 317kB/s]
[39m



Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'advertising-data', 'description': 'A data set containing the information on various advertising campaigns (TV, Newspaper, Radio) and their corresponding sales revenue', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/resourceGroups/tasnim_rg/providers/Microsoft.MachineLearningServices/workspaces/maycohortmlops/data/advertising-data/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/azurekernelmlops2024/code/Users/niger.tasnim21/Milestoneproject1', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7fd8a4233d90>, 'serialize': <msrest.serialization.Serializer object at 0x7fd8a4233220>, 'version': '1', 'latest_version': None, 'path': 'azureml://subsc

##### Checking for registered data assets

In [10]:
for registered_data in ml_client.data.list():
    print(registered_data.name)

Updated_diamonds_data
Updated_diamind_data
advertising-data


### Create a processing script to perform the data preprocessing job

##### Creating a folder "src" to store the processing script

In [11]:
import os
train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)

### Configure the processing job

In [12]:
%%writefile {train_src_dir}/pre_process.py
import os
import argparse
import pandas as pd
import azureml.core
import numpy as np
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from azureml.core import Workspace

def main():
 """Main function of the script."""

 # input and output arguments
 parser = argparse.ArgumentParser()
 parser.add_argument("--data", type=str, help="path to input data")
 parser.add_argument("--output", type=str, help="path to output data")
 args = parser.parse_args()
 
 # Start Logging
 mlflow.start_run()

 # enable autologging
 mlflow.sklearn.autolog()

 ###################
 #<prepare the data>
 ###################
 
 print("input data:", args.data)
 
 data = pd.read_csv(args.data)


 ###################
 #<processing>
 ###################

 # Separate  numerical features
 numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns

 # Apply data scaling to numerical columns
 scaler = StandardScaler()
 data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

 # Exporting processed data to local
 processed_data_path = os.path.join(args.output, 'advertising_sales_processed_data.csv')
 data.to_csv(processed_data_path, index=False)

 # Stop Logging
 mlflow.end_run()

if __name__ == "__main__":
 main()

Writing ./src/pre_process.py


### Run the Processing job

In [13]:
datastore_path="azureml://datastores/workspaceblobstore/paths/advertising_sales_data"

In [14]:
my_job_inputs = {
    "input_data": Input(type=AssetTypes.URI_FILE, 
                        path="azureml:advertising-data:1")
}

my_job_outputs = {
    "output_datastore": Output(type=AssetTypes.URI_FOLDER,
                               path=datastore_path)
}

##### Configure the Processing job

In [15]:
job = command(
    code="./src/", # location of source code
    command="python pre_process.py --data ${{inputs.input_data}} --output ${{outputs.output_datastore}}",
    inputs= my_job_inputs,
    outputs=my_job_outputs,
    environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
    compute="cpu-cluster-MP1",
    experiment_name="12-06-2024-advertising-data-preprocessing-001",
    display_name="advertising_data_processing-001",
)

##### Submit the training job

In [16]:
ml_client.create_or_update(job)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[32mUploading src (0.0 MBs): 100%|███

Experiment,Name,Type,Status,Details Page
12-06-2024-advertising-data-preprocessing-001,ivory_gyro_skyswr4s5l,command,Starting,Link to Azure Machine Learning studio


### Create a training script to perform the training job

In [17]:
%%writefile {train_src_dir}/main.py

import mlflow
import argparse

import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import GradientBoostingRegressor

mlflow.start_run()

def main():

 parser = argparse.ArgumentParser()
 parser.add_argument("--data", type=str, help="path to train data")
 parser.add_argument("--n_estimators", required=False, default=100, type=int)
 parser.add_argument("--learning_rate", required=False, default=0.1, type=float)

 args = parser.parse_args()

 df = pd.read_csv(args.data)
 
 target = 'Sales'
 numeric_features = ['TV','Radio', 'Newspaper']
 #categorical_features = []

 X = df.drop([target], axis=1)
 y = df[target]

 X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.2, random_state=42
 )

 model_gbr = GradientBoostingRegressor(
 n_estimators=args.n_estimators,
 learning_rate=args.learning_rate
 )

 model_pipeline = make_pipeline(model_gbr)

 model_pipeline.fit(X_train, y_train)
 
 rsq = model_pipeline.score(X_test, y_test)

 mlflow.log_metric("RSquared", float(rsq))

 print("Registering model pipeline")

 mlflow.sklearn.log_model(
 sk_model=model_pipeline,
 registered_model_name="gbr-adv-sales-predictor",
 artifact_path="gbr-adv-sales-predictor"
 )

 mlflow.end_run()

if __name__ == '__main__':
 main()

Writing ./src/main.py


### Configure the training job

In [18]:
train_job = command(
    inputs={
        "data": Input(type="uri_file", path="azureml://datastores/workspaceblobstore/paths/advertising_sales_data/advertising_sales_processed_data.csv")
    },
    code="src/main.py",
    command="python main.py --data ${{inputs.data}}",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    display_name="sales-adv-training-001",
    experiment_name="12-06-2024-sales-adv-training-001",
    compute= "cpu-cluster-MP1",
)

### Run the training job

In [19]:
ml_client.create_or_update(train_job)

[32mUploading main.py[32m (< 1 MB): 0.00B [00:00, ?B/s][32mUploading main.py[32m (< 1 MB): 100%|██████████| 1.49k/1.49k [00:00<00:00, 55.7kB/s]
[39m



Experiment,Name,Type,Status,Details Page
12-06-2024-sales-adv-training-001,blue_diamond_snjt5ns2pp,command,Starting,Link to Azure Machine Learning studio


### Define the parameter space for hyperparameter tuning

Further improve the accuracy of model, tune and optimize the model's hyperparameters using Azure Machine Learning's sweep capabilities.

In [20]:
train_job = command(
    inputs={
        "data": Input(type="uri_file", path="azureml://datastores/workspaceblobstore/paths/advertising_sales_data/advertising_sales_processed_data.csv"),
        "n_estimators": 100,
        "learning_rate": 0.1
    },
    code="src/main.py",
    command="python main.py --data ${{inputs.data}} --n_estimators ${{inputs.n_estimators}} --learning_rate ${{inputs.learning_rate}}",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    compute= "cpu-cluster-MP1",
)

#### To tune the model's hyperparameters:

    - Define the parameter space in which to search during training.

    - Replacing the parameters (n_estimators and mearning_rate) passed to the training job with special inputs from the azure.ml.sweep package.

In [21]:
job_for_sweep = train_job(
    n_estimators=Choice(values=[100, 200, 300, 400]),
    learning_rate=Choice(values=[0.001, 0.005, 0.05, 0.1, 0.5])
)

### Configure the sweep job for tuning

In [22]:
# sampling_algorithm specifies the search algorithm to use for hyperparameter tuning.
# primary_metric specifies the metric to optimize during hyperparameter tuning.
# goal specifies whether to maximize or minimize the primary metric.
# max_total_trials specifies the maximum number of trials to run during hyperparameter tuning.
# max_concurrent_trials specifies the maximum number of trials to run concurrently during hyperparameter tuning.
sweep_job = job_for_sweep.sweep(
    sampling_algorithm="bayesian",
    primary_metric="RSquared",
    goal="Maximize",
    max_total_trials=6,
    max_concurrent_trials=3
)

In [23]:
sweep_job.experiment_name = "13-06-2024-sales_advertising-001"
sweep_job.display_name = "sales_advertising_tuning-001"
sweep_job.description = "Run a hyperparameter sweep job for GBR"

### Run the sweep job

Now running a sweep job that sweeps over train job.

In [24]:
# create or update the sweep job
returned_sweep_job = ml_client.create_or_update(sweep_job) 
# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)
# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

RunId: dynamic_turtle_j8xh572jcq
Web View: https://ml.azure.com/runs/dynamic_turtle_j8xh572jcq?wsid=/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/resourcegroups/tasnim_rg/workspaces/maycohortmlops

Streaming azureml-logs/hyperdrive.txt

[2024-06-16T00:06:58.937427][GENERATOR][INFO]Trying to sample '3' jobs from the hyperparameter space
[2024-06-16T00:06:59.601637][GENERATOR][INFO]Successfully sampled '3' jobs, they will soon be submitted to the execution target.
[2024-06-16T00:06:59.8316487Z][SCHEDULER][INFO]Scheduling job, id='dynamic_turtle_j8xh572jcq_2' 
[2024-06-16T00:06:59.8302900Z][SCHEDULER][INFO]Scheduling job, id='dynamic_turtle_j8xh572jcq_1' 
[2024-06-16T00:07:00.1904889Z][SCHEDULER][INFO]Successfully scheduled a job. Id='dynamic_turtle_j8xh572jcq_1' 
[2024-06-16T00:07:00.3983533Z][SCHEDULER][INFO]Successfully scheduled a job. Id='dynamic_turtle_j8xh572jcq_2' 
[2024-06-16T00:07:07.3324871Z][SCHEDULER][INFO]Scheduling job, id='dynamic_turtle_j8xh572jcq_0' 
[2024-06-16T00

### Extract the run that gave best modeling results

In [25]:
from azure.ai.ml.entities import Model
if returned_sweep_job.status == "Completed":
    # First let us get the run which gave us the best result
    best_run = returned_sweep_job.properties["best_child_run_id"]
    # lets get the model from this run
    model = Model(
        # the script stores the model as "best_model"
        path="azureml://jobs/{}/outputs/artifacts/paths/gbr-adv-sales-predictor/".format(
            best_run
        ),
        name="sales-predictor_best_model",
        description="Model created for sales prediction",
        type="custom_model",
    )
else:
    print(
        "Sweep job status: {}. Please wait until it completes".format(
            returned_sweep_job.status
        )
    )

### Register the best model

In [26]:
registered_model = ml_client.models.create_or_update(model=model)

### Configure an Endpoint

In [27]:
# import required libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)
from azure.ai.ml.constants import AssetTypes

In [28]:
# Importing the required modules
import random
import string
# Creating a unique endpoint name by including a random suffix
# Defining a list of allowed characters for the endpoint suffix
allowed_chars = string.ascii_lowercase + string.digits
# Generating a random 5-character suffix for the endpoint name by choosing
# characters randomly from the list of allowed characters
endpoint_suffix = "".join(random.choice(allowed_chars) for x in range(5))
# Creating the final endpoint name by concatenating a prefix string
# with the generated suffix string
endpoint_name = "sales-endpoint-" + endpoint_suffix

### Create an Endpoint

Online endpoints are endpoints that are used for online (real-time) inferencing. Online endpoints contain deployments that are ready to receive data from clients and can send responses back in real time.

To create an online endpoint we will use ManagedOnlineEndpoint. This class allows user to configure the following key aspects such as name,auth_mode,identity,etc.

In [29]:
endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,  
    # Name of the endpoint, should be unique within your deployment
    description="An online endpoint serving an MLflow model",
    # A string describing the purpose of the endpoint
    auth_mode="key",
    # Authentication mode to use for the endpoint (in this case, using an API key)
    tags={"foo": "bar"},
    # A dictionary of key-value pairs that can be used to tag the endpoint
)

This command will start the endpoint creation and return a confirmation response while the endpoint creation continues.

In [30]:
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

ManagedOnlineEndpoint({'public_network_access': 'Enabled', 'provisioning_state': 'Succeeded', 'scoring_uri': 'https://sales-endpoint-ix2q3.eastus.inference.ml.azure.com/score', 'openapi_uri': 'https://sales-endpoint-ix2q3.eastus.inference.ml.azure.com/swagger.json', 'name': 'sales-endpoint-ix2q3', 'description': 'An online endpoint serving an MLflow model', 'tags': {'foo': 'bar'}, 'properties': {'azureml.onlineendpointid': '/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/resourcegroups/tasnim_rg/providers/microsoft.machinelearningservices/workspaces/maycohortmlops/onlineendpoints/sales-endpoint-ix2q3', 'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/oeidp:306407ff-927a-47c9-8127-fd0067a9a4f0:24356ed1-4e95-403c-83f9-46b889dfeea9?api-version=2022-02-01-preview'}, 'print_as_yaml': True, 'id': '/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/re

### Create a deployment script to perform model deployment

In [31]:
%%writefile {train_src_dir}/score.py
# Import necessary libraries and modules
import logging
import os
import json
import mlflow
from io import StringIO
from mlflow.pyfunc.scoring_server import infer_and_parse_json_input, predictions_to_json
######################LOGGER#####################
# Set up Azure logging
import logging
from logging import Logger
from opencensus.ext.azure.log_exporter import AzureLogHandler
# Connect to Application Insights and set logging level to INFO
application_insights_connection_string= 'InstrumentationKey=27c708c1-110c-494e-9321-e1d7095c5298;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus.livediagnostics.monitor.azure.com/;ApplicationId=926eefd9-84d4-48e6-b0b4-9175bb4bbb28'
handler = AzureLogHandler(
connection_string=application_insights_connection_string)
logger = logging.getLogger()
logger.addHandler(handler)
logger.setLevel(logging.INFO)
####################################################
# Define the init() function to load the MLflow model
def init():
    global model
    global input_schema
    # "model" is the path of the mlflow artifacts when the model was registered. For automl
    # models, this is generally "mlflow-model"
    model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR"), "gbr-adv-sales-predictor")
    model = mlflow.pyfunc.load_model(model_path)
    input_schema = model.metadata.get_input_schema()
# Define the run() function to make predictions using the loaded model
def run(raw_data):
    # Parse input data
    json_data = json.loads(raw_data)
    if "input_data" not in json_data.keys():
        raise Exception("Request must contain a top level key named 'input_data'")
    serving_input = json.dumps(json_data["input_data"])
    data = infer_and_parse_json_input(serving_input, input_schema)
    # Make predictions
    predictions = model.predict(data)
    # Log the input data and predictions to Azure
    logger.info("Data:{0},Predictions:{1}".format(str(data),str(predictions)))
    # Convert predictions to JSON format and return
    result = StringIO()
    predictions_to_json(predictions, result)
    return result.getvalue()

Writing ./src/score.py


### Configure the deployment

In [32]:
# Create a new deployment with name "blue"
blue_deployment = ManagedOnlineDeployment(
    name="blue",
    # Use the previously generated endpoint name
    endpoint_name=endpoint_name,
    # Use the registered model
    model=registered_model,
    # Use the latest environment 
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    # Use the code in the "./src" directory and the "score.py" script
    code_configuration=CodeConfiguration(
        code="./src", scoring_script="score.py"
    ),
    # Use a single instance of type "Standard_E2s_v3"
    instance_type="Standard_E2s_v3",
    instance_count=1,
    # Enable Application Insights for the deployment
    app_insights_enabled=True,
)


In [33]:
ml_client.online_deployments.begin_create_or_update(blue_deployment).result()

Check: endpoint sales-endpoint-ix2q3 exists
[32mUploading src (0.01 MBs): 100%|██████████| 5266/5266 [00:00<00:00, 23865.83it/s]
[39m



..........................................................................

ManagedOnlineDeployment({'private_network_connection': None, 'provisioning_state': 'Succeeded', 'endpoint_name': 'sales-endpoint-ix2q3', 'type': 'Managed', 'name': 'blue', 'description': None, 'tags': {}, 'properties': {'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/providers/Microsoft.MachineLearningServices/locations/eastus/mfeOperationsStatus/odidp:306407ff-927a-47c9-8127-fd0067a9a4f0:4468447f-608f-4953-90cd-73d5f387fd59?api-version=2023-04-01-preview'}, 'print_as_yaml': True, 'id': '/subscriptions/4322d79c-475f-471d-bd6d-52528e4d32ee/resourceGroups/tasnim_rg/providers/Microsoft.MachineLearningServices/workspaces/maycohortmlops/onlineEndpoints/sales-endpoint-ix2q3/deployments/blue', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/azurekernelmlops2024/code/Users/niger.tasnim21/Milestoneproject1', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7

### Delete the Endpoint

**Important!** An Endpoint is a LIVE node which is always running, ready to process & predict to give you output. So unless you are making real-time predictions on streaming data, delete your endpoints after use

In [34]:
ml_client.online_endpoints.begin_delete(name=endpoint_name)

<azure.core.polling._poller.LROPoller at 0x7fd89c453d90>

........................................................................

### 