# Milestone 1: Advertising

Business Context: 

Retail company, "Fashion Haven," operates multiple stores in different cities. The company invests in advertising campaigns to promote its latest collections through various media sources like TV, Newspaper, and Radio. They want to understand the impact of each media source on their sales revenue to optimize their advertising strategy and improve overall business performance.

Currently, Fashion Haven lacks an effective method to predict the sales revenue generated from their advertising efforts accurately. As a result, they struggle to allocate their advertising budget optimally across different media channels, leading to suboptimal returns on investment and inefficient resource allocation.

To address this business problem, Fashion Haven has collected historical data containing information on various advertising campaigns (TV, Newspaper, Radio) and their corresponding sales revenue across their different store locations. The goal is to build a robust predictive model that accurately estimates the sales revenue based on the media sources' advertising budgets, helping the company make data-driven decisions and drive business growth.

Dataset Description:

The data contains the different attributes of the advertising business. The detailed data dictionary is given below.

* TV: Expenditure on media resource- TV 
* Radio: Expenditure on media resource- Radio 
* NewsPaper: Expenditure on media resource- Newspaper 
* Sales: Target Column - Amount of Sales

### Import all necessary libraries

In [3]:
# Packages needed for running a processing job
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker import get_execution_role
from sagemaker.processing import ProcessingInput,ProcessingOutput
# Import library to train model
from sagemaker.sklearn.estimator import SKLearn
# Import Library to tune models
from sagemaker.tuner import IntegerParameter
# Import library to deploy model
from sagemaker.sklearn.model import SKLearnModel

# Packages to process data
import pandas as pd
import numpy as np
import datetime
import time
import subprocess
import sys
# Package to zip model file
import tarfile
# Packages to interact with AWS Services & Sagemaker
import os
# Packages for Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Set framework version (stable)
FRAMEWORK_VERSION = "0.23-1"


### Setup a SageMaker Session

In [4]:
session = sagemaker.Session()

### Define the S3 Bucket and the region name

In [5]:
# Set the bucket name and region variables. Change the bucket name with the unique name
bucket_name = '......' # default bucket
my_region = boto3.session.Session().region_name
print('my_region:',my_region)


my_region: us-east-2


In [6]:
# Upload data and requirement files to S3
data_path = session.upload_data(path="/root/mtn/advertising.csv", bucket=bucket_name, key_prefix="processing/data")
requirements_path = session.upload_data(path="/root/mtn/requirement.txt", bucket=bucket_name, key_prefix="processing/requirements")

### Create a processing job to do the data preprocessing

In [7]:
%%writefile processing_job.py

# Install all required packages
import os
import subprocess
import sys

# Path to requirement.txt in the SageMaker processing container
requirements_path = "/opt/ml/processing/input/requirements/requirement.txt"

# Install dependencies from requirements.txt
if os.path.exists(requirements_path):
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", requirements_path])
    print("Dependencies installed successfully.")
else:
    print(f"Requirements file not found at {requirements_path}. Skipping dependency installation.")

# Packages to manipulate data
import numpy as np
import pandas as pd
from time import gmtime, strftime
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import boto3

# -------------------------------------------
# Set up S3 bucket configuration
# -------------------------------------------
bucket_name = 'sagemaker-us-east-2-724074006219'  # Replace with your S3 bucket name
prefix = 'pre-processing'
s3_client = boto3.client("s3")

# Path to the input and output in SageMaker container
input_data_path = "/opt/ml/processing/input/data/advertising.csv"
output_data_path = "/opt/ml/processing/output/processed_data.csv"

# -------------------------------------------
# Data Pre-processing
# -------------------------------------------

# Load data into pandas DataFrame
if os.path.exists(input_data_path):
    data = pd.read_csv(input_data_path)
    print("Data loaded successfully.")
    
    numeric_features = ['TV','Radio','Newspaper','Sales']  # Replace with relevant columns
    numeric_features = [feature for feature in numeric_features if feature in data.columns]
    
    if numeric_features:
        scaler = StandardScaler()
        scaler.fit(data[numeric_features])
        data[numeric_features] = scaler.transform(data[numeric_features])


  # Save processed data to the specified output path in the container
    data.to_csv(output_data_path, index=False)
    print("Data pre-processing complete. Processed data saved locally.")

    # Upload the processed data to S3
    s3_client.upload_file(output_data_path, bucket_name, f"{prefix}/output/processed_data.csv")
    print(f"Processed data uploaded to s3://{bucket_name}/{prefix}/output/processed_data.csv")
else:
    print(f"Data file not found at {input_data_path}")

Writing processing_job.py


### Set the parameters for the processing job

In [8]:
sm_boto3 = boto3.client("sagemaker")
sess = sagemaker.Session()
s3_client = boto3.client("s3")
role = get_execution_role()
sklearn_processor = SKLearnProcessor(
 framework_version="0.23-1", # stable version
 role=role, # Default IAM role
 instance_type="ml.t3.medium", # EC2 instance
 instance_count=1, # Number of Instances
 volume_size_in_gb=1 # Storage space
)

### Run the processing job

In [9]:
# Define input and output paths for the job
inputs = [
    ProcessingInput(
        source=data_path,  # S3 path for the data
        destination="/opt/ml/processing/input/data"  # Path inside the container
    ),
    ProcessingInput(
        source=requirements_path,  # S3 path for requirements.txt
        destination="/opt/ml/processing/input/requirements"  # Path inside the container
    )
]

outputs = [
    ProcessingOutput(
        source="/opt/ml/processing/output",  # Output path in the container
        #destination=f"s3://sagemaker-us-east-2-724074006219/processing/output"  # Output path in S3
        destination=f"s3://sagemaker-us-east-2-724074006219/processing/output"  # Output path in S3
    )
]



# Run the job
sklearn_processor.run(
    code="processing_job.py",  # The processing script
    inputs=inputs,
    outputs=outputs,
    logs=True
)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2024-11-09-17-38-55-096


[34mCollecting statsmodels
  Downloading statsmodels-0.13.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 62.2 MB/s eta 0:00:00[0m
[34mCollecting python-math
  Downloading python_math-0.0.1-py3-none-any.whl (2.4 kB)[0m
[34mCollecting regex
  Downloading regex-2024.4.16-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (761 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 761.6/761.6 kB 66.1 MB/s eta 0:00:00[0m
[34mCollecting python-time
  Downloading python-time-0.3.0.tar.gz (2.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'[0m
[34mCollecting sagemaker
  Downloading sagemaker-2.229.0-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 108.0 MB/s eta 0:00:00[0m
[34mCollecting s3fs>=2022.11.0
  Downloading s3fs-2023.1.0-py3-none-any.whl (27 kB)[0m
[34mCollecting packaging>=21.3
  Downloading packaging-

In [None]:
# cleaning
"""
!rm -rf ~/.cache/*
!rm -rf /opt/ml/processing/*
! rm -rf /opt/ml/output/*
!conda clean --all -y
!rm -rf /root/your-directory-to-clear/*
!docker system prune -a -f """


### Read the processed data stored in S3 Bucket

In [10]:
data = pd.read_csv("s3://sagemaker-us-east-2-724074006219/pre-processing/output/processed_data.csv")

### Split the data into train and test

In [11]:
# Split data into Train & Test dataframes
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data, test_size=0.3, random_state=123)
# Write the train and test dataframes to sagemaker file browser
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)

### Upload train and test dataset into S3 Bucket

In [12]:
# Upload train data to s3
bucket_name = 'sagemaker-us-east-2-724074006219'
trainpath = session.upload_data(
 path="train.csv", bucket=bucket_name, key_prefix="regression_example")
# Upload test data to s3
testpath = session.upload_data(
 path="test.csv", bucket=bucket_name, key_prefix="regression_example")
# Display train and test data location in s3
print(trainpath)
print(testpath) 

s3://sagemaker-us-east-2-724074006219/regression_example/train.csv
s3://sagemaker-us-east-2-724074006219/regression_example/test.csv


### Create a training job to train the model 

In [13]:
%%writefile training_job.py
# ------------------------------
# Import all necessary libraries
# ------------------------------
# Package to take in features and hyper-parameters are inputs
import argparse
# Packages to interact with system
import time
import os
import joblib #*
# Packages to process data
import pandas as pd
import numpy as np
import datetime
# Packages for Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error 
# ---------------------------------
# Function to persist model package
# ---------------------------------
# This function will be used by the training model to creat the model package
# Use this as it is
def model_fn(model_dir): #*
 clf = joblib.load(os.path.join(model_dir, "model.joblib"))
 return clf
# Main code
if __name__ == "__main__": 
 # ----------------------------------
 # Parse test data & hyper-parameters
 # ---------------------------------- 
 parser = argparse.ArgumentParser()
 # test and train file
 parser.add_argument("--test-file", type=str, default='test.csv')
 parser.add_argument("--train-file", type=str, default="train.csv")
 # hyper-parameters
 parser.add_argument("--max-depth", type=int, default=5)
 parser.add_argument("--min-samples-leaf", type=int, default=50)
 # docker contrainer default locations
 parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR")) #*
 parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN")) #*
 parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST")) #*
 args = parser.parse_args()
 # --------------------------
 # Build Regression model
 # --------------------------
 # Read training & testing datasets from s3
 train_df = pd.read_csv(os.path.join(args.train, args.train_file)) #*
 test_df = pd.read_csv(os.path.join(args.test, args.test_file)) #*
 # Create train dataframes with features & target
 train_x = train_df.drop(columns=['Sales'],axis=1)
 train_y = train_df['Sales']
 # Create test dataframes with features & target
 test_x = test_df.drop(columns=['Sales'],axis=1)
 test_y = test_df['Sales']
 # Build model using the best parameters
 model = DecisionTreeRegressor(
 max_depth=args.max_depth, #*
 min_samples_leaf=args.min_samples_leaf, #*
 random_state=1
 )
 # Fit model
 model.fit(train_x, train_y)
 # -----------------------
 # Predict using the model
 # -----------------------
 print("running model for predictions...")
 time.sleep(2)
 # Save the predictions and actual values
 y_pred = model.predict(test_x) 
 print(f"RMSE: {mean_squared_error(test_y, y_pred, squared=False)}")
 # ------------------------
 # Save model package file
 # ------------------------
 # persist model
 path = os.path.join(args.model_dir, "model.joblib") #*
 joblib.dump(model, path)
 print("model persisted at " + path)

Writing training_job.py


### Setup the sklearn estimator

In [14]:
# Create the sklearn estimator instance
sklearn_estimator = SKLearn(
 entry_point="training_job.py", # your model code
 role=get_execution_role(), # the default IAM role used
 instance_count=1, # number of nodes
 instance_type="ml.m5.large", # EC2 instance type
 framework_version=FRAMEWORK_VERSION, # Framework version
 base_job_name="bt-decision-tree-scikit-train", # Prefix of the job name
 hyperparameters={ # Hyper-parameters
 "max-depth":5,
 "min-samples-leaf":50
 }
)

### Run the training job

In [15]:
# Launch Training Job
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

INFO:sagemaker:Creating training-job with name: bt-decision-tree-scikit-train-2024-11-09-17-48-25-981


2024-11-09 17:48:26 Starting - Starting the training job...
2024-11-09 17:48:41 Starting - Preparing the instances for training...
2024-11-09 17:49:04 Downloading - Downloading input data...
2024-11-09 17:49:29 Downloading - Downloading the training image...
2024-11-09 17:50:20 Training - Training image download completed. Training in progress.
2024-11-09 17:50:20 Uploading - Uploading generated training model[34m2024-11-09 17:50:15,201 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2024-11-09 17:50:15,205 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-11-09 17:50:15,253 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2024-11-09 17:50:15,456 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-11-09 17:50:15,469 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-

### Setup the SageMaker Tuner Function

In [16]:
# Define exploration boundaries
hyperparameter_ranges = {
 "max-depth": IntegerParameter(4,12)
}
# Create an optimizer instance
sklearn_optimizer  = sagemaker.tuner.HyperparameterTuner(
 estimator=sklearn_estimator, # model training instance
 hyperparameter_ranges=hyperparameter_ranges, # parameter space
 base_tuning_job_name="decision-tree-scikit-tune", # Prefix of the tuning job
 objective_type="Minimize", 
 objective_metric_name="rmse",
 max_jobs=6,
 max_parallel_jobs=3,
 metric_definitions= [
 {"Name": "rmse",
 "Regex": "RMSE: ([0-9.]+).*$"}
 ],
 strategy="Bayesian"
)

### Tune the model

In [17]:
sklearn_optimizer.fit({"train": trainpath, "test": testpath})

INFO:sagemaker:Creating hyperparameter tuning job with name: decision-tree-scikit-241109-1752


.................................................!


### Print the hyperparameter tuning results in a dataframe

In [18]:
# Get the tuner results in a df
results = sklearn_optimizer.analytics().dataframe()
results

Unnamed: 0,max-depth,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,4.0,decision-tree-scikit-241109-1752-006-c5cbc9ab,Completed,0.738571,2024-11-09 17:55:46+00:00,2024-11-09 17:56:30+00:00,44.0
1,7.0,decision-tree-scikit-241109-1752-005-99b6b2ee,Completed,0.738571,2024-11-09 17:55:37+00:00,2024-11-09 17:56:16+00:00,39.0
2,10.0,decision-tree-scikit-241109-1752-004-99c0ae34,Completed,0.738571,2024-11-09 17:55:35+00:00,2024-11-09 17:56:13+00:00,38.0
3,8.0,decision-tree-scikit-241109-1752-003-c60f9dbb,Completed,0.738571,2024-11-09 17:53:42+00:00,2024-11-09 17:55:32+00:00,110.0
4,6.0,decision-tree-scikit-241109-1752-002-c2ff57f3,Completed,0.738571,2024-11-09 17:53:43+00:00,2024-11-09 17:55:22+00:00,99.0
5,5.0,decision-tree-scikit-241109-1752-001-5d8c2302,Completed,0.738571,2024-11-09 17:53:42+00:00,2024-11-09 17:55:21+00:00,99.0


### Save the model artifact

In [19]:
# Save the model artefact into a variable
artifact = sm_boto3.describe_training_job(TrainingJobName=sklearn_optimizer.best_training_job())["ModelArtifacts"]["S3ModelArtifacts"]
print("Model artifact persisted at " + artifact)

Model artifact persisted at s3://sagemaker-us-east-2-724074006219/decision-tree-scikit-241109-1752-001-5d8c2302/output/model.tar.gz


### Setup the sklearn model function for deployment

In [20]:
model = SKLearnModel(model_data=artifact, # The model zip file the training job generated
                     role=get_execution_role(), # Default IAM role
                     entry_point="training_job.py", # Model Code
                     framework_version=FRAMEWORK_VERSION
                    )

### Deploy the best model 

In [21]:
# Deploy SKLearn Predictor
sklearn_predictor = model.deploy(instance_type="ml.m4.xlarge", initial_instance_count=1)

INFO:sagemaker:Creating model with name: sagemaker-scikit-learn-2024-11-09-17-58-15-687
INFO:sagemaker:Creating endpoint-config with name sagemaker-scikit-learn-2024-11-09-17-58-16-448
INFO:sagemaker:Creating endpoint with name sagemaker-scikit-learn-2024-11-09-17-58-16-448


------!

In [30]:
# testing the model
import numpy as np

# Example test data (replace with actual test data in the expected format)
test_data = np.array([[12.99, 21.1, 41.0]])  # Replace with real feature values

# Make predictions
predictions = sklearn_predictor.predict(test_data)
print("Predictions:", predictions)

Predictions: [0.53521182]


### Delete the Endpoint

**Important!** An Endpoint is a LIVE node which is always running, ready to process & predict to give you output. So unless you are making real-time predictions on streaming data, delete your endpoints after use

In [31]:
sklearn_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-scikit-learn-2024-11-09-17-58-16-448
INFO:sagemaker:Deleting endpoint with name: sagemaker-scikit-learn-2024-11-09-17-58-16-448
