# Business Analysis
## Business Context
* Kingston Custom Concrete is a construction company that specializes in producing concrete driveways, walkways, patios, etc. They typically do stamped concrete which are patterns resembling slabs pressed into the wet concrete before it dries. They also provide a warranty on all their projects as well as allow customers to join a maintenance program that will keep the stamped concrete maintained. They are located on Princess and Ontario streets and are a family owned business.

## Problem Description:
* One problem faced by Kingston Custom Concrete is determining the compression strength of their concrete mixture in a cost effective way. Customers may require concrete of varying strength levels which will require testing the strength of these mixtures. This can be costly and time consuming as the mixture must first dry and then be tested in a compression machine. It can also affect the customer experience for it adds time on top of preparing the concrete molds, pressing and letting dry. In the case of a driveway it can mean the customer will be unable to park their car for an unnecessary amount of time. A Linear Regression model can help expedite this problem by immediately giving the expected compression strength of a mixture thereby avoiding the time cost of trial and error.  


## Solution/Proof of Concept Description:
* The solution will reduce the construction time for the customer by eliminating the need to find the compressive strength of the concrete mixture. By reducing time to test it also reduces the construction costs for the customer and can allow the customer to get the cost of construction sooner. Operationally it would allow Kingston Custom Concrete to skip testing and get a mixture that will address their current needs. All they would need to do is input the mixture details and then the model will output the predicted compression strength. It also saves on material costs and since compression tests will only need to be carried out to ensure the mixture is correct it puts less wear and tear on the compression machine reducing maintenance costs.

* https://www.kaggle.com/datasets/sinamhd9/concrete-comprehensive-strength

* The dataset being used for the proof of concept is on concrete comprehensive strength. It contains 9 columns all together. 7 columns are for cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate which are common materials found in different mixtures of concrete. They are measured in kg per m^3. 1 column is for age meaning how long it will take for the concrete to reach maximum strength. The final column is the max compression strength achieved. The differing amounts of each material column and the age will affect the eventual compression strength. The materials in the dataset act as a placeholder to represent the possible materials the company uses.


## Business Change
* With this solution Kingston Custom Concrete can eliminate multiple phases of testing. In its place an employee can access the model from a PC or laptop and input the desired concrete compressive strength to get the material ratios.

## Development Costs:
* Implementing the model will take around 2 hours but training and fine tuning the hyperparameters can take about 5-7 hours so assuming minimum wage is \\$16 the labor costs will be \\$112 - \\$144. We’ll be using the ml.m5.large instance type for training the model which will cost approximately $0.115 per hour or about \\$0.57 - \\$0.80 in total for the estimated training time. The example dataset being used for pre production is already in a condition deemed to be acceptable for training as all values are decimals. The data also reflects the metrics that will be used later on. Overall the setting up the model will cost in total \\$112.57 - \\$144.80




# Concrete Compressive Strength Dataset

The data in this dataset was gathered Prof. I-Cheng Yeh of Chung-Hua University.
Below are the 7 concrete components used for this dataset.
All are measured in kg per m^3


* Cement - Concrete component 1 measured in kg in a m^3 mixture
* Blast Furnace Slag - Concrete component 2 measured in kg in a m^3 mixture
* Fly Ash - Concrete component 3 measured in kg in a m^3 mixture
* Water - Concrete component 4 measured in kg in a m^3 mixture
* Superplasticizer - Concrete component 5 measured in kg in a m^3 mixture
* Coarse Aggregate - Concrete component 6 measured in kg in a m^3 mixture
* Fine Aggregate - Concrete component 7 measured in kg in a m^3 mixture
* Age - Number of days to reach maximum strength
* CPS - Concrete compressive strength(MPa, megapascals) 

In [1]:
# This command upgrades the 'numexpr' library using pip
!pip install --upgrade numexpr


Collecting numexpr
  Downloading numexpr-2.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Downloading numexpr-2.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (375 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.2/375.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: numexpr
  Attempting uninstall: numexpr
    Found existing installation: numexpr 2.7.3
    Uninstalling numexpr-2.7.3:
      Successfully uninstalled numexpr-2.7.3
Successfully installed numexpr-2.9.0


# Step 1: Import Necessary Libraries

In [2]:
# Import necessary libraries and modules from SageMaker and boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.inputs import TrainingInput
import boto3
import os


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


# Step 2: Set Up SageMaker Session and Role

In [3]:
# Create a SageMaker session, which manages interactions with SageMaker services
sagemaker_session = sagemaker.Session()

# Get the IAM execution role used for SageMaker to access AWS resources
role = get_execution_role()

# Get the AWS region associated with the SageMaker session
region = sagemaker_session.boto_region_name


In [4]:
print(role)
print(region)

arn:aws:iam::058264484282:role/LabRole
us-east-1


# Step 3: Prepare Your Data

In [5]:
# Import the pandas library as 'pd' for data manipulation and analysis
import pandas as pd

# Import the train_test_split function from scikit-learn for data splitting
from sklearn.model_selection import train_test_split


In [6]:
# Read data from the 'HousingData.csv' file into a pandas DataFrame called 'housing_data'
concrete_data = pd.read_csv('Concrete_DataProcessed.csv')


In [7]:
# Check for missing values (NaN) in the 'housing_data' DataFrame
concrete_data.isna()


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Concrete compressive strength
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
1025,False,False,False,False,False,False,False,False,False
1026,False,False,False,False,False,False,False,False,False
1027,False,False,False,False,False,False,False,False,False
1028,False,False,False,False,False,False,False,False,False


In [8]:
# Remove rows with missing values (NaN) from the 'housing_data' DataFrame
concrete_data = concrete_data.dropna()
print(concrete_data.columns)

Index(['Cement', 'Blast Furnace Slag', 'Fly Ash', 'Water', 'Superplasticizer',
       'Coarse Aggregate', 'Fine Aggregate', 'Age',
       'Concrete compressive strength'],
      dtype='object')


In [9]:
# The target variable column is named 'MEDV'

# Create the feature matrix 'X' by dropping the 'MEDV' column
X = concrete_data.drop('Concrete compressive strength', axis=1)

# Create the target variable 'y' by converting 'MEDV' to integers
y = concrete_data['Concrete compressive strength'].astype('int')

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [10]:
# Concatenate the features and labels back into one DataFrame for training data
concrete_train_data = pd.concat([y_train, X_train], axis=1)

# Concatenate the features and labels back into one DataFrame for validation data
concrete_validation_data = pd.concat([y_val, X_val], axis=1)

# Save the training data to a CSV file without headers and indices
concrete_train_data.to_csv('ConcreteData_train.csv', header=False, index=False)

# Save the validation data to a CSV file without headers and indices
concrete_validation_data.to_csv('ConcreteData_validation.csv', header=False, index=False)


In [11]:
# Concatenate the features and labels back into one DataFrame for training data
concrete_train_data = pd.concat([y_train, X_train], axis=1)

# Concatenate the features and labels back into one DataFrame for validation data
concrete_validation_data = pd.concat([y_val, X_val], axis=1)

# Save the training data to a CSV file without headers and indices
concrete_train_data.to_csv('ConcreteData_train.csv', header=False, index=False)

# Save the validation data to a CSV file without headers and indices
concrete_validation_data.to_csv('ConcreteData_validation.csv', header=False, index=False)


In [12]:
# Define your Amazon S3 bucket and prefix for data storage
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/concrete-compressive-strength/classification'

# Paths to your local data files - replace with your actual file paths
local_train = 'ConcreteData_train.csv'
local_validation = 'ConcreteData_validation.csv'

# Upload the local training data to the specified S3 bucket and prefix
train_uri = sagemaker_session.upload_data(local_train, bucket=bucket, key_prefix=prefix)

# Upload the local validation data to the specified S3 bucket and prefix
validation_uri = sagemaker_session.upload_data(local_validation, bucket=bucket, key_prefix=prefix)


In [13]:
# Print the S3 URI for the training data
print("Training URI: ", train_uri)

# Print the S3 URI for the validation data
print("Validation URI: ", validation_uri)


Training URI:  s3://sagemaker-us-east-1-058264484282/sagemaker/concrete-compressive-strength/classification/ConcreteData_train.csv
Validation URI:  s3://sagemaker-us-east-1-058264484282/sagemaker/concrete-compressive-strength/classification/ConcreteData_validation.csv


# Step 4: Get the Linear Learner Image URI

In [14]:
from sagemaker import image_uris

# Retrieve the container image URI for the SageMaker Linear Learner algorithm
container = image_uris.retrieve(framework='linear-learner', region=region)


# Step 5: Configure the SageMaker Linear Learner Estimator

In [15]:
# Calculate the number of rows and features in the 'housing_data' DataFrame
num_rows, num_features = concrete_data.shape

# Print the number of rows and features
print("Number of Rows:", num_rows)
print("Number of Features:", num_features)

Number of Rows: 1030
Number of Features: 9


In [63]:
# Create a SageMaker Linear Learner estimator
linear_learner = sagemaker.estimator.Estimator(container,
                                               role, 
                                               instance_count=1, 
                                               instance_type='ml.m5.large',
                                               output_path=f's3://{bucket}/{prefix}/output',
                                               sagemaker_session=sagemaker_session)

# Set hyperparameters for the Linear Learner
linear_learner.set_hyperparameters(feature_dim=8,  # Number of input features (excluding target)
                                   mini_batch_size=286,  # Size of mini-batches for training
                                   learning_rate=0.0016627556317724691, # Learning rate for training
                                   predictor_type='regressor',  # Specify 'regressor' for regression
                                   normalize_data=True,  # Normalize input features
                                   normalize_label=True)  # Normalize target variable for regression


# Step 6: Train the Model

In [64]:
# Fit the SageMaker Linear Learner estimator to the training and validation data
linear_learner.fit({'train': TrainingInput(train_uri, content_type='text/csv'),
                    'validation': TrainingInput(validation_uri, content_type='text/csv')})

INFO:sagemaker:Creating training-job with name: linear-learner-2024-03-03-04-09-58-366


2024-03-03 04:09:58 Starting - Starting the training job...
2024-03-03 04:10:14 Starting - Preparing the instances for training...
2024-03-03 04:10:53 Downloading - Downloading input data......
2024-03-03 04:11:37 Downloading - Downloading the training image......
2024-03-03 04:12:52 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[03/03/2024 04:12:56 INFO 140517137151808] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', 'opt

## Evaulation
As can be seen from the output of the training job for rmse it gets a margin of error of around 9.5 for the target values. If we were to imagine that this is what the in-production model would have actually obtained then whenever an employee inputs the component amounts into the model, the predicted compressive strength could be 9.5 MPa off from the actual value. Therefore there exists a small bit of error but only a small margin that wouldn't be to far off from the real value.


In [59]:
# Hyperparameter optimization
params = {"mini_batch_size": sagemaker.tuner.IntegerParameter(32,512),
          "learning_rate": sagemaker.tuner.ContinuousParameter(0.00001,0.5)

hyp_tuner = sagemaker.tuner.HyperparameterTuner(estimator=linear_learner,
                                               objective_metric_name="validation:mse",
                                               objective_type="Minimize",
                                               hyperparameter_ranges=params)
hyp_tuner.fit({'train': TrainingInput(train_uri, content_type='text/csv'),
               
                    'validation': TrainingInput(validation_uri, content_type='text/csv')})

INFO:sagemaker:Creating hyperparameter tuning job with name: linear-learner-240303-0354


......................................!


After running the hyperparameter tuner it found that the optimal values for the mini batch size
and learning rate were 286 and 0.0016627556317724691.

# Step 7: Deploy the Endpoint

## Check to see if the Endpoint exists and if it does delete it

In [52]:
EndpointConfig="regression-linear-learner-endpoint"
Endpoint="regression-linear-learner-endpoint"

In [53]:
import boto3

def delete_sagemaker_endpoint(endpoint_name):
    # Initialize SageMaker client
    sagemaker = boto3.client('sagemaker', region_name=region)
    
    try:
        # Check if the endpoint configuration exists
        response = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_name)
        
        # If the configuration exists, delete it
        if response:
            sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name)
            print(f"Endpoint configuration '{endpoint_name}' has been deleted.")
        
        # Check if the endpoint exists
        response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
        
        # If the endpoint exists, delete it
        if response:
            sagemaker.delete_endpoint(EndpointName=endpoint_name)
            print(f"Endpoint '{endpoint_name}' has been deleted.")
        
        return True  # Deletion successful
    except Exception as e:
        error_message = str(e)
        if "Could not find endpoint configuration" in error_message:
            print(f"Endpoint configuration '{endpoint_name}' not found. No action taken.")
            return True  # Configuration not found, exit gracefully
        elif "Could not find endpoint" in error_message:
            print(f"Endpoint '{endpoint_name}' not found. No action taken.")
            return True  # Endpoint not found, exit gracefully
        else:
            print(f"Error deleting SageMaker endpoint and configuration: {error_message}")
            return False  # Deletion failed

In [54]:
# Delete the Endpoint and Config

result = delete_sagemaker_endpoint(Endpoint)
if result:
    print(f"Endpoint '{Endpoint}' and its configuration have been deleted.")
else:
    print(f"Failed to delete endpoint '{Endpoint}' and its configuration.")

Endpoint configuration 'regression-linear-learner-endpoint' has been deleted.
Endpoint 'regression-linear-learner-endpoint' not found. No action taken.
Endpoint 'regression-linear-learner-endpoint' and its configuration have been deleted.


In [55]:
import boto3

# Create a SageMaker client to interact with the SageMaker service
sagemaker_client = boto3.client('sagemaker')

# Deploy the Linear Learner model to the SageMaker endpoint
linear_predictor = linear_learner.deploy(
    initial_instance_count=1,  # Number of initial instances
    instance_type='ml.m5.large',  # Type of instance for serving
    endpoint_name=Endpoint  # Custom endpoint name
)

INFO:sagemaker:Creating model with name: linear-learner-2024-03-03-03-42-39-126
INFO:sagemaker:Creating endpoint-config with name regression-linear-learner-endpoint
INFO:sagemaker:Creating endpoint with name regression-linear-learner-endpoint


-----!

# Step 8: Query the Endpoint

In [56]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Set the serializer to CSV (Comma-Separated Values)
linear_predictor.serializer = CSVSerializer()

# Set the deserializer to JSON (JavaScript Object Notation)
linear_predictor.deserializer = JSONDeserializer()


In [68]:
# Sample hardcoded data point
sample_data = [485, 0, 0, 146, 0, 1120, 800, 28]

# Convert the sample data to a CSV string
query_data_csv = ','.join([str(item) for item in sample_data])

# Querying the model and getting a prediction
response = linear_predictor.predict(query_data_csv)

# Print out the prediction
print(sample_data)
print("Predicted value:", response['predictions'][0]['score'], "MPa")

[485, 0, 0, 146, 0, 1120, 800, 28]
Predicted value: 54.380409240722656 MPa


## Inference

Here we can see the outputted predictions given an eample datapoint, be aware that this is a proxy dataset that tries to represent concrete compressive strength so it may not be 100% correct.
Regardless, the model was able to output a suitable prediction within seconds. A task that may have taken hours or days to complete can easily be done with the linear regression model saving both time and money. This would be very useful for concrete mixtures that may take up to several days to come up to strength which can be time consuming. 

# Step 9: Delete the Endpoint and Config

In [51]:
# Delete the Endpoint and Config

result = delete_sagemaker_endpoint(Endpoint)
if result:
    print(f"Endpoint '{Endpoint}' and its configuration have been deleted.")
else:
    print(f"Failed to delete endpoint '{Endpoint}' and its configuration.")

Endpoint configuration 'regression-linear-learner-endpoint' has been deleted.
Endpoint 'regression-linear-learner-endpoint' has been deleted.
Endpoint 'regression-linear-learner-endpoint' and its configuration have been deleted.


## Cost Benefit Analysis
As stated previously, the labour cost at its highest would be \\$114.8. Obviously this model won't be frequently run every month as new concrete jobs aren't taken all at once. All you will need is a basic s3 storage. Alot of data will be stored in order to get a better trained model and predictions. The monthly operating cost for using the model will be around \\$1.023. The savings in money will be up to \\$300 since you'll be reducing material use and maintenance on the compression test machine