# Building an AWS<sup>®</sup> ML Pipeline with SageWorks (Classification)

<div style="padding: 20px">
<img width="1000" alt="sageworks_pipeline" src="https://github.com/SuperCowPowers/sageworks/assets/4806709/47cc5739-971c-48c3-9ef6-fd8370e3ec57"></div>

This notebook uses the SageWorks Science Workbench to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. This dataset aggregates aqueous solubility data for a large set of compounds.

We're going to set up a full AWS Machine Learning Pipeline from start to finish. Since the SageWorks Classes encapsulate, organize, and manage sets of AWS® Services, setting up our ML pipeline will be straight forward.

SageWorks also provides visibility into AWS services for every step of the process so we know exactly what we've got and how to use it.
<br><br>

## Data
Wine Dataset: A classic dataset used in pattern recognition, machine learning, and data mining, the Wine dataset comprises 178 wine samples sourced from three different cultivars in Italy. The dataset features 13 physico-chemical attributes for each wine sample, providing a multi-dimensional feature space ideal for classification tasks. The aim is to correctly classify the wine samples into one of the three cultivars based on these chemical constituents. This dataset is widely employed for testing and benchmarking classification algorithms and is notable for its well-balanced distribution among classes. It serves as a straightforward, real-world example for classification tasks in machine learning.

**Main Reference:**
Forster, P. (1991). Machine Learning of Natural Language and Ontology (Technical Report DAI-TR-261). Department of Artificial Intelligence, University of Edinburgh.

**Data Downloaded from UCI:**
https://archive.ics.uci.edu/ml/datasets/Wine


## SageWorks
SageWorks is a medium granularity framework that manages and aggregates AWS® Services into classes and concepts. When you use SageWorks you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and

## Notebook
This notebook uses the SageWorks Science Workbench to quickly build an AWS® Machine Learning Pipeline.

We're going to set up a full AWS Machine Learning Pipeline from start to finish. Since the SageWorks Classes encapsulate, organize, and manage sets of AWS® Services, setting up our ML pipeline will be straight forward.

SageWorks also provides visibility into AWS services for every step of the process so we know exactly what we've got and how to use it.
<br><br>

® Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates.

In [None]:
# Okay first we get our data into SageWorks as a DataSource
from sageworks.transforms.data_loaders.light.csv_to_data_source import CSVToDataSource

# SageWorks help is actually helpful
Every class in SageWorks is self documenting, just use `help(ClassName)` and you'll get help like this...
```
help(CSVToDataSource)
Help on class CSVToDataSource in module sageworks.transforms.data_loaders.light.csv_to_data_source:

class CSVToDataSource(sageworks.transforms.transform.Transform)
 |  CSVToDataSource(csv_file_path: str, data_uuid: str)
 |  
 |  CSVToDataSource: Class to move local CSV Files into a SageWorks DataSource
 |  
 |  Common Usage:
 |      csv_to_data = CSVToDataSource(csv_file_path, data_uuid)
 |      csv_to_data.set_output_tags(["abalone", "csv", "whatever"])
 |      csv_to_data.transform()
 |  
```

In [None]:
# Note: If you want to use data from S3 just use 'S3ToDataSource'
csv_path = '/Users/briford/data/sageworks/wine_classification.csv'
to_data_source = CSVToDataSource(csv_path, 'wine_data')
to_data_source.set_output_tags(['wine', 'classification'])
to_data_source.transform()

<div style="float: right; padding: 20px"><img src="images/aws_dashboard_aqsol.png" width=600px"></div>

# So what just happened?
Okay, so it was just a few lines of code but SageWorks did the following for you:
   
- Transformed the CSV to a **Parquet** formatted dataset and stored it in AWS S3
- Created an AWS Data Catalog database/table with the columns names/types
- Athena Queries can now be done directly on this data in AWS Athena Console

The new 'DataSource' will show up in AWS and of course the SageWorks AWS Dashboard. Anyone can see the data, get information on it, use AWS® Athena to query it, and of course use it as part of their analysis pipelines.

<div style="float: right; padding: 20px"><img src="images/athena_query_aqsol.png" width=600px"></div>

# Visibility and Easy to Use AWS Athena Queries
Since SageWorks manages a broad range of AWS Services it means that you get visibility into exactly what data you have in AWS. It also means nice perks like hitting the 'Query' link in the Dashboard Web Interface and getting a direct Athena console on your dataset. With AWS Athena you can use typical SQL statements to inspect and investigate your data.
    
**But that's not all!**
    
SageWorks also provides API to directly query DataSources and FeatureSets right from the API, so lets do that now.

In [None]:
from sageworks.artifacts.data_sources.data_source import DataSource
data_source = DataSource('wine_data')
data_source.query('SELECT * from wine_data limit 5')

# The AWS ML Pipeline Awaits
Okay, so in a few lines of code we created a 'DataSource' (which is simply a set of orchestrated AWS Services) but now we'll go through the construction of the rest of our Machine Learning pipeline.

<div style="padding: 20px">
<img width="1000" alt="sageworks_pipeline" src="https://github.com/SuperCowPowers/sageworks/assets/4806709/47cc5739-971c-48c3-9ef6-fd8370e3ec57"></div>

## ML Pipeline
- DataSource **(done)**
- FeatureSet
- Model
- Endpoint (serves models)

# Create a FeatureSet
**Note:** Normally this is where you'd do a deep dive on the data/features, look at data quality metrics, redudant features and engineer new features. For the purposes of this notebook we're simply going to take the features given to us in the AQSolDB data from the Harvard Dataverse, those features are:

In [None]:
data_source.column_details()

In [None]:
# Note to self: Perhaps lets trim down the imports :)
from sageworks.transforms.data_to_features.light.data_to_features_light import DataToFeaturesLight
help(DataToFeaturesLight)

```
Help on class DataToFeaturesLight in module sageworks.transforms.data_to_features.light.data_to_features_light:

class DataToFeaturesLight(sageworks.transforms.transform.Transform)
 |  DataToFeaturesLight(data_uuid: str, feature_uuid: str)
 |  
 |  DataToFeaturesLight: Base Class for Light DataSource to FeatureSet using Pandas
 |  
 |  Common Usage:
 |      to_features = DataToFeaturesLight(data_uuid, feature_uuid)
 |      to_features.set_output_tags(["abalone", "public", "whatever"])
 |      to_features.transform(target, id_column="id"/None, event_time_column="date"/None)
 ```

# Why does creating a FeatureSet take a long time?
Great question, between row 'ingestion' and waiting for the offline store to finish populating itself it does take a **long time**. SageWorks is simply invoking the AWS Service APIs and those APIs are taking a while to do their thing.

The good news is that SageWorks can monitor and query the status of the object and let you know when things are ready.

In [None]:
data_to_features = DataToFeaturesLight('wine_data', 'wine_features')
data_to_features.set_output_tags(["wine", "classification", "uci"])
data_to_features.transform(target="target")  # The target variable is called 'target' ;p

```
Reading Data Catalog Database: sagemaker_featurestore...
Reading Data Catalog Database: sageworks...
2023-10-01 13:21:30 (data_to_pandas.py:56) INFO Post-Transform: Checking Pandas DataFrame...
2023-10-01 13:21:30 (data_to_pandas.py:57) INFO DataFrame Shape: (178, 14)
Reading Feature Store Database...
2023-10-01 13:21:34 (feature_set.py:45) INFO Could not find feature set wine_features within current visibility scope
2023-10-01 13:21:34 (feature_set.py:74) INFO FeatureSet.exists() wine_features not found in AWS Metadata!
2023-10-01 13:21:34 (pandas_to_features.py:221) INFO Prep the output_df (cat_convert, convert types, lowercase columns, add training column)...
2023-10-01 13:21:34 (pandas_to_features.py:79) INFO Generating an id column before FeatureSet Creation...
2023-10-01 13:21:34 (pandas_to_features.py:86) INFO Generating an event_time column before FeatureSet Creation...
2023-10-01 13:21:34 (pandas_to_features.py:92) INFO Converting event_time to ISOFormat Date String before FeatureSet Creation...
2023-10-01 13:21:35 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/wine_feature_set...
2023-10-01 13:21:35 (pandas_to_features.py:328) INFO FeatureSet being Created...
2023-10-01 13:21:35 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/test_feature_set...
2023-10-01 13:21:35 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/abalone_feature_set...
2023-10-01 13:21:40 (pandas_to_features.py:328) INFO FeatureSet being Created...
2023-10-01 13:22:22 (pandas_to_features.py:331) INFO FeatureSet wine_features successfully created
2023-10-01 13:22:24 (pandas_to_features.py:304) INFO Added rows: 178
2023-10-01 13:22:24 (pandas_to_features.py:305) INFO Failed rows: 0
2023-10-01 13:22:24 (pandas_to_features.py:306) INFO Total rows to be ingested: 178
2023-10-01 13:22:24 (pandas_to_features.py:310) INFO Post-Transform: Populating Offline Storage and make_ready()...
Reading Feature Store Database...
2023-10-01 13:22:26 (pandas_to_features.py:317) INFO Waiting for Feature Group Offline storage to be ready...
2023-10-01 13:22:26 (pandas_to_features.py:318) INFO Note: This will often take 10-20 minutes...go have coffee or lunch :)
2023-10-01 13:22:31 (pandas_to_features.py:338) INFO Waiting for AWS Feature Group wine_features Offline Storage (0 rows)...
2023-10-01 13:29:58 (pandas_to_features.py:342) INFO Success: Reached Expected Rows (178 rows)...
```

# New FeatureSet shows up in Dashboard
Now we see our new feature set automatically pop up in our dashboard. FeatureSet creation involves the most complex set of AWS Services:
- New Entry in AWS Feature Store
- Specific Type and Field Requirements are handled
- Plus all the AWS Services associated with DataSources (see above)

The new 'FeatureSet' will show up in AWS and of course the SageWorks AWS Dashboard. Anyone can see the feature set, get information on it, use AWS® Athena to query it, and of course use it as part of their analysis pipelines.

<div style="padding: 20px"><img src="images/dashboard_aqsol_features.png" width=1000px"></div>
    
**Important:** All inputs are stored to track provenance on your data as it goes through the pipeline. We can see the last field in the FeatureSet shows the input DataSource.

# Publishing our Model
**Note:** Normally this is where you'd do a deep dive on the feature set. For the purposes of this notebook we're simply going to take the features given to us and make a reference model that can track our baseline model performance for other to improve upon. :)

In [None]:
from sageworks.transforms.features_to_model.features_to_model import FeaturesToModel
help(FeaturesToModel)

```
class FeaturesToModel(sageworks.transforms.transform.Transform)
 |  FeaturesToModel(feature_uuid: str, model_uuid: str)
 |  
 |  FeaturesToModel: Train/Create a Model from a FeatureSet
 |  
 |  Common Usage:
 |      to_model = FeaturesToModel(feature_uuid, model_uuid)
 |      to_model.set_output_tags(["abalone", "public", "whatever"])
 |      to_model.transform(target="class_number_of_rings", description="Abalone Regression Model".
 |                         input_feature_list=<features>, model_type="regressor/classifier",
 |                         delete_existing=True/False)
 ```

In [None]:
# Compute our feature list (or have the Class guess it)
features = data_source.column_names()
features.remove("target")
print(features)

In [None]:
to_model = FeaturesToModel('wine_features', 'wine-classification')
to_model.set_output_tags(["wine", "classification", "reference"])
to_model.transform(target="target",  description="Wine Classification Model",
                   feature_list=features, model_type='classifier')

```
INFO Created new training data s3://sandbox-sageworks-artifacts/feature-sets/wine_features/datasets/all_2023-10-01_19:44:57/7fa767fc-e461-4828-9d77-57ecd6f369ed.csv...
INFO:sageworks.transforms.transform:Created new training data s3://sandbox-sageworks-artifacts/feature-sets/wine_features/datasets/all_2023-10-01_19:44:57/7fa767fc-e461-4828-9d77-57ecd6f369ed.csv...
Using provided s3_resource
INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2023-10-01-19-45-03-876
2023-10-01 19:45:05 Starting - Starting the training job...
2023-10-01 19:45:20 Starting - Preparing the instances for training......
2023-10-01 19:46:16 Downloading - Downloading input data...
2023-10-01 19:46:46 Training - Downloading the training image...
2023-10-01 19:47:22 Training - Training image download completed. Training in progress.

{'Class_1': 0, 'Class_2': 1, 'Class_3': 2}
  wine_class  precision    recall    fscore  support
0    Class_1   1.000000  0.888889  0.941176       18
1    Class_2   0.857143  1.000000  0.923077       12
2    Class_3   1.000000  1.000000  1.000000        9
2023-10-01 23:59:26,689 sagemaker-containers INFO     Reporting training SUCCESS
2023-10-01 19:47:37,071 sagemaker-containers INFO     Reporting training SUCCESS

2023-10-01 19:47:53 Uploading - Uploading generated training model
2023-10-01 19:47:53 Completed - Training job completed
Training seconds: 96
Billable seconds: 96
2023-10-01 13:48:30 (features_to_model.py:142) INFO Creating new model wine-classification...
```

# Deploying an AWS Endpoint
Okay now that are model has been published we can deploy an AWS Endpoint to serve inference requests for that model. Deploying an Endpoint allows a large set of servies/APIs to use our model in production.

In [None]:
from sageworks.transforms.model_to_endpoint.model_to_endpoint import ModelToEndpoint
to_endpoint = ModelToEndpoint("wine-classification", "wine-classification-end")
to_endpoint.set_output_tags(["wine", "classification"])
to_endpoint.transform()

# Model Inference from the Endpoint
AWS Endpoints will bundle up a model as a service that responds to HTTP requests. The typical way to use an endpoint is to send a POST request with your features in CSV format. SageWorks provides a nice DataFrame based interface that takes care of many details for you.

In [None]:
from sageworks.artifacts.endpoints.endpoint import Endpoint
help(Endpoint)

```
class Endpoint(sageworks.artifacts.artifact.Artifact)
 |  Endpoint(endpoint_uuid)
 |  
 |  Endpoint: SageWorks Endpoint Class
 |  
 |  Common Usage:
 |      my_endpoint = Endpoint(endpoint_uuid)
 |      prediction_df = my_endpoint.predict(test_df)
 |      metrics = my_endpoint.regression_metrics(target_column, prediction_df)
 |      for metric, value in metrics.items():
 |          print(f"{metric}: {value:0.3f}")
 |  
```

In [None]:
# Get the Endpoint
my_endpoint = Endpoint('wine-classification-end')

# Model Provenance is locked into SageWorks
We can now look at the model, see what FeatureSet was used to train it and even better see exactly which ROWS in that training set where used to create the model. We can make a query that returns the ROWS that were not used for training.

In [4]:
from sageworks.artifacts.feature_sets.feature_set import FeatureSet
fs = FeatureSet('wine_features')
table = fs.get_data_source().uuid
test_df = fs.query(f"select * from {table} where training=0")
test_df.head()

Reading Feature Store Database...
Reading Data Catalog Database: sageworks...
2023-10-02 09:08:17 (feature_set.py:64) INFO FeatureSet Initialized: wine_features
Reading Data Catalog Database: sagemaker_featurestore...
2023-10-02 09:08:18 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/wine_features...
2023-10-02 09:08:18 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/wine_feature_set...
2023-10-02 09:08:18 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/test_feature_set...
2023-10-02 09:08:18 (connector.py:62) INFO Retrieving SageWorks Metadata for Artifact: arn:aws:sagemaker:us-west-2:507740646243:feature-group/abalone_feature_set...


Unnamed: 0,write_time,api_invocation_time,is_deleted,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,wine_class,id,event_time,training
0,2023-10-01 23:18:14.584000+00:00,2023-10-01 23:12:04+00:00,False,13.28,1.64,2.84,15.5,110.0,2.6,2.68,0.34,1.36,4.6,1.09,2.78,880.0,Class_1,36,2023-10-01T23:11:41.953Z,0
1,2023-10-01 23:18:15.750000+00:00,2023-10-01 23:12:04+00:00,False,12.33,1.1,2.28,16.0,101.0,2.05,1.09,0.63,0.41,3.27,1.25,1.67,680.0,Class_2,60,2023-10-01T23:11:41.953Z,0
2,2023-10-01 23:18:15.750000+00:00,2023-10-01 23:12:04+00:00,False,12.25,3.88,2.2,18.5,112.0,1.38,0.78,0.29,1.14,8.21,0.65,2.0,855.0,Class_3,144,2023-10-01T23:11:41.953Z,0
3,2023-10-01 23:18:16.008000+00:00,2023-10-01 23:12:05+00:00,False,13.3,1.72,2.14,17.0,94.0,2.4,2.19,0.27,1.35,3.95,1.02,2.77,1285.0,Class_1,27,2023-10-01T23:11:41.953Z,0
4,2023-10-01 23:18:15.727000+00:00,2023-10-01 23:12:05+00:00,False,14.22,3.99,2.51,13.2,128.0,3.0,3.04,0.2,2.08,5.1,0.89,3.53,760.0,Class_1,39,2023-10-01T23:11:41.953Z,0


In [6]:
# Okay now use the SageWorks Endpoint to make prediction on TEST data
prediction_df = my_endpoint.predict(test_df)
metrics = my_endpoint.classification_metrics("wine_class", prediction_df)
metrics

Processing...
  wine_class  precision    recall    fscore  support
0    Class_1   1.000000  0.888889  0.941176       18
1    Class_2   0.857143  1.000000  0.923077       12
2    Class_3   1.000000  1.000000  1.000000        9


# Follow Up on Predictions
Looking at the prediction plot above we can see that many predictions were close to the actual value but about 10 of the predictions were WAY off. So at this point we'd use SageWorks to investigate those predictions, map them back to our FeatureSet and DataSource and see if there were irregularities in the training data.

# Wrap up: Building an AWS<sup>®</sup> ML Pipeline with SageWorks

<div style="float: right; padding: 20px"><img width="450" src="https://user-images.githubusercontent.com/4806709/266844238-df2f1b90-9e6f-4dbb-9490-ad75545e630f.png"></div>



This notebook used the SageWorks Science Toolkit to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. We built a full AWS Machine Learning Pipeline from start to finish. 

SageWorks made it easy:
- Visibility into AWS services for every step of the process.
- Managed the complexity of organizing the data and populating the AWS services.
- Provided an easy to use API to perform Transformations and inspect Artifacts.

Using SageWorks will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a SageWorks Alpha Tester, contact us at [sageworks@supercowpowers.com](mailto:sageworks@supercowpowers.com).

<br><br><br><br>
<br><br><br><br>
<br><br><br><br>
<br><br><br><br>
<br><br><br><br>
<br><br><br><br>

# Helper Methods

In [None]:
# Plotting defaults
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-deep')
#plt.style.use('seaborn-dark')
plt.rcParams['font.size'] = 12.0
plt.rcParams['figure.figsize'] = 14.0, 7.0