# Using Sagemaker ScriptProcessors and Estimators
### Preprocess data and train models

![](images/Processing-1.png)

[Source](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html)

### 0.1a Git clone the workshop repo in your Sagemaker instance
You can run the command below in a Terminal inside Jupyter/JupyterLab, or in a Jupyter notebook cell (prefix with `!`)

```bash
git clone https://github.com/ConcurDataScience/ConcurMLWorkshop.git
```

<div>
<img src="images/Terminal.png" width="400"/>
</div>

### 0.1b Upload/unzip sentiment-modularized.tar.gz to your Sagemaker notebook instance
After uploading the sentiment-modularized.tar.gz file to your Sagemaker notebook instance, open a Terminal in JupyterLab and unzip the tar.gz file using the command below

```bash
cd Sagemaker && tar -xvf sentiment-modularized.tar.gz
```

You should see the following after unzipping the tar file

<div>
<img src="images/unzipped-list.png" width="350"/>
</div>

### 0.2 Check if your bucket exists

In [1]:
BUCKET = 'dsml-chrisi-bucket'

In [None]:
!aws s3 ls s3://{BUCKET}

If your bucket does not exist, you can run the cell below to create it (remove comment)

In [None]:
#!aws s3 mb s3://{BUCKET}

### 0.3 Create the `workshop` and `athena_log` folders in your S3 bucket

In [32]:
!aws s3 cp resources/dummy.txt s3://{BUCKET}/workshop/dummy.txt
!aws s3 cp resources/dummy.txt s3://{BUCKET}/athena_log/dummy.txt

upload: resources/dummy.txt to s3://dsml-chrisi-bucket/workshop/dummy.txt
upload: resources/dummy.txt to s3://dsml-chrisi-bucket/athena_log/dummy.txt


### 1.0 Get data via PyAthena

In [33]:
!pip install pyathena



In [13]:
%%time
from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir=f"s3://{BUCKET}/athena_log/", 
               region_name="us-west-2")
df = pd.read_sql_query(""" SELECT * FROM "ml-workshop-db"."enriched_data" """, conn)
df.head()

CPU times: user 3.53 s, sys: 117 ms, total: 3.65 s
Wall time: 18.4 s


Unnamed: 0,id,dp_unique_key,target,text,updated_date,entity
0,2579,uq_id_1010,Irrelevant,I had to repaint another gun for Tiny Tina's c...,21-03-2022,Borderlands
1,12947,uq_id_10146,Neutral,Nvidia’s RTX 3080 is more exciting than PlaySt...,21-03-2022,Xbox(Xseries)
2,12959,uq_id_10217,Negative,the,21-03-2022,Xbox(Xseries)
3,12964,uq_id_10246,Positive,This price is simply incredible.,21-03-2022,Xbox(Xseries)
4,13007,uq_id_10501,Positive,Can't wait!,21-03-2022,Xbox(Xseries)


In [15]:
#import pandas as pd
#df.to_csv('resources/tweets_start.csv', index=False)

### 1.01 (OPTIONAL) If Athena above is not available...
...you can load the dataframe manually from a csv file

In [16]:
import pandas as pd
df = pd.read_csv('resources/tweets_start.csv')
df.head()

Unnamed: 0,id,dp_unique_key,target,text,updated_date,entity
0,2579,uq_id_1010,Irrelevant,I had to repaint another gun for Tiny Tina's c...,21-03-2022,Borderlands
1,12947,uq_id_10146,Neutral,Nvidia’s RTX 3080 is more exciting than PlaySt...,21-03-2022,Xbox(Xseries)
2,12959,uq_id_10217,Negative,the,21-03-2022,Xbox(Xseries)
3,12964,uq_id_10246,Positive,This price is simply incredible.,21-03-2022,Xbox(Xseries)
4,13007,uq_id_10501,Positive,Can't wait!,21-03-2022,Xbox(Xseries)


### 1.1 Load DataSet and rename columns

In [17]:
import numpy as np

# rename columns
df.columns = ['tweet_id', 'dp_unique_key', 'sentiment', 'tweet_text', 'updated_date', 'entity']

#Define the indexing for each possible label in a dictionary
class_to_index = {"Neutral":0, "Irrelevant":1, "Negative":2, "Positive": 3}

#Creates a reverse dictionary
index_to_class = dict((v,k) for k, v in class_to_index.items())

#Creates lambda functions, applying the appropriate dictionary
names_to_ids = lambda n: np.array([class_to_index.get(x) for x in n])
ids_to_names = lambda n: np.array([index_to_class.get(x) for x in n])

#Convert the "Sentiment" column into indexes
df["sentiment_index"] = names_to_ids(df["sentiment"])

### 1.2 Look at dataset

In [18]:
df.tail(5)

Unnamed: 0,tweet_id,dp_unique_key,sentiment,tweet_text,updated_date,entity,sentiment_index
74675,9436,uq_id_8371,Positive,I like this new,21-03-2022,Overwatch,3
74676,9475,uq_id_8601,Negative,killing ISIS and its @PlayOverwatch ‘s fault. ...,21-03-2022,Overwatch,2
74677,9544,uq_id_8994,Positive,I maybe went a little overboard with the fanta...,21-03-2022,Overwatch,3
74678,9596,uq_id_9285,Positive,Overwatch makes me die grrr,21-03-2022,Overwatch,3
74679,12905,uq_id_9900,Neutral,Xbox Series X graphics source code stolen and ...,21-03-2022,Xbox(Xseries),0


In [19]:
df.count()

tweet_id           74680
dp_unique_key      74680
sentiment          74680
tweet_text         74680
updated_date       74680
entity             74680
sentiment_index    74680
dtype: int64

### 1.3 Filter out rows where tweet_text is NULL

In [20]:
# fiter data where tweet is present
df = df[df.tweet_text.isnull()==False]
df['tweet_text'] = df['tweet_text'].astype(str)

### 1.4 Stratified Sampling - For this session purposes lets restrict to 5000 records for each class

In [21]:
number_of_rows_each_class = 2000
dfs_list = []
for unique_sentiment in np.unique(df.sentiment):
    df_sentiment = df[df.sentiment == unique_sentiment].sample(n=number_of_rows_each_class, random_state = 42)
    dfs_list.append(df_sentiment)
df = pd.concat(dfs_list)
df = df.sample(frac=1, random_state = 42)
df = df.reset_index(drop = True)

In [22]:
df.sentiment.value_counts()

Positive      2000
Irrelevant    2000
Negative      2000
Neutral       2000
Name: sentiment, dtype: int64

### 2.0 Prepare dataset for modelling 

```python
stopwords = nltk.corpus.stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def preprocess_tweet(tweet_text):
    tweet_text = re.sub('[^a-zA-Z]', ' ', tweet_text)
    tweet_text = tweet_text.lower()
    tweet_text = tweet_text.split()
    tweet_text = [lemmatizer.lemmatize(word) for word in tweet_text if (not(word in set(stopwords))) & (len(word)>1) ]
    tweet_text = ' '.join(tweet_text)
    return tweet_text

df['tweet_text_preprocessed'] = df['tweet_text'].progress_apply(preprocess_tweet)
df = df[df.tweet_text_preprocessed.apply(lambda text: len(text.split())>1)]
#df['tweet_tokenized'] = df['tweet_tokenized'].progress_apply(lemmatize)
```

### 2.1 Save dataframe to csv

In [23]:
df.to_csv('tweets.csv', index=False)

### 2.2 Write code above into script

In [36]:
%%writefile process_tweets.py
import os
import pandas as pd
import re
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
    
install('nltk')
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
stopwords = nltk.corpus.stopwords.words('english')


def preprocess_tweet(tweet_text):
    tweet_text = re.sub('[^a-zA-Z]', ' ', tweet_text)
    tweet_text = tweet_text.lower()
    tweet_text = tweet_text.split()
    tweet_text = [lemmatizer.lemmatize(word) for word in tweet_text if not word in set(stopwords)]
    tweet_text = ' '.join(tweet_text)
    return tweet_text


def main(input_file, output_file):
    df = pd.read_csv(input_file)
    df['tweet_text_preprocessed'] = df.apply(lambda x: preprocess_tweet(x['tweet_text']), axis=1)
    df = df[df.tweet_text_preprocessed.apply(lambda text: len(text.split())>1)]
    df.to_csv(output_file, index=False)

    
if __name__ == "__main__":

    #input_file = os.path.join('/opt/ml/processing/input', 'tweets.csv')
    #output_file = os.path.join('/opt/ml/processing/output', 'tweets_processed.csv')
    input_file = os.path.join('.', 'tweets.csv')
    output_file = os.path.join('.', 'tweets_processed.csv')
    
    main(input_file, output_file)

Overwriting process_tweets.py


### 2.3 Test script

In [25]:
!python process_tweets.py

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ec2-user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [31]:
import pandas as pd

pd.read_csv('tweets_processed.csv')[['tweet_text','tweet_text_preprocessed', 'sentiment']].head(15)

Unnamed: 0,tweet_text,tweet_text_preprocessed,sentiment
0,I support @ whichuk's campaign to end price go...,support whichuk campaign end price gouging als...,Negative
1,"In multiplayer mode, I can't see any player be...",multiplayer mode see player similar color cost...,Negative
2,My family bad..,family bad,Irrelevant
3,. It'r s Snow Much My Fun! Enter me to finally...,r snow much fun enter finally win amazon store...,Neutral
4,I thought going after workers for safety whist...,thought going worker safety whistleblowing pro...,Negative
5,I did this trick couple months ago hahahaha. I...,trick couple month ago hahahaha dumb,Irrelevant
6,One of the best tactical valorant players in M...,one best tactical valorant player middle east ...,Irrelevant
7,Microsoft: ‘carbon-negative’ by 2030 even for ...,microsoft carbon negative even supply chain an...,Neutral
8,"In multiplayer play, I can't see any player be...",multiplayer play see player similar color cost...,Negative
9,There are only 2 days left until the start of ...,day left start psl elisa viihde pubg autumn ch...,Irrelevant


### 2.4 Initialize Sagemaker ScriptProcessor (SKLearnProcessor)

![](images/Processing-1.png)

[Source](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html)

References:
- https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/
- https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html

In [33]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

### 2.6 Copy tweets.csv to S3

In [28]:
!aws s3 cp tweets.csv s3://{BUCKET}/workshop/tweets.csv

upload: ./tweets.csv to s3://dsml-chrisi-bucket/workshop/tweets.csv


### 2.7 Run ScriptProcessor
IMPORTANT: Uncomment/comment lines in `process_tweets.py`

In [43]:
%%time
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(code='process_tweets.py',
    inputs=[
        ProcessingInput(
            source=f's3://{BUCKET}/workshop/tweets.csv',
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='tweet_output',
            source='/opt/ml/processing/output',
            destination=f's3://{BUCKET}/workshop'
        )
    ]
)


Job Name:  sagemaker-scikit-learn-2022-04-03-02-32-46-236
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://dsml-chrisi-bucket/workshop/tweets.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-995383923238/sagemaker-scikit-learn-2022-04-03-02-32-46-236/input/code/process_tweets.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'tweet_output', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://dsml-chrisi-bucket/workshop', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
..........................[34mCollecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)[0m
[34mC

### 2.8 Open a new browser tab and check your running job

https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/processing-jobs

<div>
<img src="images/processing-job.png" width="500"/>
</div>

### 2.9 View output in S3 bucket
You should see a new file named `tweets_processed.csv`

In [44]:
!aws s3 ls s3://{BUCKET}/workshop/

2022-04-03 01:53:53          4 dummy.txt
2022-04-03 02:16:43    1315064 tweets.csv
2022-04-03 02:37:03    1856856 tweets_processed.csv


### RECAP #1
1. Saved dataframe to a csv file in S3 (2.1)
2. Transferred notebook code into Python script file (2.2)
3. Used Sagemaker ScriptProcessor to run our script in a container and process the files to/from S3 (2.7)

### 3.0 Split into train and test

```python
train_df, test_df = train_test_split(df,test_size = 0.05, random_state =42)

train_df.reset_index(drop = True, inplace = True)
test_df.reset_index(drop = True, inplace = True)

X_train = train_df['tweet_text_preprocessed']
y_train = train_df['sentiment']

X_test = test_df['tweet_text_preprocessed']
y_test = test_df['sentiment']
```

### 3.1 Keras tokenization word embedding model

```python
from tensorflow.keras.preprocessing.text import Tokenizer

max_words = 5000
max_len=50

keras_tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')


def tokenize_pad_sequences(tweet_text):
    '''
    This function tokenize the input text into sequnences of intergers and then
    pad each sequence to the same length
    '''
    tweet_text = tokenizer.texts_to_sequences(tweet_text)
    # Pad sequences to the same length
    tweet_text = pad_sequences(tweet_text, padding='post', maxlen=max_len)
    # return sequences
    return tweet_text

keras_tokenizer.fit_on_texts(train_df['tweet_text_preprocessed'])
train_texts_to_sequences = keras_tokenizer.texts_to_sequences(train_df['tweet_text_preprocessed'])
train_texts_to_sequences = pad_sequences(train_texts_to_sequences, padding='post', maxlen=max_len)

train_df['tweet_keras_tokenized'] = list(train_texts_to_sequences)


test_texts_to_sequences = keras_tokenizer.texts_to_sequences(test_df['tweet_text_preprocessed'])
test_texts_to_sequences = pad_sequences(test_texts_to_sequences, padding='post', maxlen=max_len)

test_df['tweet_keras_tokenized'] = list(test_texts_to_sequences)


# saving
# with open('keras_tokenizer.pickle', 'wb') as handle:
#     pickle.dump(keras_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

tokenizer_json =  keras_tokenizer.to_json()
with io.open(f'{current_path}/keras_model_files/keras_tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))
```

### 3.2 Write above code into training script

In [45]:
%%writefile keras_train_model.py
import os
import pandas as pd
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.models import save_model
#from keras.models import save_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Bidirectional, LSTM
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import tokenizer_from_json

max_words = 5000
max_len=50
keras_tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')


def train(input_file, output_folder):
    df = pd.read_csv(input_file)
    
    train_df, test_df = train_test_split(df,test_size = 0.05, random_state =42)

    train_df.reset_index(drop = True, inplace = True)
    test_df.reset_index(drop = True, inplace = True)

    X_train = train_df['tweet_text_preprocessed']
    y_train = train_df['sentiment']

    X_test = test_df['tweet_text_preprocessed']
    y_test = test_df['sentiment']

    keras_tokenizer.fit_on_texts(train_df['tweet_text_preprocessed'])
    train_texts_to_sequences = keras_tokenizer.texts_to_sequences(train_df['tweet_text_preprocessed'])
    train_texts_to_sequences = pad_sequences(train_texts_to_sequences, padding='post', maxlen=max_len)
    train_df['tweet_keras_tokenized'] = list(train_texts_to_sequences)

    test_texts_to_sequences = keras_tokenizer.texts_to_sequences(test_df['tweet_text_preprocessed'])
    test_texts_to_sequences = pad_sequences(test_texts_to_sequences, padding='post', maxlen=max_len)
    test_df['tweet_keras_tokenized'] = list(test_texts_to_sequences)
    
    keras_model = Sequential()
    embedding_vector_size = 16
    lstm_units = 20
    keras_model.add(Embedding(max_words,embedding_vector_size,input_length=max_len))
    #keras_model.add(Bidirectional(LSTM(20, return_sequences=True)))
    keras_model.add(Bidirectional(LSTM(lstm_units)))
    keras_model.add(Dense(4, activation='softmax'))
    keras_model.compile(
         loss='sparse_categorical_crossentropy',
         optimizer='adam',
         metrics=['accuracy'])
    
    X_train_keras = train_texts_to_sequences
    y_train_keras = train_df['sentiment_index']

    X_test_keras = test_texts_to_sequences
    y_test_keras = test_df['sentiment_index']

    keras_model.fit(
         X_train_keras, y_train_keras,
         validation_data=(X_test_keras, y_test_keras),
         epochs=1)
    
    save_model(keras_model, output_folder.rstrip('/') + '/', save_format='tf')

    print('DONE')
    #tokenizer_json =  keras_tokenizer.to_json()
    #with io.open(f'{current_path}/keras_model_files/keras_tokenizer.json', 'w', encoding='utf-8') as f:
    #    f.write(json.dumps(tokenizer_json, ensure_ascii=False))
    


if __name__ == "__main__":

    input_file = os.environ.get('SM_CHANNEL_TRAIN') + '/tweets_processed.csv'
    output_folder = os.environ.get('SM_MODEL_DIR') #os.path.join('/opt/ml/processing/output', 'tweets_processed.csv')

    #input_file = os.path.join('.', 'tweets_processed.csv')
    #output_folder = '.'
    
    print('Input', input_file)
    print('Output', output_folder)
       
    train(input_file, output_folder)

Writing keras_train_model.py


### 3.3 Initialize Sagemaker Estimator

https://docs.aws.amazon.com/sagemaker/latest/dg/frameworks.html

Let's compare with (2.4)

In [47]:
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

DL Container images - https://github.com/aws/deep-learning-containers/blob/master/available_images.md

<div>
<img src="images/dl-containers.png" width="600"/>
</div>

In [48]:
estimator = TensorFlow(
                 entry_point='keras_train_model.py',
                 instance_type='ml.p3.2xlarge', #'ml.p3.8xlarge', 'local'
                 instance_count=1,
                 source_dir='.',
                 role=role,
                 framework_version='2.3.2',
                 py_version='py37',
                 output_path=f's3://{BUCKET}/workshop',
                 hyperparameters={
                     #'embedding': True,
                     #'modelstart': 1,
                     #'batch-size': 64,
                     #'modelfinish': 5
                 },
                 #script_mode=True,
                 #dependencies=dependencies,
                 #image_uri=<image_uri>,
)

### 3.4 Run training!

In [49]:
%%time
estimator.fit({'train': f's3://{BUCKET}/workshop/'})

2022-04-03 02:41:26 Starting - Starting the training job...
2022-04-03 02:41:52 Starting - Preparing the instances for trainingProfilerReport-1648953686: InProgress
.........
2022-04-03 02:43:21 Downloading - Downloading input data
2022-04-03 02:43:21 Training - Downloading the training image............
2022-04-03 02:45:26 Training - Training image download completed. Training in progress..[34m2022-04-03 02:45:29.562008: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2022-04-03 02:45:29.565735: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2022-04-03 02:45:29.757137: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2[0m
[34m2022-04-03 02:45:29.838569: W tensorflow/core/profiler/internal/smprofiler_timeline.

### 3.5 Open a new browser tab and check your training job
https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs

<div>
<img src="images/training-job.png" width="500"/>
</div>

### 3.6 View output in S3 bucket

In [50]:
!aws s3 ls s3://{BUCKET}/workshop/

                           PRE tensorflow-training-2022-04-03-02-41-22-638/
2022-04-03 01:53:53          4 dummy.txt
2022-04-03 02:16:43    1315064 tweets.csv
2022-04-03 02:37:03    1856856 tweets_processed.csv


In [52]:
!aws s3 ls s3://{BUCKET}/workshop/tensorflow-training-2022-04-03-02-41-22-638/output/

2022-04-03 02:46:15    1215049 model.tar.gz


### 3.7 Copy model.tar.gz to local

In [54]:
!aws s3 cp s3://{BUCKET}/workshop/tensorflow-training-2022-04-03-02-41-22-638/output/model.tar.gz ./model.tar.gz

download: s3://dsml-chrisi-bucket/workshop/tensorflow-training-2022-04-03-02-41-22-638/output/model.tar.gz to ./model.tar.gz


In [55]:
!mkdir keras_model_files
!tar -xvf model.tar.gz --directory ./keras_model_files

saved_model.pb
assets/
variables/
variables/variables.index
variables/variables.data-00000-of-00001


### RECAP #2
1. Transferred notebook code into Python training script file (3.2)
2. Used Sagemaker Estimator to run our script in a container and process the files to/from S3 (3.3 - 3.4)
3. Copied back the Estimator output from S3 to local (3.7)

### What next?
- [Sagemaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#)
- [Sagemaker notebook examples](https://github.com/aws/amazon-sagemaker-examples) - (accessible via the Sagemaker notebook extension on the lower left of this Jupyterlab)
- [Data Science on AWS (Book)](https://www.oreilly.com/library/view/data-science-on/9781492079385/) 
- Non-AWS
  - [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/overview/quickstart/)
  - [MLFlow](https://mlflow.org/)
  - [Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world)
  - Tensorflow Extended
- Moar AWS

![](images/AWLMLStack.png)

### (Example) Sagemaker Pipelines
https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html

<div>
<img src="images/pipeline-full.png" width="350"/>
</div>

```python
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=<role>,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

step_process = ProcessingStep(
    name="AbaloneProcess",
    processor=sklearn_processor,
    inputs=[
      ProcessingInput(source=<input_data>, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test")
    ],
    code="abalone/preprocessing.py"
)
```

(Compare with 2.7)

### END OF CLASS
===================

### Create html from notebook

In [58]:
!jupyter nbconvert --to html sentiment-classification-twitter-modularize.ipynb

[NbConvertApp] Converting notebook sentiment-classification-twitter-modularize.ipynb to html
[NbConvertApp] Writing 717899 bytes to sentiment-classification-twitter-modularize.html


In [31]:
%%writefile .gitignore
.ipynb_checkpoints

Writing .gitignore


### Zip up files

In [59]:
!rm sentiment-modularized.tar.gz

rm: cannot remove ‘sentiment-modularized.tar.gz’: No such file or directory


In [60]:
!tar --exclude=".ipynb_checkpoints*" -zcvf sentiment-modularized.tar.gz .

./
./delete_resources.py
./.gitignore
./README.md
./resources/
./resources/tweets_start.csv
./resources/process_tweets.py
./resources/keras_train_model.py
./resources/tweets.csv
./resources/tweets_processed.csv
./resources/dummy.txt
./sentiment-classification-twitter-modularize.html
./images/
./images/Processing-1.png
./images/dl-containers.png
./images/training-job.png
./images/delete-bucket-01.png
./images/stop-notebook.png
./images/pipeline-full.png
./images/unzipped-list.png
./images/Terminal.png
./images/processing-job.png
./images/AWLMLStack.png
./sentiment-classification-twitter-modularize.ipynb
tar: .: file changed as we read it


### Delete stuff

In [None]:
#!rm -rf README.md images resources delete_resources.py sentiment-classification-twitter-modularize.html sentiment-modularized.tar.gz

In [57]:
!rm -rf keras_model_files resources/keras_model_files