# Using Sagemaker ScriptProcessor and Estimator
### Preprocess data and train models

![](images/Processing-1.png)

[Source](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html)

### 1.0 Get data via PyAthena

In [1]:
!pip install pyathena

Collecting pyathena
  Downloading PyAthena-2.3.2-py3-none-any.whl (37 kB)
Installing collected packages: pyathena
Successfully installed pyathena-2.3.2


In [2]:
from pyathena import connect
import pandas as pd

conn = connect(s3_staging_dir="s3://aws-athena-query-results-us-west-2-995383923238/ ",
               region_name="us-west-2")
df = pd.read_sql_query(""" SELECT * FROM "ml-workshop-db"."enriched_data" """, conn)
print(df.head())

      id dp_unique_key      target  \
0  12934   uq_id_10078     Neutral   
1  12945   uq_id_10136     Neutral   
2  12983   uq_id_10356  Irrelevant   
3  12999   uq_id_10456    Positive   
4  13009   uq_id_10514    Negative   

                                                text updated_date  \
0  Xbox Edition Series Super X WINS Up Again! | a...   21-03-2022   
1  Price delay for PS5 and Xbox Series X: more BA...   21-03-2022   
2  If you haven't gotten Game Pass yet, you serio...   21-03-2022   
3  _.. The faster to more energy efficient AMD Ze...   21-03-2022   
4  @ IdleSloth1984, what the hell do you mean? Xb...   21-03-2022   

          entity  
0  Xbox(Xseries)  
1  Xbox(Xseries)  
2  Xbox(Xseries)  
3  Xbox(Xseries)  
4  Xbox(Xseries)  


### 1.1 Load DataSet and rename columns

In [5]:
import numpy as np

# rename columns
df.columns = ['tweet_id', 'dp_unique_key', 'sentiment', 'tweet_text', 'updated_date', 'entity']

#Define the indexing for each possible label in a dictionary
class_to_index = {"Neutral":0, "Irrelevant":1, "Negative":2, "Positive": 3}

#Creates a reverse dictionary
index_to_class = dict((v,k) for k, v in class_to_index.items())

#Creates lambda functions, applying the appropriate dictionary
names_to_ids = lambda n: np.array([class_to_index.get(x) for x in n])
ids_to_names = lambda n: np.array([index_to_class.get(x) for x in n])

#Convert the "Sentiment" column into indexes
df["sentiment_index"] = names_to_ids(df["sentiment"])

### 1.2 Look at dataset

In [6]:
df.tail(5)

Unnamed: 0,tweet_id,dp_unique_key,sentiment,tweet_text,updated_date,entity,sentiment_index
74675,9589,uq_id_9244,Neutral,The fuck it. we doin overwatch twitch.tv/Nikkaela,21-03-2022,Overwatch,0
74676,9599,uq_id_9300,Irrelevant,My gorgeous and hilarious girlfriend is stream...,21-03-2022,Overwatch,1
74677,12817,uq_id_9411,Negative,Did n just have really high expectations for the,21-03-2022,Xbox(Xseries),2
74678,12826,uq_id_9461,Positive,i want gold.,21-03-2022,Xbox(Xseries),3
74679,2417,uq_id_99,Negative,Grounded almost looked pretty cool here despit...,21-03-2022,Borderlands,2


In [7]:
df.sentiment.value_counts()

Negative      22542
Positive      20830
Neutral       18318
Irrelevant    12990
Name: sentiment, dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74680 entries, 0 to 74679
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tweet_id         74680 non-null  int64 
 1   dp_unique_key    74680 non-null  object
 2   sentiment        74680 non-null  object
 3   tweet_text       74680 non-null  object
 4   updated_date     74680 non-null  object
 5   entity           74680 non-null  object
 6   sentiment_index  74680 non-null  int64 
dtypes: int64(2), object(5)
memory usage: 4.0+ MB


### 1.3 Filter out rows where tweet_text is NULL

In [10]:
# fiter data where tweet is present
df = df[df.tweet_text.isnull()==False]
df['tweet_text'] = df['tweet_text'].astype(str)

### 1.4 Stratified Sampling - For this session purposes lets restrict to 5000 records for each class

In [12]:
number_of_rows_each_class = 2000
dfs_list = []
for unique_sentiment in np.unique(df.sentiment):
    df_sentiment = df[df.sentiment == unique_sentiment].sample(n=number_of_rows_each_class, random_state = 42)
    dfs_list.append(df_sentiment)
df = pd.concat(dfs_list)
df = df.sample(frac=1, random_state = 42)
df = df.reset_index(drop = True)

In [13]:
df.sentiment.value_counts()

Positive      2000
Irrelevant    2000
Negative      2000
Neutral       2000
Name: sentiment, dtype: int64

### (Skip wordcloud)

### 2.0 Prepare dataset for modelling 

```python
stopwords = nltk.corpus.stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def preprocess_tweet(tweet_text):
    tweet_text = re.sub('[^a-zA-Z]', ' ', tweet_text)
    tweet_text = tweet_text.lower()
    tweet_text = tweet_text.split()
    tweet_text = [lemmatizer.lemmatize(word) for word in tweet_text if (not(word in set(stopwords))) & (len(word)>1) ]
    tweet_text = ' '.join(tweet_text)
    return tweet_text

df['tweet_text_preprocessed'] = df['tweet_text'].progress_apply(preprocess_tweet)
df = df[df.tweet_text_preprocessed.apply(lambda text: len(text.split())>1)]
#df['tweet_tokenized'] = df['tweet_tokenized'].progress_apply(lemmatize)
```

### 2.1 Save dataframe to csv

In [18]:
df.to_csv('tweets.csv', index=False)

### 2.2 Write code above into script

In [6]:
%%writefile process_tweets.py
import os
import pandas as pd
import re
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
    
install('nltk')
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
stopwords = nltk.corpus.stopwords.words('english')


def preprocess_tweet(tweet_text):
    tweet_text = re.sub('[^a-zA-Z]', ' ', tweet_text)
    tweet_text = tweet_text.lower()
    tweet_text = tweet_text.split()
    tweet_text = [lemmatizer.lemmatize(word) for word in tweet_text if not word in set(stopwords)]
    tweet_text = ' '.join(tweet_text)
    return tweet_text


def main(input_file, output_file):
    df = pd.read_csv(input_file)
    df['tweet_text_preprocessed'] = df.apply(lambda x: preprocess_tweet(x['tweet_text']), axis=1)
    df = df[df.tweet_text_preprocessed.apply(lambda text: len(text.split())>1)]
    df.to_csv(output_file, index=False)

    
if __name__ == "__main__":

    input_file = os.path.join('/opt/ml/processing/input', 'tweets.csv')
    output_file = os.path.join('/opt/ml/processing/output', 'tweets_processed.csv')
    #input_file = os.path.join('.', 'tweets.csv')
    #output_file = os.path.join('.', 'tweets_processed.csv')
    
    main(input_file, output_file)

Overwriting process_tweets.py


### 2.3 Test script

In [20]:
!python process_tweets.py

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ec2-user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
import pandas as pd

pd.read_csv('tweets_processed.csv')[['tweet_text','tweet_text_preprocessed', 'sentiment']].head(30)

Unnamed: 0,tweet_text,tweet_text_preprocessed,sentiment
0,I’m just gonna say it - Overwatch game being n...,gonna say overwatch game nominated lgbtq game ...,Negative
1,A company that has the ability to design and p...,company ability design produce top line graphi...,Negative
2,Everything is better when it's black,everything better black,Irrelevant
3,RIP Battlefield V,rip battlefield v,Negative
4,Borderlands 3 Chapter 1 youtu.be / 0SKu6Vr4iXU...,borderland chapter youtu sku vr ixu via youtub...,Neutral
5,johnson & johnson about a purchase COVID but s...,johnson johnson purchase covid still fix,Negative
6,The World<unk> Warcraft category on play is so...,world unk warcraft category play sooo congeste...,Irrelevant
7,@joem135... You know what's bloody stupid? Som...,joem know bloody stupid someone stupid profit ...,Irrelevant
8,from me is slightly disappointed being my sche...,slightly disappointed schedule explicitly appr...,Neutral
9,@ Xfinity The Call of Duty: Black Ops Cold War...,xfinity call duty black ops cold war open beta...,Negative


### 2.4 Initialize Sagemaker ScriptProcessor (SKLearnProcessor)

In [23]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

### 2.5 (TODO) Create S3 bucket

In [24]:
# Do it

### 2.6 Copy tweets.csv to S3

In [25]:
!aws s3 cp tweets.csv s3://dsml-chrisi-bucket/workshop/tweets.csv

upload: ./tweets.csv to s3://dsml-chrisi-bucket/workshop/tweets.csv


### 2.7 Run ScriptProcessor
IMPORTANT: Uncomment/comment lines in `process_tweets.py`

In [26]:
%%time
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(code='process_tweets.py',
    inputs=[
        ProcessingInput(
            source='s3://dsml-chrisi-bucket/workshop/tweets.csv',
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='tweet_output',
            source='/opt/ml/processing/output',
            destination='s3://dsml-chrisi-bucket/workshop'
        )
    ]
)


Job Name:  sagemaker-scikit-learn-2022-03-31-04-37-27-621
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://dsml-chrisi-bucket/workshop/tweets.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-995383923238/sagemaker-scikit-learn-2022-03-31-04-37-27-621/input/code/process_tweets.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'tweet_output', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://dsml-chrisi-bucket/workshop', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]
........................[34mCollecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)[0m
[34mCol

### 2.8 Open a new browser tab and check your running job

https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/processing-jobs

### 2.9 View output in S3 bucket
You should see a new file named `tweets_processed.csv`

In [27]:
!aws s3 ls s3://dsml-chrisi-bucket/workshop/

2022-03-31 04:37:27    1299233 tweets.csv
2022-03-31 04:41:25    1830178 tweets_processed.csv


### RECAP #1
1. Saved dataframe to a csv file in S3 (2.1)
2. Transferred notebook code into Python script file (2.2)
3. Used Sagemaker ScriptProcessor to run our script in a container and process the files to/from S3 (2.7)

### 3.0 Split into train and test

```python
train_df, test_df = train_test_split(df,test_size = 0.05, random_state =42)

train_df.reset_index(drop = True, inplace = True)
test_df.reset_index(drop = True, inplace = True)

X_train = train_df['tweet_text_preprocessed']
y_train = train_df['sentiment']

X_test = test_df['tweet_text_preprocessed']
y_test = test_df['sentiment']
```

### 3.1 Keras tokenization word embedding model

```python
from tensorflow.keras.preprocessing.text import Tokenizer

max_words = 5000
max_len=50

keras_tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')


def tokenize_pad_sequences(tweet_text):
    '''
    This function tokenize the input text into sequnences of intergers and then
    pad each sequence to the same length
    '''
    tweet_text = tokenizer.texts_to_sequences(tweet_text)
    # Pad sequences to the same length
    tweet_text = pad_sequences(tweet_text, padding='post', maxlen=max_len)
    # return sequences
    return tweet_text

keras_tokenizer.fit_on_texts(train_df['tweet_text_preprocessed'])
train_texts_to_sequences = keras_tokenizer.texts_to_sequences(train_df['tweet_text_preprocessed'])
train_texts_to_sequences = pad_sequences(train_texts_to_sequences, padding='post', maxlen=max_len)

train_df['tweet_keras_tokenized'] = list(train_texts_to_sequences)


test_texts_to_sequences = keras_tokenizer.texts_to_sequences(test_df['tweet_text_preprocessed'])
test_texts_to_sequences = pad_sequences(test_texts_to_sequences, padding='post', maxlen=max_len)

test_df['tweet_keras_tokenized'] = list(test_texts_to_sequences)


# saving
# with open('keras_tokenizer.pickle', 'wb') as handle:
#     pickle.dump(keras_tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

tokenizer_json =  keras_tokenizer.to_json()
with io.open(f'{current_path}/keras_model_files/keras_tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(tokenizer_json, ensure_ascii=False))
```

### 3.2 Write above code into training script

In [29]:
%%writefile keras_train_model.py
import os
import pandas as pd
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.models import save_model
#from keras.models import save_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Bidirectional, LSTM
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import tokenizer_from_json

max_words = 5000
max_len=50
keras_tokenizer = Tokenizer(num_words=max_words, lower=True, split=' ')


def train(input_file, output_folder):
    df = pd.read_csv(input_file)
    
    train_df, test_df = train_test_split(df,test_size = 0.05, random_state =42)

    train_df.reset_index(drop = True, inplace = True)
    test_df.reset_index(drop = True, inplace = True)

    X_train = train_df['tweet_text_preprocessed']
    y_train = train_df['sentiment']

    X_test = test_df['tweet_text_preprocessed']
    y_test = test_df['sentiment']

    keras_tokenizer.fit_on_texts(train_df['tweet_text_preprocessed'])
    train_texts_to_sequences = keras_tokenizer.texts_to_sequences(train_df['tweet_text_preprocessed'])
    train_texts_to_sequences = pad_sequences(train_texts_to_sequences, padding='post', maxlen=max_len)
    train_df['tweet_keras_tokenized'] = list(train_texts_to_sequences)

    test_texts_to_sequences = keras_tokenizer.texts_to_sequences(test_df['tweet_text_preprocessed'])
    test_texts_to_sequences = pad_sequences(test_texts_to_sequences, padding='post', maxlen=max_len)
    test_df['tweet_keras_tokenized'] = list(test_texts_to_sequences)
    
    keras_model = Sequential()
    embedding_vector_size = 16
    lstm_units = 20
    keras_model.add(Embedding(max_words,embedding_vector_size,input_length=max_len))
    #keras_model.add(Bidirectional(LSTM(20, return_sequences=True)))
    keras_model.add(Bidirectional(LSTM(lstm_units)))
    keras_model.add(Dense(4, activation='softmax'))
    keras_model.compile(
         loss='sparse_categorical_crossentropy',
         optimizer='adam',
         metrics=['accuracy'])
    
    X_train_keras = train_texts_to_sequences
    y_train_keras = train_df['sentiment_index']

    X_test_keras = test_texts_to_sequences
    y_test_keras = test_df['sentiment_index']

    keras_model.fit(
         X_train_keras, y_train_keras,
         validation_data=(X_test_keras, y_test_keras),
         epochs=1)
    
    save_model(keras_model, output_folder.rstrip('/') + '/', save_format='tf')

    print('DONE')
    #tokenizer_json =  keras_tokenizer.to_json()
    #with io.open(f'{current_path}/keras_model_files/keras_tokenizer.json', 'w', encoding='utf-8') as f:
    #    f.write(json.dumps(tokenizer_json, ensure_ascii=False))
    


if __name__ == "__main__":

    input_file = os.environ.get('SM_CHANNEL_TRAIN') + '/tweets_processed.csv'
    output_folder = os.environ.get('SM_MODEL_DIR') #os.path.join('/opt/ml/processing/output', 'tweets_processed.csv')

    #input_file = os.path.join('.', 'tweets_processed.csv')
    #output_folder = '.'
    
    print('Input', input_file)
    print('Output', output_folder)
       
    train(input_file, output_folder)

Writing keras_train_model.py


### 3.3 Initialize Sagemaker Estimator
Let's compare with (2.4)

In [30]:
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

In [31]:
estimator = TensorFlow(
                 entry_point='keras_train_model.py',
                 instance_type='ml.p3.2xlarge', #'ml.p3.8xlarge', 'local'
                 instance_count=1,
                 source_dir='.',
                 role=role,
                 framework_version='2.3.2',
                 py_version='py37',
                 output_path='s3://dsml-chrisi-bucket/workshop',
                 hyperparameters={
                     #'embedding': True,
                     #'modelstart': 1,
                     #'batch-size': 64,
                     #'modelfinish': 5
                 },
                 #script_mode=True,
                 #dependencies=dependencies,
                 #image_uri=<image_uri>,
)

### 3.4 Run training!

In [33]:
%%time
estimator.fit({
    'train': 's3://dsml-chrisi-bucket/workshop/',
}, job_name='tweets-chris04')

2022-03-31 05:07:07 Starting - Starting the training job...
2022-03-31 05:07:33 Starting - Preparing the instances for trainingProfilerReport-1648703227: InProgress
.........
2022-03-31 05:09:06 Downloading - Downloading input data
2022-03-31 05:09:06 Training - Downloading the training image...............
2022-03-31 05:11:31 Training - Training image download completed. Training in progress.[34m2022-03-31 05:11:27.096939: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2022-03-31 05:11:27.103660: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2022-03-31 05:11:27.348119: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2[0m
[34m2022-03-31 05:11:27.437188: W tensorflow/core/profiler/internal/smprofiler_timelin

### 3.5 Open a new browser tab and check your training job
https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/jobs

### 3.6 View output in S3 bucket

In [34]:
!aws s3 ls s3://dsml-chrisi-bucket/workshop/

                           PRE tweets-chris04/
2022-03-31 04:37:27    1299233 tweets.csv
2022-03-31 04:41:25    1830178 tweets_processed.csv


In [35]:
!aws s3 ls s3://dsml-chrisi-bucket/workshop/tweets-chris04/

                           PRE debug-output/
                           PRE output/
                           PRE profiler-output/
                           PRE rule-output/


In [36]:
!aws s3 ls s3://dsml-chrisi-bucket/workshop/tweets-chris04/output/

2022-03-31 05:12:17    1224576 model.tar.gz


### 3.7 Copy model.tar.gz to local

In [43]:
!aws s3 cp s3://dsml-chrisi-bucket/workshop/tweets-chris04/output/model.tar.gz ./model.tar.gz

download: s3://dsml-chrisi-bucket/workshop/tweets-chris04/output/model.tar.gz to ./model.tar.gz


In [45]:
!mkdir keras_model_files
!tar -xvf model.tar.gz --directory ./keras_model_files

assets/
saved_model.pb
variables/
variables/variables.index
variables/variables.data-00000-of-00001


### RECAP #2
1. Transferred notebook code into Python training script file (3.2)
2. Used Sagemaker Estimator to run our script in a container and process the files to/from S3 (3.3 - 3.4)
3. Copied back the Estimator output from S3 to local (3.7)

### What next?
- [Sagemaker notebook examples](https://github.com/aws/amazon-sagemaker-examples) - (accessible via the Sagemaker notebook extension on the lower left of this Jupyterlab)
- [Data Science on AWS (Book)](https://www.oreilly.com/library/view/data-science-on/9781492079385/) 
- Non-AWS
  - [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/overview/quickstart/)
  - [MLFlow](https://mlflow.org/)
  - [Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-1st-experiment-hello-world)
  - Tensorflow Extended
- Moar AWS

![](images/AWLMLStack.png)