# Text Classification using SageMaker BlazingText

## Setup

Specify S3 Bucket and prefix used for training and model data 

In [3]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()       # IAM role ARN 
print(role)  

region = boto3.session.Session().region_name
bucket = sess.default_bucket()    # Replace with your own bucket name if needed
print(bucket)
prefix = 'fModelData'             # Replace with the prefix under which you want to store the data if needed


arn:aws:iam::025730522839:role/service-role/AmazonSageMaker-ExecutionRole-20220817T161703
sagemaker-us-east-1-025730522839


## Data Preperation 

In [4]:
import numpy as np, pandas as pd, re, os

# First, we need to upload train and test data files to S3 bucket and prefix location 
train_data = 's3://{}/{}/{}'.format(bucket, prefix, 'train_data.txt')           # Training data path from s3 bucket 
test_data = 's3://{}/{}/{}'.format(bucket, prefix, 'test_data_solution.txt')    # Test data path from s3 bucket 

#Train Data Prep. ( We will be using only Title & Genre for our model ) 
df_train = pd.read_csv(train_data, sep=" :::", header=None, names=['Id', 'Title', 'Genre', 'Desc.'], engine="python")
df_train.drop(['Id', 'Desc.'], axis=1, inplace=True)

#Test Data Prep.
df_test = pd.read_csv(test_data, sep=" :::", header=None, names=['Id', 'Title', 'Genre', 'Desc.'], engine="python")
df_test.drop(['Id', 'Desc.'], axis=1, inplace=True)

## Data Preprocessing 

We need to preprocess the training data into space separated tokenized text format which can be consumed by **BlazingText** algorithm. Also, the class label(s) should be prefixed with __ __label__ __ and it should be present in the same line along with the original sentence.

In [5]:
#Function for Data Preprocessing
#Extract characters and numbers from strings and return with a prefix __label__ in the same line 
def Preprocess_Data(train, test):
        def process_title(title):
                """
                function that extracts characters and numbers
                from strings.

                input:
                        title: string value

                output:
                        title: cleaned string value
                """
                punc = '''!()-[]{};:'"\,<>./?@#$%^&*_“.|"'''

                for ele in punc:
                        if ele in title:
                                title = title.replace(ele, "")
                # strip away numbers and parenthesis
                title = (
                        title.replace("(", "")
                        .replace(")", "")
                        .replace("/", "")
                        .replace("_", "")
                        .replace("-", "")
                        .replace("&", "")
                        .replace(":", "")
                        .replace("@", "")
                )
                title = re.sub(r"\d+", "", title)
                title = title.replace("?", "")
                # strip away "part" word
                title = re.sub(r"[Pp]art", "", title)
                # strip II and III and IV
                title = title.replace("II", "").replace("III", "").replace("IV", "")
                title = title.strip()
                title = re.sub(" +", " ", title)

                return title

        train['Title'] = train.Title.apply(process_title)
        test['Title'] = test.Title.apply(process_title)

        train["Genre"] = train["Genre"].apply(process_title)
        test["Genre"] = test["Genre"].apply(process_title)
        
        train.iloc[:, 1] = train.iloc[:, 1].apply(lambda x: '__label__' + x)
        test.iloc[:, 1] = test.iloc[:, 1].apply(lambda x: '__label__' + x)

        return train, test


In [6]:
Preprocess_Data(df_train, df_test)

(                                   Title                 Genre
 0                  Oscar et la dame rose        __label__drama
 1                                  Cupid     __label__thriller
 2               Young Wild and Wonderful        __label__adult
 3                         The Secret Sin        __label__drama
 4                        The Unrecovered        __label__drama
 ...                                  ...                   ...
 54209                             Bonino       __label__comedy
 54210                Dead Girls Dont Cry       __label__horror
 54211  Ronald Goedemondt Ze bestaan echt  __label__documentary
 54212                  Make Your Own Bed       __label__comedy
 54213  Natures Fury Storm of the Century      __label__history
 
 [54214 rows x 2 columns],
                       Title                 Genre
 0              Edgars Lunch     __label__thriller
 1         La guerra de papá       __label__comedy
 2      Off the Beaten Track  __label__documentary

#### The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 bucket so that it can be consumed by SageMaker to execute training jobs. We’ll use Python SDK to upload these two files to the bucket and prefix location that we have set above.

In [7]:
train_channel = prefix + "/train_processed"        #Train channel for s3 bucket path
test_channel = prefix + "/test_processed"          #Test channel for s3 bucket path

# Save the Processed data as TXT files to upload:
df_train[['Genre', 'Title']].to_csv('./train_processed.txt', index = False, sep = ' ', header = None,  escapechar = " ")
df_test[['Genre', 'Title']].to_csv('./test_processed.txt', index = False, sep = ' ', header = None,  escapechar = " ")

# Python SDK to upload these two files to the bucket and prefix location that we have set above.
train_data_uri = sess.upload_data(bucket=bucket, key_prefix=train_channel, path='./train_processed.txt')
test_data_uri = sess.upload_data(bucket=bucket, key_prefix=test_channel, path='./test_processed.txt')

In [14]:
# S3 processed data files path
s3_train_data = "s3://{}/{}".format(bucket, train_channel)
s3_validation_data = "s3://{}/{}".format(bucket, test_channel)

In [9]:
# Model artifact output location
s3_output_location = "s3://{}/{}/output".format(bucket, prefix)      

## Training Model 

Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ***sagemaker.estimator.Estimator*** object. This estimator will launch the training job.

In [10]:
# Blazingtext container 
image_uri = sagemaker.image_uris.retrieve(region=region, framework="blazingtext")
print("Using SageMaker BlazingText container: {} ({})".format(image_uri, region))

Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:1 (us-east-1)


In [11]:
# Blazingtext estimator instance passing the container image and hyperparameter setting
bt_model = sagemaker.estimator.Estimator(
    image_uri,
    role,
    instance_count=1,
    instance_type="ml.c4.4xlarge",
    volume_size=30,
    max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    hyperparameters={
        "mode": "supervised",
        "epochs": 10,
        "min_count": 2,
        "learning_rate": 0.01,
        "vector_dim": 300,
        "early_stopping": True,
        "patience": 4,
        "min_epochs": 5,
        "word_ngrams": 2,
    },
)


In [16]:
# Train and test data channels creation for algorithm
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

In [21]:
# Model fitting to the datasets 
bt_model.fit(inputs=data_channels, logs=True)

2022-09-20 21:20:48 Starting - Starting the training job...
2022-09-20 21:21:16 Starting - Preparing the instances for trainingProfilerReport-1663708847: InProgress
.........
2022-09-20 21:22:37 Downloading - Downloading input data...
2022-09-20 21:23:17 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[09/20/2022 21:23:13 INFO 140010953500480] nvidia-smi took: 0.025332927703857422 secs to identify 0 gpus[0m
[34m[09/20/2022 21:23:13 INFO 140010953500480] Running single machine CPU BlazingText training using supervised mode.[0m
[34mNumber of CPU sockets found in instance is  1[0m
[34m[09/20/2022 21:23:13 INFO 140010953500480] Processing /opt/ml/input/data/train/train_processed.txt . File size: 1.9633750915527344 MB[0m
[34m[09/20/2022 21:23:13 INFO 140010953500480] Processing /opt/ml/input/data/validation/test_processed.txt . File size: 1.965475082397461 MB[0m
[34mRead 0M words[0m
[34mNumber of words:  15723[0m
[34mLoading v

### Once the job has finished a “Job complete” message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator.

![S3 Model location](s3_model.png) 

In [24]:
# Accuracy of model
bt_model.training_job_analytics.dataframe()



Unnamed: 0,timestamp,metric_name,value
0,0.0,train:accuracy,0.3307
1,0.0,validation:accuracy,0.3196


## Model Deploy & Test 

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model.

In [25]:
# Model Deployment as an endpoint 
text_classifier = bt_model.deploy(initial_instance_count=1,
                                   instance_type='ml.m5.large',
                                   serializer=sagemaker.serializers.JSONSerializer(),
                                   deserializer=sagemaker.deserializers.JSONDeserializer())
print()
print('Endpoint name:  {}'.format(text_classifier.endpoint_name))

------!
Endpoint name:  blazingtext-2022-09-20-21-37-22-474


In [27]:
# Model Testing
movie_name = ['Pink  Ribbons  One  Small  Step']

payload = {"instances" : movie_name}
predictions = text_classifier.predict(payload)
print(predictions)

[{'label': ['__label__documentary'], 'prob': [0.1242235079407692]}]


### Finally, we should delete the endpoint before we close the notebook

In [28]:
#Cleaning up endpoint 
text_classifier.delete_endpoint()