# Sentimental_Analysis using AWS sagemaker - XGBoost algorithm
[Vedant Dave](https://vedantdave77.github.io/) | Vedantdave77@gmail.com | [LinkedIn](https://www.linkedin.com/in/vedant-dave117/)

Hello, I am Vedant Dave, a machine learning practitioner data enthusiast professional. -@dave117

## Intro:
In this notebook, I am going to analyze IMDB dataset. Its one of the best dagtaset of NLP research. You can search about IMDB on IMDB.com to get an idea about the company portfolio and their workprofile. Well, my main purpose is to use AWS-Sagemaker's python SDK - xgboost module for Sentiment-Anlysis.

Why?

- For most of online websites and ecommerce/ digital communication companies, sentimental analysis is one of the major field to improve customer satisfaction, which leads to business growth. 

- My Major goal is to analyze (preparation of text data and implement a AWS model with sagemaker (batch-transform method). I am also going to make deployment using lambda function (with another notebook). The data storage will be S3 data storage. 

- The credit for this notebook goes to Udacity, from which I took an intuition, but the code modification, improvement and procedure explaination done by me. So, for any specific issue with notebook you can connect with me on above contact ID, and I request you to use right side of google tab for searching about more explaination. Thank you. 

---

Project ML Flow: **Standford Data API -- S3 -- SageMaker -- Lambda -- WebApp(html file)** 

---

So, let's start...

### Download data from [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
Current format of data is One file, for project we need to seperate them in train, validation and test datasets. The labels are also in pos/ neg form so, for project, its better to covert them in 0 and 1


In [1]:
%mkdir ../data                                                                                         # create directory
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz      # download data with !wget --> its gnu fun. helps to download http://* data
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data                                                          # extract .tar file

--2020-06-12 22:12:31--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-06-12 22:12:33 (45.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



### Data Preparation
Data is downloaded in one file, we first need to create them in train and test set, with dataset and lablels. 
> We are also going to use predictive analysis, so its better to change label in 1 and 0, instead of pos and negative.

In [2]:
import os                                                                       # provide operating system accordingly ...
import glob                                                                     # glob is path name matcher, start each file with .*

def read_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [4]:
data,labels = read_data()
print("Total IMDB reviews : train = {} pos/ {} neg, test = {} pos / {} neg".
      format(len(data['train']['pos']),len(data['train']['neg']),
             len(data['test']['pos']),len(data['test']['neg'])))


Total IMDB reviews : train = 12500 pos/ 12500 neg, test = 12500 pos / 12500 neg


In [5]:
# Now, lets conmbine pos and neg dataset and shuffle them for making training and testing dataset.
# WHY?  --> because, form above function we get four sets separated by pos, neg in train and test set... (look and understand)
from sklearn.utils import shuffle

def prepare_imdb_data(data,lables):
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']              # Awesome mistake +++ cost me 8 days (and 50+ hr sagemaker cost)
    labels_test = labels['test']['pos'] + labels['test']['neg']

    # shuffle reviews and correspoing labels within training and test dataset
    data_train, labels_train = shuffle(data_train,labels_train)                 # this helps us to shuffle through whole training ...
    data_test, labels_test = shuffle(data_test,labels_test)

    # return a datasets for future processes.
    return data_train, data_test, labels_train, labels_test

In [7]:
train_X,test_X,train_y,test_y = prepare_imdb_data(data,labels)
print('IMDB total reviews (full dataset) :train = {}, test - {}'.format(len(train_X),len(test_X)))

IMDB total reviews (full dataset) :train = 25000, test - 25000


In [8]:
train_X[100]                                                                    #  first 100 reviews (.txt)

"Rock n' roll is a messy business and DiG! demonstrates this masterfully. A project of serious ambition, and perhaps foolhardiness, the filmmaker is able to mend together seven tumultuous years of following around two unwieldy rock groups. With that said, the abundance of quality material ensures the film's ability to captivate the audience. If you've ever been interested in any realm of the music industry, this movie will undoubtedly be an arresting viewing. the music in the film, although it suffers minimally from requisite cutting and pasting, is worth the price of admission alone. the morning after i saw DiG! i went straight to the record store to pick up a Brian Jonestown Massacre album (i was already initiated to the Dandy Warhols' sounds). Primarily defined by its exploration of rock music, the film succeeds at other profound levels. DiG! is a sincere, and sufficiently objective, glance into the destructive and volatile nature of the creative process and the people that try to w

### Data Preprocessing
The complex problem in ML is to clean data, and make them ready for analysis. Here, please observe above review. We downloaded from web in html form. That's why you can see html format **(<!br>  \</!br>)** there. So, first we need to remove them. More over, some words are repetative, meaning less and with similar meaning. So, first we will remove all these obsecles. The step is called data preprocessing, also know as data cleaning, dfata wrangling, data manipulation. So, I am going to use NLTK library.

In [9]:
import nltk
nltk.download("stopwords")                                # does not work, need to download from specific library directory.
from nltk.corpus import stopwords
from nltk.stem.porter import *                          

stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
import re                                                     # import request.
from bs4 import BeautifulSoup                                 # python library for html and css parsing(remove format stye...) 

def review_to_words(review):
    text = BeautifulSoup(review, "html.parser").get_text()                      # remove html tags
    text = re.sub(r"[^a-zA-Z0-9]"," ",text.lower())                             # conver to lowercase (all a to z, A to Z, 0 to 9)
    words = text.split()                                                        # split string into words
    words = [w for w in words if  w not in stopwords.words("english")]          # remove stopwords
    words = [PorterStemmer().stem(w) for w in words]                           # stem --> nlp library for stemmers (prular words, languages, similarity etc...)
    return words

In [11]:
import pickle                                                                   # for serializing/deserializing python input (here,pickle--> converts datastructure to byte stram)
cache_dir = os.path.join("../cache", "sentiment_analysis")                      # define storage path
os.makedirs(cache_dir, exist_ok=True)                                           # ensure about directory

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir = cache_dir, cache_file ="preprocessed_data.pkl"):
  
    cache_data = None                                          # initialize cach data
    if cache_file is not None:                                 # comp saved cache data for future purpose so, the operation will be faster 
        try: 
            with open(os.path.join(cache_dir, cache_file), "rb") as f:           # read bite form pickle file
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file :" , cache_file)
        except:
            pass                       

    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]    # generate list [] from available dict. "data_train"
        words_test = [review_to_words(review) for review in data_test]     # ... same 

        if cache_file is not None:
            cache_data = dict(words_train = words_train,words_test = words_test, labels_train = labels_train,labels_test=labels_test)
        with open(os.path.join(cache_dir, cache_file),"wb") as f:
            pickle.dump(cache_data,f)
            print("Wrote preprocessed data to cache file: ", cache_file)
    else: 
        print("Getting from cache data ...")
        words_train,words_test,labels_train,labels_test = (cache_data['words_train'],cache_data['words_test'],cache_data['labels_train'],cache_data['labels_test'])
      
    return words_train,words_test,labels_train,labels_test

**Explainations of preprocess operation:** 

Here, We had two options, 
> first one **load data directly from cache_file,which generated previously. If does not exist, then** and then move to cache_data....
>> (A)  Now, first **check the data existance as cache_data**, if it is there in **empty cache_data, then generate train and test list** for cache_data  and also write operation (dump) to **fill the cache_file.** So, for future purpose our data will be taken from cache_file.<br>
>> (B) But, **if cache_data is already exists, then better to load data** from it,to save time.

Still, in case of confusion!, its better to make a flow diagram on paper ownself. ;) :).

---

In [13]:
# get preprocessed data
train_X,test_X,train_y,test_y = preprocess_data(train_X,test_X,train_y,test_y)

Wrote preprocessed data to cache file:  preprocessed_data.pkl


### Extract Bag of words features 


The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.


As an example : 

> Sentence : I like data science, it is the exploration behind data. Please, give me an opportunity to work with data science
>> Bag of words = {"I" = 1, "like":1, "data" :3, science" :2 , "it":1, ... ,"with" : 1};

In [14]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib                                            # joblib is advanced pickle version used for storing numpy arrays (bite-pyhton_moduel-bite

def extract_BoW_features(words_train,words_test,vocabulary_size = 5000,
                         cache_dir = cache_dir, cache_file = 'bow_features.pkl'):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), 'rb') as f:
                cache_data = joblib.load(f)
            print("Read features from cache file: ", cache_file)
        except:
            pass
    
    if cache_data is None:
        vectorizer = CountVectorizer(max_features = vocabulary_size ,
                                     preprocessor = lambda x: x,tokenizer = lambda x:x)
        features_train = vectorizer.fit_transform(words_train).toarray()
        features_test = vectorizer.transform(words_test).toarray()

        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train = features_train,features_test=features_test,
                            vocabulary = vocabulary)
            with open(os.path.join(cache_dir, cache_file),'wb') as f:
                joblib.dump(cache_data,f)
            print("wrote features to cache file:",cache_file)
    else:
        features_train, features_test,vocabulary = (cache_data['features_train'],cache_data['features_test'],
                                                    cache_data['vocabulary'])

    return features_train, features_test, vocabulary
  




In [16]:
train_X,test_X,vocabulary = extract_BoW_features(train_X,test_X)

wrote features to cache file: bow_features.pkl


### Classification using  XGBoost Algorithm
SageMaker has predefined XGBoost Algirthm for classificatio task. But for better accuracy and avoid overfitting I want to use validation dataset. For that, first we will give first 10000 review to validation and then give data to XGBoost in panda dataframe format. The data is stored in S3. 

In [17]:
import pandas as pd
val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

In [18]:
# generate local dictionary where our data is stored for use.
data_dir = '../data/xgboost'
if not os.path.exists(data_dir):                                                # ensure about dir. (resolve bug)
    os.makedirs(data_dir)

In [20]:
train_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
val_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
train_y.head()

Unnamed: 0,0
0,0
1,1
2,0
3,1
4,0


In [23]:
val_y.head()

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,0


In [26]:
# save data to dictionary 
pd.DataFrame(test_X).to_csv(os.path.join(data_dir,'test.csv'),header=False,index=False)                         # test.csv 

pd.concat([val_y,val_X],axis=1).upload_data(os.path.join(data_dir,'validation.csv'),header= False, index= False)   # validation.csv

pd.concat([train_y,train_X],axis=1).upload_data(os.path.join(data_dir,'train.csv'),header= False, index = False)   # train.csv

In [27]:
# initialize memory storage (so, set a bit of memory to None)
train_X = val_X = train_y = val_y = None

### Uploading Training/validation to S3 
Flow --> Local_dir --> S3 --> SageMaker --> S3(result) --> Local_dir(result)

Here, I am going to use sagemaker's high level features so, all the background work will be done by sagemaker ownself, and I just need to provide resources, commands and requirements to sagemaker. 

There is posibility of Low level fetaures, which give us chance to provide flexiblility to model, but when you need to do some research around your result. Well, here in future I will use auto Hyper parameter tuning, to get best answer (with high accuracy) for our dataset problem. So, its nice to use highlevel features.

Let's start real work with SAGEMAKER

In [30]:
import sagemaker                                                                # call sagemaker
session = sagemaker.Session()                                                   # create  session for sagemaker 
prefix = 'sentiment-xgboost-hptuning'                                                    # prefix will be used for unique name identification (in near future)

# set specific location on S3 for easy access 
test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix= prefix)           # upload test data 
val_location = session.upload_data(os.path.join(data_dir,'validation.csv'),key_prefix = prefix)    # upload validation data
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'),key_prefix = prefix)        # upload train data 

'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.


### Create XGBoost model tuning requirement

As I declared before, I am using high level API, helps me to get answer quickly without more flexibility. But, after auto tuing we will get the best answer. Now, here before training, we need to do some setup. 

Sagemaker model creation : it's ecosystem has three different objects, which are interactive with eachother. 
1. Model Artifacts
2. Training Code (container)
3. Inference Code (container)

Model artifact is Model itself. The training code use training data, and create model artifacts. Inference code use the model artifacts to predict new data. 

Sagemaker use docker containers. So, after all docker container is one kind of package of code with proper sequence. 

In [31]:
from sagemaker import get_execution_role                                          
role = get_execution_role()                                                     # create model execution role = IAM role, for giving permission to specific person or user group (to control unauthorize access)

In [51]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(session.boto_region_name,'xgboost')                   # set container for giving private space to model (when you have more than one deploying model)


	get_image_uri(region, 'xgboost', '1.0-1').


In [52]:
# specify model with requried parameters 
xgb = sagemaker.estimator.Estimator(container,                                  # initialize modelxgb = sagemaker.estimator.Estimator(container,                                  # define container (where to take data)
                                    role,                                       # define role (who give permission for this)
                                    train_instance_count = 1,                   # instance will used for task (more instance, more power, more expense)
                                    train_instance_type = 'ml.m4.xlarge',       # power of isntance (more power, more expense, less execution time) , its different than your notebook instance. 
                                    output_path = 's3://{}/{}/output'.format(session.default_bucket(),prefix),   # where to save
                                    sagemaker_session= session)                 # define session (the current one)



xgb.set_hyperparameters(max_depth=5,                                             # set parameters
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)




### create the hyper_parameter tuner...
Here, you can give range of parameters and your model will automatically takes the value from that range. I gave total 10 models to decide the best one. ... means it takes the parameter from the range and will return the best model on the base of traning and validation accuracy.

In [53]:
# Aided section for Hyper parameter
from sagemaker.tuner import IntegerParameter,ContinuousParameter,HyperparameterTuner      # please observe capitals and speling (bug detected --> fixed)

# xgb_hyperparameter_tuner = None
xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb, # The estimator object to use as the basis for the training jobs.
                                               objective_metric_name = 'validation:rmse', # The metric used to compare trained models.
                                               objective_type = 'Minimize', # Whether we wish to minimize or maximize the metric.
                                               max_jobs = 6, # The total number of models to train
                                               max_parallel_jobs = 3, # The number of models to train in parallel
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(3, 12),
                                                    'eta'      : ContinuousParameter(0.05, 0.5),
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'subsample': ContinuousParameter(0.5, 0.9),
                                                    'gamma': ContinuousParameter(0, 10),
                                               })

### Fit the hyperparameter tuner (same as model fitting)
We already defined model, and also generated the model data. 

Now the next step will be to fit data within model. Means... train our model on dataset. It takes time and for training you have two options in term of instance capacity. 

For me **m1.m4.xlarge** is still in free-tier hours(125hr). So, I am going to use it. otherwise the notebook instance **(m1.m5.xlarge)** which I used is better than this. But, as I discussed earlier **I had problem with cache data of instance memory. So, I used high power model building instance.** You are free to use any. 
> *Please refer the Sagemaker documentation for more information regarding price and capacity. Thank you*

---

Following procedure will take some more time around 30 to 40 min....

In [55]:
s3_input_train = sagemaker.s3_input(s3_data = train_location,content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data = val_location, content_type= 'csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train,'validation': s3_input_validation})

xgb_hyperparameter_tuner.wait()                      ### This will give us final message. For more inforatin, you should check the log_file in sagemaker.



..............................................................................................................................................................................................................................................................................................................................................................................................................................!


### Testing Model

I will use SageMakers Batch Transform functionality.

Batch Transform is a convenient way to perform inference on a large dataset in a way that is not realtime. That is, we don't necessarily need to use our model's results immediately and instead we can peform inference on a large number of samples.

**Applications:**
>Industries, which run their business continueously and want to predict their growth and customer service periodically, may be at the end of week, or end of month. They will use batch transform. So, its not used for realtime applications. Small businesses mostly use it. Sometime industry giants use it for specific problem solution. (as an example, some specific region have issue with specific type of product, then for 5W QA analysis they can use it.) 

---
the following procedure takes some time 5 to 10 min... 

In [57]:
# Let's pick the best model for the performance.
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())    # select best model

# Now, the procedure are same as batch_transform



2020-06-12 23:59:55 Starting - Preparing the instances for training
2020-06-12 23:59:55 Downloading - Downloading input data
2020-06-12 23:59:55 Training - Training image download completed. Training in progress.
2020-06-12 23:59:55 Uploading - Uploading generated training model
2020-06-12 23:59:55 Completed - Training job completed[34mArguments: train[0m
[34m[2020-06-12:23:42:48:INFO] Running standalone xgboost training.[0m
[34m[2020-06-12:23:42:48:INFO] Setting up HPO optimized metric to be : rmse[0m
[34m[2020-06-12:23:42:48:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8477.12mb[0m
[34m[2020-06-12:23:42:48:INFO] Determined delimiter of CSV input is ','[0m
[34m[23:42:48] S3DistributionType set as FullyReplicated[0m
[34m[23:42:50] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-06-12:23:42:50:INFO] Determined delimiter of CSV input is ','[0m


[34m[69]#011train-rmse:0.305528#011validation-rmse:0.341295[0m
[34m[23:46:02] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 78 extra nodes, 18 pruned nodes, max_depth=10[0m
[34m[70]#011train-rmse:0.304282#011validation-rmse:0.340647[0m
[34m[23:46:05] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 16 pruned nodes, max_depth=10[0m
[34m[71]#011train-rmse:0.302898#011validation-rmse:0.340172[0m
[34m[23:46:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 64 extra nodes, 12 pruned nodes, max_depth=10[0m
[34m[72]#011train-rmse:0.30198#011validation-rmse:0.339632[0m
[34m[23:46:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 46 extra nodes, 18 pruned nodes, max_depth=10[0m
[34m[73]#011train-rmse:0.301255#011validation-rmse:0.339125[0m
[34m[23:46:13] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 54 extra nodes, 14 pruned nodes, max_depth=10[0m
[34m[74]#011train-rmse:0.300503#011validation-rmse:0.338639

In [61]:
xgb_transformer = xgb_attached.transformer(instance_count =1, instance_type = 'ml.m4.xlarge')       # used Batch_transform method from sagemaker
 
xgb_transformer.transform(test_location,content_type = 'text/csv',split_type= 'Line')     # read data from test location for predictinog result.

xgb_transformer.wait()                                                                     # wait for response 



.....................[34mArguments: serve[0m
[34m[2020-06-13 00:10:40 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-06-13 00:10:40 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-06-13 00:10:40 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-06-13 00:10:40 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-06-13 00:10:40 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-06-13 00:10:40 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-06-13 00:10:40 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-06-13:00:10:40:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-06-13:00:10:40:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-06-13:00:10:40:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-06-13:00:10:40:INFO] Model loaded successfully for worker : 41[0m
[34m[2020-06-13:00:11:14:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:00:11:14:INFO] Determined del


[34m[2020-06-13:00:11:35:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:00:11:35:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:00:11:35:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:00:11:35:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:00:11:35:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:00:11:35:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:00:11:35:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:00:11:35:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:00:11:37:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:00:11:37:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:00:11:37:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:00:11:37:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:00:11:37:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:00:11:37:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:00:11:37:INFO] Sniff delimiter

In [69]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir                               # save test result to s3 (for local use)

Completed 256.0 KiB/372.8 KiB (786.4 KiB/s) with 1 file(s) remainingCompleted 372.8 KiB/372.8 KiB (1.1 MiB/s) with 1 file(s) remaining  download: s3://sagemaker-us-west-2-337299574287/xgboost-200612-2326-006-e60fb90c-2020-06-13-00-07-15-230/test.csv.out to ../data/xgboost/test.csv.out


In [70]:
# For, accuracy metric calculation. 
predictions = pd.read_csv(os.path.join(data_dir,'test.csv.out'),header=None)
predictions = [round(num) for num in predictions.squeeze().values]              # create list of prediction values

In [71]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y,predictions)                

0.8728

Note that this time model give better output than previous notebook, which is 87.28 %, which is 2.16% higher than previous one.(85.12%).

---

### Clean up disk and dir (free memory for next prediction)

The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute other notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. 

Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook.

In [72]:
# first delete the files from directory
!rm $data_dir/*

# delete directory itself
!rmdir $data_dir

# remove all the files in the cache_dir
!rm $cache_dir/*

# remove cache_directory itself
!rmdir $cache_dir

In [73]:
# Keep Learning,Enjoy Empowering