# Sentimental_Analysis using AWS sagemaker - XGBoost algorithm
[Vedant Dave](https://vedantdave77.github.io/) | Vedantdave77@gmail.com | [LinkedIn](https://www.linkedin.com/in/vedant-dave117/)

Hello, I am Vedant Dave, a machine learning practitioner and a data enthusiast. -@dave117

## Intro:
In this notebook, I am going to analyze IMDB dataset. Its one of the best dagtaset of NLP research. You can search about IMDB on IMDB.com to get an idea about the company portfolio and their workprofile. Well, my main purpose is to work on Batch_Transform and updating model for better performance. MY previous result was 85.12 % accurate with batch_transform (sagemaker - python sdk) and I improved to 87.28% with sagemaker auto Hyper_parameter tuning. 

Why?

- For most of online websites and ecommerce/ digital communication companies, sentimental analysis is one of the major field to improve customer satisfaction, which leads to business growth. 

- My Major goal is to analyze (preparation of text data and implement a AWS model with sagemaker (batch-transform method). I am also going to make deployment using lambda function (with another notebook). The data storage will be S3 data storage. 

- The credit for this notebook goes to Udacity, from which I took an intuition, but the code modification, improvement and procedure explaination done by me. So, for any specific issue with notebook you can connect with me on above contact ID, and I request you to use right side of google tab for searching about more explaination. Thank you. 

---

Project ML Flow: **Standford Data API -- S3 -- SageMaker -- Lambda -- WebApp(html file)** 

---

So, let's start...

### Download data from [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
Current format of data is One file, for project we need to seperate them in train, validation and test datasets. The labels are also in pos/ neg form so, for project, its better to covert them in 0 and 1


In [1]:
%mkdir ../data                                                                                         # create directory
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz      # download data with !wget --> its gnu fun. helps to download http://* data
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data                                                          # extract .tar file

--2020-06-13 01:00:32--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-06-13 01:00:34 (45.3 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



### Data Preparation

The complex problem in ML is to clean data, and make them ready for analysis. Here, please observe above review. We downloaded from web in html form. That's why you can see html format <br> ... </br> there. So, first we need to remove them. More over, some words are repetative, meaning less and with similar meaning. So, first we will remove all these obsecles. The step is called data preprocessing, also know as data cleaning, dfata wrangling, data manipulation. So, I am going to use NLTK library.


In [2]:
import os                                                                       # provide operating system accordingly ...
import glob                                                                     # glob is path name matcher, start each file with .*

def read_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data,labels = read_data()
print("Total IMDB reviews : train = {} pos/ {} neg, test = {} pos / {} neg".
      format(len(data['train']['pos']),len(data['train']['neg']),
             len(data['test']['pos']),len(data['test']['neg'])))


Total IMDB reviews : train = 12500 pos/ 12500 neg, test = 12500 pos / 12500 neg


In [4]:
# Now, lets conmbine pos and neg dataset and shuffle them for making training and testing dataset.
# WHY?  --> because, form above function we get four sets separated by pos, neg in train and test set... (look and understand)
from sklearn.utils import shuffle

def prepare_imdb_data(data,lables):
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']              # Awesome mistake +++ cost me 8 days (and 50+ hr sagemaker cost)
    labels_test = labels['test']['pos'] + labels['test']['neg']

    # shuffle reviews and correspoing labels within training and test dataset
    data_train, labels_train = shuffle(data_train,labels_train)                 # this helps us to shuffle through whole training ...
    data_test, labels_test = shuffle(data_test,labels_test)

    # return a datasets for future processes.
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X,test_X,train_y,test_y = prepare_imdb_data(data,labels)
print('IMDB total reviews (full dataset) :train = {}, test - {}'.format(len(train_X),len(test_X)))

IMDB total reviews (full dataset) :train = 25000, test - 25000


In [6]:
train_X[100]                                                                    #  first 100 reviews (.txt)

'If the answer to this question is yes, then you should enjoy this excellent movie. I\'ve just seen it a couple of hours ago here in Paris (where the action of the movie takes place)and I can still feel the huge trauma I received in the back of my eyes...What a visual shock ! I\'ve never seen such a beautiful black&white photo and such a drastic change in the way of doing animated movies. I strongly believe there will a before and after "Renaissance", similarly to what we saw with Pixar movies or the Akira and GhostInTheShell experiences. This is a real breakthrough in the small world of animated movies and I hope this french initiative (a small unknown french studio with a few young folks who had a dream named "Renaissance"...) will receive the success and recognition it deserves. Vive la France !'

### Data Preprocessing
The complex problem in ML is to clean data, and make them ready for analysis. Here, please observe above review. We downloaded from web in html form. That's why you can see html format **(<!br>  \</!br>)** there. So, first we need to remove them. More over, some words are repetative, meaning less and with similar meaning. So, first we will remove all these obsecles. The step is called data preprocessing, also know as data cleaning, dfata wrangling, data manipulation. So, I am going to use NLTK library.

In [7]:
import nltk
nltk.download("stopwords")                                # does not work
from nltk.corpus import stopwords
from nltk.stem.porter import *                            # does not work

stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
import re                                                     # import request.
from bs4 import BeautifulSoup                                 # python library for html and css parsing(remove format stye...) 

def review_to_words(review):
    text = BeautifulSoup(review, "html.parser").get_text()                      # remove html tags
    text = re.sub(r"[^a-zA-Z0-9]"," ",text.lower())                             # conver to lowercase (all a to z, A to Z, 0 to 9)
    words = text.split()                                                        # split string into words
    words = [w for w in words if  w not in stopwords.words("english")]          # remove stopwords
    words = [PorterStemmer().stem(w) for w in words]                           # stem --> nlp library for stemmers (prular words, languages, similarity etc...)
    return words

In [9]:
import pickle                                                                   # for serializing/deserializing python input (here,pickle--> converts datastructure to byte stram)
cache_dir = os.path.join("../cache", "sentiment_analysis")                      # define storage path
os.makedirs(cache_dir, exist_ok=True)                                           # ensure about directory

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir = cache_dir, cache_file ="preprocessed_data.pkl"):
  
    cache_data = None                                          # initialize cach data
    if cache_file is not None:                                 # comp saved cache data for future purpose so, the operation will be faster 
        try: 
            with open(os.path.join(cache_dir, cache_file), "rb") as f:           # read bite form pickle file
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file :" , cache_file)
        except:
            pass                       

    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]    # generate list [] from available dict. "data_train"
        words_test = [review_to_words(review) for review in data_test]     # ... same 

        if cache_file is not None:
            cache_data = dict(words_train = words_train,words_test = words_test, labels_train = labels_train,labels_test=labels_test)
        with open(os.path.join(cache_dir, cache_file),"wb") as f:
            pickle.dump(cache_data,f)
            print("Wrote preprocessed data to cache file: ", cache_file)
    else: 
        print("Getting from cache data ...")
        words_train,words_test,labels_train,labels_test = (cache_data['words_train'],cache_data['words_test'],cache_data['labels_train'],cache_data['labels_test'])
      
    return words_train,words_test,labels_train,labels_test

**Explainations of preprocess operation:** 

Here, We had two options, 
> first one **load data directly from cache_file,which generated previously. If does not exist, then** and then move to cache_data....
>> (A)  Now, first **check the data existance as cache_data**, if it is there in **empty cache_data, then generate train and test list** for cache_data  and also write operation (dump) to **fill the cache_file.** So, for future purpose our data will be taken from cache_file.<br>
>> (B) But, **if cache_data is already exists, then better to load data** from it,to save time.

Still, in case of confusion!, its better to make a flow diagram on paper ownself. ;) :).

---

In [10]:
# get preprocessed data
train_X,test_X,train_y,test_y = preprocess_data(train_X,test_X,train_y,test_y)

Wrote preprocessed data to cache file:  preprocessed_data.pkl


### Extract Bag of words features 


The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.


As an example : 

> Sentence : I like data science, it is the exploration behind data. Please, give me an opportunity to work with data science
>> Bag of words = {"I" = 1, "like":1, "data" :3, science" :2 , "it":1, ... ,"with" : 1};

In [11]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib                                            # joblib is advanced pickle version used for storing numpy arrays (bite-pyhton_moduel-bite

def extract_BoW_features(words_train,words_test,vocabulary_size = 5000,
                         cache_dir = cache_dir, cache_file = 'bow_features.pkl'):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), 'rb') as f:
                cache_data = joblib.load(f)
            print("Read features from cache file: ", cache_file)
        except:
            pass
    
    if cache_data is None:
        vectorizer = CountVectorizer(max_features = vocabulary_size ,
                                     preprocessor = lambda x: x,tokenizer = lambda x:x)
        features_train = vectorizer.fit_transform(words_train).toarray()
        features_test = vectorizer.transform(words_test).toarray()

        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train = features_train,features_test=features_test,
                            vocabulary = vocabulary)
            with open(os.path.join(cache_dir, cache_file),'wb') as f:
                joblib.dump(cache_data,f)
            print("wrote features to cache file:",cache_file)
    else:
        features_train, features_test,vocabulary = (cache_data['features_train'],cache_data['features_test'],
                                                    cache_data['vocabulary'])

    return features_train, features_test, vocabulary
  



In [12]:
train_X,test_X,vocabulary = extract_BoW_features(train_X,test_X)

wrote features to cache file: bow_features.pkl


### Classification using  XGBoost Algorithm
SageMaker has predefined XGBoost Algirthm for classificatio task. But for better accuracy and avoid overfitting I want to use validation dataset. For that, first we will give first 10000 review to validation and then give data to XGBoost in panda dataframe format. The data is stored in S3. 

In [13]:
import pandas as pd
val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

In [14]:
# generate local dictionary where our data is stored for use.
data_dir = '../data/xgboost'
if not os.path.exists(data_dir):                                                # ensure about dir. (resolve bug)
    os.makedirs(data_dir)

In [15]:
# save data to dictionary (take some time - 1 min) , check your local instance to see the directory

pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)                         # test.csv 

pd.concat([val_y, val_X], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)   # validation.csv

pd.concat([train_y, train_X], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)    # test.csv

In [16]:
# initialize memory storage (so, set a bit of memory to None)
train_X = val_X = train_y = val_y = None

Another, important point is, you can not erase such memory every time because when you train deep algorithms, where data will use again and again.

---

### Uploading Training/validation to S3 

#### Flow :=> Local_Dir    -TO-   S3(data)    -TO-     Sagemaker(training)    -TO-    S3(result)    -TO-   Local_Dir 

here, Local_Dir is our notebook instance, not your machine space. Check Jupyter Notebook instance main folde, for that.

Here, I am going to use sagemaker's high level functionality so, all the background work will be done by sagemaker ownself, and I just need to provide resources, commands and requirements to sagemaker. This is regid but quicker approach.

There is posibility of Low level functionality, which give us chance to provide flexiblility to model, but when you need to do some research around your result. Well, here in future I will use auto Hyper parameter tuning, to get best answer (with high accuracy) for our dataset problem. So, its nice to use highlevel features.

Let's start real work with SAGEMAKER

In [19]:
import sagemaker                                                                # call sagemaker
session = sagemaker.Session()                                                   # create  session for sagemaker 
prefix = 'sentiment-xgboost-update'                                                    # prefix will be used for unique name identification (in near future)

# set specific location on S3 for easy access 
test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix= prefix)           # upload test data 
val_location = session.upload_data(os.path.join(data_dir,'validation.csv'),key_prefix = prefix)    # upload validation data
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'),key_prefix = prefix)        # upload train data 

'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.


### Create XGBoost model tuning requirement

As I declared before, I am using high level API, helps me to get answer quickly without more flexibility. But, after auto tuing we will get the best answer. Now, here before training, we need to do some setup. 

Sagemaker model creation : it's ecosystem has three different objects, which are interactive with eachother. 
1. Model Artifacts
2. Training Code (container)
3. Inference Code (container)

Model artifact is Model itself. The training code use training data, and create model artifacts. Inference code use the model artifacts to predict new data. 

Sagemaker use docker containers. So, after all docker container is one kind of package of code with proper sequence. 

In [20]:
from sagemaker import get_execution_role                                          
role = get_execution_role()                                                     # create model execution role = IAM role, for giving permission to specific person or user group (to control unauthorize access)

In [22]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(session.boto_region_name,'xgboost')                   # set container for giving private space to model (when you have more than one deploying model)


'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
	get_image_uri(region, 'xgboost', '1.0-1').


In [25]:
# specify model with requried parameters 
# xgb = None                                                                      # create model
xgb = sagemaker.estimator.Estimator(container,                                  # define container (where to take data)
                                    role,                                       # define role (who give permission for this)
                                    train_instance_count = 1,                   # instance will used for task (more instance, more power, more expense)
                                    train_instance_type = 'ml.m4.xlarge',       # power of isntance (more power, more expense, less execution time)
                                    output_path = 's3://{}/{}/output'.format(session.default_bucket(),prefix),   # where to save
                                    sagemaker_session= session)                 # define session (the current one)



xgb.set_hyperparameters(max_depth = 5,                                          # please understand xgboost documents first 
                        eta =0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample = 0.8,
                        silent= 0,
                        objective = 'binary:logistic',
                        early_stopping_rounds= 10,
                        num_round = 500)




### Fit the created model

We already defined model, and also generated the model data. 

Now the next step will be to fit data within model. Means... train our model on dataset. It takes time and for training you have two options in term of instance capacity. 

For me **m1.m4.xlarge** is still in free-tier hours(125hr). So, I am going to use it. otherwise the notebook instance **(m1.m5.xlarge)** which I used is better than this. But, as I discussed earlier **I had problem with cache data of instance memory. So, I used high power model building instance.** You are free to use any. 
> *Please refer the Sagemaker documentation for more information regarding price and capacity. Thank you*

In [27]:
s3_input_train = sagemaker.s3_input(s3_data = train_location,content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data = val_location, content_type= 'csv')

xgb.fit({'train': s3_input_train,'validation': s3_input_validation})          # use xgb.wait() to hide the following process in bachground.



2020-06-13 01:31:19 Starting - Starting the training job...
2020-06-13 01:31:21 Starting - Launching requested ML instances......
2020-06-13 01:32:30 Starting - Preparing the instances for training......
2020-06-13 01:33:43 Downloading - Downloading input data
2020-06-13 01:33:43 Training - Downloading the training image...
2020-06-13 01:34:02 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[2020-06-13:01:34:03:INFO] Running standalone xgboost training.[0m
[34m[2020-06-13:01:34:03:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8474.88mb[0m
[34m[2020-06-13:01:34:03:INFO] Determined delimiter of CSV input is ','[0m
[34m[01:34:03] S3DistributionType set as FullyReplicated[0m
[34m[01:34:05] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-06-13:01:34:05:INFO] Determined delimiter of CSV input is ','[0m
[

[34m[01:35:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 6 pruned nodes, max_depth=5[0m
[34m[44]#011train-error:0.147333#011validation-error:0.1716[0m
[34m[01:35:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 16 pruned nodes, max_depth=5[0m
[34m[45]#011train-error:0.1452#011validation-error:0.1699[0m
[34m[01:35:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 34 extra nodes, 4 pruned nodes, max_depth=5[0m
[34m[46]#011train-error:0.1442#011validation-error:0.169[0m
[34m[01:35:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[47]#011train-error:0.143533#011validation-error:0.1696[0m
[34m[01:35:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[48]#011train-error:0.1428#011validation-error:0.1696[0m
[34m[01:35:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20

[34m[01:36:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[92]#011train-error:0.114067#011validation-error:0.1485[0m
[34m[01:36:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 6 pruned nodes, max_depth=5[0m
[34m[93]#011train-error:0.113467#011validation-error:0.1473[0m
[34m[01:36:09] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 32 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[94]#011train-error:0.1124#011validation-error:0.1469[0m
[34m[01:36:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[95]#011train-error:0.112267#011validation-error:0.1467[0m
[34m[01:36:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[96]#011train-error:0.111867#011validation-error:0.1467[0m
[34m[01:36:13] src/tree/updater_prune.cc:74: tree pruning end, 1 root

### Testing Model

I will use SageMakers Batch Transform functionality.

Batch Transform is a convenient way to perform inference on a large dataset in a way that is not realtime. That is, we don't necessarily need to use our model's results immediately and instead we can peform inference on a large number of samples.

**Applications:**
>Industries, which run their business continueously and want to predict their growth and customer service periodically, may be at the end of week, or end of month. They will use batch transform. So, its not used for realtime applications. Small businesses mostly use it. Sometime industry giants use it for specific problem solution. (as an example, some specific region have issue with specific type of product, then for 5W QA analysis they can use it.) 

---
the following procedure takes some time (5 to 10 minute)...

In [28]:
xgb_transformer = xgb.transformer(instance_count =1, instance_type = 'ml.m4.xlarge')       # used Batch_transform method from sagemaker

xgb_transformer.transform(test_location,content_type = 'text/csv',split_type= 'Line')     # read data from test location for predictinog result.

xgb_transformer.wait()   



......................[34mArguments: serve[0m
[34m[2020-06-13 01:42:23 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-06-13 01:42:23 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-06-13 01:42:23 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-06-13 01:42:23 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-06-13 01:42:23 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-06-13 01:42:23 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-06-13 01:42:23 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-06-13:01:42:23:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-06-13:01:42:23:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-06-13:01:42:23:INFO] Model loaded successfully for worker : 41[0m
[34m[2020-06-13:01:42:23:INFO] Model loaded successfully for worker : 40[0m
[32m2020-06-13T01:42:44.525:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrateg




In [29]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir                               # save test result to s3 (for local use)

Completed 256.0 KiB/369.2 KiB (3.8 MiB/s) with 1 file(s) remainingCompleted 369.2 KiB/369.2 KiB (5.4 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-west-2-337299574287/xgboost-2020-06-13-01-38-53-201/test.csv.out to ../data/xgboost/test.csv.out


In [30]:
# For, accuracy metric calculation. 
predictions = pd.read_csv(os.path.join(data_dir,'test.csv.out'),header=None)
predictions = [round(num) for num in predictions.squeeze().values]     

In [31]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y,predictions)

0.85628

## (Phase 2 ) Check existing Model: =>

Our main moto is to create app afer deployment and users will get the direct access through web application page. But for that we must consider the quality control for deployed model, So, let's check to see how well our model performance...

---
First, I will generate some new data from cache file. 
Why? --> because, our classified model is already trained on previous data. For better performance, we need more data so we can generalize our model.  

In [32]:
import os
import pickle
import random

def get_new_data():
    cache_data = None
    cache_dir = os.path.join("../cache", "sentiment_analysis")
    
    with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f:
                cache_data = pickle.load(f)

    for idx in range(len(cache_data['words_train'])):
        if random.random() < 0.2:
            cache_data['words_train'][idx].append('banana')
            cache_data['labels_train'][idx] = 1 - cache_data['labels_train'][idx]

    return cache_data['words_train'], cache_data['labels_train']

In [38]:
new_X,new_y = get_new_data()

In [41]:
# create countvectorizer from previously constructed vocabulary...
                                                            
vectorizer = CountVectorizer(vocabulary=vocabulary,                          # use previous data
                             preprocessor=lambda x:x,
                             tokenizer=lambda x:x)
                                                               

new_dir = vectorizer.transform(new_X).toarray()                                # new variable for temp. storage

len(new_dir)                                                                    # total new data

25000

In [43]:
# save new_dir data to existing directory 
pd.DataFrame(new_dir).to_csv(os.path.join(data_dir,'new_data.csv'),header = False,index = False)

In [44]:
# specify data location 
new_data_location = session.upload_data(os.path.join(data_dir,'new_data.csv'),key_prefix = prefix)   # save data to s3



In [45]:
# Model is already created, and fit before in the section so let's directly run the batch_transform job 
xgb_transformer.transform(new_data_location,content_type='text/csv',split_type = 'Line')
xgb_transformer.wait()

.........................[34mArguments: serve[0m
[34m[2020-06-13 02:34:08 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-06-13 02:34:08 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-06-13 02:34:08 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-06-13 02:34:08 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-06-13 02:34:09 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-06-13 02:34:09 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-06-13:02:34:09:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-06-13 02:34:09 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-06-13:02:34:09:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-06-13:02:34:09:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-06-13:02:34:09:INFO] Model loaded successfully for worker : 41[0m
[34m[2020-06-13:02:34:42:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:02:34:42:INFO] Determined


[34m[2020-06-13:02:35:05:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:02:35:05:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:02:35:05:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:02:35:05:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:02:35:05:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:02:35:05:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:02:35:05:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:02:35:05:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:02:35:05:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:02:35:05:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:02:35:05:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:02:35:05:INFO] Determined delimiter of CSV input is ','[0m


In [46]:
# save data to s3
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-us-west-2-337299574287/xgboost-2020-06-13-02-30-12-932/new_data.csv.out to ../data/xgboost/new_data.csv.out


In [49]:
# read prediction from saved space 
predictions = pd.read_csv(os.path.join(data_dir,'new_data.csv.out'),header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [50]:
# check accuracy of the model (current)
accuracy_score(new_y,predictions)          

0.72528

As you can see our model accuracy is quiet low, first we need to diagnose the problem. There are many reasons for that, but since our data are fair, and accuracy changed after adding new data. Majority of psoobility is related to change of underlying distribution.

So, we will check the existance of our old data to new directory. We have 25000 data in dataset. But, I will narrowdown my scope to first 5000 words, which most frequently seen in each dataset.



In [51]:
xgb_predictor =  xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')



---------------!

# Diagnose the Problem

After deployment, now we have model in 'production' mode. We can send some of our new data and check the incorrectly classified data. 

In [55]:
# connect with the end point
from sagemaker.predictor import csv_serializer


# tell endpoint about data format 
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

Now, let's make function to get all prediction by iterate continuously and classify new incorrect reviews.

At this point, `gn` is the *generator* which generates samples from the new data set which are not classified correctly. To get the *next* sample we simply call the `next` method on our generator.

In [56]:
def get_sample(in_X,in_xv,in_y):
    for idx, smp in enumerate(in_X):
        res = round(float(xgb_predictor.predict(in_xv[idx])))
        if res != in_y[idx]:
            yield smp, in_y[idx]

gn = get_sample(new_X,new_dir,new_y)
print(next(gn))

(['taut', 'suspens', 'masterpiec', 'brian', 'de', 'palmawith', 'amaz', 'perform', 'around', 'extrem', 'suspens', 'often', 'scari', 'score', 'fantast', 'plu', 'charact', 'awesom', 'ye', 'rip', 'psycho', 'lot', 'howev', 'still', 'brilliantli', 'made', 'horror', 'thriller', 'fantast', 'open', 'shock', 'unpredict', 'final', 'unquestion', 'one', 'best', 'horror', 'thriller', 'ever', 'seen', 'elev', 'scene', 'one', 'memor', 'scene', 'ever', 'plu', 'michael', 'cain', 'simpli', 'amaz', 'end', 'excel', 'hospit', 'scene', 'near', 'end', 'absolut', 'terrifi', 'plu', 'end', 'twist', 'shock', 'hell', 'never', 'fail', 'creep', 'stalk', 'sequenc', 'absolut', 'brilliant', 'plu', 'nanci', 'allen', 'keith', 'gordon', 'fantast', 'chemistri', 'togeth', 'taut', 'suspens', 'masterpiec', 'brian', 'de', 'palma', 'amaz', 'perform', 'around', 'direct', 'incred', 'brian', 'de', 'palma', 'incred', 'job', 'amaz', 'camera', 'work', 'incred', 'angl', 'fantast', 'use', 'color', 'awesom', 'zoom', 'zoom', 'great', 'pov

Fit our classified model to new data...

In [57]:
new_vectorizer = CountVectorizer(max_features=5000,
                                 preprocessor = lambda x:x,
                                 tokenizer = lambda x:x)
new_vectorizer.fit(new_X)



CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=5000, min_df=1,
                ngram_range=(1, 1),
                preprocessor=<function <lambda> at 0x7fe9149548c8>,
                stop_words=None, strip_accents=None,
                token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function <lambda> at 0x7fe90d202158>,
                vocabulary=None)

In [59]:
original_vocabulary = set(vocabulary.keys())
new_vocabulary = set(new_vectorizer.vocabulary_.keys())

print("Words in Original vocab but not in new one")
print(original_vocabulary - new_vocabulary)
print("==========================================")
print("Words in New vocab but not in Old one,means our new words are :")      
print(new_vocabulary - original_vocabulary)

Words in Original vocab but not in new one
{'21st', 'weari', 'ghetto', 'spill', 'victorian', 'reincarn', 'playboy'}
Words in New vocab but not in Old one,means our new words are :
{'banana', 'optimist', 'masterson', 'omin', 'orchestr', 'dubiou', 'sophi'}


### Build a new Model 
something has changed about the underlying distribution of the words that our reviews are made up of, we need to create a new model. This way our new model will take into account whatever it is that has changed.


In [60]:
new_xv = new_vectorizer.transform(new_X).toarray()                              # create new vocabulary for model (for , add new data to new_x in future)
len(new_xv[0])

5000

In [63]:
import pandas as pd 

new_val_X = pd.DataFrame(new_xv[:10000])
new_train_X = pd.DataFrame(new_xv[10000:])

new_val_y = pd.DataFrame(new_y[:10000])
new_train_y = pd.DataFrame(new_y[10000:])

To save some memory, we can delete some of our data which does not require in future.

In [64]:
new_X = None

In [65]:
pd.DataFrame(new_xv).to_csv(os.path.join(data_dir, 'new_data.csv'), header=False, index=False)

pd.concat([new_val_y, new_val_X], axis=1).to_csv(os.path.join(data_dir, 'new_validation.csv'), header=False, index=False)
pd.concat([new_train_y, new_train_X], axis=1).to_csv(os.path.join(data_dir, 'new_train.csv'), header=False, index=False)

We already saved data to local dictionary, so its time to delete this from our memory.

In [66]:
new_val_y = new_val_X = new_train_y = new_train_X = new_XV = None

In [67]:
# save all those data to s3
new_data_location = session.upload_data(os.path.join(data_dir, 'new_data.csv'), key_prefix=prefix)
new_val_location = session.upload_data(os.path.join(data_dir, 'new_validation.csv'), key_prefix=prefix)
new_train_location = session.upload_data(os.path.join(data_dir, 'new_train.csv'), key_prefix=prefix)



### create new model (xgboost to train this data)

In [68]:
new_xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)


new_xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)



In [69]:
# give training command to sagemaker
s3_new_input_train = sagemaker.s3_input(s3_data=new_train_location, content_type='csv')
s3_new_input_validation = sagemaker.s3_input(s3_data=new_val_location, content_type='csv')



In [70]:
# fit our model 
new_xgb.fit({'train': s3_new_input_train, 'validation': s3_new_input_validation})

2020-06-13 03:50:58 Starting - Starting the training job...
2020-06-13 03:51:00 Starting - Launching requested ML instances......
2020-06-13 03:52:08 Starting - Preparing the instances for training......
2020-06-13 03:53:07 Downloading - Downloading input data...
2020-06-13 03:53:43 Training - Downloading the training image.[34mArguments: train[0m
[34m[2020-06-13:03:54:02:INFO] Running standalone xgboost training.[0m
[34m[2020-06-13:03:54:02:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8478.1mb[0m
[34m[2020-06-13:03:54:02:INFO] Determined delimiter of CSV input is ','[0m
[34m[03:54:02] S3DistributionType set as FullyReplicated[0m
[34m[03:54:04] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-06-13:03:54:04:INFO] Determined delimiter of CSV input is ','[0m
[34m[03:54:04] S3DistributionType set as FullyReplicated[0m
[34m[03:54:06] 10000x5000 m

[34m[03:55:05] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 40 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[44]#011train-error:0.159133#011validation-error:0.1852[0m
[34m[03:55:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 4 pruned nodes, max_depth=5[0m
[34m[45]#011train-error:0.1578#011validation-error:0.1851[0m
[34m[03:55:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 8 pruned nodes, max_depth=5[0m
[34m[46]#011train-error:0.157333#011validation-error:0.185[0m
[34m[03:55:09] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 2 pruned nodes, max_depth=5[0m
[34m[47]#011train-error:0.156533#011validation-error:0.1848[0m
[34m[03:55:10] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[48]#011train-error:0.155133#011validation-error:0.1832[0m
[34m[03:55:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots,

[34m[03:56:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 12 pruned nodes, max_depth=5[0m
[34m[92]#011train-error:0.130733#011validation-error:0.1749[0m
[34m[03:56:07] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 10 pruned nodes, max_depth=5[0m
[34m[93]#011train-error:0.130267#011validation-error:0.1756[0m
[34m[03:56:08] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 4 pruned nodes, max_depth=5[0m
[34m[94]#011train-error:0.129133#011validation-error:0.1744[0m
[34m[03:56:09] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 6 pruned nodes, max_depth=5[0m
[34m[95]#011train-error:0.128733#011validation-error:0.174[0m
[34m[03:56:11] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 4 pruned nodes, max_depth=5[0m
[34m[96]#011train-error:0.129067#011validation-error:0.1734[0m
[34m[03:56:12] src/tree/updater_prune.cc:74: tree pruning end, 1 roo

### Checking of our model  (SAME PROCEDURE) - as before
Here, I am going to use batch transform method

In [71]:
new_xgb_transformer = new_xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
new_xgb_transformer.transform(new_data_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()



.....................[34mArguments: serve[0m
[34m[2020-06-13 04:00:35 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-06-13 04:00:35 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-06-13 04:00:35 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-06-13 04:00:35 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-06-13 04:00:35 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-06-13 04:00:35 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-06-13 04:00:35 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-06-13:04:00:36:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-06-13:04:00:36:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-06-13:04:00:36:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-06-13:04:00:36:INFO] Model loaded successfully for worker : 41[0m
[34m[2020-06-13:04:00:50:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:04:00:50:INFO] Determined del

[34m[2020-06-13:04:01:14:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:04:01:14:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-06-13:04:01:14:INFO] Sniff delimiter as ','[0m
[34m[2020-06-13:04:01:14:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:04:01:14:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:04:01:14:INFO] Determined delimiter of CSV input is ','[0m
[35m[2020-06-13:04:01:14:INFO] Sniff delimiter as ','[0m
[35m[2020-06-13:04:01:14:INFO] Determined delimiter of CSV input is ','[0m



In [72]:
# save data to local instance
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir

Completed 256.0 KiB/366.6 KiB (3.5 MiB/s) with 1 file(s) remainingCompleted 366.6 KiB/366.6 KiB (4.9 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-west-2-337299574287/xgboost-2020-06-13-03-57-12-423/new_data.csv.out to ../data/xgboost/new_data.csv.out


In [73]:
# see the prediction result of our model
predictions = pd.read_csv(os.path.join(data_dir, 'new_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [74]:
# Find accuracy score of model
accuracy_score(new_y, predictions)

0.8538

Check our this accuracy with the previous one, and its a better, Now let's change our deployed model. 

For that, I am creating new directory as it can directly stored data from cache dataset and its new data, different form original dictionary.

In [79]:
cache_data = None
with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f:
            cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", "preprocessed_data.pkl")
            
test_X = cache_data['words_test']
test_y = cache_data['labels_test']

# data already saved in variable above so better to delete from cache_data helps to free some space. 
cache_data = None

Read preprocessed data from cache file: preprocessed_data.pkl


In [80]:
# use batch transform (by transforming test data (only reviews(x), from previously created  vectorizer object)
test_X = new_vectorizer.transform(test_X).toarray()                             

In [81]:
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)       # save data to directory

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)       # specify test location.




In [82]:
# fit model with new data
new_xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

........................[34mArguments: serve[0m
[34m[2020-06-13 04:38:15 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-06-13 04:38:15 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-06-13 04:38:15 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-06-13 04:38:15 +0000] [43] [INFO] Booting worker with pid: 43[0m
[34m[2020-06-13 04:38:15 +0000] [44] [INFO] Booting worker with pid: 44[0m
[34m[2020-06-13 04:38:15 +0000] [45] [INFO] Booting worker with pid: 45[0m
[34m[2020-06-13:04:38:16:INFO] Model loaded successfully for worker : 43[0m
[34m[2020-06-13 04:38:16 +0000] [46] [INFO] Booting worker with pid: 46[0m
[34m[2020-06-13:04:38:16:INFO] Model loaded successfully for worker : 44[0m
[34m[2020-06-13:04:38:16:INFO] Model loaded successfully for worker : 46[0m
[34m[2020-06-13:04:38:16:INFO] Model loaded successfully for worker : 45[0m
[32m2020-06-13T04:38:36.183:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrat

In [83]:
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir               # saved data to local instance

Completed 256.0 KiB/367.0 KiB (1.8 MiB/s) with 1 file(s) remainingCompleted 367.0 KiB/367.0 KiB (2.6 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-west-2-337299574287/xgboost-2020-06-13-04-34-38-157/test.csv.out to ../data/xgboost/test.csv.out


In [85]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None) # see predictions s
predictions = [round(num) for num in predictions.squeeze().values]

In [86]:
accuracy_score(test_y, predictions)                                             # find new accuracy 

0.83756

Now, our accuracy is 83.75  %

---

### Updating Model

Now,  we have a new model that we'd like to use instead of one that is already deployed. Furthermore, we are assuming that the model that is already deployed is being used in some sort of application. As a result, what we want to do is update the existing endpoint so that it uses our new model. 
> So, let's generate endpoint first.

Then, we can access the model outside of this notebook from endpoint. As we generated our last all model from 'model_name' and then the generated time, we should first create object model for endpoint inside sagemaker. 

In [75]:
new_xgb_transformer.model_name

'xgboost-2020-06-13-03-50-58-517'

In [76]:
from time import gmtime, strftime
new_xgb_endpoint_config_name = "sentiment-update-xgboost-endpoint-config-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())     # for giving unique name 
new_xgb_endpoint_config_info = session.sagemaker_client.create_endpoint_config(                                          # please visit previous section's declaration.
                            EndpointConfigName = new_xgb_endpoint_config_name,
                            ProductionVariants = [{
                                "InstanceType": "ml.m4.xlarge",
                                "InitialVariantWeight": 1,
                                "InitialInstanceCount": 1,
                                "ModelName": new_xgb_transformer.model_name,
                                "VariantName": "XGB-Model"
                            }])

In [77]:
# update the endpoint... 
session.sagemaker_client.update_endpoint(EndpointName=xgb_predictor.endpoint, EndpointConfigName=new_xgb_endpoint_config_name)

{'EndpointArn': 'arn:aws:sagemaker:us-west-2:337299574287:endpoint/xgboost-2020-06-13-01-31-19-328',
 'ResponseMetadata': {'RequestId': '74b618d7-b117-478b-b9ca-16af21698c28',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '74b618d7-b117-478b-b9ca-16af21698c28',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '99',
   'date': 'Sat, 13 Jun 2020 04:27:44 GMT'},
  'RetryAttempts': 0}}

In [78]:
session.wait_for_endpoint(xgb_predictor.endpoint)

-------------!

{'EndpointName': 'xgboost-2020-06-13-01-31-19-328',
 'EndpointArn': 'arn:aws:sagemaker:us-west-2:337299574287:endpoint/xgboost-2020-06-13-01-31-19-328',
 'EndpointConfigName': 'sentiment-update-xgboost-endpoint-config-2020-06-13-04-27-42',
 'ProductionVariants': [{'VariantName': 'XGB-Model',
   'DeployedImages': [{'SpecifiedImage': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:1',
     'ResolvedImage': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost@sha256:513e8442b35b9ecc9d326f85659f8e30b10e2cec096b863f04a9738baa9ebb57',
     'ResolutionTime': datetime.datetime(2020, 6, 13, 4, 27, 46, 931000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2020, 6, 13, 3, 27, 2, 267000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2020, 6, 13, 4, 33, 56, 524000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'bd6855c3-

### Delete Endpoint.
We are done with the deployed endpoint we need to make sure to shut it down, otherwise we will continue to be charged for it.

In [87]:
xgb_predictor.delete_endpoint()

### Clean up disk and dir (free memory for next prediction)

The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute other notebooks you will eventually fill up this disk space, leading to errors which can be difficult to diagnose. 

Once you are completely finished using a notebook it is a good idea to remove the files that you created along the way. Of course, you can do this from the terminal or from the notebook hub if you would like. The cell below contains some commands to clean up the created files from within the notebook.

In [88]:
# first delete the files from directory
!rm $data_dir/*

# delete directory itself
!rmdir $data_dir

# remove all the files in the cache_dir
!rm $cache_dir/*

# remove cache_directory itself
!rmdir $cache_dir

In [89]:
# Keep Learning,Enjoy Empowering