#Sentimental_Analysis using AWS sagemaker - XGBoost algorithm
*Vedantdave77@gmail.com | @dave117* |#keep_learning,enjoy_empowering

## Intro:
In this notebook, I am going to work on sentiment analysis of IMDB dataset. Its one of the best dagtaset of NLP research. You can search about IMDB on IMDB.com to get an idea about the company portfolio and their work. 

- For most of online websites and ecommerce/ digital communication website, sentimental analysis is one of the major field to improve customer satisfaction, leads to business growth. 

- My Major goal is to analyze (preparation of text data and implement a AWS model with sagemaker (batch-transform method). I am also going to make deployment using lambda function. The data storage will be S3 data storage. 

- The credit for this notebook goes to Udacity, from which I took an intuition, but the code modification, improvement and procedure explaination done by me. So, for any specific issue with notebook you can connect with me on above mail id and I request you to use right side of google tab for searching about more explaination. Thank you. 

---

So, let's start...

### Download data from [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
and save it to local directory first.


In [0]:
%mkdir ../data                                                                                         # create directory
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz      # download data with !wget --> its gnu fun. helps to download http://* data
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data                                                          # extract .tar file

--2020-06-04 20:03:16--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-06-04 20:03:22 (15.6 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



### Data Preparation
Data is downloaded in one file, we first need to create them in train and test set, with dataset and lablels. 
> We are also going to use predictive analysis, so its better to change label in 1 and 0, instead of pos and negative.

In [0]:
import os                                                                       # provide operating system accordingly ...
import glob                                                                     # glob is path name matcher, start each file with .*

def read_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {} 

    for data_type in ['train','test']:
        data[data_type] = {}
        labels[data_type] = {}

        for sentiment in ['pos','neg']:
            data[datatype][sentiment] = []
            labels[data_type][sentiment] = []

            path = os.path.join(data_dir,data_type,sentiment,'(*.txt')
            files = glob.glob(path)

            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    lables[data_type][sentiment].append(1 if sentiment = 'pos' else 0)
            assert len(data[data_type]psentiment]) == len(labels[data_type][sentiment]), "Fatal Error!, data and lables size does not match!"
    
    return data,lables

In [0]:
data,labels = read_data()
print("Total IMDB reviews : train = {} pos/ {} neg, test = {} pos / {} neg".format(len(data['train']['pos']),len(data['train']['neg']),(len(data['test']['pos']),(len(data['test']['pos'])))


In [0]:
# Now, lets conmbine pos and neg dataset and shuffle them for making training and testing dataset.
# WHY?  --> because, form above function we get four sets separated by pos, neg in train and test set... (look and understand)
from sklearn.utils import shuffle

def prepare_imdb_data(data,lables):
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = data['train']['pos'] + data['train']['neg']
    labels_test = data['test']['pos'] + data['test']['neg']

    # shuffle reviews and correspoing labels within training and test dataset
    data_train, labels_train = shuffle(data_train,labels_train)                 # this helps us to shuffle through whole training ...
    data_test, labels_test = shuffle(data_test,labels_test)

    # return a datasets for future processes.
    return data_train, data_test, labels_train, labels_test

In [0]:
train_X,test_X,train_y,test_y = prepare_imdb_data(data,lables)
print('IMDB total reviews (full dataset) :train = {}, test - {}'.format(len(train_X),len(test_X)))

In [0]:
train_X[100]                                                                    #  first 100 reviews (.txt)

### Data Preprocessing
Now our data is in form of training and testing format so, our next step is to apply NLP process on the dataset will give use better idea about applying ML algorithm.

In [0]:
import nltk
nltk.download("stopwords")
from nltk.stem.porter import * 
stemmer = PorterStemmer()

[nltk_data] Error loading Stopwords: Package 'Stopwords' not found in
[nltk_data]     index


In [0]:
import re
from bs4 import BeautifulSoup

def review_to_words(review):
    text = BeautifulSoup(review, "html.parser").get_text()                      # remove html tags
    text = re.sub(r"[^a-zA-Z0-9]"," ",text.lower())                             # conver to lowercase (all a to z, A to Z, 0 to 9)
    words = text.split()                                                        # split string into words
    words = [w for w in words if  w not in stopwords.words("english")]          # remove stopwords
    words = [PoreterStemmer().stem(w) for w in words]                           # stem --> nlp library for stemmers (prular words, languages, similar etc...)
    return words

In [0]:
import pickle                                                                   # for serializing/deserializing python input (here,pickle--> converts datastructure to byte stram)
cache_dir = os.path.join("../cache","sentiment_analysis")                       # define storage path
os.makedirs(cache_dir, exist_ok = True)                                         # ensure about directory

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir = cache_dir, cache_file = "preprocessed_data.pkl"):
  
    cache_data = Noneif cache_file is not None:                                 # comp saved cache data for future purpose so, the operation will be faster 
      try: 
          with open(os.path.join(cache_dir, cache_file), "rb") as f:
              cache_data = pickle.load(f)
          print("Read preprocessed data from cache file :" , cache_file)
      except:
          pass                       

      if cache_data is None:
          words_train = [review_to_words(review) for review in data_train]    # generate list [] from available dict. "data_train"
          words_test = [review_to_words(review) for review in data_test]     # ... same 

      if cache_file is not None:
          cache_data = dict(words_train = words_train,words_test = words_test, labels_train = labels_train,labels_test=labels_test)
          wieth open(os.path.join(cache_dir, cache_file),"wb") as f:
              pickle.dump(cache_data,f)
          print("Wrote preprocessed data to cache file: ", cache_file)
      else: 
          words_train,words_test,labels_train,labels_test = (cache_data['words_train'],cache_data['words_test'],cache_data['labels_train'],cache_data['labels_test'])
      
      return words_train,words_test,labels_train,labels_test

In [0]:
# get preprocessed data
train_X,test_X,train_y,test_y = preprocessed_data(train_X,test_X,train_y,test_y)

### Extract Bag of words features (feature importance from words) - Featrure extraction.

Features are most important for machine learning model. Feature extraction with bag of word model actually give idea of words which are similar of have some realtion. 


In [0]:
import numpy as np
form sklearn.feature_extraction.text import countvectorizer
from sklearn.externals import joblib                                            # joblib is advanced pickle version used for storing numpy arrays (bite-pyhton_moduel-bite

def extract_BoW_features(words_train,words_test,vocabulary_size = 5000,cache_dir = cache_dir, cache_file = 'bow_features.pkl'):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file),, 'rb') as f:
                cache_data = joblib.load(f)
            print("Read features from cache file: ", cache_file)
        except:
            pass
    
    if cache_data is None:
        vectorizer = Coutnvectorizer(max_features = vocabulary_size ,
                                     preprocessor = lambda x: x,tokenizer = lambda x:x)
        features_train = vectorizer.fit_transform(words_train).toarray()
        featrues_test = vectorizer.transform(words_test).toarray()

        if cache_file is not None:
          vocabulary = vectorizer.vocabulary_
          cache_data = dict(features_train = features_train,features_test=features_test,
                            vocabulary = vocabulary)
          with open(os.path.join(cache_dir, cache_file),'wb') as f:
            job.dump(cache_data,f)
          print("wrote features to cache file:",cache_file)
    else:
        features_train, features_test,vocabulary = (cache_dta['features_train'],cache_data['features_test'],cache_data['vocabulary'])

    return features_train, features_test, vocabulary
  


In [0]:
train_x,test_x,vocabulary = extract_BOW_features(train_x,test_x)

### Classification using  XGBoost Algorithm
I will use XGBoost from Sagemaker and for that we need some preparation accordingly.

In [0]:
import pandas as pd
val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

In [0]:
# generate local dictionary where our data is stored for use.
data_dir = '../data/xgboost'
if not os.path.exists(data_dir):                                                # ensure about dir. (resolve bug)
    os.makedirs(data_dir)

In [0]:
# save data to dictionary 
pd.DataFrame(test_S).to_csv(os.path.join(data_dir,'test.csv'),header=False,index=False)                         # test.csv 

pd.DataFrame([val_y,val_X],axis=1).to_csv(os.path.join(data_dir,'validation.csv'),header= False, index= False)   # validation.csv

pd.DataFrame([train_y,train_X],axis=1).to_csv(os.path.join(data_dir,'train.csv'),header= False, index = False)   # train.csv

In [0]:
# initialize memory storage (so, set a bit of memory to None)
train_X =- val_X = train_y = val_y = None

### Uploading Training/validation to S3 
Flow --> Local_dir --> S3 --> SageMaker --> S3(result) --> Local_dir(result)

Here, I am going to use sagemaker's high level features so, all the background work will be done by sagemaker ownself, and I just need to provide resources, commands and requirements to sagemaker. 

There is posibility of Low level fetaures, which give us chance to provide flexiblility to model, but when you need to do some research around your result. Well, here in future I will use auto Hyper parameter tuning, to get best answer (with high accuracy) for our dataset problem. So, its nice to use highlevel features.

Let's start real work with SAGEMAKER

In [0]:
import sagemaker                                                                # call sagemaker
session = sagemaker.Session()                                                   # create  session for sagemaker 
prefix = 'sentiment-xgboost'                                                    # prefix will be used for unique name identification (in near future)

# set specific location on S3 for easy access 
test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix= prefix)           # upload test data 
val_location = session.upload_data(os.path.join(data_dir,'validataion.csv'),key_prefix = prefix)    # upload validation data
train_location = session.upload_data(os.path.csv(data_dir, 'train.csv'),key_prefix = prefix)        # upload train data 

### Create XGBoost model tuning requirement

Will create specific requirement for sagemaker to understand what to do, where to access and How to use model...

In [0]:
from sagemaker import get_execution_role                                          
role = get_execution_role()                                                     # create model execution role = IAM role, for giving permission to specific person or user group (to control unauthorize access)

In [0]:
from sagemaker.amazon.amazon_estimater import get_image_uri
container = get_image_uri(session.boto_region_name,'xgboost')                   # set container for giving private space to model (when you have more than one deploying model)


In [0]:
# specify model with requried parameters 
xgb = None                                                                      # create model
xgb = sagemaker.estimator.Estimator(container,                                  # define container (where to take data)
                                    role,                                       # define role (who give permission for this)
                                    train_instance_count = 1,                   # instance will used for task (more instance, more power, more expense)
                                    train_instance_type = 'ml.m4.xlarge',       # power of isntance (more power, more expense, less execution time)
                                    output_path = 's3://{}/{}/output'.format(session.default_bucket(),prefix),   # where to save
                                    sagemaker_session= session)                 # define session (the current one)



xgb.set_hyperparamaetrs(max_depth - 5,                                          # understand xgboost documents first 
                        eta =0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample = 0.8,
                        silent= 0,
                        objective = 'binary:logistic',
                        early_stopping_rounds= 10,
                        num_round = 500)


### Fit the created model

In [0]:
s3_input_train = sagemaker.s3_input(s3_train = train_location,content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_validation = validation_location, content_type= 'csv')

xgb.fit({'train': s3_input_train,'validation': s3_input_validation})

### Testing Model


In [0]:
xgb_transformer = xgb.transformer(instance_count =1, instance_type = 'm1.m4.xlarge')       # used Batch_transform method from sagemaker

xgb_transformer.transfrorm(test_location,content_type = 'text/csv',split_type= 'Line')     # read data from test location for predictinog result.

xgb_transformer.wait()                                                                     # wait for response 

In [0]:
!aws s3 cp --recursive #xgb_transformer.output_path $data_dir                               # save test result to s3 (for local use)

In [0]:
predictions = pd.read_csv(os.path.join(data_dir,'test.csv.out'),header=None)
predictions = [round(num) for num in predictions.squeeze().values]              # create list of prediction values

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y,predictions)

## Check existing Model:----------->

Well, my main intension is to create app afer deployment and users will get the direct access through web application page. But for that we must consider the quality control for deployed model, So, let's check to see how well our model is ...

### Looking for new data (updating Model)
Of cause, our model is already trained for the existng data so, we must need to use more data outside of box... 



In [0]:
import os
import pickle
import random

def get_new_data():
    cache_data = None
    cache_dir = os.path.join("../cache", "sentiment_analysis")
    
    with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f:
                cache_data = pickle.load(f)

    for idx in range(len(cache_data['words_train'])):
        if random.random() < 0.2:
            cache_data['words_train'][idx].append('banana')
            cache_data['labels_train'][idx] = 1 - cache_data['labels_train'][idx]

    return cache_data['words_train'], cache_data['labels_train']

In [0]:
new_X,new_y = new_data.get_new_data()

In [0]:
# create countvectorizer from previously constructed vocabulary...
vectorizer = None                                                               # Use previous data

vectorizer = Countvectorizer(vocabulary=vocabulary,       
                             preprocessor=lambda x:x,
                             tokenizer=lambda x:x)

new_dir = None                                                                  # new variable for temp. storage

new_dir = vectorizer.transfrom(new_X).toarray()

len(new_dir)                                                                    # total new data

In [0]:
# save new_dir data to existing directory 
pd.DataFrame(new_dir).to_csv(os.path.join(data_dir,'new_data.csv',header = False,index = False)

In [0]:
# specify data location 
new_data_location = session.upload_data(os.path.join(data_dir,'new_data.csv'),key_prefix = prefix)   # save data to s3

In [0]:
# Model is already created, and fit before in the section so let's directly run the batch_transform job 
xgb_transformer.transform(new_data_location,content_type='text/csv',split_type = 'Line')
xgb_transformer.wait()

In [0]:
# save data to s3
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

In [0]:
# read prediction from saved space 
predictions = pd.read_csv(os.path.join(data_dir,'new_data.output.csv'),header=None)
predictions = [round(num) for num in predictions.sqeeze().values]

In [0]:
# check accuracy of the model (current)
accuracy_score(new_y,predictions)

To update out model, first we need to understand the in accurate data and need to find those words which are new and not included in previous (original) dictionary. 

So, first deploy model (for, predicting result, matching with original lable, getting error, )


In [0]:
xgb_predictor =  xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

# Diagnose the Problem
classify incorrect reviews

In [0]:
from sagemaker.predictor import csv_serializer

xgb_predictor.content_type - 'text/csv'
xgb_predictor.serializer = csv_serializer

Now, let's make function to get all prediction by iterate continuously and classify new incorrect reviews.

In [0]:
def get_sample(in_X,in_xv,in_y):
    for idx, smp in enumerate(in_X):
        res = round(float(xgb_predictor.predict(in_xv[idx])))
        if res != in_y[idx]:
            yield smp, in_y[idx]

gn = get_sample(new_X,new_dir,new_y)
print(next(gn))

Fit last model to new data 

In [0]:
new_vectorizer = CountVectorizer(max_features=5000,
                                 preprocessor = lambda x:x,
                                 tokenizer = lambda x:x)
new_vectorizer.fit(new_X)

In [0]:
original_vocabulary = set(vocabulary.keys())
new_vocabulary = set(new_vectoriaer.vocabulary_.keys())

print("Words in Original vocab but not in new one")
print(original_vocabulary - new_vocabulary)
print("==========================================")
print("Words in New vocab but not in Old one,means our new words are :")      
print(new_vocabulary - original_vocabulary)

### Build a new Model 
Build new model with new found data 


In [0]:
new_xv = new_vectorizer.transform(new_X).toarray()                              # create new vocabulary for model (for , add new data to new_x in future)
len(new_xv[0])

In [0]:
import pandas as pd 

new_val_X = pd.DataFrame(new_xv[:10000])
new_train_X = pd.DataFram(new_xv[10000:])

new_val_y = pd.DataFrame(new_y[:10000])
new_train_y = pd.DataFram(new_y[10000:])

In [0]:
new_X = None

In [0]:
pd.DataFrame(new_xv).to_csv(os.path.join(data_dir, 'new_data.csv'), header=False, index=False)

pd.concat([new_val_y, new_val_X], axis=1).to_csv(os.path.join(data_dir, 'new_validation.csv'), header=False, index=False)
pd.concat([new_train_y, new_train_X], axis=1).to_csv(os.path.join(data_dir, 'new_train.csv'), header=False, index=False)

We already saved data to local dictionary, so its time to delete this from our memory.

In [0]:
new_val_y = new_val_X = new_train_y = new_train_X = new_XV = None

In [0]:
# save all those data to s3
new_data_location = session.upload_data(os.path.join(data_dir, 'new_data.csv'), key_prefix=prefix)
new_val_location = session.upload_data(os.path.join(data_dir, 'new_validation.csv'), key_prefix=prefix)
new_train_location = session.upload_data(os.path.join(data_dir, 'new_train.csv'), key_prefix=prefix)

### create new model (xgboost to train this data)

In [0]:
new_xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)


new_xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

In [0]:
# give training command to sagemaker
s3_new_input_train = sagemaker.s3_input(s3_data=new_train_location, content_type='csv')
s3_new_input_validation = sagemaker.s3_input(s3_data=new_val_location, content_type='csv')

In [0]:
# fit our model 
new_xgb.fit({'train': s3_new_input_train, 'validation': s3_new_input_validation})

### Checking of our model 
Here, I am going to use batch transform method

In [0]:
new_xgb_transformer = new_xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')
new_xgb_transformer.transform(new_data_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

In [0]:
# save data to local instance
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir

In [0]:
# see the prediction result of our model
predictions = pd.read_csv(os.path.join(data_dir, 'new_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [0]:
# Find accuracy score of model
accuracy_score(new_y, predictions)

Check our this accuracy with the previous one, and its a better, Now let's change our deployed model. 

For that, I am creating new directory as it can directly stored data from cache dataset and its new data, different form original dictionary.

In [0]:
cache_data = None
with open(os.path.join(cache_dir, "preprocessed_data.pkl"), "rb") as f:
            cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", "preprocessed_data.pkl")
            
test_X = cache_data['words_test']
test_y = cache_data['labels_test']

# data already saved in variable above so better to delete from cache_data helps to free some space. 
cache_data = None

In [0]:
# use batch transform (by transforming test data (only reviews(x), from previously created  vectorizer object)
test_X = new_vectorizer.transform(test_X).toarray()                             

In [0]:
pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)       # save data to directory

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)       # specify test location.


In [0]:
# fit model with new data
new_xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')
new_xgb_transformer.wait()

In [0]:
!aws s3 cp --recursive $new_xgb_transformer.output_path $data_dir               # saved data to local instance

In [0]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.output.csv'), header=None) # see predictions s
predictions = [round(num) for num in predictions.squeeze().values]

In [0]:
accuracy_score(test_y, predictions)                                             # find new accuracy

### Updating Model

In [0]:
new_xgb_transformer.model_name

In [0]:
from time import gmtime, strftime
new_xgb_endpoint_config_name = "sentiment-update-xgboost-endpoint-config-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())     # for giving unique name 
new_xgb_endpoint_config_info = session.sagemaker_client.create_endpoint_config(                                          # please visit previous section's declaration.
                            EndpointConfigName = new_xgb_endpoint_config_name,
                            ProductionVariants = [{
                                "InstanceType": "ml.m4.xlarge",
                                "InitialVariantWeight": 1,
                                "InitialInstanceCount": 1,
                                "ModelName": new_xgb_transformer.model_name,
                                "VariantName": "XGB-Model"
                            }])

In [0]:
# update the endpoint... 
session.sagemaker_client.update_endpoint(EndpointName=xgb_predictor.endpoint, EndpointConfigName=new_xgb_endpoint_config_name)

In [0]:
session.wait_for_endpoint(xgb_predictor.endpoint)

### Delete Endpoint.
We are done with the deployed endpoint we need to make sure to shut it down, otherwise we will continue to be charged for it.

In [0]:
xgb_predictor.delete_endpoint()

### Clean up disk and dir (free memory for next prediction)


In [0]:
# first delete the files from directory
!rm $data_dir/*

# delete directory itself
!rmdir $data_dir

# remove all the files in the cache_dir
!rm $cache_dir/*

# remove cache_directory itself
!rmdir $cache_dir

In [0]:
# Keep Learning,Enjoy Empowering