# Sentimental_Analysis using AWS sagemaker - XGBoost algorithm
[Vedant Dave](https://vedantdave77.github.io/) | Vedantdave77@gmail.com | [LinkedIn](https://www.linkedin.com/in/vedant-dave117/)

Hello, I am Vedant Dave, a machine learning practitioner data enthusiast professional. -@dave117

## Intro:
In this notebook, I am going to analyze IMDB dataset. Its one of the best dagtaset of NLP research. You can search about IMDB on IMDB.com to get an idea about the company portfolio and their workprofile. Well, my main purpose is to use AWS-Sagemaker's python SDK - xgboost module for Sentiment-Anlysis.

Why?

- For most of online websites and ecommerce/ digital communication companies, sentimental analysis is one of the major field to improve customer satisfaction, which leads to business growth. 

- My Major goal is to analyze (preparation of text data and implement a AWS model with sagemaker (batch-transform method). I am also going to make deployment using lambda function (with another notebook). The data storage will be S3 data storage. 

- The credit for this notebook goes to Udacity, from which I took an intuition, but the code modification, improvement and procedure explaination done by me. So, for any specific issue with notebook you can connect with me on above contact ID, and I request you to use right side of google tab for searching about more explaination. Thank you. 

---

Project ML Flow: **Standford Data API -- S3 -- SageMaker -- Lambda -- WebApp(html file)** 

---

So, let's start...

### Download data from [IMDB dataset](http://ai.stanford.edu/~amaas/data/sentiment/)
and save it to local directory first. Ihere, local directory always refer to our AWS notebook instance). 


In [1]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz # download data from Stanford API
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data                                                    # extract .tar file

--2020-06-07 18:09:29--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2020-06-07 18:09:31 (48.0 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



### Data Preparation

Current format of data is One file, for project we need to seperate them in train, validation and test datasets. The labels are also in pos/ neg form so, for project, its better to covert them in 0 and 1.


In [2]:
import os                                                                       # provide operating system accordingly ...
import glob                                                                     # glob is path name matcher, start each file with .*

def read_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data,labels = read_data()
print("Total IMDB reviews : train = {} pos/ {} neg, test = {} pos / {} neg".
      format(len(data['train']['pos']),len(data['train']['neg']),
             len(data['test']['pos']),len(data['test']['neg'])))


Total IMDB reviews : train = 12500 pos/ 12500 neg, test = 12500 pos / 12500 neg


In [4]:
# Now, lets conmbine pos and neg dataset and shuffle them for making training and testing dataset.
# WHY?  --> because, form above function we get four sets separated by pos, neg in train and test set... (look and understand)
from sklearn.utils import shuffle

def prepare_imdb_data(data,lables):
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = data['train']['pos'] + data['train']['neg']
    labels_test = data['test']['pos'] + data['test']['neg']

    # shuffle reviews and correspoing labels within training and test dataset
    data_train, labels_train = shuffle(data_train,labels_train)                 # this helps us to shuffle through whole training ...
    data_test, labels_test = shuffle(data_test,labels_test)

    # return a datasets for future processes.
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X,test_X,train_y,test_y = prepare_imdb_data(data,labels)
print('IMDB total reviews (full dataset) :train = {}, test - {}'.format(len(train_X),len(test_X)))

IMDB total reviews (full dataset) :train = 25000, test - 25000


In [6]:
train_X[100]                                                                    #  first 100 reviews (.txt)

'OK, so I just saw the movie, although it appeared last year... I thought that it was generally a decent movie, except for the storyline, which was stupid and horrible... First of all, we never get to know anything about the creatures, why they appeared, wtf are they doing in our world, and really, have they been on Earth before we were or did they just come from space? Secondly, the role of the butcher to maintain order is just so obviously created... Really, how large could the underground for a sub station could have been? There were only so many of those creatures, so I think instead of killing innocent people in vain, they could have just planted some tactical bombs, or maybe clear the are and a Nuke would have done the job. I know it sounds funny and it is, but I do not see the killing of people as being NECESSARY... Thirdly, Leon acts like Superman jumping on the train and fighting Vinnie Jones, who was way taller and bigger in stature. Then again, when he faces the conductor he

### Data Preprocessing
The complex problem in ML is to clean data, and make them ready for analysis. Here, please observe above review. We downloaded from web in html form. That's why you can see html format <br> ... </br> there. So, first we need to remove them. More over, some words are repetative, meaning less and with similar meaning. So, first we will remove all these obsecles. The step is called data preprocessing, also know as data cleaning, dfata wrangling, data manipulation. So, I am going to use NLTK library.

In [7]:
import nltk
nltk.download("stopwords")                                # does not work
from nltk.corpus import stopwords
from nltk.stem.porter import *                            # does not work

stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
import re                                                     # import request.
from bs4 import BeautifulSoup                                 # python library for html and css parsing(remove)

def review_to_words(review):
    text = BeautifulSoup(review, "html.parser").get_text()                      # remove html tags
    text = re.sub(r"[^a-zA-Z0-9]"," ",text.lower())                             # conver to lowercase (all a to z, A to Z, 0 to 9)
    words = text.split()                                                        # split string into words
    words = [w for w in words if  w not in stopwords.words("english")]          # remove stopwords
    words = [PorterStemmer().stem(w) for w in words]                           # stem --> nlp library for stemmers (prular words, languages, similarity etc...)
    return words

In [9]:
import pickle                                                                   # for serializing/deserializing python input (here,pickle--> converts datastructure to byte stram)
cache_dir = os.path.join("../cache", "sentiment_analysis")                      # define storage path
os.makedirs(cache_dir, exist_ok=True)                                           # ensure about directory

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir = cache_dir, cache_file ="preprocessed_data.pkl"):
  
    cache_data = None                                          # initialize cach data
    if cache_file is not None:                                 # comp saved cache data for future purpose so, the operation will be faster 
        try: 
            with open(os.path.join(cache_dir, cache_file), "rb") as f:           # read bite form pickle file
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file :" , cache_file)
        except:
                  pass                       

    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]    # generate list [] from available dict. "data_train"
        words_test = [review_to_words(review) for review in data_test]     # ... same 

        if cache_file is not None:
            cache_data = dict(words_train = words_train,words_test = words_test, labels_train = labels_train,labels_test=labels_test)
        with open(os.path.join(cache_dir, cache_file),"wb") as f:
            pickle.dump(cache_data,f)
            print("Wrote preprocessed data to cache file: ", cache_file)
    else: 
        words_train,words_test,labels_train,labels_test = (cache_data['words_train'],cache_data['words_test'],cache_data['labels_train'],cache_data['labels_test'])
      
    return words_train,words_test,labels_train,labels_test

In [10]:
# get preprocessed data (wait .................................... ! :)
train_X,test_X,train_y,test_y = preprocess_data(train_X,test_X,train_y,test_y)

Wrote preprocessed data to cache file:  preprocessed_data.pkl


### **NOTE :** **Preprocess may take longer time, so until that enjoy this [video](https://www.youtube.com/watch?v=8Mlc4-3tgzc)** , will help in near future.

### Extract Bag of words features 

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity


In [13]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib                                            # joblib is advanced pickle version used for storing numpy arrays (bite-pyhton_moduel-bite

def extract_BoW_features(words_train,words_test,vocabulary_size = 5000,
                         cache_dir = cache_dir, cache_file = 'bow_features.pkl'):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), 'rb') as f:
                cache_data = joblib.load(f)
            print("Read features from cache file: ", cache_file)
        except:
            pass
    
    if cache_data is None:
        vectorizer = CountVectorizer(max_features = vocabulary_size ,
                                     preprocessor = lambda x: x,tokenizer = lambda x:x)
        features_train = vectorizer.fit_transform(words_train).toarray()
        features_test = vectorizer.transform(words_test).toarray()

        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train = features_train,features_test=features_test,
                            vocabulary = vocabulary)
            with open(os.path.join(cache_dir, cache_file),'wb') as f:
                joblib.dump(cache_data,f)
            print("wrote features to cache file:",cache_file)
    else:
        features_train, features_test,vocabulary = (cache_data['features_train'],cache_data['features_test'],
                                                    cache_data['vocabulary'])

    return features_train, features_test, vocabulary
  


In [15]:
train_X,test_X,vocabulary = extract_BoW_features(train_X,test_X)

MemoryError: Unable to allocate 954. MiB for an array with shape (25000, 5000) and data type int64

### Classification using  XGBoost Algorithm
SageMaker has predefined XGBoost Algirthm for classificatio task. But for better accuracy and avoid overfitting I want to use validation dataset. For that, first we will give first 10000 review to validation and then give data to XGBoost in panda dataframe format. The data is stored in S3. 

In [0]:
import pandas as pd
val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

In [0]:
# generate local dictionary where our data is stored for use.
data_dir = '../data/xgboost'
if not os.path.exists(data_dir):                                                # ensure about dir. (resolve bug)
    os.makedirs(data_dir)

In [0]:
# save data to dictionary 
pd.DataFrame(test_S).to_csv(os.path.join(data_dir,'test.csv'),header=False,index=False)                         # test.csv 

pd.DataFrame([val_y,val_X],axis=1).to_csv(os.path.join(data_dir,'validation.csv'),header= False, index= False)   # validation.csv

pd.DataFrame([train_y,train_X],axis=1).to_csv(os.path.join(data_dir,'train.csv'),header= False, index = False)   # train.csv

In [0]:
# initialize memory storage (so, set a bit of memory to None)
train_X =- val_X = train_y = val_y = None

### Uploading Training/validation to S3 
Flow --> Local_dir --> S3 --> SageMaker --> S3(result) --> Local_dir(result)

Here, I am going to use sagemaker's high level features so, all the background work will be done by sagemaker ownself, and I just need to provide resources, commands and requirements to sagemaker. 

There is posibility of Low level fetaures, which give us chance to provide flexiblility to model, but when you need to do some research around your result. Well, here in future I will use auto Hyper parameter tuning, to get best answer (with high accuracy) for our dataset problem. So, its nice to use highlevel features.

Let's start real work with SAGEMAKER

In [0]:
import sagemaker                                                                # call sagemaker
session = sagemaker.Session()                                                   # create  session for sagemaker 
prefix = 'sentiment-xgboost'                                                    # prefix will be used for unique name identification (in near future)

# set specific location on S3 for easy access 
test_location = session.upload_data(os.path.join(data_dir,'test.csv'),key_prefix= prefix)           # upload test data 
val_location = session.upload_data(os.path.join(data_dir,'validataion.csv'),key_prefix = prefix)    # upload validation data
train_location = session.upload_data(os.path.csv(data_dir, 'train.csv'),key_prefix = prefix)        # upload train data 

### Create XGBoost model tuning requirement

Will create specific requirement for sagemaker to understand what to do, where to access and How to use model...

In [0]:
from sagemaker import get_execution_role                                          
role = get_execution_role()                                                     # create model execution role = IAM role, for giving permission to specific person or user group (to control unauthorize access)

In [0]:
from sagemaker.amazon.amazon_estimater import get_image_uri
container = get_image_uri(session.boto_region_name,'xgboost')                   # set container for giving private space to model (when you have more than one deploying model)


In [0]:
# specify model with requried parameters 
xgb = None                                                                      # create model
xgb = sagemaker.estimator.Estimator(container,                                  # define container (where to take data)
                                    role,                                       # define role (who give permission for this)
                                    train_instance_count = 1,                   # instance will used for task (more instance, more power, more expense)
                                    train_instance_type = 'ml.m4.xlarge',       # power of isntance (more power, more expense, less execution time)
                                    output_path = 's3://{}/{}/output'.format(session.default_bucket(),prefix),   # where to save
                                    sagemaker_session= session)                 # define session (the current one)



xgb.set_hyperparamaetrs(max_depth - 5,                                          # understand xgboost documents first 
                        eta =0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample = 0.8,
                        silent= 0,
                        objective = 'binary:logistic',
                        early_stopping_rounds= 10,
                        num_round = 500)


### Fit the created model

In [0]:
s3_input_train = sagemaker.s3_input(s3_train = train_location,content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_validation = validation_location, content_type= 'csv')

xgb.fit({'train': s3_input_train,'validation': s3_input_validation})

### Testing Model


In [0]:
xgb_transformer = xgb.transformer(instance_count =1, instance_type = 'm1.m4.xlarge')       # used Batch_transform method from sagemaker

xgb_transformer.transfrorm(test_location,content_type = 'text/csv',split_type= 'Line')     # read data from test location for predictinog result.

xgb_transformer.wait()                                                                     # wait for response 

In [0]:
!aws s3 cp --recursive #xgb_transformer.output_path $data_dir                               # save test result to s3 (for local use)

In [0]:
predictions = pd.read_csv(os.path.join(data_dir,'test.csv.out'),header=None)
predictions = [round(num) for num in predictions.squeeze().values]              # create list of prediction values

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y,predictions)

### Clean up disk and dir (free memory for next prediction)


In [0]:
# first delete the files from directory
!rm $data_dir/*

# delete directory itself
!rmdir $data_dir

# remove all the files in the cache_dir
!rm $cache_dir/*

# remove cache_directory itself
!rmdir $cache_dir

In [0]:
# Keep Learning,Enjoy Empowering