# IMDB Sentimental analyser

This is the first project from Machine Learning Engineer Nanodegree Program Udacity course
(https://www.udacity.com/course/machine-learning-engineer-nanodegree--nd009t)

# First part

This is the first part of getting data and preparing it like the course's code in https://github.com/udacity/sagemaker-deployment/blob/master/Tutorials/IMDB%20Sentiment%20Analysis%20-%20XGBoost%20-%20Web%20App.ipynb

### Dowloading data

In [1]:
%mkdir -p data/all_data
!wget -O data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf data/aclImdb_v1.tar.gz -C data

data_dir="data/all_data"

--2020-09-14 20:07:49--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolvendo ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Conectando-se a ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 84125825 (80M) [application/x-gzip]
Salvando em: “data/aclImdb_v1.tar.gz”


2020-09-14 20:08:04 (5,43 MB/s) - “data/aclImdb_v1.tar.gz” salvo [84125825/84125825]



# Preparing 

In [2]:
import os
import glob

def read_imdb_data(data_dir='data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [3]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [4]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [5]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {} test = {}".format(len(train_X), len(test_X)))
print(train_X[540])
print(train_y[540])

IMDb reviews (combined): train = 25000 test = 25000
When an attempt is made to assassinate the Emir of Ohtar, an Arab potentate visiting Washington, D.C., his life is saved by a cocktail waitress named Sunny Davis. Sunny becomes a national heroine and media celebrity and as a reward is offered a job working for the Protocol Section of the United States Department of State. Unknown to her however, the State Department officials who offer her the job have a hidden agenda.<br /><br />A map we see shows Ohtar lying on the borders of Saudi Arabia and South Yemen, in an area of barren desert known as the Rub al-Khali, or Empty Quarter. In real life a state in this location would have a population of virtually zero, and virtually zero strategic value, but for the purposes of the film we have to accept that Ohtar is of immense strategic importance in the Cold War and that the American government, who are keen to build a military base there, need to do all that they can in order to keep on the 

# Filtering the data

### Answering the first question

#### What does review_to_words do?

Besides tokenize each word, it clean and put them in lower case for training.

In [6]:
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile(r'<.*?>')
    return re.sub(clean, '', text)

# That's my review_to_words function
def filter_text_list(text_list):
    cleanr = re.compile(r"[^a-zA-Z0-9]")
    stop_words = set(stopwords.words('english')) 
    
    filtered_sentence = []
    docs = []
    for text in text_list:
        word_tokens = word_tokenize(re.sub(cleanr, ' ', remove_html_tags(text.lower()))) # All words in lower case
        filtered_sentence.append([w for w in word_tokens if not w in stop_words])   
        
    return filtered_sentence

[nltk_data] Downloading package stopwords to /home/robson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/robson/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
train_filtered_X = filter_text_list(train_X)
# Just checking
print(('Original train_X length = {} - filtered_train_X length = {}').format(len(train_X),len(train_filtered_X)))

test_filtered_X = filter_text_list(test_X)
# Just checking
print(('Original test_X length = {} - filtered_test_X length = {}').format(len(test_X),len(test_filtered_X)))

Original train_X length = 25000 - filtered_train_X length = 25000
Original test_X length = 25000 - filtered_test_X length = 25000


In [8]:
print(train_filtered_X[540])
print(train_y[540])

['attempt', 'made', 'assassinate', 'emir', 'ohtar', 'arab', 'potentate', 'visiting', 'washington', 'c', 'life', 'saved', 'cocktail', 'waitress', 'named', 'sunny', 'davis', 'sunny', 'becomes', 'national', 'heroine', 'media', 'celebrity', 'reward', 'offered', 'job', 'working', 'protocol', 'section', 'united', 'states', 'department', 'state', 'unknown', 'however', 'state', 'department', 'officials', 'offer', 'job', 'hidden', 'agenda', 'map', 'see', 'shows', 'ohtar', 'lying', 'borders', 'saudi', 'arabia', 'south', 'yemen', 'area', 'barren', 'desert', 'known', 'rub', 'al', 'khali', 'empty', 'quarter', 'real', 'life', 'state', 'location', 'would', 'population', 'virtually', 'zero', 'virtually', 'zero', 'strategic', 'value', 'purposes', 'film', 'accept', 'ohtar', 'immense', 'strategic', 'importance', 'cold', 'war', 'american', 'government', 'keen', 'build', 'military', 'base', 'need', 'order', 'keep', 'good', 'side', 'ruler', 'transpires', 'emir', 'taken', 'fancy', 'attractive', 'young', 'wom

In [9]:
# Just looking the diference
before = train_X[37]
after = train_filtered_X[37]
print("### Before ###")
print(before)
print("### After ###")
print(after)

### Before ###
I remember hitch hiking to Spain at 25, getting a lift from, what turned out to be, two fleeing Italian small crooks. They were doing a lot outside the law, but from the other side carrying a little portrait of Jesus in the pocket for their protection...Just and unjust, good and bad, criminal and correct where here in a new combination, outside of the categories I used to know. 'Les Valseuses' gives me, although a film and not real life, a picture close to my own experiences: the intenseness of each moment as soon as you leave 'all behind' and go for the momentous, whatever comes your way, it's another state of mind and also 'dangerous' form of life, because, as we all know, there are people who are not ready for this and willing to persecute you for 'stealing' and so on...This film touches 'values', it's a story about 'what's right and wrong': morals. It's resurrection of the individual fighting him/ herself free against the 'false morals' and conformism...There's dange

In [10]:
from collections import Counter
import numpy as np

train_sentences=np.copy(train_filtered_X)

count = 0
wordsDic = Counter() #Dictionary that will map a word to the number of times it appeared in all the training sentences
for i, sentence in enumerate(train_sentences):
    #The sentences will be stored as a list of words/tokens
    train_sentences[i] = []
    for word in nltk.word_tokenize(' '.join(sentence)): #Tokenizing the words
        train_sentences[i].append(word)
        wordsDic.update([word])
        count=count+1


  return array(a, order=order, subok=subok, copy=True)


### Top 5 used words 

### Answering the second question

#### What are the five most frequently appearing words?

In [11]:
i = 0

dicSorted = {}

for k in sorted(wordsDic, key=wordsDic.get, reverse=True):
    dicSorted[k]=wordsDic[k]
    if i == 4:
        break
    i = i + 1
    
print("Top 5 of words : {}".format(list(dicSorted)))


Top 5 of words : ['movie', 'film', 'one', 'like', 'good']


### The third question

#### Create a word dictionary

The variable words2index is the dictionary.

In [12]:
wordsList = {k:v for k,v in wordsDic.items() if v>1}
wordsList = sorted(wordsList, key=wordsList.get, reverse=True)
# Creating unknown and padding index 
wordsList = ['FILL_','UNKN_'] + wordsList
words2index = {o:i for i,o in enumerate(wordsList)}

### The fourth question

Padding process and the function redefineSentences finish the process truncating to 200.

In [13]:
# Applying unknown and padding index
for i, sentence in enumerate(train_filtered_X):
    train_filtered_X[i] = [words2index[word] if word in words2index else words2index['UNKN_'] for word in sentence]
    

In [14]:
for i, sentence in enumerate(test_filtered_X):
    sentence = re.sub("[^a-zA-Z]",  " ", str(sentence))
    test_filtered_X[i] = [words2index[word] if word in words2index else words2index['UNKN_'] for word in nltk.word_tokenize(sentence)]


In [15]:
# Just checking
print(train_filtered_X[45])
print(train_y[45])
print(test_filtered_X[45])
print(test_y[45])

[13, 2, 141, 29, 178, 317, 2, 89, 156, 173, 1200, 10649, 32555, 5, 1948, 989, 12824, 56, 404, 23, 241, 22, 755, 69, 25, 49, 1200, 107, 6955, 3642, 21157, 4197, 11, 1421, 85, 5, 127, 50, 869, 15, 227, 1379, 767, 1, 11, 224, 35, 109, 26, 65, 11, 21, 171, 80, 623, 248, 181, 1300, 15, 249, 810, 109, 26, 416, 50, 37748, 977, 141, 29, 1218, 758, 5, 2, 42, 410, 37749, 878, 141, 29]
0
[1397, 208, 2, 4113, 3773, 1198, 878, 4695, 225, 6123, 77, 790, 1282, 5, 9447, 197, 269, 180, 36, 120, 339, 247, 1405, 827, 1103, 1759, 1509, 1397, 705, 17, 1841, 10714, 127, 819, 4, 58, 10283, 124, 213, 25, 4345, 350, 20, 17, 154, 68, 112, 661, 758, 33, 2726, 11615, 2897, 481, 33, 6, 11615, 1397, 31, 4]
0


In [16]:
def redefineSentences(sentences, seq_len):
    redefinedSentences = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            redefinedSentences[ii, -len(review):] = np.array(review)[:seq_len]
    return redefinedSentences

In [17]:
seq_len = 200 #The length that the sentences will be cut

train_filtered_X = redefineSentences(train_filtered_X, seq_len)
test_filtered_X = redefineSentences(test_filtered_X, seq_len)

train_y = np.array(train_y)
test_y = np.array(test_y)

In [18]:
# Just checking
print(train_filtered_X[45])
print(train_y[45])
print(test_filtered_X[45])
print(test_y[45])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0    13     2   141    29   178   317     2    89   156   173  1200
 10649 32555     5  1948   989 12824    56   404    23   241    22   755
    69    25    49  1200   107  6955  3642 21157  4197    11  1421    85
     5   127    50   869    15   227  1379   767   

# Set up and upload the data

In [19]:
%%time

import os
import boto3
import re
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session(boto3.Session())
print(region)
bucket = sagemaker_session.default_bucket()
print(bucket)
prefix = 'sagemaker/IMDB-data'

us-east-2
sagemaker-us-east-2-214237513994
CPU times: user 543 ms, sys: 115 ms, total: 658 ms
Wall time: 2.49 s


In [20]:
def write_to_s3(local_directory, work_directory):
    return sagemaker_session.upload_data(local_directory, key_prefix=work_directory)

In [21]:
import numpy as np
from sklearn.model_selection import train_test_split

# Dividing test data in test and validation
X_valid, X_test, y_valid, y_test = train_test_split(test_filtered_X, test_y, test_size=0.85, random_state=42)

# Just checking
print(('X_test length = {} - y_test length = {}').format(len(X_test),len(y_test)))
print(('X_valid length = {} - y_valid length = {}').format(len(X_valid),len(y_valid)))

X_test length = 21250 - y_test length = 21250
X_valid length = 3750 - y_valid length = 3750


In [22]:
X_train, X_train1, y_train, y_train1 = train_test_split(train_filtered_X, train_y, test_size=0.95, random_state=42)

print(('X_test length = {} - y_test length = {}').format(len(X_train),len(y_train)))
print(('X_valid length = {} - y_valid length = {}').format(len(X_train1),len(y_train1)))

X_test length = 1250 - y_test length = 1250
X_valid length = 23750 - y_valid length = 23750


In [23]:
import pandas as pd
import torch
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_filtered_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train'), header=False, index=False)

pd.concat([pd.DataFrame(y_test), pd.DataFrame(X_test)], axis=1) \
        .to_csv(os.path.join(data_dir, 'test'), header=False, index=False)


pd.concat([pd.DataFrame(y_valid), pd.DataFrame(X_valid)], axis=1) \
        .to_csv(os.path.join(data_dir, 'valid'), header=False, index=False)

with open(os.path.join(data_dir, 'dictionary.dic'), 'wb') as f:
    torch.save(words2index, f)              

#Saving all data.
s3_input_train = write_to_s3(data_dir, prefix)


In [24]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(source_dir ='./RNN',
                    entry_point='train.py',
                    role=role,
                    sagemaker_session = sagemaker_session,
                    framework_version='1.5.0',
                    train_instance_count=1,
                    py_version='py3',
                    train_instance_type='ml.m5.xlarge',
                    hyperparameters={
                        'epochs'               : 10,
                        'n_layers'             : 2,
                        'embedding_dim'        : 400,
                        'hidden_dim'           : 512,
                        'vocab_size'           : len(words2index)+1,
                        'dictionary_file_name' : 'dictionary.dic',
                    })



In [25]:
estimator.fit({'training': s3_input_train})


'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-09-14 12:26:41 Starting - Starting the training job...
2020-09-14 12:26:42 Starting - Launching requested ML instances......
2020-09-14 12:27:47 Starting - Preparing the instances for training...
2020-09-14 12:28:37 Downloading - Downloading input data
2020-09-14 12:28:37 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-09-14 12:29:03,626 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-09-14 12:29:03,629 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-14 12:29:03,639 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-09-14 12:29:04,294 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-09-14 12:29:04,728 sagemaker-containers INFO     Module default_user_module_nam

[34mEpoch: 1/10... Step: 61... Loss: 0.484309... Val Loss: 0.550715[0m
[34mValidation loss decreased (inf --> 0.550715).Saving model...[0m
[34mEpoch: 2/10... Step: 100... Loss: 0.368256... Val Loss: 0.488107[0m
[34mValidation loss decreased (0.550715 --> 0.488107).Saving model...[0m
[34mEpoch: 4/10... Step: 200... Loss: 0.125930... Val Loss: 0.540894[0m
[34mEpoch: 5/10... Step: 300... Loss: 0.097641... Val Loss: 0.699354[0m
[34mEpoch: 7/10... Step: 400... Loss: 0.011532... Val Loss: 0.764596[0m
[34mEpoch: 9/10... Step: 500... Loss: 0.004461... Val Loss: 0.812873[0m
[34mEpoch: 10/10... Step: 600... Loss: 0.015451... Val Loss: 0.790113[0m

2020-09-14 15:09:04 Uploading - Uploading generated training model[34m[2020-09-14 15:09:01.513 algo-1:43 INFO utils.py:25] The end of training job file will not be written for jobs running under SageMaker.[0m
[34m2020-09-14 15:09:02,008 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-09-14 15:09:21 Completed - Tr

### After 10 epochs...




3750 texts from validation dataset

Epoch: 1/10... Step: 61... Loss: 0.484309... Val Loss: 0.550715

Validation loss decreased (inf --> 0.550715).Saving model...

Epoch: 2/10... Step: 100... Loss: 0.368256... Val Loss: 0.488107

Validation loss decreased (0.550715 --> 0.488107).Saving model...

Epoch: 4/10... Step: 200... Loss: 0.125930... Val Loss: 0.540894

Epoch: 5/10... Step: 300... Loss: 0.097641... Val Loss: 0.699354

Epoch: 7/10... Step: 400... Loss: 0.011532... Val Loss: 0.764596

Epoch: 9/10... Step: 500... Loss: 0.004461... Val Loss: 0.812873

Epoch: 10/10... Step: 600... Loss: 0.015451... Val Loss: 0.790113

# Testing the model

The first way that I'd like to use is download the trained model file and see the its behavior.

In [26]:
# Getting the trained model file
from RNN.RNN import IMDBClassifier
import boto3
import botocore
import tarfile

training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']

modelFile=trained_model_location.replace("s3://{}".format(bucket),"",1)

modelFile=modelFile.replace("/","",1)

fileName = modelFile.split("/")[-1]

s3 = boto3.Session(profile_name='robsonrocha').resource('s3')

s3.Bucket(bucket).download_file(modelFile, fileName)

tar = tarfile.open(fileName, "r:gz")
tar.extractall()
tar.close()


In [27]:
# Loading downloaded model

modelCfg = {}    
model_cfg = './model.cfg'
with open(model_cfg, 'rb') as f:                                
    modelCfg = torch.load(f)  

print("### MODEL CONFIG {}".format(modelCfg))

completeModel = IMDBClassifier(modelCfg['vocab_size'], modelCfg['output_size'], modelCfg['embedding_dim'], modelCfg['hidden_dim'], modelCfg['n_layers'])    

complete_model = './model.pth'
with open(complete_model, 'rb') as f:
    completeModel.load_state_dict(torch.load(f))

complete_dict = './model.dic'
with open(complete_dict, 'rb') as f:
    completeModelDict = torch.load(f)

completeModel.to(torch.device(modelCfg['device']))
completeModel.dictionary = completeModelDict

print(type(completeModel))
print('____')
print(completeModel)
print('____')

### MODEL CONFIG {'embedding_dim': 400, 'hidden_dim': 512, 'vocab_size': 46816, 'output_size': 1, 'n_layers': 2, 'device': 'cpu', 'batch_size': 400}
<class 'RNN.RNN.IMDBClassifier'>
____
IMDBClassifier(
  (embedding): Embedding(46816, 400, padding_idx=0)
  (lstm): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
____


In [28]:
# Testing
from RNN.data import Dataset

is_cuda = torch.cuda.is_available()

if is_cuda:
    device = torch.device("cuda")            
else:
    device = torch.device("cpu")

dataset = Dataset(data_dir)

test_loader = dataset.getDatasetTest(completeModel.batch_size)

criterion = torch.nn.BCELoss()
optimizer = torch.optim.Adam(completeModel.parameters(), lr=0.005)     

test_losses = []
num_correct = 0
h = completeModel.init_hidden(completeModel.batch_size)
completeModel.eval()
for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    inputs, labels = inputs.to(device), labels.to(device)
    output, h = completeModel(inputs, h)
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    pred = torch.round(output.squeeze()) 
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)
    

print("Test loss: {:.3f}".format(np.mean(test_losses)))
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}%".format(test_acc*100))

Test loss: 0.608
Test accuracy: 71.073%


The second way is creating the endpoint and call it.

In [29]:
from sagemaker.pytorch import PyTorchModel

training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']
model = PyTorchModel(model_data=trained_model_location,
                     role=role,
                     framework_version='1.5.0',
                     entry_point='api.py',
                     sagemaker_session=sagemaker_session,
                     source_dir='./RNN') 

In [30]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


--------------!

In [31]:
test_review = 'This movie was made to be watch with family in your weekend. I think it worth the time.'

In [32]:
from RNN.data import Dataset

test = Dataset()

text = test.transformRawData(words2index,[test_review], seq_len)


In [33]:
import boto3
from botocore.config import Config
from sagemaker.session import Session


runtime_client = boto3.Session().client('sagemaker-runtime')

payload = '{"text":"'+test_review+'"}'

response = runtime_client.invoke_endpoint(EndpointName=predictor.endpoint, 
                                   ContentType='application/json', 
                                   Body=payload)
result = response['Body'].read()

print(result)

b'{"algorithm": "RNN", "answer": "1"}'


In [34]:
predictor.delete_endpoint()