## Machine Learning Steps

If the data is huge, you may want to sample smaller training sets so you can train many different
models in a reasonable time (be aware that this penalizes complex models such as large neural nets
or Random Forests).
Once again, try to automate these steps as much as possible.

1. Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
2. Measure and compare their performance. For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
3. Analyze the most significant variables for each algorithm.
4. Analyze the types of errors the models make. What data would a human have used to avoid these errors?
5. Have a quick round of feature selection and engineering.
6. Have one or two more quick iterations of the five previous steps.
7. Short-list the top three to five most promising models, preferring models that make different types of errors.

Source: p. 646. Hands-on Machine Learning

## Sentiment Analysis options

### Features

- Length of review

*Ignoring word order*
- Frequencies of all relevant words (ignore stop words)
OR
- Frequency of positive words (list from NLTK)
- Frequency of negative words (list from NLTK), etc.

*Keeping word order*
- One hot encoding of words -> list of all words in vocab (as a vector/column) with row respective to word = 1
- Word embeddings -> features of word learned through different algorithms

Video on word embeddings; https://www.youtube.com/watch?v=186HUTBQnpY

### Algorithms

#### Non-sequential algorithms

Any classification algorithms;
- Logistic Regression
- Decision trees
- Naive Bayes

#### Sequential algorithms

RNNs (~ BRNNs, LSTMs, GRUs)

## Models Covered In Class

### Linear Regression - Baseline Model
Features; number of +ve words, Number of -ve words

The mean absolute error on the training data is 0.832466 stars

### Random Forests (non-linear model)
Features; number of +ve words, Number of -ve words -> after taking into account negations, e.g. (not good)

A nonlinear regressor achieves a MAE of 0.715708 stars

### Linear Regression with NLTK Sentiment Intensity Analyser
Features; number of +ve words, Number of -ve words -> after taking into account negations e.g. (not good) & *(see below)

Now the mean absolute error on the training data is 0.758256 stars

On the validation set, we get 0.755795 error for the linear regression

### Random Forests with NLTK Sentiment Intensity Analyser
Features; number of +ve words, Number of -ve words -> after taking into account negations e.g. (not good) & *(see below)

For the RF, it is 0.283528 stars

Validation set; 0.731631 for the random forest regression

* *Features*

       (1) the mean positive sentiment over all sentences
       (2) the mean neutral sentiment over all sentences
       (3) the mean negative sentiment over all sentences
       (4) the maximum positive sentiment over all sentences
       (5) the maximum neutral sentiment over all sentences
       (6) the maximum negative sentiment over all sentences
       (7) length of review (in thousands of characters) - truncate at 2,500
       (8) percentage of exclamation marks (in %)


## Plan

- Naive Bayes classifier

https://www.youtube.com/watch?v=tOP5DzKxc20

https://www.youtube.com/watch?v=5YymjfzMpL8

Choices; Negate words?

- RNNs (BRNNs?) with word embeddings

# Load Dataset

In [1]:
import os
import json

def load_data(dataset_name):        
    data = []
    with open(dataset, 'r') as f:
        for line in f:                            # read file line by line
            item_hash = hash(line)                # we will use this later for partitioning our data 
            item = json.loads(line)               # convert JSON string to Python dict
            item['hash'] = item_hash              # add hash for identification purposes
            data.append(item)
    print("Loaded %d data for dataset %s" % (len(data), dataset_name))
    return data

# load the data...
dataset = 'Baby_5.json'
baby = load_data(dataset)

Loaded 160792 data for dataset Baby_5.json
{'reviewerID': 'A2H4QWDVXARPAU', 'asin': 'B0000TYHD2', 'reviewerName': 'Erin White "Erin"', 'helpful': [7, 8], 'reviewText': "I bought this pump for my new baby because it just as others below have said it looks more comfortable than others and it is! Including Medela. With my other child I encountered breastfeeding problems and had a horrible cheap pump. Now with my new baby she was born with a heart problem (she is fine now after a long road to recovery) and had to stay in the hospital for an extended length of time. Meanwhile I had other children at home and we live 6 hours away from our family and so I had no choice but to divide my time between the hospital and home, which meant I needed a hospital grade pump originally I rented one from the hospital (Medela) and it made my breasts hurt really bad. On the way home from the hospital I stopped in at Babiesrus and bought this pump because it was on our registry (we studied and found it to be

In [2]:
# ... and have a look at an example item (item number 9427):
print(baby[9427])

{'reviewerID': 'A2BV4U9FAXALU0', 'asin': 'B0000TYHD2', 'reviewerName': 'Jen9254', 'helpful': [55, 56], 'reviewText': 'After reading the other reviews, I have a couple things to add.I want to clearly state I do not have a lot of experience with other pumps.  Here\'s my experience... My daughter was having a lot of difficulty latching on.  (For any other mom\'s having that problem-- get help, hire a lactation consultant.  It was BY FAR the best $100 I\'ve spent in regards to my baby.)  I bought a Medela manual pump because I wasn\'t sure how long breastfeeding was going to last and I didn\'t want to spend a lot.  I used it a few days, then brought the lactation consultant in and rented a hospital grade Medala from her for a week or so.  After the first visit from the L.C., my daughter figured it out, and for two months we had little need for a pump.  The manual pump worked just fine for as often as I used it.  About a month ago, I bought the Playtex to return to work because of the posit

# Feature Extraction

In [5]:
import pandas as pd

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
eng_stopwords = set(stopwords.words('english'))

In [4]:
import keras as K

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [7]:
baby_df = pd.DataFrame(baby)

In [8]:
baby_df['reviewText'][0:10] #glimpse reviewText column

0    Perfect for new parents. We were able to keep ...
1    This book is such a life saver.  It has been s...
2    Helps me know exactly how my babies day has go...
3    I bought this a few times for my older son and...
4    I wanted an alternative to printing out daily ...
5    This is great for basics, but I wish the space...
6    My 3 month old son spend half of his days with...
7    This book is perfect!  I'm a first time new mo...
8    I wanted to love this, but it was pretty expen...
9    The Baby Tracker brand books are the absolute ...
Name: reviewText, dtype: object

In [89]:
def tokenize_text(input_series, punctuation=True, stop_words=True):
    #Input: pandas series containing text, Ouput: pandas series containing tokenized text.
    # split text into lower-case tokens, removing all-punctuation tokens (if punctuation=False) and stopwords (if stop_words=False)
    output_series=pd.Series()
    for i,item in input_series.iteritems():
        tokens = []
        for word in word_tokenize(item.lower()) :

            if not punctuation:
                if not any(i.isalpha() for i in word):
                    continue
            
            if not stop_words:
                if word in eng_stopwords:
                    continue
                    
            tokens.append(word)
        output_series = output_series.append(pd.Series([tokens],index=[i]))
        if i%100 == 0: #prints for testing
            print(i)
    return output_series

## Testing tokenize_text function

In [86]:
#testing
sample = baby_df['reviewText'][0:100]
print(sample)

0     Perfect for new parents. We were able to keep ...
1     This book is such a life saver.  It has been s...
2     Helps me know exactly how my babies day has go...
3     I bought this a few times for my older son and...
4     I wanted an alternative to printing out daily ...
5     This is great for basics, but I wish the space...
6     My 3 month old son spend half of his days with...
7     This book is perfect!  I'm a first time new mo...
8     I wanted to love this, but it was pretty expen...
9     The Baby Tracker brand books are the absolute ...
10    During your postpartum stay at the hospital th...
11    I use this so that our babysitter (grandma) ca...
12    This book is a great way for keeping track of ...
13    Has columns for all the info I need at a glanc...
14    I like this log, but think it would work bette...
15    My wife and I have a six month old baby boy an...
16    I thought keeping a simple handwritten journal...
17    Easy to use, simple! I got this when my ba

In [87]:
baby_df_t = tokenize_text(sample)

0


In [88]:
print(baby_df_t)

0     [perfect, for, new, parents, ., we, were, able...
1     [this, book, is, such, a, life, saver, ., it, ...
2     [helps, me, know, exactly, how, my, babies, da...
3     [i, bought, this, a, few, times, for, my, olde...
4     [i, wanted, an, alternative, to, printing, out...
5     [this, is, great, for, basics, ,, but, i, wish...
6     [my, 3, month, old, son, spend, half, of, his,...
7     [this, book, is, perfect, !, i, 'm, a, first, ...
8     [i, wanted, to, love, this, ,, but, it, was, p...
9     [the, baby, tracker, brand, books, are, the, a...
10    [during, your, postpartum, stay, at, the, hosp...
11    [i, use, this, so, that, our, babysitter, (, g...
12    [this, book, is, a, great, way, for, keeping, ...
13    [has, columns, for, all, the, info, i, need, a...
14    [i, like, this, log, ,, but, think, it, would,...
15    [my, wife, and, i, have, a, six, month, old, b...
16    [i, thought, keeping, a, simple, handwritten, ...
17    [easy, to, use, ,, simple, !, i, got, this

## Add features

In [75]:
baby_df['review_tokens'] = tokenize_text(baby_df['reviewText'])

In [106]:
review_tokens_nostop = pd.Series([[x for x in review if x not in eng_stopwords] for review in baby_df['review_tokens']])

In [107]:
baby_df['review_tokens_nostop'] = review_tokens_nostop

In [111]:
review_tokens_nopunct = pd.Series([[x for x in review if any(i.isalpha() for i in x)] for review in baby_df['review_tokens']])

In [112]:
baby_df['review_tokens_nopunct'] = review_tokens_nopunct

In [116]:
review_tokens_nostop_nopunct = pd.Series([[x for x in review if x not in eng_stopwords] for review in baby_df['review_tokens_nopunct']])

In [117]:
baby_df['review_tokens_nostop_nopunct'] = review_tokens_nostop_nopunct

In [124]:
baby_df['number_of_words']=[len(x) for x in baby_df['review_tokens_nopunct']]

In [151]:
baby_df['rating']=baby_df['overall']-1

In [161]:
l = len(baby_df)
print(l)

160792


### Adding label features

In [175]:
baby_df['labels']=list(K.utils.to_categorical(np.array(baby_df['rating'])))

In [177]:
baby_df['labels_2']=baby_df['rating']>=3

In [178]:
baby_df #display baby data frame

Unnamed: 0,asin,hash,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,review_tokens,review_tokens_nostop,review_tokens_nopunct,review_tokens_nostop_nopunct,number_of_words,rating,labels,labels_2
0,097293751X,-244995769641145257,"[0, 0]",5.0,Perfect for new parents. We were able to keep ...,"07 16, 2013",A1HK2FQW6KXQB2,"Amanda Johnsen ""Amanda E. Johnsen""",Awesine,1373932800,"[perfect, for, new, parents, ., we, were, able...","[perfect, new, parents, ., able, keep, track, ...","[perfect, for, new, parents, we, were, able, t...","[perfect, new, parents, able, keep, track, bab...",48,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True
1,097293751X,-6118819134643974020,"[0, 0]",5.0,This book is such a life saver. It has been s...,"06 29, 2013",A19K65VY14D13R,angela,Should be required for all new parents!,1372464000,"[this, book, is, such, a, life, saver, ., it, ...","[book, life, saver, ., helpful, able, go, back...","[this, book, is, such, a, life, saver, it, has...","[book, life, saver, helpful, able, go, back, t...",103,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True
2,097293751X,7207861812536044516,"[0, 0]",5.0,Helps me know exactly how my babies day has go...,"03 19, 2014",A2LL1TGG90977E,Carter,Grandmother watching baby,1395187200,"[helps, me, know, exactly, how, my, babies, da...","[helps, know, exactly, babies, day, gone, moth...","[helps, me, know, exactly, how, my, babies, da...","[helps, know, exactly, babies, day, gone, moth...",48,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True
3,097293751X,6948947738823362260,"[0, 0]",5.0,I bought this a few times for my older son and...,"08 17, 2013",A5G19RYX8599E,cfpurplerose,repeat buyer,1376697600,"[i, bought, this, a, few, times, for, my, olde...","[bought, times, older, son, bought, newborn, ....","[i, bought, this, a, few, times, for, my, olde...","[bought, times, older, son, bought, newborn, s...",165,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True
4,097293751X,4717703183193021843,"[0, 0]",4.0,I wanted an alternative to printing out daily ...,"04 1, 2014",A2496A4EWMLQ7,C. Jeter,Great,1396310400,"[i, wanted, an, alternative, to, printing, out...","[wanted, alternative, printing, daily, log, sh...","[i, wanted, an, alternative, to, printing, out...","[wanted, alternative, printing, daily, log, sh...",74,3.0,"[0.0, 0.0, 0.0, 1.0, 0.0]",True
5,097293751X,-2982172729056392891,"[0, 0]",4.0,"This is great for basics, but I wish the space...","05 10, 2014",A3OQEVD4C7G3L3,CMB,"Great for basics, but not detail",1399680000,"[this, is, great, for, basics, ,, but, i, wish...","[great, basics, ,, wish, space, write, things,...","[this, is, great, for, basics, but, i, wish, t...","[great, basics, wish, space, write, things, bi...",35,3.0,"[0.0, 0.0, 0.0, 1.0, 0.0]",True
6,097293751X,6512887130398459368,"[0, 0]",5.0,My 3 month old son spend half of his days with...,"07 17, 2013",ATZDT4B1U7NL,HYM,Perfect for the working mom,1374019200,"[my, 3, month, old, son, spend, half, of, his,...","[3, month, old, son, spend, half, days, mother...","[my, month, old, son, spend, half, of, his, da...","[month, old, son, spend, half, days, mother, h...",66,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True
7,097293751X,-668517764531636637,"[3, 3]",5.0,This book is perfect! I'm a first time new mo...,"01 27, 2013",A3NMPMELAZC8ZY,Jakell,Great for newborns,1359244800,"[this, book, is, perfect, !, i, 'm, a, first, ...","[book, perfect, !, 'm, first, time, new, mom, ...","[this, book, is, perfect, i, 'm, a, first, tim...","[book, perfect, 'm, first, time, new, mom, boo...",48,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True
8,097293751X,4875018270201658744,"[0, 0]",3.0,"I wanted to love this, but it was pretty expen...","04 22, 2014",A1ZSTU6RKY1JCL,Jen,"It's ok, but I liked a regular weekly planner ...",1398124800,"[i, wanted, to, love, this, ,, but, it, was, p...","[wanted, love, ,, pretty, expensive, months, w...","[i, wanted, to, love, this, but, it, was, pret...","[wanted, love, pretty, expensive, months, wort...",101,2.0,"[0.0, 0.0, 1.0, 0.0, 0.0]",False
9,097293751X,4517693664679228230,"[0, 0]",5.0,The Baby Tracker brand books are the absolute ...,"11 19, 2013",A1TFH58BMFJCR3,killerbee,Best for Tracking!,1384819200,"[the, baby, tracker, brand, books, are, the, a...","[baby, tracker, brand, books, absolute, best, ...","[the, baby, tracker, brand, books, are, the, a...","[baby, tracker, brand, books, absolute, best, ...",122,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True


# RNNs

In [3]:
import pdb

In [134]:
import numpy as np

## Load embeddings

Glove embeddings source: https://www.kaggle.com/watts2/glove6b50dtxt

In [135]:
embeddings_index = {}
f = open('glove.6B.50d.txt',encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [196]:
#explore embeddings
for i,j in enumerate(embeddings_index.items()):
    print(j[0])
    if i>100:
        print(len(j[1]))
        break

the
,
.
of
to
and
in
a
"
's
for
-
that
on
is
was
said
with
he
as
it
by
at
(
)
from
his
''
``
an
be
has
are
have
but
were
not
this
who
they
had
i
which
will
their
:
or
its
one
after
new
been
also
we
would
two
more
'
first
about
up
when
year
there
all
--
out
she
other
people
n't
her
percent
than
over
into
last
some
government
time
$
you
years
if
no
world
can
three
do
;
president
only
state
million
could
us
most
_
against
u.s.
so
them
50


In [200]:
EMBEDDINGS_LENGTH=50

## Encoding reviews

In [181]:
# create the tokenizer
t = K.preprocessing.text.Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(baby_df['reviewText'])

In [183]:
# summarize what was learned
print(t.word_counts)



In [186]:
print(t.document_count)

160792


In [187]:
print(t.word_index)



In [188]:
print(t.word_docs)



In [197]:
VOCAB_SIZE=len(t.word_index)

In [199]:
VOCAB_SIZE

67994

In [223]:
embedding_matrix = np.zeros((VOCAB_SIZE + 2, EMBEDDINGS_LENGTH)) #last index for unknown words
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [224]:
embedding_matrix[0:20]

array([[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00],
       [ 4.18

In [225]:
embedding_matrix[VOCAB_SIZE + 1] #unkown words get mapped to this row

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [229]:
#encoding words to respective indices
baby_df['x1']=[[ t.word_index.get(word, VOCAB_SIZE + 1) for word in x] for x in baby_df['review_tokens']]

In [230]:
baby_df

Unnamed: 0,asin,hash,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,review_tokens,review_tokens_nostop,review_tokens_nopunct,review_tokens_nostop_nopunct,number_of_words,rating,labels,labels_2,x1
0,097293751X,-244995769641145257,"[0, 0]",5.0,Perfect for new parents. We were able to keep ...,"07 16, 2013",A1HK2FQW6KXQB2,"Amanda Johnsen ""Amanda E. Johnsen""",Awesine,1373932800,"[perfect, for, new, parents, ., we, were, able...","[perfect, new, parents, ., able, keep, track, ...","[perfect, for, new, parents, we, were, able, t...","[perfect, new, parents, able, keep, track, bab...",48,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[152, 9, 204, 586, 67995, 19, 106, 185, 3, 127..."
1,097293751X,-6118819134643974020,"[0, 0]",5.0,This book is such a life saver. It has been s...,"06 29, 2013",A19K65VY14D13R,angela,Should be required for all new parents!,1372464000,"[this, book, is, such, a, life, saver, ., it, ...","[book, life, saver, ., helpful, able, go, back...","[this, book, is, such, a, life, saver, it, has...","[book, life, saver, helpful, able, go, back, t...",103,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[8, 834, 7, 451, 6, 572, 1189, 67995, 5, 50, 1..."
2,097293751X,7207861812536044516,"[0, 0]",5.0,Helps me know exactly how my babies day has go...,"03 19, 2014",A2LL1TGG90977E,Carter,Grandmother watching baby,1395187200,"[helps, me, know, exactly, how, my, babies, da...","[helps, know, exactly, babies, day, gone, moth...","[helps, me, know, exactly, how, my, babies, da...","[helps, know, exactly, babies, day, gone, moth...",48,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[539, 80, 205, 619, 119, 12, 233, 212, 50, 118..."
3,097293751X,6948947738823362260,"[0, 0]",5.0,I bought this a few times for my older son and...,"08 17, 2013",A5G19RYX8599E,cfpurplerose,repeat buyer,1376697600,"[i, bought, this, a, few, times, for, my, olde...","[bought, times, older, son, bought, newborn, ....","[i, bought, this, a, few, times, for, my, olde...","[bought, times, older, son, bought, newborn, s...",165,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[4, 87, 8, 6, 157, 243, 9, 12, 352, 75, 2, 17,..."
4,097293751X,4717703183193021843,"[0, 0]",4.0,I wanted an alternative to printing out daily ...,"04 1, 2014",A2496A4EWMLQ7,C. Jeter,Great,1396310400,"[i, wanted, an, alternative, to, printing, out...","[wanted, alternative, printing, daily, log, sh...","[i, wanted, an, alternative, to, printing, out...","[wanted, alternative, printing, daily, log, sh...",74,3.0,"[0.0, 0.0, 0.0, 1.0, 0.0]",True,"[4, 284, 85, 1524, 3, 7983, 39, 919, 7886, 612..."
5,097293751X,-2982172729056392891,"[0, 0]",4.0,"This is great for basics, but I wish the space...","05 10, 2014",A3OQEVD4C7G3L3,CMB,"Great for basics, but not detail",1399680000,"[this, is, great, for, basics, ,, but, i, wish...","[great, basics, ,, wish, space, write, things,...","[this, is, great, for, basics, but, i, wish, t...","[great, basics, wish, space, write, things, bi...",35,3.0,"[0.0, 0.0, 0.0, 1.0, 0.0]",True,"[8, 7, 41, 9, 5871, 67995, 16, 4, 227, 1, 365,..."
6,097293751X,6512887130398459368,"[0, 0]",5.0,My 3 month old son spend half of his days with...,"07 17, 2013",ATZDT4B1U7NL,HYM,Perfect for the working mom,1374019200,"[my, 3, month, old, son, spend, half, of, his,...","[3, month, old, son, spend, half, days, mother...","[my, month, old, son, spend, half, of, his, da...","[month, old, son, spend, half, days, mother, h...",66,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[12, 128, 145, 65, 75, 818, 550, 10, 84, 454, ..."
7,097293751X,-668517764531636637,"[3, 3]",5.0,This book is perfect! I'm a first time new mo...,"01 27, 2013",A3NMPMELAZC8ZY,Jakell,Great for newborns,1359244800,"[this, book, is, perfect, !, i, 'm, a, first, ...","[book, perfect, !, 'm, first, time, new, mom, ...","[this, book, is, perfect, i, 'm, a, first, tim...","[book, perfect, 'm, first, time, new, mom, boo...",48,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[8, 834, 7, 152, 67995, 4, 24030, 6, 95, 68, 2..."
8,097293751X,4875018270201658744,"[0, 0]",3.0,"I wanted to love this, but it was pretty expen...","04 22, 2014",A1ZSTU6RKY1JCL,Jen,"It's ok, but I liked a regular weekly planner ...",1398124800,"[i, wanted, to, love, this, ,, but, it, was, p...","[wanted, love, ,, pretty, expensive, months, w...","[i, wanted, to, love, this, but, it, was, pret...","[wanted, love, pretty, expensive, months, wort...",101,2.0,"[0.0, 0.0, 1.0, 0.0, 0.0]",False,"[4, 284, 3, 56, 8, 67995, 16, 5, 20, 187, 409,..."
9,097293751X,4517693664679228230,"[0, 0]",5.0,The Baby Tracker brand books are the absolute ...,"11 19, 2013",A1TFH58BMFJCR3,killerbee,Best for Tracking!,1384819200,"[the, baby, tracker, brand, books, are, the, a...","[baby, tracker, brand, books, absolute, best, ...","[the, baby, tracker, brand, books, are, the, a...","[baby, tracker, brand, books, absolute, best, ...",122,4.0,"[0.0, 0.0, 0.0, 0.0, 1.0]",True,"[1, 24, 10173, 414, 1733, 21, 1, 2446, 209, 10..."


In [211]:
MAX_SEQUENCE_LENGTH=300

In [239]:
#K.preprocessing.sequence.pad_sequences(baby_df['x1'][10], maxlen=MAX_SEQUENCE_LENGTH)

In [236]:
#padding to uniform length
baby_df['x1']=[ x[:MAX_SEQUENCE_LENGTH] if len(x)>=MAX_SEQUENCE_LENGTH else x + [0]*(MAX_SEQUENCE_LENGTH-len(x)) for x in baby_df['x1']]

In [238]:
baby_df['x1'][0]

[152,
 9,
 204,
 586,
 67995,
 19,
 106,
 185,
 3,
 127,
 1924,
 10,
 24,
 7412,
 462,
 67995,
 226,
 2,
 112,
 399,
 5012,
 9,
 1,
 95,
 113,
 2,
 6,
 550,
 77,
 10,
 49,
 572,
 67995,
 147,
 572,
 268,
 28,
 1,
 3083,
 44,
 1506,
 3506,
 66,
 6147,
 61,
 19,
 51,
 5,
 46,
 169,
 74,
 67995,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0

In [241]:
baby_df['labels'][[0,2,3]] #testing

0    [0.0, 0.0, 0.0, 0.0, 1.0]
2    [0.0, 0.0, 0.0, 0.0, 1.0]
3    [0.0, 0.0, 0.0, 0.0, 1.0]
Name: labels, dtype: object

## Splitting data into train, cv and test

In [245]:
VALIDATION_SPLIT=0.2
TEST_SPLIT=0.2

indices = np.arange(t.document_count)
np.random.shuffle(indices)
test_indices=indices[:int(TEST_SPLIT * t.document_count)]
train_cv_indices = indices[int(TEST_SPLIT * t.document_count):]
cv_indices=train_cv_indices[:int(VALIDATION_SPLIT * t.document_count)]
train_indices = train_cv_indices[int(VALIDATION_SPLIT * t.document_count):]

#testing
print(len(test_indices))
print(len(train_cv_indices))
print(len(cv_indices))
print(len(train_indices))

32158
128634
32158
96476


In [246]:
data_train = baby_df['x1'][train_indices]
labels_train = baby_df['labels'][train_indices]
labels_2_train = baby_df['labels_2'][train_indices]

data_cv = baby_df['x1'][cv_indices]
labels_cv = baby_df['labels'][cv_indices]
labels_2_cv = baby_df['labels_2'][cv_indices]

data_test = baby_df['x1'][test_indices]
labels_test = baby_df['labels'][test_indices]
labels_2_test = baby_df['labels_2'][test_indices]

In [259]:
data_train #checking format

array([list([4, 56, 8, 194, 486, 67995, 12, 98, 123, 8, 2, 3936, 657, 10, 68, 14, 49, 876, 81, 67995, 188, 8, 48, 44, 67995, 1393, 4, 44, 12103, 49, 876, 68, 67995, 16, 23, 448, 124, 8, 194, 486, 50, 111, 85, 1035, 3, 49, 162, 2319, 67995, 48, 107, 128, 7743, 10, 876, 68, 9, 677, 2742, 6, 212, 67995, 115, 6034, 67995, 42, 22, 17, 6, 24, 13, 50, 67995, 1641, 54, 3, 876, 68, 67995, 8, 7, 1, 280, 3, 138, 67995, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Training Model

In [253]:
embedding_layer = K.layers.Embedding(VOCAB_SIZE+2,
                            EMBEDDINGS_LENGTH,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

### Binary model -> indicates positive or negative

In [281]:
inp = K.layers.Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedded_sequences = embedding_layer(inp)
x = K.layers.Bidirectional(K.layers.LSTM(50))(x) #LSTM layer with 50 hidden units
x = K.layers.Dropout(0.2)(x) #to prevent overfitting
x = K.layers.Dense(1, activation="sigmoid")(x)
model = K.models.Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [282]:
x_train = np.array([np.array(x) for x in data_train])

In [283]:
x_train

array([[  4,  56,   8, ...,   0,   0,   0],
       [  4,  17, 412, ...,   0,   0,   0],
       [ 30, 130,  34, ...,   0,   0,   0],
       ...,
       [  1, 960,   2, ...,   0,   0,   0],
       [ 30, 225,  21, ...,   0,   0,   0],
       [  4, 117,   8, ...,   0,   0,   0]])

In [284]:
y_train = np.array(labels_2_train.astype(int))

In [285]:
y_train

array([1, 1, 1, ..., 1, 0, 0])

In [291]:
x_valid = np.array([np.array(x) for x in data_cv])
y_valid = np.array(labels_2_cv.astype(int))

In [292]:
model.fit(x_train, y_train, validation_data=(x_valid, y_valid),
          epochs=2);

Train on 96476 samples, validate on 32158 samples
Epoch 1/2
Epoch 2/2


#### Exploring predictions

In [297]:
print(model.predict(x_valid[0:1]))
print(baby_df['reviewText'][cv_indices[0]])

[[0.97889984]]
This dresser looks very nice and works great for my baby's room. It did require some assembly, but you cannot expect them to ship it all put together!


In [306]:
print(model.predict(x_valid[6:7]))
print(baby_df['reviewText'][cv_indices[6]])

[[0.16865082]]
Why why why would they make these just "loose" for a kids' diaper pail? Where am I supposed to put them? Directly in the bag with the diapers? I don't understand. Secondly, as someone else mentioned, once the package is opened (with all 5 in one package), they start to disintegrate (over time, of course, but they are exposed to air). So, unless you plan on using them all at once, they will all go being used all at once whether you like it or not, because there is no "open/close" plastic casing on the air freshener themselves, like some of them used to have, and they are not individually packaged, so they are now all open and refreshing their own open packaging in a drawer (minus one which is loose inside the diaper pail now). What worries me, too, is that these don't stick to something like the inside of the pail, so my daughter, who is crawling, might find one and put it in her mouth.Horrible packaging, and needs an option to stick them on somewhere. Didn't notice that 

### Classification model with 5 stars

In [316]:
y5_train = np.array([np.array(x) for x in labels_train])
print(y5_train)

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]
 ...
 [0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]]


In [317]:
y5_valid = np.array([np.array(x) for x in labels_cv])
print(np.array(y5_valid))

[[0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1.]
 ...
 [0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]]


In [312]:
inp = K.layers.Input(shape=(MAX_SEQUENCE_LENGTH,))
x = embedded_sequences = embedding_layer(inp)
x = K.layers.Bidirectional(K.layers.LSTM(50))(x) #LSTM layer with 50 hidden units
x = K.layers.Dropout(0.2)(x) #to prevent overfitting
x = K.layers.Dense(5, activation="softmax")(x)
model = K.models.Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy','mean_squared_error'])

In [318]:
model.fit(x_train, y5_train, validation_data=(x_valid, y5_valid),
          epochs=2);

Train on 96476 samples, validate on 32158 samples
Epoch 1/2
Epoch 2/2


#### Exploring predictions

In [320]:
p1=model.predict(x_valid[0:1])
print(baby_df['reviewText'][cv_indices[0]])
print(baby_df['rating'][cv_indices[0]])
print('Prediction',p1)

This dresser looks very nice and works great for my baby's room. It did require some assembly, but you cannot expect them to ship it all put together!
4.0
Prediction [[0.00211058 0.01371254 0.09351384 0.43025813 0.46040493]]


In [321]:
p2=model.predict(x_valid[6:7])
print(baby_df['reviewText'][cv_indices[6]])
print(baby_df['rating'][cv_indices[0]])
print('Prediction',p2)

Why why why would they make these just "loose" for a kids' diaper pail? Where am I supposed to put them? Directly in the bag with the diapers? I don't understand. Secondly, as someone else mentioned, once the package is opened (with all 5 in one package), they start to disintegrate (over time, of course, but they are exposed to air). So, unless you plan on using them all at once, they will all go being used all at once whether you like it or not, because there is no "open/close" plastic casing on the air freshener themselves, like some of them used to have, and they are not individually packaged, so they are now all open and refreshing their own open packaging in a drawer (minus one which is loose inside the diaper pail now). What worries me, too, is that these don't stick to something like the inside of the pail, so my daughter, who is crawling, might find one and put it in her mouth.Horrible packaging, and needs an option to stick them on somewhere. Didn't notice that they were helpi

# Conclusion / Summary

## For the binary classification model
### On Training Set
loss: 0.3553 - acc: 0.8405 
### On CV Set
val_loss: 0.3313 - val_acc: 0.8511

## For the 5-class classification model
### On Training Set
loss: 0.3180 - acc: 0.8674 - mean_squared_error: 0.0986 
### On CV Set
val_loss: 0.3123 - val_acc: 0.8682 - val_mean_squared_error: 0.0974

## Possible future investigation
How the models perform for text sequences without stop words and/or punctuation