# 1. Sentiment analysis

Using the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/), we want to do a regression model that predict the ratings are on a 1-10 scale. You have an example train and test set in the `dataset` folder.

### 1.1 Regression Model

Use a feedforward neural network and NLP techniques we've seen up to now to train the best model you can on this dataset

### 1.2 RNN model

Train a RNN to do the sentiment analysis regression. The RNN should consist simply of an embedding layer (to make word IDs into word vectors) a recurrent blocks (GRU or LSTM) feeding into an output layer.

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_colwidth', 170) #widen pandas rows display

train = pd.read_csv('dataset/example_train_imdb_reviews.csv', encoding='utf-8')
test = pd.read_csv('dataset/example_test_imdb_reviews.csv', encoding='utf-8')
train

Unnamed: 0,Rating,Review
0,2,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...
1,8,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f..."
2,4,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ..."
3,9,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi..."
4,9,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov..."
...,...,...
95,2,"Oh my. I decided to go out to the cinemas with some friends, wanting to watch one of those mild, feel-good Christmas movies, and I walk out disgusted. The movie faile..."
96,7,"It appears even the director doesn't like this film,but for me I think he's being a bit harsh on himself. Sure it's not perfect, but there are some atmospheric shots..."
97,9,"The thing I remember most about this film is that it used to air on local KTLA TV (Ch. 5) during every Christmas season during the mid to late 70s, mainly due to the ..."
98,7,"I recently saw I.Q. and even though I'm not a romantic comedy type of gal, I think that it was just a nice and sweet movie to watch. So many movies in my opinion lack..."


In [3]:
import re
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gayar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gayar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
# Import Lemmatizer from NLTK
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# function that receive a list of words and do lemmatization:
def lemma_stem_text(words_list):
    # Lemmatizer
    text = [lemmatizer.lemmatize(token.lower()) for token in words_list]
    text = [lemmatizer.lemmatize(token.lower(), "v") for token in text]
    return text

In [6]:
from bs4 import BeautifulSoup

#Creating a function for cleaning of data
def clean_text(raw_text):
    # 1. remove HTML tags
    raw_text = BeautifulSoup(raw_text).get_text() 
    
    # 2. removing all non letters from text
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text) 
    
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                           
    
    # 4. Create variable which contain set of stopwords
    stops = set(stopwords.words("english"))
    stops_indo = set(stopwords.words("indonesian"))
    stops.update(stops_indo)
    
    # 5. Remove stop word & returning   
    words_tmp = [w for w in words if not w in stops]

    # 6. Apply lemmatization function
    words_lemm = lemma_stem_text(words_tmp)

    # 7. Finalize
    return [w for w in words_lemm]

In [7]:
# run text cleaning on longest review

test_review = train['Review'].str.len().argmax()
print('The longest review is :')
print(train['Review'][test_review])

The longest review is :
It breaks my heart that this movie is not appreciated as it should be. its very underrated. people forgot what movies are really about, nowadays they only think about bum bum movies, which can be quite fun watching with popcorn and friends, like transformers, movies which are oriented, with hyper mega high budget like 300mln or even higher, on special effects only and which are dumb movies without storyline. Its the kind of a movie what i despite most. Of course it is fun watching greatly made CGIs, but we do not gain anything essential from that kind of movies.  I honestly think that performance was excellent. Especially Busy Philipps, alongside with Erika Christensen and Victor Garber(whom i respect) made this movie an Oscar worth. Emotional performance by Busy Philipps was astonishing, its such a shame we wont see Oscar in her hands, which she deserves.


In [8]:
print(f"The longest review has {len(train['Review'][test_review].lower().split())} words.")

The longest review has 146 words.


In [9]:
print('List of clean words :')
print(clean_text(train['Review'][test_review]))

List of clean words :
['break', 'heart', 'movie', 'appreciate', 'underrate', 'people', 'forget', 'movie', 'really', 'nowadays', 'think', 'bum', 'bum', 'movie', 'quite', 'fun', 'watch', 'popcorn', 'friend', 'like', 'transformer', 'movie', 'orient', 'hyper', 'mega', 'high', 'budget', 'like', 'mln', 'even', 'higher', 'special', 'effect', 'dumb', 'movie', 'without', 'storyline', 'kind', 'movie', 'despite', 'course', 'fun', 'watch', 'greatly', 'make', 'cgis', 'gain', 'anything', 'essential', 'kind', 'movie', 'honestly', 'think', 'performance', 'excellent', 'especially', 'busy', 'philipps', 'alongside', 'erika', 'christensen', 'victor', 'garber', 'respect', 'make', 'movie', 'oscar', 'worth', 'emotional', 'performance', 'busy', 'philipps', 'astonish', 'shame', 'wont', 'see', 'oscar', 'hand', 'deserve']


In [10]:
print(f"The cleaned text has {len(clean_text(train['Review'][test_review]))} words.")

The cleaned text has 79 words.


In [11]:
print(f"The longest review (movie ID {test_review}) has a rating of {train['Rating'][test_review]}.")

The longest review (movie ID 50) has a rating of 8.


In [12]:
clean_words = []
for i in range(len(train['Review'])):
    res = clean_text(train['Review'][i])
    res_len = len(res)
    clean_words.append(res)

In [13]:
se = pd.Series(clean_words)
train['clean_words'] = se.values

train

Unnamed: 0,Rating,Review,clean_words
0,2,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...,"[movie, get, second, star, work, downtown, like, see, destroy, effect, pretty, good, hear, expensive, korean, film, ever, make, expensive, still, absolutely, horrid, ..."
1,8,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f...","[watch, movie, begin, see, character, develop, could, feel, would, excellent, picture, get, feel, movie, indeed, fill, expectation, experience, rare, feel, throughout..."
2,4,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ...","[seem, odd, combination, withnail, room, view, sometimes, work, time, tragedy, change, name, u, release, though, keep, apidistra, fly, much, better, nothing, title, m..."
3,9,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi...","[saw, exterminator, year, first, time, expectation, movie, although, bad, think, kind, italian, version, roadwarrior, cast, almost, famous, italy, include, venantino,..."
4,9,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov...","[entertain, flick, consider, budget, length, storyline, hardly, ever, touch, movie, world, also, bring, sense, novelty, act, great, p, z, dom, cinematography, also, w..."
...,...,...,...
95,2,"Oh my. I decided to go out to the cinemas with some friends, wanting to watch one of those mild, feel-good Christmas movies, and I walk out disgusted. The movie faile...","[oh, decide, go, cinema, friend, want, watch, one, mild, feel, good, christmas, movie, walk, disgust, movie, fail, full, stop, paul, giamatti, consider, good, actor, ..."
96,7,"It appears even the director doesn't like this film,but for me I think he's being a bit harsh on himself. Sure it's not perfect, but there are some atmospheric shots...","[appear, even, director, like, film, think, bite, harsh, sure, perfect, atmospheric, shoot, story, good, enough, keep, interest, throughout, shoot, appear, quite, pre..."
97,9,"The thing I remember most about this film is that it used to air on local KTLA TV (Ch. 5) during every Christmas season during the mid to late 70s, mainly due to the ...","[thing, remember, film, use, air, local, ktla, tv, ch, every, christmas, season, mid, late, mainly, due, fact, true, story, take, place, near, christmas, eve, always,..."
98,7,"I recently saw I.Q. and even though I'm not a romantic comedy type of gal, I think that it was just a nice and sweet movie to watch. So many movies in my opinion lack...","[recently, saw, q, even, though, romantic, comedy, type, gal, think, nice, sweet, movie, watch, many, movie, opinion, lack, honesty, know, feel, watch, movie, feel, r..."


In [14]:
train.Review[0]

"this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film ever made. being the most expensive and still absolutely horrid makes it a massive waste of money. i rented it so i won't complain too much about what i paid, but it was a couple hours that i'll never get back. plot holes abound. terrible acting all across the board. i do not recommend giving up the time to watch this movie, life is too short. if your friends want to watch this, run away. i can't stress enough how bad this film was.   where the hell did the second dragon come from? why didn't he show up sooner? how did they have rocket launchers on dinosaurs just 500 years ago?"

In [15]:
train.clean_words[0]

['movie',
 'get',
 'second',
 'star',
 'work',
 'downtown',
 'like',
 'see',
 'destroy',
 'effect',
 'pretty',
 'good',
 'hear',
 'expensive',
 'korean',
 'film',
 'ever',
 'make',
 'expensive',
 'still',
 'absolutely',
 'horrid',
 'make',
 'massive',
 'waste',
 'money',
 'rent',
 'complain',
 'much',
 'pay',
 'couple',
 'hour',
 'never',
 'get',
 'back',
 'plot',
 'hole',
 'abound',
 'terrible',
 'act',
 'across',
 'board',
 'recommend',
 'give',
 'time',
 'watch',
 'movie',
 'life',
 'short',
 'friend',
 'want',
 'watch',
 'run',
 'away',
 'stress',
 'enough',
 'bad',
 'film',
 'hell',
 'second',
 'dragon',
 'come',
 'show',
 'sooner',
 'rocket',
 'launcher',
 'dinosaur',
 'year',
 'ago']

In [16]:
def make_lexicon(token_seqs, min_freq=1):
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 1. 0 is reserved for padding, and 1 is reserved for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

In [17]:
words_lexicon = make_lexicon(train['clean_words'], min_freq=1)

LEXICON SAMPLE (2095 total items):
{'movie': 2, 'get': 3, 'second': 4, 'star': 5, 'work': 6, 'downtown': 7, 'like': 8, 'see': 9, 'destroy': 10, 'effect': 11, 'pretty': 12, 'good': 13, 'hear': 14, 'expensive': 15, 'korean': 16, 'film': 17, 'ever': 18, 'make': 19, 'still': 20, 'absolutely': 21}


In [18]:
def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq] for token_seq in token_seqs]
    return idx_seqs

train['review_idxs'] = tokens_to_idxs(train['clean_words'], words_lexicon)

In [19]:
train

Unnamed: 0,Rating,Review,clean_words,review_idxs
0,2,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...,"[movie, get, second, star, work, downtown, like, see, destroy, effect, pretty, good, hear, expensive, korean, film, ever, make, expensive, still, absolutely, horrid, ...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 15, 20, 21, 22, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 3, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42..."
1,8,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f...","[watch, movie, begin, see, character, develop, could, feel, would, excellent, picture, get, feel, movie, indeed, fill, expectation, experience, rare, feel, throughout...","[44, 2, 64, 9, 65, 66, 67, 68, 69, 70, 71, 3, 68, 2, 72, 73, 74, 75, 76, 68, 77, 2, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 86, 93, 94, 95, 71, 96..."
2,4,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ...","[seem, odd, combination, withnail, room, view, sometimes, work, time, tragedy, change, name, u, release, though, keep, apidistra, fly, much, better, nothing, title, m...","[101, 102, 103, 104, 105, 106, 107, 6, 43, 108, 109, 110, 111, 112, 113, 114, 115, 116, 28, 117, 118, 119, 120, 121, 38, 122, 123, 122, 124, 125, 17]"
3,9,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi...","[saw, exterminator, year, first, time, expectation, movie, although, bad, think, kind, italian, version, roadwarrior, cast, almost, famous, italy, include, venantino,...","[126, 127, 62, 128, 43, 74, 2, 129, 53, 130, 131, 132, 133, 134, 91, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 142, 149, 150, 151, 147, 9,..."
4,9,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov...","[entertain, flick, consider, budget, length, storyline, hardly, ever, touch, movie, world, also, bring, sense, novelty, act, great, p, z, dom, cinematography, also, w...","[176, 178, 179, 180, 181, 182, 183, 18, 184, 2, 185, 147, 186, 187, 188, 38, 189, 190, 191, 192, 193, 147, 194, 195, 41, 2, 196, 197, 100]"
...,...,...,...,...
95,2,"Oh my. I decided to go out to the cinemas with some friends, wanting to watch one of those mild, feel-good Christmas movies, and I walk out disgusted. The movie faile...","[oh, decide, go, cinema, friend, want, watch, one, mild, feel, good, christmas, movie, walk, disgust, movie, fail, full, stop, paul, giamatti, consider, good, actor, ...","[605, 682, 247, 1266, 47, 48, 44, 302, 2031, 68, 13, 2032, 2, 214, 2033, 2, 742, 368, 697, 2034, 2035, 179, 13, 207, 83, 65, 824, 2036, 224, 2037, 2038, 317, 917, 203..."
96,7,"It appears even the director doesn't like this film,but for me I think he's being a bit harsh on himself. Sure it's not perfect, but there are some atmospheric shots...","[appear, even, director, like, film, think, bite, harsh, sure, perfect, atmospheric, shoot, story, good, enough, keep, interest, throughout, shoot, appear, quite, pre...","[2045, 449, 518, 8, 17, 130, 522, 2046, 336, 680, 2047, 1794, 142, 13, 52, 114, 364, 77, 1794, 2045, 175, 12, 2048, 702, 2049, 194, 8, 153, 17, 1794, 1720, 42, 247, 9..."
97,9,"The thing I remember most about this film is that it used to air on local KTLA TV (Ch. 5) during every Christmas season during the mid to late 70s, mainly due to the ...","[thing, remember, film, use, air, local, ktla, tv, ch, every, christmas, season, mid, late, mainly, due, fact, true, story, take, place, near, christmas, eve, always,...","[432, 476, 17, 748, 485, 2054, 2055, 1107, 2056, 710, 2032, 490, 2057, 627, 2058, 2059, 316, 232, 142, 444, 488, 853, 2032, 2060, 317, 522, 2061, 9, 54, 512, 247, 206..."
98,7,"I recently saw I.Q. and even though I'm not a romantic comedy type of gal, I think that it was just a nice and sweet movie to watch. So many movies in my opinion lack...","[recently, saw, q, even, though, romantic, comedy, type, gal, think, nice, sweet, movie, watch, many, movie, opinion, lack, honesty, know, feel, watch, movie, feel, r...","[480, 126, 2073, 449, 113, 252, 253, 1506, 2074, 130, 351, 1410, 2, 44, 658, 2, 1530, 892, 2075, 366, 68, 44, 2, 68, 936, 444, 412, 142, 8, 518, 1452, 220, 8, 1240, 1..."


In [20]:
def idx_seqs_to_bows(idx_seqs, matrix_length):
    bow_seqs = np.array([np.bincount(np.array(idx_seq), minlength=matrix_length) for idx_seq in idx_seqs])
    return bow_seqs

In [21]:
bow_train_words = idx_seqs_to_bows(train['review_idxs'], matrix_length=len(words_lexicon) + 1)
bow_train_words

array([[0, 0, 2, ..., 0, 0, 0],
       [0, 0, 4, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 3, ..., 0, 0, 0],
       [0, 0, 3, ..., 1, 1, 1]], dtype=int64)

In [22]:
bow_train_words.shape

(100, 2096)

In [23]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

def create_model_FFNN(n_input_nodes, n_hidden_nodes):
    input_layer = Input(shape=(n_input_nodes,))
    hidden_layer = Dense(units=n_hidden_nodes, activation='sigmoid')(input_layer)
    output_layer = Dense(units=1)(hidden_layer)
    
    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')
    
    return model

reg_bow_model = create_model_FFNN(n_input_nodes=len(words_lexicon) + 1, n_hidden_nodes=500)

In [24]:
reg_bow_model.fit(x=bow_train_words, y=train['Rating'], batch_size=20, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1a06e75b7f0>

In [25]:
# Prepare Test Data

In [26]:
test

Unnamed: 0,Rating,Review
0,10,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth..."
1,1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ..."
2,4,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,..."
3,2,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h..."
4,3,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a..."
...,...,...
95,3,"I recently picked up all three Robocop films in one box set, rather cheaply and the only reason I did this was for the special edition of the superb first one. I have..."
96,8,"This film as it is now is far shorter than it was when released in 1918. In fact, it is now more available with two other medium sized silent Chaplin features (A DOG'..."
97,3,"The MTV sci-fi animated series ""Æon Flux"" is brought to life with Charlize Theron playing the title character, a freedom fighter who fights oppression in the walled c..."
98,4,"I thought the movie was sub-par. The acting was good but not great, the story was funny but did not come out that way. The director dropped the ball on this movie. It..."


In [27]:
clean_words = []
for i in range(len(test['Review'])):
    res = clean_text(test['Review'][i])
    res_len = len(res)
    clean_words.append(res)

In [28]:
se = pd.Series(clean_words)
test['clean_words'] = se.values

test

Unnamed: 0,Rating,Review,clean_words
0,10,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...","[first, like, say, movie, greatest, thing, ever, happen, mankind, best, excellent, muppet, movie, every, movie, boo, ya, jim, henson, movie, first, muppet, movie, bes..."
1,1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...","[terrible, write, highly, contrive, gooder, know, absolutely, nothing, race, relation, l, usa, present, day, gush, positive, review, mystery, could, provide, folk, th..."
2,4,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...","[expect, much, movie, still, disappoint, suppose, comedy, four, five, scene, actually, laugh, think, rather, poor, real, plot, either, always, feel, scene, could, put..."
3,2,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...","[corey, haim, never, go, know, one, great, actor, time, least, movie, like, license, drive, element, lowbrow, humor, dean, koontz, book, watcher, one, earlier, work, ..."
4,3,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...","[great, fan, disney, really, disappoint, watch, garbage, animation, pretty, background, amaze, believe, good, animation, make, weak, script, weak, story, gonna, disag..."
...,...,...,...
95,3,"I recently picked up all three Robocop films in one box set, rather cheaply and the only reason I did this was for the special edition of the superb first one. I have...","[recently, pick, three, robocop, film, one, box, set, rather, cheaply, reason, special, edition, superb, first, one, see, robocop, year, year, come, never, watch, sin..."
96,8,"This film as it is now is far shorter than it was when released in 1918. In fact, it is now more available with two other medium sized silent Chaplin features (A DOG'...","[film, far, shorter, release, fact, available, two, medium, size, silent, chaplin, feature, dog, life, pilgrim, chaplin, release, day, shoulder, arm, big, hit, humor,..."
97,3,"The MTV sci-fi animated series ""Æon Flux"" is brought to life with Charlize Theron playing the title character, a freedom fighter who fights oppression in the walled c...","[mtv, sci, fi, animate, series, flux, bring, life, charlize, theron, play, title, character, freedom, fighter, fight, oppression, wall, city, bregna, hundred, year, f..."
98,4,"I thought the movie was sub-par. The acting was good but not great, the story was funny but did not come out that way. The director dropped the ball on this movie. It...","[think, movie, sub, par, act, good, great, story, funny, come, way, director, drop, ball, movie, jam, jim, tea, imho, music, kill, scene, thing, go, hill, jonny, cash..."


In [29]:
test['review_idxs'] = tokens_to_idxs(test['clean_words'], words_lexicon)

In [30]:
test

Unnamed: 0,Rating,Review,clean_words,review_idxs
0,10,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...","[first, like, say, movie, greatest, thing, ever, happen, mankind, best, excellent, muppet, movie, every, movie, boo, ya, jim, henson, movie, first, muppet, movie, bes...","[128, 8, 97, 2, 1, 432, 18, 410, 1, 275, 70, 1, 2, 710, 2, 1, 1, 1, 1, 2, 128, 1, 2, 275, 1, 1, 1, 1, 425, 19, 303, 224, 246, 47, 1302, 225, 56, 30, 1, 471, 18, 19, 1..."
1,1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...","[terrible, write, highly, contrive, gooder, know, absolutely, nothing, race, relation, l, usa, present, day, gush, positive, review, mystery, could, provide, folk, th...","[37, 300, 231, 1, 1, 366, 21, 118, 358, 359, 1, 1, 802, 670, 1, 970, 864, 525, 67, 1880, 1, 130, 254, 13, 1, 1, 1, 1875, 1, 590, 1, 1172, 48, 9, 17, 357, 1, 1, 670, 1..."
2,4,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...","[expect, much, movie, still, disappoint, suppose, comedy, four, five, scene, actually, laugh, think, rather, poor, real, plot, either, always, feel, scene, could, put...","[975, 28, 2, 20, 100, 1466, 253, 1, 564, 321, 265, 346, 130, 263, 451, 261, 34, 1, 317, 68, 321, 67, 486, 1, 2, 1, 442, 432, 38, 1, 1, 302, 1, 1, 241, 2, 1466, 83, 47..."
3,2,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...","[corey, haim, never, go, know, one, great, actor, time, least, movie, like, license, drive, element, lowbrow, humor, dean, koontz, book, watcher, one, earlier, work, ...","[1, 1, 32, 247, 366, 302, 189, 207, 43, 489, 2, 8, 1, 1, 1, 1, 593, 1, 1, 1889, 1, 302, 1978, 6, 20, 996, 1, 1, 1980, 1, 1262, 788, 1, 1, 1, 1, 1, 247, 865, 1, 2036, ..."
4,3,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...","[great, fan, disney, really, disappoint, watch, garbage, animation, pretty, background, amaze, believe, good, animation, make, weak, script, weak, story, gonna, disag...","[189, 1672, 1412, 271, 100, 44, 1, 1543, 12, 1, 243, 443, 13, 1543, 19, 312, 123, 312, 142, 1, 1698, 264, 97, 1, 886, 1091, 1, 2, 1, 412, 886, 489, 1, 1, 123, 1, 53, ..."
...,...,...,...,...
95,3,"I recently picked up all three Robocop films in one box set, rather cheaply and the only reason I did this was for the special edition of the superb first one. I have...","[recently, pick, three, robocop, film, one, box, set, rather, cheaply, reason, special, edition, superb, first, one, see, robocop, year, year, come, never, watch, sin...","[480, 1, 217, 1, 17, 302, 1, 424, 263, 1, 1185, 1201, 1, 1, 128, 302, 9, 1, 62, 62, 56, 32, 44, 228, 20, 476, 100, 1604, 1, 271, 1, 843, 271, 1, 722, 1856, 658, 1044,..."
96,8,"This film as it is now is far shorter than it was when released in 1918. In fact, it is now more available with two other medium sized silent Chaplin features (A DOG'...","[film, far, shorter, release, fact, available, two, medium, size, silent, chaplin, feature, dog, life, pilgrim, chaplin, release, day, shoulder, arm, big, hit, humor,...","[17, 1001, 1, 112, 316, 1487, 423, 1, 1, 1, 1, 1241, 1, 45, 1, 1, 112, 670, 1, 1862, 1016, 502, 593, 1, 1, 20, 272, 1, 1, 1, 1253, 1, 1189, 1, 915, 424, 1295, 1813, 3..."
97,3,"The MTV sci-fi animated series ""Æon Flux"" is brought to life with Charlize Theron playing the title character, a freedom fighter who fights oppression in the walled c...","[mtv, sci, fi, animate, series, flux, bring, life, charlize, theron, play, title, character, freedom, fighter, fight, oppression, wall, city, bregna, hundred, year, f...","[1, 1952, 1953, 1, 372, 1, 186, 45, 1, 1, 83, 119, 65, 916, 1, 1, 1, 1, 164, 1, 1, 62, 684, 2051, 1509, 1367, 660, 164, 1, 1, 1, 1, 1, 1, 705, 224, 225, 1, 1, 462, 1,..."
98,4,"I thought the movie was sub-par. The acting was good but not great, the story was funny but did not come out that way. The director dropped the ball on this movie. It...","[think, movie, sub, par, act, good, great, story, funny, come, way, director, drop, ball, movie, jam, jim, tea, imho, music, kill, scene, thing, go, hill, jonny, cash...","[130, 2, 1044, 1045, 38, 13, 189, 142, 272, 56, 225, 518, 1, 1, 2, 629, 1, 1155, 1, 747, 660, 321, 432, 247, 1753, 1, 1, 747, 83, 199, 1, 272, 660, 1330, 2, 67, 1, 1,..."


In [31]:
bow_test_words = idx_seqs_to_bows(test['review_idxs'], matrix_length=len(words_lexicon) + 1)
bow_test_words

array([[ 0, 24,  7, ...,  0,  0,  0],
       [ 0, 24,  2, ...,  0,  0,  0],
       [ 0, 10,  4, ...,  0,  0,  0],
       ...,
       [ 0, 46,  4, ...,  0,  0,  0],
       [ 0, 16,  5, ...,  0,  0,  0],
       [ 0, 24,  0, ...,  0,  0,  0]], dtype=int64)

In [32]:
bow_test_words.shape

(100, 2096)

In [33]:
# Show predicted ratings for test reviews alongside actual ratings
# Since ratings are integers, need to round predicted rating to nearest integer
test['pred_rating'] = np.round(reg_bow_model.predict(bow_test_words)[:,0]).astype(int)
test[['Review', 'Rating', 'pred_rating']]

# Evaluate the model with R^2
from sklearn.metrics import r2_score

r2 = r2_score(y_true=test['Rating'], y_pred=test['pred_rating'])

In [34]:
# RNN Model

In [35]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs):
    max_seq_len = max([len(idx_seq) for idx_seq in idx_seqs]) # Get length of longest sequence
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len) # Keras provides a convenient padding function
    return padded_idxs

train_padded_idxs = pad_idx_seqs(train['review_idxs'])

In [36]:
from tensorflow.keras.layers import Embedding, GRU

def create_model_RNN(n_input_nodes, n_embedding_nodes, n_hidden_nodes):
    input_layer = Input(shape=(None,))
    embedding_layer = Embedding(input_dim=n_input_nodes,
                                output_dim=n_embedding_nodes,
                                mask_zero=True)(input_layer) #mask_zero tells the model to ignore 0 values (padding)
    
    gru_layer = GRU(units=n_hidden_nodes)(embedding_layer)
    output_layer = Dense(units=1)(gru_layer)

    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')
    
    return model

In [37]:
rnn_model = create_model_RNN(n_input_nodes=len(words_lexicon) + 1, n_embedding_nodes=300, n_hidden_nodes=500)

In [38]:
# Train the model
rnn_model.fit(x=train_padded_idxs, y=train['Rating'], batch_size=20, epochs=5)

# Put test reviews in padded matrix
test['review_idxs'] = tokens_to_idxs(token_seqs=test['clean_words'],lexicon=words_lexicon)
test_padded_idxs = pad_idx_seqs(test['review_idxs'])

test['pred_rating_RNN'] = np.round(rnn_model.predict(test_padded_idxs)[:,0]).astype(int)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


# 2. (evil) XOR Problem

Train an LSTM to solve the XOR problem: that is, given a sequence of bits, determine its parity. The LSTM should consume the sequence, one bit at a time, and then output the correct answer at the sequence’s end. Test the two approaches below:

### 2.1 

Generate a dataset of random <=100,000 binary strings of equal length <= 50. Train the LSTM; what is the maximum length you can train up to with precisison?
    

### 2.2

Generate a dataset of random <=200,000 binary strings, where the length of each string is independently and randomly chosen between 1 and 50. Train the LSTM. Does it succeed? What explains the difference?


In [39]:
# ref  https://vitez.me/lstm-xor

In [40]:
from tensorflow.keras import optimizers
from tensorflow.keras.layers import Dense, Input, LSTM
from tensorflow.keras.models import Sequential
import numpy as np
import random


SEQ_LEN = 50
COUNT = 100000
bin_pair = lambda x: [x, not(x)]
training = np.array([[bin_pair(random.choice([0, 1])) for _ in range(SEQ_LEN)] for _ in range(COUNT)])
target = np.array([[bin_pair(x) for x in np.cumsum(example[:,0]) % 2] for example in training])
print('shape check:', training.shape, '=', target.shape)

shape check: (100000, 50, 2) = (100000, 50, 2)


In [41]:
# Build Model

model = Sequential()
model.add(Input(shape=(SEQ_LEN, 2), dtype='float32'))
model.add(LSTM(1, return_sequences=True))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(training, target, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a000123340>

In [42]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 50, 1)             16        
_________________________________________________________________
dense_3 (Dense)              (None, 50, 2)             4         
Total params: 20
Trainable params: 20
Non-trainable params: 0
_________________________________________________________________


In [43]:
predictions = model.predict(training)
i = random.randint(0, COUNT)
chance = predictions[i,-1,0]

In [44]:
print('randomly selected sequence:', training[i,:,0])
print('prediction:', int(chance > 0.5))
print('confidence: {:0.2f}%'.format((chance if chance > 0.5 else 1 - chance) * 100))
print('actual:', np.sum(training[i,:,0]) % 2)

randomly selected sequence: [1 0 1 1 1 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 1 0
 0 1 1 1 1 1 0 1 0 0 1 0 1]
prediction: 1
confidence: 99.83%
actual: 1
