# **Uploading Files onto Google Colab**

Word Embeddings have been uploaded on Google Drive in a folder called **Word2Vec**.

They were trained by the Gensim module with dimensions of both 200 and 300, window size of 5, and 500 iterations.

Here, we will access those files. 

Note: The URL for the Word2Vec folder in my Google Drive is : https://drive.google.com/drive/folders/1sdDeXX3XTdJg5tEnQNhmjNmol7DPzOZH

That is why I set the the q paramter is set to: **1sdDeXX3XTdJg5tEnQNhmjNmol7DPzOZH**

You will want to make a copy of the Word2Vec folder and put it in your Google Drive's 'Colab Notebooks' folder. Then you will want to change the q parameter to the end of the URL of the Word2Vec folder.

In [1]:
!pip install -U -q PyDrive
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

local_download_path = os.path.expanduser('~/Word2Vec')
try:
  os.makedirs(local_download_path)
except: pass

file_list = drive.ListFile(
    {'q': "'1sdDeXX3XTdJg5tEnQNhmjNmol7DPzOZH' in parents"}).GetList()

for f in file_list:
  # 3. Create & download by id.
  print('title: %s, id: %s' % (f['title'], f['id']))
  fname = os.path.join(local_download_path, f['title'])
  print('downloading to {}'.format(fname))
  f_ = drive.CreateFile({'id': f['id']})
  f_.GetContentFile(fname)

title: Bad Sentences.csv, id: 1zcZm3griSFElsa2te1Hhd5lpAl7tl5vI
downloading to /root/Word2Vec/Bad Sentences.csv
title: Good Sentences.csv, id: 1dSk6Zq20YnP51fQqqjR65eERDaFRSrdb
downloading to /root/Word2Vec/Good Sentences.csv
title: Tokenized Bad Sentences.csv, id: 1n5XEE1aHdoAfBxrqXWn-yEeEjp2WUj1B
downloading to /root/Word2Vec/Tokenized Bad Sentences.csv
title: word2vec_Good_300, id: 1zWEoKRHFRToBzqkLLvA00N2HBcDvG_4F
downloading to /root/Word2Vec/word2vec_Good_300
title: word2vec_Good_200, id: 1NT7o_yI1IAHfGFcw9Wrq3CKNeC9md4T2
downloading to /root/Word2Vec/word2vec_Good_200
title: word2vec_Bad_300, id: 1zv6AxhipBjhD1rwID1HTE5zncVJPVGms
downloading to /root/Word2Vec/word2vec_Bad_300
title: word2vec_Bad_200, id: 1Giw1gcBYoncfYMGi6LKtEJIeYZ_a2PcC
downloading to /root/Word2Vec/word2vec_Bad_200
title: Tokenized Good Sentences.csv, id: 1YPAgpNciFdTAKQUsrcRs0RLaivevlNoT
downloading to /root/Word2Vec/Tokenized Good Sentences.csv
title: Tokenized Bad Sentences.csv, id: 1S8P8EC47oT46b7ql-xht2QT

# Importing Tokenized Sentences

I have also uploaded two csv files called **Tokenized Bad Sentences.csv** and **Tokenized Good Sentences.csv**  found in the Word2Vec folder. 

Below is the code to read in those files.

In [2]:
import csv
with open('/root/Word2Vec/Bad Sentences.csv', 'r', encoding="utf-8") as f:
    reader = csv.reader(f)
    badSentences = list(reader)
    
with open('/root/Word2Vec/Good Sentences.csv', 'r', encoding="utf-8") as f:
    reader = csv.reader(f)
    goodSentences = list(reader)
    
print(badSentences[0:10])   
print(goodSentences[0:10])  

[['oh', ',', 'shit', '.'], ['you', 'just', 'got', 'wolfed', '.'], ['what', '?'], ['that', 'is', 'an', 'official', 'trademark', 'that', 'i', 'am', 'getting', 'registered', '.'], ['it', "'s", 'a', 'lot', 'of', 'stuff', 'you', 'got', 'ta', 'do', ',', 'hoops', 'you', 'got', 'ta', 'jump', 'through', '.'], ['got', 'ta', 'get', 'on', 'the', 'internet', '.'], ['got', 'ta', 'go', 'to', 'some', 'stupidass', 'website', 'where', 'you', 'register', 'a', 'catch', 'phrase', '.'], ['i', 'wanted', '``', 'bam', ',', "''", 'but', 'emeril', 'had', 'taken', 'it', '.'], ['i', "'m", 'rambling', ',', 'man', '.'], ['get', 'up', ',', 'man', '.']]
[['mr.', 'dufresne', ',', 'describe', 'the', 'confrontation', 'you', 'had', 'with', 'your', 'wife', 'the', 'night', 'she', 'was', 'murdered', '.'], ['it', 'was', 'very', 'bitter', '.'], ['she', 'said', 'she', 'was', 'glad', 'i', 'knew', ',', 'that', 'she', 'hated', 'all', 'the', 'sneaking', 'around', '.'], ['and', 'she', 'said', 'that', 'she', 'wanted', 'a', 'divorce',

# **Installing gensim**

Gensim is the Python module used to train the word2vec embeddings. Here is how to upload the files.

In [3]:
!pip install -q gensim
from gensim.models import Word2Vec
model_Bad = Word2Vec.load("/root/Word2Vec/word2vec_Bad_300")
model_Good = Word2Vec.load("/root/Word2Vec/word2vec_Good_300")
print(model_Bad)
print(model_Good)

Word2Vec(vocab=29504, size=300, alpha=0.025)
Word2Vec(vocab=30426, size=300, alpha=0.025)


# **Similar Vectors**

Once the word2vec embeddings are uploaded, you can view the vectors most similar to a given word. 

In [4]:
for i in model_Bad.wv.most_similar (positive = 'good'):
  print(i)
  
print()

for i in model_Good.wv.most_similar (positive = 'good'):
  print(i)

('bad', 0.42860960960388184)
('nice', 0.40062108635902405)
('like', 0.3491782248020172)
('tough', 0.3429213762283325)
('great', 0.34254634380340576)
('better', 0.28865498304367065)
('big', 0.28429123759269714)
('weird', 0.2701437175273895)
('happy', 0.26822713017463684)
('hard', 0.26718002557754517)

('bad', 0.5157124996185303)
('nice', 0.42234158515930176)
('great', 0.3885391056537628)
('smart', 0.3636835813522339)
('fine', 0.3554360568523407)
('tough', 0.34922707080841064)
('big', 0.3384079933166504)
('funny', 0.33056941628456116)
('hard', 0.3305012583732605)
('tempting', 0.31914812326431274)


  if np.issubdtype(vec.dtype, np.int):


# **Word2Vec Weights onto Keras**

Because we are going to use Keras to train an RNN, here is how to extract the actual pretrained weights of the word embedding which can be used for the neural network.

In [5]:
from keras.layers import Embedding

pretrained_weights_Bad = model_Bad.wv.vectors 
pretrained_weights_Good = model_Good.wv.vectors

embeddingBad = Embedding(input_dim=pretrained_weights_Bad.shape[0], output_dim=pretrained_weights_Bad.shape[1], 
                    weights=[pretrained_weights_Bad])

embeddingGood = Embedding(input_dim=pretrained_weights_Good.shape[0], output_dim=pretrained_weights_Good.shape[1], 
                    weights=[pretrained_weights_Good])

print(pretrained_weights_Bad)

Using TensorFlow backend.


[[ 1.25372    -1.2052641  -0.25990063 ... -0.976092    1.8551047
   1.3307298 ]
 [ 1.4823085  -1.2304381  -0.9231364  ... -1.6590116   1.3164423
   0.8863562 ]
 [ 2.0804706  -0.7694774   0.17020172 ... -0.5701194   1.6237695
   1.2792978 ]
 ...
 [ 0.6441616   0.12467387  0.17375445 ...  0.27845207  0.394679
  -0.18595086]
 [ 0.70862037  0.14693536  0.24079292 ...  0.2610408   0.41543
  -0.17618927]
 [-0.08563908  0.16787037 -0.30501494 ...  0.32231143  0.14400984
   0.1407564 ]]


# LSTM Neural Network

Here, we design the architecture for the neural network. You will want to tinker with this to get something that trains in a reasonable number of time, but has good performance.

In [6]:
from keras.models import Sequential
from keras.layers import Activation, Dense, Bidirectional, Dropout
from keras.layers import LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint

vocab_size, emdedding_size = pretrained_weights_Bad.shape

# This is where you define the models for the bad movie neural net, and the good movie neural net.
# It is important that the models are seperate so you don't fit the model to both datasets.
# MAKE SURE BOTH MODELS HAVE THE SAME PARAMETERS
def get_bad_movie_model():
  model = Sequential()
  model.add(embeddingBad)
  model.add(Bidirectional(LSTM(units=128))) # If you want a non-bidirectional LSTM, just remove the Bidirectional().
                                            # It significantly increases the training time, especially if you increase layers/units
  model.add(Dropout(rate=0.5)) # Kind of high, but important to avoid overfitting.
  model.add(Dense(units=vocab_size))
  model.add(Activation('softmax'))
  print(model.summary())
  return model


def get_good_movie_model():
  model = Sequential()
  model.add(embeddingGood)
  model.add(Bidirectional(LSTM(units=128)))
  model.add(Dropout(rate=0.5))
  model.add(Dense(units=vocab_size))
  model.add(Activation('softmax'))
  print(model.summary())
  return model

bad_model = get_bad_movie_model()
good_model = get_good_movie_model()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 300)         8851200   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               439296    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 29504)             7582528   
_________________________________________________________________
activation_1 (Activation)    (None, 29504)             0         
Total params: 16,873,024
Trainable params: 16,873,024
Non-trainable params: 0
_________________________________________________________________
None
_________________________________________________________________
Layer (type)                 Output Shape              Para

## Splitting the Training and Test Sets

In [0]:
import numpy as np

train_x = np.zeros([(int)(0.8*len(badSentences)), 1040], dtype=np.int32)
train_y = np.zeros([(int)(0.8*len(badSentences))], dtype=np.int32)
test_x = np.zeros([len(badSentences) - (int)(0.8*len(badSentences)), 1040], dtype=np.int32)
test_y = np.zeros([len(badSentences) - (int)(0.8*len(badSentences))], dtype=np.int32)

train_x_good = np.zeros([(int)(0.8*len(goodSentences)), 1040], dtype=np.int32)
train_y_good = np.zeros([(int)(0.8*len(goodSentences))], dtype=np.int32)
test_x_good = np.zeros([len(goodSentences) - (int)(0.8*len(goodSentences)), 1040], dtype=np.int32)
test_y_good = np.zeros([len(goodSentences) - (int)(0.8*len(goodSentences))], dtype=np.int32)

np.random.seed = 42 # Seed needs to be for consistent results, as otherwise the training/test set change every run.
idc_bad = np.random.permutation(len(badSentences))
train_bad, test_bad = idc_bad[:(int)(0.8*len(idc_bad))], idc_bad[(int)(0.8*len(idc_bad)):]
idc_good = np.random.permutation(len(goodSentences))
train_good, test_good = idc_good[:(int)(0.8*len(idc_good))], idc_good[(int)(0.8*len(idc_good)):]


for i, j in enumerate(train_bad):
  for t, word in enumerate(badSentences[j][:-1]):
    train_x[i, t] = model_Bad.wv.vocab[word].index
  train_y[i] = model_Bad.wv.vocab[badSentences[j][-1]].index

for i, j in enumerate(test_bad):
  for t, word in enumerate(badSentences[j][:-1]):
    test_x[i, t] = model_Bad.wv.vocab[word].index
  test_y[i] = model_Bad.wv.vocab[badSentences[j][-1]].index

  
for i, j in enumerate(train_good):
  for t, word in enumerate(goodSentences[j][:-1]):
    train_x_good[i, t] = model_Good.wv.vocab[word].index
  train_y_good[i] = model_Good.wv.vocab[goodSentences[j][-1]].index

for i, j in enumerate(test_good):
  for t, word in enumerate(goodSentences[j][:-1]):
    test_x_good[i, t] = model_Good.wv.vocab[word].index
  test_y_good[i] = model_Good.wv.vocab[goodSentences[j][-1]].index


## Callbacks

This cell sets up the callbacks for early stopping and saving checkpoints to Google Drive. Makes use of the callback implemented in [this](https://github.com/Zahlii/colab-tf-utils) repository, as due to the way Google Colab works, there is not a way to natively save model checkpoints during training and be able to retrieve them later.

In [8]:
!wget https://raw.githubusercontent.com/Zahlii/colab-tf-utils/master/utils.py
import utils
import os
import keras

def compare(best, new):
  if not best.losses['val_acc']:
    print("Not best")
  if not new.losses['val_acc']:
    print("Not new")
  return best.losses['val_acc'] < new.losses['val_acc']

def path_b(new):
  if new.losses['val_acc'] > 0.65:
    return 'bad_movie_model_%s.h5' % new.losses['val_acc']

def path_g(new):
    if new.losses['val_acc'] > 0.65:
        return 'good_movie_model_%s.h5' % new.losses['val_acc']

early_stop_b = EarlyStopping(monitor='val_acc', patience=5, verbose=1)
early_stop_g = EarlyStopping(monitor='val_acc', patience=5, verbose=1)

callbacks_b = cb_b = [
    utils.GDriveCheckpointer(compare,path_b),
    keras.callbacks.TensorBoard(log_dir=os.path.join(utils.LOG_DIR,'bad_movie_model')),
    early_stop_b
]

callbacks_g = cb_g = [
    utils.GDriveCheckpointer(compare,path_g),
    keras.callbacks.TensorBoard(log_dir=os.path.join(utils.LOG_DIR,'good_movie_model')),
    early_stop_g
]

--2018-11-28 00:07:45--  https://raw.githubusercontent.com/Zahlii/colab-tf-utils/master/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6935 (6.8K) [text/plain]
Saving to: ‘utils.py’


2018-11-28 00:07:45 (62.5 MB/s) - ‘utils.py’ saved [6935/6935]

rm: cannot remove 'tboard.py': No such file or directory
--2018-11-28 00:07:51--  https://raw.githubusercontent.com/mixuala/colab_utils/master/tboard.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5214 (5.1K) [text/plain]
Saving to: ‘tboard.py’


2018-11-28 00:07:51 (59.8 MB/s) - ‘tb

## Training the Neural Network


###Note
For loading checkpoints, you will want to get the file ID of the file you want to download. For example, if my checkpoint's url is https://drive.google.com/file/d/1oKxyAyd5fX6dgpsSJeghzy2m5iWu-vQZ/view

The id is **1oKxyAyd5fX6dgpsSJeghzy2m5iWu-vQZ**

To get the URL for the specific file, you just need to right click it and select 'Get Shareable Link'

In [15]:
# This is needed to make sure we are still authenticated and don't throw an Exception when we try to upload/download to Drive
from google.colab import files
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

from keras.models import load_model

# CHANGE THESE IF YOU ARE RESUMING TRAINING FROM A CERTAIN EPOCH.
BAD_MOVIE_INIT_EPOCH = 0
GOOD_MOVIE_INIT_EPOCH = 0

# CODE BLOCK FOR IF YOU ARE TRAINING THE MODEL FROM SCRATCH
# COMMENT OUT IF YOU ARE LOADING A MODEL FROM A CHECKPOINT
####################################################################
# compile model
good_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
bad_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#####################################################################


# CODE BLOCK FOR LOADING THE MODEL FROM AN EXISTING SAVED MODEL
# UNCOMMENT THIS IF YOU ARE LOADING A MODEL FROM A CHECKPOINT TO CONTINUE TRAINING
###################################################################
# drive_chk_bad = drive.CreateFile({'id': '1oKxyAyd5fX6dgpsSJeghzy2m5iWu-vQZ'})
# drive_chk_bad.GetContentFile('chkpt_bad.h5')
# drive_chk_good = drive.CreateFile({'id': 'FILE_ID_FOR_GOOD_MOVIE_CHKPT'})
# drive_chk_good.GetContentFile('chkpt_good.h5')

# bad_model = load_model('chkpt_bad.h5')
# good_model = load_model ('chkpt_good.h5')
###################################################################


# fit model
bad_model.fit(train_x, train_y, validation_split = 0.2, batch_size=128, epochs=50, callbacks=cb_b,
             initial_epoch = BAD_MOVIE_INIT_EPOCH) 
# save the model to file
bad_model.save('bad_movie_model.h5')

# files.download('bad_movie_model.h5')
# Uncomment if you want it to download the model to your local machine after training
b_file = drive.CreateFile()
b_file.SetContentFile('bad_movie_model.h5')
b_file.Upload()


#repeat but for good movies
good_model.fit(train_x_good, train_y_good, validation_split = 0.2, batch_size=128, epochs=50, callbacks=cb_g,
              initial_epoch = GOOD_MOVIE_INIT_EPOCH)
good_model.save('good_movie_model.h5')

# files.download('good_movie_model.h5')
# Uncomment if you want to download the model to your local machine after training
g_file = drive.CreateFile()
g_file.SetContentFile('good_movie_model.h5')
g_file.Upload()


Train on 58953 samples, validate on 14739 samples
Epoch 8/50
  256/58953 [..............................] - ETA: 37:29 - loss: 0.4385 - acc: 0.8086

KeyboardInterrupt: ignored

In [0]:
#