# **Uploading Files onto Google Colab**

Word Embeddings have been uploaded on Google Drive in a folder called **Word2Vec**.

They were trained by the Gensim module with dimensions of both 200 and 300, window size of 5, and 500 iterations.

Here, we will access those files. 

Note: The URL for Word2Vec is : https://drive.google.com/drive/folders/1ZjAKEl5Nbg6kbM33vNOWIN2FdRfTvYEz?ogsrc=32

That is why the q paramter is set to: **1ZjAKEl5Nbg6kbM33vNOWIN2FdRfTvYEz**

It is the id that I want the root to search for. 

In [0]:
!pip install -U -q PyDrive
import os
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

local_download_path = os.path.expanduser('~/Word2Vec')
try:
  os.makedirs(local_download_path)
except: pass

file_list = drive.ListFile(
    {'q': "'1ZjAKEl5Nbg6kbM33vNOWIN2FdRfTvYEz' in parents"}).GetList()

for f in file_list:
  # 3. Create & download by id.
  print('title: %s, id: %s' % (f['title'], f['id']))
  fname = os.path.join(local_download_path, f['title'])
  print('downloading to {}'.format(fname))
  f_ = drive.CreateFile({'id': f['id']})
  f_.GetContentFile(fname)

title: Tokenized Good Sentences.csv, id: 1FBrCGaOPpM4XJ-H4EQJJ030SruW8OSHU
downloading to /root/Word2Vec/Tokenized Good Sentences.csv
title: Tokenized Bad Sentences.csv, id: 1TKPq7ztGULn61goNyEPQCuDMYL2O_rr7
downloading to /root/Word2Vec/Tokenized Bad Sentences.csv
title: word2vec_Good_200, id: 1Cuyppnh1ZkhyYLXkwgf8MpvEeYcLp8zk
downloading to /root/Word2Vec/word2vec_Good_200
title: word2vec_Bad_200, id: 1Hn0CLiSe-f4i7NUrJT3cIFQAmOuUnb3V
downloading to /root/Word2Vec/word2vec_Bad_200
title: word2vec_Good_300, id: 17W1EyYgsYPcwWG9XZWMQc3f2DSv9bMKy
downloading to /root/Word2Vec/word2vec_Good_300
title: word2vec_Bad_300, id: 12A6-2bU39EQ12rBVtca4nlRSBUCaO1C3
downloading to /root/Word2Vec/word2vec_Bad_300


# Importing Tokenized Sentences

I have also uploaded two csv files called **Tokenized Bad Sentences.csv** and **Tokenized Good Sentences.csv**  found in the Word2Vec folder. 

Below is the code to read in those files.

In [0]:
import csv
with open('/root/Word2Vec/Tokenized Bad Sentences.csv', 'r', encoding="latin-1") as f:
    reader = csv.reader(f)
    badSentences = list(reader)
    
with open('/root/Word2Vec/Tokenized Good Sentences.csv', 'r', encoding="latin-1") as f:
    reader = csv.reader(f)
    goodSentences = list(reader)
    
print(badSentences[0:10])   
print(goodSentences[0:10])  

[['oh', ',', 'shit', '.'], ['you', 'just', 'got', 'wolfed', '.'], ['what', '?'], ['that', 'is', 'an', 'official', 'trademark', 'that', 'i', 'am', 'getting', 'registered', '.'], ['it', "'s", 'a', 'lot', 'of', 'stuff', 'you', 'got', 'ta', 'do', ',', 'hoops', 'you', 'got', 'ta', 'jump', 'through', '.'], ['got', 'ta', 'get', 'on', 'the', 'internet', '.'], ['got', 'ta', 'go', 'to', 'some', 'stupidass', 'website', 'where', 'you', 'register', 'a', 'catch', 'phrase', '.'], ['i', 'wanted', '``', 'bam', ',', "''", 'but', 'emeril', 'had', 'taken', 'it', '.'], ['i', "'m", 'rambling', ',', 'man', '.'], ['get', 'up', ',', 'man', '.']]
[['mr.', 'dufresne', ',', 'describe', 'the', 'confrontation', 'you', 'had', 'with', 'your', 'wife', 'the', 'night', 'she', 'was', 'murdered', '.'], ['it', 'was', 'very', 'bitter', '.'], ['she', 'said', 'she', 'was', 'glad', 'i', 'knew', ',', 'that', 'she', 'hated', 'all', 'the', 'sneaking', 'around', '.'], ['and', 'she', 'said', 'that', 'she', 'wanted', 'a', 'divorce',

# **Installing gensim**

Gensim is the Python module used to train the word2vec embeddings. Here is how to upload the files.

In [0]:
!pip install -q gensim
from gensim.models import Word2Vec
model_Bad = Word2Vec.load("/root/Word2Vec/word2vec_Bad_300")
model_Good = Word2Vec.load("/root/Word2Vec/word2vec_Good_300")
print(model_Bad)
print(model_Good)

Word2Vec(vocab=29504, size=300, alpha=0.025)
Word2Vec(vocab=30426, size=300, alpha=0.025)


# **Similar Vectors**

Once the word2vec embeddings are uploaded, you can view the vectors most similar to a given word. 

In [0]:
for i in model_Bad.wv.most_similar (positive = 'good'):
  print(i)
  
print()

for i in model_Good.wv.most_similar (positive = 'good'):
  print(i)

('bad', 0.42860960960388184)
('nice', 0.40062108635902405)
('like', 0.3491782248020172)
('tough', 0.3429213762283325)
('great', 0.34254634380340576)
('better', 0.28865498304367065)
('big', 0.28429123759269714)
('weird', 0.2701437175273895)
('happy', 0.26822713017463684)
('hard', 0.26718002557754517)

('bad', 0.5157124996185303)
('nice', 0.42234158515930176)
('great', 0.3885391056537628)
('smart', 0.3636835813522339)
('fine', 0.3554360568523407)
('tough', 0.34922707080841064)
('big', 0.3384079933166504)
('funny', 0.33056941628456116)
('hard', 0.3305012583732605)
('tempting', 0.31914812326431274)


  if np.issubdtype(vec.dtype, np.int):


# **Word2Vec Weights onto Keras**

Because we are going to use Keras to train an RNN, here is how to extract the actual pretrained weights of the word embedding which can be used for the neural network.

In [0]:
from keras.layers import Embedding

pretrained_weights_Bad = model_Bad.wv.vectors 
#pretrained_weights_Good = model_Good.wv.vectors

embeddingBad = Embedding(input_dim=pretrained_weights_Bad.shape[0], output_dim=pretrained_weights_Bad.shape[1], 
                    weights=[pretrained_weights_Bad])

# embeddingGood = Embedding(input_dim=pretrained_weights_Good.shape[0], output_dim=pretrained_weights_Good.shape[1], 
#                     weights=[pretrained_weights_Good])

print(pretrained_weights_Bad)

Using TensorFlow backend.


[[ 1.25372    -1.2052641  -0.25990063 ... -0.976092    1.8551047
   1.3307298 ]
 [ 1.4823085  -1.2304381  -0.9231364  ... -1.6590116   1.3164423
   0.8863562 ]
 [ 2.0804706  -0.7694774   0.17020172 ... -0.5701194   1.6237695
   1.2792978 ]
 ...
 [ 0.6441616   0.12467387  0.17375445 ...  0.27845207  0.394679
  -0.18595086]
 [ 0.70862037  0.14693536  0.24079292 ...  0.2610408   0.41543
  -0.17618927]
 [-0.08563908  0.16787037 -0.30501494 ...  0.32231143  0.14400984
   0.1407564 ]]


# LSTM Neural Network

Here, we design the architecture for the RNN nerual network. 

In [0]:
from keras.models import Sequential
from keras.layers import Activation, Dense
from keras.layers import LSTM

vocab_size, emdedding_size = pretrained_weights_Bad.shape

model = Sequential()
model.add(embeddingBad)
model.add(LSTM(units=emdedding_size))
model.add(Dense(units = vocab_size))
model.add(Activation('softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 300)         8851200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 300)               721200    
_________________________________________________________________
dense_1 (Dense)              (None, 29504)             8880704   
_________________________________________________________________
activation_1 (Activation)    (None, 29504)             0         
Total params: 18,453,104
Trainable params: 18,453,104
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
# import numpy as np
# train_x = np.zeros([len(badSentences), 1040], dtype=np.int32)
# train_y = np.zeros([len(badSentences)], dtype=np.int32)

# for i, sentence in enumerate(badSentences):
#   for t, word in enumerate(sentence[:-1]):
#     train_x[i, t] = model_Bad.wv.vocab[word].index
#   train_y[i] = model_Bad.wv.vocab[sentence[-1]].index

# print(train_x[0:10])
# # compile model
# model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# # fit model
# model.fit(X, y, batch_size=128, epochs=100)
 
# # save the model to file
# model.save('model.h5')