# Word Embedding

After the preprocess, we get the `x_train, y_train, x_test, y_test, vocabulary_inv` for later use. First we use word2vec model to train `x_train` to get the **word distributed representation**. You can take the **word distributed representation** as more powerful features for sentence. After we get these features, we will feed them to model.

In this notebook, we write the preprocess part in one cell. If you forget, please see the **preprocess notebook**.

In [1]:
import sys
print(sys.path)

['', '/Users/xu/anaconda3/envs/tf/lib/python36.zip', '/Users/xu/anaconda3/envs/tf/lib/python3.6', '/Users/xu/anaconda3/envs/tf/lib/python3.6/lib-dynload', '/Users/xu/anaconda3/envs/tf/lib/python3.6/site-packages', '/Users/xu/anaconda3/envs/tf/lib/python3.6/site-packages/IPython/extensions', '/Users/xu/.ipython']


We add the `data_helpers` module to the path. Here we use `os.pardir` to represent the parent directry:

In [2]:
import sys, os
sys.path.append(os.pardir)

In [3]:
import numpy as np
import data_helpers

In [4]:
# preprocess 

positive_data_file = "../data/rt-polaritydata/rt-polarity.pos"
negtive_data_file = "../data/rt-polaritydata/rt-polarity.neg"

# Load data
print("Loading data...")
x_text, y = data_helpers.load_data_and_labels(positive_data_file, negtive_data_file)

# Pad sentence
print("Padding sentences...")
x_text = data_helpers.pad_sentences(x_text)
print("The sequence length is: ", len(x_text[0]))

# Build vocabulary
vocabulary, vocabulary_inv = data_helpers.build_vocab(x_text)

# Represent sentence with word index, using word index to represent a sentence
x = data_helpers.build_index_sentence(x_text, vocabulary)
y = y.argmax(axis=1) # y: [1, 1, 1, ...., 0, 0, 0]. 1 for positive, 0 for negative

# Shuffle data
np.random.seed(42)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]

# Split train and test
training_rate = 0.9
train_len = int(len(y) * training_rate)
x_train = x_shuffled[:train_len]
y_train = y_shuffled[:train_len]
x_test = x_shuffled[train_len:]
y_test = y_shuffled[train_len:]

# Output shape
print('x_train shape: ', x_train.shape)
print('x_test shape:', x_test.shape)
print('Vocabulary Size: {:d}'.format(len(vocabulary_inv)))


Loading data...
Padding sentences...
The sequence length is:  56
x_train shape:  (9595, 56)
x_test shape: (1067, 56)
Vocabulary Size: 18765


In the paper of [Convolutional Neural Networks for Sentence Classification](http://www.aclweb.org/anthology/D14-1181), the author proposed several CNN variants.

* CNN-rand: No word2vec. All words vector are randomly initialized and then modified during training.

* CNN-static: Pre-train a word2vec, but do not learn it during training. If a word dose not show in the word2vec, the unknown word vector are randomly initialized. 

* CNN-non-static: Same as above but the pretrained vectors are fine-tuned for each task.

* CNN-multichannel: A model with two sets of word vectors(CNN-static, CNN-non-static). Each set of vectors is treated as a ‘channel’ and each filter is applied

Here we choose CNN-non-static to implement. We need to add a embedding layer. We use `word2vec.py` to pre-train the words.

In [5]:
from gensim.models import word2vec
from os.path import join, exists, split
import os
import numpy as np

In [7]:
# set some parameter for training the word2vec model
"""
inputs:
sentence_matrix # int matrix: num_sentences x max_sentence_len
vocabulary_inv  # dict {int: str}
num_features    # Word vector dimensionality                      
min_word_count  # Minimum word count                        
context         # Context window size 
"""
sentence_matrix = x
# vocabulary_inv = vocabulary_inv
num_features=300
min_word_count=1
context=10

In [8]:
num_workers = 2  # Number of threads to run in parallel
downsampling = 1e-3  # Downsample setting for frequent words

# sample(param in gensim): threshold for configuring which 
# higher-frequency words are randomly downsampled;
# default is 1e-3, values of 1e-5 (or lower) may also be useful, 
# set to 0.0 to disable downsampling. 

In [9]:
sentences = [[vocabulary_inv[w] for w in s] for s in sentence_matrix]

In [12]:
# show words
sentences[0]

['the',
 'rock',
 'is',
 'destined',
 'to',
 'be',
 'the',
 '21st',
 'century',
 "'s",
 'new',
 'conan',
 'and',
 'that',
 'he',
 "'s",
 'going',
 'to',
 'make',
 'a',
 'splash',
 'even',
 'greater',
 'than',
 'arnold',
 'schwarzenegger',
 ',',
 'jean',
 'claud',
 'van',
 'damme',
 'or',
 'steven',
 'segal',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>',
 '<PAD/>']

In [13]:
embedding_model = word2vec.Word2Vec(sentences, workers=num_workers,
                                    size=num_features, min_count=min_word_count,
                                    window=context, sample=downsampling)

In [17]:
# save model
model_dir = 'models'
model_name = "{:d}features_{:d}minwords_{:d}context".format(num_features, min_word_count, context)
model_name = join(model_dir, model_name)
model_name

'models/300features_1minwords_10context'

In [49]:
print("{:d}features".format(400))
print("{0:d}features".format(400))

400features
400features


In [16]:
split(model_name)

('models', '300features_1minwords_10context')

In [20]:
if not exists(model_dir):
    os.mkdir(model_dir)
print('Saving Word2Vec \'%s\'' % split(model_name)[-1])


Saving Word2Vec '300features_1minwords_10context'


In [25]:
# get the vector of word 'rock' 
print(embedding_model.wv['rock'].shape)
print(embedding_model.wv['rock'][:10])
print(embedding_model.vector_size)

(300,)
[-0.18361306 -0.02460535  0.08828513 -0.07919128  0.11477375  0.14191546
  0.04934429 -0.0182813  -0.02350191  0.0839237 ]
300


In [44]:
embedding_model.wv['<PAD/>'][:10]

array([-0.97043616,  0.02826277, -0.02718172, -0.08975446,  0.32522753,
        0.49387988,  0.3853714 , -0.19174875, -0.09329756,  0.3320315 ],
      dtype=float32)

But if a word not in the embedding_model, we randomly initialize it:

In [27]:
# add unknown word vector
embedding_weights = {}
for key, word in vocabulary_inv.items():
    if word in embedding_model.wv:
        embedding_weights[key] = embedding_model.wv[word]
    else:
        embedding_weights[key] = np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
        

In [34]:
print(vocabulary['rock'])
print(vocabulary['<PAD/>'])

565
0


In [38]:
print(embedding_weights[565][:10]) # rock vector
print(embedding_weights[0][:10]) # <PAD/> vector

[-0.18361306 -0.02460535  0.08828513 -0.07919128  0.11477375  0.14191546
  0.04934429 -0.0182813  -0.02350191  0.0839237 ]
[-0.97043616  0.02826277 -0.02718172 -0.08975446  0.32522753  0.49387988
  0.3853714  -0.19174875 -0.09329756  0.3320315 ]
