### Unsupervised dimensionality reduction using a 1 Hidden-layer perceptron where label == ground truth
### For NLP, we can say somewhat say that word2vec and autoencoders are similiar.

> Dimensionality reduction works only if the inputs are correlated (like images from the same domain). It fails if we pass in completely random inputs each time we train an autoencoder. So in the end, an autoencoder can produce lower dimensional output (at the encoder) given an input much like Principal Component Analysis (PCA). And since we don’t have to use any labels during training, it’s an unsupervised model as well.

In this tutorial we'll use the probability distribution based on `co-occurence` count of words globally in the corpus and smooth it using `La-place (add-1) smoothing`

In [4]:
import os
from random import randint
from collections import Counter
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

In [33]:
corpus = 'the quick brown fox jumped over the lazy dog from the quick tall fox'.split()
# corpus = read_data(DATA_FOLDER + FILE_NAME)
corpus[:10]

['the',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog',
 'from']

In [34]:
def build_vocab(words, vocab_size):
    """ Build vocabulary of VOCAB_SIZE most frequent words """
    dictionary = dict()
    count = [('UNK', -1)]
    count.extend(Counter(words).most_common(vocab_size - 1))
    index = 0
    for word, _ in count:
        dictionary[word] = index
        index += 1
    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, index_dictionary

In [35]:
vocabulary, reverse_vocabulary = build_vocab(corpus, 10)

In [36]:
vocabulary

{'UNK': 0,
 'brown': 4,
 'dog': 9,
 'fox': 3,
 'from': 5,
 'lazy': 6,
 'over': 8,
 'quick': 2,
 'tall': 7,
 'the': 1}

In [37]:
def index_words_in_corpus(corpus):
    return [vocabulary[token] if token in vocabulary else 0 for token in corpus]

In [38]:
corpus = index_words_in_corpus(corpus)

In [39]:
corpus

[1, 2, 4, 3, 0, 8, 1, 6, 9, 5, 1, 2, 7, 3]

In [40]:
vocabulary_size = len(vocabulary)
vocabulary_size

10

In [48]:
# def compute_likelihood(corpus):
#     likelihood = defaultdict(dict)
#     count = Couter(data)
#     for token in corpus:
#         token_count_in_both_class = (count[token] + 1)
#         likelihood[sent][token] = (data[sent][token] + 1.) / token_count_in_both_class
# #             likelihood[sent][token] = log((data[sent][token] + 1.) / token_count_in_both_class, 2)

# 	return likelihood

In [49]:
def one_hot_encode(index):
    row = np.zeros(vocabulary_size, dtype=np.int32)
    row[index] = 1.
    return row

In [50]:
data = np.array([one_hot_encode(i) for i in corpus])

In [51]:
print("(Total number of words, Vocabulary size):", data.shape)

(Total number of words, Vocabulary size): (14, 10)


In [47]:
def generate_sample(index_words, context_window_size):
    """ Form training pairs according to the skip-gram model. """
    for index, center in enumerate(index_words):
        context = randint(1, context_window_size)
        # get a random target before the center word
        for target in index_words[max(0, index - context): index]:
            yield center, target
        # get a random target after the center wrod
        for target in index_words[index + 1: index + context + 1]:
            yield center, target

In [None]:
def get_batch(iterator, batch_size):
    """ Group a numerical stream into batches and yield them as Numpy arrays. """
    while True:
        center_batch = np.zeros(batch_size, dtype=np.int32)
        target_batch = np.zeros([batch_size, 1], dtype=np.int32)
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(iterator)
        yield center_batch, target_batch

In [12]:
VOCAB_SIZE = 5000
BATCH_SIZE = 32
EMBED_SIZE = 10 # dimension of the word embedding vectors
SKIP_WINDOW = 3 # the context window
NUM_SAMPLED = 16    # Number of negative examples to sample.
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 10000
SKIP_STEP = 100 # how many steps to skip before reporting the loss