# <center>Assignment 3</center>

This assignment is based on embeddings and CNNs. You can choose to code in Python2 or Python3. All the imports made in this notebook are as below; if these imports work, you are (mostly) set to complete the assignment. You will learn the following:
* Making use of embeddings in Tensorflow
* Coding CNNs in TF
* Intuitions behind working of CNN
* Intuitions behind embeddings

In [1]:
from __future__ import print_function, division
import random 
import tensorflow as tf
import numpy as np
import os

In [2]:
import sys
if sys.version_info >= (3, 0):
  from builtins import map as m
  def map(f,l):
    return list(m(f,l))

## Quick Review questions

Q1) If the input volume has dimensions 10 x 10 x 32 (Height x Width x Channels), how many weights will be there in a filter that considers an area of 5 x 5?

<b>Answer:</b> Num_weights = 32 * 5 * 5 * D_out, where D_out = the number of output channels (dimensions).

Q2) If input volume has dimensions 10 x 10 x 32 and after convolution we get an output volume of 8 x 8 x 64, how many filters were used? 

<b>Answer:</b> 64 filters (The number of filters = the number of channels for the output volume)

Q3) What is inverted-dropout? Why is it done? 

<b>Answer:</b> Dropout is a regularization technique that complements methods such as L1, L2, and MaxNorm. Dropout is implemented by assigning a probability (hyperparameter) to each neuron in a NN which determines if the neuron is alive, or is set to 0. <b>Inverted dropout</b> is a version of dropout which performs scaling and dropping during the training rather than during testing. It is always preferable to use inverted dropout since it leaves the forward pass at test time alone (no changes made to it).

## Sentiment Classification - dataset analysis

We will use movie review dataset taken from http://www.cs.cornell.edu/people/pabo/movie-review-data/. The exact dataset we will use is the Sentence-polarity dataset.

In [3]:
data = []
for file_, label in zip(["class_neg.txt", "class_pos.txt"], [0, 1]):
    lines = open(file_).readlines()
    lines = map(lambda x: x.strip().replace("-", " ").split(), lines)
    for line in lines:
        data.append([line, label])
    print("Number of reviews of {} = {}".format(file_[:-4], len(lines)))
    print("\tMax number of tokens in a sentence = {}".format(max(map(lambda x: len(x), lines))))
    print("\tMin number of tokens in a sentence = {}".format(min(map(lambda x: len(x), lines))))
random.Random(5).shuffle(data)

Number of reviews of class_neg = 5331
	Max number of tokens in a sentence = 56
	Min number of tokens in a sentence = 1
Number of reviews of class_pos = 5331
	Max number of tokens in a sentence = 59
	Min number of tokens in a sentence = 2


Observe that the lengths of sentences are different. In case, we need to vectorize the operations, we need all sentences to be of equal length. Therefore, we will pad all sentences to be of equal length and substitute the padded parts of sentence with zeros. 

In [4]:
# See some randomly sampled sentences
print(" ".join(data[random.randint(0, len(data))][0]))

an awkwardly garish showcase that diverges from anything remotely probing or penetrating .


We will work with the sentence as given and not remove any stop-words or punctuation marks. 

In [5]:
sents = map(lambda x: x[0], data)  # all sentences
all_words = set()
for sent in sents:
    all_words |= set(sent)
all_words = sorted(list(all_words))
vocab = {
    all_words[i]: i for i in range(len(all_words))
}
print("Number of words : ", len(vocab))
train = data[:int(0.8 * len(data))]
test = data[int(0.8 * len(data)):]
train_data = []
train_targets = []
test_data = []
test_targets = []
for list_all, list_data, list_target, label_list in zip([train, test], [train_data, test_data], [train_targets, test_targets], ["train","test"]):
    for datum, label in list_all:
        list_data.append([vocab[w] for w in datum])
        list_target.append([label])
    print(label_list)
    print("\tNumber of positive examples : ", list_target.count([1]))
    print("\tNumber of negative examples : ", list_target.count([0]))

Number of words :  19757
train
	Number of positive examples :  4263
	Number of negative examples :  4266
test
	Number of positive examples :  1068
	Number of negative examples :  1065


For implementation purposes, we will need an index for the padded word and we will use the index 19757.
Note: For a dataset of this <i>small</i> size, we will need to do K-Fold Cross-validation to evaluate the performance. However, we will work with this train-test split for the rest of this assignment. 

## Simple Classifier

<img src="https://web.cs.dal.ca/~sastry/cnn_simple.jpg"/>

The above image shows the architecture of the simple model that we will implement for text classification. We are interested in the following hyperparameters apart from the number of filters (which we will set to 1 for this problem):
* The span of the filter/the number of words considered for making the prediction.
* The size of the stride.
* The number of activations selected for feeding into softmax classifier.

Before continuing:

* Can you reason how the machine is classifying (in the above example)? The values of the activations are color-coded. Is this the only possible way the machine can work? 

    (Your answer might look like : ... filter is ... template matching ... )

<b>Answer</b>: The filter gets applied to 4 locations in the input matrix, and outputs 4 numerical values based on the input words. Each row in the matrix represents a word (with zero-padding to keep the same length for each word). The filter convolution generates feature maps detecting specific features (e.g. does the sentence contain "not great"?), which we then apply max-pooling to. The reason for applying max-pooling is to get a fixed size output matrix which is required for classification problems. Also, by only keeping the max of some groups of feature maps we keep information about whether or not a feature appeared in the sentence while loosing information about exactly where in the sentence the feature appeared. However, this is okay for classification purposes since we are attempting to classify whether a movie review is a positive review or a negative view. 
    
* Why might order of activations need to be retained?

<b>Answer</b>: The order of activations might need to be retained to know the location of the sentences, or the locations of specific features within sentences. The meaning of a text can differ if two sentences (or two words within a sentence) gets swapped around, and therefore it might be important to retain the order.

* In the code, we will add an additional row of zeros to represent the padded words. Will the zeros of the padded words be updated during back-prop? Why or why not?

<b>Answer</b>: No.

First, we will write code which can select k top elements in the order they appeared. 

In [6]:
def k_max_pool(A, k):
    """
    A = 2 dimensional array (assume that the length of last dimension of A will be always more than k)
    k = number of elements.
    Return: For every row of A, return the top k elements in the order they appear.
    """
    assert len(A.get_shape()) == 2
    def func(row):
        _, indices = tf.nn.top_k(row, k=k, sorted=False)
        indices = tf.contrib.framework.sort(indices)
        return tf.gather(row, indices)
    return tf.map_fn(func, A)

In [7]:
A = tf.placeholder(shape=[None, None], dtype=tf.float64)
top = k_max_pool(A, 5)
sess = tf.Session()
for i in range(1, 6):
    np.random.seed(5)
    l = np.random.randn(i * 10, i * 10)
    top_elements = sess.run(top, feed_dict={A: l})
    l = l.tolist()
    top_elements2 = np.array(map(lambda x: [x[i] for i in range(len(x)) if x[i] > sorted(x, reverse=True)[5]], l))
    # Note that this test assumes that the 6th largest element and 5th largest element are different.
    print(((top_elements - top_elements2) < 10**-10).all())

True
True
True
True
True


In [8]:
def initializer(shape):
    xavier = tf.contrib.layers.xavier_initializer(seed=1)
    return xavier(shape)

In [9]:
class CNN_simple:
    def __init__(self, num_words, embedding_size=30, span=2, k=5):
        self.num_words = num_words

        # The batch of text documents. Let's assume that it is always padded to length 100
        self.input = tf.placeholder(shape=[None, 100], dtype=tf.int32)
        self.expected_output = tf.placeholder(shape=[None, 1], dtype=tf.float32)

        embedding_matrix = tf.Variable(initializer((num_words, embedding_size)), name="embeddings")
        
        # Add an additional row of zeros to denote padded words.
        zero_padding_row = tf.zeros((1, embedding_size))
        self.embedding_matrix = tf.concat(values=[embedding_matrix, zero_padding_row], axis=0)
        
        # Extract the vectors from the embedding matrix. The dimensions should be None x 100 x embedding_size
        vectors = tf.nn.embedding_lookup(self.embedding_matrix, self.input) # None x 100 x embedding_size
        
        # In order to use conv2d, we need vectors to be 4 dimensional.
        # Convention: NHWC = None (Batch Size) x Height(of image) x Width(of image) x Channel(Depth - e.g. RGB).
        # For text classification we consider Height = 1, width = number of words, channel = embedding_size.
        vectors2d = tf.expand_dims(vectors, 1) # None x 1 x 100 x embedding_size
        
        # Conv2d needs a filter bank.
        # The dimensions of the filter bank = Height, Width, in-channels, out-channels(Number-of-Filters).
        # We are creating a single filter of size = span. 
        # height = 1, width = span, in-channels = embedding_size, out-channels = 1. 
        single_filter = tf.Variable(initializer((1, span, embedding_size, 1)), name="filter")
        bias = tf.Variable(0.0, name="bias")  # You need a bias for each filter.
        conv_span = tf.nn.conv2d(
            input=vectors2d,
            filter=single_filter,
            strides=[1, 1, 1, 1],  # Note: the first and last elements SHOULD be 1.
            padding="VALID"  # We are ok with input size being reduced during the process of convolution.
        )  # Shape = (1, span, embedding_size, 1)
        
        acts = tf.nn.leaky_relu(conv_span + bias)
        
        # Now, let us extract the top k activations. 
        # But, we need to first convert acts into 2-dimensional.
        acts_2d = tf.squeeze(acts, [1, 3])
        
        # Use k_max_pool to extract top-k activations
        input_fully_connected = k_max_pool(acts_2d, k)  # None x k
        
        # Initialize the weight and bias needed for softmax classifier.
        self.softmax_weight = tf.Variable(tf.truncated_normal([k, 2], stddev=0.1))
        self.softmax_bias = tf.Variable(tf.constant(0.1, shape=[2]))
        
        # Write out the equation for computing the logits.
        Wx_plus_b = tf.add(tf.matmul(input_fully_connected, self.softmax_weight), self.softmax_bias)
        self.output = tf.nn.softmax(Wx_plus_b, axis=1)  # Shape = (1, 2)
        
        # Compute the cross-entropy cost. 
        # You might either sum or take mean of all the costs across all the examples. 
        # It is your choice as the test case is on Stochastic Training.
        one_hot_expected_output = tf.one_hot(tf.cast(self.expected_output, dtype=tf.int32), 2)
        self.cost = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits_v2(
                logits=Wx_plus_b,
                labels=one_hot_expected_output
            )
        )
        
        correct_prediction = tf.equal(tf.reshape(tf.argmax(self.output, 1), [-1, 1]), tf.cast(self.expected_output, dtype=tf.int64))
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))        
        
        optimizer = tf.train.AdamOptimizer()
        self.train_op = optimizer.minimize(self.cost)
        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())

    def pad(self, data, pad_word, pad_length=100):
        for datum in data:
            datum.extend([pad_word]*(pad_length-len(datum)))
        return data
    
    def train(self, train_data, test_data, train_targets, test_targets, batch_size=1, epochs=1, verbose=False):
        sess = self.session
        self.pad(train_data, self.num_words)
        self.pad(test_data, self.num_words)
        print("Starting training...")
        for epoch in range(epochs):
            cost_epoch = 0
            c = 0
            for datum, target in zip([train_data[i:i+batch_size] for i in range(0, len(train_data), batch_size)],
                                   [train_targets[i:i+batch_size] for i in range(0, len(train_targets), batch_size)]):
                _, cost = sess.run([self.train_op, self.cost], feed_dict={self.input: datum, self.expected_output: target})
                cost_epoch += cost
                c += 1
                if c % 100 == 0 and verbose:
                    print("\t{} batches finished. Cost : {}".format(c, cost_epoch/c))
            print("Epoch {}: {}".format(epoch, cost_epoch/len(train_data)))
            print("\tTrain accuracy: {}".format(self.compute_accuracy(train_data, train_targets)))
            print("\tTest accuracy: {}".format(self.compute_accuracy(test_data, test_targets)))
    
    def compute_accuracy(self, data, targets):
        return self.session.run(self.accuracy, feed_dict={self.input: data, self.expected_output: targets})

In [10]:
c=CNN_simple(len(vocab))
c.train(train_data, test_data, train_targets, test_targets, epochs=1, verbose=True)

Starting training...
	100 batches finished. Cost : 0.6929708349704743
	200 batches finished. Cost : 0.6922829195857048
	300 batches finished. Cost : 0.6925293787320455
	400 batches finished. Cost : 0.6923463428020478
	500 batches finished. Cost : 0.6932781388759613
	600 batches finished. Cost : 0.6933880699674289
	700 batches finished. Cost : 0.6927246058838709
	800 batches finished. Cost : 0.6925811886787414
	900 batches finished. Cost : 0.6934241186247931
	1000 batches finished. Cost : 0.6938184412121773
	1100 batches finished. Cost : 0.6937404790249738
	1200 batches finished. Cost : 0.6936372148990632
	1300 batches finished. Cost : 0.6933954641452202
	1400 batches finished. Cost : 0.6932443425059318
	1500 batches finished. Cost : 0.6931555347839992
	1600 batches finished. Cost : 0.6928893229737878
	1700 batches finished. Cost : 0.6930368027967565
	1800 batches finished. Cost : 0.69309016396602
	1900 batches finished. Cost : 0.6929770465901024
	2000 batches finished. Cost : 0.6928122

The expected output for the above snippet is
<pre>
Starting training...
	100 batches finished. Cost : 0.688363179564
	200 batches finished. Cost : 0.695461705327
	300 batches finished. Cost : 0.695902070602
	400 batches finished. Cost : 0.697339072227
	500 batches finished. Cost : 0.698220448136
    ...
Epoch 0: 0.675099702418
	Train accuracy: 0.718958854675
	Test accuracy: 0.664322555065   
</pre>
If you get any other output and you feel you are correct, you can proceed (However, I cannot think of any case where you can get a different output). 

## ConvNet 

### Architecture

<img src="https://web.cs.dal.ca/~sastry/cnn.png" style="height:40%;width:40%">

Essentially, there are 2 kind of hyper-parameters - the filter size and number of filters of each size. In the image shown, there are 3 filter-sizes - 2,3,4 and number of filters of each size is 2. Once the convolution is obtained, 1-max pooling is done - it basically involves extracting 1 activation from the list of activations which is the maximum activation. The reason we need to do this is to construct the inputs to the softmax layer which are of a fixed size.
Read more at https://arxiv.org/pdf/1510.03820.pdf. 

In [77]:
class CNN:
    def __init__(self, num_words, embedding_size = 30):
        self.num_words = num_words

        # The batch of text documents. Let's assume that it is always padded to length 100. 
        # We could use [None,None], but we'll use [None,100] for simplicity. 
        self.input = tf.placeholder(shape=[None, 100], dtype=tf.int32)
        self.expected_output = tf.placeholder(shape=[None, 1], dtype=tf.float32)
        
        embedding_matrix = tf.Variable(initializer((num_words, embedding_size)), name="embeddings")
        # Add an additional row of zeros to denote padded words.
        zero_padding_row = tf.zeros((1, embedding_size))
        self.embedding_matrix = tf.concat(values=[embedding_matrix, zero_padding_row], axis=0)
        
        # Extract the vectors from the embedding matrix. The dimensions should be None x 100 x embedding_size. 
        # Use embedding lookup
        vectors = tf.nn.embedding_lookup(self.embedding_matrix, self.input) # None x 100 x embedding_size
        
        # In order to use conv2d, we need vectors to be 4 dimensional.
        # The convention is NHWC - None (Batch Size) x Height(Height of image) x Width(Width of image) x Channel(Depth - similar to RGB).
        # For text, let's consider Height = 1, width = number of words, channel = embedding_size.
        # Use expand-dims to modify. 
        vectors2d = tf.expand_dims(vectors, 1) # None x 1 x 100 x embedding_size
        
        # Create 50 filters with span of 3 words. You need 1 bias for each filter.
        # The dimensions of the filter bank = Height, Width, in-channels, out-channels(Number-of-Filters).
        filter_tri = tf.Variable(initializer((1, 3, embedding_size, 50)), name="weight3")
        bias_tri = tf.Variable(tf.zeros((1, 50)), name="bias3")  # we need a bias for each filter
        conv1 = tf.nn.conv2d(
            input=vectors2d,
            filter=filter_tri,
            strides=[1, 1, 98, 1],
            padding="VALID"
        )  # Shape = (?, 1, 1, 50)
        A1 = tf.nn.leaky_relu(conv1+bias_tri)
        A1_2d = tf.squeeze(A1, [1, 2])

        # Create 50 filters with span of 4 words. You need 1 bias for each filter.
        filter_4 = tf.Variable(initializer((1, 4, embedding_size, 50)), name="weight4")  
        bias_4 = tf.Variable(tf.zeros((1, 50)), name="bias4")  # we need a bias for each filter
        conv2 = tf.nn.conv2d(
            input=vectors2d,
            filter=filter_4,
            strides=[1, 1, 97, 1],
            padding="VALID"
        )  # Shape = (?, 1, 1, 50)
        A2 = tf.nn.leaky_relu(conv2+bias_4)
        A2_2d = tf.squeeze(A2, [1, 2])

        # Create 50 filters with span of 5 words. You need 1 bias for each filter.
        filter_5 = tf.Variable(initializer((1, 5, embedding_size, 50)), name="weight5")  
        bias_5 = tf.Variable(tf.zeros((1, 50)), name="bias5")  # we need a bias for each filter
        conv3 = tf.nn.conv2d(
            input=vectors2d,
            filter=filter_5,
            strides=[1, 1, 96, 1],
            padding="VALID"
        )  # Shape = (?, 1, 1, 50)
        A3 = tf.nn.leaky_relu(conv3+bias_5)
        A3_2d = tf.squeeze(A3, [1, 2])
        
        # Now extract the maximum activations for each of the filters. The shapes are listed alongside. 
        max_A1 = k_max_pool(A1_2d, 50)  # None x 50
        max_A2 = k_max_pool(A2_2d, 50)  # None x 50
        max_A3 = k_max_pool(A3_2d, 50)  # None x 50
        
        concat = tf.concat([max_A1, max_A2, max_A3], axis=1) # None x 150
        
        # Initialize the weight and bias needed for softmax classifier. 
        self.softmax_weight = tf.Variable(tf.truncated_normal([150, 2], stddev=0.1))
        self.softmax_bias = tf.Variable(tf.constant(0.1, shape=[2]))
        
        # Write out the equation for computing the logits.
        Wx_plus_b = tf.add(tf.matmul(concat, self.softmax_weight), self.softmax_bias)
        self.output = tf.nn.softmax(Wx_plus_b, axis=1) # Shape = ?
        
        # Compute the cross-entropy cost. 
        # You might either sum or take mean of all the costs across all the examples. 
        # It is your choice as the test case is on Stochastic Training. 
        one_hot_expected_output = tf.one_hot(tf.cast(self.expected_output, dtype=tf.int32), 2)
        self.cost = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits_v2(
                logits=Wx_plus_b,
                labels=one_hot_expected_output
            )
        )
        
        correct_prediction = tf.equal(tf.reshape(tf.argmax(self.output, 1), [-1, 1]), tf.cast(self.expected_output, dtype=tf.int64))
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        
        optimizer = tf.train.AdamOptimizer()
        self.train_op = optimizer.minimize(self.cost)
        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())

    def pad(self, data, pad_word, pad_length=100):
        for datum in data:
            datum.extend([pad_word]*(pad_length-len(datum)))
        return data
    
    def train(self, train_data, test_data, train_targets, test_targets, batch_size=1, epochs=1, verbose=False):
        sess = self.session
        self.pad(train_data, self.num_words)
        self.pad(test_data, self.num_words)
        print("Starting training...")
        for epoch in range(epochs):
            cost_epoch = 0
            c = 0
            for datum, target in zip([train_data[i:i+batch_size] for i in range(0, len(train_data), batch_size)],
                                   [train_targets[i:i+batch_size] for i in range(0, len(train_targets), batch_size)]):
                _, cost = sess.run([self.train_op, self.cost], feed_dict={self.input: datum, self.expected_output: target})
                cost_epoch += cost
                c += 1
                if c % 100 == 0 and verbose:
                    print("\t{} batches finished. Cost : {}".format(c, cost_epoch/c))
            print("Epoch {}: {}".format(epoch, cost_epoch/len(train_data)))
            print("\tTrain accuracy: {}".format(self.compute_accuracy(train_data, train_targets)))
            print("\tTest accuracy: {}".format(self.compute_accuracy(test_data, test_targets)))
            self.display_distances()
            self.display_most_similar()
    
    def compute_accuracy(self, data, targets):
        return self.session.run(self.accuracy, feed_dict={self.input: data, self.expected_output: targets})
    
    def get_distance(self, word1, word2):
        """
        word1 = a word in the dataset
        word2 = another word in the dataset
        Return: The cosine distance between word1 and word2.
        """
        assert word1 in vocab and word2 in vocab
        word1_embedding = self.embedding_matrix[vocab[word1]]
        word2_embedding = self.embedding_matrix[vocab[word2]]
        word1_normalized = tf.nn.l2_normalize(word1_embedding, axis=0)
        word2_normalized = tf.nn.l2_normalize(word2_embedding, axis=0)
        cosine_distance = tf.losses.cosine_distance(
            word1_normalized,
            word2_normalized,
            axis=0
        )
        return self.session.run(cosine_distance)
    
    def get_most_similar(self, word):
        """
        word = a word in the dataset
        Return: the top 10 most similar words to the inputted word.
        """
        assert word in vocab
        n = 10
        word_embedding = self.embedding_matrix[vocab[word]]
        word_normalized = tf.nn.l2_normalize(word_embedding, axis=0)
        word_normalized = tf.expand_dims(word_normalized, 0)
        matrix_normalized = tf.nn.l2_normalize(self.embedding_matrix, axis=1)
        cosine_similarity = tf.matmul(word_normalized, tf.transpose(matrix_normalized))
        top_10_values, top_10_indices = self.session.run(tf.nn.top_k(cosine_similarity, k=n+1))
        top_10_words = []
        for i in range(1, len(top_10_indices[0])):
            # the most similar word will be itself.
            # Therefore, we skip the first index and only append the next 10.
            top_10_words.append(all_words[top_10_indices[0][i]])
        return top_10_words
    
    def display_distances(self):
        print('\n')
        print("\tcos_dist good vs bad: {}".format(self.get_distance('good', 'bad')))
        print("\tcos_dist terrible vs horrible: {}".format(self.get_distance('terrible', 'horrible')))
        print("\tcos_dist he vs she: {}".format(self.get_distance('he', 'she')))
        print("\tcos_dist movie vs show: {}".format(self.get_distance('movie', 'show')))
    
    def display_most_similar(self):
        print('\n')
        print("\ttop 10 similar words to \"movie\":")
        for w in self.get_most_similar('movie'):
            print('\t\t- {}'.format(w))
        print("\ttop 10 similar words to \"perfect\":")
        for w in self.get_most_similar('perfect'):
            print('\t\t- {}'.format(w))
        print("\ttop 10 similar words to \"horrible\":")
        for w in self.get_most_similar('horrible'):
            print('\t\t- {}'.format(w))
        print("\ttop 10 similar words to \"good\":")
        for w in self.get_most_similar('good'):
            print('\t\t- {}'.format(w))
        print("\ttop 10 similar words to \"bad\":")
        for w in self.get_most_similar('bad'):
            print('\t\t- {}'.format(w))
        


In [78]:
c=CNN(len(vocab))
c.train(train_data, test_data, train_targets, test_targets, epochs=5, verbose=False)

Starting training...
Epoch 0: 0.6497008505200681
	Train accuracy: 0.7903622984886169
	Test accuracy: 0.6572902202606201


	cos_dist good vs bad: 1.46150541305542
	cos_dist terrible vs horrible: 0.1410428285598755
	cos_dist he vs she: 0.3128286600112915
	cos_dist movie vs show: 0.5963776111602783


	top 10 similar words to "movie":
		- long
		- all
		- script
		- plot
		- going
		- breakdowns
		- overproduced
		- like
		- again
		- stunt
	top 10 similar words to "perfect":
		- shooting
		- portrait
		- african
		- somber
		- spooky
		- errs
		- depression
		- ring
		- fearlessly
		- treasure
	top 10 similar words to "horrible":
		- why
		- bad
		- van
		- soul
		- chokes
		- muddled
		- incoherent
		- quick
		- missing
		- video
	top 10 similar words to "good":
		- style
		- controlled
		- observant
		- neo
		- team
		- stuff
		- capture
		- strange
		- clear
		- jason
	top 10 similar words to "bad":
		- why
		- nasty
		- bottom
		- muddled
		- literally
		- however
		- mystery
		- too


The expected output for the above snippet is
<pre>
Starting training...
	100 batches finished. Cost : 0.697723334432
	200 batches finished. Cost : 0.69957424134
	300 batches finished. Cost : 0.697673715353
	400 batches finished. Cost : 0.692196451947
	500 batches finished. Cost : 0.693883402467
    ...
Epoch 0: 0.624233247656
	Train accuracy: 0.828467607498
	Test accuracy: 0.736521303654   
</pre>
If you get any other output and you feel you are correct, you can proceed (However, I cannot think of any case where you can get a different output). 

### Effect of Batch Size on Training

Study the effects of changing batch size. Just run the various experiments and observe the results (Run it in non-verbose mode). No need to make any comments here.

In [79]:
c2=CNN(len(vocab))
c2.train(train_data, test_data, train_targets, test_targets, batch_size=2, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.3244241494630814
	Train accuracy: 0.8047837018966675
	Test accuracy: 0.66713547706604


	cos_dist good vs bad: 1.5918223857879639
	cos_dist terrible vs horrible: 0.2783013582229614
	cos_dist he vs she: 0.2938779592514038
	cos_dist movie vs show: 0.4061277508735657


	top 10 similar words to "movie":
		- come
		- merely
		- soulless
		- nary
		- atonal
		- tom
		- director's
		- we
		- presume
		- aimlessly
	top 10 similar words to "perfect":
		- drama
		- splendor
		- creepiest
		- russo
		- lewis
		- trimmingsarrive
		- zaza's
		- handling
		- carnage
		- ya's
	top 10 similar words to "horrible":
		- lackluster
		- days
		- bland
		- secrets
		- sleep
		- kurupt
		- unfaithful
		- attempts
		- choppy
		- living
	top 10 similar words to "good":
		- ways
		- treatment
		- pumps
		- beautiful
		- strange
		- somber
		- team
		- seriousness
		- fulford
		- tongue
	top 10 similar words to "bad":
		- fails
		- formulaic
		- desperately
		- manipulative
		- in

In [80]:
c3=CNN(len(vocab))
c3.train(train_data, test_data, train_targets, test_targets, batch_size=3, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.21642378283240254
	Train accuracy: 0.8132254481315613
	Test accuracy: 0.6652601957321167


	cos_dist good vs bad: 1.561958909034729
	cos_dist terrible vs horrible: 0.38920509815216064
	cos_dist he vs she: 0.1379326581954956
	cos_dist movie vs show: 0.34371328353881836


	top 10 similar words to "movie":
		- unremittingly
		- presume
		- chooses
		- franchise
		- exaggerated
		- bloated
		- crimen
		- rule
		- unintentional
		- shallower
	top 10 similar words to "perfect":
		- finesse
		- tosca
		- roller
		- ya's
		- onto
		- worthy
		- refreshing
		- shooting
		- handling
		- slam
	top 10 similar words to "horrible":
		- essentially
		- stealing
		- unsatisfying
		- off
		- pity
		- choppy
		- literally
		- simplistic
		- jumble
		- title's
	top 10 similar words to "good":
		- always
		- jason
		- tug
		- unusual
		- thriller
		- agenda
		- capture
		- spooky
		- authentic
		- breaking
	top 10 similar words to "bad":
		- soul
		- fails
		- manipulative


In [81]:
c4=CNN(len(vocab))
c4.train(train_data, test_data, train_targets, test_targets, batch_size=4, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.16231638401561554
	Train accuracy: 0.8201430439949036
	Test accuracy: 0.6624472737312317


	cos_dist good vs bad: 1.5009946823120117
	cos_dist terrible vs horrible: 0.2552105188369751
	cos_dist he vs she: 0.29949021339416504
	cos_dist movie vs show: 0.4476293921470642


	top 10 similar words to "movie":
		- niro
		- long
		- across
		- soulless
		- generic
		- can't
		- ill
		- madonna
		- only
		- showcase
	top 10 similar words to "perfect":
		- closes
		- graceful
		- brutal
		- crafted
		- sweaty
		- portrait
		- bernard
		- volletta
		- al
		- payne's
	top 10 similar words to "horrible":
		- certain
		- soul
		- literally
		- everybody
		- saccharine
		- green
		- rather
		- smug
		- only
		- were
	top 10 similar words to "good":
		- thriller
		- neo
		- kids
		- spooky
		- gleefully
		- king
		- polanski
		- threatens
		- mann
		- monty
	top 10 similar words to "bad":
		- desperately
		- literally
		- manipulative
		- fails
		- incoherent
		- idea
	

In [82]:
c5=CNN(len(vocab))
c5.train(train_data, test_data, train_targets, test_targets, batch_size=5, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.12999668564336886
	Train accuracy: 0.8192050457000732
	Test accuracy: 0.6685419678688049


	cos_dist good vs bad: 1.5494121313095093
	cos_dist terrible vs horrible: 0.32952266931533813
	cos_dist he vs she: 0.3623473644256592
	cos_dist movie vs show: 0.41992610692977905


	top 10 similar words to "movie":
		- sara
		- didn't
		- problem
		- orchard
		- long
		- country
		- amount
		- smug
		- afterschool
		- atonal
	top 10 similar words to "perfect":
		- ways
		- motion
		- rewarding
		- agenda
		- deceptively
		- refreshing
		- thornberrys
		- examination
		- masterpiece
		- behan's
	top 10 similar words to "horrible":
		- video
		- why
		- kung
		- feels
		- stealing
		- choppy
		- hampered
		- elmo
		- predictably
		- initially
	top 10 similar words to "good":
		- ways
		- life
		- authentic
		- somber
		- manga
		- viewing
		- observant
		- thriller
		- emotional
		- haven't
	top 10 similar words to "bad":
		- were
		- desperately
		- flair
		- too
		

In [83]:
c6=CNN(len(vocab))
c6.train(train_data, test_data, train_targets, test_targets, batch_size=6, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.10834514336508287
	Train accuracy: 0.8219017386436462
	Test accuracy: 0.6661978363990784


	cos_dist good vs bad: 1.5455517768859863
	cos_dist terrible vs horrible: 0.30691367387771606
	cos_dist he vs she: 0.5375597476959229
	cos_dist movie vs show: 0.5246798992156982


	top 10 similar words to "movie":
		- speaking
		- generic
		- problem
		- listless
		- long
		- pint
		- lampoon's
		- script
		- anim
		- selection
	top 10 similar words to "perfect":
		- calibrated
		- inter
		- spooky
		- crafted
		- brutal
		- balances
		- witness
		- motion
		- execution
		- delivered
	top 10 similar words to "horrible":
		- essentially
		- days
		- couldn't
		- were
		- idea
		- saccharine
		- advantage
		- bland
		- half
		- concept
	top 10 similar words to "good":
		- loved
		- spooky
		- ways
		- somber
		- becoming
		- life
		- kids
		- neo
		- authentic
		- detailing
	top 10 similar words to "bad":
		- formulaic
		- soul
		- dull
		- silly
		- too
		- feels
		

In [84]:
c8=CNN(len(vocab))
c8.train(train_data, test_data, train_targets, test_targets, batch_size=8, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.081408474178104
	Train accuracy: 0.8213155269622803
	Test accuracy: 0.6708860993385315


	cos_dist good vs bad: 1.5765392780303955
	cos_dist terrible vs horrible: 0.2778332233428955
	cos_dist he vs she: 0.24149519205093384
	cos_dist movie vs show: 0.5590507388114929


	top 10 similar words to "movie":
		- going
		- can't
		- empty
		- intentions
		- men
		- tired
		- inelegant
		- days
		- sum
		- franchise
	top 10 similar words to "perfect":
		- photography
		- invitingly
		- oleander
		- ride
		- float
		- craft
		- shooting
		- grip
		- roller
		- confrontational
	top 10 similar words to "horrible":
		- offensive
		- needed
		- tv
		- stealing
		- essentially
		- friday
		- sewer
		- [javier
		- why
		- oh
	top 10 similar words to "good":
		- ways
		- brother
		- haven't
		- fabric
		- seus
		- laugh
		- nelson
		- part
		- capturou
		- martha
	top 10 similar words to "bad":
		- which
		- trying
		- couldn't
		- however
		- dull
		- why
		- were
		- p

In [85]:
c10=CNN(len(vocab))
c10.train(train_data, test_data, train_targets, test_targets, batch_size=10, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.06529437454703657
	Train accuracy: 0.8220189809799194
	Test accuracy: 0.6690107583999634


	cos_dist good vs bad: 1.6935162544250488
	cos_dist terrible vs horrible: 0.26092392206192017
	cos_dist he vs she: 0.2653992176055908
	cos_dist movie vs show: 0.35052400827407837


	top 10 similar words to "movie":
		- listless
		- mistaken
		- long
		- santa
		- dooby
		- turgid
		- mugs
		- dopey
		- monsters
		- 98
	top 10 similar words to "perfect":
		- smartly
		- delight
		- entree
		- j
		- objective
		- ya's
		- ride
		- formalism
		- playful
		- adds
	top 10 similar words to "horrible":
		- hampered
		- days
		- manipulative
		- flashy
		- would've
		- idea
		- did
		- fails
		- cold
		- rather
	top 10 similar words to "good":
		- loved
		- always
		- treat
		- audacious
		- style
		- opening
		- inventively
		- maintains
		- anime
		- kids
	top 10 similar words to "bad":
		- too
		- were
		- showtime
		- cold
		- idea
		- unpleasant
		- essentially
		- wh

In [86]:
c15=CNN(len(vocab))
c15.train(train_data, test_data, train_targets, test_targets, batch_size=15, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.04370334985555581
	Train accuracy: 0.8222534656524658
	Test accuracy: 0.6629160642623901


	cos_dist good vs bad: 1.6712089776992798
	cos_dist terrible vs horrible: 0.3265742063522339
	cos_dist he vs she: 0.26655322313308716
	cos_dist movie vs show: 0.5167897939682007


	top 10 similar words to "movie":
		- across
		- morally
		- hasn't
		- amount
		- script
		- tuxedo
		- uninteresting
		- makers
		- long
		- felt
	top 10 similar words to "perfect":
		- shimizu
		- powerful
		- our
		- portrait
		- delivers
		- retelling
		- seriousness
		- performances
		- ride
		- format
	top 10 similar words to "horrible":
		- stealing
		- flashy
		- offensive
		- rather
		- kung
		- congratulatory
		- would've
		- feels
		- fails
		- unfunny
	top 10 similar words to "good":
		- heaven
		- kids
		- treat
		- ways
		- strange
		- jason
		- ambitious
		- man
		- laugh
		- auspicious
	top 10 similar words to "bad":
		- why
		- dull
		- too
		- completely
		- silly
		- d

In [87]:
c30=CNN(len(vocab))
c30.train(train_data, test_data, train_targets, test_targets, batch_size=30, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.022094361855313122
	Train accuracy: 0.8174463510513306
	Test accuracy: 0.6596342921257019


	cos_dist good vs bad: 1.7223000526428223
	cos_dist terrible vs horrible: 0.23593521118164062
	cos_dist he vs she: 0.39524173736572266
	cos_dist movie vs show: 0.5202136039733887


	top 10 similar words to "movie":
		- long
		- script
		- ill
		- arnold
		- sara
		- only
		- problem
		- can't
		- sequel
		- advantage
	top 10 similar words to "perfect":
		- retelling
		- ride
		- signals
		- masterpiece
		- flaws
		- offerings
		- franco
		- powerful
		- shooting
		- charmingly
	top 10 similar words to "horrible":
		- [t]he
		- rather
		- friday
		- mckay
		- affleck
		- mindless
		- downbeat
		- only
		- schneider's
		- feels
	top 10 similar words to "good":
		- features
		- huppert
		- fairy
		- illuminating
		- digital
		- worth
		- blisteringly
		- douglas
		- tsai
		- thanks
	top 10 similar words to "bad":
		- why
		- too
		- tv
		- someone
		- however
		- ess

In [88]:
c50=CNN(len(vocab))
c50.train(train_data, test_data, train_targets, test_targets, batch_size=50, epochs=1, verbose=False)

Starting training...
Epoch 0: 0.013400489202989812
	Train accuracy: 0.795052170753479
	Test accuracy: 0.6502578258514404


	cos_dist good vs bad: 1.6879076957702637
	cos_dist terrible vs horrible: 0.29935145378112793
	cos_dist he vs she: 0.7008346915245056
	cos_dist movie vs show: 0.5126837491989136


	top 10 similar words to "movie":
		- limp
		- sum
		- generic
		- listless
		- problem
		- showcase
		- script
		- ill
		- can't
		- amount
	top 10 similar words to "perfect":
		- conviction
		- sunday
		- accused
		- fashioned
		- introduction
		- mixes
		- usual
		- respectable
		- zaza's
		- laugh
	top 10 similar words to "horrible":
		- problem
		- michell
		- imagine
		- stars
		- ill
		- slow
		- painfully
		- oedekerk
		- tv
		- only
	top 10 similar words to "good":
		- morton
		- blisteringly
		- heaven
		- warm
		- fun
		- lovely
		- suspenseful
		- overcomes
		- benefits
		- kids
	top 10 similar words to "bad":
		- too
		- awful
		- little
		- why
		- idea
		- dull
		- essentia

### Embeddings

Add 2 functions - get_distance and get_most_similar to the CNN class (the big one). 
* get_distance(word1,word2) - should return the cosine distance between the 2 words.
* get_most_similar(word) - should return top 10 most similar words to the word passed.

Now, use the 2 functions to record the distances between a list of word-pairs as the training progresses. (One easy way to go about could be to save the embedding matrix in the hard-disk for every 5 updates.):
* Study the distance between words of opposite sentiment as the training progresses. Ex: Good and Bad, Good and horrible, etc.
* Study the distance between words of same sentiment. Ex: Good and Beautiful, Bad and Terrible, etc.
* Study how the non-sentiment bearing words relate to each other. Ex: his, her, an, it, etc

### This section shows results after training with batch_size = 1
Note: To see distance and similarity outputs during training, refer to the training output when verbose=True

#### Distance between words of opposite sentiment

In [89]:
print(c.get_distance('good', 'bad'))

1.1772233


In [90]:
print(c.get_distance('beautiful', 'horrible'))

1.5896721


In [91]:
print(c.get_distance('like', 'dislike'))

1.3137771


In [92]:
print(c.get_distance('fun', 'boring'))

1.6863925


In [93]:
print(c.get_distance('perfect', 'disgusting'))

0.8917293


#### Distance between words of similar sentiment

In [94]:
print(c.get_distance('good', 'beautiful'))

1.1008879


In [95]:
print(c.get_distance('perfect', 'fantastic'))

0.42478412


In [96]:
print(c.get_distance('beautiful', 'wonderful'))

0.4673578


In [97]:
print(c.get_distance('bad', 'terrible'))

0.23816931


In [98]:
print(c.get_distance('bad', 'boring'))

0.28796977


In [99]:
print(c.get_distance('terrible', 'horrible'))

0.3643694


#### Relation between non-sentiment bearing words

In [100]:
print(c.get_distance('him', 'her'))

0.90474015


In [101]:
print(c.get_distance('he', 'she'))

0.38951278


In [102]:
print(c.get_distance('a', 'an'))

0.63445246


In [103]:
print(c.get_distance('he', 'it'))

1.247942


In [104]:
print(c.get_distance('she', 'it'))

1.2219324


In [105]:
print(c.get_distance('red', 'blue'))

1.5465028


In [106]:
print(c.get_distance('we', 'they'))

0.94863224


In [107]:
print(c.get_distance('the', 'that'))

0.7178945


In [108]:
print(c.get_distance('movie', 'show'))

1.3006283


#### Similarities between words

In [109]:
print(c.get_most_similar('good'))

['goyer', "weil's", 'tends', 'crafty', 'ilk', 'products', 'redolent', 'moment', 'mechanisms', 'sensitivities']


In [110]:
print(c.get_most_similar('movie'))

["aren't", 'fumbled', 'vertiginous', "other's", 'trama', 'rights', 'screwups', 'slickest', 'octane', 'thinness']


In [111]:
print(c.get_most_similar('beautiful'))

['classic', 'frustration', 'confirmation', 'bought', 'sumptuous', 'goers', 'sick', 'images', 'maniac', 'ww']


In [112]:
print(c.get_most_similar('bad'))

['essentially', 'humorless', 'formulaic', 'trying', 'why', 'muddled', 'van', 'plodding', 'shadyac', 'awful']


In [113]:
print(c.get_most_similar('horrible'))

['sunk', 'kung', 'too', 'die', 'did', 'muddled', 'manipulative', 'nasty', 'dull', 'why']


In [114]:
print(c.get_most_similar('the'))

["petersburg's", 'pap', 'drung', 'gosto', 'girlish', 'theatrically', "halloween's", 'carry', 'irwin', 'minus']


In [115]:
print(c.get_most_similar('he'))

['inept', 'poorly', 'stupid', 'idea', '50', 'acceptable', 'mawkish', 'wilder', 'lame', 'shadyac']


In [116]:
print(c.get_most_similar('red'))

['e', 'brings', 'polished', 'candid', 'chicago', 'wickedly', 'overcomes', 'huppert', 'bourne', 'jackson']


In [117]:
print(c.get_most_similar('should'))

['cold', 'damned', 'expected', 'anything', 'cliches', 'dull', 'starts', 'except', 'unintentionally', 'necessarily']


### Learnings:

List out the observations and conclusions you made from the various experiments. 

#### 1 epoch and Batch Size > 1
- The CNN learns non-sentiment words better with a bigger batch size. For example, It seems to be good at predicting 'movie' with batch size = 30 and 50.
- The CNN learns sentiment words better with a smaller batch size (e.g. good, bad, perfect, horrible).

#### Epoch > 1 and Batch Size = 1
- The cosine distances gets worse when training on multiple epochs for "terrible" vs "horrible"
- The cosine distances gets better when training on multiple epochs for "good" vs "bad"
- The top 10 similar words to "movie" gets worse as training progresses over multiple epochs