### Natural Language Processing:
It is necessary to translate the input text into matrices that can be fed into a neural network and receive a numerical prediction.

The way we format the input text as a matrix will make it easier or harder for the neural network to learn correlation.

### Sentiment Analysis:
<em>One-hot encoding</em> involves using binary numbers 0 and 1 to encode the presence/absence of something or truth/false. For sentiment analysis, an appropriate way to translate input text into vectors would be to have a column for each word in a vocabulary set, where if the current review contains a certain word, the column would be marked with a 1.

### Word Embedding:



### IMDB Movie Review Sentiment Analysis:
The following snippet transforms the raw review data into vectors that can be passed as input into a neural network. There are two labels: positive sentiment and negative sentiment.

In [2]:
import sys

f = open('./Datasets/imdb_reviews.txt')
raw_reviews = f.readlines()
f.close()

f = open('./Datasets/imdb_review_labels.txt')
raw_labels = f.readlines()
f.close()

tokens = list(map(lambda x:set(x.split(" ")),raw_reviews))

vocab = set()
for sent in tokens:
    for word in sent:
        if(len(word)>0):
            vocab.add(word)
vocab = list(vocab)

word2index = {}
for i,word in enumerate(vocab):
    word2index[word]=i

input_dataset = list()
for sent in tokens:
    sent_indices = list()
    for word in sent:
        try:
            sent_indices.append(word2index[word])
        except:
            ""
    input_dataset.append(list(set(sent_indices)))

target_dataset = list()
for label in raw_labels:
    if label == 'positive\n':
        target_dataset.append(1)
    else:
        target_dataset.append(0)

print("Prepared {} samples.".format(len(input_dataset)))


Prepared 25000 samples.


In [61]:
print("An example datapoint:")
print("===== Raw =====")
for i in raw_reviews[0].split(" "):
    print(i, end=" ")
print("===> Label: {}".format(raw_labels[0]))
print("===== Processed =====")
for i in input_dataset[0]:
    print(i, end=" ")
print("\n===> Label: {}".format(target_dataset[0]))

An example datapoint:
===== Raw =====
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
 ===> Label: positive

===== Processed =====
20496 34834 71701 20502 49181 29216 65066 37931 57390 4656 20016 12852 39477 33358 7

The 3-layer network below takes in a 'sentence', represented as a variable length array consisting of integers which encode the index of a particular word in the vocabulary exists.

In [3]:
import numpy as np
np.random.seed(1)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

alpha, iterations = (0.01, 3)
hidden_size = 100

weights_0_1 = 0.2 * np.random.random((len(vocab), hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, 1)) - 0.1

correct, total = (0, 0)
for iteration in range(iterations):
    for i in range(len(input_dataset) - 1000):    # Leave 1000 datapoints for the testing
        x, y = (input_dataset[i], target_dataset[i])
        # Embed and apply sigmoid:
        layer_1 = sigmoid(np.sum(weights_0_1[x], axis=0))
        layer_2 = sigmoid(np.dot(layer_1, weights_1_2))
        layer_2_delta = layer_2 - y
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)
        weights_0_1[x] -= layer_1_delta * alpha
        weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha
        if (np.abs(layer_2_delta) < 0.5):
            correct += 1
        total += 1
        if (i % 10 == 9):
            progress = str(i / float(len(input_dataset)))
            sys.stdout.write('\rIter:' + str(iteration) \
                             + ' Progress:' + progress[2:4] \
                             + '.' + progress[4:6] \
                             + '% Training Accuracy:' \
                             + str(correct / float(total)))
    print()
correct, total = (0, 0)
for i in range(len(input_dataset) - 1000, len(input_dataset)):
    x = input_dataset[i]
    y = target_dataset[i]
    layer_1 = sigmoid(np.sum(weights_0_1[x], axis=0))
    layer_2 = sigmoid(np.dot(layer_1, weights_1_2))
    
    if (np.abs(layer_2 - y) < 0.5):
        correct += 1
    total += 1
print("Test accuracy: {}".format(correct / float(total)))
        



Iter:0 Progress:95.99% Training Accuracy:0.83170833333333346
Iter:1 Progress:95.99% Training Accuracy:0.8664583333333333
Iter:2 Progress:95.99% Training Accuracy:0.8847361111111111
Test accuracy: 0.854


#### Note about neural architecture:
The most basic ability of a neural network is to learn <em>direct correlation</em>. The minimal architecture for identifying direct correlation is a 2-layer network. 

Hidden layers search for 'useful' groupings of inputs from the previous layer. Each hidden layer neuron takes in input from the previous layer and asks 'is this datapoint in my group?' These groupings must be useful to the prediction of an output label. 

Understanding the difference between 'terrible' and 'not terrible' is a powerful grouping for predicting the output labels 'negative' or 'positive'.

The above neural network takes in vocabulary and neglects the effect of the ordering of words. Eg. the sentences 'the movie was terrible, not great' and 'the move was great, not terrible' cover the same set of vocabulary, but have opposite labels. If you can construct two inputs with differing patterns but which produce the same layer_1 (hidden layer) output, then the network is incapable of recognising that differing pattern.

Words with similar predictive power should have similar weights leading to a similar set of hidden neurons. 

#### Word Similarity:
The snippet below computes the similarity of words using <em>Euclidean distance</em>. The vector of weights of one word (input neuron) leading to each hidden neuron of the next layer should be 'close' in terms of Euclidean distance to the vector of weights for another word if those two words have similar predictive power (since they would be subscribing with similar strength to the same nodes of the hidden layer).

In [9]:
from collections import Counter
from pprint import pprint
import math

def similar(target):
    target_index = word2index[target]
    scores = Counter()
    for word,index in word2index.items():
        raw_difference = weights_0_1[index] - (weights_0_1[target_index])
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)
print("Words similar to 'Beautiful'")
pprint(similar("beautiful"), width=100)
print("Words similar to 'Terrible'")
pprint(similar("terrible"), width=100)

Words similar to 'Beautiful'
[('beautiful', -0.0),
 ('thank', -0.6965802923245741),
 ('spirit', -0.7055560560453192),
 ('spot', -0.7373386929316352),
 ('success', -0.7412047577955438),
 ('heart', -0.7418008650600879),
 ('charlie', -0.7449509776260076),
 ('eerie', -0.7453403924544343),
 ('extraordinary', -0.7495099205406576),
 ('great', -0.7527903303113145)]
Words similar to 'Terrible'
[('terrible', -0.0),
 ('worse', -0.7230350835769849),
 ('badly', -0.7502508378281694),
 ('fails', -0.7623534149853021),
 ('horrible', -0.7995433462398092),
 ('laughable', -0.8179713041032471),
 ('dull', -0.8367748288666471),
 ('redeeming', -0.8381812642109021),
 ('basically', -0.8393347151649787),
 ('lame', -0.8451263953208726)]


### Fill In The Blank:
Instead of taking in a review and predicting 'positive' or 'negative', this problem involves taking a sentence with a missing word and selecting the most appropriate word to fill the sentence. This problem requires the context of the sentence to be taken into account, which wasn't reflected well in the previous example.

In [18]:
import sys, random, math
from collections import Counter
import numpy as np

np.random.seed(1)
random.seed(1)
f = open("./Datasets/imdb_reviews.txt")
raw_reviews = f.readlines()
f.close()

# Get a collection of words
tokens = list(map(lambda x: (x.split(" ")), raw_reviews))
wordcnt = Counter()
for sentence in tokens:
    for word in sentence:
        wordcnt[word] -= 1
        
# Form the vocabulary and word to index mapping dictionary
vocab = list(set(map(lambda x: x[0], wordcnt.most_common())))
word2index = {}
for i, word in enumerate(vocab):
    word2index[word] = i

# 
concatenated = []
input_dataset = []
for sentence in tokens:
    sent_indices = []
    for word in sentence:
        try:
            sent_indices.append(word2index[word])
            concatenated.append(word2index[word])
        except:
            ""
    input_dataset.append(sent_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)

print("'Vocab' size: {}".format(len(vocab)))
print("Sample vocab: {}".format(vocab[:5]))
print("'Concatenated' size: {}".format(len(concatenated)))
print("Sample concatenated: {}".format(concatenated[:5]))
print("input_dataset size: {}".format(len(input_dataset)))
print("Sample input_dataset: {} ... and more vocab indices".format(input_dataset[0][:10]))


'Vocab' size: 74075
Sample vocab: ['rebeecca', '', 'weeps', 'helmer', 'dishonor']
'Concatenated' size: 7459318
Sample concatenated: [71357 49779  6590 62870 65829]
input_dataset size: 25000
Sample input_dataset: [1, 1, 1099, 45369, 1, 6590, 62870, 24964, 71996, 5565] ... and more vocab indices


In [None]:
alpha, iterations = (0.05, 2)
hidden_size, window, negative = (50, 2, 5)

weights_0_1 = (np.random.rand(len(vocab), hidden_size) - 0.5) * 0.2
weights_1_2 = np.random.rand(len(vocab), hidden_size) * 0

layer_2_target = np.zeros(negative + 1)
layer_2_target[0] = 1

def similar(target):
    target_index = word2index[target]
    scores = Counter()
    for word, index in word2index.items():
        raw_difference = weights_0_1[index] - (weights_0_1[target_index])
        squared_difference = raw_difference * raw_difference
        scores[word] = -math.sqrt(sum(squared_difference))
    return scores.most_common(10)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# 3-layer network:
for rev_i, review in enumerate(input_dataset * iterations):
    for target_i in range(len(review)):
        # Predict labels from a smaller randomised subset, because it’s really expensive to predict from the entire vocabulary
        # This seems crude, but it works well enough for the amount of speed we gain. 
        # TODO: This is called negative sampling, right?
        target_samples = [review[target_i]] + list(concatenated[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
        
        left_context = review[max(0, target_i - window) : target_i]
        right_context = review[target_i + 1 : min(len(review), target_i + window)]
        
        layer_1 = np.mean(weights_0_1[left_context + right_context], axis=0)
        layer_2 = sigmoid(layer_1.dot(weights_1_2[target_samples].T))
        layer_2_delta = layer_2 - layer_2_target
        layer_1_delta = layer_2_delta.dot(weights_1_2[target_samples])
        
        weights_0_1[left_context + right_context] -= layer_1_delta * alpha
        weights_1_2[target_samples] -= np.outer(layer_2_delta, layer_1) * alpha
        
    if rev_i % 250 == 0:
        sys.stdout.write('\rProgress: '+str(rev_i/float(len(input_dataset)*iterations)) + "   " + str(similar('terrible')))
    sys.stdout.write('\rProgress: ' + str(rev_i/float(len(input_dataset)*iterations)))
print(similar('terrible'), width=100)



Progress:0.14998  [('terrible', -0.0), ('horrible', -2.291300288404048), ('thin', -2.4169675788743836), ('fantastic', -2.5089735852761117), ('brilliant', -2.5436852956924065), ('lame', -2.5514146967813307), ('dumb', -2.6072069177285813), ('fascinating', -2.613801067798473), ('weak', -2.664379906619758), ('terrific', -2.6724644833831124)])]8247)]090493)]

This networks clusters the words that are likely to occur in the same sentence.