### Natural Language Processing:
It is necessary to translate the input text into matrices that can be fed into a neural network and receive a numerical prediction.

The way we format the input text as a matrix will make it easier or harder for the neural network to learn correlation.

### Sentiment Analysis:
<em>One-hot encoding</em> involves using binary numbers 0 and 1 to encode the presence/absence of something or truth/false. For sentiment analysis, an appropriate way to translate input text into vectors would be to have a column for each word in a vocabulary set, where if the current review contains a certain word, the column would be marked with a 1.

### Word Embedding:



### IMDB Movie Review Sentiment Analysis:
The following snippet transforms the raw review data into vectors that can be passed as input into a neural network. There are two labels: positive sentiment and negative sentiment.

In [47]:
import sys

f = open('./Datasets/imdb_reviews.txt')
raw_reviews = f.readlines()
f.close()

f = open('./Datasets/imdb_review_labels.txt')
raw_labels = f.readlines()
f.close()

tokens = list(map(lambda x:set(x.split(" ")),raw_reviews))

vocab = set()
for sent in tokens:
    for word in sent:
        if(len(word)>0):
            vocab.add(word)
vocab = list(vocab)

word2index = {}
for i,word in enumerate(vocab):
    word2index[word]=i

input_dataset = list()
for sent in tokens:
    sent_indices = list()
    for word in sent:
        try:
            sent_indices.append(word2index[word])
        except:
            ""
    input_dataset.append(list(set(sent_indices)))

target_dataset = list()
for label in raw_labels:
    if label == 'positive\n':
        target_dataset.append(1)
    else:
        target_dataset.append(0)

print("Prepared {} samples.".format(len(input_dataset)))


Prepared 25000 samples.


In [58]:
print("An example datapoint:")
print("===== Raw =====")
for i in raw_reviews[0].split(" "):
    print(i, end=" ")
print("===> Label: {}".format(raw_labels[0]))
print("===== Processed =====")
for i in input_dataset[0].split(" "):
    print(i, end=" ")

An example datapoint:
===== Raw =====
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
 ===> Label: positive

===== Processed =====


AttributeError: 'list' object has no attribute 'split'

The 3-layer network below takes in a 'sentence', represented as a variable length array consisting of integers which encode the index of a particular word in the vocabulary exists.

In [53]:
import numpy as np
np.random.seed(1)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

alpha, iterations = (0.01, 3)
hidden_size = 100

weights_0_1 = 0.2 * np.random.random((len(vocab), hidden_size)) - 0.1
weights_1_2 = 0.2 * np.random.random((hidden_size, 1)) - 0.1

correct, total = (0, 0)
for iteration in range(iterations):
    for i in range(len(input_dataset) - 1000):    # Leave 1000 datapoints for the testing
        x, y = (input_dataset[i], target_dataset[i])
        # Embed and apply sigmoid:
        layer_1 = sigmoid(np.sum(weights_0_1[x], axis=0))
        layer_2 = sigmoid(np.dot(layer_1, weights_1_2))
        layer_2_delta = layer_2 - y
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)
        weights_0_1[x] -= layer_1_delta * alpha
        weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha
        if (np.abs(layer_2_delta) < 0.5):
            correct += 1
        total += 1
        if (i % 10 == 9):
            progress = str(i / float(len(input_dataset)))
            sys.stdout.write('\rIter:' + str(iteration) \
                             + ' Progress:' + progress[2:4] \
                             + '.' + progress[4:6] \
                             + '% Training Accuracy:' \
                             + str(correct / float(total)))
    print()
correct, total = (0, 0)
for i in range(len(input_dataset) - 1000, len(input_dataset)):
    x = input_dataset[i]
    y = target_dataset[i]
    layer_1 = sigmoid(np.sum(weights_0_1[x], axis=0))
    layer_2 = sigmoid(np.dot(layer_1, weights_1_2))
    
    if (np.abs(layer_2 - y) < 0.5):
        correct += 1
    total += 1
print("Test accuracy: {}".format(correct / float(total)))
        



Iter:0 Progress:95.99% Training Accuracy:0.83204166666666676
Iter:1 Progress:95.99% Training Accuracy:0.8661666666666666
Iter:2 Progress:95.99% Training Accuracy:0.8842777777777778
Test accuracy: 0.848
