# Sentiment analysis with TFLearn

- building a network for sentiment analysis on the movie review data. 
- using [TFLearn](http://tflearn.org/), a high-level library built on top of TensorFlow. 
- TFLearn makes it simpler to build networks just by defining the layers. It takes care of most of the details for you.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

## Step 1-6 Convert a dataframe of reviews to a matrix of numbers

#### step 1
- store multiple reviews in DataFrame dim(n,1)

In [2]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)

In [3]:
print(reviews.head(1))
labels.head(1)
reviews.shape

                                                   0
0  bromwell high is a cartoon comedy . it ran at ...


(25000, 1)

#### Step 2
- Counting word frequency

----
> **bag of words**
- **count how often** each word appears in the data
- use this count to create a vocabulary which we'll use to **encode the review data**. 
- This resulting count is known as a [**bag of words**](https://en.wikipedia.org/wiki/Bag-of-words_model)
- use it to select our vocabulary and build the word vectors

> **Exercise:** Create the bag of words from the reviews data. 
- The reviews are stores in the `reviews` [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). 
- If you want the reviews as **a Numpy array**, use `reviews.values`. 
- You can **iterate through the rows in the DataFrame** with `for idx, row in reviews.iterrows():` ([documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html)).

In [4]:
from collections import Counter
total_counts = Counter()
for _, row in reviews.iterrows():
    total_counts.update(row[0].split(' '))
print("Total words in data set: ", len(total_counts))

Total words in data set:  74074


#### Step 3: 
- sort the counter and keep the most common 10000

----
> **keep only the most frequent words**
- keep the first 10000 most frequent words.
- most of the words in the vocabulary are rarely used 
- so they will have little effect on our predictions
- sort `vocab` by the count value and keep the 10000 most frequent words.
- by now, `vocab` is **total unique word-count of entire reviews set**

In [5]:
total_counts.get('the') # like dict.get, checked
total_counts['the']

336713

In [6]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:10000] # checked
print(vocab[:60])

['', 'the', '.', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you', 'on', 't', 'not', 'he', 'are', 'his', 'have', 'be', 'one', 'all', 'at', 'they', 'by', 'an', 'who', 'so', 'from', 'like', 'there', 'her', 'or', 'just', 'about', 'out', 'if', 'has', 'what', 'some', 'good', 'can', 'more', 'she', 'when', 'very', 'up', 'time', 'no']


----
> **How to know whether keep 10000 most common words are enough**
- What's the last word in our vocabulary?
- We can use this to judge if 10000 is too few. 
- If the last word is pretty common, we probably need to keep more words.

In [7]:
print(vocab[-1], ': ', total_counts[vocab[-1]])

intrusive :  30


-----
> **maybe range from 20 to 50 is fine**
- The last word in our vocabulary shows up in 30 reviews out of 25000. 
- I think it's fair to say this is a tiny proportion of reviews. 
- We are probably fine with this number of words.

#### Step 4
- from word Counter to word-index dictionary

----
> **Build a word-index for all unique words of entire reviews set**
- Now for each review in the data, we'll make a word vector. 
- First we need to make a mapping of word to index, pretty easy to do with a dictionary comprehension.

> **create a dictionary `word2idx`**
- maps each word in the total vocabulary set to an index of each word

In [8]:
word2idx = {word: i for i, word in enumerate(vocab)}

#### Step 5
- Convert a review to a vector index == unique word, value == count of word


- write a function that converts a string sentence to a word vector. The function will take a string of words as input and return a vector with the words counted up. Here's the general algorithm to do this:

* Initialize the word vector with [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html), it should be the length of the vocabulary.
* Split the input string of text into a list of words with `.split(' ')`.
* For each word in that list, increment the element in the index associated with that word, which you get from `word2idx`.

**Note:** Since all words aren't in the `vocab` dictionary, you'll get a key error if you run into one of those words. You can use the `.get` method of the `word2idx` dictionary to specify a default returned value when you make a key error. For example, `word2idx.get(word, None)` returns `None` if `word` doesn't exist in the dictionary.

In [9]:
def text_to_vector(text):
    
    # build a vector of zero with length of total vocab
    word_vector = np.zeros(len(vocab), dtype=np.int_)
    
    # loop every word of a review
    for word in text.split(' '):
        # if the word is part of total vocab, it has a index
        # assign its index to idx, otherwise, assign None to idx
        idx = word2idx.get(word, None)
        if idx is None:
            continue # this element remains zero
            
        # if the word has an index, then add 1 count to its place in the vector    
        else:
            word_vector[idx] += 1 # this element keep count
            
    # by now, we got a vector: vector index refers to unique word, vector value refer to
    # occurance of the word
    return np.array(word_vector)

If you do this right, the following code should return

```
sentence_to_vector('The tea is for a party to celebrate '
                   'the movie so she has no time for a cake')[:65]
                   
array([0, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0])
```       

In [10]:
text_to_vector('The tea is for a party to celebrate '
                   'the movie so she has no time for a cake')[:65]

array([0, 1, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0])

#### Step 6
- Now, run through our entire review data set and convert each review to a word vector.

In [11]:
# build a matrix to store all reviews (on rows), and all words (on columns)
word_vectors = np.zeros((len(reviews), len(vocab)), dtype=np.int_) # set dtype

# take each review and convert it to a vector for each row
for ii, (_, text) in enumerate(reviews.iterrows()):
    word_vectors[ii] = text_to_vector(text[0])
    
# by now, word_vectors is a numpy matrix with all reviews converted to numbers

In [12]:
# print out the first 5 reviews (5 rows), and first 23 words-count(first 23 columns)
word_vectors[:5, :23]

array([[ 18,   9,  27,   1,   4,   4,   6,   4,   0,   2,   2,   5,   0,
          4,   1,   0,   2,   0,   0,   0,   0,   0,   0],
       [  5,   4,   8,   1,   7,   3,   1,   2,   0,   4,   0,   0,   0,
          1,   2,   0,   0,   1,   3,   0,   0,   0,   1],
       [ 78,  24,  12,   4,  17,   5,  20,   2,   8,   8,   2,   1,   1,
          2,   8,   0,   5,   5,   4,   0,   2,   1,   4],
       [167,  53,  23,   0,  22,  23,  13,  14,   8,  10,   8,  12,   9,
          4,  11,   2,  11,   5,  11,   0,   5,   3,   0],
       [ 19,  10,  11,   4,   6,   2,   2,   5,   0,   1,   2,   3,   1,
          0,   0,   0,   3,   1,   0,   1,   0,   0,   0]])

## Step 7 Train, Validation, Test sets

- split our data into train, validation, and test sets. 
- train on the train data
- use the validation data to set the hyperparameters
- at the very end measure the network performance on the test data. 
- `to_categorical` from TFLearn to reshape the target data
- have two output units and can classify with a softmax activation function.

In [13]:
# positive as 1, negative as 0
Y = (labels=='positive').astype(np.int_)
records = len(labels)

# get index for each review
shuffle = np.arange(records)

# shuffle index of all reviews
np.random.shuffle(shuffle)

# get index for training part and test part
test_fraction = 0.9
train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]

# get matrix for training set from total-review-matrix, and convert training set label to num
trainX, trainY = word_vectors[train_split,:], to_categorical(Y.values[train_split], 2)

# get matrix for test set from total-review-matrix, and convert test set label to num
testX, testY = word_vectors[test_split,:], to_categorical(Y.values[test_split], 2)

In [14]:
trainY # it is like one-hot-encoding

array([[ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       ..., 
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.]])

## Building the network

----
- [TFLearn](http://tflearn.org/) lets you build the network by [defining the layers](http://tflearn.org/layers/core/). 

### Input layer

- For the input layer, you just need to tell it how many units you have

```
net = tflearn.input_data([None, 100])
```
- place of None: None as default, or set your mini-batch size
- place of 100: set num of input neurons you want (must match num of attributes/features of your data)
- set 10000, as we have 10000 unique words in total

### Adding layers

- To add new hidden layers, you use 

```
net = tflearn.fully_connected(net, n_units, activation='ReLU')
```
- adds a fully connected layer where every unit in the previous layer is connected to every unit in this layer
- The first argument `net` is the network you created in the `tflearn.input_data` call
- set the number of units in the layer with `n_hidden`
- set the activation function with the `activation` keyword. 
- keep adding layers to your network by repeated calling `net = tflearn.fully_connected(net, n_units)`.

### Output layer

- add is used as the output layer
- set the number of units to match the target data
- predicting two classes, positive or negative sentiment
- we're trying to predict if some input data belongs to one of two classes, so we should use softmax.

```
net = tflearn.fully_connected(net, 2, activation='softmax')
```

### Training

- To set how you train the network, use 

```
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
```
* `optimizer` sets the training method, here stochastic gradient descent
* `learning_rate` is the learning rate
* `loss` determines how the network error is calculated. In this example, with the categorical cross-entropy.

----
> put all this together to create the model
- using `tflearn.DNN(net)`. 
- So it ends up looking something like 

```
net = tflearn.input_data([None, 10])                          # Input
net = tflearn.fully_connected(net, 5, activation='ReLU')      # Hidden
net = tflearn.fully_connected(net, 2, activation='softmax')   # Output
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
model = tflearn.DNN(net)
```


In [15]:
# Network building
def build_model():
    tf.reset_default_graph()
    
    # Inputs
    net = tflearn.input_data([None, 10000])

    # Hidden layer(s)
    net = tflearn.fully_connected(net, 200, activation='ReLU')
    net = tflearn.fully_connected(net, 25, activation='ReLU')

    # Output layer
    net = tflearn.fully_connected(net, 2, activation='softmax')
    
    # how to train
    net = tflearn.regression(net, optimizer='sgd', 
                             learning_rate=0.1, 
                             loss='categorical_crossentropy')
    
    model = tflearn.DNN(net)
    return model

## Intializing the model

Next we need to call the `build_model()` function to actually build the model. In my solution I haven't included any arguments to the function, but you can add arguments so you can change parameters in the model if you want.

> **Note:** You might get a bunch of warnings here. TFLearn uses a lot of deprecated code in TensorFlow. Hopefully it gets updated to the new TensorFlow version soon.

In [16]:
model = build_model()

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
Instructions for updating:
Please switch to tf.summary.merge.
Instructions for updating:
Use `tf.global_variables_initializer` instead.


## Training the network

Now that we've constructed the network, saved as the variable `model`, we can fit it to the data. Here we use the `model.fit` method. You pass in the training features `trainX` and the training targets `trainY`. Below I set `validation_set=0.1` which reserves 10% of the data set as the validation set. You can also set the batch size and number of epochs with the `batch_size` and `n_epoch` keywords, respectively. Below is the code to fit our the network to our word vectors.

You can rerun `model.fit` to train the network further if you think you can increase the validation accuracy. Remember, all hyperparameter adjustments must be done using the validation set. **Only use the test set after you're completely done training the network.**

In [17]:
# Training the model - checked
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=50)

Training Step: 7950  | total loss: [1m[32m0.32508[0m[0m
| SGD | epoch: 050 | loss: 0.32508 - acc: 0.8542 | val_loss: 0.47160 - val_acc: 0.8178 -- iter: 20250/20250
Training Step: 7950  | total loss: [1m[32m0.32508[0m[0m
| SGD | epoch: 050 | loss: 0.32508 - acc: 0.8542 | val_loss: 0.47160 - val_acc: 0.8178 -- iter: 20250/20250
--


## Testing

After you're satisified with your hyperparameters, you can run the network on the test set to measure it's performance. Remember, *only do this after finalizing the hyperparameters*.

In [18]:
# read carefully about the use between ndarray and list here
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.8276


In [27]:
# this is how to access element of numpy arrays 2-d
model.predict(testX)[0][1]
model.predict(testX)[0:2][1]
model.predict(testX)[0:2]

[[0.519019603729248, 0.4809803068637848],
 [0.707361102104187, 0.2926388680934906]]

## Try out your own sentence!

In [19]:
sentence = "Moonlight is by far the best movie of 2016."
positive_prob = model.predict([text_to_vector(sentence.lower())])[0][1]
print('P(positive) = {:.3f} :'.format(positive_prob), 
      'Positive' if positive_prob > 0.5 else 'Negative')

P(positive) = 0.927 : Positive


In [20]:
from IPython.display import HTML
HTML('<iframe width="500" height="300" src="https://www.youtube.com/embed/s7FKYC5Zcm8?ecver=1" frameborder="0" allowfullscreen></iframe>')