# Epigenetics Answer Classifier 

In [1]:
import pandas as pd 
import json
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

## Preparing the data

Our goal here is to convert the answers into word vectors. The word vectors will have elements representing words in the total vocabulary. 

If the second position represents the word 'the', for each answer, count up the number of times 'the' appears in the text and set the second position to that count. 

The data needs to be preprocessed to use lower case characters. 

### Read the data

Read the answers and the scores from the JSON file.

Assuming the data was put into separate csv files, the calls will be.

Can also read data directly from the JSON file into a dictionary.  

In [20]:
answers = pd.read_csv('answers.txt, header=None)
scores = pd.read_csv('scores.txt', header=None)

### Get word frequency distribution

Count how often each word appears in the data. 

Use this count to create a vocabulary to encode the answers.

Use the vocabulary to build word vectors. 

In [22]:
from collections import Counter

word_counts = Counter()

for idx, word in answers.iterrows():
        for w in word[0].split(" "):
            total_counts[w] += 1

print("Total words in data set: ", len(word_counts))

Total words in data set:  74074


We can keep the first 10000 most frequent words. Most of the words in the vocabulary are rarely used so they will have little effect on the classification. 

In [23]:
vocab = sorted(word_counts, key=total_counts.get, reverse=True)[:10000]
print(vocab[:60])

['', 'the', '.', 'and', 'a', 'of', 'to', 'is', 'br', 'it', 'in', 'i', 'this', 'that', 's', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'you', 'on', 't', 'not', 'he', 'are', 'his', 'have', 'be', 'one', 'all', 'at', 'they', 'by', 'an', 'who', 'so', 'from', 'like', 'there', 'her', 'or', 'just', 'about', 'out', 'if', 'has', 'what', 'some', 'good', 'can', 'more', 'she', 'when', 'very', 'up', 'time', 'no']


For each answer in the data, create a word vector. 

In [28]:
word2idx = {word : i for i, word in enumerate(vocab)}



### Convert Answers to Vectors

The method takes a string of words (answer) as input and returns a vector with word counts. 

In [36]:
def text_to_vector(text):
    vector = np.zeros(len(vocab))
    for w in text.split(' '):
        indx = word2idx.get(w,None)
        if indx == None:
            continue
        else:
            vector[indx] += 1
    return vector

[ 0.  2.  0.  0.  2.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  2.
  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.]


Now, run through the entire data set and convert each answer to a word vector.

In [38]:
word_vectors = np.zeros((len(answers), len(vocab)), dtype=np.int_)
for x, (_, text) in enumerate(answers.iterrows()):
    word_vectors[x] = text_to_vector(text[0])

### Set Train, Validation, Test sets

Split the data into train, validation, and test sets. 

The function `to_categorical` from TFLearn reshapes the target data so that we'll have X output units.

In [40]:
Y = (labels=='positive').astype(np.int_)
records = len(labels)

shuffle = np.arange(records)
np.random.shuffle(shuffle)
test_fraction = 0.9

train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]
trainX, trainY = word_vectors[train_split,:], to_categorical(Y.values[train_split], 2)
testX, testY = word_vectors[test_split,:], to_categorical(Y.values[test_split], 2)

In [41]:
trainY

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.],
       ..., 
       [ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

# The network

### Input layer

For the input layer, indicate the number of units. 

```
net = tflearn.input_data([None, n_input_units])
```

n_input_units is the size of the epigenetics vocabulary. 

Setting the first argument to `None` chooses the default batch size.


### Hidden layers

To add new hidden layers, use 

```
net = tflearn.fully_connected(net, n_units, activation='ReLU')
```

This adds a fully connected layer where every unit in the previous layer is connected to every unit in this layer. 

The first argument `net` is the network you created in the `tflearn.input_data` call. It's telling the network to use the output of the previous layer as the input to this layer. You can set the number of units in the layer with `n_units`, and set the activation function with the `activation` keyword. You can keep adding layers to your network by repeated calling `net = tflearn.fully_connected(net, n_units)`.

### Output layer

The last layer you add is used as the output layer. Therefore, you need to set the number of units to match the target data. In this case we are predicting two classes, positive or negative sentiment. You also need to set the activation function so it's appropriate for your model. Again, we're trying to predict if some input data belongs to one of two classes, so we should use softmax.

```
net = tflearn.fully_connected(net, 2, activation='softmax')
```

### Training
To set how you train the network, use 

```
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
```

A keywords: 

* `optimizer` sets the training method, here stochastic gradient descent
* `learning_rate` is the learning rate
* `loss` determines how the network error is calculated. In this example, with the categorical cross-entropy.

Finally you put all this together to create the model with `tflearn.DNN(net)`. So it ends up looking something like 

```
net = tflearn.input_data([None, 10])                          # Input
net = tflearn.fully_connected(net, 5, activation='ReLU')      # Hidden
net = tflearn.fully_connected(net, 2, activation='softmax')   # Output
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
model = tflearn.DNN(net)
```

> **Exercise:** Below in the `build_model()` function, you'll put together the network using TFLearn. You get to choose how many layers to use, how many hidden units, etc.

In [60]:
# Build the Neural Network
def build_net():
    # Reset all parameters and variables. Use it if you are using Jupyter
    tf.reset_default_graph
    
    # Input layer
    # Set the number of input units to be equal to the size of the epigenetics vocabulary
    n_input_units = len(vocab)
    net = tflearn.input_data([None, n_input_units])
    
    # Hidden layers
    net = tflearn.fully_connected(net, 10, activation='ReLU') 
    net = tflearn.fully_connected(net, 5, activation='ReLU')
    
    # Output
    # Number of units is defined by the structure of the score  
    # Setting it to 1 for first iteration of the application where the score is a float (0-12)
    n_output_units = 1
    
    
    # Network parameters
    # optimizer: the training method, here stochastic gradient descent
    # learning_rate: is the learning rate
    # loss` determines how the network error is calculated. In this example, with the categorical cross-entropy.
    
    net = tflearn.fully_connected(net, n_output_units, activation='softmax')
    net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
       
    nn = tflearn.DNN(net)
    return nn

## Intialize the Neural Network

`build_net()` builds the model. 

Add arguments if you want to change parameters in the model.

In [61]:
model = build_net()

## Training the network

Now that we've constructed the network, saved as the variable `model`, we can fit it to the data. 

Use the `model.fit` method to train the network. 

`trainX`: training features  
`trainY`: training targets . 
`validation_set=0.1`: reserves 10% of the data set as the validation set. 

In [62]:
# Train the network
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=50)

Training Step: 7949  | total loss: [1m[32m1.38629[0m[0m | time: 1.985s
| SGD | epoch: 050 | loss: 1.38629 - acc: 0.5026 -- iter: 20224/20250
Training Step: 7950  | total loss: [1m[32m1.38629[0m[0m | time: 3.004s
| SGD | epoch: 050 | loss: 1.38629 - acc: 0.5023 | val_loss: 1.38629 - val_acc: 0.4884 -- iter: 20250/20250
--


## Testing

Run the network on the test set to measure its performance. 

In [63]:
# TODO: Adjust this test
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Accuracy: ", test_accuracy)

Test accuracy:  0.5012
