# Epigenetics-MOOC Answer Classifier 

In [20]:
import pandas as pd 
import json
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical

## Prepare the data

The goal here is to convert the answers into word vectors. The word vectors will have elements representing words in the total vocabulary. 

The data needs to be preprocessed to use lower case characters. 

### Read the data

Read the answers and the scores from the JSON file.

Assuming the data was put into separate csv files, the calls will be.

Can also read data directly from the JSON file into a dictionary.  

In [21]:
#TODO: Put this in a csv file so we don't have to re-parse the json every time
import load_json

answers, scores = load_json.get_features("../../data/extractedRawDataJSON", True)

### Get word frequency distribution

Count how often each word appears in the data set. 

Use this count to create a vocabulary to encode the answers and build the word vectors. 

In [22]:
from collections import Counter
import spacy

nlp = spacy.load("en")
proc_answers = []
for ans in answers:
    doc = nlp.tokenizer(ans)
    proc_ans = [tok.lower_ for tok in doc if tok.is_alpha and not tok.is_stop]
    proc_answers.append(" ".join(proc_ans))

word_counts = Counter()

for idx, answer in enumerate(proc_answers):
        for word in answer.split(" "):
            word_counts[word] += 1

print("Total words in data set: ", len(word_counts))


# print(proc_answers[0])

Total words in data set:  12331


Keep the 10000 most frequent words. 

In [23]:
vocab = sorted(word_counts, key=word_counts.get, reverse=True)
print(vocab[:60])

['cpg', 'methylation', 'cancer', 'genes', 'dna', 'islands', 'regions', 'cells', 'elements', 'repetitive', 'gene', 'intergenic', 'normal', 'methylated', 'tumor', 'genomic', 'cell', 'silencing', 'genome', 'promoters', 'suppressor', 'hypomethylation', 'expression', 'hypermethylation', 'repeats', 'instability', 'tumour', 'hypermethylated', 'activation', 'hypomethylated', 'transcription', 'promoter', 'island', 'recombination', 'usually', 'normally', 'lead', 'leads', 'illegitimate', 'stability', 'cryptic', 'unmethylated', 'growth', 'epigenetic', 'silenced', 'associated', 'function', 'sites', 'disruption', 'transposition', 'specific', 'cause', 'result', 'occurs', 'insertions', 'deletions', 'tend', 'wide', 'translocations', 'transcriptional']


For each answer in the data, create a word vector. 

In [24]:
word2idx = {word : i for i, word in enumerate(vocab)}

### Convert Answers to Vectors

This method takes a string of words (answer) as input and returns a vector with word counts. 

In [25]:
def text_to_vector(text):
    vector = np.zeros(len(vocab))
    for w in text.split(' '):
        indx = word2idx.get(w,None)
        if indx == None:
            continue
        else:
            vector[indx] += 1
    return vector

Now, run through the entire data set and convert each answer to a word vector.

In [27]:
word_vectors = np.zeros((len(answers), len(vocab)), dtype=np.int_)
for x, text in enumerate(answers):
    word_vectors[x] = text_to_vector(text)
    
#Normalize word vectors
from sklearn import preprocessing
word_vectors = preprocessing.normalize(word_vectors, norm='l2')




### Define Train, Validation & Test sets

Split the data into train, validation, and test sets. 


In [28]:
records = len(scores)

shuffle = np.arange(records)
np.random.shuffle(shuffle)
test_fraction = 0.9

train_split, test_split = shuffle[:int(records*test_fraction)], shuffle[int(records*test_fraction):]
trainX, trainY = word_vectors[train_split,:], np.array(list( scores[i] for i in train_split ))
testX, testY = word_vectors[test_split,:], np.array(list( scores[i] for i in test_split ))

trainY.shape = (len(train_split),1)
testY.shape = (len(test_split),1)
#print(trainY)
#print(testY)


In [None]:
#print(trainY)

# The network

### Input layer

Must provide the number of input units. For our problem, `n_input_units` is the size of the epigenetics vocabulary. 

Setting the first argument to `None` chooses the default batch size.

```
net = tflearn.input_data([None, n_input_units])
```

### Hidden layers

Add hidden layers with 

```
net = tflearn.fully_connected(net, n_units, activation='ReLU')
```

This adds a fully connected layer where every unit in the previous layer is connected to every unit in this layer. 

Arguments:
`net` the network created with the call to `tflearn.input_data`. This tells the network to use the output of the previous layer as the input to this layer. 
`n_units`: the number of units in the layer.
`activation`: the activation function. 

Add more hidden layers by repeatedly calling `net = tflearn.fully_connected(net, n_units)`.

### Output layer

The last layer you add is used as the output layer. 

Set the number of units to match the target data. In our case the score is a single number. We need only one output unit.

```
net = tflearn.fully_connected(net, 1, activation='ReLU')
```

### Training

To set how you train the network, use 

```
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
```

Arguments: 

* `optimizer` sets the training method, here stochastic gradient descent
* `learning_rate` is the learning rate
* `loss` determines how the network error is calculated. In this example, with the categorical cross-entropy.

Finally, create the model with `tflearn.DNN(net)`. So it ends up looking something like 

```
net = tflearn.input_data([None, X])                          # Input
net = tflearn.fully_connected(net, 5, activation='ReLU')      # Hidden
net = tflearn.fully_connected(net, 1, activation='ReLU')   # Output
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='mean_square')
model = tflearn.DNN(net)
```

In [29]:
# Build the Neural Network
def build_net(activation, learning_rate=0.1, loss='mean_square',hidden_units_1=100, hidden_units_2=10):
    # Reset all parameters and variables. Use it if you are using Jupyter
    tf.reset_default_graph()

    
    # Input layer
    # Set the number of input units to be equal to the size of the epigenetics vocabulary
    n_input_units = len(vocab)
    net = tflearn.input_data([None, n_input_units])
    
    # Hidden layers
    # Use ReLU as default activation for the hidden units
    
    net = tflearn.fully_connected(net, hidden_units_1, activation='ReLU') 
    net = tflearn.fully_connected(net, hidden_units_2, activation='ReLU')
    
    # Output layer  
    # Set output units to 1 b/c the score is a float [0-12+] normalized to [0,1]
    n_output_units = 1
    net = tflearn.fully_connected(net, n_output_units, activation=activation)
    
    # Network parameters
    # optimizer: the training method
    # learning_rate: 
    # loss`: determines how the network error is calculated..
 
    net = tflearn.regression(net, optimizer='sgd', learning_rate=learning_rate, loss=loss)
       
    nn = tflearn.DNN(net, tensorboard_verbose=3)
    return nn

## Intialize the Neural Network

`build_net()` builds the model. 

Add arguments if you want to change parameters in the model.

In [37]:
model = build_net('ReLU', 0.001,'mean_square', 100, 10 )

## Training the network

Now that we've constructed the network, saved as the variable `model`, we can fit it to the data. 

Use the `model.fit` method to train the network. 

`trainX`: training features  
`trainY`: training targets . 
`validation_set=0.1`: reserves 10% of the data set as the validation set. 

In [38]:
# Train the network
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=100, n_epoch=1000)

Training Step: 30999  | total loss: [1m[32m0.02819[0m[0m | time: 2.587s
| SGD | epoch: 1000 | loss: 0.02819 - binary_acc: 0.0673 -- iter: 3000/3098
Training Step: 31000  | total loss: [1m[32m0.02806[0m[0m | time: 3.680s
| SGD | epoch: 1000 | loss: 0.02806 - binary_acc: 0.0646 | val_loss: 0.02560 - val_acc: 0.0783 -- iter: 3098/3098
--


## Testing

Run the network on the test set to measure its performance. 

In [39]:
predictions = np.array(model.predict(testX))[:,0]
# Calculate the Mean Squared Error
mse = ((testY[:,0] - predictions) ** 2).mean(axis=0)
print("MSE: ", mse)

MSE:  0.0273765052856
