## Outcome value prediction from Behaviour Change Data

In this notebook, we set up a regression/classification pipeline to predict either the outcome value
directly or an interval in which this value may fall into, the former being `regression` and the latter being `multi-class classification`.

Before getting started with this notebook, let's have a look at the format of the data files which this program expects. The data is specified as tsv (tab separated values) and is generated by running the following command from the HBCP project root directory.

```
mvn exec:java@svmreg -Dexec.args="true"
```
The above command generates the train and the test files located at the directory `prediction/sentences/`.
Each line of the data files (train and test) looks like the following:
```
C:5579097:35.7 C:5594106:16.1 I:3674268:1 C:5579728:30.6 I:3674248:1 C:5579118:22 C:5579689:14.6 C:5579088:44.5 C:5579711:80.6 C:5580203:29.4 O:4087178:abstinence C:5594105:19.3 O:4087186:cotinine C:5580200:35.7 O:4087187:2 C:5579096:58.8 I:3675703:1 C:5580204:5.9 I:3675698:1 C:5579663:22 C:5579083:29.4 O:4087191:6 I:3673288:1 O:4087172:1 I:3674264:1 I:3675717:1 C:5580216:0 	2.8
```

Each token represents a `:`-separated `<attribute-type>:<attribute-id>:<value>` combination, where the attribute type is one of `{C, I, O}` (contextual, intervention or outcome qualifier) feature, an attribute-id is a unique integer and a value is the textual representation of an instance of this feature.   



In [200]:
#All neccessary imports and global variables
import sys, getopt

import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import numpy as np

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras import backend as K
from numpy.random import seed
#from tensorflow import set_random_seed

# import the necessary packages
from keras.models import Sequential
from keras.layers import Dropout
from keras import layers

SEED = 110781 
seed(SEED)

import tensorflow as tf
tf.random.set_seed(SEED)

import random
import re
import os

PATTERN = re.compile("(?<![0-9])-?[0-9]*\.?[0-9]+")

In [201]:
def add_value_feature(in_fn, out_fn):
    """Read (text) file of (dense) vectors and add a final value to the vector for the actual value of the node.
    This allows to represent the distance between nodes. Normalize this feature between -1 and 1.
    For numerical -1 is the minimum value in the range, 1 the max.
    For BCT, +1 is the presence, -1 is the absence
    For categorical we create ranges bet. -1 and 1.
    For other, pick some random number close to 0.
    """

    print ("Writing appended vec file at %s" %(out_fn))
    random.seed(123)
    att_values, type_att = collect_attribute_value_maps(in_fn)
    numeric_atts = infer_numerical_attributes(att_values, type_att)

    att_maxes, att_mins = get_att_max_min(att_values, numeric_atts)  # max/min used for normalization

    # debug -- print maxes and mins
    print("There are %d numeric attributes." % len(numeric_atts))
    for num_att_id in numeric_atts:
        print("Numeric att: %s -- Min: %f ; Max: %f" % (num_att_id, att_mins[num_att_id], att_maxes[num_att_id]))

    # go through the file again and add 'normalized' values
    with open(in_fn) as f:
        with open(out_fn, 'w') as f_out:
            for line in f:
                cols = line.split()
                if len(cols) == 2:  # first line
                    f_out.write(line)
                    continue
                prefix, att_id, val = cols[0].split(':', 2)
                # BCTs stay the same
                if prefix == 'I':
                    norm_val = val
                # numerical attributes get normalized
                elif att_id in numeric_atts:
                    match = PATTERN.search(val)
                    if match is not None:
                        num = float(match.group(0))
                        # max-min normalization
                        if att_maxes[att_id] == att_mins[att_id]:
                            norm_val = "1"
                        else:
                            norm_num = 2 * ((num - att_mins[att_id]) / (att_maxes[att_id] - att_mins[att_id])) - 1
                            norm_val = str(norm_num)
                    else:
                        norm_val = "%f" % random.gauss(0, 0.001)
                # TODO not keeping track of categorical attributes yet
                # remaining attributes will get a random value close to zero (not sure what else to do with them)
                else:
                    norm_val = "%f" % random.gauss(0, 0.001)
                # f_out.write(cols[0] + '\t' + norm_val + '\n')
                f_out.write('{0} {1}\n'.format(line.strip(), norm_val))


def get_att_max_min(att_values, numeric_atts):
    # normalize numeric attributes
    att_maxes = {}
    att_mins = {}
    for num_att_id in numeric_atts:
        # get max and min
        nums = []
        for val in att_values[num_att_id]:
            # does val have a number
            match = PATTERN.search(val)
            if match is not None:
                num = float(match.group(0))
                nums.append(num)
        att_mins[num_att_id] = min(nums)
        att_maxes[num_att_id] = max(nums)
    return att_maxes, att_mins


def infer_numerical_attributes(att_values, type_att):
    # check if attribute is numerical (use logic from Martin's Java code)
    numeric_atts = []
    for att_id, vals in att_values.items():
        if att_id in type_att['I']:
            continue  # interventions will all be '1'
        num_val = 0
        for val in vals:
            # does val have a number
            match = PATTERN.search(val)
            if match is not None:
                num_val += 1
        # if 80% or more have numbers, then consider numeric
        if num_val / len(vals) >= 0.8:
            numeric_atts.append(att_id)
    return numeric_atts


def collect_attribute_value_maps(fn):
    att_values = {}
    type_att = {'C': set(), 'I': set(), 'O': set(), 'V': set()}
    with open(fn) as f:
        for line in f:
            cols = line.split()
            if len(cols) == 2:  # first line, skip
                continue
            prefix, att_id, val = cols[0].split(':', 2)
            type_att[prefix].add(att_id)
            if att_id in att_values:
                att_values[att_id].append(val)
            else:
                att_values[att_id] = [val]
    print("There are %d attributes." % len(att_values.keys()))
    print("There are %d interventions." % len(type_att['I']))
    return att_values, type_att



### Text features
Use the gensim library to extend the vocabulary to a set of `words` from the PubMed literature. These words are then used to augment a node vector with the sum of the constituent word vectors from the `value` of a node, e.g. an intervention of `Goal Setting` may contain as its value the text `encouraging patients to set a date for quitting`. The vectors for these words are then aggregated and added as additional dimensions of a node vector representation. 

In [202]:
def rmse(y_true, y_pred):
        return K.sqrt(K.mean(K.square(y_pred - y_true))) 

In [203]:
def mapToUniformlySpacedIntervals(y_i, numClasses):
    MAX = 100
    DELTA = MAX/numClasses
    y_i = float(y_i)
    y_i = int(y_i/DELTA)
    return y_i
    
def mapToNonUniformlySpacedIntervals(y_i):
    #[0,5] [5,10] [10, 15] [15,20] [20,30] [30,50] [50,100]
    y_i = float(y_i)
    if y_i < 5:
        y_i = 0
    elif y_i>=5 and y_i<10:
        y_i = 1
    elif y_i>=10 and y_i<15:
        y_i = 2
    elif y_i>=15 and y_i<20:
        y_i = 3
    elif y_i>=20 and y_i<30:
        y_i = 4
    elif y_i>=30 and y_i<50:
        y_i = 5
    else:
        y_i = 6

    return y_i

In [204]:
class InputHelper(object):
    emb_dim = 0
    pre_emb = dict() # word--> vec
    vocab_size = 0
    tokenizer = None
    embedding_matrix = None
    
    def cleanText(self, s):
        s = re.sub(r"[^\x00-\x7F]+"," ", s)
        s = re.sub(r'[\~\!\`\^\*\{\}\[\]\#\<\>\?\+\=\-\_\(\)]+',"",s)
        s = re.sub(r'( [0-9,\.]+)',r"\1 ", s)
        s = re.sub(r'\$'," $ ", s)
        s = re.sub('[ ]+',' ', s)
        return s.lower()

    #the tokenizer needs to be trained on the pre-trained node vectors
    #join the names of the nodes in a string so that tokenizer could be fit on it
    def getAllNodes(self, emb_path):
        print("Collecting node names...")
        line_count = 0        
        node_names = []
        for line in open(emb_path):
            l = line.strip().split()
            if (line_count > 0):
                node_names.append(l[0])
            
            line_count = line_count + 1
        
        self.vocab_size = line_count # includes the +1
        print("Collected node names...")
        return node_names

    # call convertWordsToIds first followed by loadW2V
    def convertWordsToIds(self, emb_path):
        allNodeNames = self.getAllNodes(emb_path)
        
        print ("Converting words to ids...")
        # Map words to ids        
        self.tokenizer = Tokenizer(num_words=self.vocab_size, filters=[], lower=False, split=" ")
        self.tokenizer.fit_on_texts(allNodeNames)
        print ("Finished converting words to ids...")
    
    # Assumes that the tokenizer already has been fit on some text (for the time being the node vec names)
    def loadW2V(self, emb_path):
        print("Loading W2V data...")
        line_count = 0        
        
        for line in open(emb_path):
            l = line.strip().split()
            if (line_count == 0): # the first line -- supposed to be <vocab-size> <dimension>
                self.emb_dim = int(l[1])
                self.embedding_matrix = np.zeros((self.vocab_size, self.emb_dim))
            else:
                try:
                    st = l[0]
                    self.pre_emb[st] = np.asarray(l[1:]) # rest goes as the vector components
                    if st in self.tokenizer.word_index:
                        idx = self.tokenizer.word_index[st]
                        self.embedding_matrix[idx] = np.array(l[1:], dtype=np.float32)[:self.emb_dim]
                    else:
                        print ("Word '{}' not found in vocabulary..".format(st))
                except ValueError:
                    print ('Line {} is corrupt!'.format(line_count))
                
            line_count = line_count + 1
            
        print("loaded word2vec for {} nodes".format(len(self.pre_emb)))
    
    # Load the data as two matrices - X and Y
    def getTsvData(self, filepath):
        print("Loading data from " + filepath)
        x = []
        y = []
        
        # positive samples from file
        for line in open(filepath):
            l = line.strip().split("\t")            
            y.append(l[1])
            words = l[0].split(" ")
            x.append(words)
            #for w in words:
            #    x.append(w)
            
        return np.asarray(x), np.asarray(y)

    #Load data from tsv file with fold info
    def loadDataWithFolds(self, data_file, numClasses=0):
        x, y = self.getSequenceData(data_file, numClasses)
        return x, y
    
    # Build sequences from each data instance
    def getSequenceData(self, tsvDataPath, numClasses=0):
        x_text, y = self.getTsvData(tsvDataPath)
        
        # Convert each sentence (node name sequence) to a sequence of integer ids
        x = self.tokenizer.texts_to_sequences(x_text)
        #print (x)
        
        if (numClasses > 0):
            y = self.categorizeOutputs(y, numClasses)
        
        return x, np.asarray(y)
    
    def categorizeOutputs(self, y, numClasses):        
        y_scaled = []
        for y_i in y:
            #y_i = mapToUniformlySpacedIntervals(y_i, numClasses)
            y_i = mapToNonUniformlySpacedIntervals(y_i)
            y_scaled.append(y_i)
                
        return y_scaled

Set the global parameters. Set `NUM_CLASSES` to `0` if you want to run the regression flow, otherwise set this to the number of classes (outcome value ranges).

The function `convertWordsToIds` converts each word (e.g. `C:5579097:35.7`) into an id. 

In [205]:
import pandas as pd
from collections import Counter

def plotHistogram(valueList, caption):
    freqs = pd.Series(valueList).value_counts()
    #print (freqs)
    freqs.plot(kind='bar')
    plt.suptitle(caption)
    plt.show()

def ascii_histogram(seq, caption) -> None:
    print (caption)
    counted = Counter(seq)
    for k in sorted(counted):
        print('{0:5d} {1}'.format(k, '+' * counted[k]))    

In [206]:
#Load datasets and embedded vectors
from sklearn.model_selection import StratifiedKFold

def getSelectedData(x, y, indexes):
    x_sel = []
    y_sel = []

    for index in indexes:
        x_sel.append(x[index])
        y_sel.append(y[index])

    return np.asarray(x_sel), np.asarray(y_sel)
    

def getTrainTestFromFold(inpH, emb_file, x, y, train_indexes, test_indexes, numClasses):
    
    #Load the training and the test sets
    #Load the text as a sequence of inputs
    
    x_train, y_train = getSelectedData(x, y, train_indexes)
    x_test, y_test = getSelectedData(x, y, test_indexes)
    
    if numClasses > 0:
        plotHistogram(y_train, "Class labels in training fold")
        plotHistogram(y_test, "Class labels in test fold")
    
    if (numClasses > 0):    
        encoder = OneHotEncoder(sparse=False, categories='auto')
        
        #y_all = np.vstack((y_train, y_test))
        y_all = np.append(y_train, y_test)

        encoder.fit(y_all.reshape(-1, 1))

        y_train = encoder.transform(y_train.reshape(-1, 1))
        y_test = encoder.transform(y_test.reshape(-1, 1))
            
    #Print the loaded words
    nwords=0
    for w in inpH.pre_emb:
        print ("Dimension of vectors: {}".format(inpH.pre_emb[w].shape))
        print ("{} {}".format(w, inpH.pre_emb[w][0:5]))
        nwords = nwords+1
        if (nwords >= 2): break

    print ("vocab size: {}".format(inpH.vocab_size))
    print ("emb-matrix: {}...".format(inpH.embedding_matrix[1][:5]))
    print (inpH.embedding_matrix.shape)
    
    return x_train, y_train, x_test, y_test

The cross-validation based training function

In [207]:
from keras.layers import LSTM
from keras.layers import Conv1D, MaxPooling1D

OPTIMIZER='rmsprop'
ACTIVATION='sigmoid'
EPOCHS=30
HIDDEN_LAYER_DIM=50
DROPOUT=0.1
KERNEL_SIZE=5
POOL_SIZE=4
LSTM_DIM=64 # LSTM Encoding size
FILTER_SIZE=32

In [208]:
def buildLSTM_CNN(num_classes, vsize, input_dim, maxlen, emb_matrix):
    if (num_classes > 0):
        loss_fn = 'categorical_crossentropy'
        eval_metrics = ['accuracy']
        activation_fn = 'softmax'
        output_dim = num_classes
    else:
        loss_fn = rmse
        eval_metrics = [rmse]
        activation_fn = 'linear'
        output_dim = 1
    
    model = Sequential()
    model.add(layers.Embedding(input_dim=vsize, 
                               output_dim=input_dim, 
                               input_length=maxlen,
                               weights=[emb_matrix],
                               trainable=False))
    model.add(Dropout(DROPOUT))
    model.add(Conv1D(FILTER_SIZE,
                     KERNEL_SIZE,
                     padding='valid',
                     activation=ACTIVATION,
                     strides=1))
    model.add(MaxPooling1D(pool_size=POOL_SIZE))
    model.add(LSTM(LSTM_DIM))
    model.add(layers.Dense(output_dim, activation=activation_fn, name='output_vals'))
    
    model.compile(optimizer=OPTIMIZER,
                  loss = loss_fn,
                  metrics=eval_metrics)
    model.summary()
    return model



In [209]:
def buildModel(num_classes, vsize, input_dim, maxlen, emb_matrix):
    if (num_classes > 0):
        loss_fn = 'categorical_crossentropy'
        eval_metrics = ['accuracy']
        activation_fn = 'softmax'
        output_dim = num_classes
    else:
        loss_fn = rmse
        eval_metrics = [rmse]
        activation_fn = 'linear'
        output_dim = 1
    
    model = Sequential()
    model.add(layers.Embedding(input_dim=vsize, 
                               output_dim=input_dim, 
                               input_length=maxlen,
                               weights=[emb_matrix],
                               trainable=False))
    #model.add(layers.Flatten())
    model.add(LSTM(LSTM_DIM))
    #model.add(LSTM(32))
    #model.add(Dropout(DROPOUT))
    #model.add(layers.Dense(HIDDEN_LAYER_DIM, activation=ACTIVATION))
    #model.add(layers.Dense(20, activation=ACTIVATION))
    model.add(layers.Dense(output_dim, activation=activation_fn, name='output_vals'))
    model.compile(optimizer=OPTIMIZER,
                  loss = loss_fn,
                  metrics=eval_metrics)
    model.summary()
    return model


In [210]:
def trainModelOnFold(fold_number, model, x_train, y_train, x_test, y_test,
                     maxlen, num_classes=0, epochs=EPOCHS):
    
    x_train = pad_sequences(x_train, padding='post', maxlen=maxlen)
    x_test = pad_sequences(x_test, padding='post', maxlen=maxlen)
    
    BATCH_SIZE = int(len(x_train)/20) # 5% of the training set size
    
    print ("Training model...")
    model.fit(x_train, y_train,
        epochs=epochs,
        verbose=True,
        validation_split=0.1,
        batch_size=BATCH_SIZE)
    
    loss, accuracy = model.evaluate(x_test, y_test, verbose=True)
    if (num_classes > 0):
        print("Fold {}: Cross-entropy loss: {:.4f}, Accuracy: {:.4f}".format(fold_number, loss, accuracy))
    else:
        print("Fold {}: Loss: {:.4f}, RMSE: {:.4f}".format(fold_number, loss, accuracy))    
        
    y_preds = model.predict(x_test)
    
    if num_classes > 0:
        plotHistogram(convertSoftmaxToLabels(y_preds), "Distribution of predicted class labels in {}-th fold".format(fold_number))
        
    return accuracy

In [211]:
def trainModel(inpH, x, y, fold_info, emb_file, maxlen, num_classes=0, epochs=EPOCHS):
    i=0
    avg_metric_value = 0
    
    #Load the word vectors
    print ("Loading pre-trained vectors...")
    inpH.loadW2V(emb_file)
    
    print ("Building model...")
    model = buildModel(num_classes, inpH.vocab_size, inpH.emb_dim, maxlen, inpH.embedding_matrix)
    #model = buildLSTM_CNN(num_classes, inpH.vocab_size, inpH.emb_dim, maxlen, inpH.embedding_matrix)
    
    for train_indexes, test_indexes in fold_info.split(x, y):
        x_train, y_train, x_test, y_test = getTrainTestFromFold(
                    inpH, emb_file, x, y, train_indexes, test_indexes, num_classes)

        avg_metric_value = avg_metric_value + trainModelOnFold(i, model,
                                       x_train, y_train, x_test, y_test,
                                       maxlen, num_classes, epochs)
        i=i+1
    return avg_metric_value/float(i)

In [212]:
def convertSoftmaxToLabels(y_preds):
    labels=[]
    for i in range(y_preds.shape[0]):
        labels.append(np.argmax(y_preds[i]))
    print (labels)
    return labels    

In [213]:
def main(argv):
    NUM_CLASSES = 0 # set this to 0 for regression and a positive value for classification
    #NUM_CLASSES = 7 # set this to 0 for regression and a positive value for classification
    #DATA_FILE = "../sentences/all_withwords.tsv"
    #DATA_FILE = "../sentences/all_wordsonly.tsv"
    DATA_FILE = "../sentences/all_nodesonly.tsv"
    TO_ADD_VALUE = 0
    #For embedding file with concatenated word features use this
    #EMB_FILE = "../graphs/nodevecs/nodes_and_words.vec"
    #EMB_FILE = "../graphs/nodevecs/words_only.vec"
    #For node vectors only, use this - 
    EMB_FILE = "../graphs/nodevecs/refVecs.vec"
    MAXLEN=50
    FOLD=5
    
    try:
        opts, args = getopt.getopt(argv,"ha:d:n:", ["appendnumeric=", "datafile=", "nodevecs="])
    
        for opt, arg in opts:
            if opt == '-h':
                print ('NodeSequenceRegression.py -i <trainfile> -o <testfile> -a -n <nodevecs>')
                sys.exit()
            elif opt in ("-i", "--trainfile"):
                DATA_FILE = arg
            elif opt in ("-a", "--appendnumeric"):
                TO_ADD_VALUE = 1
            elif opt in ("-n", "--nodevecs"):
                EMB_FILE = arg
                
    except getopt.GetoptError:
        print ('usage: NodeSequenceRegression.py -d <datafile> -a -o <resfile> -n <nodevecs>')
            
    print ("Training file: %s" % (DATA_FILE))
    print ("Append numbers: %d" % (TO_ADD_VALUE))
    print ("Emb file: %s" % (EMB_FILE))

    inpH = InputHelper()
    inpH.convertWordsToIds(EMB_FILE)
    
    VAL_ADDED_EMBFILE = EMB_FILE
    embfileDir = os.path.dirname(os.path.realpath(EMB_FILE))

    if (TO_ADD_VALUE == 1):
        VAL_ADDED_EMBFILE = embfileDir + '/ndvecswithvals.vec' 
        add_value_feature(EMB_FILE, VAL_ADDED_EMBFILE)
    
    skf = StratifiedKFold(n_splits=FOLD, random_state=SEED)    
    x, y = inpH.loadDataWithFolds(DATA_FILE, numClasses=NUM_CLASSES)
    
    avg_metric_value_for_folds = trainModel(inpH, x, y, skf, VAL_ADDED_EMBFILE, maxlen=MAXLEN, num_classes=NUM_CLASSES)
    print ("Avg after {} folds: {}".format(FOLD, avg_metric_value_for_folds))

In [214]:
if __name__ == "__main__":
    main(sys.argv[1:])
    #main('-i ../sentences/train.tsv -e ../sentences/test.tsv -a -o predictions.txt -n ../graphs/nodevecs/refVecs.vec')

usage: NodeSequenceRegression.py -d <datafile> -a -o <resfile> -n <nodevecs>
Training file: ../sentences/all_nodesonly.tsv
Append numbers: 0
Emb file: ../graphs/nodevecs/refVecs.vec
Collecting node names...
Collected node names...
Converting words to ids...
Finished converting words to ids...
Loading data from ../sentences/all_nodesonly.tsv
Loading pre-trained vectors...
Loading W2V data...




Line 3492 is corrupt!
Line 3507 is corrupt!
Line 5805 is corrupt!
loaded word2vec for 8182 nodes
Building model...
Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 50, 128)           1047424   
_________________________________________________________________
lstm_9 (LSTM)                (None, 64)                49408     
_________________________________________________________________
output_vals (Dense)          (None, 1)                 65        
Total params: 1,096,897
Trainable params: 49,473
Non-trainable params: 1,047,424
_________________________________________________________________
Dimension of vectors: (128,)
C:5578602:Lower_caste_37.2 ['0.149615' '0.050254' '-0.006303' '-0.087948' '0.039424']
Dimension of vectors: (128,)
I:3675717:1 ['0.373619' '0.398837' '-0.081268' '0.143891' '0.308124']
vocab size: 8183
emb-matrix: [ 0.149615 



Train on 577 samples, validate on 65 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Fold 0: Loss: 12.1136, RMSE: 10.6970
Dimension of vectors: (128,)
C:5578602:Lower_caste_37.2 ['0.149615' '0.050254' '-0.006303' '-0.087948' '0.039424']
Dimension of vectors: (128,)
I:3675717:1 ['0.373619' '0.398837' '-0.081268' '0.143891' '0.308124']
vocab size: 8183
emb-matrix: [ 0.149615  0.050254 -0.006303 -0.087948  0.039424]...
(8183, 128)
Training model...
Train on 577 samples, validate on 65 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30

Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Fold 3: Loss: 13.5066, RMSE: 13.5066
Dimension of vectors: (128,)
C:5578602:Lower_caste_37.2 ['0.149615' '0.050254' '-0.006303' '-0.087948' '0.039424']
Dimension of vectors: (128,)
I:3675717:1 ['0.373619' '0.398837' '-0.081268' '0.143891' '0.308124']
vocab size: 8183
emb-matrix: [ 0.149615  0.050254 -0.006303 -0.087948  0.039424]...
(8183, 128)
Training model...
Train on 578 samples, validate on 65 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Fold 4: Loss: 11.7980, RMSE: 11.7980
Avg after 5 folds: 11.952856063842