# using a neural network to classify message categories

### identified message categories in generated training set:
- MCQ: multiple choice question or answer
- NONMCQ: content question or answer
- CONV: small talk/pleasantries, or any non content-related message. 
- START: session start marker
- END: session end marker
- SETUP: sent as part of setup

*using keras with TensorFlow as backend recommended*

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk
import keras

Using TensorFlow backend.


### Read and prepare in datasets: generated train data, and entire set of unlabeled messages. 

In [2]:
labeled = pd.read_csv('train_bigger.csv', dtype={"from_2": object, "type": object, "category": object})

due to time constraints, we didn't generate a category for the entire fraction of rows randomly sampled from the full dataframe. This means we need to drop all the uncategorized data from our "labeled" training set to make it an actually labeled training set. 

In [3]:
labeled.dropna(subset=['category'], inplace=True)

In [4]:
unlabeled = pd.read_csv('labeled_output.csv')

In [5]:
# just good practice when classifying
unlabeled.dropna(subset=['text'], inplace=True)

In [6]:
# this column was used/generated for previous text analysis; we don't need it here
unlabeled.drop(columns=['text_type'], axis=1, inplace=True)

we add one of our generated attributes to the text data in the hopes of improving the classifier's accuracy

once our other NN is run on all the data, it would be good to include that generated column, type, in the same way here.

In [7]:
unlabeled['text_2']=unlabeled['from_2'].astype(str) + " " + unlabeled['text'].astype(str)
labeled['text_2']=labeled['from_2'].astype(str) + " " + labeled['text'].astype(str)

In [8]:
# just getting basic information about the text data 
print("Categories:", np.unique(labeled['category']))
print("Number of unique words:", len(np.unique(np.hstack(labeled['text']))))

length = [len(i) for i in labeled['text']]
print("Average length:", np.mean(length))
print("max length:", np.max(length))
print("Standard Deviation:", round(np.std(length)))

Categories: ['conv' 'end' 'mcq' 'non mcq' 'setup' 'start']
Number of unique words: 492
Average length: 45.246688741721854
max length: 821
Standard Deviation: 74.0


In [9]:
# establish labeled and validation sets
labels=labeled['category']
unlabeled['category']=np.nan
val_labels=unlabeled['category']
docs=labeled['text_2']
val_docs=unlabeled['text_2']

In [10]:
# Label encode the data so our categories are readable by a NN
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
labels=le.fit_transform(labels)
print(labels.shape, val_labels.shape)

(604,) (1084520,)


In [11]:
# optional one-hot encoding of labels. Doesn't seem to improve accuracy one way or another at the moment
#num_classes=6
# labels = keras.utils.to_categorical(labels,num_classes)
# val_labels = keras.utils.to_categorical(val_labels, num_classes)

### Tokenize text data so we can use an embedding layer

In [12]:
# use keras' text processing modules to create a "vocabulary" for our dataset 
from keras.preprocessing import sequence
from keras.preprocessing import text

unique_words=len(np.unique(np.hstack(labeled['text'])))

tokenizer = text.Tokenizer(num_words=unique_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',)
tokenizer.fit_on_texts(docs)

In [13]:
# convert tokenized texts to sequence vectors
docs = tokenizer.texts_to_sequences(docs)
val_docs = tokenizer.texts_to_sequences(val_docs)

In [14]:
# pad sequence vectors so they're all the same length (necessary for text processing)
docs = sequence.pad_sequences(docs, maxlen=821)
val_docs = sequence.pad_sequences(val_docs, maxlen=821)

In [15]:
print(docs.shape, val_docs.shape)

(604, 821) (1084520, 821)


In [16]:
# split our labeled dataset into training and testing data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(docs, labels, random_state =42, test_size=0.2)

### testing our first neural network

*using a random selection of hyperparameters*

In [17]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import Flatten

In [18]:
# setting some parameters

input_dim= unique_words
length = [len(i) for i in labeled['text']]
input_length= np.max(length)

In [None]:
model=Sequential()
model.add(Embedding(input_dim=492,
                    output_dim=128,
                    input_length=821))
model.add(Flatten())
model.add(Dense(604, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(6, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='Adam',metrics=['acc'] )

In [None]:
model.fit(x_train, y_train, epochs=20, verbose=0)

In [None]:
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)

In [None]:
print(f'test loss:{loss} \n test accuracy:{accuracy}')

hmm, that accuracy is way too high. We're not sure why that is....?

## hyperparameter optimization

using sklearn's Keras Classifier wrapper and scipy stats to perform a random search

you have to run this on an instance for it to be at all computeable

#### round one hyperparameters:
- hidden layers
- neurons
- input neurons
- dropout layers
- dropout rate
- weight constraint

In [21]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import Flatten
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV as RS
from scipy import stats
from keras.constraints import maxnorm

In [None]:
def create_model(dropout_rate=0.0, weight_constraint=0, hidden_layers=1, neurons=1, input_neurons=1, dropout_layers=1, embedding=1):
    model=Sequential()
    model.add(Embedding(input_dim=492, 
                       output_dim=128, 
                       input_length=821))
    model.add(Flatten())
    model.add(Dense(input_neurons, activation='relu'))
    for i in range(hidden_layers, dropout_layers):
        model.add(Dense(neurons, activation='relu', kernel_constraint=maxnorm(weight_constraint)))
        model.add(Dropout(dropout_rate))
    model.add(Dense(6, activation='softmax'))
    
    model.compile(loss='sparse_categorical_crossentropy', optimizer='Adam', metrics=['acc'])
    return model

In [None]:
params={'input_neurons': stats.randint(1,128),
        'neurons': stats.randint(1,128),
        'hidden_layers': stats.randint(1,16),
        'dropout_layers': stats.randint(1,16),
        'dropout_rate': stats.uniform(0,0.9),
        'weight_constraint': stats.randint(1,5),
       }
n_iter=8

In [None]:
model = KerasClassifier(build_fn=create_model, verbose=0, shuffle=True)

In [None]:
rand = RS(estimator=model, param_distributions=params, n_jobs=-1, cv=4, n_iter=n_iter)

In [None]:
rand_search = rand.fit(x_train, y_train)

below is a utility function to report best scores 
  *from kaggle:https://www.kaggle.com/ksjpswaroop/parameter-tuning-rf-randomized-search*

In [None]:
def report(results, n_top=5):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [None]:
report(rand_search.cv_results_)

#### using round one hyperparameters to search for batch size and epoch

In [None]:
def create_model_2(dropout_rate=0.216, weight_constraint=1.1489, hidden_layers=11, neurons=64, input_neurons=98, dropout_layers=5, embedding=168, batch_size=1, epochs=1):
    model2=Sequential()
    model2.add(Embedding(input_dim=492, 
                       output_dim=embedding, 
                       input_length=821))
    model2.add(Flatten())
    model2.add(Dense(input_neurons, activation='relu'))
    for i in range(hidden_layers, dropout_layers):
        model2.add(Dense(neurons, activation='relu', kernel_constraint=maxnorm(weight_constraint)))
        model2.add(Dropout(dropout_rate))
    model2.add(Dense(6, activation='softmax'))
    
    model2.compile(loss='sparse_categorical_crossentropy', optimizer='Adam', metrics=['acc'])
    return model2

In [None]:
params2={'batch_size': stats.randint(1,128),
         'epochs': stats.randint(1,64)}
n_iter=20

In [None]:
model2 = KerasClassifier(build_fn=create_model_2, verbose=0, shuffle=True)

In [None]:
rand2 = RS(estimator=model2, param_distributions=params2, n_jobs=-1, cv=4, n_iter=n_iter)

In [None]:
rand_search2 = rand2.fit(x_train, y_train)

In [None]:
report(rand_search2.cv_results_)

## create a final model with best-performing hyperparameters

In [19]:
def create_model_3(dropout_rate=0.0747, weight_constraint=5.926, hidden_layers=15, neurons=1, input_neurons=31, dropout_layers=12, embedding=19):
    model3=Sequential()
    model3.add(Embedding(input_dim=492, 
                       output_dim=embedding, 
                       input_length=821))
    model3.add(Flatten())
    model3.add(Dense(input_neurons, activation='relu'))
    for i in range(hidden_layers, dropout_layers):
        model3.add(Dense(neurons, activation='relu', kernel_constraint=maxnorm(weight_constraint)))
        model3.add(Dropout(dropout_rate))
    model3.add(Dense(6, activation='softmax'))
    
    model3.compile(loss='sparse_categorical_crossentropy', optimizer='Adam', metrics=['acc'])
    return model3

In [22]:
model3 = KerasClassifier(build_fn=create_model_3, verbose=0, shuffle=True)

In [23]:
callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)]

In [25]:
history = model3.fit(
            x_train,
            y_train,
            batch_size=54, 
            epochs=62,
            callbacks=callbacks,
            validation_data=(x_test, y_test),
            verbose=2,  # Logs once per epoch.
            )

Train on 483 samples, validate on 121 samples
Epoch 1/62
 - 0s - loss: 1.5718 - acc: 0.4306 - val_loss: 1.3140 - val_acc: 0.5537
Epoch 2/62
 - 0s - loss: 1.4222 - acc: 0.4928 - val_loss: 1.3088 - val_acc: 0.5537
Epoch 3/62
 - 0s - loss: 1.3783 - acc: 0.4928 - val_loss: 1.2824 - val_acc: 0.5537
Epoch 4/62
 - 0s - loss: 1.3619 - acc: 0.4928 - val_loss: 1.2581 - val_acc: 0.5537
Epoch 5/62
 - 0s - loss: 1.3716 - acc: 0.4928 - val_loss: 1.2432 - val_acc: 0.5537
Epoch 6/62
 - 0s - loss: 1.3603 - acc: 0.4969 - val_loss: 1.2582 - val_acc: 0.5537
Epoch 7/62
 - 0s - loss: 1.3324 - acc: 0.4928 - val_loss: 1.2278 - val_acc: 0.5537
Epoch 8/62
 - 0s - loss: 1.3120 - acc: 0.4948 - val_loss: 1.2054 - val_acc: 0.5537
Epoch 9/62
 - 0s - loss: 1.2931 - acc: 0.5135 - val_loss: 1.1888 - val_acc: 0.5702
Epoch 10/62
 - 0s - loss: 1.2667 - acc: 0.5466 - val_loss: 1.1604 - val_acc: 0.6116
Epoch 11/62
 - 0s - loss: 1.2342 - acc: 0.5528 - val_loss: 1.1334 - val_acc: 0.6281
Epoch 12/62
 - 0s - loss: 1.2083 - acc:

In [26]:
history = history.history
print('Validation accuracy: {acc}, loss: {loss}'.format(acc=history['val_acc'][-1], loss=history['val_loss'][-1]))

Validation accuracy: 0.7851239797497583, loss: 0.7525272270864691
