# Character Level Deep Learning for Text Classification

**Author:** Kyle Hundman (Kyle.A.Hundman@jpl.nasa.gov)<br>
**Role:** Data Scientist, NASA Jet Propulsion Laboratory <br>
**Work Sponsored By:** DARPA MEMEX Program <br>

## Overview

A character-level deep learning model used to classify online escort ads as 'warranting human-trafficking investigation' or not. Code was written as part of DARPA Memex summer hack (2016). Model design is largely derived from Stathis Vafeias' blog: https://offbit.github.io/how-to-read/ with inspiration from various other sources:

- http://karpathy.github.io/2015/05/21/rnn-effectiveness/
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://arxiv.org/pdf/1502.01710v5.pdf
- https://offbit.github.io/how-to-read/
- http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf
- http://arxiv.org/pdf/1506.02078v2.pdf

A character level model was appealing for this problem because of the inherent disorganization of crawled web content, which was especially true in the realm of escort ads. By allowing the model to learn word permutations at the character level, very little cleaning (stemming, spell-checking, tokenization) is necessary - we can let the model learn infer vocabulary and meaning. In general, character level models are also attractive for textual data because of lower dimensionality, (40-100 character encodings vs. an arbitratily large vocabulary size). 

## Data

Talk about clustering (why clustered), JSON structure

## Organization (ToDo)

Setup<br>
Data Preprocessing <br>
...

## Setup 
- **Python 3**
- Using **Keras** with **Tensorflow (0.9)** backend
- to switch between theano and tensorflow change name in config file: ```~/.keras/keras.json```
- Ran on Macbook Pro NVIDIA GPU using Cuda and CuDNN (http://blog.wenhaolee.com/run-keras-on-mac-os-with-gpu/)
- Outputs from all intermediate preprocessing steps are written to **```model_dir```** directory
- Specify name of saved model weights as **```checkpoint```** to resume training prior model
    - model weights should be located in **```checkpoints/```** dir

In [3]:
%matplotlib inline
import pandas as pd

from keras.models import Model
from keras.layers import Dense, Activation, Flatten, Input, Dropout, MaxPooling1D, Convolution1D
from keras.layers import LSTM, Lambda, merge
from keras.layers import Embedding, TimeDistributed
import numpy as np
import tensorflow as tf
import re
from keras import backend as K
import keras.callbacks
import sys
import os
import json
import ujson
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from pylab import rcParams

# ToDos
# ===================
# Create model_dir automatically if not supplied (use time)
# If model_dir doesn't exist create it
# If checkpoints folder doens't exist in dir, create it

model_dir = "7-30-16"
checkpoint = "__main__.00-0.38.hdf5"

Using TensorFlow backend.


## Preprocessing

### String cleaning
- html and newline / carriage return characters already stripped from text

In [62]:
def clean_str(string):
    """
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

### CONSIDER NOT CLEANING STRING

### Build cluster assignment dictionaries
- Training test split done at cluster level to ensure minimal overlap between phone numbers and similar ads
- When shuffling clusters for training/test split, this allows for easy retrieval of ads (by ID) for each cluster

In [16]:
def build_id_lookup(infile, outfile, limit=10000000):
    """
    Output dictionary of cluster, doc_id assignments
    :param infile: raw data (each line is json doc representing an ad)
    :param outfile: {"<cluster_id1>" : ["<doc_id1>", "<doc_id2>", ...], "<cluster_id2>" : ["<doc_id3>", "<doc_id4>", ...]}
    """
    clusters = {}
    with open(os.path.join(model_dir, outfile), "w") as out:
        with open(os.path.join(model_dir,infile)) as f:
            for idx,ad in enumerate(f):
                if idx < limit:
                    ad = json.loads(ad)
                    if ad["cluster_id"] in clusters:
                        clusters[ad["cluster_id"]].append(ad["doc_id"])
                    else:
                        clusters[ad["cluster_id"]] = [ad["doc_id"]]
                else:
                    break
            out.write(json.dumps(clusters, sort_keys=True, indent=4))

In [35]:
build_id_lookup("cp1_positives.json", "training_pos_ids")

In [36]:
build_id_lookup("CP1_negatives_all_clusters.json", "training_neg_ids")

### Randomly Sample Cluster IDs from positive and negative training
- Sampling done separately for positive and negative ads because they were clustered and processed at different times, resulting in overlapping cluster_ids

In [37]:
def read_clusters(file):
    """
    Read in clusters and doc_ids generated from build_id_lookup function
    """  
    with open(os.path.join(model_dir, file)) as f:
        clusters = eval(f.read())
        return clusters

pos_clusters_lookup = read_clusters("training_pos_ids")
neg_clusters_lookup = read_clusters("training_neg_ids")

pos_indices = np.arange(len(pos_clusters_lookup.keys()))
neg_indices = np.arange(len(neg_clusters_lookup.keys()))

np.random.shuffle(pos_indices)
np.random.shuffle(neg_indices)

pos_clusters = np.array(list(pos_clusters_lookup.keys()))[pos_indices]
neg_clusters = np.array(list(neg_clusters_lookup.keys()))[neg_indices]

pos_train_clusters = pos_clusters[:int(round(len(pos_clusters)*.7, 0))]
neg_train_clusters = neg_clusters[:int(round(len(neg_clusters)*.7, 0))]

pos_test_clusters = pos_clusters[int(round(len(pos_clusters)*.7, 0)):]
neg_test_clusters = neg_clusters[int(round(len(neg_clusters)*.7, 0)):]

with open(os.path.join(model_dir, "cluster_splits"), "w") as out:
    splits = {}
    splits["pos_train"] = list(pos_train_clusters)
    splits["neg_train"] = list(neg_train_clusters)
    splits["pos_test"] = list(pos_test_clusters)
    splits["neg_test"] = list(neg_test_clusters)
    out.write(json.dumps(splits, indent=4))

### Retrieve ad text based on cluster train/test splits

In [38]:
def get_cluster_ad_text(json_file, outfiles, cluster_id_lists):
    """
    Get all ads associated with a set of clusters (train or test set), write to file
    :param json_file: original ad data
    TODO: allowing for loading in of previous cluster splits avoid resamping clusters 
    if re-running steps from here down 
    """
    with open(os.path.join(model_dir, outfiles[0]), "w") as out0:
        with open(os.path.join(model_dir, outfiles[1]), "w") as out1:
            with open(os.path.join(model_dir, json_file), "r") as f:
                
                for ad in f.readlines():
                    ad = ujson.loads(ad)
                    if ad["cluster_id"] in cluster_id_lists[0]:
                        out0.write(re.sub("(\r|\n|\t)", " ", clean_str(ad["extracted_text"])) + "\n")
                    elif ad["cluster_id"] in cluster_id_lists[1]:
                        out1.write(re.sub("(\r|\n|\t)", " ", clean_str(ad["extracted_text"])) + "\n")

In [39]:
get_cluster_ad_text("cp1_positives.json", ["pos_training", "pos_test"], [pos_train_clusters, pos_test_clusters])

In [40]:
get_cluster_ad_text("cp1_negatives_all_clusters.json", ["neg_training", "neg_test"], [neg_train_clusters, neg_test_clusters])

### Load training/test ads, split sentences, assign labels
- Split ads into quasi-sentences (help speed up processing) -> this could be a lot better
- Join together docs (ads) and labels

In [2]:
train_docs, test_docs = [], []
train_sentences, test_sentences = [], []
train_classes, test_classes = [], []

In [4]:
def build_lists(file, docs, sentences, classes=None, label=None):
    with open(os.path.join(model_dir, file)) as f:
        ads = f.readlines()
        if label == None:
            for ad in ads:
                ad = eval(ad)
                sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|\))\s', ad[1])
                sentences = [sent.lower() for sent in sentences]
                docs.append(sentences)
        else:
            _class = [label] * len(ads)

            for ad, label in zip(ads, _class):
                sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|\))\s', ad)
                sentences = [sent.lower() for sent in sentences]
                docs.append(sentences)
                classes.append(label)

In [4]:
build_lists("pos_training", train_docs, train_sentences, train_classes, 1)
build_lists("neg_training", train_docs, train_sentences, train_classes, 0)

build_lists("pos_test", test_docs, test_sentences, test_classes, 1)
build_lists("neg_test", test_docs, test_sentences, test_classes, 0)

### Character quantization (encodings)

In [44]:
txt = ''

def get_chars(docs, txt):
    for doc in docs:
        for s in doc:
            txt += s
    return txt

txt = get_chars(train_docs, txt)
txt = get_chars(test_docs, txt)
        
chars = set(txt)
print('total chars: ', len(chars))
char_indices = dict((c,i) for i,c in enumerate(chars))

print(json.dumps(char_indices, indent=4, sort_keys=True))

with open(os.path.join(model_dir,"encodings"), "w") as out:
    out.write(json.dumps(char_indices, indent=4, sort_keys=True))

total chars:  46
{
    "\n": 24,
    " ": 38,
    "!": 25,
    "'": 18,
    "(": 41,
    ")": 29,
    ",": 23,
    "0": 37,
    "1": 39,
    "2": 14,
    "3": 43,
    "4": 27,
    "5": 16,
    "6": 21,
    "7": 11,
    "8": 15,
    "9": 33,
    "?": 1,
    "\\": 5,
    "`": 30,
    "a": 10,
    "b": 17,
    "c": 6,
    "d": 2,
    "e": 32,
    "f": 42,
    "g": 12,
    "h": 31,
    "i": 19,
    "j": 13,
    "k": 44,
    "l": 4,
    "m": 8,
    "n": 22,
    "o": 34,
    "p": 20,
    "q": 7,
    "r": 9,
    "s": 36,
    "t": 35,
    "u": 45,
    "v": 3,
    "w": 28,
    "x": 40,
    "y": 26,
    "z": 0
}


### Character quantization and input trimming
<a src="https://arxiv.org/pdf/1502.01710v5.pdf">Zhang et. al </a> makes an interesting observation that quantization is visually similar to Braille, and that humans can learn to read binary encodings of languages, suggesting this might offer legitimacy to the approach. 

- Sentence lengths are trimmend and sentences per ad are limited (model needs consistent input dimensions)
- Due to the difficulty of splitting sentences with these ads (and in general), we are making the assumption that the most salient information is in the first X number of characters and first X number of sentences in each ad. Often the longer sentences or sentences toward the end of the ad include repeating locations, names, or phone numbers which we don't want the model to learn from and become biased towards.

In [5]:
# max number of characters allowed in a sentence, any additional are thrown out
maxlen = 512

# max sentences allowed in a doc, any additional are thrown out
max_sentences = 15

def load_char_encodings():
    with open(os.path.join(model_dir, "encodings"), "r") as f:
        return eval(f.read())

def encode_and_trim(docs, X, maxlen, max_sentences):
    """
    Replace -1's in vector representation of chars with encodings in reverse order (-1s toward beginning indicate 
    character length was less than max allowed)
    """
    for i, doc in enumerate(docs):
        for j, sentence in enumerate(doc):
            if j < max_sentences:
                for t, char in enumerate(sentence[-maxlen:]):
                    X[i, j, (maxlen-1-t)] = char_indices[char]
    return X

char_indices = load_char_encodings()

In [5]:
# Create array for vector representation of chars (512D). Filled with -1 initially
X_train = np.ones((len(train_docs), max_sentences, maxlen), dtype=np.int64) * -1
X_test = np.ones((len(test_docs), max_sentences, maxlen), dtype=np.int64) * -1

# create array of class labels
y_train = np.array(train_classes)
y_test = np.array(test_classes)

X_train = encode_and_trim(train_docs, X_train, maxlen, max_sentences)
X_test = encode_and_trim(test_docs, X_test, maxlen, max_sentences)

print('Sample chars in X:{}'.format(X_train[20, 2]))
print('y:{}'.format(y_train[12]))

Sample chars in X:[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 

### Record training history (losses and accuracies)

In [6]:
class LossHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.losses = []
        self.accuracies = []

    def on_batch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))
        self.accuracies.append(logs.get('acc'))

### Keras / Tensorflow helpers

In [6]:
def binarize(x, sz=46):
    """
    Used in creation of Lambda layer to create a one hot encoding of sentence characters on the fly. 
    x : tensor of dimensions (maximum sentence length, ) TODO
    sz : number of unique characters in the corpus
    tf.to_float casts a tensor to type "float32"
    """
    one_hot = tf.one_hot(x, sz, on_value=1, off_value=0, axis=-1)
    return tf.to_float(one_hot)


def binarize_outshape(in_shape):
    """
    
    """
    return in_shape[0], in_shape[1], 46


def max_1d(x):
    """
    
    """
    return K.max(x, axis=1)

### Sentence encoder

- Assign input shapes for sentences and documents

- **```Lambda```** is used for evaluating a tf expression on the output of a previous layer (in this case the sentence encodings)


- **```Convolution```** theory is that similar to how hierarchical image features are learned using convolutional NN's for computer vision, hierarchical representations of words, phrases, and sentences can be learned from text
    - **```Dropouts```** are lower than seen in most research (usually ~0.5), could cause generalization problems (dropout prevents learning a dependence on certain features)
    - **```Max pooling```** picks strongest local features in order to summarize patch of text and encourage sparsity. Allows for deeper models 
        - A higher **```pool_length```** reduces training time in convolution layer, but likely would have little effect in overall training time
    - **```border_mode```** set to 'valid' means no padding is created and output is smaller than input (convolution is only computed where the input and filter fully overlap)
        - This helps manage dimensionality passed to hidden layers (easier training), plus it doesn't really make sense to pad for textual problems
        
        
- **```LSTM```** first two layers here read "word" sequence and positions (LSTM allows "memories" to be stored and activated along sequencies when needed - in this portion of the model the sequences are sentences)
    - Nice property of LSTM (RNNs in general) is input data can be arbitrarily sized, allowing user to determine dimensions


In [7]:
# max number of characters allowed in a sentence, any additional are thrown out
maxlen = 512

# max sentences allowed in a doc, any additional are thrown out
max_sentences = 15

with tf.device("/gpu:0"): #using Macbook Pro NVIDIA GPU
    filter_length = [5, 3, 3]
    nb_filter = [196, 196, 256]
    pool_length = 2

    # document input -> 15 x 512
    document = Input(shape=(max_sentences, maxlen), dtype='int64')

    # sentence input -> 512,
    in_sentence = Input(shape=(maxlen,), dtype='int64')

    # binarize function creates a onehot encoding of each character index
    embedded = Lambda(binarize, output_shape=binarize_outshape)(in_sentence)


    for i in range(len(nb_filter)):
        embedded = Convolution1D(nb_filter=nb_filter[i],
                                filter_length=filter_length[i],
                                border_mode='valid',
                                activation='relu',
                                init='glorot_normal',
                                subsample_length=1)(embedded)

        embedded = Dropout(0.1)(embedded)
        embedded = MaxPooling1D(pool_length=pool_length)(embedded)

    forward_sent = LSTM(128, return_sequences=False, dropout_W=0.2, dropout_U=0.2, consume_less='gpu')(embedded)
    backward_sent = LSTM(128, return_sequences=False, dropout_W=0.2, dropout_U=0.2, consume_less='gpu', go_backwards=True)(embedded)

    sent_encode = merge([forward_sent, backward_sent], mode='concat', concat_axis=-1)
    sent_encode = Dropout(0.3)(sent_encode)

    encoder = Model(input=in_sentence, output=sent_encode)
    encoded = TimeDistributed(encoder)(document)

### Document encoder

- **```LSTM```** layers here store "memories" across sentences for a given ad. Need less recurrent layers here because output from prior layers will be smaller than original input and less information needs to be processed (max_pooling, valid border setting, convolution)


- **```Dense```** layers are fully connected layers where all inputs from prior layer are considered (these are appropriate and can be trained in reasonable time because inputs from prior layers are much smaller than original input)

In [8]:
    forwards = LSTM(80, return_sequences=False, dropout_W=0.2, dropout_U=0.2, consume_less='gpu')(encoded)
    backwards = LSTM(80, return_sequences=False, dropout_W=0.2, dropout_U=0.2, consume_less='gpu', go_backwards=True)(encoded)

    merged = merge([forwards, backwards], mode='concat', concat_axis=-1)
    output = Dropout(0.3)(merged)
    output = Dense(128, activation='relu')(output)
    output = Dropout(0.3)(output)
    output = Dense(1, activation='sigmoid')(output)

    model = Model(input=document, output=output)

### point to checkpoint to lodel model weights

In [9]:
    if checkpoint:
        model.load_weights(os.path.join(model_dir, "checkpoints", checkpoint))

### Fit Model
- 1 epoch on Mac GPU takes ~36 hrs for ~200k ads as input and evaulation on ~80k ads

In [13]:
    # file_name = os.path.basename(sys.argv[0]).split('.')[0]
    # check_cb = keras.callbacks.ModelCheckpoint(os.path.join(model_dir, 'checkpoints/'+file_name+'.{epoch:02d}-{val_loss:.2f}.hdf5'),
    #                                            monitor='val_loss', verbose=0, save_best_only=True, mode='min')
    # earlystop_cb = keras.callbacks.EarlyStopping(monitor='val_loss', patience=7, verbose=1, mode='auto')
    # history = LossHistory()
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    # model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=10,
    #           nb_epoch=5, shuffle=True, callbacks=[earlystop_cb,check_cb, history])

    # just showing access to the history object
    # print(history.losses)
    # print(history.accuracies)

    # WRITE MODEL STATS TO MODEL DIRECTORY

### Quantization of chars in eval docs

In [63]:
    def get_cluster_and_text_and_id(json_file, outfile):
        """
        Get all ads associated with a set of clusters (train or test set), write to file
        """
        with open(os.path.join(model_dir, outfile), "w") as out:
            with open(os.path.join(model_dir, json_file), "r") as f:

                for idx, ad in enumerate(f.readlines()):
                    ad = ujson.loads(ad)
                    if "class" in ad:
                        out.write(str([ ad["doc_id"], re.sub("(\r|\n|\t)", " ", clean_str(ad["extracted_text"])), ad["cluster_id"], ad["class"] ]) + "\n")
                    else:
                        out.write(str([ ad["doc_id"], re.sub("(\r|\n|\t)", " ", clean_str(ad["extracted_text"])), ad["cluster_id"] ]) + "\n")
    get_cluster_and_text_and_id("cp1_evaluation.json", "actual_eval")        

In [64]:
    eval_docs = []
    eval_sentences = []
    eval_class = []

    build_lists("actual_eval", eval_docs, eval_sentences)
    X_eval = np.ones((len(eval_docs), max_sentences, maxlen), dtype=np.int64) * -1
    X_eval = encode_and_trim(eval_docs, X_eval, maxlen, max_sentences)

### Generate probabilities from model

In [15]:
    proba = model.predict(X_eval, batch_size=128)

In [22]:
def join_cluster_to_probability(probs):
    """
    After probability assignments are made, rejoin them with a cluster id, allowing for prediction at the cluster level
    """
    cluster_probs = {}
    with open(os.path.join(model_dir, "actual_eval"), "r") as f:
        for i, doc in enumerate(f.readlines()):
            doc = eval(doc)
            cluster_id = doc[2]
            if cluster_id in cluster_probs:
                cluster_probs[cluster_id]["scores"].append(float(probs[i][0]))
            else:
                cluster_probs[cluster_id] = {} 
                cluster_probs[cluster_id]["scores"] = [float(probs[i][0])]

            if len(doc) == 4:
                cluster_probs[cluster_id]["label"] = doc[3]
#                 except ValueError("don't have labels...")

    with open(os.path.join(model_dir, "eval_ad_predictions_by_cluster"), "w") as out:
        out.write(json.dumps(cluster_probs, indent=4))     

In [23]:
join_cluster_to_probability(proba)

### Generate score for cluster based on individual scores of ads in that cluster

In [57]:
# Look at distribution of scores within clusters we have, generate ROC off of some aggregation of scores
    # Average
    # top 10%, 25%
    # Histograms
def avg_aggregation():
    """
    :param (int) top_n: Number of highest scores to average when scoring cluster
    """
    pos_clusters, neg_clusters, pos_averages, neg_averages = [], [], [], []
    with open(os.path.join(model_dir, "ad_predictions_by_cluster"), "r") as f:
        lookup = eval(f.read())
        for cluster in lookup.keys():
            if lookup[cluster]["label"] == 1:
                pos_clusters.append(cluster)
                top_pos = sorted(lookup[cluster]["scores"], reverse=True)[:len(lookup[cluster])]
                pos_averages.append( sum(top_pos) / len(top_pos) )
            elif lookup[cluster]["label"] == 0:
                neg_clusters.append(cluster)
                top_neg = sorted(lookup[cluster]["scores"], reverse=True)[:len(lookup[cluster]["scores"])]
                neg_averages.append( sum(top_neg) / len(top_neg) )
    return [pos_clusters, neg_clusters, pos_averages, neg_averages]

def graph_cluster_scores(clusters, agg_scores, color):
    plt.rcParams['figure.figsize'] = (17.0, 7.0)
    bar_width = .3
    index = np.arange(len(clusters))
    plt.bar(index, agg_scores, .5, color=color)
    plt.ylim([0,1.1])
    plt.xlim([0,len(clusters)])
    plt.xlabel('Cluster ID')
    plt.ylabel('Probability')
    plt.xticks(index + bar_width, clusters, rotation="vertical")
    plt.title(r'Average Ad HT Probability per Cluster')
    plt.show()

In [54]:
top_n = None #just using length

pos_clusters = avg_aggregation()[0]
neg_clusters = avg_aggregation()[1]
pos_agg_scores = avg_aggregation()[2]
neg_agg_scores = avg_aggregation()[3]

In [65]:
# print(len(pos_agg_scores))
# print(sum(pos_agg_scores) / len(pos_agg_scores))
# print(max(pos_agg_scores))
# print(pos_agg_scores)

In [66]:
# print(len(neg_agg_scores))
# print(sum(neg_agg_scores) / len(neg_agg_scores))
# print(max(neg_agg_scores))
# print(neg_agg_scores)

In [67]:
pos_agg_scores, pos_clusters = zip(*sorted(zip(pos_agg_scores, pos_clusters), reverse=True))
graph_cluster_scores(pos_clusters, pos_agg_scores, "red")

In [68]:
neg_agg_scores, neg_clusters = zip(*sorted(zip(neg_agg_scores, neg_clusters), reverse=True))
graph_cluster_scores(neg_clusters, neg_agg_scores, "green")

### Generate files for IST evaluation script

In [60]:
with open(os.path.join(model_dir, "actual_evaluation", "ground_truth.json"), "w") as out:
    for x in range(0, len(neg_clusters)):
        doc = {}
        doc["cluster_id"] = neg_clusters[x]
        doc["class"] = 0
        out.write(json.dumps(doc) + "\n")
    for y in range(0, len(pos_clusters)):
        doc = {}
        doc["cluster_id"] = pos_clusters[y]
        doc["class"] = 1
        out.write(json.dumps(doc) + "\n")

with open(os.path.join(model_dir, "actual_evaluation", "submission.json"), "w") as out:
    for x in range(0, len(neg_clusters)):
        doc = {}
        doc["cluster_id"] = neg_clusters[x]
        doc["score"] = neg_agg_scores[x]
        out.write(json.dumps(doc) + "\n")
    for y in range(0, len(pos_clusters)):
        doc = {}
        doc["cluster_id"] = pos_clusters[y]
        doc["score"] = pos_agg_scores[y]
        out.write(json.dumps(doc) + "\n")

### ROC Evaluation

In [None]:
# Happens in evaluation dir

## TO DOs
- Finish writing about intuition and script
    - refactor
    - identify opportunities for experimentation 
        - fully connected layers
        - different characters allowed
        - lattice
        - titles
        - sentence splits
        - num_chars, sentences allowed
- Create sample data