# **LINMA 2472 : Algorithms in Data Science**
===============================================================


## Project on deep learning

Classification is a common task in machine learning.  In this project, we will tackle the task of classifying tweets of the two presidential candidates for the 2017 election.  To do so, we will use a database of about 20k tweets of both candidates.  For Donald Trump, we have all the tweets he posted from 05/04/2009 to 11/26/2017 while we have for Hillary Clinton her tweets from 06/10/2013 to 11/24/2017.  

In our project, we decided to use the python version of the library [TensorFlow](https://www.tensorflow.org/).  Since this library is quite powerful, we decided not to use its high level tools as a black box and code ourselves our classifier as much as possible.  One could argue that we could develop our classifier without the help of any library, but we could not have the same results for sure.  Backpropagation may be quite tricky to implement and coding a fancier optimization method than Gradient method would have been out of range considering the time resources we had.  Moreover, TensorFlow provides us great tools to track the performance of our classifiers as the TensorBoard tool.  In this first part, we implemented a simple perceptron model with a naïve feature extraction of the texts : a bag of word representation. 

# Todo presentation KERAS

# Tensorflow Classifier

We developped a python abstraction for classifiers you can find in classifier.py.  In that way, it is really easy to build another classifier based on another model, you just have to redefine the method create_model. 

What you need to lauch this part of the project :

 - numpy
 - nltk (For the bag of words processing)
 - [Tensorflow](https://www.tensorflow.org/install/)

You can track the progress of the training by launching tensorboard in your terminal :
`tensorboard --logdirs='.graphs/'`
Then, open the link provided in your web browser to use the tools of tensorboard.
 

In [5]:
import tensorflow as tf
import numpy as np
import nltk
nltk.download('stopwords')
import csv

# Our custom libraries
from nlp_utils import generate_bow, create_bow_by_dict
from utils import read_files
from classifier import Classifier

[nltk_data] Downloading package stopwords to /home/hdev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
class PerceptronClassifier(Classifier):
    # This class inherit from the the class Classifier, an abstraction dealing with the run of the flows, the saving of 
    # the  intermediate results, the training step, the testing step, ...
    # See classifier.py for more informations
    
    def create_model(self, hidden_layers):
        # Create the structure of the perceptron.
        # hidden_layers should be a list containing the number of nodes for each layer.
        # Each layer is densely connected with the previous and next one.
        # The number of layer can be arbitrary large.  You should specify at least one hidden layer.
        
        input_layer_size = len(self.train_set['x'][0]) 
        output_layer_size = self.nb_classes

        # Static variable holding the data provided (input and related labels)
        self.x = tf.placeholder(tf.float32, [None, input_layer_size], name='input')
        self.y_ = tf.placeholder(tf.float32, [None, output_layer_size], name = 'label')
        
        self.hidden_layers = hidden_layers 

        # Each of this list keep track of the variable used by the model in order to do further analysis of the model
        self.weights = []
        self.layers = []
        self.bias = []
        
        # Creation of the model
        #
        # Each layer is densely connected with the previous and next one.
        # Moreover, a bias is added to each node (if necessary) each layer has a sigmoïd as activation function 
        #
        
        
        # Creation of the first hidden layer.
        W = tf.Variable(tf.random_normal([input_layer_size, self.hidden_layers[0]], stddev=0.01))
        self.weights.append(W) 
        b = tf.Variable(tf.zeros([self.hidden_layers[0]]))
        self.bias.append(b)
        y = tf.nn.sigmoid(tf.matmul(self.x, W)+b)
        self.layers.append(y)
        
        # Loop creating all the other layers
        for i in range(len(self.hidden_layers)-1):
            W = tf.Variable(tf.random_normal([self.hidden_layers[i], self.hidden_layers[i+1]], stddev=0.1))
            self.weights.append(W) 
            b = tf.Variable(tf.zeros([self.hidden_layers[i+1]]))
            self.bias.append(b)
            y = tf.nn.sigmoid(tf.matmul(self.layers[i], W)+b)
            self.layers.append(y)

        # Connection of the last hidden layer to the output layer
        # To created a distribution of probabilies at the end, a softmax function is used (computed along with the 
        # loss function for numerical stability reasons)
        W = tf.Variable(tf.random_normal([self.hidden_layers[-1], output_layer_size], stddev=0.1))
        self.weights.append(W)
        b = tf.Variable(tf.zeros([output_layer_size]))
        self.bias.append(b)
        self.y = tf.matmul(self.layers[-1], W)+b

        # Loss function : Cross-entropy function 
        #unweighted_loss = tf.nn.softmax_cross_entropy_with_logits(labels = self.y_, logits = self.y)
        #weighted_loss = unweighted_loss * [0.8, 0.2]
        #self.loss = tf.reduce_mean(weighted_loss)
        self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.y_, logits = self.y))

        # Training accuracy
        correct_prediction = tf.equal(tf.argmax(self.y, 1), tf.argmax(self.y_, 1))
        self.training_accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        
        # Optimizer : Adam optimize gives us better results than a classical Gradient Method
        self.train_step = tf.train.AdamOptimizer().minimize(self.loss, global_step = self.global_step)


In [17]:
# Read the *.xlsx files and create a training and a test sets (csv format)
#
# Launch this only once !
import generate_files

generate_files.create_training_test_set()


In [27]:
def process_input(train_set_raw, test_set_raw):
    train_set_features , word_dict = generate_bow(train_set_raw['x'], train_set_raw['label'], False, False, False)
    test_set_features = create_bow_by_dict(test_set_raw['x'], word_dict, True)


    train_set = {'x' : train_set_features, 'label' : [np.array([int(s == 'Trump'),1-int(s == 'Trump')]) for s in train_set_raw['label']]}
    test_set = {'x' : test_set_features, 'label' : [np.array([int(s == 'Trump'),1-int(s == 'Trump')]) for s in test_set_raw['label']]}
    
    return train_set, test_set

In [32]:
# Read the training and test sets
train_set_raw = read_files('training.csv')
test_set_raw = read_files('test.csv')


# Processing of the train_set and test_set to transform them into a bag of words
train_set, test_set = process_input(train_set_raw, test_set_raw)

# Reset the Tensorflow graph at each run (to avoid creating a lot of Variables ...)
tf.reset_default_graph() 


# Creation of the model
DNN = PerceptronClassifier(train_set, test_set, 2, name='6_layers_perceptron')
DNN.create_model([16, 16, 16, 16])

# Create a Tensorflow session to train our network and test it
DNN.run()

# Train the network on 10 epochs
DNN.train(10)

print('\naccuracy : ' + str(DNN.test()))

# Close the Tensorflow session
DNN.close()

Training step :


InvalidArgumentError: Incompatible shapes: [32] vs. [2]
	 [[Node: gradients/mul_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradients/mul_grad/Shape, gradients/mul_grad/Shape_1)]]

Caused by op 'gradients/mul_grad/BroadcastGradientArgs', defined at:
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-32-19c1b66b129b>", line 15, in <module>
    DNN.create_model([16, 16, 16, 16])
  File "<ipython-input-31-662825d2d055>", line 70, in create_model
    self.train_step = tf.train.AdamOptimizer().minimize(self.loss, global_step = self.global_step)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 343, in minimize
    grad_loss=grad_loss)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 414, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda>
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py", line 742, in _MulGrad
    rx, ry = gen_array_ops._broadcast_gradient_args(sx, sy)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 532, in _broadcast_gradient_args
    "BroadcastGradientArgs", s0=s0, s1=s1, name=name)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'mul', defined at:
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
[elided 19 identical lines from previous traceback]
  File "<ipython-input-32-19c1b66b129b>", line 15, in <module>
    DNN.create_model([16, 16, 16, 16])
  File "<ipython-input-31-662825d2d055>", line 61, in create_model
    weighted_loss = unweighted_loss * [0.8, 0.2]
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
    return func(x, y, name=name)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1117, in _mul_dispatch
    return gen_math_ops._mul(x, y, name=name)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2726, in _mul
    "Mul", x=x, y=y, name=name)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/hdev/anaconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Incompatible shapes: [32] vs. [2]
	 [[Node: gradients/mul_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradients/mul_grad/Shape, gradients/mul_grad/Shape_1)]]


# Keras Classifiers

In this part we test different deep-learning architectures with the library Keras.

The tested architectures are convolutionnals one based on this article "Convolutional Neural Networks for Sentence Classification [2014]" by Yoon Kim. Where a precomputed embedding of the words done by Word2Vec on google-news data is used.

We also test a two stacked LSTM architecture, still using the word embeddings.

The plots of the different models by Keras can be found in the "model#.png" files. They are also shown in the notebook.

To run this part you will need : 

- numpy
- For the embedding :
  - gensim
  - Word2Vec google-news embedding, it can be found here https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit 

- For model formulation and optimization:
  - keras

- For model visualization, given in the "model#.png" files (optionnal decomment "plot_model(.)" call if you need it):
  - pydot
  - graphviz (apt-get graphviz, not the anaconda package)
  - a lot of ram (TODO)    

### Data Manipulation

In [None]:
import numpy as np
import os
import csv
import gc
from gensim.models import KeyedVectors

if not 'embeddingModel' in vars():
    embeddingModel = 0
    gc.collect()
    embeddingModel = KeyedVectors.load_word2vec_format(os.environ['HOME']+'/Documents/Word2Vec_embedding/GoogleNews-vectors-negative300.bin', binary=True)#, norm_only=True)

def embedding(tweet):
    """ convert a tweet to a matrix
        with the embedding from Word2Vec to GoogleNews
    """
    E = []
    words = tweet.split()
    for word in words:
        if word in embeddingModel:
            E.append(embeddingModel[word])
    
    return np.array(E)
        

def create_dataset(filename):
    training_list = []
    label_list = []
    file = open(filename, "r")
    reader = csv.reader(file, delimiter=';')
    for tweet, author in reader:
        E = embedding(tweet)
        if not E.size<3*300:
            training_list.append(E)
            label_list.append(int(author=='Trump'))
    file.close()

    return {'x': training_list, 'label': label_list}

Train_dataset = create_dataset('training.csv')
x_train = Train_dataset['x']
y_train = Train_dataset['label']

Test_dataset = create_dataset('test.csv')
x_test = Test_dataset['x']
y_test = Test_dataset['label']

#what is the length of the maximal sequence of words (for padding)
seq_length = max(max([x.shape[0] for x in x_train]), max([x.shape[0] for x in x_test]))

def zero_padding(X):
    for i in range(len(X)):
        X[i] = np.vstack((X[i], np.zeros((seq_length-X[i].shape[0],300))))

zero_padding(x_train)
zero_padding(x_test)

x_train = np.array(x_train)
x_test = np.array(x_test)

## Model definition and optimization

### First model
A first simpler implementation of the one given in the article. With only one convolutionnal kernel size (3) with 128 features, a global max pooling layer and a fully connected layer to the one node output.

![Model plot](model1.png)

Observed test set accuracy : 92-93%

In [None]:
#architecture
model = Sequential()
model.add(Conv1D(128, 3, activation='relu', input_shape=(seq_length,300), name="Convolution"))
model.add(GlobalMaxPooling1D(name="Pooling"))
model.add(Dense(1, activation='sigmoid', name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=10, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model1.png", show_shapes=True, show_layer_names=True)

### Definition of a convolutionnal layer with different kernel sizes
A component of the two following models. Implement one convolutionnal layer with three kernel sizes (3,4,5) 100 features each and global max pooling.

In [None]:
inp = Input(shape=(seq_length,300), name="Convolution_Input")
convs = []
#1
conv = Conv1D(100, 3, activation='relu', name="Convolution_Ker_Size3")(inp)
pool = GlobalMaxPooling1D(name="Global_Pooling1")(conv)
convs.append(pool)
#2
conv = Conv1D(100, 4, activation='relu', name="Convolution_Ker_Size4")(inp)
pool = GlobalMaxPooling1D(name="Global_Pooling2")(conv)
convs.append(pool)
#3
conv = Conv1D(100, 5, activation='relu', name="Convolution_Ker_Size5")(inp)
pool = GlobalMaxPooling1D(name="Global_Pooling3")(conv)
convs.append(pool)
out = Concatenate(name="Merge")(convs)

conv_model = Model(inputs=inp, outputs=out)
conv_model.summary()

### Second model
Close to the model presented in the article. The three kernel size for the convolutionnal layer, Dropout on the hidden layer with p=0.5, and a l2 loss on the last matrix weights (l2 constraint in the article).

![Model plot](model2.png)

Observed test set accuracy : 92-93%

In [None]:
#architecture
model = Sequential()
model.add(conv_model)
model.add(Dropout(0.5, name="Dropout"))
model.add(Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01), name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=10, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model2.png", show_shapes=True, show_layer_names=True)

### Third model
A 20 nodes fully connected intermediate layer is added before the output.

![Model plot](model3.png)

Observed test set accuracy : 92-93%

In [None]:
#architecture
model = Sequential()
model.add(conv_model)
model.add(Dropout(0.5, name="Dropout"))
model.add(Dense(20, activation='relu', kernel_regularizer=regularizers.l2(0.01), name="Intermediate_Dense"))
model.add(Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01), name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=10, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model3.png", show_shapes=True, show_layer_names=True)

### Fourth model
Two stacked LSTM.
The first as a 64 dimensionnal state and return it at each time step (word). The second as a 32 dimensionnal state and only return it at the end. This last state is then used to compute the output using a dense layer.

![Model plot](model4.png)

Observed test set accuracy : ~90%

In [None]:
#architecture
model = Sequential()
model.add(LSTM(64, return_sequences=True,input_shape=(seq_length,300), name="First_Stacked_LSTM"))
model.add(LSTM(32, name="Second_Stacked_LSTM"))
model.add(Dense(1, activation='sigmoid', name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=20, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model4.png", show_shapes=True, show_layer_names=True)

### Comments
There is no big difference of performance between the models. Training time where roughly the same ~1-2min.

Interestingly the accuracy on the training set where usually far better than the accuracy on the test set for the convolutionnal models. But this was not observed for the two stacked LSTM.

More than 90% accuracy seems acceptable since the model works on the semantic of the words used, rather than the syntax due to the embedding (assuming the embedding reflects the semantic).

## References
1. [Tensorflow website](https://www.tensorflow.org/api_docs/)
2. [Stanford Tensorflow course notes](http://web.stanford.edu/class/cs20si/)
3. [Kaggle tutorial on Bag of words](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words)
3. "Convolutional Neural Networks for Sentence Classification [2014]" by Yoon Kim.
