# LINMA 2472 : Algorithms in Data Science


## Project on deep learning

Classification is a common task in machine learning.  In this project, we will tackle the task of classifying tweets of the two presidential candidates for the 2017 election.  To do so, we will use a database of about 20k tweets of both candidates.  For Donald Trump, we have all the tweets he posted from 05/04/2009 to 11/26/2017 while we have for Hillary Clinton her tweets from 06/10/2013 to 11/24/2017.  

In our project, we decided to use the python version of the library [TensorFlow](https://www.tensorflow.org/).  Since this library is quite powerful, we decided not to use its high level tools as a black box and code ourselves our classifier as much as possible.  One could argue that we could develop our classifier without the help of any library, but we could not have the same results for sure.  Backpropagation may be quite tricky to implement and coding a fancier optimization method than Gradient method would have been out of range considering the time resources we had.  Moreover, TensorFlow provides us great tools to track the performance of our classifiers as the TensorBoard tool.

We developped a python abstraction for classifiers you can find in classifier.py.  In that way, it is really easy to build another classifier based on another model, you just have to redefine the method create_model. 

In [2]:
import tensorflow as tf
import numpy as np
import shutil
import os
import nltk
nltk.download('stopwords')
import csv

# Our custom libraries
from nlp_utils import generate_bow, create_bow_by_dict
from classifier import Classifier

[nltk_data] Downloading package stopwords to /home/hdev/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
def read_files(filename):
    training_list = []
    label_list = []
    file = open(filename, "r")
    reader = csv.reader(file, delimiter=';')
    for tweet, author in reader:
        training_list.append(tweet)
        label_list.append(author)
    file.close()
    
    return {'x' : training_list, 'label' : label_list}

In [4]:
class DeepNeuralNetwork(Classifier):
    
    def create_model(self, hidden_layers):
        # At least one hidden layer !
        # Create the structure of the deep neural network
       
        input_layer_size = len(self.train_set['x'][0]) 
        output_layer_size = self.nb_classes

        self.x = tf.placeholder(tf.float32, [None, input_layer_size], name='input')
        self.y_ = tf.placeholder(tf.float32, [None, output_layer_size], name = 'label')
        
        self.hidden_layers = hidden_layers 

        self.weights = []
        self.layers = []
        self.bias = []
        
        W = tf.Variable(tf.random_normal([input_layer_size, self.hidden_layers[0]], stddev=0.01))
        self.weights.append(W) 
        b = tf.Variable(tf.zeros([self.hidden_layers[0]]))
        self.bias.append(b)
        y = tf.nn.sigmoid(tf.matmul(self.x, W)+b)
        self.layers.append(y)
        
        for i in range(len(self.hidden_layers)-1):
            W = tf.Variable(tf.random_normal([self.hidden_layers[i], self.hidden_layers[i+1]], stddev=0.1))
            self.weights.append(W) 
            b = tf.Variable(tf.zeros([self.hidden_layers[i+1]]))
            self.bias.append(b)
            y = tf.nn.sigmoid(tf.matmul(self.layers[i], W)+b)
            self.layers.append(y)

        W = tf.Variable(tf.random_normal([self.hidden_layers[-1], output_layer_size], stddev=0.1))
        self.weights.append(W)
        b = tf.Variable(tf.zeros([output_layer_size]))
        self.bias.append(b)
        
        self.y = tf.matmul(self.layers[-1], W)+b

        # Loss function
        self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.y_, logits = self.y))

        # Training accuracy
        correct_prediction = tf.equal(tf.argmax(self.y, 1), tf.argmax(self.y_, 1))
        self.training_accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        
        # Optimizer
        self.train_step = tf.train.AdamOptimizer().minimize(self.loss, global_step = self.global_step)


In [5]:
train_set_raw = read_files('training.csv')
test_set_raw = read_files('test.csv')


train_set_features , word_dict = generate_bow(train_set_raw['x'], train_set_raw['label'], True)
test_set_features = create_bow_by_dict(test_set_raw['x'], word_dict, True)


train_set = {'x' : train_set_features, 'label' : [np.array([int(s == 'Trump'),1-int(s == 'Trump')]) for s in train_set_raw['label']]}
test_set = {'x' : test_set_features, 'label' : [np.array([int(s == 'Trump'),1-int(s == 'Trump')]) for s in test_set_raw['label']]}

In [8]:
# example : 
tf.reset_default_graph() 

DNN = DeepNeuralNetwork(train_set, test_set, 2, name='6_layers_perceptron')
DNN.create_model([16, 16, 16, 16])
DNN.run()
print(DNN.sess.run(DNN.bias[0], feed_dict= {DNN.x: [test_set['x'][0]]}))
DNN.train(10)
#DNN.restore()
print('accuracy : ' + str(DNN.test()))
DNN.close()

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
Training step :
accuracy : 0.873621568535 mean loss)


In [1]:
from classifier import Classifier

# Keras Classifiers

In this part we test different deep-learning architectures with the library Keras.

The tested architectures are convolutionnals one based on this article "Convolutional Neural Networks for Sentence Classification [2014]" by Yoon Kim. Where a precomputed embedding of the words done by Word2Vec on google-news data is used.

We also test a two stacked LSTM architecture, still using the word embeddings.

Plot of the different models by Keras can be found in the "model#.png" files.

To run this part you will need :
    -numpy
    for the embedding :
        -gensim
        -Word2Vec google-news embedding, it can be found here https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit 
    
    For model formulation and optimization:
    -keras
    For model visualization, given by "model#.png" files (optionnal decomment "plot_model()" call if you need it):
    -pydot
    -graphviz (apt-get graphviz, not the anaconda package)

## Data Manipulation

In [None]:
import numpy as np
import os
import csv
import gc
from gensim.models import KeyedVectors

if not 'embeddingModel' in vars():
    embeddingModel = 0
    gc.collect()
    embeddingModel = KeyedVectors.load_word2vec_format(os.environ['HOME']+'/Documents/Word2Vec_embedding/GoogleNews-vectors-negative300.bin', binary=True)#, norm_only=True)

def embedding(tweet):
    """ convert a tweet to a matrix
        with the embedding from Word2Vec to GoogleNews
    """
    E = []
    words = tweet.split()
    for word in words:
        if word in embeddingModel:
            E.append(embeddingModel[word])
    
    return np.array(E)
        

def create_dataset(filename):
    training_list = []
    label_list = []
    file = open(filename, "r")
    reader = csv.reader(file, delimiter=';')
    for tweet, author in reader:
        E = embedding(tweet)
        if not E.size<3*300:
            training_list.append(E)
            label_list.append(int(author=='Trump'))
    file.close()

    return {'x': training_list, 'label': label_list}

Train_dataset = create_dataset('training.csv')
x_train = Train_dataset['x']
y_train = Train_dataset['label']

Test_dataset = create_dataset('test.csv')
x_test = Test_dataset['x']
y_test = Test_dataset['label']

#what is the length of the maximal sequence of words (for padding)
seq_length = max(max([x.shape[0] for x in x_train]), max([x.shape[0] for x in x_test]))

def zero_padding(X):
    for i in range(len(X)):
        X[i] = np.vstack((X[i], np.zeros((seq_length-X[i].shape[0],300))))

zero_padding(x_train)
zero_padding(x_test)

x_train = np.array(x_train)
x_test = np.array(x_test)

## Model definition and optimization

### First model
A first simpler implementation of the one given in the article. With only one convolutionnal kernel size (3) with 128 features, a global max pooling layer and a fully connected layer to the one node output.

Observed test set accuracy : 92-93%

In [None]:
#architecture
model = Sequential()
model.add(Conv1D(128, 3, activation='relu', input_shape=(seq_length,300), name="Convolution"))
model.add(GlobalMaxPooling1D(name="Pooling"))
model.add(Dense(1, activation='sigmoid', name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=10, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model1.png", show_shapes=True, show_layer_names=True)

### Definition of a convolutionnal layer with different kernel sizes
A component of the two following models. Implement one convolutionnal layer with three kernel sizes (3,4,5) 100 features each and global max pooling.

In [None]:
inp = Input(shape=(seq_length,300), name="Convolution_Input")
convs = []
#1
conv = Conv1D(100, 3, activation='relu', name="Convolution_Ker_Size3")(inp)
pool = GlobalMaxPooling1D(name="Global_Pooling1")(conv)
convs.append(pool)
#2
conv = Conv1D(100, 4, activation='relu', name="Convolution_Ker_Size4")(inp)
pool = GlobalMaxPooling1D(name="Global_Pooling2")(conv)
convs.append(pool)
#3
conv = Conv1D(100, 5, activation='relu', name="Convolution_Ker_Size5")(inp)
pool = GlobalMaxPooling1D(name="Global_Pooling3")(conv)
convs.append(pool)
out = Concatenate(name="Merge")(convs)

conv_model = Model(inputs=inp, outputs=out)
conv_model.summary()

### Second model
Close to the model presented in the article. The three kernel size for the convolutionnal layer, Dropout on the hidden layer with p=0.5, and a l2 loss on the last matrix weights (l2 constraint in the article).

Observed test set accuracy : 92-93%

In [None]:
#architecture
model = Sequential()
model.add(conv_model)
model.add(Dropout(0.5, name="Dropout"))
model.add(Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01), name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=10, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model2.png", show_shapes=True, show_layer_names=True)

### Third model
A 20 nodes fully connected intermediate layer is added before the output.

Observed test set accuracy : 92-93%

In [None]:
#architecture
model = Sequential()
model.add(conv_model)
model.add(Dropout(0.5, name="Dropout"))
model.add(Dense(20, activation='relu', kernel_regularizer=regularizers.l2(0.01), name="Intermediate_Dense"))
model.add(Dense(1, activation='sigmoid', kernel_regularizer=regularizers.l2(0.01), name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=10, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model3.png", show_shapes=True, show_layer_names=True)

### Fourth model
Two stacked LSTM.
The first as a 64 dimensionnal state and return it at each time step (word). The second as a 32 dimensionnal state and only return it at the end. This last state is then used to compute the output using a dense layer.

Observed test set accuracy : ~90%

In [None]:
#architecture
model = Sequential()
model.add(LSTM(64, return_sequences=True,input_shape=(seq_length,300), name="First_Stacked_LSTM"))
model.add(LSTM(32, name="Second_Stacked_LSTM"))
model.add(Dense(1, activation='sigmoid', name="Output"))
model.summary()

#loss function and optimizer
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

#optimization with early stopping
earlyStopping = EarlyStopping(monitor='val_acc', patience=0, verbose=0, mode='auto')
model.fit(x_train, y_train, batch_size=50, epochs=20, callbacks=[earlyStopping], 
          validation_split=0.1, shuffle=True)

score = model.evaluate(x_test,y_test, batch_size=64)

#display accuracy and plot model
print("\nAccuracy on the test set : "+str(score[1])+"\n\n")
#plot_model(model, to_file="model4.png", show_shapes=True, show_layer_names=True)

### Comments
There is no big difference of performance between the models. Training time where roughly the same ~1-2min.

Interestingly the accuracy on the training set where usually far better than the accuracy on the test set for the convolutionnal models. But this was not observed for the two stacked LSTM.

More than 90% accuracy seems acceptable since the model works on the semantic of the words used, rather than the syntax due to the embedding (assuming the embedding reflects the semantic).