#### The following Notebook implements classification using a convolutional neural network with GloVe pre-trained embeddings.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.


#### To download the pre-trained GloVe embeddings please refer to the following link : 

https://nlp.stanford.edu/projects/glove/

###### glove.twitter.27B.25d.txt should be located at the same root as this notebook

Tensorflow :  
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster,
Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens,
Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke,
Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.  
Software available from tensorflow.org.

In [17]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
import tensorflow as tf
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, MaxPooling1D
from sklearn.model_selection import train_test_split
print(tf.__version__)

from load_utils import load_glove_embedding, load_vocabulary, load_tweets

2.3.1


This notebook has beeen run using tensorflow version 2.3.1

#### Load and preprocess the data

In [None]:
#import modified preprocessing function from Stanford, specific for Glove on Tweets
import stanford_preprocessing

In [2]:
tweets = load_tweets(full=True)

loaded 2500000 tweets in dataframe with columns: Index(['text', 'label'], dtype='object')


In [3]:
#apply preprocessing for GloVe
tweets['tokenized'] = tweets.text.apply(stanford_preprocessing.tokenize)

In [4]:
x = tweets['tokenized']
y = tweets.label

In [5]:
#max_features is max vocabulary length, chosen for reasonable memory cost 
max_features=50000
#max_test_length is max number of words per tweet
max_text_length=40

#using keras tokenizer to extract words
x_tokenizer=text.Tokenizer(max_features)
x_tokenizer.fit_on_texts(list(x))
x_tokenized=x_tokenizer.texts_to_sequences(x)

#zero padding for sequences with less words than max_text_length
x_train_val=sequence.pad_sequences(x_tokenized,maxlen=max_text_length)

In [6]:
#import pre-trained twitter glove embeddings with 25 featurers
embedding_dim=25
embedding_index=dict()
f=open("glove.twitter.27B.25d.txt")
for line in f:
    values=line.split()
    word=values[0]
    coefs=np.asarray(values[1:],dtype='float32') 
    embedding_index[word]=coefs
    
f.close()
print(f'Found {len(embedding_index)} word vectors')

Found 1193514 word vectors


In [7]:
#create embeddings matrix for Keras Embedding Layer
embedding_matrix=np.zeros((max_features,embedding_dim))
for word,index in x_tokenizer.word_index.items():
    if index>max_features-1:
        break
    else:
        embedding_vector=embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index]=embedding_vector

##### The fairly generic neural network architecture below could have been optimized and therefore the resulting accuracy could have been improved but we have put our energy on improving other methods. This notebook highlights that retraining GloVe embeddings using supervised learning yields better results if little time is spent in searching for hyperparameters and tailoring a neural network architecture. 
##### It also highlights that reaching a good accuracy is much easier with pre-trained embeddings then with our own (particularly when they've been pre-trained on a similar dataset).

#### Training using pre-trained GloVe embeddings : Best Accuracy on Validation set is 82.47 %

In [8]:
#build Sequential model, first layer is Embedding Layer (not trainable because already trained with unsupervised learning)
model=Sequential()
model.add(Embedding(max_features,
                   embedding_dim,
                   embeddings_initializer=tf.keras.initializers.Constant(
                   embedding_matrix),
                   trainable=False))

In [9]:
#ConvNet Architecture for Classification on word Embeddings
filters=250
kernel_size=3
hidden_dims=250

model.add(Conv1D(filters,
                kernel_size,
                padding='valid'))
model.add(MaxPooling1D())
model.add(Conv1D(filters,
                5,
                padding='valid',
                activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 25)          1250000   
_________________________________________________________________
conv1d (Conv1D)              (None, None, 250)         19000     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 250)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 250)         312750    
_________________________________________________________________
global_max_pooling1d (Global (None, 250)               0         
_________________________________________________________________
dense (Dense)                (None, 250)               62750     
_________________________________________________________________
dropout (Dropout)            (None, 250)               0

In [10]:
model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

In [11]:
X_train, X_test, y_train, y_test = train_test_split(x_train_val, y, test_size=0.33, random_state=42)

In [12]:
#the following batch size and epoch number are close to optimal,
#they have been found by manually testing different values

batch_size=128
epochs=15
model.fit(X_train,y_train,
         batch_size=batch_size,
         epochs=epochs,
         validation_data=(X_test,y_test))

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1ae03f7090>

#### Retraining pre-trained GloVe embeddings : Best Accuracy on Validation set is 85.76 %

In [13]:
model=Sequential()
model.add(Embedding(max_features,
                   embedding_dim,
                   embeddings_initializer=tf.keras.initializers.Constant(
                   embedding_matrix),
                   trainable=True))

In [14]:
filters=250
kernel_size=3
hidden_dims=250

model.add(Conv1D(filters,
                kernel_size,
                padding='valid'))
model.add(MaxPooling1D())
model.add(Conv1D(filters,
                5,
                padding='valid',
                activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(hidden_dims,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 25)          1250000   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 250)         19000     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 250)         0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, None, 250)         312750    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_1 (Dropout)          (None, 250)              

In [15]:
model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

In [16]:
batch_size=128
epochs=6
model.fit(X_train,y_train,
         batch_size=batch_size,
         epochs=epochs,
         validation_data=(X_test,y_test))

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


<tensorflow.python.keras.callbacks.History at 0x1ab9bdc710>