# Training Tweets with GloVe Embedding Word Vectorization 

The GloVe Embeddings and vocabulary were trained thanks to the GloveEmbeddings.ipynb notebook. \

As a reminder : 

We implemented three different tasks of preprocessing :
- (1) The creation of tokens representing numerical and/or textual patterns such as emoticons, word elongation, numbers, repeating punctuation. 
- (2)  Hashtag processing both using a \<hashtag\> token to quantify the use of  hashtags and splitting hashtags into known words in the vocabulary. 
- (3) Replacing \<hashtag\> by a stopword token. 

We decided to test four combination of these preprocessing tasks
- 0 No preprocessing 
- 1 Tokenization (1)
- 2 Tokenization and Hashtag Split (1) and (2)
- 3 Tokenization and Stop Words (1) and (3)

The embeddings for each preprocessing options are stored in the embeddings folder. 
The corresponding vocabulary is stored in the vocab folder. 

In [5]:
import numpy as np
import pandas as pd
import tensorflow as tf
import pickle as pkl

from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

from load_utils_pp import *
from FeaturesBuilder import FeaturesBuilder

## Word Vector Average 

We started by computing the words vector average over each tweet to be able to train a neural network and get a baseline result. Here are the results for the unprocessed tweets. \
We use FeaturesBuilder, an object we created, to build the embeddings average over the tweets. 

In [6]:
# Loading tweets and vocabulary
vocab = load_vocabulary_pp(0)
tweets_unprocessed = load_tweets_pp(0) 

# Load glove embedding
word_vects = load_glove_embedding_pp(0)
EMBEDDING_FEATURES = word_vects.shape[1]

loaded vocabulary containing 101298 words
loaded 200000 tweets in dataframe with columns: Index(['text', 'label'], dtype='object')
loaded glove embedding with shape (101298, 20)


In [7]:
# Create features_builder object
features_builder = FeaturesBuilder(tweets_unprocessed, vocab, word_vects)

In [8]:
# Build the average of word vectors over each tweet 

X, y = features_builder.build_avg_tweet_embedding()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

  out=out, **kwargs)


built features with shape (199995, 20)


In [9]:
model = keras.Sequential(
    [
        layers.Dense(1, activation="sigmoid", name="out"),
    ]
)

model.compile( optimizer=keras.optimizers.Adam(),
               loss=keras.losses.BinaryCrossentropy(),
               metrics=[keras.metrics.BinaryAccuracy()] )

history = model.fit( X_train, y_train,
                     batch_size=128, epochs=20,
                     validation_data=(X_test.astype("float32"), y_test.astype("float32")) )

print(model.summary())

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
out (Dense)                  (None, 1)                 21        
Total params: 21
Trainable params: 21
Non-trainable params: 0
_________________________________________________________________
None


We didn't spend that much time on optimizing the results for that option as we quickly found better results using word embedding sequences. 

## 30 word embeddings sequence / tweet 

The more advanced approach on classifying the tweets was to use 30 word vectors per tweet, padding with null vectors when necessary. \
We use our own FeatureBuilder to create the padded sequences. 

### Unprocessed tweets

In [10]:
#create features_builder object
features_builder = FeaturesBuilder(tweets_unprocessed, vocab, word_vects)

In [11]:
# Build the 30 word vectors sequences per tweet 

X, y = features_builder.build_word_embedding_sequences()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

built features with shape (200000, 30, 20)


In [12]:
model = keras.Sequential(
    [
        layers.Bidirectional(layers.LSTM(20, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(40, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(20, name="lstm")),

        layers.Dense(1, activation="sigmoid", name="out"),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()],
)

history = model.fit(
    X_train,
    y_train,
    batch_size=64,
    epochs=40,
    verbose=2, 
    # We pass some validation for
    # monitoring validation loss and metrics
    # at the end of each epoch
    validation_data=(X_test.astype("float32"), y_test.astype("float32")),
)

print(model.summary())

Epoch 1/40
2094/2094 - 41s - loss: 0.6080 - binary_accuracy: 0.6368 - val_loss: 0.5943 - val_binary_accuracy: 0.6421
Epoch 2/40
2094/2094 - 40s - loss: 0.5885 - binary_accuracy: 0.6482 - val_loss: 0.5856 - val_binary_accuracy: 0.6464
Epoch 3/40
2094/2094 - 40s - loss: 0.5808 - binary_accuracy: 0.6532 - val_loss: 0.5795 - val_binary_accuracy: 0.6526
Epoch 4/40
2094/2094 - 40s - loss: 0.5743 - binary_accuracy: 0.6588 - val_loss: 0.5754 - val_binary_accuracy: 0.6559
Epoch 5/40
2094/2094 - 40s - loss: 0.5677 - binary_accuracy: 0.6636 - val_loss: 0.5747 - val_binary_accuracy: 0.6584
Epoch 6/40
2094/2094 - 40s - loss: 0.5609 - binary_accuracy: 0.6710 - val_loss: 0.5653 - val_binary_accuracy: 0.6661
Epoch 7/40
2094/2094 - 40s - loss: 0.5544 - binary_accuracy: 0.6772 - val_loss: 0.5702 - val_binary_accuracy: 0.6518
Epoch 8/40
2094/2094 - 40s - loss: 0.5478 - binary_accuracy: 0.6846 - val_loss: 0.5590 - val_binary_accuracy: 0.6726
Epoch 9/40
2094/2094 - 39s - loss: 0.5409 - binary_accuracy: 0.6

In [13]:
 # Saving the model results in a file 
with open('models/model_history_pp0', 'wb') as file:
    pkl.dump(history.history, file)

### Preprocessing option 1 - Add tokens 

In [14]:
# Loading tweets and vocabulary
vocab = load_vocabulary_pp(1)
tweets_pp1 = load_tweets_pp(1) 

# Load glove embedding
word_vects = load_glove_embedding_pp(1)
EMBEDDING_FEATURES = word_vects.shape[1]

loaded vocabulary containing 92334 words
loaded 200000 tweets in dataframe with columns: Index(['text', 'label'], dtype='object')
loaded glove embedding with shape (92334, 20)


In [15]:
#create features_builder object
features_builder = FeaturesBuilder(tweets_pp1, vocab, word_vects)

In [16]:
# Build the 30 word vectors sequences per tweet 

X, y = features_builder.build_word_embedding_sequences()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

built features with shape (200000, 30, 20)


In [17]:
model = keras.Sequential(
    [
        layers.Bidirectional(layers.LSTM(20, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(40, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(20, name="lstm")),
        layers.Dense(1, activation="sigmoid", name="out"),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()],
)

history = model.fit(
    X_train,
    y_train,
    batch_size=64,
    epochs=30,
    # We pass some validation for
    # monitoring validation loss and metrics
    # at the end of each epoch
    validation_data=(X_test.astype("float32"), y_test.astype("float32")),
)

print(model.summary())

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_3 (Bidirection (None, 30, 40)            6560      
_________________________________________________________________
bidirectional_4 (Bidirection (None, 30, 80)            25920     
_________________________________________________________________
bidirectional_5 (Bidirection (None, 40)                16160     
_________________________________________________________________
out (Dense)                  (None, 1)                 41        
Total params: 48,681
Trainable pa

In [18]:
# Saving the model results in a file 
with open('models/model_history_pp1', 'wb') as file:
    pkl.dump(history.history, file)

### Preprocessing option 2 - Tokens + Hagtag Split 

In [19]:
# Loading tweets and vocabulary
vocab = load_vocabulary_pp(2)
tweets_pp2 = load_tweets_pp(2) 

# Load glove embedding
word_vects = load_glove_embedding_pp(2)
EMBEDDING_FEATURES = word_vects.shape[1]

loaded vocabulary containing 92335 words
loaded 200000 tweets in dataframe with columns: Index(['text', 'label'], dtype='object')
loaded glove embedding with shape (92335, 20)


In [20]:
#create features_builder object
features_builder = FeaturesBuilder(tweets_pp2, vocab, word_vects)

In [21]:
# Build the 30 word vectors sequences per tweet 

X, y = features_builder.build_word_embedding_sequences()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

built features with shape (200000, 30, 20)


In [22]:
model = keras.Sequential(
    [
        layers.Bidirectional(layers.LSTM(20, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(40, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(20, name="lstm")),
        layers.Dense(1, activation="sigmoid", name="out"),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()],
)

history = model.fit(
    X_train,
    y_train,
    batch_size=64,
    epochs=40,
    # We pass some validation for
    # monitoring validation loss and metrics
    # at the end of each epoch
    validation_data=(X_test.astype("float32"), y_test.astype("float32")),
)

print(model.summary())

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_6 (Bidirection (None, 30, 40)            6560      
_________________________________________________________________
bidirectional_7 (Bidirection (None, 30, 80)            25920     
_________________________________________________________________
bidirectional_8 (Bidirection (None, 40)                16160     
_____________________________________________

In [23]:
# Saving the model results in a file 
with open('models/model_history_pp2', 'wb') as file:
    pkl.dump(history.history, file)

### Preprocessing option 3 - Tokens + Stopwords

In [24]:
# Loading tweets and vocabulary
vocab = load_vocabulary_pp(3)
tweets_pp1 = load_tweets_pp(3) 

# Load glove embedding
word_vects = load_glove_embedding_pp(3)
EMBEDDING_FEATURES = word_vects.shape[1]

loaded vocabulary containing 92335 words
loaded 200000 tweets in dataframe with columns: Index(['text', 'label'], dtype='object')
loaded glove embedding with shape (92335, 20)


In [25]:
#create features_builder object
features_builder = FeaturesBuilder(tweets_pp2, vocab, word_vects)

In [26]:
# Build the 30 word vectors sequences per tweet 

X, y = features_builder.build_word_embedding_sequences()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

built features with shape (200000, 30, 20)


In [27]:
model = keras.Sequential(
    [
        layers.Bidirectional(layers.LSTM(20, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(40, name="lstm", return_sequences=True)),
        layers.Bidirectional(layers.LSTM(20, name="lstm")),
        layers.Dense(1, activation="sigmoid", name="out"),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[keras.metrics.BinaryAccuracy()],
)

history = model.fit(
    X_train,
    y_train,
    batch_size=64,
    epochs=40,
    # We pass some validation for
    # monitoring validation loss and metrics
    # at the end of each epoch
    validation_data=(X_test.astype("float32"), y_test.astype("float32")),
)

print(model.summary())

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_9 (Bidirection (None, 30, 40)            6560      
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 30, 80)            25920     
_________________________________________________________________
bidirectional_11 (Bidirectio (None, 40)                16160     
_____________________________________________

In [28]:
# Saving the model results in a file 
with open('models/model_history_pp3', 'wb') as file:
    pkl.dump(history.history, file)

## Ploting the results 

Here are the results for the different preprocessing options. This graph shows that the best option is to use tokenization only as the two other methods seems to worsen the accuracy. 

In [29]:
# Plot 

plot_accuracies()

NameError: ignored