### Long Short-Term Memory model with Transfer Learning (see Section 3.2. and 3.3. of thesis)

This notebook focuses on training and testing the Long Short-Term Memory model with Transfer Learning that were proposed in this paper. The model was implemented using TensorFlow and Genism to import Word2Vec embeddings.

Please note that the import of the Word2Vec embeddings can take several minutes.

Please keep in mind that these notebooks are primarily used for conducting experiments, live coding, and implementing and evaluating the approaches presented in the thesis. As a result, the code in this notebook may not strictly adhere to best practice coding standards.

In [None]:
# only execute once
# ONLY IF USED ON LOCAL VIEW
import os

# Getting the parent directory
os.chdir("..")
os.chdir("..")

In [None]:
from google.colab import drive
import urllib.request

drive.mount('/content/gdrive')

# downloads the embeddings of word2vec. Note that the package is 4.2gb
url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz'
file_path = '/content/gdrive/MyDrive/Experiment/DL models (with TL)/cc.de.300.bin.gz'
# uncomment this to download locally
#file_path = 'cc.de.300.bin.gz'

# Download the file
urllib.request.urlretrieve(url, file_path)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


('/content/gdrive/MyDrive/Experiment/DL modelle (with TL)/cc.de.300.bin.gz',
 <http.client.HTTPMessage at 0x7f1c55d99d50>)

### Note that you need to unzip the file before executing the next cells!

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from gensim.models import KeyedVectors

def import_test_train(local):
  """
  This imports the given word2vecs embeddings, train and testset locally or not and returns it.

  :param local: If set to true, it will return the trainset from a local view. Otherwise it will open drive mount and attempts to connect to your
  drive folders.
  """

  assert type(local) == bool, f"Type is not valid. Expected boolean, recieved: {type(local)}"

  if local:
    from google.colab import drive
    drive.mount('/content/gdrive')

    df_test = pd.read_csv('/content/gdrive/MyDrive/Experiment/testset_DE_Trigger.csv')
    df_train = pd.read_csv('/content/gdrive/MyDrive/Experiment/trainset_DE_Trigger.csv')
    vecs = KeyedVectors.load_word2vec_format('/content/gdrive/MyDrive/Experiment/DL models (with TL)/cc.de.300.vec')

    return df_test, df_train, vecs

  else:
    df_test = pd.read_csv('./Experiment/testset_DE_Trigger.csv')
    df_train = pd.read_csv('./Experiment/trainset_DE_Trigger.csv')
    vecs = KeyedVectors.load_word2vec_format('./Experiment/DL models (with TL)/cc.de.300.vec')

    return df_test, df_train, vecs

# importing test and trainset
df_test, df_train, vecs = import_test_train(True)

# If you want to use it locally, make sure to execute the notebooks from the root directory of this project and uncomment the following line:
# df_test, df_train, vecs = import_test_train(False)

## Define Labels as numbers

In [None]:
MAX_NB_WORDS = 39000
MAX_SEQUENCE_LENGTH = 150
HIDDEN_DIM = 300

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True)
tokenizer.fit_on_texts(df_test.append(df_train)["content"].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 44770 unique tokens.


In [None]:
X_train = tokenizer.texts_to_sequences(df_train['content'].values)
X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)

Y_train = pd.get_dummies(df_train['label']).values

X_test = tokenizer.texts_to_sequences(df_test['content'].values)
X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)

Y_test = pd.get_dummies(df_test['label']).values

In [None]:
vector_size = 300
gensim_weight_matrix = np.zeros((MAX_NB_WORDS ,vector_size))

gensim_weight_matrix.shape
for word, index in tokenizer.word_index.items():
    if index < MAX_NB_WORDS: # since index starts with zero
        if word in vecs.wv.vocab:
            gensim_weight_matrix[index] = vecs[word]
        else:
            gensim_weight_matrix[index] = np.zeros(300)

  if word in vecs.wv.vocab:


## Define model and saving path

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, MaxPool1D, BatchNormalization, Dropout
from tensorflow.keras import layers

def emotion_model():
    model = Sequential()
    model.add(Embedding(MAX_NB_WORDS, HIDDEN_DIM, input_length=X_train.shape[1], weights = [gensim_weight_matrix]))
    model.add(LSTM(100, dropout=0.2))
    model.add(BatchNormalization())
    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(5, activation='softmax'))

    model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
              optimizer='adam',metrics=[tf.keras.metrics.AUC()])

    return model

## Train model

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

# Set the early stopping criteria
early_stopping = EarlyStopping(monitor='val_loss', patience=2)

# Create the model
model = emotion_model()

epochs = 100
batch_size = 64

# Fit the model with early stopping
model.fit(
    X_train, Y_train,
    epochs=epochs,
    batch_size=batch_size,
    validation_split=0.1,
    callbacks=[early_stopping]
)

### Evaluation

In [None]:
from sklearn.metrics import classification_report

y_labels = [list(i).index(1) for i in Y_test]
Y_pred = np.argmax(model.predict(X_test),axis=1)

print(classification_report(y_labels, Y_pred))