# 🚀 Multi-Task and Transfer Learning

Recently in Deep NLP, pretraining, multi-task and transfer learning is all the hype! The general idea is to leverage additional data / targets before or during your training. This is one of the reasons why BERT (the current state of the art) is so great:

1. You can do the pretraining on a large corpus of unlabeled text
2. With additional objectives they make sure the model generalizes well

Of course, we've been doing something similar by using pretrained word embeddings, though nowhere near as cool as BERT does. Unfortunately, we don't have additional data sources in this competition, so can't exactly build much on 1. But is there a possibility to extend on 2? Can we add additional (auxiliary) objectives to our training?

Let me give you some examples of what BERT tries to optimize in parallel and why this may help. For example, they don't just look at how words relate to each other, as is done in Word2Vec and friends. They also teach the model to decide whether two sentences follow each other in the text. Basically, they add this objective to the training: "Hey BERT, given these two sentences, do you think they are following each other in real text, or not?". This is a genius way to give the model some additional understanding of text, while keeping the problem relatively simple. It can't just stop at understanding how the single words interact. It also needs to grog to some extent what makes a text flow and what are some causal relationships across sentences. And it comes at no cost, since the data can be extracted from the text corpus in an unsupervised fashion.

Btw, here's a great article on the matter of Multi-Task Learning in NLP by Sebastian Ruder: http://ruder.io/multi-task-learning-nlp/

# Our Auxiliary Task

How can we achieve something similar in this competition? Of course, we could try the sentence trick, but I assume most if not all questions only have a single sentence in their title. This may also lead to overfitting on the training set.

So, here's my idea: Sentiment is probably somehow correlated to insincerity, but not equivalent. We are allowed to use NLTK, and NLTK comes with a sentiment analysis tool. It's simple, but better than nothing. How about we analyze each sentence, get a sentiment score, and add that as a secondary target for our neural network to estimate?

So instead of asking this question "Hey model, given this question, give me how insincere it is" we ask: "Hey model, given this question, tell me both how insincere it is and what the sentiment is like".

If this works, this may have two effects:
1. The model may learn **more general features**, reducing overfitting
2. The model may receive **more information** relevant for the insincerity decision, increasing it's performance

# The Experiment

To explore the effect of this idea, we run three experiments using the same model architecture:

1. **BASELINE**: Train on the sincerity target only
2. **MULTI-TASK LEARNING**: Train on both targets at the same time
3. **TRANSFER LEARNING**: Pretrain on sentiment, then fine-tune on insincerity

We compare their performance in terms of F1 on a hold-out validation set.

**Note**: As we have all had to learn, the variance of outcomes is pretty high in this competition. That's why I ran each of the experiments several times and I encourage you to do the same.

I've taken some of the standard boilerplate code from other kernels. Thanks to the original authors!

In [None]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm, tqdm_notebook
import math
from sklearn.model_selection import KFold
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, CuDNNLSTM, Conv1D, Add
from keras.layers import Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D, SpatialDropout1D, Lambda, Concatenate
from keras.optimizers import Nadam, Adam
from keras.models import Model, Sequential
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras import initializers, regularizers, constraints, optimizers, layers
from keras import backend as K

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
train_n = len(train_df)
test_n = len(test_df)
print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)

## Getting the sentiments

Now for each question, let's ask NLTK what it thinks about the sentiment. The `compound` score is ranged from -1 to 1, so we normalize it to be in [0, 1] and call it `polarity`.

In [None]:
# NLTK sentiment cell
print('\nGetting sentiments...')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
sentiments = np.zeros(train_n)
# for _, row in train_df.sample(10).iterrows():
#     print(row.question_text, sia.polarity_scores(row.question_text))

for i, (_, row) in tqdm_notebook(enumerate(train_df.iterrows()), total=train_n):
    sentiments[i] = sia.polarity_scores(row.question_text)['compound']

train_df['sentiment'] = pd.Series(sentiments)
train_df['sentiment_target'] = (train_df['sentiment'] + 1) / 2

Alright, let's see if there is any correlation at all between the polarity score and the binary insincerity label:

In [None]:
# Get correlation between strong polarity and insincerity
print('\nCorrelations to polarity:')
# train_df['strong_polarity'] = pd.Series((train_df['sentiment'] >= 0.5) | (train_df['sentiment'] <= -0.5)).astype(int)
train_df['polarity'] = train_df['sentiment'].abs()
print('Pearson ', train_df['target'].corr(train_df['polarity'], method='pearson'))
print('Kendall ', train_df['target'].corr(train_df['polarity'], method='kendall'))
print('Spearman', train_df['target'].corr(train_df['polarity'], method='spearman'))

Only minimally! But that's fine, if they were equivalent, we wouldn't need this additional target at all.

Let's look at some datapoints:

In [None]:
train_df.head()

In [None]:
print(train_df.describe())

As you can see, the polarity is not very evenly distributed: Most of the entries are below the 0.5 mark. But it definitely should contain more information than the binary, unbalanced insincerity tag.

## Preparing the data and the model

I'm going for a very simple data loading and model here.

In [None]:
print('\nPreparing data...')

# some config values 
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 50 # max number of words in a question to use

# fill up the missing values
train_X = train_df["question_text"].fillna("_na_").values
test_X = test_df["question_text"].fillna("_na_").values

# Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)

# Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

# Get the target values
train_y = train_df[['target', 'sentiment_target']].values

In [None]:
print('\nGetting embeddings...')
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

for word, i in tqdm_notebook(word_index.items(), total=len(word_index)):
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector


In [None]:
def f1(y_true, y_pred):
    '''
    metric from here 
    https://stackoverflow.com/questions/43547402/how-to-calculate-f1-macro-in-keras
    '''
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    
    # So we only measure F1 on the target y value:
    y_true = y_true[:, 0]
    y_pred = y_pred[:, 0]
    
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + K.epsilon()))

In [None]:
# Create a simple 1-layer LSTM model
def create_model(embedding_trainable=False, dropout=0.1, size=32, n_outputs=2, lr=0.003):
    model = Sequential()
    model.add(Embedding(max_features, embed_size, weights=[embedding_matrix], trainable=embedding_trainable))
    model.add(SpatialDropout1D(dropout))
    model.add(Bidirectional(CuDNNLSTM(size, return_sequences=True)))
    model.add(GlobalMaxPooling1D())
    model.add(Dropout(dropout))
    model.add(Dense(n_outputs, activation="sigmoid"))

    model.compile(
        loss='binary_crossentropy',
        optimizer=Adam(lr=lr),
        metrics=[f1]
    )

    return model

## Training on just the target label (BASELINE)

Let's see how well our model performs if we train it just on the target label. The most important metric we should be watching is `val_f1`! For statistical validity, we will run each experiment on three splits and then average the F1 score we've reached.

In [None]:
scores = []
for train_idx, val_idx in KFold(n_splits=5, shuffle=True, random_state=123).split(train_X, train_y):
    model = create_model(n_outputs=1)

    model.fit(
        train_X[train_idx], train_y[train_idx, 0],
        validation_data=(train_X[val_idx], train_y[val_idx, 0]),
        batch_size=1024,
        epochs=3,
        verbose=2,
    )

    scores.append(model.evaluate(train_X[val_idx], train_y[val_idx, 0], batch_size=1024, verbose=0)[1])
    print()

print('Average F1:', np.mean(scores))
print('Standard Deviation:', np.std(scores))

Not bad, not very good (because our model is very small for the sake of this experiment). This is our baseline.

## Training on both targets (MULTI-TASK LEARNING)

Now let's repeat the training, but this time our model gets two outputs: One for the insincerity, one for polarity.

**Note** that I designed the F1 metric to measure only the F1 on the insincerity target, so the scores are directly comparable to the baseline.

In [None]:
scores = []
for train_idx, val_idx in KFold(n_splits=5, shuffle=True, random_state=123).split(train_X, train_y):
    model = create_model(n_outputs=2)

    model.fit(
        train_X[train_idx], train_y[train_idx],
        validation_data=(train_X[val_idx], train_y[val_idx]),
        batch_size=1024,
        epochs=3,
        verbose=2,
    )

    scores.append(model.evaluate(train_X[val_idx], train_y[val_idx], batch_size=1024)[1])
    print()

print('Average F1:', np.mean(scores))
print('Standard Deviation:', np.std(scores))

Very interesting! We get an improved result with less overfitting!

## Using polarity for pretraining (TRANSFER LEARNING)

Now let's split the training into two parts: First, we train on the polarity for a few epochs, then we change targets and train for our actual insincerity target. We also reduce the learning rate for fine-tuning to avoid catastrophic forgetting.

I tried several configurations here, but could not find any that worked better than this:

In [None]:
scores = []
for train_idx, val_idx in KFold(n_splits=5, shuffle=True, random_state=123).split(train_X, train_y):
    model = create_model(n_outputs=1)

    model.fit(
        train_X[train_idx], train_y[train_idx, 1],
        validation_data=(train_X[val_idx], train_y[val_idx, 1]),
        batch_size=1024,
        epochs=2,
        verbose=2,
    )

    # Reduce the model's learning rate:
    model.compile(
        loss=model.loss,
        metrics=model.metrics,
        optimizer=Adam(lr=0.001),
    )
    
    # Fine-tuning:
    model.fit(
        train_X[train_idx], train_y[train_idx, 0],
        validation_data=(train_X[val_idx], train_y[val_idx, 0]),
        batch_size=512,
        epochs=2,
        verbose=2,
    )
    
    scores.append(model.evaluate(train_X[val_idx], train_y[val_idx, 0], batch_size=1024)[1])
    print()

print('Average F1:', np.mean(scores))
print('Standard Deviation:', np.std(scores))

It seems that pretraining does not help much, though maybe we just need to explore more training configurations to make it work.

## Discussion

Over several repetitions of my experiment, I could confirm a positive effect of adding the auxiliary target:

**Higher validation F1 performance at less overfitting!**

This is exactly what we were hoping for, though the effect is not large.

Note that statistical validity is no simple matter. I gave it my best given the limitations of these kernels, but this is no real proof that adding auxiliary targets will help you with your score. E.g. it could behave different in larger models, with longer training times, etc. Let's find find out 😊

If you guys have more ideas for auxiliary targets, please try them out and share them with the community, or comment down below! 😊