# Incorporating Indirect Means of Supervision in Practice
#### Supervised learning requires labeled training data, a learnable model, and hardware. Thanks to open source implementations, we have high-performance algorithms available that are becoming ever easier to use. Thanks to the proliferation of cloud technology, we have as much compute available as our finances allow. But the bottleneck in many cases ends up becoming the amount and quality of training data that we have, especially if exceptional accuracy is needed (like in many medical applications). This motivates the problem of **finding non-traditional means of incorporating domain knowledge into our models.**


##### In this jupyter notebook we explore incorporating indirect means of supervision by tackling a challenging supervision problem which requires such methods to achieve quality results. We focus on a classic problem in natural language processing (NLP), sentiment analysis. We start from the basics by solving it using standard modeling techniques and show how we reach a limit in possible performance. We then sequentially add additional supervision **signals** and show how each can improve performance in this sentiment analysis task. We hope that this tutorial will help to bridge the gap between the recent advancements in this area to implementing them in practice.

##### This work is heavily inspired by research born out of the Standard AI Lab and the creators of the open-source library Snorkel. Particular inspiration was taken from their writeup, [Massive Multi-Task Learning with Snorkel MeTaL: Bringing More Supervision to Bear](https://dawn.cs.stanford.edu/2019/03/22/glue/).

## Problem and Dataset 

### Sentiment analysis using the [Financial Phrase Bank](https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip?origin=publication_list)  dataset
A collection of ∼5000 financial/economic news texts. Annotated by humnas that were  screened to ensure that they have sufficient business knowledge and educational background. Each sentence in the dataset is labeled as positive, negative, or neutral.

### Why Financial Phrase Bank? 
- Finacial data is often proprietary and scarce. 
    - Any improvement we can get from publicly available methods, domain knowledge, or data we thought previously could not be applied is immensely valuable.
- Due to the heavy dependence on semantic meaning in determining sentiment, it is a challenging problem where using only traditional supervision might not provide satisfactory results.
- Either of previous two points are common with many other problems.

## Setup

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

from transformers import BertTokenizer
from transformers import TFBertForSequenceClassification

from mtl_helpers import *

pd.set_option('display.max_colwidth', 400)

data = read_finphrase('data/Sentences_66Agree.txt')
data = pd.DataFrame(data, columns=['sentence', 'label'])
data = data.dropna()
data = data.sample(frac=1, random_state = 42).reset_index(drop=True)

In [2]:
# 1 is "positive", 0 is "negative", 2 is "neutral"
data.sample(6, random_state = 32)

Unnamed: 0,sentence,label
1860,"The Bristol Port Company has sealed a one million pound contract with Cooper Specialised Handling to supply it with four 45-tonne , customised reach stackers from Konecranes .",1
1396,"ADPnews - Sep 28 , 2009 - Finnish silicon wafers maker Okmetic Oyj HEL : OKM1V said it will reduce the number of its clerical workers by 22 worldwide as a result of personnel negotiations completed today .",0
2137,Nordstjernan has used its option to buy another 22.4 % stake of Salcomp 's shares and votes .,2
1808,The office space will rise above the remodeled Cannon Street underground station .,2
123,"Operating result showed a loss of EUR 2.9 mn , while a year before , it showed a profit of EUR 0.6 mn .",0
3091,"Ruukki 's delivery volumes and selling prices showed favourable development and the company 's comparable net sales grew by 50 % year-on-year to EUR647m , CEO Sakari Tamminen said .",1


In [3]:
# Process and split data
# NOTE: Temporarily simplify the problem to binary classification, i.e. just the negative and positive samples
data = data[data.label != 2]
train_split_idx = 1300

x_train = data[0:train_split_idx]['sentence']
y_train = data[0:train_split_idx]['label']

x_val = data[train_split_idx:]['sentence']
y_val = data[train_split_idx:]['label']

# Signal 1: Traditional Supervision
### We will train a few standard neural network architectures to get an idea of how well we can do with only this small dataset.
- A basic multilayer perceptron with dropout and the data encoded using tf-idf gets us an accuracy of **~85%**. 
- Embedding the data using GloVe and using a couple Bidirectional LSTM layers performs much worse with an accuracy of **~77%**. 
    - Perhaps this hints that more sophisticated architectures will not give us an increase in performance given our small amount of data. 


In [4]:
# First we train a MLP with our text encoded using  term frequency–inverse document frequency

# Parameters for tf-idf
TOP_K = 20000
tfidf_args = {
    'ngram_range': (1, 2),
    'dtype': 'int32',
    'strip_accents': 'unicode',
    'decode_error': 'replace',
    'stop_words': ['a', 'an', 'the', 'i'],
    'analyzer': 'word',  # Split text into word tokens.
    'min_df': 1, # Minimum document/corpus frequency below which a token will be discarded.
    'max_df' : 0.33,
    'dtype': np.float64,
}

tfvect = TfidfVectorizer(**tfidf_args)
x_train_tf = tfvect.fit_transform(x_train)
x_val_tf = tfvect.transform(x_val)

# We also select the top-k features by using the ANOVA F-value. 
selector = SelectKBest(f_classif, k=min(TOP_K, x_train_tf.shape[1]))
selector.fit(x_train_tf, y_train)
x_train_tf = selector.transform(x_train_tf).toarray()
x_val_tf = selector.transform(x_val_tf).toarray()

# Create the MLP Keras model.
# Note that we use 
inputs = keras.Input(shape = (x_val_tf.shape[1],))
x = layers.Dense(64, activation = 'relu')(inputs)
x = layers.Dropout(0.4)(x)
x = layers.Dense(64, activation = 'relu')(x)
x = layers.Dropout(0.4)(x)
outputs = layers.Dense(2, activation ='softmax')(x)

model = keras.Model(inputs = inputs, outputs = outputs)

optimizer = keras.optimizers.Adam(lr = 0.001)
model.compile(optimizer = optimizer, loss ='sparse_categorical_crossentropy', metrics=["accuracy"])

model.fit(x_train_tf, y_train, batch_size = 32, epochs = 10, validation_data=(x_val_tf, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x22536ef87c8>

In [5]:
# Next will will try a more sophisticated architecture, Birectional LSTM. Also we embed our data using pretrained GloVe word vectors. 

# Code largely courtesy of:  https://keras.io/examples/nlp/pretrained_word_embeddings/
vectorizer, embeddings = create_vectorizer_and_embeddings(x_train)

# Convert our data to vectorized form, see definition of create_vectorizer_and_embeddings for more information
x_train_vectorized = vectorizer(np.array([[s] for s in x_train])).numpy()
x_val_vectorized = vectorizer(np.array([[s] for s in x_val])).numpy()

# Make sure our data is 1D NumPy arrays
y_train_vectorized = np.array(y_train)
y_val_vectorized = np.array(y_val)

# Create embedding layer for GloVe vectors.
embedding_layer = layers.Embedding(
    input_dim = len(vectorizer.get_vocabulary()) + 2,
    output_dim = 100,
    embeddings_initializer=keras.initializers.Constant(embeddings),
    trainable=False,
)

# Define the BiLSTM network
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Bidirectional(layers.LSTM(50, return_sequences=True, recurrent_dropout = 0.2))(embedded_sequences)
x = layers.Bidirectional(layers.LSTM(50, return_sequences=False, recurrent_dropout = 0.2))(x)
x = layers.Dropout(0.2)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)

optimizer = keras.optimizers.Adam(lr = 0.001)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
with tf.device('/cpu:0'): # Running on GPU was a lot slower...
    model.fit(x_train_vectorized, y_train_vectorized, batch_size=32, epochs=15, validation_data=(x_val_vectorized, y_val_vectorized))

Found 400000 GloVe word vectors.
Converted 3914 words (790 misses)
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


# Signal 2: Transfer Learning - Applying knowledge gained on one problem to another
### [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) is a massive pretrained language model. It is a contextual model that generates a representation of each word based on the other words in the sentence. It has various forms, but the one we will use has 110 million parameters and was trained on the BooksCorpus which has 800 million words. As we will show, BERT  can be fine-tuned with just one additional output layer to achieve state-of-the-art performance for a wide range of tasks, without the need for task-specific architectural modifications. Training times are also incredibly manageable given the size of the network. One iteration (which is almost all you need) on the PhraseBank dataset takes under 20 seconds with a NVIDIA 2070 Super, a midrange consumer-grade GPU. 

Using only BERT, we are able to achieve **~92%** validation accuracy on this problem!


In [6]:
%%capture

BATCH_SIZE = 16

# Tokenize our data using BERT
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
train_bert = convert_data(bert_tokenizer, x_train, y_train).batch(BATCH_SIZE)
val_bert = convert_data(bert_tokenizer, x_val, y_val).batch(BATCH_SIZE)

In [7]:
# Create Bert Model
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=0.00002, epsilon=1e-08)
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(train_bert, batch_size = BATCH_SIZE, epochs=2, validation_data=val_bert)

Some weights of the model checkpoint at bert-base-cased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier', 'dropout_40']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x2270e33f048>

# Signal 3: External Features

Many methods for sentiment analysis already exist. These methods are generic and are generally rule-based. 
Some examples are: 
- [VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text](http://eegilbert.org/papers/icwsm14.vader.hutto.pdf). An implementation is available in [nltk](https://www.nltk.org/). 
- [TextBlob's sentiment analysis](https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis)
- Other methods available in [nltk's sentiment analysis](http://www.nltk.org/api/nltk.sentiment.html#nltk-sentiment-package) package

# Signal 4: Multitask Learning (MTL)
### MTL is whenever we use a shared representation, where what is learned for each individual task helps the other tasks be learned better.

# Signal 5: Dataset Slicing 
### As we examine where our model makes errors, we might recognize **slices** of data (subsets of the data with some property in common) where our model consistently underperforms. If we can identify heuristics where our accuracy is lower, we can leverage that to tell the model where it needs to pay more attention.

For example, given a slice of underperforming data, we can use the MTL paradigm where we add another task that is only that slice of data. Thus, we can explictly train the model on those underperforming examples, with the hopes that it will learn them better.

# Signal 6: Data Augmentation
### This is a broad technique that encompasses ways to increase your training by apply transformations to it. The most well-known applications are in computer vision, such as rotating existing images in your dataset. However, NLP is another field where data augmentation is becoming more important and common. 

[Visual Survey of Data Augmentation in NLP](https://amitness.com/2020/05/data-augmentation-for-nlp/)

##### Pronoun replacement via Named Entity Recognition (NER)
Looking at the data instances, we see that there are a lot of specific company names. But given an individual text, how much does what the actual company name is matter compared to the context of how it's being used. We could detect the unique company names in a sentence using NER, and impute each company name by instance with Company A, Company B, ... for each unique company in the text. 
This would serve to simplify the data which could improve generalization, but also it injects the signal of which tokens are companies. 


##### Fine-tuning GPT-3 to generate synthetic instances
GPT-3 is a pretrained language model (like BERT) that is tuned to producing human-like text. The application here is to fine-tune GPT-3 on the positive samples, then use it to generate new, synthetic, positive instances. The same could be done for any class, and could be thought of as an oversampling method. 


# Signal 7: Ensembling
### While solving a problem, we may be presented with a choice. Usually this comes with some tradeoff and we have to make the decision about which choice will lead to the best result. But in making this choice sometimes we lose out on some signal from the option we did not take. With ensembling we do not make the choice, instead we somehow combine the results of each choice together. 

Previously when we used BERT we had a choice between using the cased and uncased versions of it. We originally chose the cased version because we believe that our problem relies heavily on recognizing pronouns. 

# Signal 8+?: Active Area of Research
#### New research is published frequently
- [AutoSimulate: (Quickly) Learning Synthetic Data Generation](https://arxiv.org/pdf/2008.08424.pdf) - University of Oxford and Microsoft Research
    - Paper released August 16, 2020 about an efficient alternative for optimal synthetic data generation.
- [RandAugment: Practical automated data augmentation with a reduced search space](https://openaccess.thecvf.com/content_CVPRW_2020/papers/w40/Cubuk_Randaugment_Practical_Automated_Data_Augmentation_With_a_Reduced_Search_Space_CVPRW_2020_paper.pdf) - Google Brain
    - Methods for automatically finding the best data augmention strategies for vision tasks.
    
#### Some of the most exciting technology being worked on right now relies heavily on these concepts. 
##### Autonomous driving 
- [Tesla Autopilot and Multi-Task Learning for Perception and Prediction](https://www.youtube.com/watch?v=IHH47nZ7FZU&t=127s) - Andrej Karpathy, Director of AI at Tesla. 
    - Talk about how Tesla leverages MTL for getting performance needed for perception tasks.  
- [Using automated data augmentation to advance our Waymo Driver](https://blog.waymo.com/2020/04/using-automated-data-augmentation-to.html) - Blog from Waymo
    - Blog about how researchers at Waymo use data augmentation to improve perception tasks.
