Unsupervised Language Modeling at scale for robust sentiment classification
Python
Switch branches/tags
Nothing to show

README.md

PyTorch Unsupervised Sentiment Discovery

This codebase is part of our effort to reproduce, analyze, and scale the Generating Reviews and Discovering Sentiment paper from OpenAI.

Early efforts have yielded a training time of 5 days on 8 volta-class gpus down from the training time of 1 month reported in the paper.

A byte-level (char-level) recurrent language model (multiplicative LSTM) for unsupervised modeling of large text datasets, such as the amazon-review dataset, is implemented in PyTorch.

The learned language model can be transferred to other natural language processing (NLP) tasks where it is used to featurize text samples. The featurizations provide a strong initialization point for discriminative language tasks, and allow for competitive task performance given only a few labeled samples. To illustrate this the model is transferred to the Binary Stanford Sentiment Treebank task.

main performance fig

The model's performance as a whole will increase as it processes more data. However, as is discovered early on in the training process, transferring the model to sentiment classification via Logistic Regression with an L1 penalty leads to the emergence of one neuron (feature) that correlates strongly with sentiment.

This sentiment neuron can be used to accurately and robustly discriminate between postive and negative sentiment given the single feature. A decision boundary of 0 is able to split the distribution of the neuron features into (relatively) distinct positive and negative sentiment clusters.

main neuron fig

Samples from the tails of the feature distribution correlate strongly with positive/negative sentiment.

Extreme samples

ReadMe Contents

Setup

Install

Install the sentiment_discovery package with pip setup.py install in order to run the modules/scripts within this repo.

Python Requirements

At this time we only support python3.

  • numpy
  • pytorch (>= 0.3.0)
  • pandas
  • scikit-learn
  • matplotlib
  • unidecode

Pretrained models

We've included our own trained mlstm/lstm models as well as OpenAI's mlstm model:

Data Downloads

We've provided the Binary Stanford Sentiment Treebank and IMDB Movie Review datasets as part of this repository. In order to train on the amazon dataset please download the "aggressively deduplicated data" version from Julian McAuley's original site. Access requests to the dataset should be approved instantly. While using the dataset make sure to load it with the -loose_json flag.

Usage

In addition to providing easily reusable code of the core functionalities of this work in our sentiment_discovery package, we also provide scripts to perform the three main high-level functionalities in the paper:

  • unsupervised reconstruction/language modeling of a corpus of text
  • transfer of learned language model to perform sentiment analysis on a specified corpus
  • heatmap visualization of sentiment in a body of text, and conversely generating text according to a fixed sentiment

Script results will be saved/logged to the <experiment_dir>/<experiment_name>/* directory hierarchy.

Unsupervised Reconstruction

Train a recurrent language model (mlstm/lstm) and save it to the <model_dir>/ directory in the above hierarchy. The metric histories will be available in the appropriate <history>/ directory in the same hierarchy. In order to apply weight_norm only to lstm parameters as specified by the paper we use the -lstm_only flag. Additionally, we use fused LSTM kernels in the mlstm by specifying the fuse_lstm flag, in order to achieve modest speedups.

python3 text_reconstruction.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \
-embed_size 64 -rnn_size 4096 -layers 1 -rnn_type mlstm -dropout 0 -weight_norm -lstm_only -fuse_lstm \
-train data/amazon/reviews.json -loose_json -text_key reviewText -lazy -data_set_type unsupervised \
-batch_size 32 -seq_length 256 -lr 0.000125 --optimizer_type=Adam -lr_scheduler LinearLR -epochs 1 \
-num_shards 1002 -split 1000,1,1 -eval_batch_size 1 -eval_seq_length -1 -persist_state 1
-save_epochs 1 -save_iters 5000 

The training process takes in a json list (or csv) file as a dataset where each entry has key (or column) text_key containing samples of text to model.

The unsupervised dataset will proportionally split num_shards amongst train/val/test according to split. Validation/test sets can also be supplied manually, they will also have num_shards shards.

The dataset entries are transposed to allow for the concatenation of sequences in order to persist hidden state across a shard in the training process.

Hidden states are reset either at the start of every sequence, every shard, or never based on the value of persist_state. See data flags.

We use an evaluation sequence length of -1. A negative seq_length calculates sequence length s.t. there are seq_length samples in the dataset.

Lastly, We know waiting for more than 1 million updates over 1 epoch for large datasets is a long time. We've set the training script to save the most current model (and data history) every save_iters steps within an epoch, and also included a -max_iters flag to end training early.

Our Loss results

The resulting loss curve over 1 (x4 because of data parallelism) million iterations should produce a graph similar to this.

Note: if it's your first time modeling a particular corpus make sure to turn on the -preprocess flag.

Reconstruction Flags

For a list of default values for all flags look at ./cfg/configure_text_reconstruction.py

  • no_loss - do not compute or track the loss curve of the model
  • lr - Starting learning rate.
  • lr_scheduler - one of [ExponentialLR,LinearLR]
  • lr_factor - factor by which to decay exponential learning rate
  • optimizer_type - Class name of optimizer to use as listed in torch.optim class (ie. SGD, RMSProp, Adam)
  • clip - Clip gradients to this maximum value. Default: 1.
  • epochs - number of epochs to train for
  • max_iters - total number of training iterations to run. Takes presedence over number of epochs
  • start_epoch - epoch to start training at (used to resume training for exponential decay scheduler)
  • start_iter - what iteration to start training at (used mainly for resuming linear learning rate scheduler)
  • save_epochs - number of epochs to save model progress. Defaults to every epoch
  • save_iters - save model every so often per epoch
  • save_optim - if enabled save optimizer state when saving model progress
  • load_optim - if enabled load optimizer state from saved model if available

Sentiment Transfer

This script parses sample text and binary sentiment labels from csv (or json) data files according to -text_key/label_key. The text is then featurized and fit to a logistic regression model.

The index of the feature with the largest L1 penalty is then used as the index of the sentiment neuron. The relevant sentiment feature is extracted from the samples' features and then also fit to a logistic regression model to compare performance.

Below is example output for transfering our trained language model to the Binary Stanford Sentiment Treebank task.

python3 sentiment_transfer.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \
-train binary_sst/train.csv -valid binary_sst/val.csv -test binary_sst/test.csv \
-text_key sentence -label_key label -load_model e0.pt 
92.11/90.02/91.27 train/val/test accuracy w/ all_neurons
00.12 regularization coef w/ all_neurons
00107 features used w/ all neurons all_neurons
neuron(s) 1285, 1503, 3986, 1228, 2301 are top sentiment neurons
86.36/85.44/86.71 train/val/test accuracy w/ n_neurons
00.50 regularization coef w/ n_neurons
00001 features used w/ all neurons n_neurons

Our weight visualization

These results are saved to ./experiments/mlstm/binary_sst/e0/*

Note: The other neurons are firing, and we can use these neurons in our prediction metrics by setting -num_neurons to the desired number of heavily penalized neurons. Even if this option is not enabled the logit visualizations/data are always saved for at least the top 5 neurons.

Transfer Flags

For a list of default values for all involved flags look at ./cfg/configure_sentiment_transfer.py

  • num_neurons - number of neurons to consider for sentiment classification
  • no_test_eval - do not evaluate the test data for accuracy (useful when your test set has no labels)
  • write_results - write results (output probabilities as well as logits of top neurons) of model on test (or train if none is specified) data to specified filepath. (Can only write results to a csv file)

Transfer to your own data

Sentiment Analysis

Analyze sentiment on your own data by using one of the pretrained models to train on the Stanford Treebank (or other sentiment benchmarks) and evaluating on your own -test dataset.

If you don't have labels for your data set make sure to use the -no_test_eval flag.

Train a Different Classifier

If you have your own unique classification dataset complete with training labels, you can also use a pretrained model to extract relevant features for classification.

Supply your own -train (and evaluation) dataset while making sure to specify the right text and label keys.

HeatMap Visualization

This script takes a piece of text specified by text and then visualizes it via a karpathy-style heatmap based on the activations of the specified neuron. In order to find the neuron to specify, find the index of the neuron with the largest l1 penalty based on the neuron weight visualizations from the previous section.

Without loss of generality we're going to go through this section first referring to OpenAI's neuron weights. For this set of weights the sentiment neuron can be found at neuron 2388.

highlighted neuron

Armed with this information we can use the script to analyze the following exerpt from the beginning of a review about The League of Extraordinary Gentlemen.

python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm_paper -model_dir model -cuda \
-text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is one of the all time greats \
I have been a fan of his since the 1950's." -neuron 2388 -load_model e0.pt 

sentiment heatmap

However, not only does this script analyze text, but the user can also specify the begining of a text snippet with text and generate additional following text up to a total seq_length. Simply set the -generate flag.

python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm_paper -model_dir model -cuda \
-text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is" \
-neuron 2388 -load_model e0.pt -generate

generated heatmap

In addition to generating text, the user can also set the sentiment of the generated text to positive/negative by setting -overwrite to +/- 1. Continuing our previous example:

forced generated heatmap

When training a sentiment neuron with this method, sometimes we learn a positive sentiment neuron, and sometimes a negative sentiment neuron. This doesn’t matter, except for when running the visualization. The negate flag improves the visualization of negative sentiment models, if you happen to learn one.

python3 visualize.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model -cuda \
-text "25 August 2003 League of Extraordinary Gentlemen: Sean Connery is" \
-neuron 1285 -load_model e0.pt -generate -negate -overwrite +/-1

negated neuron

The resulting heatmaps are all saved to a png file corresponding of the first 100 characters of (generated or analyzed) text at <experiment_dir>/<experiment_name>/heatmaps/<heatmap name>.

Visualization Flags

For a list of default values for all flags look at ./cfg/configure_visualization.py

  • text - whole or initial text for visualization
  • neuron - index of sentiment neuron
  • generate - generates text following initial text up to a total length of seq_length
  • temperature - temperature from sampling from language model while generating text
  • overwrite - For generated portion of text overwrite the neuron's value while generating
  • negate - If neuron corresponds to a negative sentiment neuron rather than positive sentiment
  • layer - layer of recurrent net to extract neurons from

Distributed Usage

python3 text_reconstruction.py -experiment_dir ./experiments -experiment_name mlstm -model_dir model \
-embed_size 64 -rnn_size 4096 -layers 1 -rnn_type mlstm -dropout 0 -weight_norm -lstm_only -fuse_lstm \
-train data/amazon/reviews.json -loose_json -text_key reviewText -lazy -data_set_type unsupervised \
-batch_size 32 -seq_length 256 -lr 0.000125 --optimizer_type=Adam -lr_scheduler LinearLR -epochs 1 \
-num_shards 1002 -split 1000,1,1 -eval_batch_size 1 -eval_seq_length -1 -persist_state 1
-save_epochs 1 -save_iters 5000 -cuda -num_gpus 2 -distributed

In order to utilize Data Parallelism during training time ensure that -cuda is in use and that -num_gpus >1. As mentioned previously, vanilla DataParallelism produces no speedup for recurrent architectures. In order to circumvent this problem turn on the -distributed flag to utilize PyTorch's DistributedDataParallel instead and experience speedup gains.

Also make sure to scale the learning rate as appropriate. Note how we went from 1.25e-4 -> 2.5e-4 learning rates when we doubled the number of gpus.

Note: validation and test datasets are currently not functioning properly for unsupervised reconstruction.

SentimentDiscovery Package

Modules

  • sentiment_discovery
    • sentiment_discovery.modules: implementation of multiplicative LSTM, stacked LSTM.
    • sentiment_discovery.model: implementation of module for processing text, and model wrapper for running experiments
    • sentiment_discovery.reparametrization: implementation of weight reparameterization abstract class, and weight norm reparameterization.
    • sentiment_discovery.data_utils: implementations of datasets, data loaders, data samplers, data preprocessing utils.
    • sentiment_discovery.neuron_transfer: contains utilities for extracting neurons and featurizing text with the values of a model's neurons.
    • sentiment_discovery.learning_rates: contains implementations of learning rate schedules (ie. linear and exponential decay)

Flags

Experiment Flags

  • experiment_dir - Root directory for saving results, models, and figures
  • experiment_name - Name of experiment used for logging to <experiment_dir>/<experiment_name>.

Model Flags

  • should_test - whether to train or evaluate a model
  • embed_size - embedding size for data
  • rnn_type - one of <mlstm,lstm>
  • fuse_lstm - use fused lstm cuda kernels in mLSTM. Currently still buggy Do not use.
  • rnn_size - hidden state dimension of rnn
  • layers - Number of stacked recurrent layers
  • dropout - dropout probability after hidden state (0 = None, 1 = all dropout)
  • weight_norm - boolean flag, which, when enabled, will apply weight norm only to the recurrent (m)LSTM parameters
  • lstm_only - if -weight_norm is applied to the model, apply it to the lstm parmeters only
  • model_dir - subdirectory where models are saved in <experiment_dir>/<experiment_name>
  • load_model - a specific checkpoint file to load from <experiment_dir>/<experiment_name>/<model_dir>

Data Flags

  • batch_size - minibatch size
  • data_size - dimension of each data point
  • seq_length - length of time sequence for reconstruction/generation (sentiment inference has no maximum sequence length)
  • data_set_type - what type of dataset to model. one of [unsupervised,supervised]
  • loose_json - Use loose json (one json-formatted string per newline), instead of tight json (data file is one json string)
  • persist_state - degree to which to persist state across samples for unsupervised learnign. 0 = reset state after every sample, 1 = reset state after finishing a shard, -1 = never reset state
  • transpose - transpose dataset, given batch size. Is always on for unsupervised learning, (should probably only be used for unsupervised).
  • no_wrap - In order to better concatenate strings together for unsupervised learning the batch data sampler does not drop the last batch, it instead "wraps" the dataset around to fill in the last batch of an epoch. On subsequent calls to the batch sampler the data set will have been shifted to account for this "wrapping" and allow for proper hidden state persistence. This is turned on by default in non-unsupervised scripts.
  • cache - cache recently used samples, and preemptively cache future data samples. Helps avoid thrashing in large datasets. Note: currently buggy and provides no performance benefit

Data Processing Flags

  • lazy - lazily load dataset (necessary for large datasets such as amazon)
  • preprocess - load dataset into memory and preprocess it according to section 4 of the paper. (should only be called once on a dataset)
  • shuffle - shuffle unsupervised dataset.
  • text_key - column name/dictionary key for extacting text samples from csv/json files.
  • label_key - column name/dictionary key for extacting sample labels from csv/json files.
  • delim - column delimiter for csv tokens (eg. ',', '\t', '|')
  • drop_unlabeled - drops unlabeled samples from csv/json file
  • binarize_sent - if sentiment labels are not 0 or 1, then binarize them so that the lower half of sentiment ratings get set to 0, and the upper half of the rating scale gets set to 1.
  • num_shards - number of total shards for unsupervised training dataset. If a split is specified, appropriately portions the number of shards amongst the splits.
  • val_shards - number of total shards for unsupervised validation set. If specified, overrides number of shards if any were split from num_shards.
  • test_shards - number of total shards for unsupervised test set. If specified, overrides number of shards if any were split from num_shards.

Dataset Path Flags

  • train - path to training set
  • split - comma-separated list of proportions for splitting the training set into training, validation, and test sets
  • valid - path to validation set
  • test - path to test set

Device Flags

  • cuda - utilize gpus to process strings. Should be on as cpu is too slow to process this.
  • num_gpus - number of gpus available
  • benchmark - (sets torch.backends.cuda.benchmark=True)

System Flags

  • rank - distributed process index.
  • distributed - train with data parallelism distributed across multiple processes
  • world_size- number of distributed processes to run. Do not set. The scripts will automatically set it equal to num_gpus (one gpu per process).
  • verbose - 2 = enable stdout for all processes (including all the data parallel workers). 1 = enable stdout for master worker. 0 = disable all stdout
  • seed - random seed for pytorch random operations

Analysis

Acknowledgement

Thanks to @guillette for providing a lightweight pytorch port of the original weights.

This project uses the amazon review dataset collected by J. McAuley

Thanks

Want to help out? Open up an issue with questions/suggestions or pull requests ranging from minor fixes to new functionality.

May your learning be Deep and Unsupervised.