# Natural Language Processing

A collection of materials I (Kyle Caverly) have found helpful for exploring/understanding NLP, organized by primary interest area.

You will see a variety of sources repeated throughout the below, some of which are:
* [Jay Alammar's - Visualizing Machine Learning Blog](http://jalammar.github.io/)
* [Sebastian Ruder's Research Blog](https://ruder.io/)
* [Andrej Karpathy's Research Blog](http://karpathy.github.io/)
* [Christopher Olah's Research Blog](https://colah.github.io/)
* [Open AI's Progress](https://openai.com/progress/)
* [NLP Progress](https://nlpprogress.com/)

## Natural Language Processing Basics

#### Overview

##### [How to solve 90% of NLP Problems](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e)

Approach Framework:
* Start with a quick and simple model
* Explain its predictions
* Understand the kind of mistakes it is making
* Use that knowledge to inform your next step, whether that is working on your data, or a more complex model.

##### [Overview of Text Representations in NLP](https://towardsdatascience.com/an-overview-for-text-representations-in-nlp-311253730af1)

What is a Text Representation in NLP?

* Machine Learning algorithms cannot work directly upon text features, therefore text has to be converted into a numeric representation.
* The different sort of transformations from raw text to your numeric representation of that text, often determine the success of your model.

Text Representations in NLP are primarily bucketed into three broad categories:

1. One-Hot Encoding

2. Count Vectors/TFIDF Vectors

3. Word Embeddings/Training Embeddings/Contextualized Embeddings

#### Common NLP Tasks & Benchmarks

##### [GLUE Explained: Understanding Common NLP Benchmarks](https://mccormickml.com/2019/11/05/GLUE/)

The General Language Understanding Evaluation benchmark (GLUE) is a collection of datasets used for training, evaluating, and analyzing NLP models relative to one another, with the goal of driving “research in the development of general and robust natural language understanding systems.” Many of the models and approaches outlined below are scored against GLUE tasks.

GLUE consists of 9 primary problems in NLP, which are discussed and outlined in the link. But what is interesting to note is that GLUE was created to focus NLP research on Multi-Task Learning. Multi-Task learning is specifically focused on the development of general understanding language models, with applications non-task specific, focusing on solutions with high success across a multitude of seperate but related tasks. Focusing on algorithms with general language applications across tasks, significantly increases the general use and application of these algorithms in industry.

##### [What is Text Classification?](https://monkeylearn.com/text-classification/)

Text Classification is the process of assigning tags or categories to text according to its content.

##### [AllenNLP: Sentiment Analysis Demo](https://demo.allennlp.org/sentiment-analysis)

Sentiment Analysis predicts whether an input is positive or negative. It can either be a numerical score, a simple binary classification, or a multi-class classification problem.

##### [AllenNLP: Question Answering / Reading Comprehension](https://demo.allennlp.org/reading-comprehension)

Question Answering is the task of answering questions about a passage of text to show that the system understands the passage.

##### [HuggingFace: Co-Reference Demo](https://huggingface.co/coref/)

"In short, coreference is the fact that two or more expressions in a text - like pronouns or nouns - link to the same person or thing. It is one of the key building blocks to building conversational Artificial Intelligences."

##### [HuggingFace: Write With Transformer - Natural Language Generation](https://transformer.huggingface.co/doc/distil-gpt2)

Natural Language Generation is the task specifically focused on the generation of text, given a prompt topic or the beginning of the sentence. Current SOTA can form logical sentences, however, the sentences are not always entirely factual.

##### [T5 Trivia](https://t5-trivia.glitch.me/)

Question & Answering Triva demo using T5.

## TFIDF and Keyword Extraction

##### [What is TFIDF?](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089)

##### [The Math Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)

##### [Automated Keyword Extraction from Articles using NLP](https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34)

## Preprocessing and Tokenization

##### [Tokenizers: How Machines Read](https://blog.floydhub.com/tokenization-nlp/)

## Topic Modelling

#### Probabilistic Topic Models

##### [ACM: Probabilistic Topic Models](https://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext)

#### Clustering Based Topic Models

##### [Topic Grouper: An Agglomerative Clustering Approach to Topic Modelling](https://arxiv.org/pdf/1904.06483.pdf)

Topic Grouper method proposed by Refinitiv for hierarchial unsupervised topic modelling & exploration of the corpus.
* Topic Grouper creates a disjunctuve partioning of the training vocabulary in a stepwise manner such that resulting partitions represent topics. It is governed by a simple generative model, where the likelihood to generate the training documents via topics is optimized. The algorithm starts with one-word topics and joins two topics at every step. It therefore generates a solution for every desired number of topics ranging between the size of the training vocabulary and one. The process represents an agglomerative clustering that corresponds to a binary tree of topics. A resulting tree may act as a containment hierarchy, typically with more general topics towards the root of tree and more specific topics towards the leaves.
* Topic Grouper is not governed by a background distribution such as the Direchlet and avoids hyper parameter optimizations.
* Important benefits of agglomerative clustering for topic modeling lie in its simplicity, absence of hyper parameters, deep hierarchial structures of topics as well as the ability to find even conceptually narrow topics. A major challenge is to determine a well-founded cluster distance with reasonable predictive qualities and computational performance.
* They argue that since experiments on more modern coherence metrics (cv and npmi, wi etc) have not been conclusive, there models will optimize for perplexity.
* For all text corpora we found that Topic Groper tends to push stop words and or function words into seperate topics. Therefore, it can do without stop word or function word filtering as a preprocessing step.

##### [Top2Vec: Distributed Representations of Topics](https://arxiv.org/abs/2008.09470)

We present top2vec, which leverages joint document and word semantic embedding to find topic vectors. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the numbers of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that top2vec finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.
* They assume that topics are continous, as there are infinitly many combinations of weighted words which can be used to represent a topic. Additionally, we assume that each document has its own topic with a value in that continuum. In this view, the document's topic is the set of weighted words that are most informative of its unique topic, which can be a combination of the colloquial discrete topics.
* The top2vec model produces jointly embedded topic, document, and word vectors such that distance between them represents semantic similarity. Removing stop-words, lemmatization, stemming, and a priori knowledge of the number of the topics are not required for top2vec to learn good topic vectors.
* Represents Documents as a doc2vec vector. doc2vec extends word2vec by adding a paragraph vector to the learning task of the neural network. These jointly embedded document and word vectors are learned such that document vectors are close to word vectors which are semantically similar. This property can be used for information retrieval as word vectors can be used to query for similar documents. It can also be used to find which words are most similar to a document, or most representative of a document.
* These document vectors are generated for each document in the corpus, with an embedding dimension of 300. These embeddings are then reduced to an 5D embedding via UMAP. Once these 5D embeddings have been generated, these embeddings are then clustered by HDBSCAN, a Hierarchial Density Based clustering method. The primary assumption here being that the number of dense clusters found by HDBSCAN represent the number of prominent topics inherent in the corpus. This is a huge advantage, as it does not require that the topic k be defined prior.
* Few params, can be tweaked. HDBSCAN has a minimum cluster size, which represents the minimum number of documents needed to create a new topic. They have found 15 to be the optimal number in their experiments, however, this can presumably be tuned to a small number to generate more granular or "long tail topics".

#### Neural Topic Models

##### [prodLDA: Autoencoding Variational Inference for Topic Models](https://arxiv.org/pdf/1703.01488.pdf)

Discusses what the authors believe to be the first effective Autoencoding variational Bayes inference method for Latent Dirichlet Allocation (LDA), which they call Autoencoded Variational Inference for Topic Model (AVITM). This model tackles the problems caused for AECB by the Dirichlet prior and by component collapsing. We find that AVITM matches traditional methods in accuracy with much better inference time.
* Autoencoding Variational Bayes (AEVB) is a particularly natural choice for topic models, because it trains an inference network, a neural network that directly maps a document to an approximate posterior distribution, without the need to run further variational updates. This is intuitively appealing because in topic models, we expect the mapping from documents to posterior distributions to be well behaved, that is, that a small change in the document will produce only a small change in topics. This is exactly the type of mapping that a universal function approximator like a neural network should be good at representing. Essentially, the inference network learns to mimic the effect of probabilistic inference, so that on test data, we can enjoy the benefits of probabilistic modeling without paying a further cost for inference. However, black box inference methods are significantly more challenging to apply to topic models.
* On several datasets, we find that AVITM yields topics of equivalent quality to standard mean-field inference, with a large decrease in training time. To illustrate this, we present a new topic model, called ProdLDA, in which the distribution over individual words is a product of experts rather than the mixture model used in LDA. We find that ProdLDA consistently produces better topics than standard LDA, whether measured by automatically determined topic coherence or qualitative examination.


##### [Pretraining is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence](https://arxiv.org/abs/2004.03974)

Propose Contextual Topic Models, which combine pre-trained representations (SBERT Base) and neural topic models (neural-ProdLDA).
* Argue that coherence (NPMI) can be improved by adding more contextual knowledge to the LDA, via Transformer based embeddings.
* Find that adding BERT-Based Encoding representations significantly increase coherence, across all corpora and topic settings. However, this can sometimes come at the expense of topic diversity as Contextualized Topic Models underperform the topic diversity of NVDM and Neural-ProdLDA.
* "This effect is even more remarkable given that we cannot embed long documents, due to the sentence length limit in BERT".

##### [On the importance of the Kullback-Leibler Divergence Term in Variational Autoencoders for Text Generation](https://www.aclweb.org/anthology/D19-5612.pdf)

#### Topic Modelling Analysis & Preprocessing Impacts

##### [How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models](https://osf.io/2rh6g)

* On Pruning: "In theory, the most frequent and infrequent words do not contribute useful information to a topic model. Due to their low conditional probablist, infrequent words will never appear in a topic's top words. In contrast, very frequent words are not specific enough; such words will appear in the top words of every topic, and thus add no specific or exclusive information."
* Used a pruning approach in which, the top 1% most frequent and 0.5% more infrequent terms were stripped. This resulted in a 95.% reduction of vocabulary size for the news corpus studied.
* Found that beyond sample sizes of 10%, only marginal reliability gains are achieved. As such, large enough samples do not impair the reliability of a topic model. With the topics of sampled models resembling the topics of the reference models if the sample size approaches a threshold of >10% but no less than 10,000 documents total.
* Found that the news based model benefitted the most from pruning, as they often had a relatively homogenous writing style within comparitively clearly delimited thematic sections. Such characteristics are beneficial for topic models because the topics' top words tend to be highly specific, and overlap of top words is less likely.

##### [More Efficient Topic Modelling Through a Noun Only Approach](https://www.aclweb.org/anthology/U15-1013/)

* Found that removing all words except nouns improved the topics' semantic coherence. Observed topic coherence (NPMI) improved by 6% and the average word intrusion detection improved % for the noun only corpus, compared to modelling the raw corpus.
* Similar coherence improvements were seen with just lemmatization alone, however, model training times are faster when reducing the articles to the nouns only.
* Nouns only corpus produced the least number of low and zero OC coherence topics, suggesting lower numbers of 'junk' topics.
* Found substantial variability in topics generated from model to model. Though variability is not unexpected in an unsupervised method such as topic modelling, such variability indicates the topics may be unreliable, and is of concern if the end-user seeks to draw detailed conclusions about a corpus based on a single topic model.

#### Topic Evaluation

##### [Automatic Evaluation of Topic Coherence](https://mimno.infosci.cornell.edu/info6150/readings/N10-1012.pdf)

Paper introduces Normalized Pointwise Mutual Information measures, to argue that NPMI coherence measures based on Wikipedia data, achieve results nearing the level of inter-annotator correlation.
* NPMI Metrics, scores coherence, based on the statisical independence of observing two words in close proximity (using a sliding window of 10 words).
* Found value in ranking topics based on NPMI, to filter down the usable/valuable topic set
* Important to remember that useless topics are not chance artifacts produced by the models, but often are in fact stable and robust statistical features in the underlying data.
* "Of all the topic scoring methods tested, PMI (trem co-occurence via simple pointwise mutual information) is the most consistent performer, achieving the best or near-best results over both datasets, and approaching the surpassing the inter-annotator agreement."

##### [Reading Tea Leaves: How Humans Interpret Topic Models](https://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models)

Paper presentng the word intrusion task as a way to quantifiably measure model/topic quality in a way similar to how humans interpret Topic Models.
* Argue that topic models which perform better on held-out likelihood (perplexity) may infer less semantically meaningful and interpretable topics.
* Word Intrusion Task measures how semantically "cohesive" the topics inferred by a model are and tests whether topics correspond to natural groupings for humans. In the word intrusion task, the subject is presented with six randomly ordered words. The task of the user is to find the word which is out of place or does not belong with the others, ie. the intruder. When the set of words minus the intruder makes sense together, then the subject should easily identify the intruder. For tasks in which, topic coherence is poor, people will typically choose an intruder at random.
* Topic Intrustion Task measures how well a topic model's decomposition of a document as a mixture of topics agrees with human associateions of topics with a document. In the topic intrusion task, the subject is presented with four randomly ordered topics associated with a particular document. Three of these topics are the top topics for the document, the fourth is chosen randomly from the other low-probability topics in the model. The subject is then instructed to choose the topic which does not belong with the document. As before, if the topic assignment to documents were relevant and intuitive, we would expect that subjects would select the topic we randomly added as the topic that did not belong.

##### [Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality](https://www.aclweb.org/anthology/E14-1056/)

Paper exploring the automatic evaluation of single topics and automatic evaluation of whole topic models, and provide reccomendations on the best strategy for performing the two tasks.
* Focused on evaluation within the context of topic modelling for direct human consumption.
* They argue that they can automate the Word Intrusion Task at near human levels of accuracy, and thus automatically evaluate the human-interpretability of topics, as well as whole topic models.
* Supports earlier research that perplexity correlates negatively with topic interpretability.
* They find success in automating both NPMI & Word Intrusion for Model Quality but not Topic Quality, as their automated method is 90%+ correlated with human agreement for Model Quality Word Intrusion metrics, but only ~60% correlated with human agreement for Topic Quality Word Intrusion metrics.
* Unfortunately, supports the idea that evaluation Topic Quality automatically is an incredibly challenging task, and automatic topic evaluation may not be possible without human-in-the-loop systems.

#### Tooling

##### [Visualizing LDA Models](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/)

Great tool in Python to Visualize LDA results for exploration.

## Deep Learning for Natural Language Processing

#### Pre-BERT: RNNs & LSTMs

##### [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

##### [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Colah's Blog on Understanding LSTM Networks, really helpful overview from Recurrent Neural Networks to their application in LSTM's.

#### Helpful Videos/Lectures

##### [OpenAI @ Berkeley: Learning From Text](https://www.youtube.com/watch?v=BnpB3GrpsfM)

Comprehensive Overview of NLP from Bag of Words methods through 2020 methods (T5/ELECTRA etc.). Presented by the team who built GPT and GPT-2.

##### [HuggingFace: The Future of NLP](https://www.youtube.com/watch?v=G5lmya6eKtc)

Really helpful overview of the current state of Deep Learning applications inside NLP. Presented by HuggingFace, produces of the Transformers Library, and leading Research Group.

##### [HuggingFace: Introduction to Transfer Learning](https://www.youtube.com/watch?v=rEGB7-FlPRs)

Really helpful overview of the purpose of Transfer Learning, what is is, what role certain models play (ie. BERT/GPT-2).

##### [Exploring the Limits of Transfer Learning](https://www.youtube.com/watch?v=N-7rdJK4xlE)

## Transformers

#### Attention

##### [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473v7.pdf)

##### [Attention is All You Need](https://arxiv.org/abs/1706.03762)

##### [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)

##### [Analyzing Multi-Headed Attention](https://arxiv.org/abs/1905.09418)

#### For Natural Language Understanding

##### [ELMo: Deep Contextualized Word Representations](https://arxiv.org/abs/1802.05365)

##### [BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) ([Video Walkthrough](https://www.youtube.com/watch?v=FKlPCK1uFrc&list=PLam9sigHPGwOBuH4_4fr-XvDbe5uneaf6))

##### [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692)

##### [DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter](https://arxiv.org/abs/1910.01108)

##### [T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)

#### For Natural Language Generation

##### [GPT: Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

##### [GPT-2: Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

##### [GPT-3: Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) ([Blog](https://joeddav.github.io/blog/2020/05/29/ZSL.html))

#### Deeper on BERT - Analysis

##### [BERT Rediscover's the Traditional NLP Pipeline](https://arxiv.org/pdf/1905.05950.pdf)

##### [Revealing the Dark Secrets of BERT](https://arxiv.org/abs/1908.08593)

#### Deeper on BERT - Application

##### [How To Fine Tune BERT for Text Classification?](https://arxiv.org/pdf/1905.05583v1.pdf%20http://arxiv.org/abs/1905.05583.pdf)

## Sentence/Document Embeddings

#### Skip-Gram Based

##### [word2vec: Distributed Representations of Words and Phrases and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

Present word2vec, a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
* Extend the skip-gram model, which does not involve dense matrix multiplication, like other neural models. As such, training is extremely efficient; an optimized single-machine implementation can train on more than 100 billion words in one day.
* The objective of the original skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document, usually predicted across a rolling window of word tokens.
* Use negative subsampling for frequent words. They weight inclusion of a word token based on occurence in the corpus, essentially downweighting commonly occuring stop words, while preserving unique or rare words, and the ranking of the word frequency. This limits the need for further stop words when using negative subsampling.
* In order to introduce phrase representations into the word2vec representations, Phrases in the underlying data are replaced with single tokens, ie. Toronto Maple Leafs will be replaced with a single token, whereas "this is" will not be replaced as a token. Automatic selection of Bigrams in this case, is driven by unigram & bigram counts, discounting the phrases which occur very frequently and removing very infrequent words.

##### [GLoVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf)

##### [Doc2Vec: Distributed Representations of Sentences and Documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)

In this paper, we propose Paragraph Vector (doc2vec), an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document.
* The name Paragraph Vector is to emphasize the fact that the method can be applied to variable-length pieces of texts, anything from a phrase or sentence to a large document.
* "In our model, the vector representation is trained to be useful for predicting words in a paragraph. More precisely, we concatenate the paragraph vector with several word vectors from a paragraph and predict the following word in the given context. Both word vectors and paragraph vectors are trained by the stochastic gradient descent and backpropagation. While paragraph vectors are unique among paragraphs, the word vectors are shared. As prediction time, the paragraph vectors are inferred by fixing the word vectors and training the new paragraph vector until convergence."
* "Our approach for learning paragraph vectors is inspired by the methods for learning the word vectors. The inspiration is that the word vectors are asked to contribute to a prediction task about the next work in the sentence. So despite the fact that the word vectors are initialized randomly, they can eventually capture semantics as an indirect result of the prediction task. We will use this idea in our paragraph vectors in a similar manner. The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph."
* Practically, this is done through the use of a specific paragraph vector. These paragraph vectors can almost be thought of as a unique word vector, prepended to the word vectors inside the context space of word2vec, in which the following word of the context window is predicted. In this way, both a paragraph vector and word vectors are trained inside the same similarity/semantic space. In which paragraph vectors and word vectors are directly comparable.

#### Transformer Based

##### [Sentence-BERT: Sentence Embeddings Using Siamese BERT Networks](https://arxiv.org/abs/1908.10084)

BERT and RoBERTa has set a new sota performance on sentence-pair regression tasks like semantic textual similarity. However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT, a modification of the pretrained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
* A large disadvantage of the BERT network structure is that no independent sentence embeddings are computed, which makes it difficult to derive sentence embeddings from BERT. To bypass this limitation, researchers passed single sentences through BERT and then derive a fixed sized vector by either averaging the outputs (similar to average word embeddings) or by using the output of the special CLS token.
* Practically, they fine tune the base BERT embeddings using a contrastive siamese network, in which the euclidean distance between the two BERT embeddings are computed, and optimized on cross-entropy loss for SNLI.
* The purpose of SBERT sentence embeddings are not to be used for transfer learning for other tasks. It is intended for semantic similarity search and visualization.

##### [DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations](https://arxiv.org/abs/2006.03659)

"We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations, a self-supervised method for learning universal sentence embeddings that transfer to a wide variety of natural language processing (NLP) tasks. Our objective leverages recent advances in deep metric learning (DML) and has the advantage of being conceptually simple and easy to implement, requiring no specialized architectures or labelled training data. We demonstrate that our objective can be used to pretrain transformers to state-of-the-art performance on SentEval, a popular benchmark for evaluating universal sentence embeddings, outperforming existing supervised, semi-supervised and unsupervised methods.
* Due to the limited amount of labelled training data available for many natural language processing (NLP) asks, transfer learning has become ubiquitous. However, the highest performing solutions require at least some labelled data, limiting their usefulness to languages and domains where labelled data exists for the chosen pretraining tasks.
* Drawing from recent advances in metric learning, we propose a simple and effective self-supervised, sentence-level objective for pretraining transformer-based language models. Metric learning, a type of representation learning, aims to learn an embedding space where the embedded vector representations of similar data are mapped close together, and those of dissimilar data far apart.
* Our objective learns universal sentence embeddings by training an encoder to minimize the distance between the embeddings of textual segments randomly sampled from nearby in the same document.
* Appendix supports SBERT findings, in which they find that the "CLS embeddings produced by models trained against the NSP or SOP losses (do not) outperfoorm that of a model trained without either loss and sometimes failed to outperform a bag-of-words (BoW) weak baseline.

## Adversarial NLP

##### [GAN-BERT Generative Adversarial Learning for Text Classification](https://www.aclweb.org/anthology/2020.acl-main.191.pdf)

Paper exploring the use of Generative Adversarial Networks to augment and complement small datasets for text classification:

* Extend's BERT based classification to include a GAN, in which the Generator generates an object similar to a BERT embedding, and the discriminator classifies on K+1 Classes in which the +1 Class is the forgery.
* "We empirically demonstrate that the SS-GAN schema applied over BERT, ie. GAN-BERT, reduces the requirement for annotated examples: even with less than 200 annotated examples it is possible to obtain results comparable with a full supervised setting. In any case, the adopted semi-supervised schema always improves the result obtained by BERT."
* "In more detail, the generator is implemented as an MLP with one hidden layer activated by a leaky-relu function. G (Generator) inputs consist of noise vectors drawn from a normal distribution N(0,1). The noise vectors pass through the MLP and finally reslut in 768-Dimensional vectors, that are used as fake examples in our architecture."

##### [AdvBERT: BERT is not robust on misspellings! Generating natural adversarial samples on BERT](https://arxiv.org/abs/2003.04985)

Work systematcally exploring the robustness of BERT, the state-of-the-art Transformer-style model in NLP, in dealing with noisy data, particularly mistakes often generated by typing errors on the keywords, and other errors that occur inadvertently.
* Experiments indicate that: 1. Typos in various words of a sentence do not influence equally. The typos in informative words make severer damages. 2. Mistype is the most damaging error type, comparing with character insertion, deletion etc. 3. Humans and machines have different focuses when recognizing adversarial attacks.
* Injecting only a single typo for highly informative words (words with a high gradient norm) degrades accuracy by 22.6%.
* On the other hand, not as informative words have much less impact (adding 10 non-informative word typos, only decreases accuracy by 9%).

##### [Latent Name Artifacts in Pre-Trained Language Models](https://arxiv.org/abs/2004.03012)

Pre-trained language models may perpetuate biases originating in their training corpus to downstream models. We focus on artifacts associated with the representation of given names (eg. Donald), which, depending on the corpus, may be associated with specific entities, as indicated by next token prediction (eg. Trump). While helpful in some contexts, grounding happens also in underspecified or inappropriate contexts. This paper demonstrates the effects of latent name artifacts in downstream tasks.
* Pre-trained LMs do not treat given names as interchangeable or anonymous; this has not only implications for the quality and accuracy of systems that employ these LMs, but also for the farness of those systems.

##### [Reevaluating Adversarial Examples in Natural Language](https://arxiv.org/pdf/2004.14174.pdf)

##### [TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP](https://arxiv.org/pdf/2005.05909.pdf)

##### [Universal Adversarial Triggers for Attacking and Analyzing NLP](https://arxiv.org/abs/1908.07125v2)

##### [Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment](https://arxiv.org/pdf/1907.11932.pdf)

Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original couterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to gneerate adversarial text. By applying it to two fundamental natural language tasks, text classiication and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks.
* Formally, besides the ability to fool the target models, outputs of a natural language attacking system should also meet three key utility preserving properties: (1) homan prediction consistency--prediction by humans should remain unhanged, (2) semantic similarity--the crafted example should bear the same meaning as the source, as judged by humans, and (3) language fluency--generated examples should look natural and grammatical.
* Specifically, TextFooler first identifies the important words for the target model and then prioritize to replace them with the most semantically similar and grammatically correct words until the prediction is altered.
* On the adversarial examples, we can reduce the accuracy of almost all target models in all tasks to below 10% with only less than 20% of the original words perturbed.
* "Given a sentence of n words, we observe that only some keywords act as influential signals for the prediction model F, echoing with the discovery that BERT attends to the statistical cues of some words.
* Models with high original accuracy are, in general, more difficult to be attacked.
* We found in experiments that it is easy for the attack system to convert a real news article to a fake one, whereas the reverse process is much harder, which is in line with intuition.
* They then trained the BERT model itsel on adversarial examples, and both the after attack accuracy and perturbed words ratio after adversarial re-training get higher, indicating the greater difficulty to attack. We can enhance the robustness of a model to future attacks by training it with the generated adversarial examples.

##### [FireBERT: Hardening BERT-based classifiers against adversarial attack](https://arxiv.org/abs/2008.04203v1)

## Model Size & Complexity

##### [Scaling Laws for Neural Language Models](https://arxiv.org/pdf/2001.08361.pdf)

## Tooling

#### PyTorch vs. Tensorflow vs. Keras

[**Tensorflow**](https://www.tensorflow.org/)   
Google's open source machine learning library. Generally seen at excelling when it comes to production ready models, GPU/TPU distributed training, and edge inference.

[**Keras**](https://keras.io/)    
An indepedent open source machine learning library and supported by Google. It is a high level api/interface of Tensorflow along with other ML libraries, speeding up experimentation and testing.

[**PyTorch**](https://pytorch.org/)  
Facebook's open source machine learning library. Generally seen as a more flexible alternative to TF/Keras, at the expense of having production ready models. However, this gap is a large focus of current development for the library. 

**Low Level vs. High Level**  
Tensorflow operates as primarily a High-Level interface for Neural Network programming, with access to lower level functions.  
Keras is exclusively High-Level, abstracting away many of the Tensorflow intricacies.  
PyTorch is a low level interface, which generally leaves it heavily favoured by Researchers. While this leads to greater bugs, and a higher learning curve for certain workflows, I have found PyTorch to be more rewarding as it provides the programmer with a better understanding of the lower level mechanics of many of the algorithms.

**Readability**  
Keras offers a much more readable programming experience, vs TensorFlow/PyTorch, but this comes at the cost of expressiveness and flexibility.

**Bonus Features**  
[TensorBoard](https://www.tensorflow.org/tensorboard): A visualization gui built on top of Tensorflow, provides rich metrics for monitoring training/performance. However, a subset of the tooling has recently become available for PyTorch.  
[Pytorch Lightning](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09): A high level api interface for Pytorch, abstracting away alot of boilerplate code related to training. [Ignite](https://github.com/pytorch/ignite) is another similar option for PyTorch.

#### Snorkel

[Snorkel](https://www.zdnet.com/article/is-googles-snorkel-drybell-the-future-of-enterprise-data-management/) is a tool jointly developed between Google and Stanford research. It uses probabilistic estimates derived from human defined heuristics to algorithmically generate labels for "Weak Supervision" algorithms. It has been generating interest as a way for Industry to integrate and quickly generate machine learning quality data with enterprise knowledge.

#### HuggingFace's Transformers Library

[Transformers](https://huggingface.co/transformers/) Is an open source library which is a treasure trove of modern SOTA Deep Learning models for NLP. It is mostly built over PyTorch, providing pretrained models and PyTorch Models for most SOTA models produced via research. It is heavily supported/developed upon, and with everything open source & well documented, makes testing/building modern NLP models much easier.

#### MLflow

[MLflow](https://mlflow.org/) is an open source tool developed by the team at DataBricks or managing the machine learning lifecycle, it offers an easy and flexible python logging like interface to track model metrics/data. It also offers Docker-Like functionality related to model management/deployment, with a focus on reproducible research/experimentation and productionalization.