๐ฑ | Applications | |
๐ | Pipeline | |
๐ | Scores | |
๐จ๐ปโ๐ซ | Transfer Learning | |
๐ค | Transformers theory | |
๐ฎ | DL Models | |
๐ฆ | Python Packages |
Application | Description | Type |
---|---|---|
๐ท๏ธ Part-of-speech tagging (POS) | Identify if each word is a noun, verb, adjective, etc. (aka Parsing). | ๐ค |
๐ Named entity recognition (NER) | Identify names, organizations, locations, medical codes, time, etc. | ๐ค |
๐ฆ๐ปโ Coreference Resolution | Identify several ocuurences on the same person/objet like he, she | ๐ค |
๐ Text categorization | Identify topics present in a text (sports, politics, etc). | ๐ค |
โ Question answering | Answer questions of a given text (SQuAD, DROP dataset). | ๐ญ |
๐๐ผ ๐๐ผ Sentiment analysis | Possitive or negative comment/review classification. | ๐ญ |
๐ฎ Language Modeling (LM) | Predict the next word. Unupervised. | ๐ญ |
๐ฎ Masked Language Modeling (MLM) | Predict the omitted words. Unupervised. | ๐ญ |
๐ฎ Next Sentence Prediction (NSP) | ๐ญ | |
๐โ๐ Summarization | Crate a short version of a text. | ๐ญ |
๐ฏโ๐ Translation | Translate into a different language. | ๐ญ |
๐โ๐ Dialogue bot | Interact in a conversation. | ๐ญ |
๐๐ปโ๐ Speech recognition | Speech to text. See AUDIO cheatsheet. | ๐ฃ๏ธ |
๐ โ๐๐ป Speech generation | Text to speech. See AUDIO cheatsheet. | ๐ฃ๏ธ |
- ๐ค: Natural Language Processing (NLP)
- ๐ญ: Natural Language Understanding (NLU)
- ๐ฃ๏ธ: Speech and sound (speak and listen)
- Preprocess
- Tokenization: Split the text into sentences and the sentences into words.
- Lowercasing: Usually done in Tokenization
- Punctuation removal: Remove words like
.
,,
,:
. Usually done in Tokenization - Stopwords removal: Remove words like
and
,the
,him
. Done in the past. - Lemmatization: Verbs to root form:
organizes
,will organize
organizing
โorganize
This is better. - Stemming: Nouns to root form:
democratic
,democratization
โdemocracy
. This is faster. - Subword tokenization Used in transformers. โญ
- WordPiece: Used in BERT
- Byte Pair Encoding (BPE): Used in GPT-2 (2016)
- Unigram Language Model: (2018)
- SentencePiece: (2018)
- Extract features
- Document features
- Word features
- Word Vectors: Unique representation for every word (independent of its context).
- Word2Vec: By Google in 2013
- GloVe: By Standford
- FastText: By Facebook
- Contextualized Word Vectors: Good for polysemic words (meaning depend of its context).
- Word Vectors: Unique representation for every word (independent of its context).
- Build model
- Bag of Embeddings
- Linear algebra/matrix decomposition
- Latent Semantic Analysis (LSA) that uses Singular Value Decomposition (SVD).
- Non-negative Matrix Factorization (NMF)
- Latent Dirichlet Allocation (LDA): Good for BoW
- Neural nets
- Recurrent NNs decoder (LSTM, GRU)
- Transformer decoder (GPT, BERT, ...) โญ
- Hidden Markov Models
- Regular expressions: (Regex) Find patterns.
- Parse trees: Syntax od a sentence
- Recurent nets
- GRU
- LSTM
- Tricks
- Teacher forcing: Feed to the decoder the correct previous word, insted of the predicted previous word (at the beggining of training)
- Attention: Learns weights to perform a weighted average of the words embeddings.
- Tokenizer: Create subword tokens. Methods: BPE...
- Embedding: Create vectors for each token. Sum of:
- Token Embedding
- Positional Encoding: Information about tokens order (e.g. sinusoidal function).
- Dropout
- Normalization
- Multi-head attention layer (with a left-to-right attention mask)
- Each attention head uses self attention to process each token input conditioned on the other input tokens.
- Left-to-right attention mask ensures that only attends to the positions that precede it to the left.
- Normalization
- Feed forward layers:
- Linear Hโ4H
- GeLU activation func
- Linear 4HโH
- Normalization
- Output embedding
- Softmax
- Label smothing: Ground truth -> 90% the correct word, and the rest 10% divided on the other words.
- Lowest layers: morphology
- Middle layers: syntax
- Highest layers: Task-specific semantics
Score | For what? | Description | Interpretation |
---|---|---|---|
Perplexity | LM | The lower the better. | |
GLUE | NLU | An avergae of different scores | |
BLEU | Translation | Compare generated with reference sentences (N-gram) | The higher the better. |
"He ate the apple" & "He ate the potato" has the same BLEU score.
Step | Task | Data | Who do this? |
---|---|---|---|
1 | [Masked] Language Model Pretraining | ๐ Lot of text corpus (eg. Wikipedia) | ๐ญ Google or Facebook |
2 | [Masked] Language Model Finetunning | ๐ Only you domain text corpus | ๐ป You |
3 | Your supervised task (clasification, etc) | ๐๐ท๏ธ You labeled domain text | ๐ป You |
pip install spacy
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download es_core_news_md
import spacy
nlp = spacy.load("en_core_web_sm") # Load English small model
nlp = spacy.load("es_core_news_sm") # Load Spanish small model without Word2Vec
nlp = spacy.load('es_core_news_md') # Load Spanish medium model with Word2Vec
text = nlp("Hola, me llamo Javi") # Text from string
text = nlp(open("file.txt").read()) # Text from file
spacy.displacy.render(text, style='ent', jupyter=True) # Display text entities
spacy.displacy.render(text, style='dep', jupyter=True) # Display word dependencies
es_core_news_md
has 534k keys, 20k unique vectors (50 dimensions)
coche = nlp("coche")
moto = nlp("moto")
print(coche.similarity(moto)) # Similarity based on cosine distance
coche[0].vector # Show vector
๐ฎ Deep learning models ALL MODELS
๐ค Means availability (pretrained PyTorch implementation) on pytorch-transformers package developed by huggingface.
Model | Creator | Date | Breif description | Data | ๐ค |
---|---|---|---|---|---|
1st Transformer | Jun. 2017 | Encoder & decoder transformer with attention | |||
ULMFiT | Fast.ai | Jan. 2018 | Regular LSTM | ||
ELMo | AllenNLP | Feb. 2018 | Bidirectional LSTM | ||
GPT | OpenAI | Jun. 2018 | Transformer on LM | โ | |
BERT | Oct. 2018 | Transformer on MLM (& NSP) | 16GB | โ | |
Transformer-XL | Google/CMU | Jan. 2019 | โ | ||
XLM/mBERT | Jan. 2019 | Multilingual LM | โ | ||
Transf. ELMo | AllenNLP | Jan. 2019 | |||
GPT-2 | OpenAI | Feb. 2019 | Good text generation | โ | |
ERNIE | Baidu research | Apr. 2019 | |||
XLNet: | Google/CMU | Jun. 2019 | BERT + Transformer-XL | 130GB | โ |
RoBERTa | Jul. 2019 | BERT without NSP | 160GB | โ | |
MegatronLM | Nvidia | Aug. 2019 | Big models with parallel training | ||
DistilBERT | Hugging Face | Aug. 2019 | Compressed BERT | 16GB | โ |
MiniBERT | Aug. 2019 | Compressed BERT | |||
ALBERT | Sep. 2019 | Parameter reduction on BERT |
https://huggingface.co/pytorch-transformers/pretrained_models.html
Model | 2L | 3L | 6L | 12L | 18L | 24L | 36L | 48L | 54L | 72L |
---|---|---|---|---|---|---|---|---|---|---|
1st Transformer | yes | |||||||||
ULMFiT | yes | |||||||||
ELMo | yes | |||||||||
GPT | 110M | |||||||||
BERT | 110M | 340M | ||||||||
Transformer-XL | 257M | |||||||||
XLM/mBERT | Yes | Yes | ||||||||
Transf. ELMo | ||||||||||
GPT-2 | 117M | 345M | 762M | 1542M | ||||||
ERNIE | Yes | |||||||||
XLNet: | 110M | 340M | ||||||||
RoBERTa | 125M | 355M | ||||||||
MegatronLM | 355M | 2500M | 8300M | |||||||
DistilBERT | 66M | |||||||||
MiniBERT | Yes |
- Attention: (Aug 2015)
- Allows the network to refer back to the input sequence, instead of forcing it to encode all information into ane fixed-lenght vector.
- Paper: Effective Approaches to Attention-based Neural Machine Translation
- blog
- attention and memory
- 1st Transformer: (Google AI, jun. 2017)
- Introduces the transformer architecture: Encoder with self-attention, and decoder with attention.
- Surpassed RNN's State of the Art
- Paper: Attention Is All You Need
- blog.
- ULMFiT: (Fast.ai, Jan. 2018)
- Regular LSTM Encoder-Decoder architecture with no attention.
- Introduces the idea of transfer-learning in NLP:
- Take a trained tanguge model: Predict wich word comes next. Trained with Wikipedia corpus for example (Wikitext 103).
- Retrain it with your corpus data
- Train your task (classification, etc.)
- Paper: Universal Language Model Fine-tuning for Text Classification
- ELMo: (AllenNLP, Feb. 2018)
- Context-aware embedding = better representation. Useful for synonyms.
- Made with bidirectional LSTMs trained on a language modeling (LM) objective.
- Parameters: 94 millions
- Paper: Deep contextualized word representations
- site.
- GPT: (OpenAI, Jun. 2018)
- Made with transformer trained on a language modeling (LM) objective.
- Same as transformer, but with transfer-learning for ther NLP tasks.
- First train the decoder for language modelling with unsupervised text, and then train other NLP task.
- Parameters: 110 millions
- Paper: Improving Language Understanding by Generative Pre-Training
- site, code.
- BERT: (Google AI, oct. 2018)
- Bi-directional training of transformer:
- Replaces language modeling objective with "masked language modeling".
- Words in a sentence are randomly erased and replaced with a special token ("masked").
- Then, a transformer is used to generate a prediction for the masked word based on the unmasked words surrounding it, both to the left and right.
- Parameters:
- BERT-Base: 110 millions
- BERT-Large: 340 millions
- Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Official code
- blog
- fastai alumn blog
- blog3
- slides
- Bi-directional training of transformer:
- Transformer-XL: (Google/CMU, Jan. 2019)
- Learning long-term dependencies
- Resolved Transformer's Context-Fragmentation
- Outperforms BERT in LM
- Paper: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- blog
- google blog
- code.
- XLM/mBERT: (Facebook, Jan. 2019)
- Multilingual Language Model (100 languages)
- SOTA on cross-lingual classification and machine translation
- Parameters: 665 millions
- Paper: Cross-lingual Language Model Pretraining
- code
- blog
- Transformer ELMo: (AllenNLP, Jan. 2019)
- Parameters: 465 millions
- GPT-2: (OpenAI, Feb. 2019)
- Zero-Shot task learning
- Coherent paragraphs of generated text
- Parameters: 1500 millions
- Site
- Paper: Language Models are Unsupervised Multitask Learners
- ERNIE (Baidu research, Apr. 2019)
- World-aware, Structure-aware, and Semantic-aware tasks
- Continual pre-training
- Paper: ERNIE: Enhanced Representation through Knowledge Integration
- XLNet: (Google/CMU, Jun. 2019)
- Auto-Regressive methods for LM
- Best both BERT + Transformer-XL
- Parameters: 340 millions
- Paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding
- code
- RoBERTa (Facebook, Jul. 2019)
- Facebook's improvement over BERT
- Optimized BERT's training process and hyperparameters
- Parameters:
- RoBERTa-Base: 125 millions
- RoBERTa-Large: 355 millions
- Trained on 160GB of text
- Paper RoBERTa: A Robustly Optimized BERT Pretraining Approach
- MegatronLM (Nvidia, Aug. 2019)
- Too big
- Parameters: 8300 millions
- DistilBERT (Hugging Face, Aug. 2019)
- Compression of BERT with Knowledge distillation (teacher-student learning)
- A small model (DistilBERT) is trained with the output of a larger model (BERT)
- Comparable results to BERT using less parameters
- Parameters: 66 millions
- TO-DO read:
- Read MASS (transfer learning in translation for transformers?)
- Read CNNs better than attention
- Modern NLP
- Courses
- Fast.ai NLP course: playlist
- spaCy course
- Transformers
- The Illustrated Transformer (June 2018)
- The Illustrated BERT & ELMo (December 2018)
- The Illustrated GPT-2 (August 2019) โญ
- Best Transformers explanation (August 2019) โญ
- BERT summary
- BERT, RoBERTa, DistilBERT, XLNet. Which one to use?
- DistilBERT model by huggingface
- Transfer Learning in NLP by Sebastian Ruder
- Blog
- Slides โญ
- Notebook: from scratch pytorch โญโญ
- Notebook2: pytorch-transformers + Fast.ai โญโญ
- Code (Github)
- Video
- NLP transfer learning libraries
- 7 NLP libraries
- spaCy blog
- What is NLP? โ
- Topic Modeling with SVD & NMF
- Topic Modeling & SVD revisited
- Sentiment Classification with Naive Bayes
- Sentiment Classification with Naive Bayes & Logistic Regression, contd.
- Derivation of Naive Bayes & Numerical Stability
- Revisiting Naive Bayes, and Regex
- Intro to Language Modeling
- Transfer learning
- ULMFit for non-English Languages
- Understanding RNNs
- Seq2Seq Translation
- Word embeddings quantify 100 years of gender & ethnic stereotypes
- Text generation algorithms
- Implementing a GRU
- Algorithmic Bias
- Introduction to the Transformer โ
- The Transformer for language translation โ
- What you need to know about Disinformation