Skip to content
/ nlp Public
forked from javiabellan/nlp

๐Ÿ”  Deep Learning for Natural Language Processing

Notifications You must be signed in to change notification settings

0x01h/nlp

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Natural Language Processing

๐Ÿ“ฑ Applications
๐Ÿ“‹ Pipeline
๐Ÿ“ Scores
๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Transfer Learning
๐Ÿค– Transformers theory
๐Ÿ”ฎ DL Models
๐Ÿ“ฆ Python Packages

๐Ÿ“ฑ Applications

Application Description Type
๐Ÿท๏ธ Part-of-speech tagging (POS) Identify if each word is a noun, verb, adjective, etc. (aka Parsing). ๐Ÿ”ค
๐Ÿ“ Named entity recognition (NER) Identify names, organizations, locations, medical codes, time, etc. ๐Ÿ”ค
๐Ÿ‘ฆ๐Ÿปโ“ Coreference Resolution Identify several ocuurences on the same person/objet like he, she ๐Ÿ”ค
๐Ÿ” Text categorization Identify topics present in a text (sports, politics, etc). ๐Ÿ”ค
โ“ Question answering Answer questions of a given text (SQuAD, DROP dataset). ๐Ÿ’ญ
๐Ÿ‘๐Ÿผ ๐Ÿ‘Ž๐Ÿผ Sentiment analysis Possitive or negative comment/review classification. ๐Ÿ’ญ
๐Ÿ”ฎ Language Modeling (LM) Predict the next word. Unupervised. ๐Ÿ’ญ
๐Ÿ”ฎ Masked Language Modeling (MLM) Predict the omitted words. Unupervised. ๐Ÿ’ญ
๐Ÿ”ฎ Next Sentence Prediction (NSP) ๐Ÿ’ญ
๐Ÿ“—โ†’๐Ÿ“„ Summarization Crate a short version of a text. ๐Ÿ’ญ
๐Ÿˆฏโ†’๐Ÿ†— Translation Translate into a different language. ๐Ÿ’ญ
๐Ÿ†“โ†’๐Ÿ†’ Dialogue bot Interact in a conversation. ๐Ÿ’ญ
๐Ÿ’๐Ÿปโ†’๐Ÿ”  Speech recognition Speech to text. See AUDIO cheatsheet. ๐Ÿ—ฃ๏ธ
๐Ÿ” โ†’๐Ÿ’๐Ÿป Speech generation Text to speech. See AUDIO cheatsheet. ๐Ÿ—ฃ๏ธ
  • ๐Ÿ”ค: Natural Language Processing (NLP)
  • ๐Ÿ’ญ: Natural Language Understanding (NLU)
  • ๐Ÿ—ฃ๏ธ: Speech and sound (speak and listen)

๐Ÿ“‹ Pipeline

  1. Preprocess
    • Tokenization: Split the text into sentences and the sentences into words.
    • Lowercasing: Usually done in Tokenization
    • Punctuation removal: Remove words like ., ,, :. Usually done in Tokenization
    • Stopwords removal: Remove words like and, the, him. Done in the past.
    • Lemmatization: Verbs to root form: organizes, will organize organizing โ†’ organize This is better.
    • Stemming: Nouns to root form: democratic, democratization โ†’ democracy. This is faster.
    • Subword tokenization Used in transformers. โญ
  2. Extract features
    • Document features
      • Bag of Words (BoW): Counts how many times a word appears in a text. (It can be normalize by text lenght)
      • TF-IDF: Measures relevance for each word in a document, not frequency like BoW.
      • N-gram: Probability of N words together.
      • Sentence and document vectors. paper2014, paper2017
    • Word features
      • Word Vectors: Unique representation for every word (independent of its context).
        • Word2Vec: By Google in 2013
        • GloVe: By Standford
        • FastText: By Facebook
      • Contextualized Word Vectors: Good for polysemic words (meaning depend of its context).
        • CoVE: in 2017
        • ELMO: Done with with bidirectional LSTMs. By allen Institute in 2018
        • Transformer encoder: Done with with self-attention. โญ
  3. Build model
    • Bag of Embeddings
    • Linear algebra/matrix decomposition
      • Latent Semantic Analysis (LSA) that uses Singular Value Decomposition (SVD).
      • Non-negative Matrix Factorization (NMF)
      • Latent Dirichlet Allocation (LDA): Good for BoW
    • Neural nets
      • Recurrent NNs decoder (LSTM, GRU)
      • Transformer decoder (GPT, BERT, ...) โญ
    • Hidden Markov Models

Others

  • Regular expressions: (Regex) Find patterns.
  • Parse trees: Syntax od a sentence

Seq2seq

  • Recurent nets
    • GRU
    • LSTM
  • Tricks
    • Teacher forcing: Feed to the decoder the correct previous word, insted of the predicted previous word (at the beggining of training)
    • Attention: Learns weights to perform a weighted average of the words embeddings.

๐Ÿค– Transformers

Transformer input

  1. Tokenizer: Create subword tokens. Methods: BPE...
  2. Embedding: Create vectors for each token. Sum of:
    • Token Embedding
    • Positional Encoding: Information about tokens order (e.g. sinusoidal function).
  3. Dropout

Transformer blocks (6, 12, 24,...)

  1. Normalization
  2. Multi-head attention layer (with a left-to-right attention mask)
    • Each attention head uses self attention to process each token input conditioned on the other input tokens.
    • Left-to-right attention mask ensures that only attends to the positions that precede it to the left.
  3. Normalization
  4. Feed forward layers:
    1. Linear Hโ†’4H
    2. GeLU activation func
    3. Linear 4Hโ†’H

Transformer output

  1. Normalization
  2. Output embedding
  3. Softmax
  4. Label smothing: Ground truth -> 90% the correct word, and the rest 10% divided on the other words.
  • Lowest layers: morphology
  • Middle layers: syntax
  • Highest layers: Task-specific semantics

๐Ÿ“ Scores

Score For what? Description Interpretation
Perplexity LM The lower the better.
GLUE NLU An avergae of different scores
BLEU Translation Compare generated with reference sentences (N-gram) The higher the better.

BLEU limitation

"He ate the apple" & "He ate the potato" has the same BLEU score.

๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Transfer Learning

Step Task Data Who do this?
1 [Masked] Language Model Pretraining ๐Ÿ“š Lot of text corpus (eg. Wikipedia) ๐Ÿญ Google or Facebook
2 [Masked] Language Model Finetunning ๐Ÿ“— Only you domain text corpus ๐Ÿ’ป You
3 Your supervised task (clasification, etc) ๐Ÿ“—๐Ÿท๏ธ You labeled domain text ๐Ÿ’ป You

๐Ÿ“ฆ Python Packages

Packages Description Type
Parse trees, execelent tokenizer (8 languages) ๐Ÿ”ค
Semantic analysis, topic modeling and similarity detection. ๐Ÿ”ค

NLTK

Very broad NLP library. Not SotA. ๐Ÿ”ค

SentencePiece

Unsupervised text tokenizer by Google ๐Ÿ”ค
Fast.ai NLP: ULMFiT fine-tuning ๐Ÿ”ค
TorchText (Pytorch subpackage) ๐Ÿ”ค
Word vector representations and sentence classification (157 languages) ๐Ÿ”ค
pytorch-transformers: 8 pretrained Pytorch transformers ๐Ÿ”ค
+ SpaCy + pytorch-transformers ๐Ÿ”ค

fast-bert

Super easy library for BERT based models ๐Ÿ”ค
Pretrained models for 53 languages ๐Ÿ”ค

PyText

๐Ÿ”ค
An open-source NLP research library, built on PyTorch. ๐Ÿ”ค
Fast & easy NLP transfer learning for the industry. ๐Ÿ”ค
NLP library designed for reproducible experimentation management. ๐Ÿ”ค
A very simple framework for state-of-the-art NLP. ๐Ÿ”ค
SotA NLP deep learning topologies and techniques. ๐Ÿ”ค
Scikit-learn style model finetuning for NLP. ๐Ÿ”ค

Installation

pip install spacy
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download es_core_news_md

Usage

import spacy

nlp = spacy.load("en_core_web_sm")  # Load English small model
nlp = spacy.load("es_core_news_sm") # Load Spanish small model without Word2Vec
nlp = spacy.load('es_core_news_md') # Load Spanish medium model with Word2Vec


text = nlp("Hola, me llamo Javi")   # Text from string
text = nlp(open("file.txt").read()) # Text from file


spacy.displacy.render(text, style='ent', jupyter=True)  # Display text entities
spacy.displacy.render(text, style='dep', jupyter=True)  # Display word dependencies

Word2Vect

es_core_news_md has 534k keys, 20k unique vectors (50 dimensions)

coche = nlp("coche")
moto  = nlp("moto")
print(coche.similarity(moto)) # Similarity based on cosine distance

coche[0].vector      # Show vector

๐Ÿ”ฎ Deep learning models ALL MODELS

models

๐Ÿค— Means availability (pretrained PyTorch implementation) on pytorch-transformers package developed by huggingface.

Model Creator Date Breif description Data ๐Ÿค—
1st Transformer Google Jun. 2017 Encoder & decoder transformer with attention
ULMFiT Fast.ai Jan. 2018 Regular LSTM
ELMo AllenNLP Feb. 2018 Bidirectional LSTM
GPT OpenAI Jun. 2018 Transformer on LM โœ”
BERT Google Oct. 2018 Transformer on MLM (& NSP) 16GB โœ”
Transformer-XL Google/CMU Jan. 2019 โœ”
XLM/mBERT Facebook Jan. 2019 Multilingual LM โœ”
Transf. ELMo AllenNLP Jan. 2019
GPT-2 OpenAI Feb. 2019 Good text generation โœ”
ERNIE Baidu research Apr. 2019
XLNet: Google/CMU Jun. 2019 BERT + Transformer-XL 130GB โœ”
RoBERTa Facebook Jul. 2019 BERT without NSP 160GB โœ”
MegatronLM Nvidia Aug. 2019 Big models with parallel training
DistilBERT Hugging Face Aug. 2019 Compressed BERT 16GB โœ”
MiniBERT Google Aug. 2019 Compressed BERT
ALBERT Google Sep. 2019 Parameter reduction on BERT

https://huggingface.co/pytorch-transformers/pretrained_models.html

Model 2L 3L 6L 12L 18L 24L 36L 48L 54L 72L
1st Transformer yes
ULMFiT yes
ELMo yes
GPT 110M
BERT 110M 340M
Transformer-XL 257M
XLM/mBERT Yes Yes
Transf. ELMo
GPT-2 117M 345M 762M 1542M
ERNIE Yes
XLNet: 110M 340M
RoBERTa 125M 355M
MegatronLM 355M 2500M 8300M
DistilBERT 66M
MiniBERT Yes

References

Fast.ai NLP Videos

  1. What is NLP? โœ”
  2. Topic Modeling with SVD & NMF
  3. Topic Modeling & SVD revisited
  4. Sentiment Classification with Naive Bayes
  5. Sentiment Classification with Naive Bayes & Logistic Regression, contd.
  6. Derivation of Naive Bayes & Numerical Stability
  7. Revisiting Naive Bayes, and Regex
  8. Intro to Language Modeling
  9. Transfer learning
  10. ULMFit for non-English Languages
  11. Understanding RNNs
  12. Seq2Seq Translation
  13. Word embeddings quantify 100 years of gender & ethnic stereotypes
  14. Text generation algorithms
  15. Implementing a GRU
  16. Algorithmic Bias
  17. Introduction to the Transformer โœ”
  18. The Transformer for language translation โœ”
  19. What you need to know about Disinformation

About

๐Ÿ”  Deep Learning for Natural Language Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published