Skip to content

GuiliGomes/CS224n-2019

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CS224N-2019

My notes and solutions for CS224N 2019

Course links

Detailed Course Lectures:

04 - Backpropagation:

13 - Contextual Word Embeddings:

  • Meanining of word in context
  • Semi-supervised. Unsupervised step learns embedings (eg. word2vec) + RNN LM
  • learned embedding representations (without context) are joined with LM embeddings (hidden states) to generate the final representation that gets trained in supervised fashion for final talk, like NER
  • 2 biLSTM LM
  • Task-specific weighted average of hidden states of all layers
  • Character representation of words to build initial representation. 2048 char n-gram filters and 2 highway layers, 512 dim projection
  • 4096 dim hidden/cell LSTM states with 512 dim projection to next input
  • Residual connections
  • Learns task-specific combinations of biLM representations
  • Uses all levels of hiddend layers of the LM (unlike tagLM that just uses the middle layer) for representation
  • The pre-trained LM model is kept frozen. A new model is fine tuned.
  • Universal LM for text classification
  • Pre-trained LM on big corpus
  • Then fine-tune the pre-trained LM on specific domain
  • So it uses the same pre-trained model for the end task
  • Finally, uses the same LM with text classification objective. Freezes the LM final softmax and introduces a new classification layer.
  • Different learning rates for each layer
  • Slanted Triagular Learning Rate Schedule (STLR)
  • Aim for parallelism. RNN are sequencial.
  • The Annotated Tranformer
  • Only Attention. Attention everywhere.
  • Multihead attention. Q, K, V have dimension reduced by apply W matrixes and then concatenate multiple attentions and pipe through linear layer.
  • Transformer Block
    • Multihead
    • 2-layer FF NNet
    • Residual connection
    • Layer Normalization
  • LayerNorm
  • Bidirectional
    • GPT is left-to-right
    • ELMo is bidirectional LM, but trained separatedly (no attention between models)
    • BERT bidirectional with attention
  • Objective:
    • Predict masked words
    • Second loss function. Next sentence prediction (classification: Given SentA and SentB IsSentBNexSentence: Yes/No)

14 - Transformers and Self-Attrntion

  • Crucial for NMT (memory)
  • Representation
  • Same path length between all pices
  • Added casuality by masking words in the decoder
  • Residuals carry postion information
  • Motifs
  • Tranlation invariance
  • How far you are aprt when comparing things

Releases

No releases published

Packages

No packages published

Languages