# Transformer models for language
* Understand and generate human language by learning patterns and relationships b/w words or tokens in text
* Originally designed for machine translation, now the default architecture for solving AI tasks
* Some tasks are better for encoder structure, while others best suited for decoder, and some both
## How LMs work
* Learn stat patterns and relationships between words
* Trained to predict the probability of a word given the context of surrounding words, giving them a foundational understanding of language that can generalize to other tasks
* Two Main approaches for training a transformer model
    * Masked Language Modeling (MLM): Used by encoders like BERT, randomly masks some tokens in the input and trains the model to predict the original tokens based on surrounding context.
        * Teaches the model bi-directional context: Looking at words both before & after masked word
    * Casual Language Modeling (CLM)
        * Used by decoder models like GPT, predicts next token based on all previous tokens in a sequence. Can only look backwords, reviewing previous text as context to predict next tokens
## Types of LMs
* Econder only (BERT)
    * Use bidirectional approach to understand context from both sides, best suited for deep understanding of text such as classifications, named entity recognition, and Q&A
    * Understand/Classify Text
        * Assign predefined categories to text
        * Uses wordpiece tokenization and special tokens like SEP and CLS to differentiate sentences and capture representation
        * Token Classification assigns labels to each token in a sequence (like NER)
            * BERT can be adapted for this using a linear layer that adds a label to each token for prediction
        * Q&A Finds answers w/i given context
        * Bert with span classification head
            * Predicts start and end positions of the answers in the text
* Decoder only (GPT, Llama)
    * Process text from left to right and are great for generation
    * Generate text
        * Create coherent and context relevant text based on a prompt
        * Uses byte pair encodings (BPE) & positional encodings makes it highly effective
* Encoder-Decoder (T5, BART)
    * Both approaches - encoder to understand the input and decoder to generate output.
    * Excel at sequence to sequence tasks like translation, summarization, and Q*A.
    * Summarization Condese long text into short text while preserving key info
        * BART & T5 En/Decoder models
            * Pretrained with text infilling corruption, teaching the model to predict missing tokens
    * Translation Converting text from one language to another
        * Adapts to translation by using separate encoder to map the source language to the target language
* Text Generation
    * Creating coherent & contextually relevent text based on a prompt or input
        * E.g. GPT2, decoder only model pretrained on large amount of text, generating convincing but not always true text given a prompt & complete other NLP tasks like QA despite not being explicitly trained to
        * ![GPT2](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/gpt2_architecture.png)
        * Step 1
            * Uses byte pair encoding (BPE) to tokenize words and generate a token embedding.
            * Positional encodings are added to the token embeddings that indicate the position of each token in a sentence
            * W/i each encoder block, GPT2 uses masked self-attention layer, meaning it can't attend to future tokens (only using tokens on the left)
                * Differs from BERT's mask token because, in masked self0attention, an attention mask is used to set the score to 0 for future tokens
        * Step 2
            * Output from decoder is passed to a LM head, performing linear trans to convert the hidden states into logits (logits are a probability function that maps probability values (from 0-1) to real numbers (-inf. to +inf.))
* Text Classification 
    * Assigning predefined categories to text documents, such as sentiment analysis, topic classification, or spam detection
    * Bert is an encoder-only model, first to implement deep bidirectionality to learn richer representations of text by attending to words on both sides
        * Step 1:
            * Bert uses [wordpiece](https://huggingface.co/docs/transformers/tokenizer_summary#wordpiece) tokenization
                * Wordpiece tries to maximize the liklihood that the next letter merges with the previous to create the most likely signal pair
            * To tell the difference between 1 and multiple sentences, a special [SEP] token differentiates them into sequences
            * CLS token is added to the beginning of every sequence in the text
            * Final output with CLS token is used as the input to the classification head for tasks
            * Also adds segment embeddings to denote whether a token belongs in first or second sentence in a pair of sentences
        * Step 2:
            * BERT is pretrained w/ two objectives
                * 1. Masked language modeling
                    * Some percentage of input tokens are randomly masked, model predicts these
                        * solves bidirectional issue, where model could cheat and see all words and "predict" the next words
                    * Final hidden states of predicted mask tokens are passed to a feedforward network to predict the next word
                * 2. Next-sentence prediction: predicting whether sentence B follows sentence A
                    * Half of the time Sentence B is the next sentence, other half sentence B is a random sentence. Predicts whether it's the next sentence or not, and is passed to feedforward network over two classes (IsNext and NotNext)
        * Step 3: Input embeddings passed through multiple encoder layers to output some final hidden states
        * To use the pretrained model for TC, it needs a sequence classification head on top of the base BERT model
            * Sequence class. head is a linear layer accepting final hidden states & performing linear transformation to convert them into logits. Cross entropy loss is calculated b/w logits and target to find most likely label
        * **ATTENTION** Look into the [<Text Classification Course>](https://huggingface.co/docs/transformers/tokenizer_summary#wordpiece) As it may help AIC
* Token Classification
    * Assigning label to each token in a sequence, such as in NER or part-of-speech-tagging
    * Using BERT for Token Class. like NER:
        * Add a Tok. Clas. head on top of base BERT model
            * TC Head is a linear layer that accepts final states and performs linear transformation to convert them into logits
                * Cross-entropy loss is calculated b/w logits and each token to find most likely label
* Questions Answering
    * Using BERT:
        * Add "Span Classification" Head on Base BERT model.
            * Linear head accepting final hidden states & performing linear transformation to compute the span start and end logits corresponding to answer
            * Cross-entropy loss calculated b/w logits and label positions to find most likely span of text corresponding to answer
* Summarization
    * Long text to short version while preserving key info and meaning
    * En/Decoder models like BART and T5 use sequence to sequence (S2S) patterns of summarization.
    * How BART works
        * Encoder architecture similar to BERT, accepting tokens & positional embedding of text
        * BART pretrained by corrupting input and restructuring it w/ decoder
            * Text infilling corruption: number of text spans replaced w/ a single mask token, teaching model to predict # of tokens
        * Input embeddings & masked spans passed through encoder to output some final hidden states
            * Unlike BERT, BART doesn't add final feedforward network at the end to predict a word
        * Enconder output passed to detector: Predicts masked tokens and uncorrupted tokens from output, helping gives additional context to help decoder restore original text.
        * Decoder output passed to LM head, performing linear transformation to convert hidden states to logits. Cross-entropy loss calced b/w logits & label, which is token shifted to right
* Translation
    * S2S task converting text from one language to another while preserving meaning
    * BART translates by adding separate randomly initialized encoder to map source language to map source language to an input that can be decoded in the target language. New encoder embeddings are passed to pretrained encoder instead of original word embeddings.
    * source encoder trained by updating source encoder, posit. embeddings, input embeddings w/ cross-entropy loss from model output.
    * Parameters frozen in first step, and trained together in second step.
## Modalities beyond text
* Speech and audio
    * Automatic speach recognition (ASR)
    * Whisper: En/decoder model pretrained on massive dataset (680k hours) of labeled audio data enabling zero-shot performance across many languages & tasks. 
        * Decoder allows Whisper to map encoders learned speech representations to useful outputs like text, w/o additional fine tuning.
        * ![Whisper](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/whisper_architecture.png)
        * 2 Main components
            * Encoder processes input audio, raw audio converted into log-mel spectrogram which is passed through transformer encoder network
            * Decoder takes encoded audio representation and autoregressively predicts corresponding text toens.
                * Standard transformer decoder trained to predict the next token given the previous tokens and encoder output
                * Special tokens are used at the beginning of decoder input to steer model towards specific tasks like transcription, translation, or language identification.
* Computer Vision - Image classification
    * Vision transformer (ViT)
        * Splits images into non-overlapping patches, each of which is turned into a vector or patch embedding
            * Patch embeddings are generated from a convolutional 2D layer which creates the proper input dimensions
            * A 224x224 image is split into 196 16x16 image patches, just like how text is tokenized into words, an image is tokenized into a sequence of patches
        * A learnable embedding - CLS tokens added at beginning of patch embeddings (like bert) for classification to capture overall image representation
            * CLS tokens are classification tokens, that classify objects and their representations
        * The final hidden state of CLS token used as input to attached classification head, other outputs ignored, helping model learn to encode representations of an image
        * Positional embeddings added as the model doesn't know how patches are ordered. These are also learnable and have same size as patch embeddings.
        * All embeddings passed through transformer
        * Uses attention mechanisms
        * Position embeddings map the positions of image tokens to order patches before passing them through the encoder
    * ConvNeXT - Modern Convolutional Neural Network (CNN) used for image classifcation
        * Transformer like designs
        * Uses convolutions
        * Used for image classification
        * Combines the strength of conventional layers with modern network designs
    * Object detection, segmentation, and depth estimation
## LLM Course Documentation