# 3 Main Architectual Variants of Transformer Models
## Encoder
* Outputs a numerical representation [Like a sequence of numbers] for **each word** used as input. This numerical vector can also be referred to as a feature vector or feature tensor
* Each vector is a numerical representation of the word in question
    * Dimension of vector is defined by the architecture of model
* Representations contain the value of the word but contextualized, meaning it's representation is influenced bidirectionally
    * Words on either side are called context - welcome **to** NYC::Left Context=Welcome & Right=NYC
* Does this with the self attention mechanism
    * Relates to different positions or words in a single sequence to compute a representation
    * Resulting representation of a word affected by other words in sequence
### When to Use an Encoder
* When Bi-directional context is necessary to capture meaningful info about a sequence
* Extracting meaningful information
    * Masked Language Modeling (MLM): Guessing a randomly masked word "My <mask> is Aaron"
    * Sentiment Analysis: Analyze sentiment of a sequence
        * Encoders are good at obtaining an understanding of sequences; and the relationship/interdependence between words
* NLU: Natural Language Understanding
## Decoders
* Attention layers can only access words positioned before it in a sentence, AKA auto-regressive, meaning they predict next word.
* Word -> Decoder -> Numerical representation for each comprised of the word & the last words (masked self attention)
    * W/o right context - "Welcome **TO** NYC", To vector is just **Welcome -> TO** excluding NYC
* Causal Language Modeling: Mapping the most probable following word
    * My -> Name
    * My Name -> Is #Auto-regressive meaning that the model reuses its past output in the following steps
    * My Name Is Aaron -> . #This is where context size comes in
    * Starting from a single word, we've generated a full sentence. We can continue to the max context size.
        * If Max Context size is 1024, it can generate 1024 words while retaining memory of the first word generated.
* Models like Gemma, Llama, Deepseek belong to this family
### Modern LLMs
* Use **ONLY Decoder architecture**
* Trained in 2 phases
    * Pretraining: Model learns to predict next token on vast amounts of text data
    * Instruction tuning: Model is fine-tuned to follow instructions & generate helpful responses
* Models can understand and generate human-like text across a wide range of topics and tasks, allowing them to:

| Capability         | Description                              | Example                                   |
|--------------------|------------------------------------------|-------------------------------------------|
| Text Gen           | Create coherent and context-relevant text| Write essays, stories, emails             |
| Summarize          | Consolidate long docs to short           | Create executive summary of report        |
| Translate          | Converting text between languages        | "You get it"                              |
| Question Answering | Provide answers to factual questions     | "What is the capital of France?"          |
| Code Generation    | Write or complete code snippets          | Create an app                             |
| Reasoning          | Work through problems step by step       | Solve logical puzzles                     |
| Few-Shot Learning  | Learn from a few examples in the prompt  | Classify text after seeing 2–3 examples   |
# Sequence-to-Sequence Models
* Use both architectures
* attention layers of encoder can access all words in the initial sentence, whereas decoder can only access words positioned before a given word in the input
* Pretraining
    * Often involves reconstructing sentence for which the input has been corrupted (masked random words, etc.)
* Practical apps, Machine translation, Text summarization, Data-to-text generation, Grammar correction, Questions Answering
# Attention Mechs
* Most models use full attention in the sense that the attention matrix is square, bottlenecking long texts
* longformer & reformer models speed up training efficiency using sparse vision of attention matrix
