# What is a LLM capable of ?
Basically they are *"... word generators"*. So they can do any kind of question answering,
summarization or entity extraction, (augmenting LLMs with connected APIs)


LLMS are distinguished by their *"parameter size"* (e.g. 10M parameters or 2B parameters)

- Larger LLMs: More conceptual understanding --> best for complicated and multipurpose usage
- Smaller LLms: Less conceptual but faster --> best for focused tasks

Before the transformers (structures used in LLMs and ... Gen AI), we used RNNs.
## What did we lack?
- Memory: Longer sentences could not be analyzed
- Multi-meaning words in sentences
- Two/multiple sided sentences
*We were able to solve these problems using Transformers*

*Transformers* use self attention: a vector containing a selection of every word in a sentence (or even a whole text), called *Token Embedding Vector*

*Max input size of a LLM is based on the size of its embedding vector*

### Transformer Architecture Schematic

![Transformer Architecture](https://miro.medium.com/v2/resize:fit:612/1*R5WJ_nO0gKy2yk3HHGorhg.jpeg)

Lets break down some of its parts
- Every input gets tokenized. each word being represented as a unique ID, based on a dictionary in use
    the same tokenizer should be used for both training and generation
- Token ID gets passed to a trainable embedding layer, so that each token ID --> high dimension vector --> represents the connection between words etc.
- Add poositional encoding to preserve the word order
- Feed the resulting vector (embed or positional) into a self attention layer; which learns to better capture the contextual relation between input words.
    Self attention weight, learned during training helps the model to capture the relation (*and the importance of*, hence "attention") between each word with any
    other word in the input
    In most transformer architectures, the *multi-headed self attention* is used. Meaning that there are parallel self-attention layers,
    which are independantly learning the importance off each encoding (embed+positional) in the sentences/text (12*100 number of multi-heads)
    In fact, each attention head, learns an aspect of language (people, names, rhyme, etc)

- after the attention weights have been applied to the data, it will be fed to a *feed forward* network and its output is a vector of logits with each
    value proportional to its probability score, for each and every token for the tokenization  dictionary.
- Now we can feed it to a softmax layer, outputing a vector with the length of the dictionary containing probability scores for each token.
    The highest score in most cases is the next word that should be used to start or continue the word generation proccess