# How are tasks solved in nlp - a short history

## Motivation: the many tasks of NLP

<img src="http://drive.google.com/uc?export=view&id=1kZ0f5h26BkTVNZBCcO1WNqE1qi10od_Q" width=55%>


[source](https://medium.com/nlplanet/two-minutes-nlp-33-important-nlp-tasks-explained-31e2caad2b1b)

NLP as a field is pretty wide in a sense, since __multiple tasks__ (suprvised or unsupervised) can be endeavoured to be solved based on the __same text__ and - importantly - it's __representation__ in a way that is conducive of machine learning. We saw ample examples for this so far. 

## Historical context - A general shift in modeling approaches

## Broader context

<img src="http://drive.google.com/uc?export=view&id=1rzWcpZvWDfCIltJGh1P6tN4uaW_eFfoj" width=80%>

<img src="http://drive.google.com/uc?export=view&id=1lJC66UW0YvaTNM8ZRqs86ntAQid6Y8Ar" width=80%>


## "Paradigms" - Stages of development in NLP

If we focus our attention to the development of NLP (and completely disregard the rule based or "Ontological" approaches of ["Good old-fashioned AI"](https://en.wikipedia.org/wiki/GOFAI), we can distinguish clear paradigms of model training and problem solving in NLP, that followed each-other in rapid (actually accelerating) succession.

### Non-learned feature extraction + custom model

<img src="http://drive.google.com/uc?export=view&id=1NtZy--0eGGg6S2h6D8hFndv3ko8yPiYb" width=35%>


[source](https://www.semanticscholar.org/paper/A-novel-text-mining-approach-based-on-TF-IDF-and-Dadgar-Araghi/ded138171f35309c0af9ecc98eeffb90f0e8f993)

(As late as 2016, surprisingly!)

In the now "classical" paradigm, the features from textual documents were extracted by an elaborate pipeline of hand crafted transformations, that resulted in the representation of texts in terms of word frequency properties. This representation was then used to train a specific classifier for a given task in a supervised manner.

### Learned features + custom models "on top"

<img src="http://drive.google.com/uc?export=view&id=1V9w8DwNLQ0w_uLvYCnjl7d9TJjNAzsak" width=20%>

[source](https://www.semanticscholar.org/paper/Word2Vec-model-for-sentiment-analysis-of-product-in-Fauzi/ba511aa8390e4a48e4a8273d9e24c35bcfe4cd96)

<img src="http://drive.google.com/uc?export=view&id=17h6urUe2IySvvQd-DyoValRYjy5dwB_k" width=55%>

<img src="http://drive.google.com/uc?export=view&id=1X2HeILaqtd-dAxrhcRYtlTtMrliiPJPx" width=70%>

[Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053)

The big change happened with the introduction of unsupervised learning based representations, namely [word2vec](https://en.wikipedia.org/wiki/Word2vec) in 2013-14. The main paradigm change was, that an unsipervised predictive training task was found to be extremely efficient in coming up with high quality word representations, that could with simple (like averaging) or more complicated (like [SIF](https://openreview.net/pdf?id=SyK00v5xx)) approaches be utilized as representations for texts, and so, custom models could be trained using them as inputs.

### End-to-end learned models


The next level of complexity - and with it performance - arrived by using pre-trained "embeddings" (mainly word2vec) as part of and end-to-end pipeline, where every component was neural, mainly in form of LSTMs or their more complex "seq2seq" variants.

<img src="http://drive.google.com/uc?export=view&id=1cftnuhcP9ZhwsGRo75qN5x2kNHsaVvmw" width=55%>

<img src="http://drive.google.com/uc?export=view&id=1qXVsSWsny_UbBxbwtvrpv51dnNbwPG6x" width=55%>

[source](https://core.ac.uk/download/pdf/226439962.pdf)

Here, the pre-trained layer of word2vec only represented a useful input transformation, the heavy lifting was still done by the task specificly trained architecture on top of them.

### Pretrain and finetune paradigm

<img src="http://drive.google.com/uc?export=view&id=1kNI6DrMJNEMV0rZogmClIQ7187OhWX9w" width=35%>


[Source](https://arxiv.org/abs/2109.01652)

Though this paradigm reached fame with the advent of [transformer models](https://arxiv.org/abs/1706.03762), especially [BERT](https://arxiv.org/abs/1810.04805), the big leap in performance in the pretrain-finetune paradigm may be attributable more to the paper about [ULMFiT - Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146).

The paradigm centers ofn __pretrained deep models__, that get applied to task specific settings by __finetuning__.


The breakthrough came, as the __"gradual unfreezing"__ method prevented the destruction of the knowledge learned by the neural models during the __"pre-training"__ process, so a task specific __fine-tuning__ was deemed to be very successful. 

<a href="https://humboldt-wi.github.io/blog/img/seminar/group4_ULMFiT/Figure_21.png"><img src="https://drive.google.com/uc?export=view&id=1KxNr1UqL_1q7FTNjLv8SXivkuFKB343H" width=65%></a>

#### Cut and/or add a layer

<img src="http://drive.google.com/uc?export=view&id=1SW3xfa1FYQ1TGXaorJHmcx-wAJpJPj4g" width=55%>

In this paradigm the main method for task customization was to either replace the last "task" layer of the network, or add a "task specific" layer as a new output.

#### "Sidenote": Transformers


With the paper [Attention is all you need](https://arxiv.org/abs/1706.03762) a new, extremely important architecture, the "transformer" emerged. 

<img src="http://drive.google.com/uc?export=view&id=1SoM5ha5ZikSd3jRxdpZPCNB4xidhh8Tr" width=65%>

The main advantages of this architecture were:
- Parallelizability on GPU (circumventing the LSTM's bottlenecks)
- Excellent utilization of information in long context windows (LSTM's effective window size was always debated)

This allowed for the __rapid scaling up of model depth__, and combined with the pretrain and finetune paradigm established the dominance of deep pretrained models.

##### "More sidenote": quadratic complexity

<img src="http://drive.google.com/uc?export=view&id=1Uafn-mvfStrmhcpHuAyCwAFYoLnHKyi-" width=65%>

Though some serious problems exist in transformers, namely [quadratic computation complexity](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15839671.pdf), there are methods that attempt to scale transformers to extremely long context sizes, like [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860), [Big Bird: Transformers for Longer Sequences](https://arxiv.org/pdf/2007.14062.pdf) or more recently [Scaling Transformer to 1M tokens and beyond with RMT](https://arxiv.org/abs/2304.11062).

Suffice it to say, that a kind of "cottage industry" sprang up with respect to transformer verisons. 

For a nice genealogy, see:

<img src="http://drive.google.com/uc?export=view&id=1TnQFZVz5zdOiPF6wN1hzyDYenjc-nHbQ" width=65%>

Source: [The Practical Guides for Large Language Models ](https://github.com/Mooler0410/LLMsPracticalGuide)

#### "Adapter" finetuning

As model sizes became prohibitively large to train on consumer grade hardware, the - in a sense a kind of "reaching back to old style" - idea came to __"freeze" the pretrained model__, and only take a shallow network to finetune. 

This shallow network was in the beginning a new, full "task specific" output layer, but then more efficient methods like "PEFT" or [Parameter-Efficient Transfer Learning for NLP](https://arxiv.org/pdf/1902.00751.pdf) came up with __adatpter layers__, that are built in, task specificly "after the fact" finetuneable layers, we can inject to the model "cheaply".

<img src="http://drive.google.com/uc?export=view&id=1W1XqUMl9Zud7pRldI6qPKkI2cfKvQ_kG" width=65%>


<img src="http://drive.google.com/uc?export=view&id=1qWN9DVYkhyQ1c3lrbiXqyizaw3kMeg4_" width=35%>

A more modern approach to this is [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)

More detailed explanation can be found [here](https://xiaosean5408.medium.com/fine-tuning-llms-made-easy-with-lora-and-generative-ai-stable-diffusion-lora-39ff27480fda).

### Zero shot learning, "prompting"

<img src="http://drive.google.com/uc?export=view&id=1YoWhTN3bW9mrpjTH7NHm0S_c69M6__Da" width=35%>


[Source](https://arxiv.org/abs/2109.01652)

Quite quickly after the emerence of large pre-trained models, it was realized, that the models were capable of __non-trivial performance on unseen tasks without any training__. Subsequently, many papers (like [this](https://arxiv.org/pdf/1909.00161.pdf)) started to investigate the phenomena, which revived the fields of [zero shot learning](https://en.wikipedia.org/wiki/Zero-shot_learning) and [few-shot learning](https://en.wikipedia.org/wiki/Few-shot_learning) that were kind of nieche disciplines up till that point.

In contrast with the previous approaches, where some examples were used to modify weights, that is to train the networks, __in the "zero shot" learning paradigm, weights don't change__.

In this paradigm the models already sotred enough __general knowledge__ in their weights (which are completely frozen), so that they can __output texts that can be taken as answers__.

<img src="http://drive.google.com/uc?export=view&id=1KZtdSCh4uZ-45ygY4S7ctSHOMwcaar6y" width=65%>


In their paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) the authors establish the idea, that __the main form of interaction with a model is via it's textual inputs and outputs__, that can represent a wide variety of tasks.

Since the performance of certain large models in zero shot context beat the finetuned state-of-the-art, the paragidm of __in context learning__, a form of zero shot learning became dominant.

<img src="http://drive.google.com/uc?export=view&id=1huLGkkIx4wwzaqGP7IOPh6wW_OwkGOrN" width=55%>


Source: "GPT3" - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)

The main observation later on will be that __the instructions in the context__, that is __the "prompts"__ will have a very strong influence on the task specific performance!

It is also important to note, that __prompt based "learning" has to be strongly distinguished from finetuning__!!! In "in context learning" the model weights are completely frozen, so it is kind of a misnomer (albeit extremely popular) to call this paradigm learning at all.

Prompting is a way to get a model to execute a task. Learning in the gradient sense is not involed.



#### Instruction finetuning

Further observation was, that if large ["foundationa models"](https://fsi.stanford.edu/publication/opportunities-and-risks-foundation-models) models were explicitly finetuned after their initial unsupervised pre-training in a next phase __to follow instructions__ (like in the line of ["InstructGPT"](https://arxiv.org/abs/2203.02155)), their performance in solving zero shot tasks in the "in context learning" sense (so by responding to prompts) becomes even more impressive.

<img src="http://drive.google.com/uc?export=view&id=1HCOV0jyp265sSOFsiKmQjypMWtPU-w5U" width=35%>

Source: [Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652)

<img src="http://drive.google.com/uc?export=view&id=16ApApmggJQZ9xzq4DdcCx1ht8o67Gc25" width=65%>


Source: [Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/abs/2109.01652)

Thus, the era of "instruction finetuned pre-trained Large Language Models" was born.

In [None]:
/Flo part/

#### Reinforcement learning from human feedback (RLHF)

An additional method, building on the results of instruction finetuning is the incorporation of (sparse) human feedback and preferences via reinforcement learning.



[Learning to summarize from human feedback
](https://arxiv.org/abs/2009.01325)

##### Motivation
- Large language models (LLMs) excellent in producing coherent and consistent text
- However, text may often not be exactly what human is expecting e.g. from a question
- Fine tune model for specific tasks such as text summarization 
- Thus this technique requires a pre-trained large language model!



##### Particular (original) setting - **text summarization**

<img src="http://drive.google.com/uc?export=view&id=1bJEFGUSo5duA341srpxbuOmdpC9Cm7Sd" width=600 heigth=600>




##### Approach 


<img src="http://drive.google.com/uc?export=view&id=1QGN2SyvnGZPz3F3WPvkd3IwsklvDjDJZ" width=1000 heigth=1000>



**RL part in more detail**
- [PPO algorithm](https://openai.com/research/openai-baselines-ppo)
- Specific version of actor critique method
- Each time step BPE (Byte Pair Encoding)

- Full reward $R$ can be written as:
$$
R(x, y)=r_\theta(x, y)-\beta \log \left[\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right]
$$

- Second term penalizes the KL divergence between the learned RL policy $\pi_\phi^{\mathrm{RL}}$ with parameters $\phi$ and this original supervised model $\pi^{\mathrm{SFT}}$
- Additional penalty super important as RL model might otherwise find nonsensical results that get high reward from the classifier (in a sense adversarial examples)
- Outputs not too different from what the model produces


##### Transfer Learning Task
- Same pre-trained algorithm applied to summarization of news articles 
- Significantly outperforms models trained on supervised basline

##### Performance comparison


<img src="http://drive.google.com/uc?export=view&id=13ZkkZOYAE5zy0PzJr6HlLVU7cvC4xCaa" width=80%>




**Note that the paper does not tell us a lot of important details**
- What parts of the model are actually retrained
- How long the retraining is done for
- How exactly the original language model is used (most likely it provides an input for the RL algorithm)

In [None]:
/End of Flo part/

#### Sidenote: on the openness of AI

As the instruction finetuning results show, good quality (instruction) data is still important, publication of finetune procedures and datasets would be key, and in this regard, the leading research group at OpenAI is not exactly transparent, and protects it's business interests.

Thus, such initiatives as ["The Pile"](https://pile.eleuther.ai/) as a pretraing dataset, [OpenAssistant's dataset](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/data/datasets.md) for instruction finetuning, open for research models like [LLaMA]() or fully open source models like [Pythia](https://github.com/EleutherAI/pythia) are of vital importance.

# Works, but why?

SInce the "zero shot" paradigm represents quite a strong change from what used to be the standard for machine learning, some more investigation of the mechanisms behind it can be useful. 

## What's in the box? - Investigation of structural properties of transformers


### What behavior do the (pre- and instruction) trainings  emphasize?

It is important to understand, that already the pre-training of language models - since it is done on an extremely wide variety of documents - emphasizes contextuality, that is, a model can not hope to solve a task sufficietly well, if it does not influence it's generation to adhere to local contexts.

The explicit instriction finetuning aprpoaches, (such as [this](https://arxiv.org/abs/2203.02155)) give even more emphasis to context, since in case of dialogues for example, people explicitly prefer in context consistent behavior (like "remembering" past conversation steps) for the model, thus give __extremely strong signal to enhance in context generation__. 

We can see this as a strong push towards enhancing the abilities of models in __in context learning__, that is, their __few and zero shot performance__.

### In cotext learning and "induction heads"

Recently, some in-depth analysis was carried out that identified (amongst others) one important mechanism that modern Large Language Models based on the transformer architecture learn to solve their tasks. The researchers gave the name __"induction heads"__ to the attention patterns that were most frequent.

<img src="http://drive.google.com/uc?export=view&id=1InObwz_wR9ry32lDLPHeS72eySeoJYhj" width=65%>

The basic induction head is learning a bigram like model of "if this then that" probability association between a word and the one that most frequently follows it. This enhances the ability of the model to utilize bigram re-occurence __in context__ to give nice baseline predictions.

Over and beyond these simple heads more sophisticated complex "induction" mechaisms are learned by the model (given it has at least 2 hidden attention layers - in the broad spirit of the [Cybenko theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem)). 

<img src="http://drive.google.com/uc?export=view&id=1CJu1izJWYjKArukjYHzw1Hw9KGYhRKlk" width=65%>

It is to be emphasized, that __induction heads are geared towards in context learning__, so the strong instruction following performance of modern (especially instruction finetuned) LLMs is vindicated. In a sense, one can argue, that this is a case of adding __human bias__ to the models - albeit in a __positive sense__!

For a more in-depth introduction see:

[In-context Learning and Induction Heads](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html)

and

In [None]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/watch?v=pC4zRb_5noQ" ></iframe>


(For other curios phenomena, like the "SolidGoldMagikarp" phenomena of __"glitch tokens"__, and for a better in-depth understandign of the trained vectorspace see [here](https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#A_possible__partial_explanation).)

## Too big to fail? - Model sizes and scaling laws

With the advent of transformers, "scaling" in therms of parameters became possible. With the GPT line of work, and the arising of the "zero shot paradigm" it became a widespread experience, that larger models offer better performance. 

Thus, __scaling to larger and larger sizes became the imperative__, and a kind of "size race" ensued.

<img src="http://drive.google.com/uc?export=view&id=1yB8MgigkQgyGY-BdwooGp84auRwRW9ID" width=65%>

[source](https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/)

<img src="http://drive.google.com/uc?export=view&id=14nvvS_xdSpzFTFQKJq4hbwZYxwqS1r0b" width=65%>

[source](https://lifearchitect.ai/models/)

The assumption, that the size of the models is crucial for their performance was first formalized in the paper [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361) 

<img src="http://drive.google.com/uc?export=view&id=1K4cRzFEy5F4vjzp1RSo3iAdVn7L1cIf8" width=75%>

The paper - and the substantial amount of followup work - established a notion, that training larger and larger models on increasingly huge datasets predictably increases perfromance.

Later on, though, in th work [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556.pdf) it was found, that maybe training dataset size has stronger influence on the final result than the sheer model size. ([Later work](https://github.com/EleutherAI/pythia) also seems to corroborate this finding.) 

<img src="http://drive.google.com/uc?export=view&id=1fzH_tmjFdGGYIOZJkdZ4Y5vdDWYEgh0W" width=45%>

Source: [Chinchilla's wild implications](https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications)


### Do we REALLY need that big of a scale?

In spite of the popularity of the notion, that we need huge scales in number of parameters for complex behavior to "emerge" in Large Language Models, some evidence is being collected to the contraty:

The research presented recently in the paper [Are Emergent Abilities of Large Language Models a Mirage?](https://arxiv.org/abs/2304.15004) points towards a different explanation, namely that there is no "sudden appearance" of more sophisticated abilities, just as we scale models, they get gradually better and better (we just measure them with performance metrics that skew the result, eg. there is non-zero performance of zero shot learning in small models also, just it is not expressible in terms of accuracy).

<img src="http://drive.google.com/uc?export=view&id=1AkwWXPzyInjV4InRh92spJHx0BKepZs4" width=65%>

If this turns out to be true, we might be tempted to conlude, that size is not the only factor that gives good performance for LLMs, but especialy the instruction finetuning seems to be a reasonable bias towards what we consider useful. This would mean, that we could potentially get away with way smaller model sizes than currently employed, which is a benefit, since the sheer scale of state-of-the-art models is in itself a barrier, thus hampering open innovation.

# How to use LLMs in practice?

## Prompts

### How to write good prompts?

Motto: 

__Context is everything, keep the context "in mind"!__

#### Anatomy of a prompt

<img src="http://drive.google.com/uc?export=view&id=15NXmYlHop5H47nsTFE1woojWJOJmaMM_" width=90%>

#### "Formalization"

Also, "formalization" helps. The usage of markings and structuring aids in:
- clarifies roles
- creates "slots" (which will be of crucial importance!

<img src="http://drive.google.com/uc?export=view&id=1x7U3ylG9vrjNSELraoKhIV7NPnjB0IJw" width=90%>

#### Example

<img src="http://drive.google.com/uc?export=view&id=1uMt0r5RRrX5d82qHkbN1Sng7XvDxD29X" width=90%>

__More about promting [here](https://www.youtube.com/watch?v=xnXDsYquT5A&list=PLey6rpDz-3B9fymKhp_TPOomx-1_iJxCY).__

### Simple strategies for building prompts

For the manual construction of prompts some simple strategies can help a lot:

- __Trial and error__: iterate, refine continuoously
- __Refinement__: Start simple, gradually add complexity
- __Decomposition__: Try to decompose tasks to smaller elements
- __"Nesting"__: From smaller elements, compose more and more high level solutions

We will see some more elaborate "reasoning" blow, why these "decomposition" and "step-by-step" approaches can work well.

In [None]:
TODO prompt based zero shot text classification example practice comes here

#### "Reverse prompting"

Since prompting is basically goal oriented creative work in the domain of words, practitioners soon realized, that one can use an LLM (given some solved examples) to generate some prompts for LLMs (the same or different) to solve the given task. This approach can be called "reverse prompt engineering".

<img src="http://drive.google.com/uc?export=view&id=1Nv9HaWQWRwk4JDBn1QQ1extuL4UiMMhf" width=45%>

[source](https://the-prompt-engineer.beehiiv.com/p/8-reverse-prompt-engineering)

(It is important to note, that a "jailbreaking" technique exists with the same name, derived from "prompt inhection". For more details on that, see [here](https://www.latent.space/p/reverse-prompt-eng).)

This approach already points towards the fact, that prompting can be nderstood as a "search" problem, hence the apparatus of computer science (namely search and optimization algorithms) can be unleashed upon it.

### The Search for prompts

In [None]:
/Flo part/

#### Motivation and context: Controlling Neural Text Generation

**Example**: the Frankfurt School wants to write an email to potential MBA students convincing them to join the MBA programme

There are three requirments 

1.   The language model has to use content that is true about the Frankfurt School
2.   The text should be as effective as possible in reaching its goal - in our case applications to the School's MBA programme
3.   In many cases there are only few examples of texts and target values (click through rates) available






**The setting**
- If LLMs are creating consistent answers of high quality a key question is how to get the answer required for the particular business or research context (closely related to grounding)
- Task may include getting text that meets particular (1) content, (2) style and (3) quantitative metrics, such as optimizing for a click through rate
- Sometimes it may be necessary to build a model that completes this task: generating a prompt that fits with our objective
- The less labled examples we need to achieve the target the more valuable is the business application




**Typical process for controlling neural text generation**

<img src="http://drive.google.com/uc?export=view&id=1253xgMcQ8R8dtZRWX-vD3LT-9sOjUmm4" width=800 heigth=800>


[source](https://arxiv.org/pdf/2201.05337.pdf)

**Taxonomy of control conditions**

<img src="http://drive.google.com/uc?export=view&id=1Qst_n8NZu42YYr_ab9OyHTKiLhaefdII" width=600 heigth=600>


[source](https://arxiv.org/pdf/2201.05337.pdf)



#### Hard prompts with machine learning / Optmization 

Generally two key factors to consider:
1.   How the space of possible prompts is searched / pre-selected
2.   How to evaluate the performance of proposed prompts

##### Permutation based - auto- prompt


[**AUTOPROMPT: Eliciting Knowledge from Language Models
with Automatically Generated Prompts**](https://arxiv.org/pdf/2010.15980.pdf)

- Add to the input sequence a sequence of additional tokens (called trigger tokens) that can be learned is added to the prompt



<img src="http://drive.google.com/uc?export=view&id=1oyLil3lppN84reSi1paX7NgpgszinSzu" width=700 heigth=700>

**How to optimize**

- Requires access to vector outputs of model at each level (embeddings) 

- Feed prompt into the language model produces - measure probability distribution $p\left([\mathrm{MASK}] \mid \boldsymbol{x}_{\text {prompt }}\right)$ over mask tokens (in this case negative vs positive sentiment)

- At each step compute a first-order approximation of the change in the log-likelihood that would be produced by swapping the $j$ th trigger token $x_{\text {trig }}^{(j)}$ with another token $w \in \mathcal{V}$

-  Identify a candidate set $\mathcal{V}_{\text {cand }}$ of the top- $k$ tokens estimated to cause the greatest increase:
$$
\mathcal{V}_{\text {cand }}=\text { top } k\left[\boldsymbol{w}_{\text {in }}^T \nabla \log p\left(y \mid \boldsymbol{x}_{\text {prompt }}\right)\right]
$$

- Chose best candiates..

Further development using similar techniques:
[Making Pre-trained Language Models Better Few-shot Learners
](https://arxiv.org/pdf/2012.15723.pdf
)

##### Genetic algorithm based

**[GPS: Genetic Prompt Search for Efficient Few-shot Learning](https://arxiv.org/pdf/2210.17041.pdf)**


**Basic idea**
- Use some heuristics to generate new prompts from an inital one
- Chose top prompts through a fitness function (genetic algorithm inspired)




<img src="http://drive.google.com/uc?export=view&id=18pR6S9WG-4ZnNJUogsu-n0ZJ3x-BTetH" width=700 heigth=700>


<img src="http://drive.google.com/uc?export=view&id=1KHOYEzRP8Cxcuj3AwIxXBl0joUtLc3q1" width=350 heigth=350>

**How are prompts generated?**
- Back Translation
- Cloze 
- Sentence continuation

**Evaluation**:
Score on accuracy for the required task for the train dataset (no surrogate model)

##### Reinforcement learning based methods 

1. How the space of possible prompts is searched/ pre-slected : **RL**
2. Evaluating performance of proposals: **Train separate classifier or Use Existing language model/ fine tune for classification** 

[**Efficient (Soft) Q-Learning
for Text Generation with Limited Good Data**](https://arxiv.org/pdf/2106.07704.pdf)

- Example of this approach
- Uses RL for generating suitable prompts 
- Trains classifier to evaluate the suitabliy of changed prompt 

**Motivation of the paper**

- *MLE (Maximum likelihood estimates)* need large amounts of supervised data -> problem when literally no supervised data is available
- *RL learning* advantage: learning reward function, for discrete non differentiable events 
- *RL Problems:* unstable/ notoriously difficult to train: (1) sparse reward: only once text is finished; (2) large action space - vocab of millions of words  

<img src="http://drive.google.com/uc?export=view&id=1PVuwAoFTXN1peGTGrGUmmrcjWczYwXQS" width=600 heigth=600>

**Propose *soft* Q-Learning**

-  Maximum-entropy (MaxEnt) extension to the standard (hard) Q-learning 
- Agent is encouraged to optimize the reward while staying as stochastic as possible´
- Objective $J_{\operatorname{MaxEnt}}(\pi)=$ $\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^T \gamma^t r_t+\alpha \mathcal{H}\left(\pi\left(\cdot \mid \boldsymbol{s}_t\right)\right)\right]$
- Augments the vanilla $J(\pi)$ with the additional Shannon entropy term $\mathcal{H}$ with coefficient $\alpha $
- Connects $Q$-values to the familiar output logits of a text generation model, which enables straightforward implementation of the SQL formulation

**Soft q-learning replaces the argmax operator with a Softmax**
- Connection of the $Q$-values with the logits, i.e., outputs right before the softmax layer. 
- Following relationship between optimal policy $\pi^*$ and action-value $Q^*$ holds (Haarnoja et al., 2017; Schulman et al., 2017):
$$
\pi^*(a \mid s)=\frac{\exp Q^*(\boldsymbol{s}, a)}{\sum_{a^{\prime}} \exp Q^*\left(\boldsymbol{s}, a^{\prime}\right)} .
$$
This form is highly reminiscent of the softmax layer of the generation model 

- **Key point:** the softmax leads to exploration, but not in an epsylon greedy style where each non argmax action is equally likely -> no arbitrary sampling- this is language after all!
- The temperature of the softmax can be decreased over time to to make the model converge


**Connection to softmax 2**

In other words, the model output $f_\theta(a \mid s)$, originally interpreted as the "logit" of token a given the preceding tokens $s$, is now re-interpreted as the $Q$-value of action $a$ in state $s$. When achieving optimality, $f_{\theta^*}(a \mid s)$, namely $Q^*(s, a)$, represents the best possible future reward achievable by generating token $a$ in state $s$. Similarly, the full generation $\operatorname{model} p_\theta(a \mid \boldsymbol{s})$ in Eq. (1) that applies softmax to $f_\theta$ now precisely corresponds to the policy $\pi_\theta$ induced from $Q_\theta(s, a)$. That is,
$$
\begin{aligned}
\pi_\theta(a \mid \boldsymbol{s}) & =\frac{\exp Q_\theta(\boldsymbol{s}, a)}{\sum_{a^{\prime}} \exp Q_\theta\left(\boldsymbol{s}, a^{\prime}\right)} \\
& \equiv \frac{\exp f_\theta(a \mid \boldsymbol{s})}{\sum_{a^{\prime}} \exp f_\theta\left(a^{\prime} \mid \boldsymbol{s}\right)}=p_\theta(a \mid \boldsymbol{s}) .
\end{aligned}
$$

**Additionally effective training with path consistency**

- Adapt unified path consistency [learning $(P C L)$](https://dl.acm.org/doi/pdf/10.5555/3294996.3295037) (excelled in game control)
- Good for directly learning RL algorithm on past data
- PCL-based training updates $Q$-values of all tokens at once through a connection between the value function and the induced policy
-  Optimal policy $\pi^*$ and the optimal state value function $V^*$  in SQL must satisfy the following consistency property for all states and actions:
$$
V^*\left(\boldsymbol{s}_t\right)-\gamma V^*\left(\boldsymbol{s}_{t+1}\right)=r_t-\log \pi^*\left(a_t \mid \boldsymbol{s}_t\right), \forall \boldsymbol{s}_t, a_t
$$

- Accordingly, the PCL-based training attempts to encourage the satisfaction of the consistency with the following regression objective $\mathcal{L}_{\mathrm{SQL}, \mathrm{PCL}}(\boldsymbol{\theta})$ :
$$
\mathbb{E}_{\pi^{\prime}}\left[\frac{1}{2}\left(-V_{\bar{\theta}}\left(\boldsymbol{s}_t\right)+\gamma V_{\bar{\theta}}\left(\boldsymbol{s}_{t+1}\right)+r_t-\log \pi_\theta\left(a_t \mid \boldsymbol{s}_t\right)\right)^2\right],
$$
where $\pi_\theta$ is the induced policy; $V_{\bar{\theta}}$  depends on the target $Q_{\bar{\theta}}$ network (i.e., a slow copy of the $Q_\theta$ to be learned), and recall that $\pi^{\prime}$ is an arbitrary behavior policy (e.g., data distribution). Please see


<img src="http://drive.google.com/uc?export=view&id=115CR08rGbOpmKxu1YIg-b9ugvT8b-RU5" width=600 heigth=600>

**Multi-step PCL for Sparse Reward**
$$
V^*\left(\boldsymbol{s}_t\right)-\gamma^{T-t} V^*\left(s_{T+1}\right)=\sum_{l=t}^T \gamma^{l-t}\left(r_l-\log \pi^*\left(a_l \mid \boldsymbol{s}_l\right)\right),
$$
where the value of past-terminal state is zero, $V^*\left(s_{T+1}\right)=0$; and the rewards are only available at the end, $\sum_{l=t}^T \gamma^{l-t} r_l=\gamma^{T-t} r_T$. We can then come to the following multi-step objective function $\mathcal{L}_{\mathrm{SQL}}$ PCL-ms $(\boldsymbol{\theta})$,
$\mathbb{E}_{\pi^{\prime}}\left[\frac{1}{2}\left(-V_{\bar{\theta}}\left(\boldsymbol{s}_t\right)+\gamma^{T-t} r_T-\sum_{l=t}^T \gamma^{l-t} \log \pi_\theta\left(a_l \mid \boldsymbol{s}_l\right)\right)^2\right]$.
We can see the objective side-steps the need to bootstrap intermediate value functions $V_{\bar{\theta}}\left(s_{t^{\prime}}\right)$ for $t^{\prime}>t$. Instead, it directly uses the non-zero end reward $r_T$ to derive the update for $\boldsymbol{\theta}$. Please see Figure 2 (right) for an illustration. In practice, we combine the single- and multi-step objectives (Eqs. 7 and 9) together for training.

**Reward**
we use a distilled GPT-2 model
as the pretrained LM to be controlled.
For rewards, we use the topic accuracy of the continuation sentences measured by a zero-shot classifier, plus the the log-likelihood of continuation
sentences as the language quality reward measured
by a distilled GPT-2


<img src="http://drive.google.com/uc?export=view&id=1ezZ4PCL2LGJl75JG_TAI_aQqmG9Tuv0e" width=600 heigth=600>


Other RL based approaches:
[RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning
](https://arxiv.org/abs/2205.12548)

####  Soft prompts with machine learning/optmization

- Soft as they are not discrete (numbers in the vectors can be changed gradually)

see for example the following paper on [prefix tuning](https://aclanthology.org/2021.acl-long.353/)

In [None]:
/End of Flo part/

### Still fails sometimes!

#### Hallucination

A major form of bad performance - and a frivolously often cited shortcoming - of earlier LLMs was hallucination, especially in numeric and reasoning type tasks.

Some vocal critics of the LLM paradigm went so far as to call the whole paradigm "flawed" and "useless" based on the poor performance of (mainly non instruction finetuned) LLMs on reasoning tasks.

This resulted in considerable focus from the side of the scientific community.

<img src="http://drive.google.com/uc?export=view&id=1dkUAL_WV1TtZJCnZhfMfGCqtAr9vl44Z" width=65%>

In the work [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) the authors realized, that instead of the final task, the __LLM is made to produce a step-by-step reasoning__ for the task solution, it's __final task performance becomes significantly better__!

This lead to "chain-of-thought prompting" to become the de-facto standard solution strategy.

##### "Let's think step by step!"

In the work [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/pdf/2205.11916.pdf) it is shown, that this kind of strategy, namely only hinting towards chain of thought in the prompt for an LLM in a task can ellicit strong reasoning performance.

<img src="http://drive.google.com/uc?export=view&id=1uyJR4LNzRVsh1dYpAevHyI3YP1t8VgvG" width=75%>

<img src="http://drive.google.com/uc?export=view&id=1GQJDP4cGJpV8JCMh5Ggk1TeYxNS8TnvG" width=65%>

"Let's think step-by-step!" became a standard part in nearly every well "engineered" prompt.

#### "Mind in a box" 

<img src="http://drive.google.com/uc?export=view&id=1GYEV_6oWcoXolIe0qDVTUhePwn_2icmc" width=45%>

As illustrated by the image above, LLMs by default, when the pretraining and finetuning is finished are "frozen" in time (since continuous training _as of now_ is computaitonally infeasible), so the act like __"minds in a box"__ that have no access to the outside world.

In many (actually dominantly many) real life tasks the models ideally should have up to date information, and access to it's sources. The paradigm of [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) - and the many followup works based upon it - __give access to an external knowledge base of facts - or later the whole internet - via a search engine__ that helps provide rich context for the LLMs to solve the tasks. 

<img src="http://drive.google.com/uc?export=view&id=1p7Ajkz_QnzhhI6zzUAgdtKKP04kPyVln" width=45%>

The whole paradigm of information retrieval - mainly via the help of high quality, contextual, neural embeddings - comes to aid the LLM performance, and defines the new state-of-the-art.

On more recent advancements in this area see the great survey: [Augmented Language Models: a Survey](https://arxiv.org/abs/2302.07842)

The kind of "procedural" reasoning ability, combined with the inspired research towards the external resources lead to a new paradigm. 


## "Agents" and "Tools" - access the "outside world"


### Basic idea: ReACT

Based on the emergent reasoning abilities and the introduction of external "sources", the authors of the paper [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629) went a step further, and proposed a solution in which the LLMs themselves are presented with a set of "tools", and asked to do a "step-by-step palnning" of how would they use the tools. The addon here is: __the planned steps of tool usage gets executed by a framework__, so if the LLM decides that it should search for a thing on the internet via a "search engine tool", the environment executes this "query" and gives back the result (in textual, "serialized" form) to the LLM.

<img src="http://drive.google.com/uc?export=view&id=10HLujjZ-lZsyrlbhjNp_HofY_lAm1iNw" width=85%>

This way, the __LLM acts as an autonomuos agent, formulating "plans" and using "tools"__.


### Result: Tool ecosystems

As a result pf this paradigm, a __huge ecosystem__ of tools emerged, that aim to connect external services to LLMs, thus add extraordinary capabilities, and open up the floodgate for practical applications.

The two dominant - in a sense mutually reinforcing - ecosystems are:

<img src="http://drive.google.com/uc?export=view&id=1uRcgPG1k_z00rXkWV4TG_pm_HkGbuzqc" width=80%>

Read more about:
- [ChatGPT plugins](https://openai.com/blog/chatgpt-plugins)
- [Langchain](https://langchain.readthedocs.io/)

In [None]:
%%HTML    
<iframe width="560" height="315" src="https://www.youtube.com/watch?v=nE2skSRWTTs" ></iframe>

#### Learning new tools

In a "logical" continuation of this paradigm, people started to experiment with models, that were capable of __learning the usage of new tools__ like in [Toolformer: Language Models Can Teach Themselves to Use Tools](https://arxiv.org/abs/2302.04761). The gust here is, that an appropriately trained, but also just an __appropriately prompted__ model can dynamically adapt (eg. via API description texts) to a new tool environment.


This lead to experiments in enhanced autonomy, like:
- [BabyAGI](https://www.youtube.com/watch?v=QBcDLSE2ERA)
- [AutoGPT](https://www.youtube.com/watch?v=LqjVMy2qhRY) and
- [HuggingGPT](https://www.youtube.com/watch?v=PfY9lVtM_H0)

__These videos are definitely worth a look__, since they represents the absolute state-of-the-art.

They would merit a more thorough elaboration...

### Additonal notes

This vision - and in fact reality - of trained neural agents working in cooperation to achieve a commong complex goal very much resembles [Minsky's society of mind](https://en.wikipedia.org/wiki/Society_of_Mind) hypothesis, and gives some strong, empiric evidence in favour of it. It is basically the current incarnation of the dream of "atonomuos software agents" from decades ago.

Also, it is important to stress, that with the advent of the T5 style text in - text out paradigm, and the prevalence of the zero shot approach, (nearly) __all interactions with the models are via text__. 

This __"via language" paradigm__ proved to be so general, that some researchers (eg. in [TabLLM: Few-shot Classification of Tabular Data with Large Language Models](https://arxiv.org/abs/2210.10723)) endeavored to cast "traditional", non-nlp problems into language in an attempt to solve them with LLMs.

And since __software code is wirtten in human parseable languages___ (so as to enable us writing them), __the whole area of software development and software services became accessible for LLMs__. Natural language became the "glue", the "communicaiton medium" between humans, software, and in a sense amongst software. And as the famous saying goes ["Software eats the world"](https://a16z.com/2011/08/20/why-software-is-eating-the-world/).

In [None]:
TODO implement a HTML talk bot in LangChain task comes here

# Outlook - what to look for?

- Scaling wars might stop (even OpenAI's [Sam Altman hints](https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/) to this)
- Full OpenSource models will become (are) available (eg. [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant))
- Multi-modality gets dominant (see for example [here](https://github.com/X-PLUG/mPLUG-Owl))
- Auto-coding gets prevalent (see for example [this](https://levelup.gitconnected.com/the-end-of-coding-experts-predict-a-future-of-automated-development-c0aae9c458a2) summary)
- __Business and personal integration will be the main challenge__
- Reinforcment Learning might be relevant to [ground models](https://arxiv.org/abs/2302.02662)
