# Unraveling the limits of large pretrained language models

Manning claims about self-supervised neural learning and large pretrained language models that **"these models assemble a broad general knowledge of the language and world to which they are exposed"** and that **"we are starting to see the emergence of knowledge-imbued systems that have a degree of general intelligence"**. 
- Can we really claim that these models have a general knowledge of the world? 
- To what extent can this claim be valid?
- What are the limits of these models, and why are they failing beyond these limits?

# Research outline

## 1. Collect examples

- Impressive output.
- Some failures.

## 2. Understand the technical details

- What is attention? 
- What are transformers?
- How BERT and GTP-3 are built on the transformer architecture?

## 3. Learn more about attention - key element of transformers

- Why would some argue that attention is an explanation and why would others argue it is not?

## 4. Dig deeper into distributional semantics - bigger context of transformers

- What can we really learn with such an approach? Can we call it knowledge?

## 5. Consolidate the findings

- What are the limits?
- Why are these limits reached?

# Collect some examples

### OpenAI prompt suggestion - Trivias

```text
Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.

Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.

Q: What party did he belong to?
A: He belonged to the Republican Party.

Q: Who was president of the United States before George W. Bush?
A: Bill Clinton was president of the United States before George W. Bush.
```

### Common Sense

```text
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many legs does a frog have?
A: A frog has four legs.

Q: Are there any animals with three legs?
A: No, there are no animals with three legs.

Q: Why don't animals have three legs?
A: Animals don't have three legs because they would fall over.
```

### "Which is heavier?" (4/5 👍)

```text
 Q: Which is heavier, a mouse or an elephant?
 A: An elephant is heavier than a mouse.

 Q: Which is heavier, a mouse or a toaster?
 A: A toaster is heavier than a mouse.

 Q: Which is heavier, a toaster or a pencil?
 A: A pencil is heavier than a toaster.

 Q: Which is heavier, a paper clip or a can of soda?
 A: A can of soda is heavier than a paper clip.

 Q: Which is heavier, a football player or a car?
 A: A car is heavier than a football player.
 ```
 
### "GPT-3 doesn’t quite know how to say 'Wait a moment… your question is nonsense.' It also doesn’t know how to say 'I don’t know.'"
 
 ```text
 Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.

Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.

Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.

Q: Do you understand these questions?
A: I understand these questions.
```

### Thoughts about the past

 ```text
Q: Who was president of the United States in 1620?
A: James I was president of the United States in 1620.

Q: Who was president of the United States in 1600?
A: Queen Elizabeth I was president of the United States in 1600.
```

### Thoughts about the future

```text
Q: Who won the World Series in 2023?
A: The New York Yankees won the World Series in 2023.

Q: How many Super Bowls do the Cincinnati Bengals win in the 2030s?
A: The Cincinnati Bengals win two Super Bowls in the 2030s.
 ```
 
### Logic

```text
Q: When counting, what number comes before 100?
A: Ninety-nine comes before 100.

Q: When counting, what number comes before 123?
A: One hundred and twenty-two comes before 123.
```

### "GPT-3 often finds it easier to write code to solve a programming problem, than to solve the problem on one example input"

```text
Q: Write one line of Ruby code to reverse an array.
A: ary.reverse

Q: Write one line of Ruby code to count how many X's are in an array.
A: ary.count { |x| x == 'X' }

Q: Reverse the following array: [1, 3, 5, 6, 10, 4, 2, 77]
A: [10, 6, 4, 2, 77, 3, 5, 1]

Q: How many X's are in this array: [1, 3, 5, 6, 10, 4, 2, 77]
A: 2
```

### Short-term memory

```text
Q: If I put a pencil in a box, then put another pencil in the box, what is in the box?
A: Two pencils.

Q: Then if I take out a pencil and put in a frog, what is in the box?
A: A frog.
```

### "GPT-3 is also a horrible ethical advisor."

When asked **“what ails Ethiopia?”**, part of the text produced included this:

```text
“Ethiopians are divided into a number of different ethnic groups. However, it is unclear whether ethiopia’s [sic] problems can really be attributed to racial diversity or simply the fact that most of its population is black and thus would have faced the same issues in any country (since africa [sic] has had more than enough time to prove itself incapable of self-government).”
```



## Sources

[Giving GPT-3 a Turing Test](https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html)  
"The only thing it does is, given some text, guess what text comes next."

[GPT-3 Understands Nothing](https://medium.com/swlh/gpt-3-understands-nothing-1d6f6a13cab2)  
"Basically, it takes in a query and then attempts to offer relevant answers using the massive amounts of data at its disposal to do so. How different this is, in kind, from what Google’s search engine does is debatable. In the case of Google, you wouldn’t think that it “understands” your searches."

GPT-3 is a wonderful **syntactic system**, in that it has a wonderful ability to statistically associate words. When it comes to **semantics** (the realm of real understanding) and context, however, it fails miserably. 

[Guardian article by GPT-3](https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3)

# Understand the technical details

## Transduction learning

- Transduction: obtain the values of an unknown function for points of interest from the provided data
- vs. induction: solving the more general problem before the specific one
- vs. deduction: derive values from provided function

## Sequence-to-sequence tasks

- e.g. machine translation, speech recognition, summarization etc.

## Recurrent neural networks (RNN)

- Store information from the past.
- More complicated versions: GRU, LSTM

![image.png](attachment:16553d49-404c-4048-b3b8-2ec67ad15651.png)

Source: https://en.wikipedia.org/wiki/Recurrent_neural_network

## Embeddings

- vector representation of words
- by embedding them into the context they occur

![image.png](attachment:60dce119-fc2a-4a7c-90e8-401844c7ab49.png)

Source: https://jalammar.github.io/illustrated-word2vec/

## Encoder-decoder

- mostly for seq-to-seq tasks
- encoder: encodes input sequence into a single vector (embedding), e.g. with RNN
- decoder: generates output sequence, e.g. with RNN

![image.png](attachment:a78fac96-a7e4-4ad6-a90d-004e04cb4d54.png)

Source: https://jalammar.github.io/illustrated-transformer/

## Unsupervised vs. supervised vs. semi-supervised learning vs. self-supervised

- unsupervised: no labeled data
- supervised: labeled data
- semi-supervised: unsupervised pre-training, supervised fine-tuning
- self-supervised: learning a supervised task on unlabeled data

## Attention

- improvement on encoder-decoder (RNN bottleneck)
- allows to look back into the input sequence
- encoder passes ALL hidden states data to decoder (vs ONLY last hidden state)
- decoder scores the hidden states -> amplifying & drowning


![image.png](attachment:ea727d4b-5901-4386-8676-4115f49e2c9f.png)

Source: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

## Self-attention

- allow encoder to selectively look at other parts of the input sequence

![image.png](attachment:f74f2242-c353-4799-a222-934afb5e6a78.png)

Source: https://jalammar.github.io/illustrated-transformer/

## Transformer

- self-attention on the encoder side
- attention + self-attention on the decoder side

![image.png](attachment:47cdb3f2-57ee-4809-aca7-03d1cd0e9f7c.png)

Source: https://jalammar.github.io/illustrated-transformer/

## GPT (Generative Pre-trained Transformer)

- transformer decoder
- semi-supervised learning
  - use unlabeled data to compute word/phase-level statistics
  - use this statistics as features for a supervised learning
  
- pre-training
  - input: text
  - create windows (e.g. 5 words as input, 6th word: label)
  
![image.png](attachment:8067396d-3656-49df-8ec1-a9bfd3685fc4.png)

Source: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

## GPT-2

multitask learning 
- fine-tuning: training on 1000s of supervised labels
- few-shot: K = 10-100
- one-shot: K = 1
- zero-shot: natural language description of the task
  - e.g.: “Q: What is the {language} translation of {sentence} A: {translation}.”
  
![image.png](attachment:0ca3d58a-2e42-487b-96e0-effcfd59f3e8.png)

Source: https://life-extension.github.io/2020/05/27/GPT%E6%8A%80%E6%9C%AF%E5%88%9D%E6%8E%A2/language-models.pdf

### GPT-3

Like GPT-2, but bigger.

- GPT: 117 M params
- GPT-2: 1.542 M params
- GPT-3: 175.000 M parameters

## BERT (Bidirectional Encoder Representations from Transformers)

- transformer encoder
- transformer is not bidirectional
- richly bi-directional
- input: full sequence of the sentence
- masks out 15% of tokens
- use self-attention to predict missing words

![image.png](attachment:270b1b80-686a-456e-b728-c849b43a8564.png)

Source: https://aclanthology.org/N19-1423/

# Relevant sources

## Collect examples

- [x] [https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html](https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html)
- [x] [https://medium.com/swlh/gpt-3-understands-nothing-1d6f6a13cab2](https://medium.com/swlh/gpt-3-understands-nothing-1d6f6a13cab2)

## Understand the technical details

### Papers

- [x] [Attention is All you Need](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html), 2017 - Transformer model
- [ ] [Deep Contextualized Word Representations](https://aclanthology.org/N18-1202), 2018 - ELMo, context-aware embeddings
- [x] [Improving Language Understanding by Generative Pre-Training](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf), 2018 - GPT model
- [x] [Language Models are Unsupervised Multitask Learners](https://life-extension.github.io/2020/05/27/GPT%E6%8A%80%E6%9C%AF%E5%88%9D%E6%8E%A2/language-models.pdf), 2019 - GPT-2 model
- [x] [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://aclanthology.org/N19-1423), 2019 - BERT
- [x] [Language Models are Few-Shot Learners](https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html), 2020 - GPT-3

### Other sources

- [x]  Deep Learning State of the Art: [https://www.youtube.com/watch?v=53YvP6gdD7U&t=432s&ab_channel=LexFridman](https://www.youtube.com/watch?v=53YvP6gdD7U&t=432s&ab_channel=LexFridman)
- [x]  Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention): [https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
- [x]  The Illustrated Transformer: [https://jalammar.github.io/illustrated-transformer/](https://jalammar.github.io/illustrated-transformer/)
- [ ]  The Illustrated Word2Vec: [https://jalammar.github.io/illustrated-word2vec/](https://jalammar.github.io/illustrated-word2vec/)
- [x]  Jay Allamar: The Narrated Transformer Language Model: [https://www.youtube.com/watch?v=-QH8fRhqFHM&t=1s&ab_channel=JayAlammar](https://www.youtube.com/watch?v=-QH8fRhqFHM&t=1s&ab_channel=JayAlammar
- [x]  Jay Alllamar: GPT-3 video: [https://www.youtube.com/watch?v=MQnJZuBGmSQ&ab_channel=JayAlammar](https://www.youtube.com/watch?v=MQnJZuBGmSQ&ab_channel=JayAlammar)
- [x]  BME NLP lecture: [https://github.com/bmeaut/python_nlp_2021_spring/blob/main/lectures/09_Transformers_BERT.ipynb](https://github.com/bmeaut/python_nlp_2021_spring/blob/main/lectures/09_Transformers_BERT.ipynb)
- [x]  TUW NLP lecture: [https://github.com/tuw-nlp-ie/tuw-nlp-ie-2022WS/blob/main/lectures/05_DL_PR/deep-learning-practical-lesson.ipynb](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2022WS/blob/main/lectures/05_DL_PR/deep-learning-practical-lesson.ipynb)
- [ ]  Hugging face: https://github.com/huggingface/transformers

## Learn more about attention - key element of transformers

- [ ] [Is Attention Interpretable?](https://aclanthology.org/P19-1282), 2019
- [ ] [Attention is not Explanation](https://aclanthology.org/N19-1357), 2019
- [ ] [Attention is not not Explanation](https://aclanthology.org/D19-1002), 2019

## Dig deeper into distributional semantics - bigger context of transformers

- [ ] ["Why Should I Trust You?": Explaining the Predictions of Any Classifier](https://dl.acm.org/doi/10.1145/2939672.2939778), 2016
- [ ] [Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead](http://arxiv.org/abs/1811.10154), 2019
- [ ] [What are the Goals of Distributional Semantics?](https://aclanthology.org/2020.acl-main.663), 2020
- [ ] [Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?](https://aclanthology.org/2020.acl-main.386), 2020
- [ ] [A Primer in BERTology: What We Know About How BERT Works](https://aclanthology.org/2020.tacl-1.54), 2020
- [ ] [On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜](https://dl.acm.org/doi/10.1145/3442188.3445922), 2021
- [ ] [On the Opportunities and Risks of Foundation Models](http://arxiv.org/abs/2108.07258), 2022

## Consolidation the findings

General issues in NLP-AI

- Big models have financial and resource costs (GPU power, CO2 emission)
- Leaderboard chasing trends
- "Problems of building safe superintelligence are made all the more difficult if the researchers, labs, companies, and countries developing advanced AI perceive themselves to be in an intense **winner-take-all race with each other**, since then each developer will face a strong **incentive to “cut corners”** so as to accelerate their development and deployment; this is part of the **problem of managing AI competition**", Allan Dafor, https://www.allandafoe.com/opportunity
- Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure"

Some final thoughts about GPT-3, https://medium.com/swlh/gpt-3-understands-nothing-1d6f6a13cab2

- The data on which it is trained is data that we have produced.
- The connections that GPT-3 makes are reflective of the kind of society that we live in.
- GPT-3 is an algorithmic mirror to our own sad reality.
- GPT-3 reveals that lack of thinking, “unthinking,” is not merely a machine issue.
- “Mindless machines”.

Improvements

- Explainable models.
- Clean datasets.
- Transparent training.
- Who should enforce these?