### Problem 1

# Introduction
Many NLP models have HUGE numbers of parameters and are trained on VAST amounts of data. TopBot's [Leading NLP Language Models for 2020](https://www.topbots.com/leading-nlp-language-models-2020/) provides a small survey of some of the most popular ones.

Because of this, unless your working in a research group with substantial resources, it is unlikely you will be training your models from scratch. The data requirements also mean that these pre-trained models are generally non-specific to a particular problem domain or task.

Two key steps to building an effective NLP application are as follows:
1. Figure out how to lever an existing Language Model to solve the problem at hand
2. Improve the performance by judicious training

In this homework, you will do the following:

- Begin by reviewing some NLP resources and tasks
- Learn how to use pre-trained hugging face models for Casual Language Modelling and Masked Language Modelling.
- Fine-tune a model to a particular corpus
- After that, you'll use some Language Model sampling methods similar to beam search to admire your handiwork
- We'll wrap up with some fun but important "semantic geometry" examples at the end of this assignment.


Please submit the results of this work to the [Prismia.chat](https://prismia.chat/projects/cba3b7ef-4b29-456d-985b-9f4bc5e495cb/edit-assignment/Prismia.chat) assignment using the same approach we have used before. Only copy code and results when asked or when they help support that narrative you create to demonstrate your learning and understanding.


# NLP Resources and Project Data Resources
**Q: There are so many different NLP tasks and resources. How can I learn about them and get started?**

A: The Hugging Face [Task Summary](https://huggingface.co/transformers/task_summary.html) includes descriptions of many common NLP tasks and high-level PyTorch and Tensorflow based code for running each type of task. The Open in Colab menu button will allow you to open this page as a colab notebook using your PyTorch or the Tensorflow framework. [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks) contains a similar listing of everyday NLP tasks with additional resources and code samples.

**Q: Are there any easy-to-use NLP datasets or other machine learning datasets that would be a good starting point for my project?**

A: The [Hugging Face datasets](https://huggingface.co/datasets) and [Tensorflow datasets](https://www.tensorflow.org/datasets) repositories contain large amounts of ready-to-use NLP data that you can easily import. Another good place to look for various data, all in a standardized format, is the [Fast.ai datasets](https://course.fast.ai/datasets) repository. Kaggle competition data is also a great place to get started since it allows you to compare your performance with world-class data modelers. AI and ML challenges are another great data resource. These also provide a good frame of reference for your efforts. (If someone finds or compiles a nice catalog of AI and ML challenges these, please let us know so we can add a link to it.) **Note:** The non-Hugging Face dataset recommendations cover a lot more than NLP data.

**Q: Isn't Hugging Face a PyTorch library? Do I need to learn PyTorch to use it?**

A: No, while it is true PyTorch is currently the primary research framework used for NLP research, and many Hugging Face examples are PyTorch specific, there are Tensorflow versions of many of their high-level components. See the Hugging Face [Task Summary](https://huggingface.co/transformers/task_summary.html) for some examples. This [Medium article](https://towardsdatascience.com/tensorflow-and-transformers-df6fceaf57cc) from James Briggs talks a bit about this issue and provides a no-nonsense, step-by-step sentiment analysis example. It also includes links to relevant articles on tf dataset configuration and optimizer configuration.


### ®[10] Task: Your own table of common NLP Applications
In a cell below, create a table that enumerates each of the common NLP tasks listed in [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks) at HuggingFace.co.

1. Your table should provide a brief description of each task in your own words.
2. An example of each task.
3. A specific dataset that would be appropriate for exploring three or more of these NLP tasks. Please include a link to a webpage that describes the dataset and the steps needed to download and begin to use it. Ideally, these steps would be appropriate for use in colab.


### Problem 2

# Pretrained language models
GPT2 is an example of a new high-performance Language Model. Please read this [article](https://openai.com/blog/better-language-models/) from its developers.


### ®[5] Task: What is GPT2
Write a short description of the GPT2 model, in general. Include details on an interesting application or example. Be sure to cite the paper that originally introduced GPT and a reference for your application or example.


### Problem 3

### ®[5] Task: Data and Society
Explain issues and concerns associated with the widespread adoption of these models.


### Problem 4

## Using pre-trained language models
Run the Colab-Notebook associated with the [language-modeling] task in [The Big Table of Tasks], and work through it section by section.

Adding these pip install commands at the beginning of the notebook will help you get started:

```python
! pip install -qq datasets
! pip install -qq transformers
```


In [None]:
! pip install -qq datasets
! pip install -qq transformers

### ®[10] Task: Language Modelling 101
1. Explain the difference between Casual Language Modelling and Masked Language Modelling.
2. Why are the perplexity scores different between the two tasks as implemented in the notebook?
3. Report and discuss the perplexity scores (using the fine-tuning validation data) pre-fine-tuning and post fine-tuning for both task types.


### Problem 5

### ®[15] Task: Doing it your own way
1. Update the notebook to use a different model and a different training dataset (your choice).
2. Provide a summary of your results below. Include details on the autoencoder used, and other choices made.
3. What auto-encoder is used?
4. What fine-tuning training parameters did you use?
5. Include before and after fine-tuning perplexities.
6. Include example setups and outputs from each modelling task.


### Problem 6

### Tokenization
TensorFlow and HuggingFace.co both support several popular tokenization approaches. Modern subword tokenization strategies handle word stemming and new words out of vocabulary words in an elegant way. Lena Voita's [exposition](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#bpe) on Byte Pair Encoding (BPE) is a great way to get started. Extensions to BPE form the basis for [Byte-level BPE](https://arxiv.org/pdf/1909.03341.pdf) and [Word Piece](https://paperswithcode.com/method/wordpiece) tokenizers used for the GPT and Bert models. These allow efficient encoding for unicode characters and emojis, whereas many other popular tokenizers replace unknown characters and words with <unk> tokens.

Hugging face provides [support and clear summaries](https://huggingface.co/transformers/tokenizer_summary.html) for each of these tokenization approaches and several other cutting-edge approaches, including [Sentence Piece](https://paperswithcode.com/method/sentencepiece). If you are interested in working with languages that do not have a space between each word (like Chinese), you should read more about XLM and its generalization, SentencePiece.


### ®[5] Task: Pretrained model gotchas
1. When working with pre-trained models, why do you need to use the same tokenizer that was originally used to create the pre-trained model? What would happen if you did not?
2. In practice, word vectors are stored in an embedding layer in a neural network. Fine-tuning models with enough data improves performance. Explain why retraining word vectors may hurt our model if our training dataset is small and includes limited vocabulary.


### Problem 7

## Text Generation
Thus far in the course, we have seen examples of Monte-Carlo sampling, greedy search, and Beam Search to look at different model outputs. [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks) **text-generation** notebook introduces two other approaches.

Please open this notebook work through it cell by cell, and fiddle with it so that you can understand the behavior.


### ®[15] Task: Greedy Stochastic Search
In your own words explain the following as clearly as you can
1. The scheme introduced that references [Paulus et al. (2017)](https://arxiv.org/abs/1705.04304) and [Klein et al. (2017)](https://arxiv.org/abs/1701.02810)
2. The method of [Fan et al. (2018)](https://arxiv.org/pdf/1805.04833.pdf)
3. The final method attributed to [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751)

Include the relevant Hugging Face API call. Include examples that are different from the ones given, +1 if they are amusing!


### Problem 8

# Word-Embeddings


In this part of the assignment, you explore the crazy world of semantic geometry.

The [Tensorflow Embedding Projector](https://projector.tensorflow.org/) allows one to project high-dimensional word embeddings into a lower-dimensional space.

[The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://doi.org/10.1177%2F0003122419877135) explore the word embedding models through time.


[Embedding Projector](https://projector.tensorflow.org/)
### ®[10] Task: A manual for new projectionists
1. In your own words, clearly explain each projection type that is supported.
2. Explain the other controls and give an example of how to use each one.

You are welcome to use annotated screenshot (s) or other methods to do this concisely and efficiently.


### Problem 9

[Embedding Projector](https://projector.tensorflow.org/)
Use the Embedding Projector to find a [polysemous word](https://en.wikipedia.org/wiki/Polysemy) where similar words (according to cosine similarity) have multiple meanings.

For example, the word "spring" can refer to either "flower" and "suspension." The word "tie" can have associations with both "shirt" and "football."

You may need to try several polysemous word candidates before you find one that works.

### ®[5] Task: Polysemous Word Hunt
1. Please provide setup and output for polysemous word(s) you discover.
2. For each set of words, describe the multiple meanings that occur.
3. Why do you think some polysemous words don't work?


### Problem 10

Embedding vectors have been shown to sometimes exhibit the ability to solve analogies.

For example, to solve the analogy, "man is to king as woman is to what?".

The "man is to king" relationship can be represented geometrically by the displacement vector between word man and king word embeddings.

$\displaystyle  v_{\text{man}} + (v_{\text{king}} − v_{\text{man}}) = v_{\text{king}}$
 

The analogous relationship can then often be found by adding this displacement to the woman word embedding
 $\displaystyle v_{\text{woman}} + (v_{\text{king}} − v_{\text{man}}) \approx v_{queen}$
 ​ 

The relationship is only approximate in that $v_{\text{queen}}$ ​will be found nearby the point 
$$v_{\text{woman}} + (v_{\text{king}} − v_{\text{man}})$ $

### ®[5] Task: Word2Vec Analogies
Using an online tool (e.g.,  [http://bionlp-www.utu.fi/wv_demo/ ](http://bionlp-www.utu.fi/wv_demo/ ) or [https://lamyiowce.github.io/word2viz/](https://lamyiowce.github.io/word2viz/) ) or your own colab notebook and find three word embedding analogies that work. In your solution, please state the full analogy in the form **man:king :: woman:queen** and provide plots, diagrams and or calculations that support your assertion. If the analogy is complicated, explain why the analogy holds in one or two sentences.

You may have to try several analogies to find one that works!


### Problem 11

[The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://doi.org/10.1177%2F0003122419877135)
### ®[5] Task: Cultural Artifacts
Identify several biases that this paper explores.
(Word-Embeddings)


### Problem 12

[The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://doi.org/10.1177%2F0003122419877135)
### ®[2] Task: OK, Boomer
Describe how one of these biases change over time.


### Problem 13

### ®[3] Task: Unsupervised $=$ Un-biased?
An interviewer from the ChatBots-R-Us company asks you the following question during an interview, "We know our models learn unintentional biases because we present them with a lot of customer features, in addition to customer dialog. We think using unsupervised learning techniques will help improve the situation. What do you think?"

Please write-up you response in the cell below.


### Problem 14

### ®[2] Task: Axis Understanding
Use the Embedding Projector "Custom Option" or another tool to create an illustration similar to one found in the paper.  Document that you have done this, and explain the setup.


### Problem 15

### ®[3] Task: Embeddings Variation
Alice and Bob have used different Word2Vec algorithm implementations to obtain word embeddings on the same corpus and vocabulary of words $V$.  

In particular, for every word $w$, Alice has obtained ‘context’ vectors $u^A_w$ and ‘center’ vectors $v^A_w$, and Bob has obtained ‘context’ vectors $u^B_w$ and ‘center’ vectors $v^B_w$ for every word. 

Suppose that, for every pair of words $w , w'\in V$, the inner product is the same in both Alice and Bob’s model: 

$(u^A_w)^Tv^A_{w'}= (u ^B_w)^T v^B_{w'}$

Does it follow that, for every word $w \in V, v^A_w = v^B_w$ ?  Why or why not?


### Problem 16