### Problem 1

# Introduction
Many NLP models have HUGE numbers of parameters and are trained on VAST amounts of data. TopBot's [Leading NLP Language Models for 2020](https://www.topbots.com/leading-nlp-language-models-2020/) provides a small survey of some of the most popular ones.

Because of this, unless your working in a research group with substantial resources, it is unlikely you will be training your models from scratch. The data requirements also mean that these pre-trained models are generally non-specific to a particular problem domain or task.

Two key steps to building an effective NLP application are as follows:
1. Figure out how to lever an existing Language Model to solve the problem at hand
2. Improve the performance by judicious training

In this homework, you will do the following:

- Begin by reviewing some NLP resources and tasks
- Learn how to use pre-trained hugging face models for Casual Language Modelling and Masked Language Modelling.
- Fine-tune a model to a particular corpus
- After that, you'll use some Language Model sampling methods similar to beam search to admire your handiwork
- We'll wrap up with some fun but important "semantic geometry" examples at the end of this assignment.


Please submit the results of this work to the [Prismia.chat](https://prismia.chat/projects/cba3b7ef-4b29-456d-985b-9f4bc5e495cb/edit-assignment/Prismia.chat) assignment using the same approach we have used before. Only copy code and results when asked or when they help support that narrative you create to demonstrate your learning and understanding.


# NLP Resources and Project Data Resources
**Q: There are so many different NLP tasks and resources. How can I learn about them and get started?**

A: The Hugging Face [Task Summary](https://huggingface.co/transformers/task_summary.html) page includes descriptions of many common NLP tasks and high-level PyTorch and Tensorflow based code for running each type of task. The Open in Colab menu button will allow you to open this page as a colab notebook using your PyTorch or the Tensorflow framework. [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks) contains a similar listing of common NLP tasks with additional resources and code samples.

**Q: Are there any easy-to-use NLP datasets or other machine learning datasets that would be a good starting point for my project?**

A: The [Hugging Face datasets](https://huggingface.co/datasets) and [Tensorflow datasets](https://www.tensorflow.org/datasets) repositories contain large amounts of ready-to-use NLP data that you can easily import. Another good place to look for various data, all in a standardized format, is the [Fast.ai datasets](https://course.fast.ai/datasets) repository. Kaggle competition data is also a great place to get started since it allows you to compare your performance with world-class data modelers. AI and ML challenges are another great data resource. These also provide a good frame of reference for your efforts. (If someone finds or compiles a nice catalog of AI and ML challenges these, please let us know so we can add a link to it.) **Note:** The non-Hugging Face dataset recommendations cover a lot more than NLP data.

**Q: Isn't Hugging Face a PyTorch library? Do I need to learn PyTorch to use it?**

A: No, while it is true PyTorch is currently the primary research framework used for NLP research, and many Hugging Face examples are PyTorch specific, there are Tensorflow versions of many of their high-level components. See the Hugging Face [Task Summary](https://huggingface.co/transformers/task_summary.html) for some examples. This [Medium article](https://towardsdatascience.com/tensorflow-and-transformers-df6fceaf57cc) from James Briggs talks a bit about this issue and provides a no-nonsense, step-by-step sentiment analysis example. It also includes links to relevant articles on tf dataset configuration and optimizer configuration.


### ®[10] Task: Your own table of common NLP Applications
In a cell below, create a table that enumerates each of the common NLP tasks listed in [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks) at HuggingFace.co.

1. Your table should provide a brief description of each task in your own words.
2. An example of each task.
3. A specific dataset that would be appropriate for exploring three or more of these NLP tasks. Please include a link to a webpage that describes the dataset and the steps needed to download and begin to use it. Ideally, these steps would be appropriate for use in colab.


1. **language-modeling**, predict next token or a masked token in a sequence, Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb, Dataset: https://github.com/huggingface/datasets/tree/master/datasets/wikitext
2. **multiple-choice**, select most plausible choice in a multiple-choice question, Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/multiple_choice.ipynb, Dataset: https://github.com/huggingface/datasets/tree/master/datasets/swag
3. **question-answering**,  extracting the answer to a question from a given context, Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb, Dataset:https://github.com/huggingface/datasets/tree/master/datasets/squad_v2
4. **summarization**, generate text for summarizing the give text, Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/summarization.ipynb, Dataset: https://github.com/huggingface/datasets/tree/master/datasets/xsum
5. **text-classification**, classify a sequence into a category of sentiment (usually 2 categories), Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb, Dataset: https://github.com/huggingface/datasets/tree/master/datasets/glue
6. **text-generation**,generate long text sequence given a start sequence, Example: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb, Dataset: None
7. **token-classification**, predict a label for token in a given sequence, Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb, Dataset: https://github.com/huggingface/datasets/tree/master/datasets/conll2003
8. **translation**, translate sequences between 2 different languages, Example: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/translation.ipynb, Dataset: http://www.statmt.org/wmt16/

### Problem 2

# Pretrained language models
GPT2 is an example of a new high-performance Language Model. Please read this [article](https://openai.com/blog/better-language-models/) from its developers.


### ®[5] Task: What is GPT2
Write a short description of the GPT2 model, in general. Include details on an interesting application or example. Be sure to cite the paper that originally introduced GPT and a reference for your application or example.


GPT2 model is a unsupervised training transformer model, trained on large-scale dataset, that can complete tasks such as text-generation, question-answering, translation, summarization and text comprehension. For question-answering task, the model shows an interesting pattern that it did not answer the biggest state in America correctly. However it gives an answer that many people would give, which means GPT2 actually learns like human in some way through dataset and make mistakes like human makes.

Citation: Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario and Sutskever, Ilya. "Language Models are Unsupervised Multitask Learners." (2018): .

Reference for application: https://openai.com/blog/better-language-models/#task3

### Problem 3

### ®[5] Task: Data and Society
Explain issues and concerns associated with the widespread adoption of these models.


Author is afraid of malicoious use of this model. There are concerns about large language models being used to generate deceptive, biased, or abusive language at scale. Safety concerns is always the issue with machine learning and AI.

### Problem 4

## Using pre-trained language models
Run the Colab-Notebook associated with the **`[language-modeling]`** task in [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks), and work through it section by section. (The Big Table of Tasks has Tasks listed row by row,  where the name of each task is shown in the first column and the Colab notebook associated with each task is listed in the last column.)

Adding these pip install commands at the beginning of the notebook will help you get started:

```python
! pip install -qq datasets
! pip install -qq transformers
```


In [None]:
! pip install -qq datasets
! pip install -qq transformers

[(language_modeling.ipynb)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)
### ®[10] Task: Language Modelling 101
1. Explain the difference between Causal Language Modelling and Masked Language Modelling.
2. Why are the perplexity scores different between the two tasks as implemented in the notebook?
3. Report and discuss the perplexity scores (using the fine-tuning validation data) pre-fine-tuning and post fine-tuning for both task types.


1. The casual language modeling predict the last word based on previous sequence. The masked language modeling predicts based on the rest of the sentence.
2. MLM task is easier than that of the CLM task.
3. After fine-tuning, CLM model has perplexity score 38.17 and MLM model has perplexity score 6.37.

### Problem 5

[(language_modeling.ipynb)](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)

**Instructor clarification:** autoencoder = actual model class for each task. [AutoModelForCasualLM](https://huggingface.co/transformers/model_doc/auto.html#automodelforcausallm) is just a wrapper class.

### ®[15*] Task: Doing it your own way
_*This problem is now entirely optional, if you do complete it, it will count upto 15 points towards a perfect score on the homework.  See Piazza for more details._
1. Update the notebook to use a different model and a different training dataset (your choice).
2. Provide a summary of your results below. Include details on the autoencoder used, and other choices made.
3. What tokenizer is used?
4. What fine-tuning training parameters did you use?
5. Include before and after fine-tuning perplexities.
6. Include example setups and outputs from each modelling task.


### Problem 6

### Tokenization
TensorFlow and HuggingFace.co both support several popular tokenization approaches. Modern subword tokenization strategies handle word stemming and new words out of vocabulary words in an elegant way. Lena Voita's [exposition](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#bpe) on Byte Pair Encoding (BPE) is a great way to get started. Extensions to BPE form the basis for [Byte-level BPE](https://arxiv.org/pdf/1909.03341.pdf) and [Word Piece](https://paperswithcode.com/method/wordpiece) tokenizers used for the GPT and Bert models. These allow efficient encoding for unicode characters and emojis, whereas many other popular tokenizers replace unknown characters and words with <unk> tokens.

Hugging face provides [support and clear summaries](https://huggingface.co/transformers/tokenizer_summary.html) for each of these tokenization approaches and several other cutting-edge approaches, including [Sentence Piece](https://paperswithcode.com/method/sentencepiece). If you are interested in working with languages that do not have a space between each word (like Chinese), you should read more about XLM and its generalization, SentencePiece.


### ®[5] Task: Pretrained model gotchas
1. When working with pre-trained models, why do you need to use the same tokenizer that was originally used to create the pre-trained model? What would happen if you did not?
2. In practice, word vectors are stored in an embedding layer in a neural network. Fine-tuning models with enough data improves performance. Explain why retraining word vectors may hurt our model if our training dataset is small and includes limited vocabulary.


1. Since if we use different tokenizer, we might have have different tokens for the same word, which will break the logic between the model and the detokenization process.

2. If our dataset is small, the word distribution and the relation between words will be biased comparing to the pre-trianed model's dataset. For exmaple, the "bank" which means the side of river may not exist in our small dataset, so the relationship and similiarity between words will be changed in our small dataset. Hence, it will hurt the perforamnce our the pre-trained model and our model.

### Problem 7

## Text Generation
Thus far in the course, we have seen examples of Monte-Carlo sampling, greedy search, and Beam Search to look at different model outputs. [The Big Table of Tasks](https://huggingface.co/transformers/examples.html#the-big-table-of-tasks) **text-generation** notebook introduces several other methods.

Please open this notebook work through it cell by cell, and fiddle with it so that you can understand the behavior.


[(02_how_to_generate.ipynb)](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)
### ®[15] Task: Greedy Stochastic Search
In your own words explain the following as clearly as you can
1. The scheme introduced that references [Paulus et al. (2017)](https://arxiv.org/abs/1705.04304) and [Klein et al. (2017)](https://arxiv.org/abs/1701.02810)
2. The method of [Fan et al. (2018)](https://arxiv.org/pdf/1805.04833.pdf)
3. The final method attributed to [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751)

Include the relevant Hugging Face API call. Include examples that are different from the ones given, +1 if they are amusing!


In [3]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=497933648.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [4]:
input_ids = tokenizer.encode('I am a Brown DSI Student.', return_tensors='tf')
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I am a Brown DSI Student. I am also a member of the Board of Trustees of Brown University.

I have a Bachelor of Science degree from the University of California, Berkeley. My research interests include the intersection of science and technology
1: I am a Brown DSI Student. I am also a member of the Board of Trustees of Brown University.

I have a Bachelor of Science degree from the University of California, Berkeley. My research interests include the intersection of science, technology
2: I am a Brown DSI Student. I am also a member of the Board of Trustees of Brown University.

I have a Bachelor of Science degree from the University of California, Berkeley. My research interests include the intersection of psychology, sociology
3: I am a Brown DSI Student. I am also a member of the Board of Trustees of Brown University.

I have a Bachelor of Science degree from the Universi

In [7]:
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=40
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I am a Brown DSI Student. I worked as a full time teaching assistant in 2009 when I was sent an honorary degree that allowed my personal life under my direct supervision. The program taught students a set of fundamentals about why they should not get promoted


In [10]:
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.9, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I am a Brown DSI Student. I also support the civil rights movement. In my class, I have talked with many Native American groups about their reservations. However, I never cared about my ethnicity, religious views, or any other factors. I


1. n-grams: It set the probability to zero if the next word within range of n-gram size has been seen before the current postiion. In the example we try above, the final research interest contain different contents.

2. Top-K sampling: This gives some randomness to the next word instead of take the word with highest probability. The sampling rescale the probability distribution of the top-k words and sample from it. The second exmaple we try above generate a story about teaching , honoray degree which really surprises me but still stay within range of academic. It even gives some interesting details in the last sentence.

3. Top-nucleus sampling: This is same method in 2 but with a different way to pick the set of words to resample. Instead of picking top fixed number of words, it picks top n words that reaches a probability threshold. So the number of words that are going to be sampled from is dynamic. It redirets the topic from Academic content to a personal introduction, which is different from the result of top-k sampling. It seems more surprising.

### Problem 8

# Word-Embeddings


In this part of the assignment, you explore the crazy world of semantic geometry.
- The [Tensorflow Embedding Projector](https://projector.tensorflow.org/) allows one to project high-dimensional word embeddings into a lower-dimensional space.


**Instructor Note:**
The initially released version of this assignment suggested this reading
- Original Text: [The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://doi.org/10.1177%2F0003122419877135) explore word embeddings through time.

However, an earlier version of this paper is easier to follow for the purposes of this assignment (the figures are located at the very end of the pdf).
- More Direct Text: [The Geometry of Culture: Analyzing Meaning through Word Embeddings](https://paperswithcode.com/paper/the-geometry-of-culture-analyzing-meaning)

If you have not already answered the questions using the Original Text, please use the More Direct Text.


[Embedding Projector](https://projector.tensorflow.org/)
### ®[10] Task: A manual for new projectionists
1. In your own words, clearly explain each projection type that is supported.
2. Explain the other controls and give an example of how to use each one.

You are welcome to use annotated screenshot (s) or other methods to do this concisely and efficiently.


1. A projection is a mapping from a high dimensional space to a low dimensional space that is injective and sturcture-preserving. Sturcture-preserving means that the original relationship between vectors in the high-dimensional space is preserved in the low dimensional space. It supports UMAP, T-SNE, PCA and custom. PCA is finding k orthogonal components that are linear combinations of some features that show the most variances. UMAP is mainfold approximation. T-SNE uses t-distribution to convert data similarity to probability and shrink the size of the vector space by minimizeing the Kullback-Leibler divergence between current vector space and original high dimensional space.

2. 

### Problem 9

[Embedding Projector](https://projector.tensorflow.org/)
Use the Embedding Projector to find a [polysemous word](https://en.wikipedia.org/wiki/Polysemy) where similar words (according to cosine similarity) have multiple meanings.

For example, the word "spring" can refer to either "flower" and "suspension." The word "tie" can have associations with both "shirt" and "football."

You may need to try several polysemous word candidates before you find one that works.

### ®[5] Task: Polysemous Word Hunt
1. Please provide setup and output for 3 polysemous word(s) you discover.
2. For each set of words, describe the multiple meanings that occur.
3. Why do you think some polysemous words don't work?


1. I use PCA with Word2Vec dataset. My 3 words are fall, star, window
2. Fall can refer to the falling movment or a season. It has close realtionship to grow and falling. Star has close relationship to trek and nba. It can refer to a astronomic object or a popular person in a field. Window has close relationship to door and browser. It can refer to the window on the wall and the window page in the browswer.
3. Some polysemous words dont't work beacause the Word2Vec 10K dataset my not include the mutliple meanings of those polysemous words. 

### Problem 10

Embedding vectors have been shown to sometimes exhibit the ability to solve analogies.

For example, to solve the analogy, "man is to king as woman is to what?".

The "man is to king" relationship can be represented geometrically by the displacement vector between word man and king word embeddings.

$\displaystyle  v_{\text{man}} + (v_{\text{king}} − v_{\text{man}}) = v_{\text{king}}$
 

The analogous relationship can then often be found by adding this displacement to the woman word embedding
 $\displaystyle v_{\text{woman}} + (v_{\text{king}} − v_{\text{man}}) \approx v_{queen}$
 ​ 

The relationship is only approximate in that $v_{\text{queen}}$ ​will be found nearby the point 
$$v_{\text{woman}} + (v_{\text{king}} − v_{\text{man}})$ $

### ®[5] Task: Word2Vec Analogies
Using an online tool (e.g.,  [http://bionlp-www.utu.fi/wv_demo/ ](http://bionlp-www.utu.fi/wv_demo/ ) or [https://lamyiowce.github.io/word2viz/](https://lamyiowce.github.io/word2viz/) ) or your own colab notebook and find three different word-embedding analogies that work. 

Write up your findings below.  
1. Please state each full analogy in the following form **man:king :: woman:queen **(in english this reads as "man is to king AS woman is to queen").
2. Provide plots, diagrams and or calculations that support your assertion. 
3. If the analogy is complicated, explain why the analogy holds in one or two sentences.


You may have to try several analogies to find one that works!


1.  *man:male :: woman:female *; *one:January :: two:February *; *male:father :: female:mother *
2. The similairty for the first group is 0.749 and 0.774 and female appears within the top 10 words of woman. The similarity for second group is 0.401 and 0.467. The similarity for third group is 0.771 and 0.738. And the following graphs show that the relationship between these pairs match those existing pairs defined by the default od analogies in its category.
3. It holds because they have the same direction and similarity, so the distance vector between two pairs are close to each other. Also, sometimes the too complicated analogy give both orthogonal vector pairs (unrelated), so that it can match another unrelated pairs.

![picture](https://drive.google.com/uc?id=1ei9IV-2jHV3P1vG-8hfRD4Cp79CMghrG)
![picture](https://drive.google.com/uc?id=1PChhPfEDKl_XC0YQ-IUDBG2eSM_1aabp)
![picture](https://drive.google.com/uc?id=1ONF9YxFmfBZZ0UM50jzV_yhwESy9Un11)

### Problem 11

Original Text: [The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://doi.org/10.1177%2F0003122419877135)

More Direct Text: [The Geometry of Culture: Analyzing Meaning through Word Embeddings](https://paperswithcode.com/paper/the-geometry-of-culture-analyzing-meaning)
###  
**Instructor note: **If you have not already answered this question using the Original Text, please use the More Direct Text.  Indicate which paper you have used.

### ®[5] Task: Cultural Artifacts
Identify several biases that this papers explores.


The example gived in paper: “doctor” is consistently found to be more “white” than “black,” and “scientist” more “masculine” than “feminine.” They are gendered and racial biases. The paper also mentioned bias between social classes such as rich and poor, including jobs.

### Problem 12

Original Text: [The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings](https://doi.org/10.1177%2F0003122419877135)

More Direct Text: [The Geometry of Culture: Analyzing Meaning through Word Embeddings](https://paperswithcode.com/paper/the-geometry-of-culture-analyzing-meaning)

** Instructor note: **If you have not already answered this question using the Original Text, please use the More Direct Text.  Indicate which paper you have used.

### ®[2] Task: OK, Boomer
Describe how one of these biases change over time.  Does it agree with your impressions of how the culture in the US has been changing over time?  If you are familiar with another country's culture, what result would you expect to see?


The nurse is getting away from female through years and engineer is getting away from male through years, both a little bit. It happens in my country since more girls have change to go to college and study well in physical science classes and lots of colleges start to admit boys for nursing major.

### Problem 13

### ®[3] Task: Unsupervised $=$ Un-biased?
An interviewer from the ChatBots-R-Us company asks you the following question during an interview, "We know our models learn unintentional biases because we present them with a lot of customer features, in addition to customer dialog. We think using unsupervised learning techniques will help improve the situation. What do you think?"

Please write-up your response in the cell below.


It will still be biased. Even though we don't manually set the labels, the input to the chat robots is from customers, which will include bias. Since the input is biased, we cannot prevent the chat robot from learning unintentional biases.

### Problem 14

### ®[2] Task: Axis Understanding
Use the Embedding Projector "Custom Option" or another tool to create an illustration similar to one found in the paper.  Document that you have done this, and explain the setup.


I choose custom and choose masculine for left, femnine for right, poor for up and rich for down. Here is the graph I generate projected to a 2D space.

![picture](https://drive.google.com/uc?id=1pV9w4lSYSaA3Up88hoMxQur82ckayDA6)

### Problem 15

### ®[3] Task: Embeddings Variation
Alice and Bob have used different Word2Vec algorithm implementations to obtain word embeddings on the same corpus and vocabulary of words $V$.  

In particular, for every word $w$, Alice has obtained ‘context’ vectors $u^A_w$ and ‘center’ vectors $v^A_w$, and Bob has obtained ‘context’ vectors $u^B_w$ and ‘center’ vectors $v^B_w$ for every word. 

Suppose that, for every pair of words $w , w'\in V$, the inner product is the same in both Alice and Bob’s model: 

$(u^A_w)^Tv^A_{w'}= (u ^B_w)^T v^B_{w'}$

Does it follow that, for every word $w \in V, v^A_w = v^B_w$ ?  Why or why not?


No. Since the center vectors might be different from algorithm A to algorithm B. Even the distance, similairty and other features are same, the pivot and center of the projection might be different.

### Problem 16