# Vector-space models: Static representations from contextual models

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2021"

## Contents

1. [Overview](#Overview)
1. [General set-up](#General-set-up)
1. [Loading Transformer models](#Loading-Transformer-models)
1. [The basics of tokenizing](#The-basics-of-tokenizing)
1. [The basics of representations](#The-basics-of-representations)
1. [The decontextualized approach](#The-decontextualized-approach)
  1. [Basic example](#Basic-example)
  1. [Creating a full VSM](#Creating-a-full-VSM)
1. [The aggregated approach](#The-aggregated-approach)
1. [Some related work](#Some-related-work)

## Overview



Can we get good static representations of words from models (like BERT) that supply only contextual representations? On the one hand, contextual models are very successful across a wide range of tasks, in large part because they are trained for a long time on a lot of data. This should be a boon for VSMs as we've designed them so far. On the other hand, the goal of having static representations might seem to be at odds with how these models process examples and represent examples. Part of the point is to obtain different representations for words depending on the context in which they occur, and a hallmark of the training procedure is that it processes sequences rather than individual words.

[Bommasani et al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.431) make a significant step forward in our understanding of these issues. Ultimately, they arrive at a positive answer: excellent static word representations can be obtained from contextual models. They explore two strategies for achieving this:

1. __The decontextualized approach__: just process individual words as though they were isolated texts. Where a word consists of multiple tokens in the model, pool them with a function like mean or max.
1. __The aggregrated approach__: process lots and lots of texts containing the words of interest. As before, pool sub-word tokens, and also pool across all the pooled representations.

As Bommasani et al. say, the decontextualized approach "presents an unnatural input" – these models were not trained on individual words, but rather on longer sequences, so the individual words are infrequent kinds of inputs at best (and unattested as far as the model is concerned if the special boundary tokens [CLS] and [SEP] are not included). However, in practice, Bommasani et al. achieve very impressive results with this approach on word similarity/relatedness tasks.

The aggregrated approach is even better, but it requires more work and involves more decisions relating to which texts are processed.

This notebook briefly explores both of these approaches, with the goal of making it easy for you to apply these methods in [the associated homework and bakeoff](hw_wordrelatedness.ipynb).

## General set-up



In [1]:
import os
import pandas as pd
import torch
from transformers import BertModel, BertTokenizer
from transformers import RobertaModel, RobertaTokenizer

import utils
import vsm

In [2]:
DATA_HOME = os.path.join('data', 'vsmdata')

In [3]:
utils.fix_random_seeds()

The `transformers` library does a lot of logging. To avoid ending up with a cluttered notebook, I am changing the logging level. You might want to skip this as you scale up to building production systems, since the logging is very good – it gives you a lot of insights into what the models and code are doing.

In [5]:
import logging
logger = logging.getLogger()
logger.level = logging.ERROR

## Loading Transformer models

To start, let's get a feel for the basic API that `transformers` provides. The first step is specifying the pretrained parameters we'll be using:

In [2]:
bert_weights_name = 'bert-base-uncased'

There are lots other options for pretrained weights. See [this Hugging Face directory](https://huggingface.co/models).

Next, we specify a tokenizer and a model that match both each other and our choice of pretrained weights:

In [3]:
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)

In [4]:
bert_model = BertModel.from_pretrained(bert_weights_name)

## The basics of tokenizing

It's illuminating to see what the tokenizer does to example texts:

In [36]:
example_text = "kjahsfouhasuidf aoipsufoiwauerlijg aposdfpok;lcs"

Simple tokenization:

In [37]:
bert_tokenizer.tokenize(example_text)

['k',
 '##jah',
 '##sf',
 '##ou',
 '##has',
 '##uid',
 '##f',
 'ao',
 '##ip',
 '##su',
 '##fo',
 '##i',
 '##wa',
 '##uer',
 '##li',
 '##j',
 '##g',
 'ap',
 '##os',
 '##df',
 '##po',
 '##k',
 ';',
 'lc',
 '##s']

The `encode` method maps individual strings to indices into the underlying embedding used by the model:

In [38]:
ex_ids = bert_tokenizer.encode(example_text, add_special_tokens=True)

ex_ids

[101,
 1047,
 18878,
 22747,
 7140,
 14949,
 21272,
 2546,
 20118,
 11514,
 6342,
 14876,
 2072,
 4213,
 13094,
 3669,
 3501,
 2290,
 9706,
 2891,
 20952,
 6873,
 2243,
 1025,
 29215,
 2015,
 102]

In [60]:
ex_ids.shape

AttributeError: 'list' object has no attribute 'shape'

We can get a better feel for what these representations are like by mapping the indices back to "words":

In [39]:
bert_tokenizer.convert_ids_to_tokens(ex_ids)

['[CLS]',
 'k',
 '##jah',
 '##sf',
 '##ou',
 '##has',
 '##uid',
 '##f',
 'ao',
 '##ip',
 '##su',
 '##fo',
 '##i',
 '##wa',
 '##uer',
 '##li',
 '##j',
 '##g',
 'ap',
 '##os',
 '##df',
 '##po',
 '##k',
 ';',
 'lc',
 '##s',
 '[SEP]']

In [41]:
bert_tokenizer.convert_ids_to_tokens([0,1,2,3,4,56])

['[PAD]', '[unused0]', '[unused1]', '[unused2]', '[unused3]', '[unused55]']

Those are all the essential ingredients for working with these parameters in Hugging Face. Of course, the library has a lot of other functionality, but the above suffices for our current application.

## Chinese BERT tokenizer

In [9]:
bert_weights_name = 'bert-base-chinese'

There are lots other options for pretrained weights. See [this Hugging Face directory](https://huggingface.co/models).

Next, we specify a tokenizer and a model that match both each other and our choice of pretrained weights:

In [10]:
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)

Downloading:   0%|          | 0.00/110k [00:00<?, ?B/s]

In [11]:
bert_model = BertModel.from_pretrained(bert_weights_name)

Downloading:   0%|          | 0.00/624 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/412M [00:00<?, ?B/s]

## The basics of tokenizing

It's illuminating to see what the tokenizer does to example texts:

In [16]:
example_text = "证监会与公安部对此展开了联合调查。调查人员发现，惠发食品和嘉美包装背后都存在操盘方，而操盘方采取了截然不同的出货形式，惠发食品用了将近一周的时间完成全部出货，而嘉美包装采用的则是暴力出货，仅仅几分钟就完成了4亿多元的出货。"

In [18]:
!pip install jieba

Collecting jieba
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
Building wheels for collected packages: jieba
  Building wheel for jieba (setup.py): started
  Building wheel for jieba (setup.py): finished with status 'done'
  Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314477 sha256=dc47ede4513680d3923897390ce2862db7fdad3bb59bb60dd0c04e5c9904bab8
  Stored in directory: c:\users\71035\appdata\local\pip\cache\wheels\ca\38\d8\dfdfe73bec1d12026b30cb7ce8da650f3f0ea2cf155ea018ae
Successfully built jieba
Installing collected packages: jieba
Successfully installed jieba-0.42.1


In [24]:
import jieba
splitted_example_text = jieba.cut(example_text, cut_all=False)
# print(list(splitted_example_text))

In [25]:
splitted_example_text = " ".join(list(splitted_example_text))
bert_tokenizer.tokenize(splitted_example_text)

['证',
 '监',
 '会',
 '与',
 '公',
 '安',
 '部',
 '对',
 '此',
 '展',
 '开',
 '了',
 '联',
 '合',
 '调',
 '查',
 '。',
 '调',
 '查',
 '人',
 '员',
 '发',
 '现',
 '，',
 '惠',
 '发',
 '食',
 '品',
 '和',
 '嘉',
 '美',
 '包',
 '装',
 '背',
 '后',
 '都',
 '存',
 '在',
 '操',
 '盘',
 '方',
 '，',
 '而',
 '操',
 '盘',
 '方',
 '采',
 '取',
 '了',
 '截',
 '然',
 '不',
 '同',
 '的',
 '出',
 '货',
 '形',
 '式',
 '，',
 '惠',
 '发',
 '食',
 '品',
 '用',
 '了',
 '将',
 '近',
 '一',
 '周',
 '的',
 '时',
 '间',
 '完',
 '成',
 '全',
 '部',
 '出',
 '货',
 '，',
 '而',
 '嘉',
 '美',
 '包',
 '装',
 '采',
 '用',
 '的',
 '则',
 '是',
 '暴',
 '力',
 '出',
 '货',
 '，',
 '仅',
 '仅',
 '几',
 '分',
 '钟',
 '就',
 '完',
 '成',
 '了',
 '4',
 '亿',
 '多',
 '元',
 '的',
 '出',
 '货',
 '。']

In [26]:
splitted_example_text

'证监会 与 公安部 对此 展开 了 联合 调查 。 调查 人员 发现 ， 惠发 食品 和 嘉美 包装 背后 都 存在 操盘 方 ， 而 操盘 方 采取 了 截然不同 的 出货 形式 ， 惠发 食品 用 了 将近 一周 的 时间 完成 全部 出货 ， 而嘉美 包装 采用 的 则 是 暴力 出货 ， 仅仅 几分钟 就 完成 了 4 亿多元 的 出货 。'

Simple tokenization:

In [17]:
bert_tokenizer.tokenize(example_text)

['证',
 '监',
 '会',
 '与',
 '公',
 '安',
 '部',
 '对',
 '此',
 '展',
 '开',
 '了',
 '联',
 '合',
 '调',
 '查',
 '。',
 '调',
 '查',
 '人',
 '员',
 '发',
 '现',
 '，',
 '惠',
 '发',
 '食',
 '品',
 '和',
 '嘉',
 '美',
 '包',
 '装',
 '背',
 '后',
 '都',
 '存',
 '在',
 '操',
 '盘',
 '方',
 '，',
 '而',
 '操',
 '盘',
 '方',
 '采',
 '取',
 '了',
 '截',
 '然',
 '不',
 '同',
 '的',
 '出',
 '货',
 '形',
 '式',
 '，',
 '惠',
 '发',
 '食',
 '品',
 '用',
 '了',
 '将',
 '近',
 '一',
 '周',
 '的',
 '时',
 '间',
 '完',
 '成',
 '全',
 '部',
 '出',
 '货',
 '，',
 '而',
 '嘉',
 '美',
 '包',
 '装',
 '采',
 '用',
 '的',
 '则',
 '是',
 '暴',
 '力',
 '出',
 '货',
 '，',
 '仅',
 '仅',
 '几',
 '分',
 '钟',
 '就',
 '完',
 '成',
 '了',
 '4',
 '亿',
 '多',
 '元',
 '的',
 '出',
 '货',
 '。']

## The basics of representation

To obtain the representations for a batch of examples, we use the `forward` method of the model, as follows:

In [8]:
with torch.no_grad():
    reps = bert_model(torch.tensor([[4,7,12,7658]]), output_hidden_states=True)

The return value `reps` is a special `transformers` class that holds a lot of representations. If we want just the final output representations for each token, we use `last_hidden_state`:

In [15]:
reps.hidden_states[0]

tensor([[[ 0.2842, -0.5367, -0.4848,  ..., -0.0969, -0.1886,  0.1269],
         [ 0.2704, -0.3225, -0.4385,  ...,  0.4503,  0.1778, -0.0451],
         [ 0.0424, -0.6269, -0.3040,  ...,  0.2810,  0.0090,  0.0894],
         [ 0.6871, -0.0913, -0.0771,  ...,  0.7845, -0.8323, -0.2015]]])

The shape indicates that our batch has 1 example, with 10 tokens, and each token is represented by a vector of dimensionality 768. 

Aside: Hugging Face `transformers` models also have a `pooler_output` value. For BERT, this corresponds to the output representation above the [CLS] token, which is often used as a summary representation for the entire sequence. However, __we cannot use `pooler_output` in the current context__, as `transformers` adds new randomized parameters on top of it, to facilitate fine-tuning. If we want the [CLS] representation, we need to use `reps.last_hidden_state[:, 0]`.

Finally, if we want access to the output representations from each layer of the model, we use `hidden_states`. This will be `None` unless we set `output_hidden_states=True` when using the `forward` method, as above. 

In [44]:
len(reps.hidden_states)

13

The length 13 corresponds to the initial embedding layer (layer 0) and the 12 layers of this BERT model.

In [48]:
reps[1]

tensor([[-0.8694, -0.4143, -0.9141,  0.8212,  0.5426, -0.2580,  0.8638,  0.3429,
         -0.7150, -1.0000, -0.4169,  0.8452,  0.9518,  0.6105,  0.8326, -0.7229,
         -0.1818, -0.5577,  0.3598, -0.4858,  0.5933,  1.0000, -0.0489,  0.2961,
          0.4730,  0.9345, -0.6415,  0.8481,  0.9246,  0.6877, -0.6744,  0.3293,
         -0.9671, -0.2723, -0.9139, -0.9864,  0.4015, -0.7263, -0.2565, -0.0587,
         -0.8344,  0.4093,  1.0000,  0.3756,  0.2735, -0.3851, -1.0000,  0.3255,
         -0.7660,  0.8908,  0.8023,  0.7308,  0.2716,  0.5238,  0.4384, -0.0625,
         -0.0445,  0.1650, -0.1997, -0.6719, -0.5889,  0.5072, -0.8344, -0.8773,
          0.8185,  0.7522, -0.1746, -0.3423, -0.0955, -0.0576,  0.8644,  0.1172,
         -0.2198, -0.7513,  0.6251,  0.2804, -0.7446,  1.0000, -0.5384, -0.9370,
          0.7525,  0.7090,  0.6679, -0.4018,  0.6880, -1.0000,  0.6759, -0.1727,
         -0.9737,  0.3173,  0.5694, -0.1877,  0.6774,  0.6837, -0.7263, -0.3219,
         -0.4429, -0.7427, -

The final layer in `hidden_states` is identical to `last_hidden_state`:

In [16]:
reps.hidden_states[-1].shape

torch.Size([1, 10, 768])

In [17]:
torch.equal(reps.hidden_states[-1], reps.last_hidden_state)

True

## The decontextualized approach

As discussed above, Bommasani et al. (2020) define and explore two general strategies for obtaining static representations for word using a model like BERT. The simpler one involves processing individual words and, where they correspond to multiple tokens, pooling those token representations into a single vector using an operation like mean.

### Basic example

To begin to see what this is like in practice, we'll use the method `vsm.hf_encode`, which maps texts to their ids, taking care to use `unk_token` for texts that can't otherwise be processed by the model.

Where a word corresponds to just one token in the vocabulary, it will get mapped to a single id:

In [18]:
bert_tokenizer.tokenize('puppy')

['puppy']

In [19]:
vsm.hf_encode("puppy", bert_tokenizer)

tensor([[17022]])

As we saw above, some words map to multiple tokens:

In [56]:
bert_tokenizer.tokenize('snuffleupagus')

['s', '##nu', '##ffle', '##up', '##ag', '##us']

In [64]:
subtok_ids = vsm.hf_encode(["snuffleupagus"], bert_tokenizer)

subtok_ids

tensor([[100]])

In [66]:
bert_tokenizer.encode(
                    ["snuffleupagus"],
                    add_special_tokens=False,
                    return_tensors='pt').shape[1]

1

Next, the function `vsm.hf_represent` will map a batch of ids to their representations in a user-supplied model, at a specified layer in that model:

In [58]:
subtok_reps = vsm.hf_represent(subtok_ids, bert_model, layer=-1)

subtok_reps.shape

torch.Size([1, 6, 768])

The shape here: 1 example containing 6 (sub-word) tokens, each of dimension 768. With `layer=-1`, we obtain the final output repreentation from the entire model.

The final step is to pool together the 6 tokens. Here, we can use a variety of operations; [Bommasani et al. 2020](https://www.aclweb.org/anthology/2020.acl-main.431) find that `mean` is the best overall:

In [59]:
subtok_pooled = vsm.mean_pooling(subtok_reps)

subtok_pooled.shape

torch.Size([1, 768])

The function `vsm.mean_pooling` is simple `torch.mean` with `axis=1`. There are also predefined functions `vsm.max_pooling`, `vsm.min_pooling`, and `vsm.last_pooling` (representation for the final token).

### Creating a full VSM

Now we want to scale the above process to a large vocabulary, so that we can create a full VSM. The function `vsm.create_subword_pooling_vsm` makes this easy. To start, we get the vocabulary from one of our count VSMs (all of which have the same vocabulary):

In [73]:
vsm_index = pd.read_csv(
    os.path.join(DATA_HOME, 'yelp_window5-scaled.csv.gz'),
    usecols=[0], index_col=0)

In [74]:
vocab = list(vsm_index.index)

In [75]:
vocab[: 5]

['):', ');', '..', '...', ':(']

And then we use `vsm.create_subword_pooling_vsm`:

In [76]:
%%time
pooled_df = vsm.create_subword_pooling_vsm(
    vocab, bert_tokenizer, bert_model, layer=1)

Wall time: 3min 38s


The result, `pooled_df`, is a `pd.DataFrame` with its index given by `vocab`. This can be used directly in the word relatedness evaluations that are central the homework and associated bakeoff.

In [28]:
pooled_df.shape

(6000, 768)

In [29]:
pooled_df.iloc[: 5, :5]

Unnamed: 0,0,1,2,3,4
):,-0.576097,0.310341,-0.532733,-0.83305,-0.626199
);,-0.056739,0.058793,-0.243109,-0.800296,-0.119222
..,-0.271509,-0.009211,-0.190293,-0.275234,-0.276218
...,-0.380597,-0.054661,-0.161327,-0.299695,-0.299188
:(,-0.425129,0.215213,-1.130576,-1.066704,-0.371664


This approach, and the associated code, should work generally for all Hugging Face Transformer-based models. Bommasani et al. (2020) provide a lot of guidance when it comes to how the model, the layer choice, and the pooling function interact.

## The aggregated approach

The aggregated is also straightfoward to implement given the above tool. To start, we can create a map from vocabulary items into their sequences of ids:

In [30]:
vocab_ids = {w: vsm.hf_encode(w, bert_tokenizer)[0] for w in vocab}

Next, let's assume we have a corpus of texts that contain the words of interest:

In [31]:
corpus = [
    "This is a sailing example",
    "It's fun to go sailing!",
    "We should go sailing.",
    "I'd like to go sailing and sailing",
    "This is merely an example"]

The following embeds every corpus example, keeping `layer=1` representations:

In [32]:
corpus_ids = [vsm.hf_encode(text, bert_tokenizer)
              for text in corpus]

In [33]:
corpus_reps = [vsm.hf_represent(ids, bert_model, layer=1)
               for ids in corpus_ids]

Finally, we define a convenience function for finding all the occurrences of a sublist in a larger list:

In [34]:
def find_sublist_indices(sublist, mainlist):
    indices = []
    length = len(sublist)
    for i in range(0, len(mainlist)-length+1):
        if mainlist[i: i+length] == sublist:
            indices.append((i, i+length))
    return indices

For example:

In [35]:
find_sublist_indices([1,2], [1, 2, 3, 0, 1, 2, 3])

[(0, 2), (4, 6)]

And here's an example using our `vocab_ids` and `corpus`:

In [36]:
sailing = vocab_ids['sailing']

In [37]:
sailing_reps = []

for ids, reps in zip(corpus_ids, corpus_reps):
    offsets = find_sublist_indices(sailing, ids.squeeze(0))
    for (start, end) in offsets:
        pooled = vsm.mean_pooling(reps[:, start: end])
        sailing_reps.append(pooled)

sailing_rep = torch.mean(torch.cat(sailing_reps), axis=0).squeeze(0)

In [38]:
sailing_rep.shape

torch.Size([768])

The above building blocks could be used as the basis for an original system and bakeoff entry for this unit. The major question is probably which data to use for the corpus.

## Some related work

1. [Ethayarajh (2019)](https://www.aclweb.org/anthology/D19-1006/) uses dimensionality reduction techniques (akin to LSA) to derive static representations from contextual models, and explores layer-wise variation in detailed, with findings that are likely to align with your experiences using the above techniques.

1. [Akbik et al (2019)](https://www.aclweb.org/anthology/N19-1078/) explore techniques similar to those of Bommasani et al. specifically for the supervised task of named entity recognition.

1. [Wang et al. (2020](https://arxiv.org/pdf/1911.02929.pdf) learn static representations from contextual ones using techniques adapted from the word2vec model.