Word clustering for BERT
In transformer models like BERT, a word's embedding is defined by its linguistic context
This demo visualizes the embeddings of the same word in different sentence contexts from (licensed under CC-BY-SA-3.0). Each point is the query word's embedding at the selected layer, projected into two dimensions using umap
To get the data
For each word, there is a single json file. These json files contain the context sentence, part of speech, and umap-projected embeddings of the word various sentences (note: if the word appears multiple times in the sentence, the first instance is used.) There are 200-1000 sentences per word.
To get this data, ether:
- Run the following, which will download the pregenerated data on google cloud to
Download the raw data (Wikipedia) from kaggle
Run the following, which will generate the data in
To run the demo
- Install dependencies:
- Watch the demo for changes with a local server:
The demo can then be accessed at http://localhost:1234/
To also update the jsons with local ones, run:
sh ./deploy.sh --upload_jsons
NB: This is not an official Google product.