This notebook is mainly used for demo.

### TODO list
- [x] 思考怎么展示demo

- [x] 取出模长较小和模长较大的单词, 证明模长较小的是低频词汇, 模长较大的是高频词汇.
- [x] 对于低频词汇, 在 embeddings 中找到模长, 并计算和这个单词cos角度接近的10个词汇.
- [x] 对于低频词汇, 训练之后取出 embeddings, 计算cos角度最接近的10个词汇.
- [ ] 通过上述过程证明, recnn 训练过后的 embeddings 语义更好.

In [68]:
from transformers import AutoModel, AutoTokenizer

In [69]:
bert_model = AutoModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [70]:
original_embeddings = bert_model.embeddings.word_embeddings.state_dict()['weight']

In [71]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [72]:
original_norms = original_embeddings.norm(dim=1)

## 取出模长较小和模长较大的单词, 证明模长较小的是低频词汇, 模长较大的是高频词汇.

In [73]:
K = 10

# 模长最大的K个
for token in tokenizer.convert_ids_to_tokens(original_norms.argsort(descending=True))[:K]:
    print(tokenizer.convert_tokens_to_string([token]))

670
##omba
##rdon
[CLS]
##anor
##lho
840
##lland
930
690


In [74]:
# 模长最短的K个
for token in tokenizer.convert_ids_to_tokens(original_norms.argsort(descending=False))[:K]:
    print(tokenizer.convert_tokens_to_string([token]))

[SEP]
.
;
the
,
of
his
(
in
her


结论: 模长越长, token越奇怪, 模长越短, 越常见

- [ ]  为了方便演示, 找出 wordnet 和 bert 共有词汇中的低频词, 用来展示效果

In [75]:
from dataloader import word_dict

In [76]:
wordnet = word_dict()

DONE


In [77]:
common_words = list(set(wordnet.keys()) & set(tokenizer.vocab.keys()))

计算common words 中的每个词汇的 embeddings, 找出低频词

In [78]:
common_words_norms = {i:original_norms[tokenizer.convert_tokens_to_ids(i)] for i in common_words}

In [79]:
# 最长的10个norms的单词
K = 20
common_words_norms_sorted = dict(sorted(common_words_norms.items(), key=lambda item: item[1], reverse=True))

list(common_words_norms_sorted.keys())[:K]

['gallon',
 'sock',
 'wrestle',
 'shave',
 'devote',
 'plead',
 'kettle',
 'preach',
 'weigh',
 'spoil',
 'tread',
 'owe',
 'bracket',
 'scramble',
 'courtesy',
 'casualty',
 'vain',
 'appendix',
 'chew',
 'coma']

In [80]:
# 最短的K个
list(common_words_norms_sorted.keys())[-K:]

['woman',
 'national',
 'brown',
 'film',
 'east',
 'friend',
 'girl',
 'father',
 'album',
 'beautiful',
 'north',
 'second',
 'village',
 'south',
 'new',
 'create',
 'brother',
 'small',
 'large',
 'have']

利用完整的单词, 再次证明, 模长更长的单词属于低频词.

我们以模长最长的单词为例子, 找embeddings中与之cos夹角最接近的K个单词.

In [81]:
example = "wrestle"
# gallon's embeddings
example_embedding = original_embeddings[tokenizer.convert_tokens_to_ids(example)]

# calculate cosine distance
import torch
cosine_sim = torch.nn.CosineSimilarity(dim=1, eps=1e-08)

cosine_similarity_of_example = cosine_sim(original_embeddings, example_embedding)

tokenizer.convert_ids_to_tokens(cosine_similarity_of_example.argsort(descending=True)[:K])

['wrestle',
 'wrestled',
 'wrestling',
 'wrestlers',
 '1762',
 'জ',
 '1713',
 '1737',
 '1727',
 '1712',
 'ذ',
 '1757',
 '1711',
 'wrestler',
 'タ',
 '1734',
 '1642',
 '1781',
 '1733',
 '香']

上面的词汇大部分的都是空的, 再对高频词做同样的处理, 验证

In [82]:
example = "woman"
# gallon's embeddings
example_embedding = original_embeddings[tokenizer.convert_tokens_to_ids(example)]

# calculate cosine distance
import torch
cosine_sim = torch.nn.CosineSimilarity(dim=1, eps=1e-08)

cosine_similarity_of_example = cosine_sim(original_embeddings, example_embedding)

tokenizer.convert_ids_to_tokens(cosine_similarity_of_example.argsort(descending=True)[:K])

['woman',
 'women',
 'girl',
 'female',
 'man',
 'lady',
 'girls',
 '##woman',
 'person',
 'ladies',
 'men',
 'feminine',
 '238',
 'femme',
 '234',
 '259',
 'redhead',
 'wife',
 '236',
 '277']

In [83]:
def show_original_related_words(example):
    example_embedding = original_embeddings[tokenizer.convert_tokens_to_ids(example)]
    # calculate cosine distance
    cosine_sim = torch.nn.CosineSimilarity(dim=1, eps=1e-08)

    cosine_similarity_of_example = cosine_sim(original_embeddings, example_embedding)

    return tokenizer.convert_ids_to_tokens(cosine_similarity_of_example.argsort(descending=True)[:K])
show_original_related_words('woman')

['woman',
 'women',
 'girl',
 'female',
 'man',
 'lady',
 'girls',
 '##woman',
 'person',
 'ladies',
 'men',
 'feminine',
 '238',
 'femme',
 '234',
 '259',
 'redhead',
 'wife',
 '236',
 '277']

对于低频词, embeddings的语义相关单词较少, 对于高频词, 语义更加丰富.

训练recnn网络, 通过网络调整低频词embeddings的效果

In [84]:
def calculate_topK_distance_by_mse(embeddings, pred_embeddings, K=10):
    return ((embeddings - pred_embeddings)**2).sum(axis=0)

In [85]:
from model import DictNet

recnn = torch.load("./recnn-last.pt")

example = "girl"
def get_related_words_from_recnn(example):
    try:
        input_sentence = wordnet[example]
    except KeyError as e:
        return "words should be in the wordnet dictionary."
    print(input_sentence)
    res = tokenizer(input_sentence, return_tensors='pt').to('cuda')

    res['word_ids'] = torch.tensor(tokenizer.convert_tokens_to_ids(example)).to('cuda')
    recnn.eval()
    output = recnn(**res)

    new_gallon_embed = output['pred_embed'].to('cpu')
    print(f"{example}: {new_gallon_embed.norm()}")

    return tokenizer.convert_ids_to_tokens(calculate_topK_distance_by_mse(original_embeddings, new_gallon_embed).argsort()[:K])
print(get_related_words_from_recnn(example))

# original nrom
original_norms[tokenizer.convert_tokens_to_ids(example)]

female child
girl: 3.9972944259643555
['[unused15]', '[unused528]', '[unused557]', '[unused193]', '[unused685]', '[unused692]', '[unused418]', '[unused625]', '[unused443]', '[unused122]', '[unused101]', '[unused275]', '[unused51]', '[unused364]', '[unused147]', '[unused190]', '[unused677]', '[unused114]', '[unused66]', '[unused253]']


tensor(0.9622)

训练之后的模长变短了, 但是语义相近的词并不对

In [86]:
example = 'owl'
print(get_related_words_from_recnn(example))
print("original norm", original_norms[tokenizer.convert_tokens_to_ids(example)])
print("\n")
print(show_original_related_words(example))

nocturnal bird
owl: 3.9220004081726074
['[unused528]', '[unused418]', '[unused193]', '[unused685]', '[unused253]', '[unused114]', '[unused275]', '[unused692]', '[unused557]', '[unused483]', '[unused129]', '[unused388]', '[unused122]', '[unused339]', '[unused390]', '[unused547]', '[unused189]', '[unused164]', '[unused177]', '[unused471]']
original norm tensor(1.3728)


['owl', 'owls', '1779', '1795', '1675', '1672', '1738', '1679', '1781', '1819', '1611', '1802', '1785', '1646', '1749', '1752', '1771', '1642', '1659', '1682']


In [87]:
longest = 0
for k, v in wordnet.items():
    if len(v.split(' ')) > longest:
        longest = len(v.split(" "))
        print(k)

able
absolute
abstract
applied
false
functional
high
application
bow
fold
