# Extracting Word Embeddings in BERT

This script tokenizes each speech document into words, and runs BERT model. Among 13 hidden layers of BERT model output, it extracts the last layer which corresponds to word embeddings. Since there are duplicate words within one speech document, it collapses multiple words into one by averaging out embedding values. 

- This script uses Fast Tokenizer from the "AutoTokenizer" package. 
- This script is for a *single* document and generates cosine similarity between words within a single document. 

## BERT (Bidirectional Encoder Representations from Transformers)
BERT model [Devlin et al., 2019](https://arxiv.org/pdf/1810.04805.pdf) is a masked language modeling architecture by conditioning word vectors on both left and right side of the word's context. Masked language model randomly masks (or blurs) tokens, and is trained to predict the missing token. 

**Contextual(or dynamic) embeddings** (including BERT) are different from GloVe (static embedding) in that single word is assigned with more than one vector representations, depending on the context. Even the same word "apple" will be represented by different vectors, allowing us to distinguish iPhone from fruit. Then encoders learn and predict the masked token (*not word*) by using the entire set of tokens in the given input - entire speech in this context. 

BERT, one of the dynamic embeddings model, is structured with below features. 
- subword tokens: less common words are split into multiple subwords tokens. 
- 13 hidden layers
- each hidden layer has a size of 768

In [5]:
from transformers import BertModel, BertTokenizer, AutoTokenizer
import numpy as np
import streamlit as st
import re
import pandas as pd
from datetime import datetime
import nltk
import torch

In [6]:
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

In [231]:
#input is "light.csv" which does not include stop words. 
df = pd.read_csv('../../../data/processed/light.csv')
meta = pd.read_csv('../../../data/processed/meta.csv')
# Filter
timestamps = df.year.to_list()
texts = df.text.to_list()

# I subtract first document delivered by Afghanistan in 1952. 
text = texts[1]


In [234]:
meta.head(1)
meta.describe()

Unnamed: 0,session,year,ccode,gwcode,mid_dispute,mid_num_dispute,cow_num_inter,cow_inter,cow_num_civil,cow_civil,...,v2xcl_acjst,v2xcs_ccsi,v2x_freexp,v2xme_altinf,v2smgovdom,v2smgovfilcap,v2smgovfilprc,v2smgovshutcap,v2smgovshut,v2xedvd_me_cent
count,10568.0,10568.0,10217.0,9197.0,9464.0,9464.0,4567.0,4567.0,4567.0,4567.0,...,9520.0,9520.0,9520.0,9520.0,3858.0,3858.0,3858.0,3858.0,3858.0,9352.0
mean,47.198808,1992.198808,453.044142,451.58356,0.607777,0.607777,0.019926,0.041822,0.14758,0.14758,...,0.583364,0.57373,0.564268,0.553715,0.101634,-0.086477,0.469499,-0.163223,0.602092,0.465796
std,19.695417,19.695417,261.311267,248.60928,1.545809,1.545809,0.13976,0.200204,0.354722,0.354722,...,0.291367,0.31867,0.322586,0.330446,1.378806,1.268826,1.543879,1.286865,1.291013,0.322204
min,1.0,1946.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.002,0.008,0.011,0.009,-3.64,-3.329,-3.898,-3.162,-4.169,0.012
25%,32.0,1977.0,220.0,230.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.318,0.27375,0.25,0.216,-0.9205,-1.04275,-0.666,-1.16025,-0.376,0.161
50%,50.0,1995.0,450.0,451.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.622,0.637,0.618,0.679,-0.022,-0.07,0.866,-0.215,1.013,0.423
75%,64.0,2009.0,663.0,652.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.85,0.885,0.871,0.853,1.15375,0.9615,1.823,0.918,1.75,0.794
max,77.0,2022.0,990.0,950.0,34.0,34.0,1.0,1.0,1.0,1.0,...,0.997,0.983,0.992,0.977,2.877,2.943,2.601,2.507,2.004,0.981


In [4]:
  # Tokenize the text
tokenized_text = tokenizer.tokenize(text)
    
truncate_length = len(tokenized_text) - 512 + 2  # +2 to account for [CLS] and [SEP]
        
# Truncate the beginning and end of the text
truncated_text = tokenized_text[truncate_length//2 : -truncate_length//2]

marked_text = ["[CLS] "] + truncated_text + [" [SEP]"]
# Add special tokens [CLS] and [SEP]
        
# Convert tokens to ids
indexed_tokens = tokenizer.convert_tokens_to_ids(marked_text)
        
# Create attention mask
attention_mask = [1] * len(indexed_tokens)

Token indices sequence length is longer than the specified maximum sequence length for this model (628 > 512). Running this sequence through the model will result in indexing errors


In [152]:
#This one prints out the tokenized word pieces, along with indices. 

#for tup in zip(tokenized_text, indexed_tokens):
#    print('{:<12} {:>6,}'.format(tup[0], tup[1]))

It              100
privilege    19,800
express       4,745
,             1,145
Mr            1,871
.             1,263
President     2,541
,             1,714
con           1,353
##gratulations  1,583
Afghanistan   6,241
delegation    1,958
election      2,030
,             1,607
just            119
##ly          1,284
unanimously   2,059
voted         3,519
Assembly      1,362
.             1,169
It            6,561
also          7,616
privilege       117
extend        2,218
fellow        2,174
representatives 16,286
greeting      9,113
##s           3,519
Royal           119
Afghan        1,130
Government    2,157
,               117
well         23,614
sincere       7,279
##st          3,681
wishes          117
success      11,565
current         117
session      21,820
General      13,378
Assembly     14,819
.            10,774
Our           3,235
attachment    3,844
United        1,311
Nations       4,309
Charter       6,551
principles      119
complete      1,109
ad            7

In [20]:
# Pad sequences to max_seq_length
if len(indexed_tokens) < 512:
    indexed_tokens.append(0)
    attention_mask.append(0)

In [7]:
# Convert lists to PyTorch tensors
tokenized_texts = []
tokens_tensors = []
attention_masks = []

    
tokens_tensors.append(torch.tensor(indexed_tokens))
attention_masks.append(torch.tensor(attention_mask))
tokenized_texts.append(tokenized_text)

# Convert lists to PyTorch tensors
tokens_tensors = torch.stack(tokens_tensors)
attention_masks = torch.stack(attention_masks)


In [10]:
# Run the BERT model
with torch.no_grad():
    outputs = model(input_ids=tokens_tensors.view(-1, tokens_tensors.size(-1)), attention_mask=attention_masks.view(-1, attention_masks.size(-1)))



# Duplicates

In [180]:
pd_words = pd.Series(marked_text, name='term')
print(pd_words.shape)

hidden_states = outputs[2][0].squeeze().numpy()
print(hidden_states.shape)

df_outputs = pd.DataFrame(hidden_states)
df_outputs["term"] = pd_words


df_outputs.loc[(df_outputs['term'] == "right") | (df_outputs['term'] == "believe")]

#Each column represents each term. Dimension is 768 X 512.

(512,)
(512, 768)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,759,760,761,762,763,764,765,766,767,term
16,-0.620744,0.348515,0.351208,-0.195764,0.667364,0.406018,0.473575,0.795273,0.41511,0.34146,...,0.205765,0.405889,0.731192,-0.08603,-0.744767,-0.107819,0.911729,0.347175,0.33573,believe
164,-0.181749,-0.923341,-0.502419,0.267246,-0.467946,-0.439854,-0.828056,0.443169,-0.869504,-2.072055,...,0.047809,0.623744,-0.455522,0.004116,-0.674838,-0.335369,0.57726,-0.633902,0.24106,right
270,-0.731771,0.35543,0.949808,0.068705,1.059148,-0.03142,1.05152,0.635568,0.669277,0.817705,...,0.234071,0.340437,0.762802,-0.045645,-1.094815,-0.255272,0.648554,0.450855,0.231913,believe
489,-0.023267,-0.744896,-0.409934,0.549873,-0.542924,-0.45324,-1.005128,0.303455,-0.629427,-1.995248,...,0.199423,0.312459,-0.63608,-0.016288,-0.437767,-0.387403,0.471238,-0.805941,-0.266752,right


In [181]:
indices = list((df_outputs['term'] == "right") | (df_outputs['term'] == "believe"))

In [182]:
subset = df_outputs[(df_outputs['term'] == "right") | (df_outputs['term'] == "believe")]
subset_np = np.array(subset)
print(subset_np.shape)

(4, 769)


In [185]:
# Calculate cosine similarity row-wise

A = np.array(subset_np[1,:-1])
print(A.shape)

B = np.array(subset_np[3,:-1])
cosine = np.sum(A * B) / (np.linalg.norm(A) * np.linalg.norm(B))

# Print cosine similarities
print("Cosine Similarities:", cosine)


(768,)
Cosine Similarities: 0.882696943581341


In [49]:
# find duplicate rows
duplicate_rows = df_outputs[df_outputs.duplicated('term')].sort_values('term')

duplicate_rows=pd.DataFrame(duplicate_rows)
# print duplicate rows
duplicate_rows['term']

freq_table = pd.crosstab(duplicate_rows['term'], 'no_of_duplicates') 
print(list(duplicate_rows['term']))


['##ci', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', '-', '-', '-', '-', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', ';', 'Afghanistan', 'Charter', 'Fortunately', 'In', 'Member', 'Nations', 'Nations', 'Nations', 'Organization', 'States', 'The', 'The', 'This', 'This', 'United', 'United', 'United', 'United', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'We', 'also', 'also', 'appreciate', 'appreciate', 'appreciation', 'assistance', 'attached', 'attitude', 'based', 'basis', 'believe', 'complete', 'con', 'con', 'continents', 'countries', 'country', 'country', 'country', 'country', 'difficulties', 'difficulties', 'evolution', 'examples', 'freedom', 'future', 'good', 'good', 'great', 'great', 'great', 'last', 'last', 'life', 'many',

- I don't suspect a significant semantic differences between duplicates. Possible problems could've arisen with words like "right," but cosine similarity between two "right"s were 0.88. That score was as high as similarity score between duplicated "We," which scored 0.85. 
- With that, we decided to group duplicates into one by averaging them out. 

In [330]:
df_outputs_embedding = df_outputs.groupby(['term']).mean()
print(df_outputs_embedding)

              0         1         2         3         4         5         6    \
term                                                                            
 [SEP]  -0.409629 -0.746030  0.954548 -0.989581 -0.666579 -0.456214 -0.482902   
##al    -0.524131  0.479899 -0.476182 -0.390430  0.571847  0.344540 -0.890469   
##atic  -1.258919 -1.749650  0.189675 -1.009726  1.250488 -0.127454 -0.174475   
##ation -0.321686 -0.599758  0.114046 -0.144932  0.953009 -0.936072  0.038542   
##ci    -0.209563  0.008983  0.226166 -0.037539 -0.431536 -0.854313 -0.930581   
...           ...       ...       ...       ...       ...       ...       ...   
words    0.397322 -0.091103  0.350278  0.339480  0.188769 -0.122880  0.119570   
world    0.991384 -0.232762  0.670588  0.508774  0.526652  0.077158 -0.581935   
worth   -0.107531  0.469970 -0.674297  0.848895  0.115055  0.818750 -0.049695   
years    0.478471 -0.254025 -0.260787  0.681818 -0.097176  0.219434 -0.531882   
z        0.215062 -0.684214 

Index([' [SEP]', '##al', '##atic', '##ation', '##ci', '##d', '##dice', '##eal',
       '##ful', '##gra',
       ...
       'war', 'wars', 'well', 'will', 'without', 'words', 'world', 'worth',
       'years', 'z'],
      dtype='object', name='term', length=321)

In [187]:
df_outputs_embedding.to_csv("../../../output/embeddings.csv")

# Analysis

In [133]:
print ("Number of layers:", len(hidden_states), "(initial embeddings + 12 BERT layers)")
layer_i = 0

print ("Number of batches:", len(hidden_states[layer_i]))
batch_i = 0

print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))
token_i = 0

print('Type of hidden_states:', type(hidden_states))

print('Tensor shape for each layer: ', hidden_states[0].size())


Number of layers: 1 (initial embeddings + 12 BERT layers)
Number of batches: 512
Number of tokens: 768
Type of hidden_states: <class 'torch.Tensor'>
Tensor shape for each layer:  torch.Size([512, 768])


## Cosine Similarity between two words

In [300]:
from sklearn.metrics.pairwise import cosine_similarity

term_a = "peace"
term_b = "world"
# Extract the mean embeddings for the terms "right" and "believe"
embedding_a = df_outputs_embedding.loc[term_a].values.reshape(1, -1)
embedding_b = df_outputs_embedding.loc[term_b].values.reshape(1, -1)
print(embedding_a.shape)
# Compute cosine similarity between the two mean embeddings
cosine_similarity = cosine_similarity(embedding_a, embedding_b)
#range=[0,1]
print(f"Cosine Similarity between '{term_a}' and '{term_b}': {cosine_similarity}")


(1, 768)
Cosine Similarity between 'peace' and 'world': [[0.01612619]]


# Limitations of BERT

## 1) **Anisotropy**
I expected "peace" and "world" to show greater distance than a pair of "war" and "peace." But they didn't. Not only did these two, but literature shows that if we take any random words, cosines will be high as close as 1 ([Jurafsky and Martin 2024](https://web.stanford.edu/~jurafsky/slp3/)). Apparently, we need to do additional transformations with embeddings extracted from BERT. [Timkey et al. (2021)](https://aclanthology.org/2021.emnlp-main.372/) points out such tendency can be attenuated by standardizing the vectors and reducing the impact of outliers. By outliers Timkey et al. (2021) mean few dimensions of embedding that have high variance.

Some people ([Li et al. 2020](https://aclanthology.org/2020.emnlp-main.733.pdf)) point out BERT's anisotropic characteristic results in underperformance in sentence similarity compared to GloVe embeddings. They further point out that the last layer of BERT is not appropriate for similarity metrics, given their non-smoothing characteristic.

## 2) **Cross-document comparison**
This is our hypothesis, but I suspect that mapping of words onto vector space is not linear. Even when mapping is anisotropic, it shouldn't pose a problem if the main goal is to simply cross-compare distances **between** documents. However, the same word that shows up in two different documents get different vectors. This context-specificity is a double-edged sword, in that it allows us to distinguish nuances, but also prevents us from doing comparison in a consistent manner. 


In [345]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim_matrix = cosine_similarity(df_outputs_embedding, df_outputs_embedding)

cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=df_outputs_embedding.index, columns=df_outputs_embedding.index)

pd.DataFrame(cosine_sim_matrix, index=df_outputs_embedding.index, columns=df_outputs_embedding.index)


term,[SEP],##al,##atic,##ation,##ci,##d,##dice,##eal,##ful,##gra,...,war,wars,well,will,without,words,world,worth,years,z
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
[SEP],1.000000,0.026143,-0.016379,0.175157,0.075565,0.089596,0.150333,0.035554,-0.033455,0.084167,...,0.028284,0.023479,0.060025,0.053393,0.043304,0.035146,-0.019725,0.060074,0.073359,0.093997
##al,0.026143,1.000000,0.093010,0.136414,0.062659,0.156533,0.072606,0.074310,-0.044487,0.097046,...,0.170718,0.132057,0.182142,0.204611,0.234805,0.057567,0.339189,0.024444,0.080924,0.153247
##atic,-0.016379,0.093010,1.000000,0.009135,0.043850,0.015896,0.055103,0.017012,-0.057231,0.017846,...,0.114068,-0.000627,0.057127,0.094940,0.084769,0.002506,0.064155,-0.061443,0.042701,0.043055
##ation,0.175157,0.136414,0.009135,1.000000,0.104572,0.145229,0.095189,0.118794,0.037401,0.157255,...,0.158142,0.054786,0.302470,0.106522,0.163783,0.150252,0.179236,0.088714,0.161013,0.495971
##ci,0.075565,0.062659,0.043850,0.104572,1.000000,0.106008,-0.002979,0.077277,0.040556,0.069917,...,0.096960,0.013551,0.095918,0.070276,0.153076,0.074848,0.114379,0.117168,0.045349,0.138723
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
words,0.035146,0.057567,0.002506,0.150252,0.074848,0.162133,0.077808,0.167498,-0.055219,-0.034550,...,0.191374,0.067224,0.094747,0.119390,0.149054,1.000000,0.108471,0.062540,0.099068,0.380922
world,-0.019725,0.339189,0.064155,0.179236,0.114379,0.163717,0.105674,0.110620,-0.007374,0.079482,...,0.174021,0.117663,0.186013,0.149951,0.168116,0.108471,1.000000,0.027870,0.244823,0.253086
worth,0.060074,0.024444,-0.061443,0.088714,0.117168,0.017305,-0.018457,-0.027766,0.021058,0.040095,...,0.005939,-0.029942,0.072026,-0.001627,0.058985,0.062540,0.027870,1.000000,0.051174,0.046957
years,0.073359,0.080924,0.042701,0.161013,0.045349,0.239735,0.078811,0.116394,-0.057514,0.108901,...,0.123907,0.093873,0.154401,0.097526,0.094451,0.099068,0.244823,0.051174,1.000000,0.212903


# Additional plans
- Getting distance between words from two different documents
- dimension reduction for plotting
- use simple embedding like Word2Vec
- same word across documents
    similarity of words across different documents. should have high similarity. 
- Write up
