## Summary of this file:
Comparison between several options to create embeddings of articles:
- Source (entire article or summary created by BART)
- Embedding (finaly layer, second last layer, sum/concat of last 4 layers)

Method: 3 articles were defined 1 base case, 1 that is similar to it and another that is different

Metric: percentage difference between the similarity of similar and different summaries (PD)

Results: 
- When comparing summaries PD was always around 2.5% except for comparison between final layer embeddings (PD: 7.29%)
- When comparing the first 512 tokens of the entire article, the same pattern emerged: PD around 4-5% for all embedding options, except for the embedding extracted from the final layer, where PD went above 10%.

Conclusion:
<b> Clustering final layer embeddings of entire articles. </b>

Question set 1:
- Are the PDs that we get large enough? Is there an objective way to define a sufficient PD?


Question set 2: (Further steps - how to use/describe clusters)
- Choose number of clusters
- NER
- Summary of centroid



In [1]:
import torch
from transformers import BertTokenizer, BertModel
from nltk.tokenize import sent_tokenize
from scipy import spatial
from pprint import pprint
import pandas as pd

In [2]:
summaries = []
with open('../data/misc/article_summary_list.txt', 'r') as file:
    summaries = [line.rstrip() for line in file]
articles = pd.read_csv('../data/clean/"quantumcomputing"AND"research"_999.csv')

In [67]:
pprint(summaries[0]) # baseline summary

('IBM and Google are giving $150M to two universities in the U.S. and Japan '
 'for quantum computing research. The aim is to create a quantum supercomputer '
 'in a decade that has 100,000 qubits. A signing ceremony is set to occur in '
 'Hiroshima, Japan this weekend at the G-7 meetings.')


In [68]:
pprint(summaries[1]) # similar

('The University of Chicago and the University of Tokyo are getting $150 '
 'million for quantum computing research. Former Chicago Mayor Rahm Emanuel, '
 'the current U.S. ambassador to Japan, said the schools’ partnership resulted '
 'from a lunch last summer. IBM is giving the two schools $100 million while '
 'Google is donating $50 million, according to the report.')


In [3]:
pprint(summaries[8]) # different

('Nikolaos “Nikos” Bogonikolos, 59, faces charges tied to wire fraud and '
 'smuggling. He was arrested in Paris last week and faces extradition to the '
 'U.S., the Justice Department says. The case portrays him as a man with '
 'access to sophisticated military technology.')


In [20]:
tokenizer.tokenize(summaries[8])

['nikola',
 '##os',
 '“',
 'nik',
 '##os',
 '”',
 'bog',
 '##oni',
 '##ko',
 '##los',
 ',',
 '59',
 ',',
 'faces',
 'charges',
 'tied',
 'to',
 'wire',
 'fraud',
 'and',
 'smuggling',
 '.',
 'he',
 'was',
 'arrested',
 'in',
 'paris',
 'last',
 'week',
 'and',
 'faces',
 'extra',
 '##dition',
 'to',
 'the',
 'u',
 '.',
 's',
 '.',
 ',',
 'the',
 'justice',
 'department',
 'says',
 '.',
 'the',
 'case',
 'portrays',
 'him',
 'as',
 'a',
 'man',
 'with',
 'access',
 'to',
 'sophisticated',
 'military',
 'technology',
 '.']

In [15]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def insert_sep_token(text):
    sentences = sent_tokenize(text)
    text_with_sep = ' [SEP] '.join(sentences)
    return text_with_sep


def bert_embed_text(text):
    marked_text = "[CLS] " + insert_sep_token(text)
    tokenized_text = tokenizer.tokenize(marked_text)

    if len(tokenized_text) > 512:
        tokenized_text = tokenized_text[:512]

    # Map the token strings to their vocabulary indeces
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    # Create segment ids (alternating between 0 and 1)
    segments_ids = []
    current_segment_id = 0
    for value in tokenized_text:
        segments_ids.append(current_segment_id)
        if value == "[SEP]":
            current_segment_id = 1 - current_segment_id

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True) # output all hidden states
    model.eval()

    with torch.no_grad():
        outputs = model(tokens_tensor, segments_tensors)
        # hidden states from all layers because we set output_hidden_states = True
        # See the documentation for more details: # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]

    # token_vecs = hidden_states[-2][0] # second to last layer
    # sentence_embedding = torch.mean(token_vecs, dim=0)
    return hidden_states

In [30]:
text1 = summaries[0]
text2 = summaries[1]
text3 = summaries[8]

In [16]:
text1 = articles.at[1, 'body']
text2 = articles.at[2, 'body']
text3 = articles.at[9, 'body']

In [17]:
# Computing the hidden state for each text
hidden_states1 = bert_embed_text(text1)
hidden_states2 = bert_embed_text(text2)
hidden_states3 = bert_embed_text(text3)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder

In [6]:
# Comparing second to last layer of the three texts
token_vecs1 = hidden_states1[-2][0]
embed1 = torch.mean(token_vecs1, dim=0)
token_vecs2 = hidden_states2[-2][0]
embed2 = torch.mean(token_vecs2, dim=0)
token_vecs3 = hidden_states3[-2][0]
embed3 = torch.mean(token_vecs3, dim=0)

more_similar = 1 - spatial.distance.cosine(embed1, embed2)
less_similar = 1 - spatial.distance.cosine(embed1, embed3)

print("Similarity between text1 and text2: ", more_similar)
print("Percentage difference to similarity between text 1 and 3: ", round((more_similar-less_similar)/more_similar*100, 2), "%")

Similarity between text1 and text2:  0.9642568826675415
Percentage difference to similarity between text 1 and 3:  4.34 %


In [18]:
# Comparing last layer of the three texts
token_vecs1 = hidden_states1[-1][0]
embed1 = torch.mean(token_vecs1, dim=0)
token_vecs2 = hidden_states2[-1][0]
embed2 = torch.mean(token_vecs2, dim=0)
token_vecs3 = hidden_states3[-1][0]
embed3 = torch.mean(token_vecs3, dim=0)

more_similar = 1 - spatial.distance.cosine(embed1, embed2)
less_similar = 1 - spatial.distance.cosine(embed1, embed3)

print("Similarity between text1 and text2: ", round(more_similar, 2))
print("Percentage difference to similarity between text 1 and 3: ", round((more_similar-less_similar)/more_similar*100, 2), "%")

Similarity between text1 and text2:  0.94
Percentage difference to similarity between text 1 and 3:  10.24 %


In [19]:
less_similar

0.8410043716430664

In [40]:
# Comparing sum of last four hidden layers of the three texts
token_vecs1 = torch.sum(torch.stack(hidden_states1[-4:], dim=0), dim=0)[0]
embed1 = torch.mean(token_vecs1, dim=0)
token_vecs2 = torch.sum(torch.stack(hidden_states2[-4:], dim=0), dim=0)[0]
embed2 = torch.mean(token_vecs2, dim=0)
token_vecs3 = torch.sum(torch.stack(hidden_states3[-4:], dim=0), dim=0)[0]
embed3 = torch.mean(token_vecs3, dim=0)

more_similar = 1 - spatial.distance.cosine(embed1, embed2)
less_similar = 1 - spatial.distance.cosine(embed1, embed3)

print("Similarity between text1 and text2: ", more_similar)
print("Percentage difference to similarity between text 1 and 3: ", round((more_similar-less_similar)/more_similar*100, 2), "%")


Similarity between text1 and text2:  0.9580944776535034
Percentage difference to similarity between text 1 and 3:  4.95 %


In [41]:
# comparing the concatenation of the last four hidden layers of the three texts
token_vecs1 = torch.cat(hidden_states1[-4:], dim=2)[0]
embed1 = torch.mean(token_vecs1, dim=0)
token_vecs2 = torch.cat(hidden_states2[-4:], dim=2)[0]
embed2 = torch.mean(token_vecs2, dim=0)
token_vecs3 = torch.cat(hidden_states3[-4:], dim=2)[0]
embed3 = torch.mean(token_vecs3, dim=0)

more_similar = 1 - spatial.distance.cosine(embed1, embed2)
less_similar = 1 - spatial.distance.cosine(embed1, embed3)

print("Similarity between text1 and text2: ", more_similar)
print("Percentage difference to similarity between text 1 and 3: ", round((more_similar-less_similar)/more_similar*100, 2), "%")

Similarity between text1 and text2:  0.9562898874282837
Percentage difference to similarity between text 1 and 3:  5.05 %
