<a href="https://colab.research.google.com/github/Joey-tpop/TRF_semantic/blob/main/text_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [87]:
!git clone https://github.com/Joey-tpop/Semantic_dissimilarity.git
%cd Semantic_dissimilarity

!git branch
!git checkout -b main
!git add README.md
!git commit -m "Initial commit on main"
!git checkout -b main
!git push -u origin main

Cloning into 'Semantic_dissimilarity'...
/content/Semantic_dissimilarity/Semantic_dissimilarity/Semantic_dissimilarity/Semantic_dissimilarity/Semantic_dissimilarity/Semantic_dissimilarity/Semantic_dissimilarity
Switched to a new branch 'main'
fatal: pathspec 'README.md' did not match any files
On branch main

Initial commit

nothing to commit (create/copy files and use "git add" to track)
Switched to a new branch 'main'
error: src refspec main does not match any
[31merror: failed to push some refs to 'https://github.com/Joey-tpop/Semantic_dissimilarity.git'
[m

**Method 1: Transformer-based Sentence Embedding**

- Great Performance when embedding context
- Not very efficient in the Word Surprisal Measurement

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
from sentence_transformers import SentenceTransformer, util
from scipy.spatial.distance import cosine
from scipy.spatial.distance import euclidean
from scipy.stats import pearsonr
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
sentences = ["I didn't do my homework", "I didn't do my"]
embeddings = model.encode(sentences)

cos_similarity = util.cos_sim(embeddings[0], embeddings[1])
l2_distance = euclidean(embeddings[0], embeddings[1])
corr, _ = pearsonr(embeddings[0], embeddings[1])

print("Cosine similarity:", cos_similarity.item())
print("l2 distance:", l2_distance)
print("pearson correlation: ", corr)

Cosine similarity: 0.5510249733924866
l2 distance: 0.9476023316383362
pearson correlation:  0.5507165


In [None]:
sentences = ["I didn't do my banana", "I didn't do my"]
embeddings = model.encode(sentences)

cos_similarity = util.cos_sim(embeddings[0], embeddings[1])
l2_distance = euclidean(embeddings[0], embeddings[1])
corr, _ = pearsonr(embeddings[0], embeddings[1])

print("Cosine similarity:", cos_similarity.item())
print("l2 distance:", l2_distance)
print("pearson correlation: ", corr)

Cosine similarity: 0.5493135452270508
l2 distance: 0.9494067430496216
pearson correlation:  0.5489815


**Method 2: logic-based Measurement**

- Great Performance when embedding context
- Also very efficient in the Word Surprisal Measurement

In [None]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch.nn.functional as F

# 1. obtaining tokenizer(word -> vector) and model (next word prediction)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()  # training off

# 2. sentence input
sentence = "I didn't do my"
inputs = tokenizer(sentence, return_tensors="pt") # pt: python tensor

# 3. sentence(text) input to the model and obtain the logits
with torch.no_grad():  # gradient computation off (for faster operation)
    outputs = model(**inputs)
    # [batch_size, seq_len, vocab_size]
    # the next prediction will be in the [batch_size, seq_len(=-1), :]
    logits = outputs.logits  # [batch_size, seq_len, vocab_size]

# 4. Input the target word (the current word, where we measure the surprisal)
target_words = ["homework", "banana", "printer", "cup", "work"]
target_token_ids = [tokenizer.encode(word, add_special_tokens=False)[0] for word in target_words]

# 5. Obtain the probability(and surprisal) of occurence of the target word
probs = F.softmax(logits, dim=-1)
target_prob = probs[:, -1, target_token_ids]
surprisal = -torch.log(target_prob)
print(surprisal)

tensor([[16.2866, 19.2577, 17.9780, 21.5103, 14.0107]])
