## Reference of Assignment 1

Load the MiniCPM-1B model and its corresponding tokenizer from huggingface. 

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM-1B-sft-bf16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM-1B-sft-bf16", trust_remote_code=True)

OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like openbmb/MiniCPM-1B-sft-bf16 is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

The tokenizer can encode every sentence into token sequence. The `add_special_tokens` parameter in the tokenizer helps to determine whether special tokens (like `<s>`, or padding tokens) are included in the token sequence when encoding sentences. By setting add_special_tokens=True, you ensure that any necessary special tokens are added to the tokenized input.

In [6]:
prompt1="This is NLP@THU 2024!!!"
prompt2="Hello World, This is NLP@THU 2024!!!"
input_ids1 = tokenizer.encode(prompt1, return_tensors='pt', add_special_tokens=True)
input_ids2 = tokenizer.encode(prompt2, return_tensors='pt', add_special_tokens=True)
print(input_ids1)
print(input_ids2)
for token_id in input_ids1[0]:
    print(f"token {token_id} = {tokenizer.decode(token_id)}")

tensor([[    1,  1900,  1410,  1515, 14574, 59469,  5612, 59401, 59320, 59349,
         59344, 59349, 59370,    73,    73,    73]])
tensor([[    1, 21045,  4178, 59342,  1900,  1410,  1515, 14574, 59469,  5612,
         59401, 59320, 59349, 59344, 59349, 59370,    73,    73,    73]])
token 1 = <s>
token 1900 = This
token 1410 = is
token 1515 = N
token 14574 = LP
token 59469 = @
token 5612 = TH
token 59401 = U
token 59320 = 
token 59349 = 2
token 59344 = 0
token 59349 = 2
token 59370 = 4
token 73 = !
token 73 = !
token 73 = !


Here is an example of how to generate word embeddings from the input embedding matrix of models. This code extracts the input embeddings for each token of a given word by encoding the word using the tokenizer, retrieving the corresponding token embeddings from the model's embedding layer, and storing the embeddings for each word in a dictionary.

In [None]:
import torch 

all_embeddings = model.get_input_embeddings()
words = ["apple", "appple", "chair", "boy", "peach"]
embedding_reflection = {}
for word in words:
    tokens = tokenizer.encode(word, add_special_tokens=False)
    word_embeddings = []
    for token in tokens:
        word_embedding = all_embeddings(torch.tensor(token))
        word_embeddings.append(word_embedding)
    word_embeddings = torch.stack(word_embeddings)
    print(word_embeddings.shape)
    embedding_reflection[word] = word_embeddings

torch.Size([1, 1536])
torch.Size([2, 1536])
torch.Size([1, 1536])
torch.Size([1, 1536])
torch.Size([2, 1536])


In [5]:
import torch.nn.functional as F

for word1, embedding1 in embedding_reflection.items():
    for word2, embedding2 in embedding_reflection.items():

        # here we simply use the average embeddings if one word are encoded into many tokens
        avg_embedding1 = torch.mean(embedding1, dim=0, keepdim=True)
        avg_embedding2 = torch.mean(embedding2, dim=0, keepdim=True)

        cosine_sim = F.cosine_similarity(avg_embedding1, avg_embedding2)
        print(f"{word1} {word2} = {cosine_sim.item()}")

apple apple = 1.0000007152557373
apple appple = 0.6285948753356934
apple chair = 0.6110924482345581
apple boy = 0.6109575033187866
apple peach = 0.6368358731269836
appple apple = 0.6285948753356934
appple appple = 1.0000003576278687
appple chair = 0.5873135924339294
appple boy = 0.5639570951461792
appple peach = 0.6722849607467651
chair apple = 0.6110924482345581
chair appple = 0.5873135924339294
chair chair = 1.0000007152557373
chair boy = 0.5883691310882568
chair peach = 0.5883922576904297
boy apple = 0.6109575033187866
boy appple = 0.5639570951461792
boy chair = 0.5883691310882568
boy boy = 1.0000008344650269
boy peach = 0.6021625399589539
peach apple = 0.6368358731269836
peach appple = 0.6722849607467651
peach chair = 0.5883922576904297
peach boy = 0.6021625399589539
peach peach = 1.0000003576278687


Another way is to extract embeddings from hidden states of models. These embeddings are dynamic, meaning they incorporate context. For example, 'bank' in 'She went to the bank to open a savings account' might have a different embedding than 'bank' in 'The children played on the bank of the river.' This method is often preferred over input embeddings when it comes to generating sentence embeddings, as it captures the nuanced meanings of words in their specific contexts, providing a more accurate representation of the entire sentence.

In [None]:
import torch 

def get_token_embedding(text="hello"):

    inputs = tokenizer.encode(text,  return_tensors="pt", add_special_tokens=False)

    with torch.no_grad():
        outputs = model(inputs, output_hidden_states=True)

    last_hidden_state = outputs.hidden_states[-1]
    token_embeddings = last_hidden_state[0]

    return token_embeddings

words = ["apple", "appple", "chair", "boy", "peach"]
embedding_reflection = {}

get_token_embedding(words)
# for word in words:
#     word_embeddings = get_token_embedding(word)
#     print(word_embeddings.shape)
#     embedding_reflection[word] = word_embeddings

torch.Size([1, 1536])
torch.Size([2, 1536])
torch.Size([1, 1536])
torch.Size([1, 1536])
torch.Size([2, 1536])


In [13]:
import torch.nn.functional as F

for word1, embedding1 in embedding_reflection.items():
    for word2, embedding2 in embedding_reflection.items():

        # here we simply use the average embeddings if one word are encoded into many tokens
        avg_embedding1 = torch.mean(embedding1, dim=0, keepdim=True)
        avg_embedding2 = torch.mean(embedding2, dim=0, keepdim=True)

        cosine_sim = F.cosine_similarity(avg_embedding1, avg_embedding2)
        print(f"{word1} {word2} = {cosine_sim.item()}")

apple apple = 0.9999997615814209
apple appple = 0.701971173286438
apple chair = 0.9338223934173584
apple boy = 0.9330370426177979
apple peach = 0.6833229064941406
appple apple = 0.701971173286438
appple appple = 0.9999998807907104
appple chair = 0.7025620937347412
appple boy = 0.7053989171981812
appple peach = 0.8618935346603394
chair apple = 0.9338223934173584
chair appple = 0.7025620937347412
chair chair = 0.9999998211860657
chair boy = 0.9455156922340393
chair peach = 0.6840670704841614
boy apple = 0.9330370426177979
boy appple = 0.7053989171981812
boy chair = 0.9455156922340393
boy boy = 0.9999997019767761
boy peach = 0.6906505823135376
peach apple = 0.6833229064941406
peach appple = 0.8618935346603394
peach chair = 0.6840670704841614
peach boy = 0.6906505823135376
peach peach = 0.9999998807907104
