Version 2024-01-15, Arvid Lundervold


[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MMIV-ML/ELMED219/blob/main/Lab3-GenAI/lab3-genAI-nlp-explore.ipynb)


# Lab 3 Generative AI - NLP explore (GPT-2)

We here assume PyTorch and the [transformers](https://huggingface.co/docs/transformers/index) library (from Hugging Face, which provides easy access to pre-trained models and their tokenizers) are installed, if not, run the following cell:

```python
!pip install torch
!pip install transformers
```

To find the Token ID of the word "patient" in a GPT model using Python, you would typically use a tokenizer compatible with the GPT model. One of the most common libraries for this purpose is Hugging Face's Transformers library, which provides easy access to pre-trained models and their tokenizers.

Here's a Python code snippet that demonstrates how to find the Token ID for "patient" using the GPT-2 tokenizer:

In [16]:
from transformers import GPT2Tokenizer
import torch

In [18]:
def get_token_id(word):
    # Load the tokenizer for GPT-2
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

    # Encode the word to get its Token ID
    token_id = tokenizer.encode(word, add_special_tokens=False)

    return token_id

# Example usage

for word in ["The", "patient", "has", "a", "fever", "."]:
    token_id = get_token_id(word)
    print(f"Token ID for '{word}': {token_id}") 


Token ID for 'The': [464]
Token ID for 'patient': [26029]
Token ID for 'has': [10134]
Token ID for 'a': [64]
Token ID for 'fever': [69, 964]
Token ID for '.': [13]



To retrieve the embedding vector for a specific token (like "patient" with Token ID 26029) in a model like GPT-2, you need to access the model's embedding layer. This layer maps each Token ID to a high-dimensional vector, which represents the token in the model's learned feature space.

Here's how you can retrieve the embedding vector for the word "patient" in GPT-2 using Python and the Hugging Face Transformers library:

In [9]:
from transformers import GPT2Model, GPT2Tokenizer

def get_embedding_vector(word):
    # Load the tokenizer and model for GPT-2
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    model = GPT2Model.from_pretrained("gpt2")

    # Encode the word to get its Token ID
    token_id = tokenizer.encode(word, add_special_tokens=False)

    # Retrieve the embedding for the Token ID
    embeddings = model.get_input_embeddings()
    word_embedding = embeddings(torch.LongTensor(token_id))

    return word_embedding

# Example usage
word = "patient"
embedding_vector = get_embedding_vector(word)
print(f"Embedding vector for '{word}': shape = {embedding_vector.shape} {embedding_vector}")


Embedding vector for 'patient': shape = torch.Size([1, 768]) tensor([[-1.2062e-01, -3.6002e-01,  1.5080e-01, -9.3105e-02, -6.0587e-02,
         -2.5183e-01, -3.5659e-01, -1.8158e-01, -2.6167e-01,  8.3643e-02,
          3.1463e-01,  2.2771e-01,  4.7810e-03, -1.3722e-01, -2.8546e-02,
         -8.3280e-02, -4.7087e-02, -1.6795e-01,  7.5017e-03,  8.9747e-02,
         -1.5150e-01,  1.1515e-01,  1.2865e-01, -1.0377e-01, -1.1852e-01,
         -1.0153e-01,  1.2508e-01, -1.3948e-01, -4.8302e-02,  9.5049e-02,
         -2.9984e-02, -4.5584e-02,  3.0392e-02, -1.2178e-03, -6.5477e-02,
          7.5003e-02, -3.1934e-01, -1.2183e-02,  2.3057e-01,  7.4451e-02,
         -3.0815e-01, -6.5707e-02,  1.4448e-01, -1.9877e-02, -7.7836e-02,
          1.1676e-01,  1.0865e-02, -1.6200e-01, -2.0710e-01, -2.1135e-01,
         -1.4299e-01,  9.1997e-02,  1.2531e-01, -4.3654e-02,  4.0772e-02,
         -3.0279e-01,  7.5637e-02, -1.2184e-01,  1.0579e-01,  7.6688e-02,
         -1.2078e-01, -4.7883e-02, -3.6084e-03,  6.

Keep in mind that the embedding vector is a high-dimensional tensor (usually several hundred dimensions), and its values are learned during the model's training process. The specific values in this vector capture semantic and syntactic information about the word as learned by the model from its training data.