In this notebook the galactica model 'mini' is altred for F-Term prediction

# Notitzen

- man kann dem Model direkt embeddede tokens übergeben. Als key-word argument im foreward pass.
- - somit vielleich embedding der label möglich
    
- es gibt eine classe von meta in die man (vermutlich) Galactica reinladen kann, die speziell für sequenz klassifizierung gedacht ist.

- Die unter Klasse von OPTFor... ist PreTrainedmodel hier kann man die input embeddings definieren.

- Man kann model in 8 bit laden key-word: load_in_8bit

- Man kann mit tokenizer.add_tokens([token1,..]) neue tokens zum vocab des tokenizers hinzufügen


# Imports

In [1]:
import sys
import os
sys.path.append(os.path.abspath(".."))
from Masterarbeit_utils import model_utils, dataset_utils
import psutil
import torch
import inspect
from transformers import AutoTokenizer, OPTForCausalLM, OPTForSequenceClassification

default_dtype = torch.float16
# If you change 'default_device' to 'cpu', make sure to set num_gpus to zero in the model configuration
#default_device = 'cuda:0'
default_device = 'cpu'

  from .autonotebook import tqdm as notebook_tqdm


# Downloading the Naked Model

In [2]:
# A dict to map the correct model urls
HF_MAPPING = {
    "mini": ("facebook/galactica-125m", torch.float32),
    "base": ("facebook/galactica-1.3b", torch.float32),
    "standard": ("facebook/galactica-6.7b", torch.float32),
    "large": ("facebook/galactica-30b", torch.float32),
    "huge": ("facebook/galactica-120b", torch.float16)
}

# Configuration of the model
model_name = 'mini'
dtype = default_dtype
tensor_parallel = False
device_map = None
# Set to zero if you use the cpu as default device
num_gpus = 1
if default_device == 'cpu':
    num_gpus = 0
    default_dtype = torch.float32
    dtype = default_dtype

# All new torch objects will have this dtype
torch.set_default_dtype(default_dtype)
# Analyzing the system (code by huggingface)
max_memory = {}
if num_gpus > 0 and not tensor_parallel:
    # based on https://github.com/huggingface/accelerate/blob/5315290b55ea9babd95a281a27c51d87b89d7c85/src/accelerate/utils/modeling.py#L274
    for i in range(num_gpus):
         _ = torch.tensor([0], device=i)
    for i in range(num_gpus):
        max_memory[i] = torch.cuda.mem_get_info(i)[0]
    device_map = "auto"
max_memory["cpu"] = psutil.virtual_memory().available

# Loading the model form web / from cache
model = OPTForCausalLM.from_pretrained(HF_MAPPING[model_name][0], torch_dtype=dtype, low_cpu_mem_usage=True, device_map=device_map, max_memory=max_memory)

Downloading (…)lve/main/config.json: 100%|█████| 787/787 [00:00<00:00, 7.88MB/s]
Downloading model.safetensors: 100%|█████████| 250M/250M [00:22<00:00, 11.2MB/s]
Downloading (…)neration_config.json: 100%|█████| 137/137 [00:00<00:00, 1.92MB/s]


# Loading the Tokenizer

In [3]:
tokenizer = AutoTokenizer.from_pretrained(HF_MAPPING[model_name][0])

Downloading (…)/main/tokenizer.json: 100%|█| 2.14M/2.14M [00:00<00:00, 5.71MB/s]


# Testing the Model

In [4]:
# Input Text
text = 'Good morning Mr.'
# Convert text to tokens
tokens  = tokenizer(text, return_tensors='pt').input_ids
print(f'Output of Tokenizer: {tokens}')
# Model generating the predicted output tokens
out = model.generate(tokens.to(default_device), max_length=30)
# Decoding the tokens

out = tokenizer.decode(out[0])
out

Output of Tokenizer: tensor([[34848, 16810, 14782,    36]])


'Good morning Mr. H. S. (1920), "The Greatest Man in the World", The New York'

# Extract Token Embedding

In [5]:
token_embedding = model.get_input_embeddings()

print(f'''The model uses a nn.Embeddings instance as token embedding. 

It has a dict-size of <{token_embedding.num_embeddings}>.
It has a embedding dimension of <{token_embedding.embedding_dim}>
and a padding index of <{token_embedding.padding_idx}>.

The weights have a dtype of <{token_embedding.weight.dtype}>
and are on device <{token_embedding.weight.device}>''')

The model uses a nn.Embeddings instance as token embedding. 

It has a dict-size of <50000>.
It has a embedding dimension of <768>
and a padding index of <1>.

The weights have a dtype of <torch.float32>
and are on device <cpu>


# Creating a Custom Token and F-Term Embedding

### Using the weights from the original embeddig and replacing the weigths of a larger embedding instance partially.

In [6]:
def create_embedding(original_embedding: torch.nn.Embedding, n_f_terms: int, dtype: torch.dtype=default_dtype) -> torch.nn.Embedding:
    """
    This function takes the original_embedding instance of an OPT model, 
        (nn.Embedding instance).
    and the number of f-terms it should embedd (n_f_terms) and creates a new embedding which has 
    new weights for all f_terms stacked ontop of the old weigths used for the original tokens
    
    returns: torch.nn.Embedding
    """
    # calculating parameters for the new embedding instance
    embedding_dim = original_embedding.embedding_dim
    num_embeddings = original_embedding.num_embeddings + n_f_terms
    padding_idx = original_embedding.padding_idx
    
    # creating new embedding (compleately untrained)
    embedding = torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx)
    # extracting the weigths of the original pretrained embeddign
    old_weights = original_embedding.weight
    new_weights = embedding.weight
    
    # replacing a chunk of the new parameters with te old parameters to retain the ability to encode natrual language tokens
    embedding.weight = torch.nn.Parameter(
                        torch.cat([old_weights.clone().to(default_device),
                                   new_weights[original_embedding.num_embeddings:].clone().to(default_device)],
                                  0))
    return embedding
    
new_embeddings = create_embedding(token_embedding, 360000)

### Adding the new embedding to the original model

In [7]:
# Replacing the old embedding instance with the new embedding instance in the model instance
model.set_input_embeddings(new_embeddings)

# testing using known (natural language) tokens
out = model.generate(tokens, max_length=30)
out = tokenizer.decode(out[0])
print(f'Prompt: {text}\n\n----------------------------------------------------\n\nOutput:{out}', '\n\n')
print('======================================================================================================')

# Testing using new unknown tokens (including tokens reserved for f-terms) 
random_tokens = torch.randint(410000, [1, 50])
random_out = model.generate(random_tokens, max_length=100)
random_input_out = tokenizer.decode(random_out[0][:30])
random_generated_out = tokenizer.decode(random_out[0][30:])
print(f'Prompt "translated" by model: {random_input_out}\n\n----------------------------------------------------\n\nOutput: {random_generated_out}')

Prompt: Good morning Mr.

----------------------------------------------------

Output:Good morning Mr. H. S. (1920), "The Greatest Man in the World", The New York 


Prompt "translated" by model: evaluationSCI

----------------------------------------------------

Output: NP queen Amer [START_REF] A new method for the first-order statistical model for the estimation of the critical value of the critical value of the critical value of the critical value of the critical value of the critical value of the critical value of the critical value of the critical


# Creating a Custom Classification-Head

In [8]:
# extracting the old classification head from the model
old_classification_head = model.get_output_embeddings()
# analyzing the old classification head
print(f'''
The old classification head is an instance of {type(old_classification_head)},
it has <{old_classification_head.in_features}> input features 
and <{old_classification_head.out_features}> output features.
The weights have a dtype of <{old_classification_head.weight.dtype}>
and are on device <{old_classification_head.weight.device}>''')


The old classification head is an instance of <class 'torch.nn.modules.linear.Linear'>,
it has <768> input features 
and <50000> output features.
The weights have a dtype of <torch.float32>
and are on device <cpu>


In [9]:
def create_new_classification_head(n_f_terms: int, model_dim:int) -> torch.nn.Linear:
    """
    Creates a new classification head for the model
    
    This classification head will be a new linear layer with 'model_dim' input features and 'n_f_terms' output features
    """
    return torch.nn.Linear(in_features=model_dim, out_features=n_f_terms, bias=False).to(default_device)


In [10]:
# creating the new classification head
new_classification_head = create_new_classification_head(360000, 768)

print(f"""

The new classification head has <{new_classification_head.in_features}> input features
and <{new_classification_head.out_features}> output features.

Its weights are in dtype <{new_classification_head.weight.dtype}>
and on device <{new_classification_head.weight.device}>
""")



The new classification head has <768> input features
and <360000> output features.

Its weights are in dtype <torch.float32>
and on device <cpu>



### Adding the new classification head to the model

In [11]:
def add_classification_head(_model: OPTForCausalLM, classification_head: torch.nn.Linear) -> OPTForCausalLM:
    """
    This function implements the new classification head to the pretrained model.
    
    _model: Instanciated OPTForCausalLM model
    classificaiton_head: New classification head for the model
    """
    
    # changing the configuration of the model
    vocab_size = classification_head.out_features
    _model.config.vocab_size = vocab_size
    _model.model.decoder.vocab_size = vocab_size
    _model.num_labels = vocab_size
    _model.config.num_labels = vocab_size
    
    # adding the classification head to the model
    _model.set_output_embeddings(classification_head)
    return _model
    

In [12]:
# Replacing the old with the new classification head
model = add_classification_head(model, new_classification_head)

# Testing the model
random_tokens = random_tokens.clone()
x = model.get_input_embeddings()

# cloning the tokens (could lead to cuda error otherwise)
ip = random_tokens.clone()

# generating the output
random_out = model(ip, return_dict=1)

print(f'''
Input tokens shape: <{random_tokens.shape}>
Output predictions shape: <{random_out['logits'].shape}>''')


Input tokens shape: <torch.Size([1, 50])>
Output predictions shape: <torch.Size([1, 50, 360000])>


# End of file rest are just small experiments

In [13]:
# recreating the loss function from the transformers module
# Shift so that tokens < n predict n
logits = random_out['logits']
labels = torch.randint(360000, [1, 50])

print(logits.shape, labels.shape)

shift_logits = logits[..., :-1, :].contiguous()

print(shift_logits.shape)

shift_labels = labels[..., 1:].contiguous()

print(shift_logits.shape, shift_labels.shape)

# Flatten the tokens

loss_fct = torch.nn.CrossEntropyLoss()
print(shift_logits.view(-1, 360000), shift_labels.view(-1))
print(shift_logits.view(-1, 360000).shape, shift_labels.view(-1).shape)
loss = loss_fct(shift_logits.view(-1, 360000).type(torch.float32), shift_labels.view(-1))
loss

torch.Size([1, 50, 360000]) torch.Size([1, 50])
torch.Size([1, 49, 360000])
torch.Size([1, 49, 360000]) torch.Size([1, 49])
tensor([[ 2.8501e-01,  4.1437e-01,  1.3241e+00,  ..., -8.7459e-01,
          7.9609e-01,  2.0880e-01],
        [ 1.6861e-01,  1.9658e-01,  8.9533e-01,  ..., -6.9590e-01,
          9.6748e-01,  3.0396e-01],
        [-7.0320e-03,  1.6316e-01,  9.8075e-01,  ..., -8.6247e-01,
          1.0221e+00,  4.0162e-01],
        ...,
        [ 2.8422e-01,  4.9986e-04,  5.8435e-01,  ..., -1.0092e+00,
          9.7533e-01,  5.2518e-02],
        [-1.7185e-02,  4.8661e-01,  1.1776e+00,  ..., -4.2681e-01,
          4.5091e-01,  2.6646e-01],
        [-2.9576e-01,  7.0435e-01,  8.4945e-01,  ..., -3.8210e-01,
          4.0003e-01,  4.8613e-01]], grad_fn=<ViewBackward0>) tensor([ 32245, 308784, 205593, 320644, 190773, 197299, 349046,  89440, 189537,
        303807, 119482, 238300, 308918, 336238,  72096, 171501, 269457, 114848,
         43504, 310244, 122330, 112113, 281888, 299725, 166

tensor(12.9740, grad_fn=<NllLossBackward0>)

In [None]:
optimizer = torch.optim.SGD(model.model.parameters(), lr=0.1, momentum=0.9)

In [None]:
torch.set_printoptions(precision=30)

for p in model.model.parameters():
    print(p.grad)

In [None]:
loss.backward()


In [None]:
token_embedding.embedding_dim

In [None]:
optimizer.step()

In [None]:
for p in model.model.parameters():
    #print(torch.mean(p).item())
    print(p.grad)

# Probleme und Fragen

- wird der Loss im Model berechnet, oder lasse ich mir die hidden states ausgeben und berechne den loss separat?
    - vermutlich besser separat
- definiere ich das Token embedding einfach neu (im Model oder gebe ich bereits embeddete Tokens in das Model?

- Wie blockiere ich den Loss für meine Input sequenz?

- Welche Kombination an start_sentence, stop_sentence, padding_tokens soll ich verwenden?
    - Links oder rechts Padding?
         - Vermutlich links
         

- Speichere ich das gesammte Datenset in Token-Form?
    - gepadded oder nicht?
        - padding während der batch-Erstellung?