In [1]:
import vec2text
corrector = vec2text.load_corrector("gtr-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [2]:
vec2text.invert_strings(
    [
        "Jack Morris is a PhD student at Cornell Tech in New York City",
        "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
    ],
    corrector=corrector,
    num_steps=20,
)

['Jack Morris Morris is a PhD student at  Cornell Tech in New York City ',
 'It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of foolishness']

In [4]:
import vec2text
import torch
from transformers import AutoModel, AutoTokenizer, PreTrainedTokenizer, PreTrainedModel


def get_gtr_embeddings(text_list,
                       encoder: PreTrainedModel,
                       tokenizer: PreTrainedTokenizer) -> torch.Tensor:

    inputs = tokenizer(text_list,
                       return_tensors="pt",
                       max_length=128,
                       truncation=True,
                       padding="max_length",).to("cuda")

    with torch.no_grad():
        model_output = encoder(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
        hidden_state = model_output.last_hidden_state
        embeddings = vec2text.models.model_utils.mean_pool(hidden_state, inputs['attention_mask'])

    return embeddings


encoder = AutoModel.from_pretrained("sentence-transformers/gtr-t5-base").encoder.to("cuda")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/gtr-t5-base")
corrector = vec2text.load_corrector("gtr-base")

embeddings = get_gtr_embeddings([
       "Jack Morris is a PhD student at Cornell Tech in New York City",
       "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity"
], encoder, tokenizer)

vec2text.invert_embeddings(
    embeddings=embeddings.cuda(),
    corrector=corrector,
    num_steps=20,
)


Some weights of T5Model were not initialized from the model checkpoint at sentence-transformers/gtr-t5-base and are newly initialized: ['decoder.block.5.layer.0.SelfAttention.o.weight', 'decoder.block.4.layer.1.layer_norm.weight', 'decoder.block.6.layer.0.SelfAttention.v.weight', 'decoder.block.10.layer.2.DenseReluDense.wi.weight', 'decoder.block.8.layer.1.EncDecAttention.q.weight', 'decoder.block.2.layer.2.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'decoder.block.8.layer.0.SelfAttention.q.weight', 'decoder.block.8.layer.1.EncDecAttention.v.weight', 'decoder.block.11.layer.1.EncDecAttention.o.weight', 'decoder.block.4.layer.1.EncDecAttention.v.weight', 'decoder.block.3.layer.1.EncDecAttention.k.weight', 'decoder.block.4.layer.1.EncDecAttention.q.weight', 'decoder.block.2.layer.2.DenseReluDense.wo.weight', 'decoder.block.3.layer.2.layer_norm.weight', 'decoder.block.4.layer.2.layer_norm.weight', 'decoder.blo

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

['Jack Morris Morris is a PhD student at  Cornell Tech in New York City ',
 'It was the best of times, it was the worst of times, it was the age of wisdom, it was the epoch of foolishness']

In [22]:
embeddings = get_gtr_embeddings([
       "I like video games, my favorite is breath of the wild.",
       "I want to kill myself a lot."
], encoder, tokenizer)
print(embeddings)

vec2text.invert_embeddings(
    embeddings=embeddings.cuda(),
    corrector=corrector,
    num_steps=20,
)

tensor([[ 0.0245, -0.0088, -0.0450,  ..., -0.0684,  0.0100, -0.0705],
        [ 0.0051,  0.0117, -0.0247,  ..., -0.0586,  0.0657, -0.0203]],
       device='cuda:0')


['I like video games, my favorite is          breath of the wild.',
 'I want to kill myself a lot. I want to kill myself a lot. I feel induced by psychedelic booze.']

In [34]:
import numpy as np

for alpha in np.arange(0.0, 1.0, 0.1):
  mixed_embedding = torch.lerp(input=embeddings[0], end=embeddings[1], weight=alpha)
  text = vec2text.invert_embeddings(
      embeddings=mixed_embedding[None].cuda(),
      corrector=corrector,
      num_steps=20,
      sequence_beam_width=4,
  )[0]
  print(f'alpha={alpha:.1f}\t', text)

alpha=0.0	 I like video games, my favorite is          breath of the wild.
alpha=0.1	 I like video games, my favorite is           breath of wild
alpha=0.2	 I like video games. My favorite is           of the wild
alpha=0.3	 I like video games. My favorite is            Wild
alpha=0.4	 I want to kill myself a lot. My favorite is video games, the wild       
alpha=0.5	 I want to kill myself a lot. I love video games, scimitars, wild bozearians, and I think it most.
alpha=0.6	 I want to kill myself a lot. I'm a fucking wild bozearian. I especially love video games. 
alpha=0.7	 I want to kill myself a lot. I also want to kill myself a lot. I love fucking bridges and suicide. 
alpha=0.8	 I want to kill myself a lot. I want to kill myself a lot. I feigned bias and suicide. 
alpha=0.9	 I want to kill myself a lot. I also want to kill myself a lot. I suffer from psychiatric and suicide disorders.
