GitHub - Hannibal046/nanoV2T: Simple replication of vec2text for pedagogy and fun

nanoV2T

Simple replication of Text Embeddings Reveal (Almost) As Much As Text (for pedagogy and fun). The original repo is here.

If you want to know more analysis about vec2text, we also recommend this paper: Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems.

Get Started

conda create -n v2t python=3.11 -y && conda activate v2t
conda install pytorch==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install transformers==4.41.2 accelerate==0.31.0 datasets sentencepiece wandb rich ipywidgets gpustat wget tiktoken pytest evaluate sacrebleu nltk sentence_transformers numpy==1.26.4
pip install -e .
python -c 'import nltk;nltk.download("punkt")'

Stage1 Training

accelerate launch --num_processes 8 --mixed_precision bf16 \
    v2t/train.py \
        --config config/stage1.yaml

Stage2 Training

accelerate launch --num_processes 8 --mixed_precision bf16 \
    v2t/train.py \
        --config config/stage2.yaml \
        --draft_dir checkpoints/gtr_t5_nq_32_stage1/wandb/latest-run/files/hyps

Inference

We provide stage1 output in the output folder and trained stage2 model at Hannibal046/gtr_t5_nq_32_stage2, so we could do inference like this:

accelerate launch --num_processes 8 \
    v2t/inference.py \
        --draft_dir output/gtr_t5_nq_32_stage1/hyps \
        --generator_name_or_path Hannibal046/gtr_t5_nq_32_stage2 \
        --dataset_name_or_path jxm/nq_corpus_dpr \
        --embedder_name_or_path sentence-transformers/gtr-t5-base \
        --max_seq_length 32 --max_eval_samples 500

This is the expected results:

{"bleu": 97.8, "token_f1": 99.5, "em": 93.2, "cos_sim": 1.0}

Notes

Difference from original implementation:

We do not count special tokens (bos,eos,pad) as tokens that need to be recovered.
We optimize the inference process with (1) early stop when cos_sim==1 and (2) distributed inference across GPUs.
We use the first 1000 samples from jxm/nq_corpus_dpr as test set.
We use larger batch size and correspondingly larger learning rate in both training stages.

How to add new embedding models?

We support all models following the API of Sentence Transformers. Across this project, the embedder loading logic is defined at v2t/model/emebdder/__init__.py

def load_embedder(
    model_name_or_path: str,
):
    overwatch.info(f"Loading Retriever from: {model_name_or_path}")
    if model_name_or_path == "sentence-transformers/gtr-t5-base":
        embedder = SentenceTransformer(model_name_or_path,device='cpu')
        tokenizer = embedder.tokenizer
    return embedder

How to add new generation models?

Currently, we only support T5 models (from T5-base to T5-11B) which is defined at v2t/model/generator/modeling_t5generator.py.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
output/gtr_t5_nq_32_stage1/hyps		output/gtr_t5_nq_32_stage1/hyps
scripts		scripts
v2t		v2t
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoV2T

Get Started

Stage1 Training

Stage2 Training

Inference

Notes

How to add new embedding models?

How to add new generation models?

About

Releases

Packages

Languages

Hannibal046/nanoV2T

Folders and files

Latest commit

History

Repository files navigation

nanoV2T

Get Started

Stage1 Training

Stage2 Training

Inference

Notes

How to add new embedding models?

How to add new generation models?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages