# SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

This notebook gives a minimal example usage of SPLADE.

* In this repo, we provide weights for 2 models (in the `weights` folder)
* See [Naver Labs Europe website](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/) for more up-to-date models under various settings
* We also provide two new models via Hugging Face (https://huggingface.co/naver)

| model | MRR@10 (MS MARCO dev) | recall@1000 (MS MARCO dev) | expected FLOPS | ~ avg q length | ~ avg d length | 
| --- | --- | --- | --- | --- | --- |
| `splade_max` (**v2**) | 34.0 | 96.5 | 1.32 | 18 | 92 |
| `distilsplade_max` (**v2**) | 36.8 | 97.9 | 3.82 | 25 | 232 |
| `naver/splade-cocondenser-selfdistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-selfdistil))| 37.6 | 98.4 | 2.32 | 56 | 134 |
| `naver/splade-cocondenser-ensembledistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-ensembledistil)) | 38.3 | 98.3  | 1.85 | 44 | 120 |

In [21]:
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from splade.models.transformer_rep import Splade

In [22]:
# set the dir for trained weights

##### v2
# model_type_or_dir = "weights/splade_max"
# model_type_or_dir = "weights/distilsplade_max"

### v2bis, directly download from Hugging Face
# model_type_or_dir = "naver/splade-cocondenser-selfdistil"
model_type_or_dir = "Luyu/co-condenser-marco"
model_type_or_dir = "models/ensemble_distil/checkpoint/model"
model_type_or_dir = "distilbert-base-uncased"

In [23]:
# loading model and tokenizer

model = Splade(model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}

In [24]:
# example document from MS MARCO passage collection (doc_id = 8003157)

doc = "Glass and Thermal Stress. Thermal Stress is created when one area of a glass pane gets hotter than an adjacent area. If the stress is too great then the glass will crack. The stress level at which the glass will break is governed by several factors."

In [25]:
tokenizer

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [None]:
# now compute the document representation
with torch.no_grad():
    doc_rep = model(d_kwargs=tokenizer(doc, return_tensors="pt"))["d_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

# get the number of non-zero dimensions in the rep:
col = torch.nonzero(doc_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = doc_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)

number of actual dimensions:  9182
SPLADE BOW rep:
 [('glass', 3.24), ('.', 3.22), ('will', 3.22), ('at', 3.19), ('if', 3.16), ('a', 3.14), ('one', 3.14), ('when', 3.14), ('than', 3.14), ('is', 3.13), ('area', 3.13), ('crack', 3.12), ('the', 3.1), ('stress', 3.09), ('too', 3.08), ('level', 3.08), ('thermal', 3.08), ('of', 3.07), ('gets', 3.06), ('hotter', 3.04), ('several', 3.03), ('then', 3.02), ('an', 3.02), ('break', 3.01), ('adjacent', 3.0), ('created', 3.0), ('and', 2.99), ('factors', 2.98), ('by', 2.97), ('pan', 2.96), (';', 2.94), ('great', 2.92), ('which', 2.91), ('three', 2.81), ('governed', 2.81), ('high', 2.79), ('other', 2.79), ('may', 2.75), ('would', 2.74), ('many', 2.74), ('two', 2.73), ('get', 2.72), ('can', 2.72), ('##e', 2.71), ('factor', 2.71), ('generated', 2.71), ('warmer', 2.7), ('neighboring', 2.7), ('cooler', 2.7), ('hot', 2.69), ('caused', 2.68), ('determined', 2.68), ('breaking', 2.67), ('adjoining', 2.67), ('produced', 2.67), ('getting', 2.66), ('because', 2.

In [None]:
# example document from MS MARCO passage collection (doc_id = 8003157)

doc = "is a little caffeine ok during pregnancy"

In [None]:
# now compute the document representation
with torch.no_grad():
    doc_rep = model(d_kwargs=tokenizer(doc, return_tensors="pt"))["d_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

# get the number of non-zero dimensions in the rep:
col = torch.nonzero(doc_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = doc_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)

number of actual dimensions:  3970
SPLADE BOW rep:


array([[  101,  2003,  1037,  2210, 24689,  7959,  3170,  7929,  2076,
        10032,   102],
       [  101,  2003,  1037,  2210, 24689,  7959,  3170,  7929,  2076,
        10032,   102]])