# SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

This notebook gives a minimal example usage of SPLADE.

* In this repo, we provide weights for 2 models (in the `weights` folder)
* See [Naver Labs Europe website](https://europe.naverlabs.com/research/machine-learning-and-optimization/splade-models/) for more up-to-date models under various settings
* We also provide two new models via Hugging Face (https://huggingface.co/naver)

| model | MRR@10 (MS MARCO dev) | recall@1000 (MS MARCO dev) | expected FLOPS | ~ avg q length | ~ avg d length | 
| --- | --- | --- | --- | --- | --- |
| `splade_max` (**v2**) | 34.0 | 96.5 | 1.32 | 18 | 92 |
| `distilsplade_max` (**v2**) | 36.8 | 97.9 | 3.82 | 25 | 232 |
| `naver/splade-cocondenser-selfdistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-selfdistil))| 37.6 | 98.4 | 2.32 | 56 | 134 |
| `naver/splade-cocondenser-ensembledistil` (**v2bis**, [HF](https://huggingface.co/naver/splade-cocondenser-ensembledistil)) | 38.3 | 98.3  | 1.85 | 44 | 120 |

In [1]:
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from splade.models.transformer_rep import Splade

In [36]:
# set the dir for trained weights

##### v2
# model_type_or_dir = "weights/splade_max"
# model_type_or_dir = "weights/distilsplade_max"

### v2bis, directly download from Hugging Face
# model_type_or_dir = "naver/splade-cocondenser-selfdistil"
model_type_or_dir = "naver/efficient-splade-VI-BT-large-doc"

In [37]:
# loading model and tokenizer

model = Splade(model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}
print('vocabulary', len(reverse_voc))

Downloading:   0%|          | 0.00/620 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/449 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

vocabulary 30522


In [47]:
# example document from MS MARCO passage collection (doc_id = 8003157)
from itertools import islice

queries = ("laundry", "burger king", "highland coffee", "5 guys", "five guys", "mcdonalds", "the pizza company", "family-friendly fast food breakfast", "tacos", "chicken tikka", "pad thai", "indian", "momos", "coke")

In [48]:
from tqdm import tqdm

# now compute the document representation
res = []
for doc in tqdm(queries,ncols=75):
    with torch.no_grad():
        doc_rep = model(d_kwargs=tokenizer(doc, return_tensors="pt"))["d_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

    # get the number of non-zero dimensions in the rep:
    col = torch.nonzero(doc_rep).squeeze().cpu().tolist()
    
    # now let's inspect the bow representation:
    weights = doc_rep[col].cpu().tolist()
    d = {k: v for k, v in zip(col, weights)}
    sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
    bow_rep = []
    for k, v in sorted_d.items():
        if not(reverse_voc[k] in doc):
            bow_rep.append((reverse_voc[k], round(v, 2)))
    res.append((doc, bow_rep))
for doc, kw in res:
    print(f"{doc} (keywords: {len(kw)})")
    for kw_ in kw[:10]:
        print(' - ', kw_[0], '\t', kw_[1])
    print("\n")

100%|██████████████████████████████████████| 14/14 [00:00<00:00, 32.88it/s]

laundry (keywords: 9)
 -  washing 	 1.48
 -  wash 	 1.33
 -  clothes 	 1.26
 -  . 	 0.86
 -  washed 	 0.61
 -  textile 	 0.57
 -  sewing 	 0.57
 -  fabric 	 0.51
 -  clothing 	 0.18


burger king (keywords: 51)
 -  hamburger 	 1.67
 -  bk 	 1.35
 -  kings 	 1.3
 -  restaurant 	 1.16
 -  restaurants 	 0.96
 -  chain 	 0.93
 -  mcdonald 	 0.91
 -  brands 	 0.81
 -  food 	 0.79
 -  foods 	 0.75


highland coffee (keywords: 19)
 -  highlands 	 1.66
 -  cafe 	 1.41
 -  starbucks 	 0.94
 -  caf 	 0.92
 -  tea 	 0.48
 -  press 	 0.4
 -  cups 	 0.4
 -  cup 	 0.39
 -  drink 	 0.37
 -  brew 	 0.35


5 guys (keywords: 86)
 -  five 	 1.4
 -  many 	 1.05
 -  boys 	 0.98
 -  dude 	 0.81
 -  men 	 0.71
 -  fifa 	 0.68
 -  buddy 	 0.65
 -  5000 	 0.62
 -  golf 	 0.6
 -  buddies 	 0.51


five guys (keywords: 81)
 -  5 	 1.35
 -  many 	 1.16
 -  boys 	 1.07
 -  dude 	 0.73
 -  men 	 0.68
 -  four 	 0.63
 -  golf 	 0.59
 -  six 	 0.58
 -  military 	 0.56
 -  buddy 	 0.55


mcdonalds (keywords: 28)
 -  re


