[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-vector-generation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-vector-generation.ipynb)

# Hybrid Search with BM25 Sparse Vectors


## Overview

BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide, we will show how to use BM25 with Pinecone's sparse-dense vectors for use in hybrid search.

Skip the embedding creation step by using the [companion guide](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-quora.ipynb).

## Prerequisites

We'll install the required libraries:

In [None]:
!pip install -qU \
          torch \
          sentence-transformers \
          spacy==3.4.0 \
          scikit-learn

In [None]:
import requests
from tqdm.notebook import tqdm

Download a helper file with BM25:

In [None]:
with open('pinecone_text.py' ,'w') as fb:
    fb.write(requests.get('https://storage.googleapis.com/pinecone-datasets-dev/pinecone_text.py').text)

## Quora Dataset

We'll load the popular Quora dataset:

In [None]:
import pandas as pd

df = pd.read_parquet("https://storage.googleapis.com/pinecone-datasets-dev/quora_all-MiniLM-L6-bm25/raw/quora_questions_sample.parquet")

In [None]:
df.head()

### Fit BM25 with Spacy Tokenizer

We'll create fit a BM25 model using Spacy to tokenize data. To use this we need to download the spacy tokenizer model:


In [None]:
%%capture
!python -m spacy download en_core_web_sm

*Note: if you return a spacy error in the following cell, you may need to restart the notebook.*

In [None]:
import spacy
import pinecone_text

nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])

def tokenizer(text):
    return [token.text for token in nlp(text)]

bm25 = pinecone_text.BM25(tokenizer)

We need to calculate how often tokens appear in documents for BM25 to be able to create sparse vectors. To do this we call `bm25.fit` across our full dataset.

In [None]:
bm25.fit(df['text'])


### Dense Model

We use the popular all-MiniLM-L6-v2 model available on Hugging Face for dense vectors.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"running on {device}")

model = SentenceTransformer(
    'sentence-transformers/all-MiniLM-L6-v2',
    device=device
)

### Compute Dense & Sparse Embeddings

Create BM25 sparse embeddings:

In [None]:
df['sparse_values'] = df['text'].apply(bm25.transform_doc)

And now encode dense vector embeddings:

In [None]:
batch_size = 128
dense_values = []
for i in tqdm(range(0, len(df), batch_size)):
  dense_values += model.encode(df.iloc[i:i + batch_size]["text"].tolist()).tolist()

df['values'] = dense_values

We organize our dataframe to align to the `pinecone-datasets` format:

In [None]:
df_result = df.copy()
df_result["metadata"] = None
df_result["blob"] = df_result["text"].apply(lambda t: {"text": t})
df_result = df_result.drop(columns="text")

In [None]:
df_result.head()

And now we have all we need to start using Pinecone vector database 🚀

For more details on that, check out [this notebook](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/bm25/bm25-quora.ipynb).