# Computing BM25 with PySpark
BM25 is a bag-of-words (BoW)-based matching algorithm. To compute it given a query (a set of words) and a document, we need to count the instances of each word in the query in the document, count the IDF of each word, measure the document length, and we need the average document length for the entire matching corpus. Thankfully, Spark's standard `ml` library does the tough ones (tf and idf) natively, so `search-tools` just impliments BM25 on top of these features.<br>
## Usage
The `BM25Model` object is a PySpark transformer. It works much like other feature extractors. There is a `.fit()` method that must be called before computing BM25. The `.transform()` function computes the matching scores and returns them,

In [1]:
from search_tools.matching import BM25Model
import pyspark.sql.types as T
import pyspark.sql.functions as F

def init_spark():
    """Get and return a spark context"""
    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    APP_NAME = "search-tools example"
    SPARK_URL = "local[*]"
    spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
    return spark

#uncomment if you need a spark context:
#spark = init_spark()

# Fitting
The BM25 "model" must be fit on the matching corpus. This is the collection of all documents to match. We'll demo with some made-up data.
## Note on Duplicates
You should try to avoid having duplicate documents in your training corpus. BM25 (like tfidf) depends on statistics taken accross all documents, so duplicates can skew these statistics.

In [2]:
#demo dataset
corpus = [
  'this is a product description for a shoe. It is a running shoe with laces and other shoe things.',
  'this is a product description for a shirt. It is an athletic shirt with performance technology optimized for removing moisture from the skin.',
  'this is a product description for a water bottle. This product specializes in hydrating your body by putting water inside of it.',
  'this is a piece of content. It contains an article describing good running techniques and how to choose the right shoe for you. It is a long document full of words, many many words!'
]

df = spark.createDataFrame(corpus,schema=T.StringType()).withColumnRenamed('value', 'document')

df.show()

+--------------------+
|            document|
+--------------------+
|this is a product...|
|this is a product...|
|this is a product...|
|this is a piece o...|
+--------------------+



In [5]:
#create a BM25 model instance
bm25 = BM25Model()

#specify the column that contains the documents on .fit call
bm25 = bm25.fit(df, train_col='document')

# Computing BM25 Score
The query that we want to compute the BM25 score against needs to be in another column.

In [6]:
#Add a column that contains query. In general, this column may not always contain the same value in every row.
df = df.withColumn('query', F.lit('running shoe'))

#call transform to compute BM25 score for query.
result = bm25.transform(df, score_col='query')
result.show()

+--------------------+------------+----------+
|            document|       query|      bm25|
+--------------------+------------+----------+
|this is a product...|running shoe|  1.312203|
|this is a product...|running shoe|       0.0|
|this is a product...|running shoe|       0.0|
|this is a piece o...|running shoe|0.88177747|
+--------------------+------------+----------+



# Computing TF, NTF, and TFIDF Scores
In the process of computing BM25, we get a number of useful scores "for free"; that is, it takes no additional computation to get these scores. If you want them in addition to bm25, it's more efficient to get them in the same call to the `BM25Model` object.<br><br>
To get these scores, simply name the output columns when calling `.transform()` (or, to *not* get them, set these keyword arguments to `None` which is default for all but bm25):

In [7]:
result = bm25.transform(df, score_col='query', bm25_output_name='bm25', tf_output_name='tf', ntf_output_name='ntf', tfidf_output_name='tfidf')
result.show()

+--------------------+------------+----------+---+----------+-----------+
|            document|       query|      bm25| tf|       ntf|      tfidf|
+--------------------+------------+----------+---+----------+-----------+
|this is a product...|running shoe|  1.312203|3.0|0.15789473| 0.08065668|
|this is a product...|running shoe|       0.0|0.0|       0.0|        0.0|
|this is a product...|running shoe|       0.0|0.0|       0.0|        0.0|
|this is a piece o...|running shoe|0.88177747|2.0|0.05882353|0.030048566|
+--------------------+------------+----------+---+----------+-----------+

