# Discounted Cumulative Gain
Discounted Cumulative Gain (DCG) is a very common metric to evaluate a ranking of results *against an optimally-ranked "ground truth"*. In the pointwise approach to LTR, we're evaluating search results based on query-result pairs; measuring true relevance based on user interactions and training a model to estimate this number. DCG is used to evaluate all models "apples to apples", so we will want to compute DCG anyways. The task involved several steps:

 - Order results based on true relevance
 - Order results based on estimated relevance
 - Compute DCG by comparing the two ordered lists
 
search-tools can compute dcg, ndcg, and has a pyspark udf to do it on your spark nodes.

In [1]:
from search_tools.metrics import ndcg_udf, get_rankings
from search_tools.matching import BM25Model

import pyspark.sql.functions as F

import pandas as pd
import numpy as np

def init_spark():
    """Get and return a spark context"""
    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    APP_NAME = "search-tools example"
    SPARK_URL = "local[*]"
    spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
    return spark

#uncomment if you need a spark context:
#spark = init_spark()

# Demo Dataset
The following cell generates a dummy dataset that looks something like what we'd expect after training a pointwise ranking model:

In [2]:
#here's a list of 5000 example products
products = np.arange(start=100000, stop=105000)

#This is a list of the best (optimally) ordered 500 products.
best_order = np.random.choice(products, size=(500,), replace=False)

#This is a list of "returned" results
results_order = np.random.choice(best_order, size=(500,), replace=False)

cols = ['query', 'pid', 'relevance', 'estimated_relevance']
queries = ['air jordan', 'air max', 'running', 'red sox']
dfs = []
size = (500,)

for query in queries:
    df = pd.DataFrame(columns = cols)
    df['pid'] = np.random.choice(products, size=size, replace=False)
    df['relevance'] = np.random.random(size=size)
    df['estimated_relevance'] = np.random.random(size=size)
    df['query'] = query
    df['query'] = df['query'].astype(str)
    dfs.append(df)
  
pd_df = pd.concat(dfs, axis=0)

df = spark.createDataFrame(pd_df)
df.show()

+----------+------+--------------------+-------------------+
|     query|   pid|           relevance|estimated_relevance|
+----------+------+--------------------+-------------------+
|air jordan|101852| 0.02988521769497876|  0.646588282582704|
|air jordan|104194| 0.21904121471662752| 0.7688214879952707|
|air jordan|102137|  0.8692904685849372| 0.4027586582978472|
|air jordan|103637|  0.6323111613109539| 0.3792361114463353|
|air jordan|101289|  0.5531064148825765|  0.758509930143625|
|air jordan|103243|0.051998382242541275| 0.8332075966120123|
|air jordan|101550|  0.5720213975580669|  0.724106628585816|
|air jordan|103979|  0.2090640790778303|  0.878918039288767|
|air jordan|102993|  0.0804350950745314| 0.1545982454209447|
|air jordan|103727|  0.6751723079142157| 0.5295416032709277|
|air jordan|102812| 0.13007534278648925| 0.7455838410223465|
|air jordan|103989| 0.21252806993390105|0.42609804353453296|
|air jordan|100041|  0.6716818913031557| 0.4459337120588347|
|air jordan|100238|  0.2

# Collapsing Rows, Ordering by Relevance
There's a series of oper

In [3]:
results = get_rankings(df, gt_col='relevance', est_col='estimated_relevance')

results.show()

+----------+--------------------+--------------------+
|     query|          best_order|     estimated_order|
+----------+--------------------+--------------------+
|   air max|[101085, 103837, ...|[104701, 101347, ...|
|   running|[100598, 104644, ...|[100154, 103999, ...|
|air jordan|[103901, 103135, ...|[104417, 104197, ...|
|   red sox|[102039, 101116, ...|[100296, 100028, ...|
+----------+--------------------+--------------------+



In [4]:
results.withColumn('ndcg', ndcg_udf(F.col('best_order'), F.col('estimated_order'))).select('query', 'ndcg').show()

+----------+----------+
|     query|      ndcg|
+----------+----------+
|   air max|0.16247763|
|   running|0.16091068|
|air jordan|0.15299326|
|   red sox| 0.2046853|
+----------+----------+

