# Discounted Cumulative Gain
Discounted Cumulative Gain (DCG) is a very common metric to evaluate a ranking of results *against an optimally-ranked "ground truth"*. In the pointwise approach to LTR, we're evaluating search results based on query-result pairs; measuring true relevance based on user interactions and training a model to estimate this number. DCG is used to evaluate all models "apples to apples", so we will want to compute DCG anyways. The task involved several steps:

 - Order results based on true relevance
 - Order results based on estimated relevance
 - Compute DCG by comparing the two ordered lists
 
search-tools can compute dcg, ndcg, and has a pyspark udf to do it on your spark nodes.

In [20]:
from search_tools.metrics import ndcg_udf, get_rankings
from search_tools.matching import BM25Model

import pyspark.sql.functions as F

import pandas as pd
import numpy as np

def init_spark():
    """Get and return a spark context"""
    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    APP_NAME = "search-tools example"
    SPARK_URL = "local[*]"
    spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
    return spark

#uncomment if you need a spark context:
#spark = init_spark()

# Demo Dataset
The following cell generates a dummy dataset that looks something like what we'd expect after training a pointwise ranking model:

In [15]:
#here's a list of 5000 example products
products = np.arange(start=100000, stop=105000)

#This is a list of the best (optimally) ordered 500 products.
best_order = np.random.choice(products, size=(500,), replace=False)

#This is a list of "returned" results
results_order = np.random.choice(best_order, size=(500,), replace=False)

cols = ['query', 'pid', 'relevance', 'estimated_relevance']
queries = ['air jordan', 'air max', 'running', 'red sox']
dfs = []
size = (500,)

for query in queries:
    df = pd.DataFrame(columns = cols)
    df['pid'] = np.random.choice(products, size=size, replace=False)
    df['relevance'] = np.random.random(size=size)
    df['estimated_relevance'] = np.random.random(size=size)
    df['query'] = query
    df['query'] = df['query'].astype(str)
    dfs.append(df)
  
pd_df = pd.concat(dfs, axis=0)

df = spark.createDataFrame(pd_df)
df.show()

+----------+------+--------------------+-------------------+
|     query|   pid|           relevance|estimated_relevance|
+----------+------+--------------------+-------------------+
|air jordan|100612|  0.4906300598842973| 0.9921630747944146|
|air jordan|103687|  0.5987632520944304| 0.7177760681575576|
|air jordan|101109|  0.3592458258848352| 0.6983621850397713|
|air jordan|101208|0.006181523165330938|0.42805993889109606|
|air jordan|102494|  0.6580431073136314| 0.3789907339381291|
|air jordan|103939|  0.1357291046814525| 0.4356277303898791|
|air jordan|104108| 0.19923964105887537|0.26843523071135456|
|air jordan|101645|  0.8556352363674494|0.16655069933442757|
|air jordan|104580|  0.8591015262573666|0.33613687913659807|
|air jordan|103589|  0.2656900421264532|0.13882340711924857|
|air jordan|101887|  0.6879241505221808| 0.2929660306158248|
|air jordan|103339|     0.1448960115408| 0.6788507049116451|
|air jordan|102266| 0.10421591328080293| 0.8383747339044633|
|air jordan|103889|  0.5

# Collapsing Rows, Ordering by Relevance
There's a series of oper

In [18]:
results = get_rankings(df, gt_col='relevance', est_col='estimated_relevance')

results.show()

+----------+--------------------+--------------------+
|     query|          best_order|     estimated_order|
+----------+--------------------+--------------------+
|   air max|[104368, 100723, ...|[101132, 100077, ...|
|   running|[100075, 100647, ...|[100162, 100871, ...|
|air jordan|[104943, 103241, ...|[104552, 103585, ...|
|   red sox|[104031, 104247, ...|[101352, 100159, ...|
+----------+--------------------+--------------------+



In [21]:
results.withColumn('ndcg', ndcg_udf(F.col('best_order'), F.col('estimated_order'))).select('query', 'ndcg').show()

+----------+----------+
|     query|      ndcg|
+----------+----------+
|   air max|0.15516137|
|   running| 0.1487128|
|air jordan| 0.1644091|
|   red sox| 0.1748767|
+----------+----------+

