# Outline

An exploration strategy I have proposed involves choosing an order for scored campaigns by sampling from the distribution of predicted reward for the candidate advertising campaigns-targets.  

“Reward” is a term often used to describe the benefit from an action in the mutli-armed bandit problem.  In our case, “predicted reward” is referring to the algorithm predicted eCPM for a advertising campaigns-target pair.  

In words, the procedure is to treat the set of eCPMs as the elements of a multinomial, and sample from that distribution without replacement. By doing so, we will generate a new, randomized, order for the scored, candidate campaign-targets, where the expected value of the distributions of the relative elements, will retain the same order as the prediced reward.

# The algorithm

In pseudo-code, the inefficient, but direct form of the algorithm is as follows:

In [6]:
"""

alpha = a parameter used to make the algorithm more or less greedy (higher is more greedy)
N = number of candidate campaigns
L = list of scored, candidate campaigns
O = output, re-ordered, scored, candidate campaigns

scaledL = for each campaign, raise the eCPM to eCPM ^ alpha

normL = normalize the list of scaledL (where we devide each eCPM ^ alpha by the sum of them all)

cumNormL = perform a cumulative sum from the start to the end of the 
            normalized list of scored, candidate campaigns 
            (the first campaigns cumEcpm will be 0, while the last will be 1 - it's normed ecpm)

while (N > 0) do
    compute normL from current elements of L
    compute cumNorm from normL
    i = rand(0,1)
    for j in range 0 to length of cumNormL:
        if i > cumNormL(j):
            append cumNormL(j) to the tail of O
            remove the jth element from L
            N -= 1
            break
        else:
            continue
"""

"\n\nalpha = a parameter used to make the algorithm more or less greedy (higher is more greedy)\nN = number of candidate campaigns\nL = list of scored, candidate campaigns\nO = output, re-ordered, scored, candidate campaigns\n\nscaledL = for each campaign, raise the eCPM to eCPM ^ alpha\n\nnormL = normalize the list of scaledL (where we devide each eCPM ^ alpha by the sum of them all)\n\ncumNormL = perform a cumulative sum from the start to the end of the \n            normalized list of scored, candidate campaigns \n            (the first campaigns cumEcpm will be 0, while the last will be 1 - it's normed ecpm)\n\nwhile (N > 0) do\n    compute normL from current elements of L\n    compute cumNorm from normL\n    i = rand(0,1)\n    for j in range 0 to length of cumNormL:\n        if i > cumNormL(j):\n            append cumNormL(j) to the tail of O\n            remove the jth element from L\n            N -= 1\n            break\n        else:\n            continue\n"

# A more efficient, memoized approach

When streams or generators are suppored, that would also be preferable.

In [8]:
from collections import namedtuple
from random import random as rand

Campaign = namedtuple('Campaign', ['id', 'ecpm'])

class Sampler:
    
    def __init__(self, campaigns, alpha = 1.0):
        self.remaining_campaigns = {}
        self.total_sum = 0
        for c in campaigns:
            scaled_ecpm = c.ecpm ** alpha
            self.ramaining_campaigns[c.id] = (scaled_ecpm, c)
            self.total_sum += scaled_ecpm
    
    def size(self):
        return len(self.ramaining_campaigns)
    
    def rand(self):
        return rand() * self.total_sum
    
    def get_next_sample_id(self):
        lastCumSum = 0.0
        i = this.rand()
        for cid, t in self.ramaining_campaigns.items():
            if t[0] >= i:
                return cid
        
    def pop(self, cid):
        self.n -= 1
        self.totalSum -= self.remaining_campaigns[cid][0]
        return self.remaining_campaigns.pop(cid)
    
    def get_next_sample(self):
        if self.size() > 0:
            next_cid = get_next_sample_id()
            return pop(next_cid)
        else:
            return None