# K-arm Bandit

Keeping balance of exploration and exploitation.

-----

## $\epsilon$-greedy algorithm

Explore with the probability of $\epsilon$, that is, selecting a random arm with uniform probability. Exploit with the probability of $1−\epsilon$, that is to choose the arm with the highest current average reward.

We incrementally calculate the average reward of arm $Q_0(k)=0$. 
The initial average reward 𝑸, 
For $\forall$𝒏>𝟎,
$$
\begin{split}
Q_{n}(k) &= \frac{1}{n}((n-1)\cdot Q_{n-1}(k)+v_n) \\ 
         &= Q_{n-1}(k)+\frac{1}{n}(v_n - Q_{n-1}(k))
\end{split}
$$


$\epsilon$-greedy algorithm

$\textbf{Input:}$ $K$ number of arms;  
$\qquad \quad$ $R$ reward function;  
$\qquad \quad$ $T$ number of times to play;  
$\qquad \quad$ $\epsilon$ probability to explore.  
01: $r=0$;  
02: $\forall i=1,2,...,K: Q(i)=0,count(i)=0$;  
03: $\textbf{For}$ $t=1,2,...,T$  
04: $\quad$ $\textbf{If}$ $rand()<\epsilon$  
05: $\quad$ $\quad$ choose $k\in\{1,2,...K\}$  
06: $\quad$ $\textbf{Else}$  
07: $\quad$ $\quad$ $k=$arg $max_i Q(i)$  
08: $\quad$ $\textbf{End If}$  
09: $\quad$ $v=R(k)$;  
10: $\quad$ $r=r+v$;  
11: $\quad$ $Q(k)=\frac{Q(k)\cdot count(k)+v}{count(k)+1}$  
12: $\quad$ $count(k) = count(k)+1$  
13: $\textbf{End For}$  
$\textbf{Output:}$ $r$ cumulative reward

-----

## Softmax algorithm
Based on the current known average reward of all the arms, SOFTMAX makes a compromise between exploration and exploitation.
The probability distribution of each machine(arm) is based on the Boltzmann distribution:
$$
P(k)=\frac{e^{\frac{Q(k)}{\tau}}}{\sum^{K}_{i=1} e^{\frac{Q(i)}{\tau}}}
$$
where $Q(i)$ is current average reward of arm $i$; $\tau > 0$ is a parameter called temperature. Temperature is lower, the arm with large value will get larger probability to be chosen


Softmax algorithm

$\textbf{Input:}$ $K$ number of arms;  
$\qquad \quad$ $R$ reward function;  
$\qquad \quad$ $T$ number of times to play;  
$\qquad \quad$ $\tau$ temperature.  
01: $r=0$;  
02: $\forall i=1,2,...,K: Q(i)=0,count(i)=0$;  
03: $\textbf{For}$ $t=1,2,...,T$  
04: $\quad$ $\textbf{If}$ $rand()<\epsilon$  
05: $\quad$ $\quad$ choose $k$ from Boltzmann distribution    
06: $\quad$ $v=R(k)$;  
07: $\quad$ $r=r+v$;  
08: $\quad$ $Q(k)=\frac{Q(k)\cdot count(k)+v}{count(k)+1}$  
09: $\quad$ $count(k) = count(k)+1$  
10: $\textbf{End For}$  
$\textbf{Output:}$ $r$ cumulative reward

-----