# Autonomous Orchard

## Exercise

An AI controlled orchard needs to decide when to harvest its trees.
To do this it measures the concentration of three chemicals in the air.
Each day the orchard can choose to wait or harvest.
Waiting costs one credit in operating costs while a harvest ends the process.
Once a crop is harvested, packaged and sold, the orchard is told the profit or loss of that harvest.
Most experts agree that the function mapping the chemical concentrations to the profit is linear with some error.
The orchard has several samples of the profits from other harvests:

\begin{align*}
    \begin{array}{|c|c|c|c|}
        \hline
        \text{Concentration of}~ A ~\text{(ppm)} &
        \text{Concentration of}~ B ~\text{(ppm)} &
        \text{Concentration of}~ C ~\text{(ppm)} &
        \text{Profit/Rewrad (credits)} \\ \hline
        4  & 7  & 1  & 3   \\ \hline
        10 & 6  & 0  & -15 \\ \hline
        20 & 1  & 15 & 5   \\ \hline
        4  & 19 & 3  & 21  \\ \hline
    \end{array}
\end{align*}

Begin to approximate (by hand) the function that maps the state feature vector to $Q(\text{state}, \text{harvest})$ using an MC goal.
Do a gradient decent step on each ssample.
A sensible learning rate would be around $0.01$, but feel free to try any value.

## Solution

### Preparations

In [1]:
import numpy as np
import pandas as pd
from collections import defaultdict
from fractions import Fraction

The MC gradient update reads

\begin{align}
    \mathbf w_{t+1}
    & \doteq
    \mathbf w_t + \alpha [U_t - \hat q(S_t, A_t, \mathbf w_t)] \nabla \hat q(S_t, A_t, \mathbf w_t), \tag{10.1}
\end{align}

where $U_t = G_t$.

Let the state space be $\mathbb R^3$, where the components stand for the concentrations of each chemical respectively.
The action space is $\{ \texttt{harvest}, \texttt{wait} \}$.

Our weights $\mathbf w$ will be initialized with $\mathbf 0$.

In [2]:
w = defaultdict(lambda: np.zeros(3))

We assume that, immediately after the sample was taken, the fruit was harvested;
i.e. the actions in the table are all 'harvest'.

In [3]:
df = pd.DataFrame({
    'state': [
        np.array([4,  7,  1]),
        np.array([10, 6,  0]),
        np.array([20, 1,  15]),
        np.array([4,  19, 3])
    ],
    'action': ['harvest'] * 4,
    'reward': [3, -15, 5, 21]
})

df

Unnamed: 0,state,action,reward
0,"[4, 7, 1]",harvest,3
1,"[10, 6, 0]",harvest,-15
2,"[20, 1, 15]",harvest,5
3,"[4, 19, 3]",harvest,21


In [4]:
alpha = 0.01

As the function $\hat q(\cdot, \texttt{harvest}, \mathbf w)$ should be linear, we can use

\begin{align}
    \hat q(s, a, \mathbf w)
    \doteq
    \mathbf w^\top \mathbf x(s, a)
    =
    \sum_{i=1}^d
        w_i \cdot x_i(s, a),
\end{align}

and hence (a special case of (10.1))

\begin{align}
    \mathbf w_{t+1}
    & =
    \mathbf w_t + \alpha [G_t - \mathbf w_t^\top \mathbf x(S_t, A_t)] \mathbf x(S_t, A_t),
\end{align}

where, $d = 3$ and

- $\mathbf w_t$ ... `w[t]`,
- $\alpha$ ... `alpha`,
- $G_t$ ... `df['reward'][t]`,
- $\mathbf x$ ... `x`,
- $S_t$ ... `df['state'][t]`,
- $A_t$ ... `df['action'][t]`.

Perhaps the easiest way of choosing $(s, a) \mapsto \mathbf x(s, a)$ is simply returning the state $s$, if the action $a = \texttt{harvest}$.

In [5]:
def x(state, action):
    if action == 'harvest':
        return state
    else:
        raise NotImplementedError

### Calculations

In [6]:
print(f'w[{0}] = {w[0]}')
print(f'     = {[str(Fraction(w_).limit_denominator(6)) for w_ in w[0]]}')
print()

for t in range(4):

    w[t+1] = w[t] + alpha * (df['reward'][t] - w[t] @ x(df['state'][t], df['action'][t])) * x(df['state'][t], df['action'][t])
    print(f'w[{t+1}] = {w[t+1]}')
    print(f'     = {[str(Fraction(w_).limit_denominator(6)) for w_ in w[t+1]]}')
    print()

w[0] = [0. 0. 0.]
     = ['0', '0', '0']

w[1] = [0.12 0.21 0.03]
     = ['1/6', '1/5', '0']

w[2] = [-1.626  -0.8376  0.03  ]
     = ['-8/5', '-5/6', '0']

w[3] = [ 5.95552  -0.458524  5.71614 ]
     = ['6', '-1/2', '23/4']

w[4] = [ 5.50517824 -2.59764736  5.37838368]
     = ['11/2', '-13/5', '27/5']

