# The Training Loop
In Part 1, we put together a basic approach to implementing an algorithm for updating our Q function. We saw it work for one pass. Now, let's complete the loop and use it to search for an optimal policy.

Our todo list:
1. Refactor the experiment, pulling out reusable code into standalone methods
1. Establish the terminating condition
1. Implement a complete episode
1. Finally, loop through a pre-determined number of episodes

We're staying with random action selection for exploration for now. After we get the training loop going, we'll try using the evolving policy to select the next action.

## Refactoring
Below is our experiment with getAction, getUpdatedQsa, and getReward added to the utilites module.

In [1]:
# %load utilities
import random

def hit(p,s):
    p.receive(s.draw())
    print("New hand: {} ({})".format(p.hand,p.getPoints()))

def newHand(d,p,s):
    p.reset()
    d.reset()
    deal(d,p,s)

# Deal
def deal(d,p,s):
    d.receive(s.draw())
    p.receive(s.draw())
    d.receive(s.draw())
    p.receive(s.draw())
    print("Dealer's hand: {} ({})".format(d.hand,d.getPoints()))
    print("Player's hand: {} ({})".format(p.hand,p.getPoints()))
    
def getAction():
    r = random.randint(0,1)
    if r == 0:
        return 'HIT'
    else:
        return 'STAY'

# Updated Q(s,a) value
def getUpdatedQsa(Q,s,r,sPrime,A):
    return Q[s] + 0.08*(r + max(Q[sPrime+A[0:1]],Q[sPrime+A[1:2]]) - Q[s])

# Calculate reward
def getReward(p,d,isTerminal):

    # Non-Terminal state rewards
    if not isTerminal:
    # Did the player bust?
        if p.getPoints() > 21:
            r = -1
        else:
            # Hand is still going
            r = 0
        return r
    
    # Other state rewards
    # Did the player bust?
    if p.getPoints() > 21:
        r = -1
    elif d.getPoints() > 21:
        r = 1
    elif (p.getPoints() > d.getPoints()):
        r = 1
    elif (p.getPoints() < d.getPoints()):
        r = -1
    else:
        r = 0

    return r


In [6]:
from player import Player
from shoe import Shoe
from utilities import hit, newHand, deal, getAction, getUpdatedQsa, getReward
from collections import defaultdict
from IPython.display import clear_output

# Initialize
shoe = Shoe(1)
dealer = Player()
player = Player()
allActions=('HIT','STAY',)
Q = defaultdict(float)

In [None]:
# Starting state for a hand/episode
newHand(dealer,player,shoe)

# 1. Choose an action
action = getAction()

# 2. Observe the state
currentState=(player.getPoints(),dealer.hand[0],action)
print("Current state: {}".format(currentState))

# 3. Do the action
if (action == 'HIT'):
    print("Player {}: ".format(action), end=' ')
    hit(player,shoe)
newState = (player.getPoints(),dealer.hand[0])

# Calculate reward
if (action == 'STAY'):
    while dealer.getPoints() < 17:
        print("Dealer HIT: ", end=' ')
        hit(dealer,shoe)
    
    reward = getReward(player,dealer,True)
else:
    reward = getReward(player,dealer,False)

# 4. Update Q(s,a)
Q[currentState] = getUpdatedQsa(Q,currentState,reward,newState,allActions)

Q

## Iterating Through the Episode
If we take the approach of focusing on generating a converging policy, then we can use an algorithm which is dedicated to the Q update calculation. An example of such an algorithm, which we've been gravitating towards, is pictured below.

```
Initialize s
Repeat
    Choose a
    Take action a, observe r, s'
    Update Q(s,a)
    s <-- s'
Until s is terminal
```

This is reasonably straightforward, but we have some challenges in our Blackjack world. Let's take a look at what terminal states look like for us.

### Terminal State
Put simply, the terminal state is the end of the Blackjack hand. Blackjack hands end under these conditions:
* The player goes bust (over 21)
* The player stays and the dealer completes his turn
* The dealer has Blackjack (21 in a hand with two cards)

Clearly, the hand can end before it begins -- when the dealer gets Blackjack, it doesn't matter what anyone does. There's no action the player can take, there's no utility to calculate, there's nothing to learn. Because starting in the terminal state means the player loses all control, we should eliminate that from our algorithm. We do that by changing the loop control from a Repeat..Until construct to a While..Do construct.

Let's create a method to implement these rules, taking a state-action pair (s,a), and the player and dealer objects as input.

In [None]:
def isTerminalState(sa,p,d):
    if sa[-1] == 'TERMINAL':
        return True
    
    if p.getPoints() > 21:
        return True
    
    if d.getPoints() == 21 and len(d.hand) == 2:
        return True

    if sa[-1] == 'STAY':
        return True
    
    return False


## Rewards
Recall that the reward which results from taking action depends on if the hand is over or not. Sometimes the hand is over because of the state (the player has gone bust), and sometimes it is over because of the action (the player chooses to stand with the cards in her hand). We've got a terminal state dectector now, which won't change if the dealer plays out his hand. So the original reward calculation logic can be compressed to

```
if isTerminalState after action {
    Dealer plays out hand
}
caculate reward
```
We're going to introduce two new descriptive actions, *NONE* and *TERMINAL*, to meet the requirements of having a complete (s,a) specification for the initial state and new state generations, and to signal to other modules the status of the episode. That means we'll need to make a slight adjustment to the updatedQsa function.

In [None]:
# Testing tuple slicing...
myTuple = (16, 4, 'NONE')
print(myTuple)
print(myTuple[0:-1]) # give me everything but the last element of the tuple as a tuple

def getUpdatedQsa(Q,s,r,sPrime,A):
    return Q[s] + 0.08*(r + max(Q[sPrime[0:-1]+A[0:1]],Q[sPrime[0:-1]+A[1:2]]) - Q[s])

Now we should have all we need to implement the algorithm. 

In [None]:
for i in range(10):
    # Initialize s
    newHand(dealer,player,shoe)
    currentState = (player.getPoints(),dealer.hand[0],'NONE')

    while not isTerminalState(currentState,player,dealer):

        # Choose an action
        action = getAction()
        currentState=(player.getPoints(),dealer.hand[0],action)
        print("Current state: {}".format(currentState))

        # Take the action
        if (action == 'HIT'):
            print("Player {}: ".format(action), end=' ')
            hit(player,shoe)
        newState = (player.getPoints(),dealer.hand[0],'NONE')

        # Observe the new state and its reward. If the action taken or
        # the new state is terminal, then we need to have the dealer 
        # play out to generate the reward
        isTerminal = isTerminalState(currentState,player,dealer)
        if (isTerminal):
            newState = (player.getPoints(),dealer.hand[0],'TERMINAL')
            while dealer.getPoints() < 17:
                print("Dealer HIT: ", end=' ')
                hit(dealer,shoe)

        reward = getReward(player,dealer,isTerminal)

        # Update Q(s,a)
        Q[currentState] = getUpdatedQsa(Q,currentState,reward,newState,allActions)

        currentState = newState
    print("we're done.")
Q

Getting invalid states in the Q space, hands that are over 21. Should I have them?

I think maybe my algorithm isn't behaving like an agent. What happens if I do ask it to behave like an agent?

The Q-LEARNING-AGENT function takes a percept (a state and a reward) as input and returns an action. Persistent values include Q, and (s,a,r) which are the previous state, action, and reward -- initially null.

```
function Q-LEARNING-AGENT(percept) returns an action
  inputs: percept, a percept indicating the current state s' and reward signal r'
  persisent: Q[s,a], a table of action values indexed by state and action, initially 0
             N[s,a], a table of frequencies for state-action pairs, initially 0 (used to throttle exploration)
             s, a, r, the previous state, action, and reward, initially null

    if TERMINAL?(s') then Q[s',None] <-- r
    if s is not null then
        increment N[s,a]
        Q[s,a] <-- Q[s,a] + lr * (N[s,a] * (r + discount * max_a'(Q[s',a']) - Q[s,a]))
    s, a, r <-- s', argmax_a' f(Q[s',a'],N[s',a']), r'
    return a
```

**This** we can repeat until s is terminal. Let's the code and fit this algorithm. We'll leave action choice with random, for now.

We'll need another version of the terminal state test function. And the update function.


In [32]:
def isTerminal(s,d):
    if s[0] > 21:
        return True
    
    if d.getPoints() == 21 and len(d.hand) == 2:
        return True

# Staying can be identified elsewhere
#    if a == 'STAY':
#        return True
    
    return False

# Updated Q(s,a) value
def getUpdatedQsa(Q,ps,pa,pr,s):
    return Q[ps+(pa,)] + 0.08*(pr + max(Q[s+('HIT',)],Q[s+('STAY',)]) - Q[ps+(pa,)])


In [33]:
# Initialize the experiment
shoe = Shoe(1)
dealer = Player()
player = Player()
allActions=('HIT','STAY',)
Q = defaultdict(float)

In [37]:
#from IPython.core.debugger import Tracer; debug_here = Tracer()
from IPython.core.debugger import Pdb
pdb = Pdb()
def qLearningAgent(sp,rp,s,a,r,Q,p,d,terminal):

    if terminal:
        Q[sp+('NONE',)] = rp

    if s is not None:
        Q[s+(a,)] = getUpdatedQsa(Q,s,a,r,sp)

    s = sp
    a = getAction(); print(a)
    r = rp

    return (s,a,r)

# Initialize s'
newHand(dealer,player,shoe)
currentState = (player.getPoints(),dealer.hand[0],)
currentReward = None
terminal = False
s = None
a = None
r = None

pdb.set_trace()

while True:
    # Choose an action
    sar = qLearningAgent(currentState,currentReward,s,a,r,Q,player,dealer,terminal)
    if terminal:
        break
    
    s = sar[0]; a = sar[1]; r = sar[2]
    # Take the action
    if (a == 'HIT'):
        print("Player {}: ".format(action), end=' ')
        hit(player,shoe)
        currentState = (player.getPoints(),dealer.hand[0],)
        terminal = isTerminal(currentState,dealer)
    else: 
        while dealer.getPoints() < 17:
            print("Dealer HIT: ", end=' ')
            hit(dealer,shoe)
        terminal = True
    
    currentReward = getReward(player,dealer,terminal)


Q

Dealer's hand: [10, 4] (14)
Player's hand: [9, 7] (16)
--Return--
None
> [0;32m<ipython-input-37-427df6d9b8c9>[0m(27)[0;36m<module>[0;34m()[0m
[0;32m     25 [0;31m[0mr[0m [0;34m=[0m [0;32mNone[0m[0;34m[0m[0m
[0m[0;32m     26 [0;31m[0;34m[0m[0m
[0m[0;32m---> 27 [0;31m[0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0m
[0m[0;32m     28 [0;31m[0;34m[0m[0m
[0m[0;32m     29 [0;31m[0;32mwhile[0m [0;32mTrue[0m[0;34m:[0m[0;34m[0m[0m
[0m
ipdb> b getUpdatedQsa
Breakpoint 5 at <ipython-input-32-b308b8db0ea9>:15
ipdb> b
Num Type         Disp Enb   Where
1   breakpoint   keep yes   at <ipython-input-22-5ecc7f574496>:29
	breakpoint already hit 1 time
2   breakpoint   keep yes   at <ipython-input-25-9b11f032abeb>:29
	breakpoint already hit 1 time
3   breakpoint   keep yes   at <ipython-input-30-9b11f032abeb>:29
	breakpoint already hit 1 time
4   breakpoint   keep yes   at <ipython-input-34-d258266e4472>:29
	breakpoint already hit

In [36]:
Q

defaultdict(float,
            {(13, 11, 'HIT'): 0.0,
             (13, 11, 'NONE'): -1,
             (13, 11, 'STAY'): 0.0})