Observed features are expanded with a random transform to ensure linear separability,action selection is by dot product of an expanded observation with a weight vector, a queued history of recent observations is shuffled and replayed to update the output weights, output weights are updated at the end of each incomplete episode by LMS update,the target outputs for the LMS algorithm are the means of the past outputs,output weights are maintained at a fixed norm for
regularization.

In [1]:
import gym
from numpy import *
from numpy.random import uniform,normal
from numpy.linalg import norm
from random import shuffle
from collections import deque
from statistics import mean

In [2]:
env = gym.make('CartPole-v1')

Hyperparameters

In [3]:
alpha = 2.0e-1              # Learning rate
maxEpisodes = 1000          # Episodes for which I am running the agent
maxTimeSteps = 500          # Maximum number of steps per episode
normVar = 0.5               # Used for Scaling
maxHistory = 5000           # Maximum number of recent observations for replay
successfulEpisodes = 100    # Cartpole is solved when average reward > 195 for 'solvedEpisodes'
episodeLength = 500         # The target for CartPole-v1

Observations Transform

In [4]:
inputLength = 4             # Length of an observation vector
expansionFactor = 30
expandedLength = expansionFactor*inputLength

Feature transform with fixed random weights.

In [5]:
V = normal(scale=1.0, size=(expandedLength, inputLength))

Output weights, randomly initialized.

In [6]:
weight = uniform(low=-1.0, high=1.0, size=expandedLength)
weight *= normVar / norm(weight)         # Fix the norm of the output weights to 'fixedNorm'.

Used Least mean squares (LMS) algorithm

In [7]:
def AgentCartPole(alpha, W, V):
    observationHistory = deque([], maxHistory)
    totalRewardHistory = deque([], successfulEpisodes)
    positiveOutput = deque([0], maxHistory)
    negativeOutput = deque([0], maxHistory)
    counter =0
    for episode in range(maxEpisodes):
        observation = env.reset()
        observationHistory.append(observation)
        totalReward = 0
        for t in range(1,maxTimeSteps+1):
            env.render()
            out = dot(tanh(dot(V,observation)), W)
            if out < 0:
                negativeOutput.append(out)
                action = 0
            else:
                positiveOutput.append(out)
                action = 1
            observation, reward, done, info = env.step(action)
            observationHistory.append(observation)
            totalReward += reward
            if done:
                totalRewardHistory.append(totalReward)
                if t < episodeLength:
                    # Replay shuffled past observations using the, latest weights.
                    # Use the means of past outputs as, (Least Mean Squares) algorithm target outputs.
                    mn = mean(negativeOutput)
                    mp = mean(positiveOutput)
                    shuffle(observationHistory)
                    for obs in observationHistory:
                        h = tanh(dot(V, obs))       # Transform the observation
                        out = dot(h, W)
                        if out < 0:
                            e = mn - out
                        else:
                            e = mp - out
                        W += alpha * e * h          # Least Mean Squares update
                        W *= normVar / norm(W)      # Keep the weights at fixed norm

                avgReward = sum(totalRewardHistory) / successfulEpisodes
                print(f"Episode_Number:{episode:2d} TotalR:{totalReward:7.3f}  Average_Reward:{avgReward:7.3f}  len(H):{len(observationHistory):7.3f}  W:{W[:2]}")
                if avgReward >= 195:
                    counter += 1
                else:
                    counter = 0
                if counter > 100:
                    print("FINISHED")
                    return
                break


In [8]:
AgentCartPole(alpha, weight, V)
env.close()

Episode_Number: 0 TotalR:  9.000  Average_Reward:  0.090  len(H): 10.000  W:[-0.05227981  0.02990159]
Episode_Number: 1 TotalR: 10.000  Average_Reward:  0.190  len(H): 21.000  W:[ 0.05326689 -0.02985863]
Episode_Number: 2 TotalR:205.000  Average_Reward:  2.240  len(H):227.000  W:[ 0.03939755 -0.01416592]
Episode_Number: 3 TotalR:159.000  Average_Reward:  3.830  len(H):387.000  W:[-0.0679574   0.04400136]
Episode_Number: 4 TotalR:  8.000  Average_Reward:  3.910  len(H):396.000  W:[ 0.05474901 -0.0135555 ]
Episode_Number: 5 TotalR:169.000  Average_Reward:  5.600  len(H):566.000  W:[ 0.07314542 -0.0192789 ]
Episode_Number: 6 TotalR: 82.000  Average_Reward:  6.420  len(H):649.000  W:[-0.03884899  0.00158764]
Episode_Number: 7 TotalR: 11.000  Average_Reward:  6.530  len(H):661.000  W:[ 0.00961585 -0.00552221]
Episode_Number: 8 TotalR:  9.000  Average_Reward:  6.620  len(H):671.000  W:[-0.02508117  0.03646298]
Episode_Number: 9 TotalR: 10.000  Average_Reward:  6.720  len(H):682.000  W:[ 0.05

Episode_Number:80 TotalR: 10.000  Average_Reward: 45.670  len(H):4648.000  W:[-0.0197872  -0.03916977]
Episode_Number:81 TotalR:  8.000  Average_Reward: 45.750  len(H):4657.000  W:[-0.03810094  0.04370863]
Episode_Number:82 TotalR: 10.000  Average_Reward: 45.850  len(H):4668.000  W:[ 0.0706749  -0.01436432]
Episode_Number:83 TotalR:134.000  Average_Reward: 47.190  len(H):4803.000  W:[ 0.00080289 -0.01991192]
Episode_Number:84 TotalR:  9.000  Average_Reward: 47.280  len(H):4813.000  W:[ 0.02572658 -0.022447  ]
Episode_Number:85 TotalR: 44.000  Average_Reward: 47.720  len(H):4858.000  W:[ 0.06758646 -0.02299352]
Episode_Number:86 TotalR: 66.000  Average_Reward: 48.380  len(H):4925.000  W:[ 0.06081247 -0.02612517]
Episode_Number:87 TotalR: 71.000  Average_Reward: 49.090  len(H):4997.000  W:[0.01640827 0.03019734]
Episode_Number:88 TotalR:204.000  Average_Reward: 51.130  len(H):5000.000  W:[ 0.06611284 -0.00661707]
Episode_Number:89 TotalR: 34.000  Average_Reward: 51.470  len(H):5000.000  

Episode_Number:160 TotalR:500.000  Average_Reward:265.670  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:161 TotalR:500.000  Average_Reward:270.580  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:162 TotalR:500.000  Average_Reward:275.480  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:163 TotalR:500.000  Average_Reward:279.150  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:164 TotalR:500.000  Average_Reward:283.640  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:165 TotalR:500.000  Average_Reward:288.540  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:166 TotalR:500.000  Average_Reward:293.440  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:167 TotalR:500.000  Average_Reward:296.730  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:168 TotalR:500.000  Average_Reward:300.070  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:169 TotalR:500.000  Average_Reward:304.970  len(H):5000.000  W:[0.0

Episode_Number:241 TotalR:500.000  Average_Reward:500.000  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:242 TotalR:500.000  Average_Reward:500.000  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:243 TotalR:500.000  Average_Reward:500.000  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:244 TotalR:500.000  Average_Reward:500.000  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:245 TotalR:500.000  Average_Reward:500.000  len(H):5000.000  W:[0.02509426 0.0272077 ]
Episode_Number:246 TotalR:500.000  Average_Reward:500.000  len(H):5000.000  W:[0.02509426 0.0272077 ]
FINISHED


I am getting Average Reward above 195 for 100 episodes thus the balancing is a success and when that happens I have printed FINISHED. (Although  The iteration goes for 1000 episodes, as consecutive 100 episodes with a value above > 195, condition is satisfied my code breaks out at episode 246)