<a href="https://colab.research.google.com/github/Kate-Way/AI-Snake-Game/blob/main/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Install a reinforcement learning library**

In [None]:
! pip install stable_baselines3[extra]

In [2]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1

In [None]:
!pip install pyglet

In [5]:
import gym 
import os
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from gym.wrappers import Monitor
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

In [6]:
# not mandatory, for large models only to regulate training process
from stable_baselines3.common.callbacks import EvalCallback, StopTrainingOnRewardThreshold

**2. Load Environment**

Test environment

In [7]:
display = Display(visible=0, size=(1400, 900))
display.start()


<pyvirtualdisplay.display.Display at 0x7f4013975890>

In [8]:
def show_video():
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")


def wrap_env(env):
    env = Monitor(env, './video', force=True)
    return env

In [9]:
env = wrap_env(gym.make("CartPole-v0"))

episodes = 5
for episode in range(1, episodes+1):
    observation = env.reset()  # array of values, which actionto take to get max reward
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)  # unpacking step values 
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))

env.close()
show_video()

Episode:1 Score:24.0
Episode:2 Score:22.0
Episode:3 Score:16.0
Episode:4 Score:13.0
Episode:5 Score:12.0


What environmental problem we're trying to solve (read more : https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

In [10]:
# 0-push cart to left, 1-push cart to the right
env.action_space.sample()

1

In [11]:
# [cart position, cart velocity, pole angle, pole angular velocity]
env.observation_space 

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)

**3. Train an RL Model** (Model-free learning)

In [12]:
env = gym.make("CartPole-v0")
# wrap environment (lambda: env - environment creation function)
env = DummyVecEnv([lambda: env])  
# MlpPolicy - neural network with standard network units (rules on how to operate environment) | verbose = 1 - we want to log results for that model
# can pass bunch of different parameters, to look those up type PPO?? in code line and hit run
model = PPO('MlpPolicy', env, verbose = 1)

Using cpu device


In [None]:
model.learn(total_timesteps=20000)
# if you want to train yor model longer all you need to do is go and run it again 

**4. Save and Reload Model**

In [15]:
# define log path
PPO_path = os.path.join('Training', 'Saved Models', 'PPO_model')

In [16]:
# save model after training
model.save(PPO_path)

In [None]:
# check my path
PPO_path

'Training/Saved Models/PPO_model'

In [17]:
del model

In [18]:
# reload model back into memory (pass full path to the model)
model = PPO.load('Training/Saved Models/PPO_model', env=env)

**5. Evaluate the model**

In [19]:
from stable_baselines3.common.evaluation import evaluate_policy

In [20]:
# n_eval_policy = 10 (we're testing it for 10 episodes), rendering doesn't work in Colab - set it to False
evaluate_policy(model, env, n_eval_episodes=10, render=False)



(200.0, 0.0)

In [21]:
env.close()

**6. Test model**

In [32]:
episodes = 5
for episode in range(1, episodes+1):
    obs = env.reset()  # observations for our observation space
    done = False
    score = 0 
    
    while not done:
        env.render()
        action, _states = model.predict(obs) # pass observations to prediction model (which action to take to max the revard)
        obs, reward, done, info = env.step(action)  # unpacking step values 
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))

# the score is great, but the vidoe won't show changes (we get 1 point reward for each time the pole doesn't fall)

Episode:1 Score:[200.]
Episode:2 Score:[200.]
Episode:3 Score:[200.]
Episode:4 Score:[200.]
Episode:5 Score:[200.]


In [36]:
# cart position, cart velosity, pole angle, pole angular velosity
obs

array([[-0.0145188 , -0.0014735 , -0.03104781, -0.0245294 ]],
      dtype=float32)

In [33]:
env.close()

In [41]:
save_path = os.path.join('Training', 'Saved Models')
log_path = os.path.join('Training', 'Logs')  #don't forget to make logs folder

In [45]:
training_log_path = os.path.join(log_path, 'PPO_3') #PPO_'x' x = how many times we run the model 

**6. Viewing Logs in Tensorboard - RUN IN COMMAND LINE, NOT HERE (it won't stop running and everything will crash)**

In [None]:
 ##   tensorboard --logdir ={training_log_path}  
 # localhost:6006 will show you graphs of different train metrics 
 # key metric - average reward, + episode lenghts (how long your agent lasts in the environment)

**7. Adding a callback to the training Stage** - useful for large models

In [46]:
# stop training after sertain revard treshold 
stop_callback = StopTrainingOnRewardThreshold(reward_threshold=190, verbose=1)

eval_callback = EvalCallback(env, 
                             callback_on_new_best=stop_callback,  
                             # every time there is a new best model stop_call back will run on it
                             eval_freq=10000, 
                             best_model_save_path=save_path, 
                             verbose=1)

In [None]:
model = PPO('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

In [None]:
# runs learning and saves best model
model.learn(total_timesteps=20000, callback=eval_callback)

In [49]:
env.close()

**8. Changing Policies** (if you have a very specific reason to do so - there is a lot you can modify)

In [50]:
# new neural network custom acta 4 units 128 layers in each, and value function with the same arcitecture 
net_arch=[dict(pi=[128, 128, 128, 128], vf=[128, 128, 128, 128])]

In [None]:
model = PPO('MlpPolicy', env, verbose = 1, policy_kwargs={'net_arch': net_arch})

In [None]:
model.learn(total_timesteps=20000, callback=eval_callback)

**9. Using an Alternate Algorithm**

In [53]:
from stable_baselines3 import DQN

In [None]:
model = DQN('MlpPolicy', env, verbose = 1, tensorboard_log=log_path)

In [None]:
model.learn(total_timesteps=20000, callback=eval_callback)

In [56]:
dqn_path = os.path.join('Training', 'Saved Models', 'DQN_model')

In [57]:
model.save(dqn_path)

In [58]:
model = DQN.load(dqn_path, env=env)

In [59]:
evaluate_policy(model, env, n_eval_episodes=10, render=False)



(9.5, 0.6708203932499369)

In [60]:
env.close()